Table 4 Performance comparison on GPQA benchmark (Diamond Set and Main Set). Models are grouped by architecture family and sorted by performance within groups. Underline bold highlights the best-performing model with DR-CoT enhancement, while Italic indicates scores that are surpassed by ModernBERT-large with DR-CoT. The highlighted row demonstrates that DR-CoT enables a significantly smaller model to achieve competitive performance against larger, more resource-intensive models.
From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models
Model | Diamond Set (%) | Main set (%) |
---|---|---|
GPT-4 Models | ||
Few-Shot GPT-4 | 39.3 | 38.1 |
Few-Shot CoT GPT-4 | 38.8 | 39.7 |
Zero-Shot CoT GPT-4 | 35.7 | 39.5 |
Zero-Shot GPT-4 | 34.2 | 32.1 |
Smaller Models with DR-CoT Enhancement | ||
ModernBERT-large(DR-CoT) | 32.9 | 29.5 |
ELECTRA-base(DR-CoT) | 29.3 | 21.4 |
ELECTRA-large(DR-CoT) | 28.5 | 26.1 |
ROBERTa-large(DR-CoT) | 28.5 | 23.2 |
ROBERTa-base(DR-CoT) | 25.8 | 20.1 |
BERT-large(DR-CoT) | 26.3 | 26.8 |
ModernBERT-base(DR-CoT) | 25.0 | 23.9 |
BERT-base(DR-CoT) | 23.7 | 19.4 |
GPT-3.5 Models | ||
Zero-Shot GPT-3.5-turbo-16k | 30.6 | 29.8 |
Few-Shot CoT GPT-3.5-turbo-16k | 29.6 | 28.0 |
Zero-Shot CoT GPT-3.5-turbo-16k | 28.1 | 28.9 |
Few-Shot GPT-3.5-turbo-16k | 26.0 | 28.9 |
Llama-2 Models | ||
Zero-Shot CoT Llama-2-70B-chat | 31.1 | 28.5 |
Few-Shot CoT Llama-2-70B-chat | 28.1 | 29.1 |
Zero-Shot Llama-2-70B-chat | 26.5 | 27.6 |
Few-Shot Llama-2-70B-chat | 24.0 | 26.9 |
Additional Baselines | ||
GPT-4 with search | 27.6 | 28.7 |