Table 4 Performance comparison on GPQA benchmark (Diamond Set and Main Set). Models are grouped by architecture family and sorted by performance within groups. Underline bold highlights the best-performing model with DR-CoT enhancement, while Italic indicates scores that are surpassed by ModernBERT-large with DR-CoT. The highlighted row demonstrates that DR-CoT enables a significantly smaller model to achieve competitive performance against larger, more resource-intensive models.

Model	Diamond Set (%)	Main set (%)
GPT-4 Models
Few-Shot GPT-4	39.3	38.1
Few-Shot CoT GPT-4	38.8	39.7
Zero-Shot CoT GPT-4	35.7	39.5
Zero-Shot GPT-4	34.2	32.1
Smaller Models with DR-CoT Enhancement
ModernBERT-large(DR-CoT)	32.9	29.5
ELECTRA-base(DR-CoT)	29.3	21.4
ELECTRA-large(DR-CoT)	28.5	26.1
ROBERTa-large(DR-CoT)	28.5	23.2
ROBERTa-base(DR-CoT)	25.8	20.1
BERT-large(DR-CoT)	26.3	26.8
ModernBERT-base(DR-CoT)	25.0	23.9
BERT-base(DR-CoT)	23.7	19.4
GPT-3.5 Models
Zero-Shot GPT-3.5-turbo-16k	30.6	29.8
Few-Shot CoT GPT-3.5-turbo-16k	29.6	28.0
Zero-Shot CoT GPT-3.5-turbo-16k	28.1	28.9
Few-Shot GPT-3.5-turbo-16k	26.0	28.9
Llama-2 Models
Zero-Shot CoT Llama-2-70B-chat	31.1	28.5
Few-Shot CoT Llama-2-70B-chat	28.1	29.1
Zero-Shot Llama-2-70B-chat	26.5	27.6
Few-Shot Llama-2-70B-chat	24.0	26.9
Additional Baselines
GPT-4 with search	27.6	28.7

Quick links

Search