Table 2 Pass@1 accuracy for GPQA Diamond and AIME2024 benchmarks across frontier models with and without DR-CoT. Bold and italic cells represent the best performing model for each benchmark.
From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models
Model | GPQA diamond | AIME2024 |
---|---|---|
Grok 3 Beta (Think) | 84.6 | 83.9 |
Grok 3 mini Beta (Think) | 80.0 | 89.5 |
DeepSeek-R1 | 71.5 | 79.8 |
Gemini 2.0 Flash Thinking | 74.2 | 73.3 |
o1 | 78.0 | 83.3 |
o3 mini (high) | 79.7 | 87.3 |
o3 mini (medium) | 76.8 | 79.6 |
Models with DR-CoT Enhancement | ||
Grok 3 Beta (Think) + DR-CoT | 87.3 | 86.8 |
o3 mini + DR-CoT | 79.4 | 81.5 |
Gemini 2.0 Flash Thinking + DR-CoT | 75.7 | 76.2 |