Table 2 Pass@1 accuracy for GPQA Diamond and AIME2024 benchmarks across frontier models with and without DR-CoT. Bold and italic cells represent the best performing model for each benchmark.

Model	GPQA diamond	AIME2024
Grok 3 Beta (Think)	84.6	83.9
Grok 3 mini Beta (Think)	80.0	*89.5*
DeepSeek-R1	71.5	79.8
Gemini 2.0 Flash Thinking	74.2	73.3
o1	78.0	83.3
o3 mini (high)	79.7	87.3
o3 mini (medium)	76.8	79.6
Models with DR-CoT Enhancement
Grok 3 Beta (Think) + DR-CoT	*87.3*	86.8
o3 mini + DR-CoT	79.4	81.5
Gemini 2.0 Flash Thinking + DR-CoT	75.7	76.2

Quick links

Search