Table 2 Pass@1 accuracy for GPQA Diamond and AIME2024 benchmarks across frontier models with and without DR-CoT. Bold and italic cells represent the best performing model for each benchmark.

From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models

Model

GPQA diamond

AIME2024

Grok 3 Beta (Think)

84.6

83.9

Grok 3 mini Beta (Think)

80.0

89.5

DeepSeek-R1

71.5

79.8

Gemini 2.0 Flash Thinking

74.2

73.3

o1

78.0

83.3

o3 mini (high)

79.7

87.3

o3 mini (medium)

76.8

79.6

Models with DR-CoT Enhancement

Grok 3 Beta (Think) + DR-CoT

87.3

86.8

o3 mini + DR-CoT

79.4

81.5

Gemini 2.0 Flash Thinking + DR-CoT

75.7

76.2