Table 4 Performance comparison on GPQA benchmark (Diamond Set and Main Set). Models are grouped by architecture family and sorted by performance within groups. Underline bold highlights the best-performing model with DR-CoT enhancement, while Italic indicates scores that are surpassed by ModernBERT-large with DR-CoT. The highlighted row demonstrates that DR-CoT enables a significantly smaller model to achieve competitive performance against larger, more resource-intensive models.

From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models

Model

Diamond Set (%)

Main set (%)

GPT-4 Models

Few-Shot GPT-4

39.3

38.1

Few-Shot CoT GPT-4

38.8

39.7

Zero-Shot CoT GPT-4

35.7

39.5

Zero-Shot GPT-4

34.2

32.1

Smaller Models with DR-CoT Enhancement

ModernBERT-large(DR-CoT)

32.9

29.5

ELECTRA-base(DR-CoT)

29.3

21.4

ELECTRA-large(DR-CoT)

28.5

26.1

ROBERTa-large(DR-CoT)

28.5

23.2

ROBERTa-base(DR-CoT)

25.8

20.1

BERT-large(DR-CoT)

26.3

26.8

ModernBERT-base(DR-CoT)

25.0

23.9

BERT-base(DR-CoT)

23.7

19.4

GPT-3.5 Models

Zero-Shot GPT-3.5-turbo-16k

30.6

29.8

Few-Shot CoT GPT-3.5-turbo-16k

29.6

28.0

Zero-Shot CoT GPT-3.5-turbo-16k

28.1

28.9

Few-Shot GPT-3.5-turbo-16k

26.0

28.9

Llama-2 Models

Zero-Shot CoT Llama-2-70B-chat

31.1

28.5

Few-Shot CoT Llama-2-70B-chat

28.1

29.1

Zero-Shot Llama-2-70B-chat

26.5

27.6

Few-Shot Llama-2-70B-chat

24.0

26.9

Additional Baselines

GPT-4 with search

27.6

28.7