Table 5 Pass@1 accuracy on HumanEval benchmark for code generation. The table shows standard large language models ranked by performance, with visual indicators showing where DR-CoT-enhanced smaller models would place in the ranking. Gray rows indicate where DR-CoT-enhanced models would rank among larger models, demonstrating how this technique enables smaller models (1.3B-1.5B parameters) to achieve competitive performance with models 10-50x their size.

From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models

Model

HumanEval (%)

Rank

Large Language Models

GPT-3.5 (May 2023)

73.2

1

WizardCoder-Python-34B-V1.0

73.2

1

OpenChat-3.5-7B-0106

72.6

3

CodeLlama-70B-Instruct

72.0

4

WhiteRabbitNeo-33B-v1

72.0

4

\(\hookrightarrow\) Qwen2.5Coder-1.5B-Instruct + DR-CoT would rank here (71.4%)

Phind-CodeLlama-34B-v2

71.3

6

speechless-coder-ds-6.7B

71.3

6

Magicoder-S-CL-7B

70.7

8

Claude-3-Sonnet (Mar 2024)

70.7

8

Llama3-1.8B-Instruct

69.5

10

Mistral Large (Mar 2024)

69.5

10

Claude-2 (Mar 2024)

69.5

10

Qwen1.5-72B-Chat

68.3

13

Gemini Pro 1.5

68.3

13

StarCoder2-15B-Instruct-v0.1

67.7

15

speechless-starcoder2-15b

67.1

16

Code-290k-6.7B-Instruct

64.6

18

Phi-3-mini-4k-instruct

64.6

18

\(\hookrightarrow\) DeepseekCoder-1.3B-Instruct + DR-CoT would rank here (64.1%)

Command-R+

64.0

20

dolphin-2.6-mixtral-8x7b

64.0

20

Gemini Pro 1.0

63.4

22

Models With DR-CoT

Qwen2.5Coder-1.5B-Instruct

54.5

–

Qwen2.5Coder-1.5B-Instruct + DR-CoT

71.4

(\(\uparrow\) to 4th)

DeepseekCoder-1.3B-Instruct

57.3

–

DeepseekCoder-1.3B-Instruct + DR-CoT

64.1

(\(\uparrow\) to 18th)