Table 5 Pass@1 accuracy on HumanEval benchmark for code generation. The table shows standard large language models ranked by performance, with visual indicators showing where DR-CoT-enhanced smaller models would place in the ranking. Gray rows indicate where DR-CoT-enhanced models would rank among larger models, demonstrating how this technique enables smaller models (1.3B-1.5B parameters) to achieve competitive performance with models 10-50x their size.

Model	HumanEval (%)	Rank
Large Language Models
GPT-3.5 (May 2023)	73.2	1
WizardCoder-Python-34B-V1.0	73.2	1
OpenChat-3.5-7B-0106	72.6	3
CodeLlama-70B-Instruct	72.0	4
WhiteRabbitNeo-33B-v1	72.0	4
\(\hookrightarrow\) Qwen2.5Coder-1.5B-Instruct + DR-CoT would rank here (71.4%)
Phind-CodeLlama-34B-v2	71.3	6
speechless-coder-ds-6.7B	71.3	6
Magicoder-S-CL-7B	70.7	8
Claude-3-Sonnet (Mar 2024)	70.7	8
Llama3-1.8B-Instruct	69.5	10
Mistral Large (Mar 2024)	69.5	10
Claude-2 (Mar 2024)	69.5	10
Qwen1.5-72B-Chat	68.3	13
Gemini Pro 1.5	68.3	13
StarCoder2-15B-Instruct-v0.1	67.7	15
speechless-starcoder2-15b	67.1	16
Code-290k-6.7B-Instruct	64.6	18
Phi-3-mini-4k-instruct	64.6	18
\(\hookrightarrow\) DeepseekCoder-1.3B-Instruct + DR-CoT would rank here (64.1%)
Command-R+	64.0	20
dolphin-2.6-mixtral-8x7b	64.0	20
Gemini Pro 1.0	63.4	22
Models With DR-CoT
Qwen2.5Coder-1.5B-Instruct	54.5	–
Qwen2.5Coder-1.5B-Instruct + DR-CoT	71.4	(\(\uparrow\) to 4th)
DeepseekCoder-1.3B-Instruct	57.3	–
DeepseekCoder-1.3B-Instruct + DR-CoT	64.1	(\(\uparrow\) to 18th)

Quick links

Search