Table 5 Pass@1 accuracy on HumanEval benchmark for code generation. The table shows standard large language models ranked by performance, with visual indicators showing where DR-CoT-enhanced smaller models would place in the ranking. Gray rows indicate where DR-CoT-enhanced models would rank among larger models, demonstrating how this technique enables smaller models (1.3B-1.5B parameters) to achieve competitive performance with models 10-50x their size.
From: DR-CoT: dynamic recursive chain of thought with meta reasoning for parameter efficient models
Model | HumanEval (%) | Rank |
---|---|---|
Large Language Models | ||
GPT-3.5 (May 2023) | 73.2 | 1 |
WizardCoder-Python-34B-V1.0 | 73.2 | 1 |
OpenChat-3.5-7B-0106 | 72.6 | 3 |
CodeLlama-70B-Instruct | 72.0 | 4 |
WhiteRabbitNeo-33B-v1 | 72.0 | 4 |
\(\hookrightarrow\) Qwen2.5Coder-1.5B-Instruct + DR-CoT would rank here (71.4%) | ||
Phind-CodeLlama-34B-v2 | 71.3 | 6 |
speechless-coder-ds-6.7B | 71.3 | 6 |
Magicoder-S-CL-7B | 70.7 | 8 |
Claude-3-Sonnet (Mar 2024) | 70.7 | 8 |
Llama3-1.8B-Instruct | 69.5 | 10 |
Mistral Large (Mar 2024) | 69.5 | 10 |
Claude-2 (Mar 2024) | 69.5 | 10 |
Qwen1.5-72B-Chat | 68.3 | 13 |
Gemini Pro 1.5 | 68.3 | 13 |
StarCoder2-15B-Instruct-v0.1 | 67.7 | 15 |
speechless-starcoder2-15b | 67.1 | 16 |
Code-290k-6.7B-Instruct | 64.6 | 18 |
Phi-3-mini-4k-instruct | 64.6 | 18 |
\(\hookrightarrow\) DeepseekCoder-1.3B-Instruct + DR-CoT would rank here (64.1%) | ||
Command-R+ | 64.0 | 20 |
dolphin-2.6-mixtral-8x7b | 64.0 | 20 |
Gemini Pro 1.0 | 63.4 | 22 |
Models With DR-CoT | ||
Qwen2.5Coder-1.5B-Instruct | 54.5 | – |
Qwen2.5Coder-1.5B-Instruct + DR-CoT | 71.4 | (\(\uparrow\) to 4th) |
DeepseekCoder-1.3B-Instruct | 57.3 | – |
DeepseekCoder-1.3B-Instruct + DR-CoT | 64.1 | (\(\uparrow\) to 18th) |