Table 1 DDI Results for Different Models for \(\theta \in \{50, 80, 90, 95, 99\}\) on the HumanEval dataset. \(R^2\) indicates exponential fit quality: Excellent (\(R^2 \ge 0.9\)), Good (\(0.7 \le R^2 < 0.9\)), Poor (\(R^2 < 0.7\)). Models with \(\lambda = \text {None}\) had insufficient data points for exponential fitting after filtering zero effectiveness values.
From: Measuring and mitigating debugging effectiveness decay in code language models
Model | \(E_0\) | \(\lambda\) | \(A_0\) | \(t_\theta\) | \(R^2\) |
|---|---|---|---|---|---|
claude-3–7-sonnet-20250219 | 93.902 | None | 100.00 | [] | None |
codegemma:7b | 51.219 | 0.9309 | 66.463 | [1, 2, 3, 4, 5] | Excellent |
codellama:7b | 21.341 | 0.2467 | 45.122 | [3, 7, 10, 13, 19] | Poor |
codestral:22b | 58.537 | 0.3388 | 89.024 | [3, 5, 7, 9, 14] | Good |
deepseek-coder-v2:16b | 71.951 | 0.9692 | 84.146 | [1, 2, 3, 4, 5] | Excellent |
deepseek-coder:6.7b | 45.732 | 0.4737 | 74.390 | [2, 4, 5, 7, 10] | Excellent |
devstral:24b | 84.146 | 0.6438 | 94.512 | [2, 3, 4, 5, 8] | Excellent |
gemma2:9b | 59.146 | 0.7632 | 76.219 | [1, 3, 4, 4, 7] | Excellent |
gpt-3.5-turbo | 73.781 | 1.3297 | 82.317 | [1, 2, 2, 3, 4] | Excellent |
gpt-3.5-turbo-1106 | 70.732 | 0.7553 | 85.976 | [1, 3, 4, 4, 7] | Excellent |
gpt-4–1106-preview | 90.244 | 0.7619 | 96.951 | [1, 3, 4, 4, 7] | Poor |
granite3.3:8b | 68.902 | 0.9482 | 82.317 | [1, 2, 3, 4, 5] | Excellent |
llama2:7b | 3.659 | 0.1185 | 10.976 | [6, 14, 20, 26, 39] | Poor |
llama3.1:8b | 56.707 | 1.1142 | 72.561 | [1, 2, 3, 3, 5] | Excellent |
mistral:instruct | 29.878 | 0.5291 | 54.268 | [2, 4, 5, 6, 9] | Excellent |
phi4-reasoning:14b | 59.146 | 0.6052 | 81.098 | [2, 3, 4, 5, 8] | Excellent |
phi4:14b | 83.537 | 0.7680 | 93.293 | [1, 3, 3, 4, 6] | Excellent |
qwen2.5-coder | 76.219 | 0.4624 | 94.159 | [2, 4, 5, 7, 10] | Excellent |