Table 1 DDI Results for Different Models for \(\theta \in \{50, 80, 90, 95, 99\}\) on the HumanEval dataset. \(R^2\) indicates exponential fit quality: Excellent (\(R^2 \ge 0.9\)), Good (\(0.7 \le R^2 < 0.9\)), Poor (\(R^2 < 0.7\)). Models with \(\lambda = \text {None}\) had insufficient data points for exponential fitting after filtering zero effectiveness values.

From: Measuring and mitigating debugging effectiveness decay in code language models

Model

\(E_0\)

\(\lambda\)

\(A_0\)

\(t_\theta\)

\(R^2\)

claude-3–7-sonnet-20250219

93.902

None

100.00

[]

None

codegemma:7b

51.219

0.9309

66.463

[1, 2, 3, 4, 5]

Excellent

codellama:7b

21.341

0.2467

45.122

[3, 7, 10, 13, 19]

Poor

codestral:22b

58.537

0.3388

89.024

[3, 5, 7, 9, 14]

Good

deepseek-coder-v2:16b

71.951

0.9692

84.146

[1, 2, 3, 4, 5]

Excellent

deepseek-coder:6.7b

45.732

0.4737

74.390

[2, 4, 5, 7, 10]

Excellent

devstral:24b

84.146

0.6438

94.512

[2, 3, 4, 5, 8]

Excellent

gemma2:9b

59.146

0.7632

76.219

[1, 3, 4, 4, 7]

Excellent

gpt-3.5-turbo

73.781

1.3297

82.317

[1, 2, 2, 3, 4]

Excellent

gpt-3.5-turbo-1106

70.732

0.7553

85.976

[1, 3, 4, 4, 7]

Excellent

gpt-4–1106-preview

90.244

0.7619

96.951

[1, 3, 4, 4, 7]

Poor

granite3.3:8b

68.902

0.9482

82.317

[1, 2, 3, 4, 5]

Excellent

llama2:7b

3.659

0.1185

10.976

[6, 14, 20, 26, 39]

Poor

llama3.1:8b

56.707

1.1142

72.561

[1, 2, 3, 3, 5]

Excellent

mistral:instruct

29.878

0.5291

54.268

[2, 4, 5, 6, 9]

Excellent

phi4-reasoning:14b

59.146

0.6052

81.098

[2, 3, 4, 5, 8]

Excellent

phi4:14b

83.537

0.7680

93.293

[1, 3, 3, 4, 6]

Excellent

qwen2.5-coder

76.219

0.4624

94.159

[2, 4, 5, 7, 10]

Excellent