Measuring and mitigating debugging effectiveness decay in code language models

Adnan, Muntasir; Kuhn, Carlos C. N.

doi:10.1038/s41598-025-27846-5

Download PDF

Article
Open access
Published: 18 December 2025

Measuring and mitigating debugging effectiveness decay in code language models

Muntasir Adnan¹ &
Carlos C. N. Kuhn¹

Scientific Reports volume 15, Article number: 44120 (2025) Cite this article

1889 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI self-debugging and provides the first systematic metric to gauge LLM-based code generation.

AI hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content

Article Open access 27 September 2024

ChatGPT as a COBUILD lexicographer

Article Open access 13 October 2023

A practical approach for finding anti-debugging routines in the Arm-Linux using hardware tracing

Article Open access 26 June 2024

Introduction

The advent of large language models (LLMs) has transformed automated code generation, enabling developers to produce functional code with remarkable speed and scale¹. Recent efforts have shifted toward debugging-based code generation, where LLMs iteratively refine their output based on compiler feedback or error messages, mirroring traditional software development practices^2,3,4,5. This iterative approach represents a fundamental departure from single-pass generation, yet the underlying dynamics of debugging-based LLM-guided code generation remain critically underexplored. Existing implementations often apply an arbitrary number of debugging attempts without examining their optimal extent or effectiveness over continuous iterations^2,3. This approach can incur significant computational costs and lacks methodological rigour in determining when additional iterations cease to yield meaningful improvements. Preliminary research and our analysis suggest that LLM-guided self-debugging typically follows an exponential decay pattern, where debugging effectiveness diminishes rapidly with successive attempts⁴. However, no systematic work has been conducted to characterise this decay phenomenon or explore strategies to break these patterns for improved performance. This pattern of diminishing returns in iterative LLM approaches extends beyond code generation, with recent research on reasoning models demonstrating similar complexity-dependent limitations where self-correction capabilities plateau and models either overthink simple problems or fail entirely on complex ones⁶, suggesting a natural ceiling that warrants systematic investigation.

Furthermore, as debugging-based LLM-guided code generation becomes increasingly prevalent, evaluation metrics must evolve beyond traditional single-pass assessments^7,8 to account for the iterative nature of the process. Current evaluation approaches treat code as static artefacts rather than as the product of a dynamic development process, overlooking the significant quality enhancements that often emerge through systematic debugging and refinement⁹. This limitation becomes increasingly problematic as the field moves toward debugging-based approaches that more closely align with human software development practices¹⁰. Single-pass metrics such as pass@k⁷ measure the probability that at least one correct solution exists among k independently generated candidates. It fails to account for the iterative debugging process that is central to practical software development workflows¹⁰, and relies solely on manually written test cases¹¹.

This study examines the effectiveness of repeated debugging attempts in LLM-based code generation and investigates strategic interventions to enhance the debugging process. To address the limitations of existing evaluation metrics, we propose a novel evaluation framework: the Debugging Decay Index (DDI). The DDI metric provides a unified assessment of LLM coding proficiency by modelling the exponential effectiveness decay observed in iterative debugging processes. Our framework computes strategic intervention timing $t_\theta$ based on configurable effectiveness decay thresholds $\theta$, returning a comprehensive evaluation tuple $(E_0, \lambda , t_\theta , R^2)$ that captures initial performance, decay sustainability, strategic stopping points, and model fit quality. This multi-dimensional approach enables distinctive evaluation across different aspects of the code generation and debugging pipeline. Our investigation addresses the following research questions:

RQ1 (Debugging Window): How many debugging attempts maximise the effectiveness of LLM-generated code before further iterations yield diminishing returns, and how do these attempt windows vary across different model architectures and problem characteristics?
RQ2 (DDI): How can we develop a unified evaluation metric that comprehensively assesses LLM code generation and debugging capabilities, quantifying initial performance, sustained effectiveness, and iterative refinement capability encompassing both reasoning proficiency and instruction-following competency across diverse model architectures?
RQ3 (Strategic Fresh Starts): Based on the optimal debugging windows identified in RQ1 and the decay characteristics quantified in RQ2, to what extent can implementing fresh start strategies after reaching effectiveness thresholds improve overall accuracy compared to continued iterative refinement within the same generation context?

Literature review

Evaluation metric

Code-generating LLMs are typically evaluated based on functional correctness or whether the generated code effectively solves the given task. In this paradigm, the pass@k metric⁷ has become a standard measure. Pass@k is the probability that at least one of k independently generated solutions to a problem passes all unit tests. Pass@k can be written as -

$$\begin{aligned} pass@k = 1 - \mathbb {P}(\textit{all incorrect}) \end{aligned}$$

The unbiased^7,8 estimation formula is -

$$\begin{aligned} pass@k = 1 - \frac{\left( {\begin{array}{c}n-c\\ k\end{array}}\right) }{\left( {\begin{array}{c}n\\ k\end{array}}\right) } \end{aligned}$$

Where n is the total number of samples generated, $n\ge k$ and c of them pass. One can draw $n\ge k$ samples and count the number of solutions c that pass⁸. Numerous subsequent works on LLM-guided code generation have used pass@k. For example, CodeT¹² and Top Pass¹³ evaluated various models on standard benchmarks using the pass@k metric. In MBR-EXEC¹⁴, authors measured pass@k for HumanEval⁷, Mostly Basic Python Programming(MBPP)¹⁵ to compare instruction tuning. Code generation benchmark leaderboards and evaluations of programming-focused large language models consistently report pass@k metrics (typically k=1, 5, 10, and occasionally up to k=100) as a standard method for model comparison^{16,17,18,19,20}. The elegance of this metric lies in its simplicity and direct correlation with functionality; a model that can generate at least one correct solution within k attempts demonstrates meaningful capability in code generation tasks. Importantly, pass@k is a binary, functional metric; it only cares whether any generated solution is entirely correct.

Building upon this foundation, researchers have conducted thorough investigations into the pass@k metric’s characteristics, examining its sensitivity to both the sample size (k) and the inherent difficulty of programming problems^16,20,21,22. A critical limitation identified is the metric’s sole reliance on provided test suites, which may not comprehensively verify all aspects of code correctness or efficiency²¹. This concern was empirically validated when researchers augmented the standard HumanEval benchmark with more rigorous test cases (creating HumanEval-ET), resulting in a significant performance drop of approximately 20–30% across various models²⁰. A more fundamental concern relates to how optimising for pass@k can distort model behaviour and evaluation priorities. Top Pass¹³ introduced a ranking model that directly optimises for this metric, revealing a key limitation: pass@k rewards getting one solution correct over producing multiple near-correct solutions. This approach fails to reward quick convergence and may allow models to game the metric by generating variants of the same algorithm rather than exploring diverse approaches. Complementary findings revealed that 42% of code generations failing unit tests were still rated valuable by programmers and proposed a hybrid metric²³ combining functional correctness with syntactic similarity, which achieved a 14% stronger correlation with programmer-perceived value. These findings suggest that evaluation metrics should consider not only binary correctness but also how effectively code can be refined through debugging. In response to these limitations, several research works have proposed several variations of pass@k. The count@k metric²⁴ counts how many of k attempts are correct, while AlphaCode introduced n@k¹⁶ that generalises pass@k to measure exactly n correct solutions out of k attempts. Addressing the need to recognise partially correct solutions, the $pass-ratio@n$ metric²⁵ averages the squared test-pass ratio across n generated code samples. This approach gives partial credit to nearly-correct solutions, addressing the granularity that pass@k lacks.

While these functionality-based metrics dominate code generation evaluation, many researchers still report non-functional metrics such as BLEU²⁶, CodeBLEU²⁷, or ROUGE²⁸ to measure syntactic similarity. These metrics are not replacements for pass@k but often accompany it to gauge quality aspects beyond functional correctness. While a few orthogonal approaches exist, they all fail to capture the iterative nature of code development and the debugging capabilities of LLMs.

Our proposed Debugging Decay Index (DDI) addresses this gap by focusing on the iterative path to functional correctness rather than arbitrary sampling. Unlike traditional metrics, DDI measures how effectively models leverage iterative debugging feedback to improve a solution until it achieves functional correctness. This approach acknowledges that real-world programming rarely involves generating multiple independent attempts; instead, developers iteratively refine their code through debugging cycles. By quantifying the efficiency of this debugging process, DDI provides a reliable evaluation of how models would perform in practical software development contexts, where strategic iteration, rather than random sampling, is the path to functional code.

Debugging

Researchers have explored dynamic approaches to incorporate execution feedback and debugging capabilities in LLM-guided code generation. Recent work²⁹ investigated debugging in two distinct contexts: in-context debugging, which involves inspecting intermediate execution states, and post-context debugging, which focuses on analysing error results after complete execution. Building on this foundation, the SELF-DEBUGGING framework⁵ demonstrated how LLMs can analyse execution results and explain their own generated code line by line, mirroring approaches developed initially for human developers³⁰. The framework allowed for a maximum of 10 debugging attempts, but the researchers observed that successful debugging typically concluded within just three iterations. By comparison, MapCoder³ implemented a more extensive debugging protocol, allowing up to 25 attempts, but limiting them to a maximum of 5 attempts per individual plan. The authors reported that while increased debugging iterations generally improved performance, this relationship was not strictly linear across all datasets. Notably, their results for HumanEval-ET did not follow the expected proportional improvement trend, indicating potential dataset-specific considerations in debugging efficacy. Similarly, the Large Language Model Debugger (LDB)² employed 10 debugging attempts in their standard configuration, with additional experiments using up to 20 attempts on the HumanEval dataset. Their findings revealed a continuous but diminishing improvement trend, with gains becoming increasingly marginal after the fifth attempt. The subsequent 15 attempts collectively yielded only 2.4% additional improvement. PyCapsule⁴ implemented a more streamlined approach compared to MapCoder while still achieving state-of-the-art (SOTA) performance across several benchmark datasets. The framework employed five debugging attempts beyond the initial solution and fitted the resulting normalised debugging effectiveness to an exponential decay function, revealing that effectiveness usually diminishes dramatically after the third attempt and follows an exponential decay pattern. Their analysis further demonstrated that debugging effectiveness varies significantly across model architectures: OpenAI’s GPT-4¹⁷ exhibited complete loss of debugging effectiveness (relative to the first attempt) by the third iteration, while GPT-3.5¹⁷ showed similar exhaustion by the fourth attempt. In contrast, Qwen2.5-coder-instruct¹⁸ maintained some debugging capability until the fifth attempt, suggesting model-specific patterns in debugging performance decay. These findings highlight a critical research gap: the need for a standardised approach to quantify and optimise debugging capability for LLM code generation.

Empirical evidence across debugging frameworks reveals consistent diminishing returns, though the specific decay characteristics vary systematically across model architectures, suggesting model-specific debugging signatures that remain unexplored as evaluation criteria. Existing approaches treat these decay patterns as inevitable limitations rather than quantifiable characteristics of the model. This systematic variation in debugging persistence presents an opportunity to develop methodologies that both measure debugging capability through decay modelling (RQ1, RQ2) and identify possible optimal intervention strategies when effectiveness diminishes beyond an acceptable threshold (RQ3).

Methodology

RQ1: debugging window

We introduce the concept of a “debugging window” in the context of LLMs for code generation, which refers to the threshold for debugging attempts. While diminishing effectiveness will always occur with continued debugging efforts, establishing this window allows us to determine a practical cutoff point that balances debugging effectiveness with computational efficiency. To model the effectiveness of each debugging attempt over time, this study employs the exponential decay function (Equation 1). The exponential decay function is defined as follows:

$$\begin{aligned} E(t) = E_0 e^{-\lambda t} \end{aligned}$$

(1)

In this study, E(t) represents the effectiveness of debugging at attempt t, while $E_0$ denotes the initial effectiveness corresponding to the very first attempt. The decay constant $\lambda$ represents the rate of effectiveness loss over successive attempts and serves as our primary metric for characterising iterative debugging capability.df Models with lower $\lambda$ values maintain their effectiveness longer across debugging iterations, and t represents the discrete number of debugging attempts, allowing us to model the temporal progression of debugging effectiveness. To further analyse the decay process, we examine the half-life $t_{1/2}$, which represents the number of debugging attempts after which the effectiveness reduces to half its initial value $E_0$. By definition and from Equation 1, we get:

$$\begin{aligned} E(t_{1/2}) = \frac{1}{2} E_0 \implies t_{1/2} = \frac{\ln (2)}{\lambda } \end{aligned}$$

(2)

We can generalise Equation 2 to determine the number of debugging attempts required for any given decay percentage. For a decay threshold where effectiveness can lose up to $\theta \%$ of its initial value (meaning $(100-\theta )\%$ effectiveness remains), the number of debugging attempts $t_\theta$ is given by:

$$\begin{aligned} t_\theta = \frac{\ln \left( \frac{100}{100-\theta }\right) }{\lambda } \end{aligned}$$

(3)

This generalised formula enables us to calculate the debugging window for any threshold $\theta$, providing the flexibility to determine when diminishing effectiveness justifies terminating the debugging process based on specific computational constraints.

RQ2: The Debugging Decay Index (DDI)

Our proposed DDI integrates our exponential decay analysis from RQ1 to create a comprehensive evaluation framework for LLM debugging capabilities. Unlike traditional metrics that focus solely on final outcomes, DDI captures the efficiency and capability of the debugging process and the final accuracy.

Framework implementation

The DDI is formulated as a function

$$\begin{aligned} DDI(data, \theta ) \rightarrow (E_0, \lambda , t_\theta , R^2) \end{aligned}$$

that accepts data, the normalised debugging effectiveness measurements across multiple iterative attempts; and $\theta$, the effectiveness decay threshold(s) representing the maximum acceptable performance degradation. Following the PyCapsule⁴ framework, the normalised debugging effectiveness data represents the independent influence of each debugging attempt. The DDI framework identifies strategic intervention points $t_\theta$ where debugging effectiveness would degrade by $\theta \%$ from the initial value. In RQ3, we leverage these DDI-calculated intervention points to evaluate whether implementing fresh start strategies at the predicted timing can improve overall accuracy compared to continued iterative refinement within the same generation context. Fresh starts involve reinitiating the debugging process with the original problem statement only. DDI returns a four-element tuple:

$E_0$ (Initial Effectiveness): $E_0$ represents the initial effectiveness, calculated as $E_0 = N_{solved\_at\_attempt\_0} / N_{total}$. This metric is directly comparable to pass@1 and represents the model’s inherent code generation capability before any debugging.
$\lambda$ (Decay Rate): The decay constant extracted from fitting the exponential decay function (Equation 1) to normalised debugging effectiveness data. A lower $\lambda$ indicates slower decay in effectiveness and more persistent debugging behaviour, reflecting sustained instruction following and reasoning consistency across iterations.
$t_\theta$ (Optimal Intervention Points): $t_\theta$ represents the maximum number of debugging attempts before effectiveness drops by $\theta \%$ from the initial value. This represents the strategic intervention threshold corresponding to the $\theta$ value, calculated using Equation 3. Since debugging attempts must be discrete integers, we apply the ceiling function to convert the continuous mathematical solutions into practical stopping points. This ensures that the debugging window provides sufficient attempts to reach at least the specified effectiveness threshold.
$R^2$ (Fit Quality): The coefficient of determination measuring how well the exponential decay model explains the observed debugging effectiveness patterns. We interpret the results using the following categories: Excellent ($R^2 \ge 0.9$), Good ($0.7 \le R^2 < 0.9$), or Poor ($R^2 < 0.7$). High $R^2$ values indicate predictable exponential decay behaviour, while low values suggest erratic or non-exponential debugging patterns that may require alternative evaluation approaches.

Evaluation process and interpretation

The DDI evaluation proceeds through four core steps: initial assessment records $E_0 = N_{solved\_at\_0} / N_{total}$; iterative debugging tracks effectiveness at each attempt; decay analysis fits Equation 1 using nonlinear least squares regression to extract $\lambda$, setting $\lambda = \text {None}$ when insufficient data points ($n < 3$) exist; and threshold calculation determines strategic intervention timing $t_\theta$ using Equation 3.

The DDI outputs provide comprehensive model characterisation of code generation and debugging capabilities, requiring interpretation of both effectiveness metrics and fit quality. For models with high $R^2$ values ($\ge 0.7$), the combination of $E_0$ and $\lambda$ reveals distinct model archetypes: high $E_0$ an low $\lambda$ indicates both strong reasoning and persistent debugging (ideal), low $E_0$ and low $\lambda$ suggests consistent but ineffective approaches, high $E_0$ and high $\lambda$ indicates strong initial reasoning but poor debugging persistence, while low $E_0$ and high $\lambda$ represents both weak reasoning and rapid debugging degradation. However, for models with poor fit quality ($R^2 < 0.7$), the exponential decay assumption may not apply, indicating that a different mathematical function may be required to fully characterise the model behaviour. In such cases, evaluation should rely primarily on $E_0$ when using DDI. Pseudocode for DDI is provided in Appendix: DDI Pseudocode.

RQ3: strategic fresh starts

To investigate whether strategic interventions can mitigate the debugging decay phenomenon identified in RQ2, we implement fresh start strategies at DDI calculated strategic intervention points. A fresh start completely clears conversation history and begins anew with only the original problem statement. This mechanism addresses the rapid degradation of effectiveness observed in the exponential decay pattern, particularly when models become trapped in the low-effectiveness tail, where continued debugging attempts yield negligible improvement. The fresh start strategy operates on the hypothesis that reinitialising the generation process shifts the model from exploiting failing solution approaches back to exploring alternative solution spaces. Based on empirical evidence from RQ2, we observe varied suitable intervention points $t_\theta$ across different models, as demonstrated in Table 1. Given the variance in decay patterns, we strategically implement fresh starts at DDI-calculated intervention thresholds, enabling each model to benefit from reinitialisation at its optimal timing. To ensure a fair comparison with existing approaches, we maintain the same total attempt budget as previous works^3,4, consisting of six attempts (initial generation plus five debugging iterations). Our approach strategically allocates these attempts while triggering fresh starts at DDI-calculated intervention points, testing whether strategic reinitialisation can overcome debugging decay while maintaining strict comparability with baseline methods.

Evaluation and experimental setup

To address our research questions regarding debugging windows (RQ1) and the DDI framework (RQ2), we initially applied our methodology to eighteen language models using the HumanEval⁷ dataset. HumanEval’s 164 function-level Python problems with well-defined test cases provide a controlled environment for isolating debugging effectiveness patterns across diverse model architectures. While this single-dataset analysis establishes the prevalence of exponential decay patterns, we subsequently conducted cross-dataset validation to verify that DDI characteristics generalise beyond HumanEval.

Debugging protocol

Our evaluation employs an iterative debugging protocol using the self-correction framework from PyCapsule⁴. The debugging process is briefly described as follows:

Initial Generation (Attempt 0): Each model receives a problem specification and generates an initial solution. This code is executed against test cases, establishing the baseline effectiveness $E_0$.

Iterative Refinement (Attempts 1–5): For unsuccessful solutions, the model receives structured feedback containing: the original problem specification, previously generated code, execution results (error messages, stack traces, or failed test case outputs) and an explicit instruction to debug and correct the code. The model then generates a revised solution, which is again executed against the same test cases. This process continues for a maximum of five debugging attempts beyond the initial generation, totalling six attempts per problem.

Effectiveness Measurement: Following PyCapsule’s normalisation approach, we measure the independent contribution of each debugging attempt. Given N total problems, let $S_0$ denote problems solved at attempt 0, leaving $N_1 = N - S_0$ unsolved. At each subsequent attempt $i \ge 1$, $S_i$ additional problems are solved from the remaining $N_i = N - \sum _{j=0}^{i-1} S_j$ unsolved problems. The normalised effectiveness at attempt i is $I_i = \frac{S_i}{N_i}$

This normalisation isolates the independent debugging contribution at each attempt, removing the cumulative effect of previous successes. The DDI framework models the exponential decay of these normalised effectiveness values $I_i$ across attempts.

Evaluation

Table 1 DDI Results for Different Models for $\theta \in \{50, 80, 90, 95, 99\}$ on the HumanEval dataset. $R^2$ indicates exponential fit quality: Excellent ($R^2 \ge 0.9$), Good ($0.7 \le R^2 < 0.9$), Poor ($R^2 < 0.7$). Models with $\lambda = \text {None}$ had insufficient data points for exponential fitting after filtering zero effectiveness values.

Full size table

Our experimental design systematically evaluates the decay patterns of debugging effectiveness across diverse model architectures, ranging from smaller, specialised models like DeepSeek-Coder 6.7b³¹ to larger, general-purpose models such as Claude-3–7-sonnet-20250219³², GPT-4¹⁷, and GPT-3.5¹⁷. Using normalised debugging effectiveness data from HumanEval⁷, we extracted model-specific decay constants $\lambda$. For each model, we calculated $E_0, \lambda , t_\theta \text { where } \theta \in 50, 80, 90, 95, 99 \text { and } R^2$. Additionally, we report $A_0$ values representing the final accuracy achieved after six attempts without any fresh start interventions (same as PyCapule⁴), providing a baseline performance metric for comparison with our strategic restart approaches in RQ3, see Table 2.

Table 1 and Fig. 1 present our comprehensive analysis of debugging decay characteristics across these LLMs. The debugging window calculations reveal distinct performance characteristics across model architectures.

Claude-3.7-Sonnet demonstrated remarkable performance, achieving 100% effectiveness ($A_0 = 100\%$) essentially within two attempts, which prevented fitting to the exponential decay model, resulting $\lambda =None$. This exceptional performance represents a unique case where conventional debugging window calculations may not apply.

Conversely, the Phi-4³³ model comparison provides particularly revealing insights into the relationship between reasoning capabilities and debugging sustainability. While phi4:14b³³ $(E_0=83.537\%, \lambda =0.76)$ significantly outperformed phi4-reasoning:14b³³ $(E_0=59.146\%, \lambda =0.60)$ in initial effectiveness by approximately 24%, likely due to phi4-reasoning not being instruction fine-tuned and thus more challenging to parse, the reasoning model demonstrated remarkable debugging improvement capacity. Despite starting from a substantially lower baseline, phi4-reasoning achieved a final accuracy of 81.098% compared to phi4:14b’s 93.293%, representing an improvement of 21.95% versus only 9.75%, respectively. The reasoning model improved more than twice as much as the standard model through iterative debugging. These findings suggest that the decay constant $\lambda$ captures not only debugging efficiency but also underlying reasoning capabilities and the models’ susceptibility to instructional feedback. Models with lower $\lambda$ values demonstrate a greater capacity to integrate corrective guidance into their subsequent debugging actions.

The reasoning model’s lower $\lambda$ value indicates superior debugging sustainability, enabling it to extract more value from iterative refinement processes. This reveals that reasoning-capable models, although potentially harder to prompt initially, possess an enhanced capacity for systematic error correction and solution refinement –a crucial characteristic for extended debugging sessions where sustained improvement matters more than initial performance.

GPT variants exhibit relatively fast effectiveness decay, with gpt-3.5-turbo¹⁷ showing the highest decay rate, reaching the 80% threshold by attempts 2–3. In contrast, models like codestral:22b³⁴ and deepseek-coder:6.7b³¹ demonstrate more sustained debugging capabilities with lower decay rates ($\lambda =0.375 \text { and } \lambda =0.330$ respectively), extending debugging windows to 5–7 attempts for the same threshold. DDI reveals nuanced debugging characteristics that would be missed by simple effectiveness metrics alone. The case of phi4-reasoning:14b³³ exemplifies this.

Strategic fresh start

Table 2 Performance comparison showing baseline accuracy $A_0$ achieved within six attempts without intervention, versus fresh start strategies implemented at DDI-calculated intervention points where $\theta \in \{50, 80\}$. $A_{50}$ and $A_{80}$ represent final accuracy when fresh starts are triggered at $t_{50}$ and $t_{80}$ thresholds respectively. The corresponding intervention timing ($t_\theta$ values) for each model can be found in Table 1. Bold values indicate performance improvements over the baseline $A_0$, demonstrating cases where strategic reinitialisation outperforms continued iterative debugging within the same debugging context at no extra token usage at all.

Full size table

To evaluate the effectiveness of strategic fresh starts proposed in RQ3, we implemented restart interventions at the calculated strategic thresholds for $\theta \in \{50, 80\}$ effectiveness degradation. Table 2 presents the comparative performance results, demonstrating the impact of strategic reinitialisation versus continued iterative debugging. The results reveal that strategic fresh starts can significantly improve debugging performance across most models without requiring any additional computational resources. Since fresh starts only involve clearing conversation history at predetermined intervention points while maintaining the same attempt budget, the computational overhead remains equivalent with similar or reduced token usage on average compared to continuous debugging sessions. For example, DeepSeek-Coder-V2-16B reduced token consumption from 108,289 to approximately 89,000 tokens on average, while Codestral-22B maintained usage around 94,000 tokens compared to 97,000 in the continuous sessions.

Of the six models evaluated, all showed performance improvements when fresh starts were applied at DDI-calculated intervention points. Significantly, llama3.1:8b³⁵ showed the most significant improvement, enhancing its baseline accuracy from 72.56% to 82.82%. In contrast, deepseek-coder-v2:16b³¹ experienced the second largest enhancement, with its baseline accuracy increasing from 84.1% to 92.1%. Similarly, Mistral:Instruct³⁶ demonstrated consistent gains across both thresholds, improving from 54.3% to 62.8% and 57.3%. This demonstrates that strategic timing of fresh starts, rather than simply increasing attempt counts, can overcome debugging decay patterns and improve overall effectiveness. Analysis of the normalised debugging effectiveness patterns (Fig. 2) reveals that fresh start interventions successfully break the exponential decay curve observed in RQ1. Rather than following the predicted decay trajectory, models implementing fresh starts at strategic intervention points demonstrate renewed effectiveness spikes, essentially resetting the decay pattern and enabling continued productive debugging. This empirical evidence supports our hypothesis that strategic reinitialisation shifts models from exploitation of failing solution approaches back to exploration of alternative solution spaces.

Cross-dataset validation

To evaluate whether DDI characteristics generalise beyond HumanEval, we conducted cross-dataset validation with three representative models: GPT-4–1106-preview (frontier capabilities), GPT-3.5-turbo-1106 (mainstream deployment), and Qwen2.5-coder (open-source, specialised code generation). Due to computational constraints associated with running extensive iterative debugging sessions across multiple datasets, we focused this validation on a limited number of models spanning proprietary and open-source architectures with diverse original performance characteristics. These three models were evaluated across four datasets in total: HumanEval⁷, HumanEval-ET²⁰, MBPP¹⁵, and MBPP-ET²⁰, which vary in difficulty and problem characteristics.

Table 3 Cross-dataset DDI decay ($\lambda$) and strategic intervention thresholds ($t_\theta$) for $\theta \in \{50, 80, 90, 95, 99\}$. Values demonstrate model-specific stability patterns across diverse problem distributions.

Full size table

Table 3 reveals that cross-dataset $\lambda$ stability correlates strongly with the original R$^2$ fit quality from Table 1. Qwen2.5-coder, which exhibited excellent exponential fit on HumanEval demonstrates remarkable consistency with mean $\bar{\lambda }_{qwen} = 0.503$ and closely matches the original HumanEval value ($\lambda$ = 0.462), indicating stable, predictable debugging behaviour regardless of problem characteristics. GPT-3.5-turbo-1106 shows very little variation with a mean $\bar{\lambda }_{gpt3.5} = 0.718$, maintaining excellent consistency around its original HumanEval value ($\lambda$ = 0.755). The $\lambda$ distribution across 4 datasets suggests some sensitivity to problem characteristics, though decay patterns remain exponential across datasets. In contrast, GPT-4–1106-preview exhibits substantial variation ($\lambda \in [0.573, 0.743]$, mean = 0.634), diverging significantly from its original HumanEval value ($\lambda$ = 0.761). This instability aligns with the poor R$^2$ fit quality observed while testing with HumanEval, indicating that while DDI’s R$^2$ component captures consistent debugging patterns in some models, it fails to generalise across all model behaviours Critically, even for models with high $\lambda$ variation like GPT-4, the derived intervention thresholds remain practically stable: $t_{80}$ varies by $\approx 1$ debugging attempt across datasets, sufficient precision for configuring production systems where consistent resource allocation is essential for online coding assistance.

These findings empirically validate R$^2$ as a predictor of cross-dataset reliability: models with excellent fit quality (R$^2 \ge 0.9$) maintain consistent $\lambda$ values regardless of problem distribution, while poor-fit models show dataset-dependent variation. This robustness indicates that DDI provides reliable resource allocation guidance across diverse problem distributions, particularly for models with high R$^2$ fit quality.

Additionally, a one-way ANOVA confirms that the three models exhibit significantly different decay characteristics ($\text {F-Statistic} = 10.04$, $F_{0.01}(2, 9) = 8.02$ where 0.01 is the significance level ($\alpha$), since F-Statistic is > $F_\alpha (2, 9)$, we can reject the null hypothesis at $p < 0.01$). Finally, the effect size ($\eta ^2$) is 0.691, with model identity explaining 69.1% of the variance in $\lambda$ values. This large effect size indicates that decay rates are predominantly model-intrinsic properties rather than dataset-dependent artefacts (Cohen’s guidelines: $\eta ^2> 0.14$ is large). The within-model consistency combined with between-model differences validates DDI’s ability to characterise distinct debugging behaviours across models. We note that the HumanEval $\lambda$ values in Table 3 differ slightly from those reported in Table 1. This variation reflects the stochastic nature of LLM sampling, where identical prompts can yield different code generations and thus different debugging trajectories. Such run-to-run variance is well-documented in code generation evaluation^2,3,4 and does not invalidate the core findings.

Discussion and limitations

Interpreting DDI parameters for model selection

The DDI framework provides practitioners with quantitative guidance for model selection based on task requirements and computational constraints. The interplay between initial effectiveness ($E_0$) and decay rate ($\lambda$) reveals distinct model characteristics that inform deployment decisions. Models exhibiting rapid decay ($\lambda> 1.0$) exhaust their debugging capacity within 1–2 iterations. Such models are best suited for scenarios where computational resources are limited or where quick single-pass generation is prioritised over iterative refinement. The high decay rate suggests that continued debugging attempts yield diminishing returns, making additional iterations computationally inefficient. Conversely, models with low decay rates ($\lambda < 0.5$) maintain debugging effectiveness across multiple iterations. Codestral-22B ($\lambda = 0.34$) and DeepSeek-Coder-6.7B ($\lambda = 0.47$) exemplify this category, sustaining useful debugging capability through five or more attempts. These models are appropriate for complex programming tasks requiring extended refinement cycles, where the problem space is large or initial solutions are unlikely to be correct. The sustained effectiveness indicates that these models can productively utilise additional computational resources through continued debugging. Whilst our study focuses on characterising and quantifying decay patterns rather than establishing causal mechanisms, several hypotheses emerge from our observations that warrant future investigation.

The exponential decay pattern may reflect progressive saturation of the model’s context window. As debugging iterations accumulate, the conversation history grows to include multiple code versions, error messages, and refinement attempts. This expanding context may overwhelm the model’s ability to maintain focus on the original problem specification, leading to degraded performance. Models with lower $\lambda$ values might possess superior context management capabilities, enabling them to filter relevant information from accumulated debugging history.

An alternative mechanism involves gradual drift from the original instructions coupled with error compounding. Each debugging attempt introduces the risk of new errors whilst attempting to fix existing ones. Models with high $\lambda$ values may lack the systematic error correction strategies necessary to avoid introducing additional problems during refinement. The phi4-reasoning model’s superior debugging sustainability ($\lambda = 0.60$) despite lower initial effectiveness suggests that reasoning-capable models may employ more systematic approaches to debugging that resist this drift.

The fresh start strategy’s effectiveness (RQ3) provides indirect evidence for an exploitation-exploration trade-off. Models may become trapped in local minima within the solution space, repeatedly attempting variations of fundamentally flawed approaches. The exponential decay could represent decreasing probability of escaping these local minima as the model increasingly commits to its initial approach. Fresh starts force re-exploration of the solution space, explaining the performance improvements observed in Table 2.

Generalisation limitations

This work quantifies and characterises debugging effectiveness decay in LLM-based code generation. Whilst we discuss implications for model selection and computational resource allocation, several related questions fall outside the scope of this study. These include how problem characteristics (complexity, domain, structure) influence decay patterns, how different prompting strategies or inference parameters might alter debugging behaviour, and the interaction between DDI-measured capabilities and broader development workflows. While the DDI framework offers a systematic approach to measuring and characterising iterative refinement effectiveness in LLMs, it does not encompass the full spectrum of factors that influence code generation quality or affect practical deployment. Investigating these interactions represents valuable future work.

DDI parameters reflect the interaction between model capabilities and problem characteristics. While the exponential decay phenomenon appears robust across multiple debugging contexts, specific $\lambda$ and $t_\theta$ values vary with problem set characteristics. Our cross-dataset validation demonstrates that models with high R$^2$ values maintain consistent $\lambda$ across diverse problem distributions, indicating that DDI captures model-intrinsic debugging characteristics rather than dataset-specific artefacts. Comprehensive validation across fundamentally different coding paradigms (e.g., problem solving, web development, systems programming) remains future work.

Additionally, while our fresh start interventions demonstrate performance improvements across almost all evaluated models, the magnitude of these improvements critically depends on the selected effectiveness threshold $\theta$. Although we observe consistent benefits regardless of threshold selection, selecting $\theta$ values for maximum performance gains represents a crucial but unexplored aspect of our framework. The systematic selection of strategic intervention thresholds falls outside the scope of this study and represents an important direction for future investigation.

Conclusion

This work introduces the Debugging Decay Index (DDI), a novel evaluation framework that characterises the exponential effectiveness decay patterns inherent in LLM-guided iterative debugging processes. Through a systematic analysis of eighteen language models on HumanEval, we demonstrate that debugging effectiveness typically follows predictable exponential decay trajectories, enabling principled determination of optimal intervention timing rather than relying on arbitrary attempt limits. Our key contributions include: (1) mathematical characterisation of debugging decay patterns across diverse model architectures; (2) the DDI framework, which provides unified assessment of coding and debugging capabilities through initial effectiveness ($E_0$), decay rate ($\lambda$), strategic intervention timing ($t_\theta$), and model fit quality ($R^2$); and (3) demonstration that strategic fresh start interventions at DDI-calculated thresholds can break exponential decay patterns and improve final accuracy without incurring additional computational costs. DDI provides practical support for optimising debugging workflows in production and reveals core properties of iterative refinement in LLMs. It also accommodates non-exponential decay functions–such as linear or polynomial–making it applicable to a wider range of model behaviours, including those not limited to code generation. Future research directions include developing adaptive threshold selection strategies that respond to problem complexity, comparative analysis of human versus AI debugging patterns to validate theoretical foundations of effectiveness degradation, and integration within comprehensive software engineering workflows, including testing frameworks^37,38 and code structure analysis. The mathematical simplicity and interpretability of DDI make it well-suited for interdisciplinary investigation of iterative problem-solving across both artificial and biological intelligence systems.

Data availability

All benchmark datasets utilised in this study are openly available from their respective public repositories. The code and analysis scripts will be made publicly available upon publication. In the meantime, please contact the corresponding author at Adnan.adnan@canebrra.edu.au to access the data/results for this study.

References

Jiang, J., Wang, F., Shen, J., Kim, S. & Kim, S. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. https://doi.org/10.1145/3747588 (2025).
Article Google Scholar
Zhong, L., Wang, Z. & Shang, J. Debug like a human: A large language model debugger via verifying runtime execution step by step. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W. et al.) 851–870 (Association for Computational Linguistics, 2024). https://doi.org/10.18653/v1/2024.findings-acl.49.
Chapter Google Scholar
Islam, M. A., Ali, M. E. & Parvez, M. R. MapCoder: Multi-agent code generation for competitive problem solving. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W. et al.) 4912–4944 (Association for Computational Linguistics, 2024). https://doi.org/10.18653/v1/2024.acl-long.269.
Chapter Google Scholar
Adnan, M., Xu, Z. & Kuhn, C. C. N. Large language model guided self-debugging code generation (2025). arXiv:2502.02928.
Chen, X., Lin, M., Schaerli, N. & Zhou, D. Teaching large language models to self-debug. In Kim, B. et al. (eds.) International Conference on Representation Learning, 2024, 8746–8825 (2024).
Shojaee*, P. et al. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. In NeurIPS (2025).
Chen, M. et al. Evaluating large language models trained on code (2021). arXiv:2107.03374.
Paul, D. G., Zhu, H. & Bayley, I. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. In 2024 IEEE International Conference on Artificial Intelligence Testing (AITest) 87–94 (IEEE Computer Society, Los Alamitos, 2024). https://doi.org/10.1109/AITest62860.2024.00019.
Chapter Google Scholar
Liu, A. & Coblenz, M. Debugging techniques in professional programming. In 13th Annual Workshop at the Intersection of PL and HCI, https://doi.org/10.1184/R1/22277365.v1 (2023).
Spinellis, D. Effective Debugging: 66 Specific Ways to Debug Software and System 1st edn. (Addison-Wesley Professional, 2016).
Google Scholar
Tong, W. & Zhang, T. CodeJudge: Evaluating code generation with large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y. et al.) 20032–20051 (Association for Computational Linguistics, 2024).
Chapter Google Scholar
Chen, B. et al. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations (2023).
Lyu, Z.-C., Li, X.-Y., Xie, Z. & Li, M. Top pass: Improve code generation by pass@k-maximized code ranking. Front. Comput. Sci. https://doi.org/10.1007/s11704-024-40415-9 (2024).
Article Google Scholar
Shi, F., Fried, D., Ghazvininejad, M., Zettlemoyer, L. & Wang, S. I. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y. et al.) 3533–3546 (Association for Computational Linguistics, 2022). https://doi.org/10.18653/v1/2022.emnlp-main.231.
Chapter Google Scholar
Austin, J. et al. Program synthesis with large language models (2021). arXiv:2108.07732.
Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–1097. https://doi.org/10.1126/science.abq1158 (2022).
Article ADS CAS PubMed Google Scholar
OpenAI et al. Gpt-4 technical report (2024). arXiv:2303.08774.
Hui, B. et al. Qwen2.5-coder technical report (2024). arXiv:2409.12186.
Liu, J. et al. Evaluating language models for efficient code generation. In First Conference on Language Modeling (2024).
Liu, J., Xia, C. S., Wang, Y. & Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23 (Curran Associates Inc., Red Hook, NY, USA, 2023).
Chen, L. et al. A survey on evaluating large language models in code generation tasks. arXiv:2408.16498 (2024).
Wang, R. et al. Can llms replace human evaluators? an empirical study of llm-as-a-judge in software engineering. Proc. ACM Softw. Eng. 2. https://doi.org/10.1145/3728963 (2025).
Dibia, V. et al. Aligning offline metrics and human judgments of value for code generation models. In Findings of the Association for Computational Linguistics: ACL 2023 (eds Rogers, A. et al.) 8516–8528 (Association for Computational Linguistics, 2023). https://doi.org/10.18653/v1/2023.findings-acl.540.
Chapter Google Scholar
Zeng, Z., Wang, Y., Xie, R., Ye, W. & Zhang, S. Coderujb: An executable and unified java benchmark for practical programming scenarios. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, 124–136, (Association for Computing Machinery, New York, NY, USA, 2024). https://doi.org/10.1145/3650212.3652115
Yeo, S., Ma, Y.-S., Kim, S. C., Jun, H. & Kim, T. Framework for evaluating code generation ability of large language models. ETRI J. 46, 106–117. https://doi.org/10.4218/etrij.2023-0357 (2024).
Article Google Scholar
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (eds Isabelle, P. et al.) 311–318 (Association for Computational Linguistics, 2002). https://doi.org/10.3115/1073083.1073135.
Chapter Google Scholar
Ren, S. et al. Codebleu: a method for automatic evaluation of code synthesis (2020). arXiv:2009.10297.
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Google Scholar
Chen, X. et al. Revisit self-debugging with self-generated tests for code generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Che, W. et al.) 18003–18023 (Association for Computational Linguistics, 2025). https://doi.org/10.18653/v1/2025.acl-long.881.
Chapter Google Scholar
Thomas, D. & Hunt, A. The Pragmatic Programmer: Your Journey to Mastery, 20th Anniversary Edition 20th anniversary edition. (Pearson Education, 2019).
Google Scholar
Guo, D. et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv:2401.14196 (2024).
Anthropic. Claude 3.7 sonnet and claude code. Online (2025). Https://www.anthropic.com/news/claude-3-7-sonnet.
Abdin, M. et al. Phi-4 technical report. arXiv:2412.08905 (2024).
MistralAI. Codestral. Online (2024). Https://mistral.ai/news/codestral.
Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023).
Jiang, A. Q. et al. Mistral 7b (2023). arXiv:2310.06825.
Górski, T. Smarts: A java package for smart contract test suite generation and execution. SoftwareX 26, 101698. https://doi.org/10.1016/j.softx.2024.101698 (2024).
Article Google Scholar
Pira, E. & Khodizadeh-Nahari, M. Combinatorial t-way test suite generation using an improved asexual reproduction optimization algorithm. Appl. Soft Comput. 150, 111070. https://doi.org/10.1016/j.asoc.2023.111070 (2024).
Article Google Scholar

Download references

Funding

This work is funded under the agreement with the ACT Government, Future Jobs Fund - Open Source Institute(OpenSI) - R01553; and NetApp Technology Alliance Agreement with OpenSI - R01657. Additionally, this research was supported by the Australian Government through the Department of Education’s National Industry PhD Program (project 36337). The views expressed herein are those of the authors and are not necessarily those of the Australian Government or the Department of Education.

Author information

Authors and Affiliations

Open Source Institute, University of Canberra, Bruce, Canberra, Australia
Muntasir Adnan & Carlos C. N. Kuhn

Authors

Muntasir Adnan
View author publications
Search author on:PubMed Google Scholar
Carlos C. N. Kuhn
View author publications
Search author on:PubMed Google Scholar

Contributions

M.A and C.C.N.K collaborated on exploring the idea, writing, designing the experiments and reviewing the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Muntasir Adnan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information. (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Adnan, M., Kuhn, C.C.N. Measuring and mitigating debugging effectiveness decay in code language models. Sci Rep 15, 44120 (2025). https://doi.org/10.1038/s41598-025-27846-5

Download citation

Received: 23 June 2025
Accepted: 06 November 2025
Published: 18 December 2025
Version of record: 18 December 2025
DOI: https://doi.org/10.1038/s41598-025-27846-5

Subjects

Abstract

Similar content being viewed by others

AI hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content

ChatGPT as a COBUILD lexicographer

A practical approach for finding anti-debugging routines in the Arm-Linux using hardware tracing

Introduction

Literature review

Evaluation metric

Debugging

Methodology

RQ1: debugging window

RQ2: The Debugging Decay Index (DDI)

Framework implementation

Evaluation process and interpretation

RQ3: strategic fresh starts

Evaluation and experimental setup

Debugging protocol

Evaluation

Strategic fresh start

Cross-dataset validation

Discussion and limitations

Interpreting DDI parameters for model selection

Generalisation limitations

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information. (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links