Introduction

The advent of large language models (LLMs) has transformed automated code generation, enabling developers to produce functional code with remarkable speed and scale1. Recent efforts have shifted toward debugging-based code generation, where LLMs iteratively refine their output based on compiler feedback or error messages, mirroring traditional software development practices2,3,4,5. This iterative approach represents a fundamental departure from single-pass generation, yet the underlying dynamics of debugging-based LLM-guided code generation remain critically underexplored. Existing implementations often apply an arbitrary number of debugging attempts without examining their optimal extent or effectiveness over continuous iterations2,3. This approach can incur significant computational costs and lacks methodological rigour in determining when additional iterations cease to yield meaningful improvements. Preliminary research and our analysis suggest that LLM-guided self-debugging typically follows an exponential decay pattern, where debugging effectiveness diminishes rapidly with successive attempts4. However, no systematic work has been conducted to characterise this decay phenomenon or explore strategies to break these patterns for improved performance. This pattern of diminishing returns in iterative LLM approaches extends beyond code generation, with recent research on reasoning models demonstrating similar complexity-dependent limitations where self-correction capabilities plateau and models either overthink simple problems or fail entirely on complex ones6, suggesting a natural ceiling that warrants systematic investigation.

Furthermore, as debugging-based LLM-guided code generation becomes increasingly prevalent, evaluation metrics must evolve beyond traditional single-pass assessments7,8 to account for the iterative nature of the process. Current evaluation approaches treat code as static artefacts rather than as the product of a dynamic development process, overlooking the significant quality enhancements that often emerge through systematic debugging and refinement9. This limitation becomes increasingly problematic as the field moves toward debugging-based approaches that more closely align with human software development practices10. Single-pass metrics such as pass@k7 measure the probability that at least one correct solution exists among k independently generated candidates. It fails to account for the iterative debugging process that is central to practical software development workflows10, and relies solely on manually written test cases11.

This study examines the effectiveness of repeated debugging attempts in LLM-based code generation and investigates strategic interventions to enhance the debugging process. To address the limitations of existing evaluation metrics, we propose a novel evaluation framework: the Debugging Decay Index (DDI). The DDI metric provides a unified assessment of LLM coding proficiency by modelling the exponential effectiveness decay observed in iterative debugging processes. Our framework computes strategic intervention timing \(t_\theta\) based on configurable effectiveness decay thresholds \(\theta\), returning a comprehensive evaluation tuple \((E_0, \lambda , t_\theta , R^2)\) that captures initial performance, decay sustainability, strategic stopping points, and model fit quality. This multi-dimensional approach enables distinctive evaluation across different aspects of the code generation and debugging pipeline. Our investigation addresses the following research questions:

  • RQ1 (Debugging Window): How many debugging attempts maximise the effectiveness of LLM-generated code before further iterations yield diminishing returns, and how do these attempt windows vary across different model architectures and problem characteristics?

  • RQ2 (DDI): How can we develop a unified evaluation metric that comprehensively assesses LLM code generation and debugging capabilities, quantifying initial performance, sustained effectiveness, and iterative refinement capability encompassing both reasoning proficiency and instruction-following competency across diverse model architectures?

  • RQ3 (Strategic Fresh Starts): Based on the optimal debugging windows identified in RQ1 and the decay characteristics quantified in RQ2, to what extent can implementing fresh start strategies after reaching effectiveness thresholds improve overall accuracy compared to continued iterative refinement within the same generation context?

Literature review

Evaluation metric

Code-generating LLMs are typically evaluated based on functional correctness or whether the generated code effectively solves the given task. In this paradigm, the pass@k metric7 has become a standard measure. Pass@k is the probability that at least one of k independently generated solutions to a problem passes all unit tests. Pass@k can be written as -

$$\begin{aligned} pass@k = 1 - \mathbb {P}(\textit{all incorrect}) \end{aligned}$$

The unbiased7,8 estimation formula is -

$$\begin{aligned} pass@k = 1 - \frac{\left( {\begin{array}{c}n-c\\ k\end{array}}\right) }{\left( {\begin{array}{c}n\\ k\end{array}}\right) } \end{aligned}$$

Where n is the total number of samples generated, \(n\ge k\) and c of them pass. One can draw \(n\ge k\) samples and count the number of solutions c that pass8. Numerous subsequent works on LLM-guided code generation have used pass@k. For example, CodeT12 and Top Pass13 evaluated various models on standard benchmarks using the pass@k metric. In MBR-EXEC14, authors measured pass@k for HumanEval7, Mostly Basic Python Programming(MBPP)15 to compare instruction tuning. Code generation benchmark leaderboards and evaluations of programming-focused large language models consistently report pass@k metrics (typically k=1, 5, 10, and occasionally up to k=100) as a standard method for model comparison16,17,18,19,20. The elegance of this metric lies in its simplicity and direct correlation with functionality; a model that can generate at least one correct solution within k attempts demonstrates meaningful capability in code generation tasks. Importantly, pass@k is a binary, functional metric; it only cares whether any generated solution is entirely correct.

Building upon this foundation, researchers have conducted thorough investigations into the pass@k metric’s characteristics, examining its sensitivity to both the sample size (k) and the inherent difficulty of programming problems16,20,21,22. A critical limitation identified is the metric’s sole reliance on provided test suites, which may not comprehensively verify all aspects of code correctness or efficiency21. This concern was empirically validated when researchers augmented the standard HumanEval benchmark with more rigorous test cases (creating HumanEval-ET), resulting in a significant performance drop of approximately 20–30% across various models20. A more fundamental concern relates to how optimising for pass@k can distort model behaviour and evaluation priorities. Top Pass13 introduced a ranking model that directly optimises for this metric, revealing a key limitation: pass@k rewards getting one solution correct over producing multiple near-correct solutions. This approach fails to reward quick convergence and may allow models to game the metric by generating variants of the same algorithm rather than exploring diverse approaches. Complementary findings revealed that 42% of code generations failing unit tests were still rated valuable by programmers and proposed a hybrid metric23 combining functional correctness with syntactic similarity, which achieved a 14% stronger correlation with programmer-perceived value. These findings suggest that evaluation metrics should consider not only binary correctness but also how effectively code can be refined through debugging. In response to these limitations, several research works have proposed several variations of pass@k. The count@k metric24 counts how many of k attempts are correct, while AlphaCode introduced n@k16 that generalises pass@k to measure exactly n correct solutions out of k attempts. Addressing the need to recognise partially correct solutions, the \(pass-ratio@n\) metric25 averages the squared test-pass ratio across n generated code samples. This approach gives partial credit to nearly-correct solutions, addressing the granularity that pass@k lacks.

While these functionality-based metrics dominate code generation evaluation, many researchers still report non-functional metrics such as BLEU26, CodeBLEU27, or ROUGE28 to measure syntactic similarity. These metrics are not replacements for pass@k but often accompany it to gauge quality aspects beyond functional correctness. While a few orthogonal approaches exist, they all fail to capture the iterative nature of code development and the debugging capabilities of LLMs.

Our proposed Debugging Decay Index (DDI) addresses this gap by focusing on the iterative path to functional correctness rather than arbitrary sampling. Unlike traditional metrics, DDI measures how effectively models leverage iterative debugging feedback to improve a solution until it achieves functional correctness. This approach acknowledges that real-world programming rarely involves generating multiple independent attempts; instead, developers iteratively refine their code through debugging cycles. By quantifying the efficiency of this debugging process, DDI provides a reliable evaluation of how models would perform in practical software development contexts, where strategic iteration, rather than random sampling, is the path to functional code.

Debugging

Researchers have explored dynamic approaches to incorporate execution feedback and debugging capabilities in LLM-guided code generation. Recent work29 investigated debugging in two distinct contexts: in-context debugging, which involves inspecting intermediate execution states, and post-context debugging, which focuses on analysing error results after complete execution. Building on this foundation, the SELF-DEBUGGING framework5 demonstrated how LLMs can analyse execution results and explain their own generated code line by line, mirroring approaches developed initially for human developers30. The framework allowed for a maximum of 10 debugging attempts, but the researchers observed that successful debugging typically concluded within just three iterations. By comparison, MapCoder3 implemented a more extensive debugging protocol, allowing up to 25 attempts, but limiting them to a maximum of 5 attempts per individual plan. The authors reported that while increased debugging iterations generally improved performance, this relationship was not strictly linear across all datasets. Notably, their results for HumanEval-ET did not follow the expected proportional improvement trend, indicating potential dataset-specific considerations in debugging efficacy. Similarly, the Large Language Model Debugger (LDB)2 employed 10 debugging attempts in their standard configuration, with additional experiments using up to 20 attempts on the HumanEval dataset. Their findings revealed a continuous but diminishing improvement trend, with gains becoming increasingly marginal after the fifth attempt. The subsequent 15 attempts collectively yielded only 2.4% additional improvement. PyCapsule4 implemented a more streamlined approach compared to MapCoder while still achieving state-of-the-art (SOTA) performance across several benchmark datasets. The framework employed five debugging attempts beyond the initial solution and fitted the resulting normalised debugging effectiveness to an exponential decay function, revealing that effectiveness usually diminishes dramatically after the third attempt and follows an exponential decay pattern. Their analysis further demonstrated that debugging effectiveness varies significantly across model architectures: OpenAI’s GPT-417 exhibited complete loss of debugging effectiveness (relative to the first attempt) by the third iteration, while GPT-3.517 showed similar exhaustion by the fourth attempt. In contrast, Qwen2.5-coder-instruct18 maintained some debugging capability until the fifth attempt, suggesting model-specific patterns in debugging performance decay. These findings highlight a critical research gap: the need for a standardised approach to quantify and optimise debugging capability for LLM code generation.

Empirical evidence across debugging frameworks reveals consistent diminishing returns, though the specific decay characteristics vary systematically across model architectures, suggesting model-specific debugging signatures that remain unexplored as evaluation criteria. Existing approaches treat these decay patterns as inevitable limitations rather than quantifiable characteristics of the model. This systematic variation in debugging persistence presents an opportunity to develop methodologies that both measure debugging capability through decay modelling (RQ1, RQ2) and identify possible optimal intervention strategies when effectiveness diminishes beyond an acceptable threshold (RQ3).

Methodology

RQ1: debugging window

We introduce the concept of a “debugging window” in the context of LLMs for code generation, which refers to the threshold for debugging attempts. While diminishing effectiveness will always occur with continued debugging efforts, establishing this window allows us to determine a practical cutoff point that balances debugging effectiveness with computational efficiency. To model the effectiveness of each debugging attempt over time, this study employs the exponential decay function (Equation 1). The exponential decay function is defined as follows:

$$\begin{aligned} E(t) = E_0 e^{-\lambda t} \end{aligned}$$
(1)

In this study, E(t) represents the effectiveness of debugging at attempt t, while \(E_0\) denotes the initial effectiveness corresponding to the very first attempt. The decay constant \(\lambda\) represents the rate of effectiveness loss over successive attempts and serves as our primary metric for characterising iterative debugging capability.df Models with lower \(\lambda\) values maintain their effectiveness longer across debugging iterations, and t represents the discrete number of debugging attempts, allowing us to model the temporal progression of debugging effectiveness. To further analyse the decay process, we examine the half-life \(t_{1/2}\), which represents the number of debugging attempts after which the effectiveness reduces to half its initial value \(E_0\). By definition and from Equation 1, we get:

$$\begin{aligned} E(t_{1/2}) = \frac{1}{2} E_0 \implies t_{1/2} = \frac{\ln (2)}{\lambda } \end{aligned}$$
(2)

We can generalise Equation 2 to determine the number of debugging attempts required for any given decay percentage. For a decay threshold where effectiveness can lose up to \(\theta \%\) of its initial value (meaning \((100-\theta )\%\) effectiveness remains), the number of debugging attempts \(t_\theta\) is given by:

$$\begin{aligned} t_\theta = \frac{\ln \left( \frac{100}{100-\theta }\right) }{\lambda } \end{aligned}$$
(3)

This generalised formula enables us to calculate the debugging window for any threshold \(\theta\), providing the flexibility to determine when diminishing effectiveness justifies terminating the debugging process based on specific computational constraints.

RQ2: The Debugging Decay Index (DDI)

Our proposed DDI integrates our exponential decay analysis from RQ1 to create a comprehensive evaluation framework for LLM debugging capabilities. Unlike traditional metrics that focus solely on final outcomes, DDI captures the efficiency and capability of the debugging process and the final accuracy.

Framework implementation

The DDI is formulated as a function

$$\begin{aligned} DDI(data, \theta ) \rightarrow (E_0, \lambda , t_\theta , R^2) \end{aligned}$$

that accepts data, the normalised debugging effectiveness measurements across multiple iterative attempts; and \(\theta\), the effectiveness decay threshold(s) representing the maximum acceptable performance degradation. Following the PyCapsule4 framework, the normalised debugging effectiveness data represents the independent influence of each debugging attempt. The DDI framework identifies strategic intervention points \(t_\theta\) where debugging effectiveness would degrade by \(\theta \%\) from the initial value. In RQ3, we leverage these DDI-calculated intervention points to evaluate whether implementing fresh start strategies at the predicted timing can improve overall accuracy compared to continued iterative refinement within the same generation context. Fresh starts involve reinitiating the debugging process with the original problem statement only. DDI returns a four-element tuple:

  • \(E_0\) (Initial Effectiveness): \(E_0\) represents the initial effectiveness, calculated as \(E_0 = N_{solved\_at\_attempt\_0} / N_{total}\). This metric is directly comparable to pass@1 and represents the model’s inherent code generation capability before any debugging.

  • \(\lambda\) (Decay Rate): The decay constant extracted from fitting the exponential decay function (Equation 1) to normalised debugging effectiveness data. A lower \(\lambda\) indicates slower decay in effectiveness and more persistent debugging behaviour, reflecting sustained instruction following and reasoning consistency across iterations.

  • \(t_\theta\) (Optimal Intervention Points): \(t_\theta\) represents the maximum number of debugging attempts before effectiveness drops by \(\theta \%\) from the initial value. This represents the strategic intervention threshold corresponding to the \(\theta\) value, calculated using Equation 3. Since debugging attempts must be discrete integers, we apply the ceiling function to convert the continuous mathematical solutions into practical stopping points. This ensures that the debugging window provides sufficient attempts to reach at least the specified effectiveness threshold.

  • \(R^2\) (Fit Quality): The coefficient of determination measuring how well the exponential decay model explains the observed debugging effectiveness patterns. We interpret the results using the following categories: Excellent (\(R^2 \ge 0.9\)), Good (\(0.7 \le R^2 < 0.9\)), or Poor (\(R^2 < 0.7\)). High \(R^2\) values indicate predictable exponential decay behaviour, while low values suggest erratic or non-exponential debugging patterns that may require alternative evaluation approaches.

Evaluation process and interpretation

The DDI evaluation proceeds through four core steps: initial assessment records \(E_0 = N_{solved\_at\_0} / N_{total}\); iterative debugging tracks effectiveness at each attempt; decay analysis fits Equation 1 using nonlinear least squares regression to extract \(\lambda\), setting \(\lambda = \text {None}\) when insufficient data points (\(n < 3\)) exist; and threshold calculation determines strategic intervention timing \(t_\theta\) using Equation 3.

The DDI outputs provide comprehensive model characterisation of code generation and debugging capabilities, requiring interpretation of both effectiveness metrics and fit quality. For models with high \(R^2\) values (\(\ge 0.7\)), the combination of \(E_0\) and \(\lambda\) reveals distinct model archetypes: high \(E_0\) an low \(\lambda\) indicates both strong reasoning and persistent debugging (ideal), low \(E_0\) and low \(\lambda\) suggests consistent but ineffective approaches, high \(E_0\) and high \(\lambda\) indicates strong initial reasoning but poor debugging persistence, while low \(E_0\) and high \(\lambda\) represents both weak reasoning and rapid debugging degradation. However, for models with poor fit quality (\(R^2 < 0.7\)), the exponential decay assumption may not apply, indicating that a different mathematical function may be required to fully characterise the model behaviour. In such cases, evaluation should rely primarily on \(E_0\) when using DDI. Pseudocode for DDI is provided in Appendix: DDI Pseudocode.

RQ3: strategic fresh starts

To investigate whether strategic interventions can mitigate the debugging decay phenomenon identified in RQ2, we implement fresh start strategies at DDI calculated strategic intervention points. A fresh start completely clears conversation history and begins anew with only the original problem statement. This mechanism addresses the rapid degradation of effectiveness observed in the exponential decay pattern, particularly when models become trapped in the low-effectiveness tail, where continued debugging attempts yield negligible improvement. The fresh start strategy operates on the hypothesis that reinitialising the generation process shifts the model from exploiting failing solution approaches back to exploring alternative solution spaces. Based on empirical evidence from RQ2, we observe varied suitable intervention points \(t_\theta\) across different models, as demonstrated in Table 1. Given the variance in decay patterns, we strategically implement fresh starts at DDI-calculated intervention thresholds, enabling each model to benefit from reinitialisation at its optimal timing. To ensure a fair comparison with existing approaches, we maintain the same total attempt budget as previous works3,4, consisting of six attempts (initial generation plus five debugging iterations). Our approach strategically allocates these attempts while triggering fresh starts at DDI-calculated intervention points, testing whether strategic reinitialisation can overcome debugging decay while maintaining strict comparability with baseline methods.

Evaluation and experimental setup

To address our research questions regarding debugging windows (RQ1) and the DDI framework (RQ2), we initially applied our methodology to eighteen language models using the HumanEval7 dataset. HumanEval’s 164 function-level Python problems with well-defined test cases provide a controlled environment for isolating debugging effectiveness patterns across diverse model architectures. While this single-dataset analysis establishes the prevalence of exponential decay patterns, we subsequently conducted cross-dataset validation to verify that DDI characteristics generalise beyond HumanEval.

Fig. 1
Fig. 1
Full size image

Exponential decay curves fitted to debugging effectiveness data for four language models. The grey dashed lines indicate effectiveness thresholds at different \(\theta\) values. The \(\lambda\) (decay rate) and \(R^2\) (goodness-of-fit) values are displayed for each model.

Debugging protocol

Our evaluation employs an iterative debugging protocol using the self-correction framework from PyCapsule4. The debugging process is briefly described as follows:

Initial Generation (Attempt 0): Each model receives a problem specification and generates an initial solution. This code is executed against test cases, establishing the baseline effectiveness \(E_0\).

Iterative Refinement (Attempts 1–5): For unsuccessful solutions, the model receives structured feedback containing: the original problem specification, previously generated code, execution results (error messages, stack traces, or failed test case outputs) and an explicit instruction to debug and correct the code. The model then generates a revised solution, which is again executed against the same test cases. This process continues for a maximum of five debugging attempts beyond the initial generation, totalling six attempts per problem.

Effectiveness Measurement: Following PyCapsule’s normalisation approach, we measure the independent contribution of each debugging attempt. Given N total problems, let \(S_0\) denote problems solved at attempt 0, leaving \(N_1 = N - S_0\) unsolved. At each subsequent attempt \(i \ge 1\), \(S_i\) additional problems are solved from the remaining \(N_i = N - \sum _{j=0}^{i-1} S_j\) unsolved problems. The normalised effectiveness at attempt i is \(I_i = \frac{S_i}{N_i}\)

This normalisation isolates the independent debugging contribution at each attempt, removing the cumulative effect of previous successes. The DDI framework models the exponential decay of these normalised effectiveness values \(I_i\) across attempts.

Evaluation

Table 1 DDI Results for Different Models for \(\theta \in \{50, 80, 90, 95, 99\}\) on the HumanEval dataset. \(R^2\) indicates exponential fit quality: Excellent (\(R^2 \ge 0.9\)), Good (\(0.7 \le R^2 < 0.9\)), Poor (\(R^2 < 0.7\)). Models with \(\lambda = \text {None}\) had insufficient data points for exponential fitting after filtering zero effectiveness values.

Our experimental design systematically evaluates the decay patterns of debugging effectiveness across diverse model architectures, ranging from smaller, specialised models like DeepSeek-Coder 6.7b31 to larger, general-purpose models such as Claude-3–7-sonnet-2025021932, GPT-417, and GPT-3.517. Using normalised debugging effectiveness data from HumanEval7, we extracted model-specific decay constants \(\lambda\). For each model, we calculated \(E_0, \lambda , t_\theta \text { where } \theta \in 50, 80, 90, 95, 99 \text { and } R^2\). Additionally, we report \(A_0\) values representing the final accuracy achieved after six attempts without any fresh start interventions (same as PyCapule4), providing a baseline performance metric for comparison with our strategic restart approaches in RQ3, see Table 2.

Table 1 and Fig. 1 present our comprehensive analysis of debugging decay characteristics across these LLMs. The debugging window calculations reveal distinct performance characteristics across model architectures.

Claude-3.7-Sonnet demonstrated remarkable performance, achieving 100% effectiveness (\(A_0 = 100\%\)) essentially within two attempts, which prevented fitting to the exponential decay model, resulting \(\lambda =None\). This exceptional performance represents a unique case where conventional debugging window calculations may not apply.

Conversely, the Phi-433 model comparison provides particularly revealing insights into the relationship between reasoning capabilities and debugging sustainability. While phi4:14b33 \((E_0=83.537\%, \lambda =0.76)\) significantly outperformed phi4-reasoning:14b33 \((E_0=59.146\%, \lambda =0.60)\) in initial effectiveness by approximately 24%, likely due to phi4-reasoning not being instruction fine-tuned and thus more challenging to parse, the reasoning model demonstrated remarkable debugging improvement capacity. Despite starting from a substantially lower baseline, phi4-reasoning achieved a final accuracy of 81.098% compared to phi4:14b’s 93.293%, representing an improvement of 21.95% versus only 9.75%, respectively. The reasoning model improved more than twice as much as the standard model through iterative debugging. These findings suggest that the decay constant \(\lambda\) captures not only debugging efficiency but also underlying reasoning capabilities and the models’ susceptibility to instructional feedback. Models with lower \(\lambda\) values demonstrate a greater capacity to integrate corrective guidance into their subsequent debugging actions.

The reasoning model’s lower \(\lambda\) value indicates superior debugging sustainability, enabling it to extract more value from iterative refinement processes. This reveals that reasoning-capable models, although potentially harder to prompt initially, possess an enhanced capacity for systematic error correction and solution refinement –a crucial characteristic for extended debugging sessions where sustained improvement matters more than initial performance.

GPT variants exhibit relatively fast effectiveness decay, with gpt-3.5-turbo17 showing the highest decay rate, reaching the 80% threshold by attempts 2–3. In contrast, models like codestral:22b34 and deepseek-coder:6.7b31 demonstrate more sustained debugging capabilities with lower decay rates (\(\lambda =0.375 \text { and } \lambda =0.330\) respectively), extending debugging windows to 5–7 attempts for the same threshold. DDI reveals nuanced debugging characteristics that would be missed by simple effectiveness metrics alone. The case of phi4-reasoning:14b33 exemplifies this.

Strategic fresh start

Table 2 Performance comparison showing baseline accuracy \(A_0\) achieved within six attempts without intervention, versus fresh start strategies implemented at DDI-calculated intervention points where \(\theta \in \{50, 80\}\). \(A_{50}\) and \(A_{80}\) represent final accuracy when fresh starts are triggered at \(t_{50}\) and \(t_{80}\) thresholds respectively. The corresponding intervention timing (\(t_\theta\) values) for each model can be found in Table 1. Bold values indicate performance improvements over the baseline \(A_0\), demonstrating cases where strategic reinitialisation outperforms continued iterative debugging within the same debugging context at no extra token usage at all.

To evaluate the effectiveness of strategic fresh starts proposed in RQ3, we implemented restart interventions at the calculated strategic thresholds for \(\theta \in \{50, 80\}\) effectiveness degradation. Table 2 presents the comparative performance results, demonstrating the impact of strategic reinitialisation versus continued iterative debugging. The results reveal that strategic fresh starts can significantly improve debugging performance across most models without requiring any additional computational resources. Since fresh starts only involve clearing conversation history at predetermined intervention points while maintaining the same attempt budget, the computational overhead remains equivalent with similar or reduced token usage on average compared to continuous debugging sessions. For example, DeepSeek-Coder-V2-16B reduced token consumption from 108,289 to approximately 89,000 tokens on average, while Codestral-22B maintained usage around 94,000 tokens compared to 97,000 in the continuous sessions.

Of the six models evaluated, all showed performance improvements when fresh starts were applied at DDI-calculated intervention points. Significantly, llama3.1:8b35 showed the most significant improvement, enhancing its baseline accuracy from 72.56% to 82.82%. In contrast, deepseek-coder-v2:16b31 experienced the second largest enhancement, with its baseline accuracy increasing from 84.1% to 92.1%. Similarly, Mistral:Instruct36 demonstrated consistent gains across both thresholds, improving from 54.3% to 62.8% and 57.3%. This demonstrates that strategic timing of fresh starts, rather than simply increasing attempt counts, can overcome debugging decay patterns and improve overall effectiveness. Analysis of the normalised debugging effectiveness patterns (Fig. 2) reveals that fresh start interventions successfully break the exponential decay curve observed in RQ1. Rather than following the predicted decay trajectory, models implementing fresh starts at strategic intervention points demonstrate renewed effectiveness spikes, essentially resetting the decay pattern and enabling continued productive debugging. This empirical evidence supports our hypothesis that strategic reinitialisation shifts models from exploitation of failing solution approaches back to exploration of alternative solution spaces.

Fig. 2
Fig. 2
Full size image

Normalised debugging effectiveness trajectories compared to baseline continuous debugging (\(A_0\)) with fresh start strategies implemented at \(\theta \in \{50, 80\}\). The distinctive spikes in \(A_{50} \text { and } A_{80}\) demonstrate successful intervention effects, where fresh starts reset the debugging process and break the monotonic decay pattern observed in baseline approaches. These spikes represent moments where strategic reinitialisation successfully shifts models from failed solution exploitation back to productive exploration, enabling recovery from debugging decay within the same computational budget.

Cross-dataset validation

To evaluate whether DDI characteristics generalise beyond HumanEval, we conducted cross-dataset validation with three representative models: GPT-4–1106-preview (frontier capabilities), GPT-3.5-turbo-1106 (mainstream deployment), and Qwen2.5-coder (open-source, specialised code generation). Due to computational constraints associated with running extensive iterative debugging sessions across multiple datasets, we focused this validation on a limited number of models spanning proprietary and open-source architectures with diverse original performance characteristics. These three models were evaluated across four datasets in total: HumanEval7, HumanEval-ET20, MBPP15, and MBPP-ET20, which vary in difficulty and problem characteristics.

Table 3 Cross-dataset DDI decay (\(\lambda\)) and strategic intervention thresholds (\(t_\theta\)) for \(\theta \in \{50, 80, 90, 95, 99\}\). Values demonstrate model-specific stability patterns across diverse problem distributions.

Table 3 reveals that cross-dataset \(\lambda\) stability correlates strongly with the original R\(^2\) fit quality from Table 1. Qwen2.5-coder, which exhibited excellent exponential fit on HumanEval demonstrates remarkable consistency with mean \(\bar{\lambda }_{qwen} = 0.503\) and closely matches the original HumanEval value (\(\lambda\) = 0.462), indicating stable, predictable debugging behaviour regardless of problem characteristics. GPT-3.5-turbo-1106 shows very little variation with a mean \(\bar{\lambda }_{gpt3.5} = 0.718\), maintaining excellent consistency around its original HumanEval value (\(\lambda\) = 0.755). The \(\lambda\) distribution across 4 datasets suggests some sensitivity to problem characteristics, though decay patterns remain exponential across datasets. In contrast, GPT-4–1106-preview exhibits substantial variation (\(\lambda \in [0.573, 0.743]\), mean = 0.634), diverging significantly from its original HumanEval value (\(\lambda\) = 0.761). This instability aligns with the poor R\(^2\) fit quality observed while testing with HumanEval, indicating that while DDI’s R\(^2\) component captures consistent debugging patterns in some models, it fails to generalise across all model behaviours Critically, even for models with high \(\lambda\) variation like GPT-4, the derived intervention thresholds remain practically stable: \(t_{80}\) varies by \(\approx 1\) debugging attempt across datasets, sufficient precision for configuring production systems where consistent resource allocation is essential for online coding assistance.

These findings empirically validate R\(^2\) as a predictor of cross-dataset reliability: models with excellent fit quality (R\(^2 \ge 0.9\)) maintain consistent \(\lambda\) values regardless of problem distribution, while poor-fit models show dataset-dependent variation. This robustness indicates that DDI provides reliable resource allocation guidance across diverse problem distributions, particularly for models with high R\(^2\) fit quality.

Additionally, a one-way ANOVA confirms that the three models exhibit significantly different decay characteristics (\(\text {F-Statistic} = 10.04\), \(F_{0.01}(2, 9) = 8.02\) where 0.01 is the significance level (\(\alpha\)), since F-Statistic is > \(F_\alpha (2, 9)\), we can reject the null hypothesis at \(p < 0.01\)). Finally, the effect size (\(\eta ^2\)) is 0.691, with model identity explaining 69.1% of the variance in \(\lambda\) values. This large effect size indicates that decay rates are predominantly model-intrinsic properties rather than dataset-dependent artefacts (Cohen’s guidelines: \(\eta ^2> 0.14\) is large). The within-model consistency combined with between-model differences validates DDI’s ability to characterise distinct debugging behaviours across models. We note that the HumanEval \(\lambda\) values in Table 3 differ slightly from those reported in Table 1. This variation reflects the stochastic nature of LLM sampling, where identical prompts can yield different code generations and thus different debugging trajectories. Such run-to-run variance is well-documented in code generation evaluation2,3,4 and does not invalidate the core findings.

Discussion and limitations

Interpreting DDI parameters for model selection

The DDI framework provides practitioners with quantitative guidance for model selection based on task requirements and computational constraints. The interplay between initial effectiveness (\(E_0\)) and decay rate (\(\lambda\)) reveals distinct model characteristics that inform deployment decisions. Models exhibiting rapid decay (\(\lambda> 1.0\)) exhaust their debugging capacity within 1–2 iterations. Such models are best suited for scenarios where computational resources are limited or where quick single-pass generation is prioritised over iterative refinement. The high decay rate suggests that continued debugging attempts yield diminishing returns, making additional iterations computationally inefficient. Conversely, models with low decay rates (\(\lambda < 0.5\)) maintain debugging effectiveness across multiple iterations. Codestral-22B (\(\lambda = 0.34\)) and DeepSeek-Coder-6.7B (\(\lambda = 0.47\)) exemplify this category, sustaining useful debugging capability through five or more attempts. These models are appropriate for complex programming tasks requiring extended refinement cycles, where the problem space is large or initial solutions are unlikely to be correct. The sustained effectiveness indicates that these models can productively utilise additional computational resources through continued debugging. Whilst our study focuses on characterising and quantifying decay patterns rather than establishing causal mechanisms, several hypotheses emerge from our observations that warrant future investigation.

The exponential decay pattern may reflect progressive saturation of the model’s context window. As debugging iterations accumulate, the conversation history grows to include multiple code versions, error messages, and refinement attempts. This expanding context may overwhelm the model’s ability to maintain focus on the original problem specification, leading to degraded performance. Models with lower \(\lambda\) values might possess superior context management capabilities, enabling them to filter relevant information from accumulated debugging history.

An alternative mechanism involves gradual drift from the original instructions coupled with error compounding. Each debugging attempt introduces the risk of new errors whilst attempting to fix existing ones. Models with high \(\lambda\) values may lack the systematic error correction strategies necessary to avoid introducing additional problems during refinement. The phi4-reasoning model’s superior debugging sustainability (\(\lambda = 0.60\)) despite lower initial effectiveness suggests that reasoning-capable models may employ more systematic approaches to debugging that resist this drift.

The fresh start strategy’s effectiveness (RQ3) provides indirect evidence for an exploitation-exploration trade-off. Models may become trapped in local minima within the solution space, repeatedly attempting variations of fundamentally flawed approaches. The exponential decay could represent decreasing probability of escaping these local minima as the model increasingly commits to its initial approach. Fresh starts force re-exploration of the solution space, explaining the performance improvements observed in Table 2.

Generalisation limitations

This work quantifies and characterises debugging effectiveness decay in LLM-based code generation. Whilst we discuss implications for model selection and computational resource allocation, several related questions fall outside the scope of this study. These include how problem characteristics (complexity, domain, structure) influence decay patterns, how different prompting strategies or inference parameters might alter debugging behaviour, and the interaction between DDI-measured capabilities and broader development workflows. While the DDI framework offers a systematic approach to measuring and characterising iterative refinement effectiveness in LLMs, it does not encompass the full spectrum of factors that influence code generation quality or affect practical deployment. Investigating these interactions represents valuable future work.

DDI parameters reflect the interaction between model capabilities and problem characteristics. While the exponential decay phenomenon appears robust across multiple debugging contexts, specific \(\lambda\) and \(t_\theta\) values vary with problem set characteristics. Our cross-dataset validation demonstrates that models with high R\(^2\) values maintain consistent \(\lambda\) across diverse problem distributions, indicating that DDI captures model-intrinsic debugging characteristics rather than dataset-specific artefacts. Comprehensive validation across fundamentally different coding paradigms (e.g., problem solving, web development, systems programming) remains future work.

Additionally, while our fresh start interventions demonstrate performance improvements across almost all evaluated models, the magnitude of these improvements critically depends on the selected effectiveness threshold \(\theta\). Although we observe consistent benefits regardless of threshold selection, selecting \(\theta\) values for maximum performance gains represents a crucial but unexplored aspect of our framework. The systematic selection of strategic intervention thresholds falls outside the scope of this study and represents an important direction for future investigation.

Conclusion

This work introduces the Debugging Decay Index (DDI), a novel evaluation framework that characterises the exponential effectiveness decay patterns inherent in LLM-guided iterative debugging processes. Through a systematic analysis of eighteen language models on HumanEval, we demonstrate that debugging effectiveness typically follows predictable exponential decay trajectories, enabling principled determination of optimal intervention timing rather than relying on arbitrary attempt limits. Our key contributions include: (1) mathematical characterisation of debugging decay patterns across diverse model architectures; (2) the DDI framework, which provides unified assessment of coding and debugging capabilities through initial effectiveness (\(E_0\)), decay rate (\(\lambda\)), strategic intervention timing (\(t_\theta\)), and model fit quality (\(R^2\)); and (3) demonstration that strategic fresh start interventions at DDI-calculated thresholds can break exponential decay patterns and improve final accuracy without incurring additional computational costs. DDI provides practical support for optimising debugging workflows in production and reveals core properties of iterative refinement in LLMs. It also accommodates non-exponential decay functions–such as linear or polynomial–making it applicable to a wider range of model behaviours, including those not limited to code generation. Future research directions include developing adaptive threshold selection strategies that respond to problem complexity, comparative analysis of human versus AI debugging patterns to validate theoretical foundations of effectiveness degradation, and integration within comprehensive software engineering workflows, including testing frameworks37,38 and code structure analysis. The mathematical simplicity and interpretability of DDI make it well-suited for interdisciplinary investigation of iterative problem-solving across both artificial and biological intelligence systems.