Introduction

Novel research findings require independent confirmation before attaining acceptance. This process often involves terms such as “repeatability,” “replicability,” and “reproducibility.” The reliability of confirmation has been questioned from multiple perspectives. Low statistical power and researcher bias lead to the reporting of incorrect findings1. Pressure to publish and selective publication of results also reduce reproducibility2. Post hoc hypothesis formulation and flexibility in analyses are further problems3. These and similar studies suggest that the credibility of science is eroding4,5. Responding to these concerns, the US National Academies of Sciences, Engineering, and Medicine assessed the status of the confirmation process and recommended measures to improve the rigor and transparency of research, resulting in the report Reproducibility and Replicability in Science6, hereafter referred to as the “NASEM Report.”

Over the last four decades, a large body of research has examined how agricultural practices affect the sustainability of production systems, what adaptations might help ensure sustainable production, and what mitigation opportunities exist for issues such as greenhouse gas emissions. These endeavors have generated controversies related to whether reported results are robust or how applicable they are across environments or cropping practices. Examples include tillage effects on soil carbon7,8, cover crop effects on nitrate leaching9, and crop responses to climate change, including atmospheric CO2 ([CO2])10,11,12, temperature13,14,15 and relative humidity16.

Scrutiny of sustainability research will only intensify. As issues such as food security, non-point source water pollution, and climate change gain attention in political and economic arenas, policies are being proposed with large impacts on producers and other stakeholders. Both proponents and opponents of specific policies will demand that supporting research is robust. Examples of concerns with potentially significant economic or legal implications include the reliability of soil carbon credits17 and the estimated impacts of aerosols on global warming18.

Additional trends may further erode research integrity and confirmation. “Publish or perish” policies that link publishing in high-impact journals to job advancement may induce researchers to cut corners, including decreasing internal confirmation prior to publication19 and manipulating data or analyses to enhance apparent statistical significance, often critiqued as “p-hacking”20. Funding agencies may support only “novel” or “innovative” research rather than confirmation studies. Journals heavily reliant on manuscript fees may favor lax peer review21,22. Papers may be generated from fake data5,23, while content generated using artificial intelligence may involve plagiarism while reducing creativity and innovation24. As field research costs increase25, researchers may reduce sampled area, replication, or number of trials, further compromising confirmation.

Given the interest in research confirmation and the increasing likelihood of controversies, the confirmation of agricultural research merits examination. The foremost concern is to increase the robustness of findings, but improved confirmation should also reduce the diversion of resources to refuting ill-founded results26,27.

Focusing on research to support sustainable production, we review the levels of confirmation embodied in the terms “repeatability”, “replicability”, and “reproducibility”, and propose steps toward strengthening reproducibility for both field research and modeling-based studies. Compared to other areas of agricultural research, reproducibility is especially relevant for research on sustainable production. Firstly, sustainable production often involves more complex management than in conventional systems, so research may require more complex treatments and measures of performance, challenging efforts to reproduce research. Secondly, sustainable practices often involve matching inputs to specific, local conditions, making thorough characterization of field environments paramount. Thirdly, the time frames considered for sustainability are longer than for other research, introducing challenges both for documenting management and environmental conditions as well as for independent duplication of field research. Finally, emphasizing on- and off-site impacts implies a direct connection with environmental concerns that invite scrutiny by multiple stakeholders. In this context, numerical modeling and related tools are invaluable for examining how multiple factors may interact in production scenarios and how measured responses may vary spatially and temporally, especially given climate uncertainty.

To frame our discussion of research confirmation, it is helpful to consider a single-season experiment. A specific crop phenotype or environmental property (Pt) at a time (t) is usefully described as a function of the field’s initial conditions (time t = 0, Ft=0), the crop’s genetics (G), the environment (Et), crop management (M t) and εt, representing random error (errors from measurements of Pt, input data, model structure, parameter estimation, emergent properties, and other sources or variation):

$${{\rm{P}}}_{{\rm{t}}}=f({{\rm{F}}}_{{\rm{t}}=0},{\rm{G}},{{\rm{E}}}_{{\rm{t}}},{{\rm{M}}}_{{\rm{t}}})+{{{\varepsilon }}}_{{\rm{t}}}.$$
(1)

Experimental treatments are applied by varying G, Et, or Mt at one or more locations or time sequences. Experimental units may involve individual plants or experimental plots.

Reproducing a series of Pt values requires conducting confirmatory studies under conditions of G, Et, and Mt that are relevant to the underlying research problem. In field research, natural variation in Et precludes perfect duplication of prior results. Researchers attempt to confirm findings via replicated experiments conducted over multiple locations or seasons, usually also seeking to understand how Pt varies with Et. In numerical modeling, while estimating values of Pt given Ft=0, G, Et, and Mt may appear straightforward, efforts to duplicate work often encounter difficulties due to factors including our incomplete understanding of crop processes, uncertainty of model inputs, and software issues, as examined later.

Terminology for confirmation of research findings

The terminology for confirmation of research varies greatly. “Repeatability,” “reproducibility,” and “replicability” are often interchanged28. In many disciplines, their meanings vary with how identical the confirmatory process is to the original experiment or analysis29. The NASEM Report defined “reproducibility” as the ability to obtain consistent results using the same input data, computations, methods, code, and conditions of analysis, thus focusing exclusively on computations and rendering the term synonymous with “computational reproducibility.” “Replicability” was defined as the ability to obtain consistent results across studies that are directed at the same research question, each study obtaining its own data. Two studies are replicated if they provide consistent results within the expected uncertainty of the study system. “Repeatability” was considered a specialized term from metrology relating to measurements repeated close in time and using the same conditions and equipment.

Use of the terms “repeatability,” “replicability,” and “reproducibility” in agricultural research shows limited consistency (Table 1) and often diverges from the NASEM Report (Table 2). In agricultural research, “repeatability” usually refers to the ability of a research group to obtain essentially identical results when an analysis or experiment is repeated within a study or under the same conditions as an initial study. Given the inherent variability of Et in field experiments due to variability of weather, soils and biotic factors, repeatability is difficult to achieve and not fully expected. However, the term appears applicable for individual measurements over short time intervals and for laboratory analyses. For modeling and numerical analyses, repeatability implies that data, scripts, software or processing environments are unchanged from previous work within a research group, and that results are essentially identical.

Table 1 Examples of how the terms “repeatability,” “replicability” and “reproducibility” appear in agricultural research
Table 2 Proposed terminology for confirmation of results on agricultural research on climate change, and the equivalent term from the National Academies of Sciences, Engineering, and Medicine (NASEM) report6

We consider “replicability” as the ability of a single research group to obtain identical results from a previous study when using the same methods, including numerical analyses (Table 2). The concept includes single-season field experiments repeated over multiple seasons or locations. Replication increases confidence that results hold true, and quantifying the effects of G, Et and Mt helps indicate how responses might vary across seasons or regions. Computational replicability is seldom discussed in agricultural research but seems synonymous with “repeatability”.

“Reproducibility” refers to obtaining comparable results from a study independent of the original. A new field experiment might obtain results that confirm prior research, but involve different cultivars, crops, locations, or management. Such work often seeks both to confirm the original study and to understand whether the results are robust for a broader range of situations. Analogously, reproducibility in modeling involves two situations. The first is when independent researchers use data from the original study and the same or other models to confirm the original results. The second arises when new sets of data are modeled in a manner similar to the original study.

The most notable differences between agricultural research and the NASEM Report (Table 2) are that the Report lacks equivalents for “replicability” in agricultural research and that it thus considers our use of “reproducibility” to be “replicability”, while limiting “reproducibility” to computations. A further difference is that we consider confirmation of computations, taken to include modeling and other numerical analyses, as crucial both for original research and independent, external confirmation.

Confirmation of Field Research

Considering Pt = f(Ft=0, G, Et, Mt), the first challenge for strengthening confirmation of field research is to describe F t=0, G, Et, and Mt in sufficient detail that other researchers can understand how the results were obtained and, if desired, reproduce the experiment within the constraints inherent to reproducing the Et, including soil, weather and biotic conditions. A second challenge is to describe protocols used to obtain values of observed Pt.

For describing F0, G, Et, and Mt, the standards first developed by the International Benchmark Sites Network for Agrotechnology Transfer (IBSNAT) project and subsequently revised by the International Consortium for Agricultural Systems Applications (ICASA) provide a useful vocabulary and data architecture for documenting experiments30,31. The standards were used by the Agricultural Model Intercomparison and Improvement Project (AgMIP)32 and formed the core of the AgMIP data management system33.

Protocols for measuring Pt present further challenges. Economic yield might be described by the plot area, whether border- or end-rows were excluded, the threshing process, and how moisture contents were determined. Traits such as leaf photosynthesis, canopy reflectance, and soil nutrient concentrations require descriptions of instrument configurations and calibrations, sampling criteria and procedures, conditions during measurements, and data processing, among other metadata. The Prometheus web resource34 hosts protocols for ecological and environmental plant physiology. The platform protocols.io35 provides tools for entering, editing, and sharing protocols within a research group, and finalized protocols may be associated with a unique DOI (digital object identifier). However, neither platform is widely used in agricultural research.

Improper data manipulation and misuse of statistical tests can lead to erroneous results. Practices such as modifying analyses to achieve statistical significance (“p-hacking”), formulating hypotheses after observing the results (“HARKing”), and publishing only positive findings (“publication bias”) increase the risk of reporting non-existent effects or relationships (false positives or Type I errors), which are seldom reproducible20,36.

Even for series of well-characterized field experiments, reaching a consensus on implications of results can be challenging. Studies of crop response to elevated atmospheric CO2 (e[CO2]) consistently show that growth increases with e[CO2], but the estimated responses vary37,38. Comparisons are difficult due to differences in G, Et, and Mt among experiments and the methods used to induce e[CO2], notably the use of enclosed or Open-Topped Chambers (OTCs) vs. Free-Air CO2 Enrichment (FACE). Kimball et al. 39. listed twelve ways that the environment inside OTCs can differ from outside. From the literature, they calculated an average increase in growth inside OTCs of 10% at ambient [CO2]. Such growth increases could be amplified by e[CO2], leading to greater growth during the exponential phase of crop growth, further enhancing growth responses in OTCs compared to FACE, as first reported by Long and co-workers10. In contrast, a 2020 review12 concluded that fluctuations in e[CO2], which are prominent in FACE systems, reduce assimilation and growth compared to steady-state e[CO2]. Hence, responses measured with FACE may be too low, while those from OTCs may be high. Building consensus remains difficult in the absence of studies directly comparing methods for elevating [CO2].

Confirmation of Crop Simulations

We consider first process-based crop simulation models as these are the most widely used models in sustainable agriculture. Confirmation of models can involve three facets. The simplest concerns the consistency of numerical results: if identical numerical inputs are processed with the original model version and computational environment, outputs should be identical to the originals40. The second concerns the confirmation of the mathematical representations of the biophysical processes embodied in each model, whether examined as component processes or complete models, recognizing that model developers differ in their approaches or hypotheses underlying how processes are represented. This is essentially model evaluation, typically comparing simulations of one or more models with process-specific experimental data and conducting sensitivity analyses41,42. The third facet is the confirmation of results from model applications under different assumptions, which for sustainability usually involves simulating long-term crop rotations or sequences, potentially varying climatic factors or [CO2] to mimic climate change. Model confirmation in applications may involve comparisons with field experiments, historical production records, or outputs from other models43.

Confirming model outputs can involve the three levels outlined in Table 2. In theory, for any level, the process only requires re-running the specified model with the associated datasets. However, obtaining identical results often proves difficult, especially when independent parties attempt to reproduce results, even using the same model. Comparing 455 models of biological processes expressed using Systems Biology Markup Language (SBML), Tiwari and co-workers were unable to reproduce results from half of the models44.

Factors constraining the confirmation of modeling studies include differences in inputs, parameters, and model versions, the use of stochastic processes such as weather generation, difficulties in interpreting code, and software dependencies (Table 3). Minor differences arise simply from compiling model code under different operating systems, language versions, and compiler settings40.

Table 3 Examples of potential causes of failure to confirm results from simulation modeling or other numerical models

Determining how accurately biological, chemical, and physical processes are embodied in crop models is widely discussed as model evaluation45,46. Foremost among factors constraining model evaluation are the scarcity of detailed field data and uncertainties in inputs such as genotype-specific traits, soil physical parameters, and initial conditions47. Furthermore, field data and simulation outputs are often mismatched due to differences between variables measured in the field vs. what are modeled as state variables48. Failure to consider a lack of independence among observed data, typically involving differences among treatments or environments, can inflate apparent model accuracy49. Additionally, while field measurements usually describe “realized” crop growth where multiple abiotic and biotic factors limit growth, crop models simulate greater, “attainable” growth constrained by explicitly modeled effects such as water or nutrient deficits or non-optimal temperatures. Thus, simulated growth usually represents an upper boundary for measured values. Finally, crop phenotypes are emergent properties that arise from interacting processes within a complex system. Crop models often include parameters in process equations whose values are estimated because they are difficult to measure directly (e.g., for gene effects). A crop model thus represents an abductive learning framework whereby only a subset of possible solutions is allowed, yet it is impossible to pinpoint a unique solution. This lack of unique solutions, termed equifinality50, further compromises reproducibility.

AgMIP crop model intercomparisons have partially addressed reproducibility by comparing different models run using identical inputs15,51. The intercomparisons, however, have largely focused on how well models described crop growth or environmental effects in specific field trials, rather than on reproducibility among models. An implicit assumption may have been that the field data, including associated Gx, Et, and Mt, had low uncertainty compared to differences among models. In an analysis of modeling datasets for 426 potato experiments, errors occurred in all elements of the inputs, parameters, and evaluation data52. Weather data appeared especially problematic, possibly because weather data are easier to cross-check as compared to soil, management, and crop growth data.

Confirmation of Other Numerical Models

Statistical and geospatial models are also used to investigate issues relating to sustainability, especially for climate change53,54. Again, while confirmation of such models seems straightforward, attempts to reproduce analyses from other disciplines have encountered difficulties55. In geosciences, Konkol and co-workers recreated analyses and resulting maps or graphs from 41 open-access papers56. Analyses from two papers were reproduced without issues and for 33, issues were readily resolved. For two papers, issues were partially resolved, but four papers were considered irreproducible. Similar studies from psychology found that published values frequently could only be reproduced after author consultation57,58. Difficulties involved how analytic procedures were reported, and the primary research conclusions were unaffected.

The underlying causes of failure to reproduce numerical analyses parallel those for crop modeling (Table 3). Data may inadvertently be modified over time, such as in the handling of outliers or missing values or by values from databases being updated. Workflows that include different software tools may require manual manipulation of files, increasing the potential for errors. Even when workflows are documented, problems arise. Two analyses of research using the open-source digital notebook Jupyter Notebooks (https://jupyter.org/, verified 2025-02-12) found that results were often unreproducible because the actual computation steps (“cells”) differed from the described order59.

Towards Improved Reproducibility and Confirmation

Strengthening the confirmation of sustainability research requires a substantial shift in research culture, including changes in the attitudes of individuals, teams, and funding agencies6,60,61. Researchers must recognize the importance of thoroughly and accurately describing their experiments and analyses, and sharing their data, data collection and analysis protocols, and software in ways that enable reproduction. The planning of field experiments, modeling, and any subsequent numerical analyses should seek to maximize repeatability, replicability, and reproducibility (Table 2). Guided by a project that assessed the reproducibility of computer code62, we suggest researchers consider whether their work could be reproduced ten years from now. Our admittedly subjective rationale is that if agricultural research can be reproduced after a decade, it should be reproducible over a longer period, acknowledging the inherent uncertainties in instruments, germplasm, environments, and other factors.

A recurring recommendation, formulated in various manners, is to follow “best” practices for data management and processing that enhance reproducibility63,64. For agriculture, key practices for researchers are outlined in Table 4. Publishers might create a certification process that assesses completeness, nomenclature and formatting, including materials, methods, datasets, and software. Researchers in ecology and evolution proposed eight review criteria, including how well a manuscript describes metadata, data processing steps, and sources of secondary data65. The journal PLOS Computational Biology implemented a pilot system for peer review of reproducibility66. Ideally, compliance would be tested prior to submission, using tools similar to turnitin (https://turnitin.com, verified 2025-02-12) or iThenticate (https://www.ithenticate.com/, verified 2025-02-12).

Table 4 Recommended actions to strengthen the reproducibility of data management and analysis in agricultural research

Researchers may resist change out of concern that it will divert resources from advancing their objectives7,67,68. However, changes can enhance research impact, reduce errors, discourage unfounded challenges, and improve compliance with open science directives. A balance must be struck between providing too little data, rendering studies irreproducible, and requiring so much documentation that research suffers.

We suggest resource concerns may be overstated. Valuable information is often available but unreported simply because its importance is unrecognized (e.g., row spacing, sowing depth, fertilizer composition). Similarly, some data are not collected because their perceived value does not justify the cost. For instance, soil samples are often taken before planting but limited to the upper 30 cm. Extending sampling to the maximum rooting depth provides a more complete nutrient balance at minimal extra cost since sampling sites are already established.

Meta-research examining how published research is evolving in terms of replication and reproducibility might identify constraints and needs69. A 2011 analysis of methodologies for simulating climate change impacts helped strengthen subsequent studies, although reproducibility was not explicitly addressed70.

Field research

The first step for field experiments is to improve reporting of F0, G, Et, and Mt. Adequately describing the weather, soil profile characteristics, and crop management is essential. Often, data on management exist but are not in an organized digital format. The ICASA standards provide one option for documenting the field environment and crop management30,31.

Given that sustainability research often concerns quantitative responses of crops or soils to nutrients, temperature, precipitation, [CO2], and other abiotic factors, the question arises of whether studies can measure responses more accurately, thus strengthening confirmation, especially considering interactions of G, Et, and Mt. For experiments involving multiple quantitative factors, response surface methods can capture nonlinear responses with a reduced number of plots, while maintaining statistical power71. Studies combining e[CO2] with other factors have predominantly used only two levels per factor, sufficient to detect an interaction with [CO2] but insufficient to infer the shape of the responses71.

Confirmation also benefits from coordination among trials to standardize elements of G, Et, and Mt or measurement protocols. The “China Wheat” study partially standardized wheat (Triticum aestivum L.) cultivars, nitrogen and water regimes across five locations from Texas, USA to Alberta, Canada, and used coordinated protocols for growth, spectral reflectance, and canopy temperature measurements72. GRACEnet (Greenhouse gas Reduction through Agricultural Carbon Enhancement network)73 and the Long Term Agroecosystem Research (LTAR) network74 specifically address sustainability.

Crop modeling and other numerical models

Multiple actions to enhance reproducibility of simulation modeling and other numerical approaches merit consideration. As far as possible, modeling per se and associated analyses should employ peer-reviewed, open-source software. The open-source framework Crop2ML allows interchanging modules among models, which should enhance reproducibility75,76, although Crop2ML still requires comparisons with external data to identify the most promising approaches. Model inputs, parameters, and control scripts should be placed in public repositories. If the model or analytic software is not open source, then the equations and parameters should be reported in detail. Version control systems such as Git, along with the cloud-based GitHub repository, can assist researchers in tracking model development, and code can also be shared as appendices to journal articles, research web sites, or model repositories such as the CoMSES Net Model Library (https://www.comses.net/)77.

Crop model intercomparisons have strengthened model-based research for climate change impacts but also highlighted challenges in improving models per se, designing simulation experiments, and analysis of modeling studies. A major constraint remains the scarcity of datasets combining a range of treatments or environmental conditions with adequate information on soils and crop management. Furthermore, data on crop growth and development often constrain accurate model parameterization. There are numerous calls to follow the FAIR Data Principles of datasets being Findable, Accessible, Interoperable, and Reusable78, and funding sources increasingly require datasets to be released in digital formats. However, datasets in repositories such as the USDA Ag Data Commons (https://agdatacommons.nal.usda.gov/browse; verified 2025-02-12) frequently lack data describing Et, and Mt.

Coordinated field experiments carefully designed to fill confirmation gaps in model evaluation and application, and that follow adequate protocols for data collection and sharing, are essential to reduce the current bottleneck in experimental data. Design of field trials and protocols would benefit from greater collaboration among experimentalists and crop modelers.

Model intercomparisons might investigate sources of uncertainty such as model inputs, including initial conditions, model structure, and model parameters79. In ecological modeling, researchers cited the need to “break” models, defined as determining “under what conditions the mechanisms represented in a model can no longer explain observed phenomena”80. This approach was embodied in the temperature-based sensitivity analyses used to evaluate modeled responses for sorghum (Sorghum bicolor (L.) Moench) and dry bean (Phaseolus vulgaris L.)81.

Conclusion

Research related to sustainable agricultural production will face increasing scrutiny and pressure to quantify responses more accurately for factors including soil carbon and nutrient levels, air temperature, precipitation, [CO2], water deficits, flooding, crop genetics, and biotic factors. Addressing these challenges requires sustained efforts by individual researchers, research groups, funding agencies, and others to strengthen the independent confirmation of scientific results. While the NASEM report increased the visibility of the confirmation process, NASEM terminology seems too narrow for agricultural research. We urge the use of the broader senses of repeatability, replicability, and reproducibility, and emphasize their relevance in field research, crop simulation modeling, and other numerical analyses.

Agricultural research should explicitly plan for reproducibility, recognizing that natural variability in the local environment (Et) constrains reproducibility of field studies. A useful benchmark is whether the descriptions of experiments and analyses are detailed enough to enable reproduction of the research 10 years later. Further actions include strengthening the digital description of data and protocols, improving experimental designs and statistical analyses, and simulating crops growing under challenging, extreme conditions. These actions are crucial for research to accurately characterize the responses of agricultural systems to multiple challenges, especially in the context of sustainable production and climate uncertainty. Attaining the needed changes may require additional resources, but the benefits of improved confirmation should justify the investment.