A few years ago, we introduced an article format called Reusability Reports to highlight good practices in code sharing and reporting. A renewed focus on reproducibility and transparency in code reporting seems warranted, as research output has accelerated with the widespread adoption of large language models.
Reproducibility is a fundamental principle of rigorous scientific research, enabling others to verify, replicate and build on existing findings. Digital research artefacts — such as datasets, software and code — often have a crucial role in supporting the main results of a study and should therefore be described and shared in a transparent and consistent way. To promote this goal, several community guidelines have been developed over the past decade. Among the most influential are the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for scholarly data, which were published in 20161. These guidelines were later extended to research software in 2022 through the FAIR4RS principles2.
In a recent mini-review article on research software practices, Zhang et al.3 argue that two additional guiding principles could further strengthen existing frameworks, recognizing the role of peer review in assessing the quality and reproducibility of research code and software. The authors introduce the principle of ‘reviewability’, which encourages researchers to consider the needs of referees when preparing their software. In particular, this principle highlights the importance of clarity, legibility and completeness, ensuring that code can be readily examined and evaluated during the review process.
They also propose the related principle of ‘supportability’, which focuses on clearly demonstrating how code underpins the main findings of a paper. This may involve providing an explicit evidence chain that links data inputs, code and computational results to the figures and plots present in the article.
Greater attention to the challenges of reviewing research software seems welcome. Our editorial policies require that datasets and code repositories be available during peer review, and we typically conduct an editorial assessment to determine whether the repository is sufficiently well structured and complete. However, referees often face practical barriers — including difficulties accessing, installing, running and testing the code — alongside the time constraints inherent to the review process. To facilitate effective review, authors should ensure that their code repositories are well organized, with a logical folder structure, a comprehensive ReadMe file to explain the repository contents, and an overview of software dependencies. Where relevant, checkpoints and pre-trained models should also be provided.
To facilitate the review process, we encourage authors to create executable compute capsules that allow code to be run within a ready-made environment. Through our continuing partnership with CodeOcean, prospective authors submitting to Nature Machine Intelligence can upload their code and data to their platform and designate a script or command that demonstrates what the code is doing (for example, training a model or evaluating it on a test dataset). Referees can then modify and execute the capsule directly on the CodeOcean web platform or download it as a docker file that contains all required software dependencies and configurations. Until the paper is accepted, the capsule can remain confidential, ensuring that only editors and reviewers have access to it.
To further highlight the benefits of well-structured and high-quality code repositories associated with published papers, we introduced Reusability Reports in 20204 as a dedicated article format in Nature Machine Intelligence. These peer-reviewed research articles focus on evaluating the robustness, extensibility and reusability of previously published code. Although such articles are typically commissioned by the editors and often linked to papers in the journal, we also welcome proposals from the community.
A recent example is a Reusability Report from Butt and Walker5, who evaluated a molecular bioactivity foundation model called ActFound. Foundation models are typically pre-trained on large datasets and can be adapted to a range of downstream tasks with fine-tuning. In their assessment, the authors5 found that ActFound could be readily fine-tuned on a natural product dataset using the provided Colab tool, demonstrating the practical accessibility of the released code and resources.
In a different domain, solving combinatorial problems, Li et al.6 examined an approach based on hypergraph neural networks known as HypOp. The original study proposed an efficient framework to tackle higher-order combinatorial problems by training HypOp in a distributed and parallel manner. In their Reusability Report, the authors6 reproduced the findings of the original paper on the well-known MaxCut problem while also investigating how performance changes when the number of GPUs used for training varies. In addition, the report extends the evaluation of HypOp to other combinatorial optimization tasks, including the maximum clique and quadratic assignment problems. Together, these analyses broaden the applicability of the original methodology and provide further insight into HypOp’s capabilities across different computational environments.
As noted at the start of this article, the increasing use of large language models (LLMs) in scientific workflows presents new challenges for maintaining standards of reproducibility. The outputs of LLMs are inherently non-deterministic and may vary even when the same prompt is used7. Moreover, commercial LLMs are frequently updated with new versions or modified settings, which can further complicate attempts to reproducible results over time.
In addition, computational frameworks that incorporate LLMs and autonomous agents can be particularly challenging to review. As we outline in an Editorial published in 20248, we ask authors to provide clear documentation of how LLMs are used within their computational frameworks, including details such as the model version, as well as representative prompts and outputs. Such transparency helps reviewers to better understand how these systems contribute to the reported results. We agree with Zhang et al.3 that a greater emphasis on the reviewability of software and code is timely. This is especially important given the continuously growing number of submitted papers — an increase that is itself partly driven by the expanding use of LLMs in research. Strengthening practices around transparency, documentation and code accessibility will be crucial to maintain robust standards of reproducibility and peer review in this evolving landscape.
References
Wilkinson, M. et al. Sci. Data 3, 160018 (2016).
Barker, M. et al. Sci. Data 9, 622 (2022).
Zhang, H. et al. Comput. Struct. Biotech. J. 23, 3989–3998 (2024).
Nat. Mach. Intell. 2, 729 (2020).
Butt, C. M. & Walker, A. S. Nat. Mach. Intell. 8, 270–275 (2026).
Li, X. Nat. Mach. Intell. 7, 1870–1878 (2025).
Atil, B. et al. In Proc. 5th Workshop on Evaluation and Comparison of NLP Systems 135–148 (ACL, 2025).
Nat. Mach. Intell. 6, 845 (2024).
Rights and permissions
About this article
Cite this article
Recognizing reproducibility and reusability in times of fast science. Nat Mach Intell 8, 293–294 (2026). https://doi.org/10.1038/s42256-026-01219-7
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-026-01219-7