At Nature Biomedical Engineering, most of our submissions describe new devices, instruments, methods or tools that are meant to improve human health. We take a number of factors into consideration when deciding what to send for peer review and ultimately publish1. These factors include whether the paper fits our scope, if it will be of interest to our exceptionally broad audience, whether new ideas are explored, the potential translational impact and the overall practical advance.

When the proposed innovation lies in the overall practical advance, we gauge the final outcomes based on our understanding of a field, the demonstrations shown and, most importantly, benchmarking data. Indeed, time and time again, we find that a key concern raised both in our own discussions about a study and in referee feedback is the absence of standard experiments needed to validate performance and missing comparisons to relevant alternative approaches, tools or therapies. In many cases, these are crucial points that distinguish a good paper that seems compelling and technically sound, from a great paper that clearly warrants further consideration.

Expected benchmarking varies widely by field. In some disciplines, benchmarking largely comprises carrying out comparable experiments across papers, such that indirect performance comparisons can be made. For example, most manuscripts describing new cancer therapeutics show applications in orthotopic mouse models of cancer to gauge performance on reducing tumour volume and increasing survival. These papers almost always include proper internal controls and sometimes compare to approved therapeutics (or performance in combination with approved therapeutics) but rarely directly compare to alternative related tools.

Of course, the world of cancer therapeutics is vast, and there is no possibility of comparing each proposed strategy to all the latest alternatives. However, we argue that to convince experts of the validity and strength of, for example, a new nanoparticle formulation for drug delivery, a few alternatives especially of similar classes of nanoparticles would ideally be assessed in a side-by-side comparison. At the very least, gold-standard therapies should be included alongside controls. These types of experiment are not only crucial for editors and reviewers to understand the level of advance, but for the field more generally to understand when a new technology has truly made a big splash in a sea of valuable but more incremental advances.

By contrast, papers in the digital pathology field often have thorough benchmarking. In these cases, most relevant alternative tools that are freely available and actually install can be directly compared. While questions of fairness and best practices in benchmarking sometimes arise during peer review, such as fair machine learning model training or parameter optimization, these issues can often be resolved reasonably, and tools can be assessed with metrics broadly accepted in the field. This type of comparison is enabled by the relative ease and affordability of benchmarking computational tools and field expectations for measuring performance.

Thorough comparison with existing approaches demonstrating the degree of advance offered by a new technology is a sign of a healthy research ecosystem with continuous innovation. The point of benchmarking in an individual paper is rarely to show that an existing approach should not be used, but, again, to show the degree of advance a new strategy makes over the next best or where an approach fits within a crowded space.

When designing benchmarking experiments, one should consider their readers. To reach potential users, especially those who are already content using existing tools, it is important to show that the benefits of switching to the new approach are worth the time and effort associated with trying something new. To make a case to other developers that the work represents a breakthrough worthy of further consideration and development, it is crucial to showcase the current and potential future benefits of an approach. To interest clinicians, the demonstrations and benchmarking must show that the work could represent a clear advance over gold-standard methods for the health of their patients. Finally, to stand out to our editorial team and reviewers, a study must satisfy all or most of the above.

Benchmarking can also be multi-faceted. Although an approach may dominate in terms of performance metrics, those gains may come at a cost. Following along with our earlier examples, computational papers typically report runtimes and computing hardware requirements. Therapeutic papers should experimentally assess potential side effects, such as inflammation, toxicity, clearance and other important aspects of performance. In these cases, benchmarking is essential to paint the complete picture.

We are keenly aware that direct comparisons are not always possible and often for valid reasons. Sometimes comparing to the state of the art could mean spending a month troubleshooting someone else’s poorly documented code or chasing scientists to share unavailable code. Alternatively, sometimes custom reagents are not available or readily synthesized outside of a particular research group’s laboratory, making direct comparisons unfeasible. In cases like these, it is critically important to cite and discuss the relevant literature and clearly state in a data-supported manner the limitations that are addressed by the proposed approach. Simply stating that other methods are more complex or time-consuming than the newly described strategy is generally not a convincing argument.

As editors, we are aware of the expense, time and animal lives that can go into additional benchmarking. For these reasons, we seek to promote smart experimental planning that includes appropriate benchmarking at the outset, rather than adding it later due to pressure from the peer review process. We also argue that when done wisely, these are almost always a good investment, because they are so crucial to clarifying the potential impact of a study.

As a guide, we are happy to point readers to some well-benchmarked studies published in our own pages on topics such as vaccines2,3, gene editing4,5, improved therapies6,7,8 and digital pathology9. More generally, we invite authors to look at our published works in a given field to see the types of comparative data we will expect.

Finally, there are times when benchmarking studies in and of themselves are valuable for a community, even in the absence of tool development. They serve to clarify the state of the art, highlight weaknesses and strengths of existing tools, and point developers to the best starting points. For example, in medical image analysis, many ongoing competitions serve as field-wide benchmarks through which new tools can constantly compare themselves to the latest winners. While these are not without their own weaknesses10, they can have tremendous value for growing fields. We welcome such comparative studies, such as a recent Article from Kather and colleagues on benchmarking digital pathology foundation models11, which we think offers important insights to both biomedical users and those developing next-generation tools for digital pathology.

Ultimately, one may have the best new approach in the world, but without comparative data to back up claims, the importance can be easy to overlook.