Main

Diagnosis is fundamental to delivering effective healthcare. Clinical information within electronic health records (EHRs), imaging, laboratory tests and pathology can facilitate the timely and accurate detection of diseases1,2,3. For patients, this can provide an explanation for their health condition and guide clinicians to choose appropriate treatments, potentially improving patient outcomes4,5. Public and global health measures are also principally guided by effective diagnostic workflows6.

Diagnostic research is often at risk of producing biased results, due to flaws in methodological design and lack of transparency7. It has also long been a concern that reporting of diagnostic test research is inadequate and inconsistent, leading to substantial research misrepresentation and waste8,9,10. Furthermore, it is often incorrectly assumed that the diagnostic accuracy of a test is a fixed characteristic; it is now well understood that common diagnostic accuracy measures (for example, sensitivity and specificity) can vary across clinical contexts, target populations, disease severity and different definitions of a reference standard11,12. Key information about the study design, setting, participants, index tests, reference standards, analysis and outcomes should be reported in all diagnostic test accuracy studies. Missing or unclear information hampers safe translation into clinical practice as key stakeholders, such as healthcare professionals, regulators and policymakers, are unable to evaluate the evidence base of a diagnostic test.

In response to this, the STARD statement was developed in 2003, and was subsequently updated in 2015 (STARD 2015), to standardize the reporting of diagnostic accuracy research13,14. By outlining a list of 30 minimum essential items that should be reported for every diagnostic test accuracy study, STARD can improve the quality of study reporting, help stakeholders judge the risk of bias and applicability of the findings and enhance research reproducibility. The accompanying explanation and elaboration document provides the rationale for each item with examples of good reporting15. STARD has since been extended to provide guidance for reporting studies in conference abstracts (STARD for Abstracts)16. Evidence suggests that adherence to STARD improves the reporting of key information in diagnostic test accuracy studies17,18.

The landscape of clinical diagnostics has shifted considerably since the release of STARD 2015. Advances in understanding diseases at both population and molecular levels19,20,21,22, as well as technological breakthroughs such as AI23,24, could enhance diagnostic capacity and efficacy. As a technology, AI may have the unique potential to both improve the performance of diagnostic systems and streamline workflows to alleviate healthcare resources25. Moreover, diagnostics constitutes a substantial proportion of clinical AI focus, with most AI devices achieving regulatory approval thus far belonging to the diagnostic field26. However, research in this field has thus far been conducted without a suitable reporting guideline that accounts for the unique properties of AI-driven diagnostic systems and the associated challenges.

For the purposes of this guideline, AI refers to computer systems that can perform tasks that typically require human intelligence, such as classification, prediction or pattern recognition. This includes, but is not limited to, machine learning and deep learning models, natural language processing tools or foundation models that generate or support diagnostic outputs. Systems that include static or manually programmed rules without adaptive learning, such as simple decision trees, were not included in the scope. AI introduces several additional potential sources of bias that are currently not always reported by study authors or accounted for by existing guidelines27. These may be related to study design, patient selection, dataset handling, ethical considerations, index test and reference standard conduct, statistical methods, reporting of results and discussion and interpretation of findings. Therefore, an accurate evaluation of the clinical applicability of AI-centered diagnostic systems is not always possible.

To strengthen the reporting of AI-centered diagnostic accuracy studies, the STARD-AI statement was developed. STARD-AI provides a checklist of minimum criteria that should be reported in every diagnostic test accuracy study evaluating an AI system. It joins several complementary EQUATOR Network initiatives that outline reporting guidelines for clinical AI studies, including CONSORT-AI for clinical trials of AI interventions28, SPIRIT-AI for trial protocols29, TRIPOD+AI for prediction and prognostic models30 and CLAIM for medical imaging studies31. Relevant reporting guidelines and their scopes can be viewed in Table 1. The aim of STARD-AI is to improve completeness and transparency in study reporting, supporting stakeholders to evaluate the robustness of study methodology, assess the risk of bias and inform applicability and generalizability of study findings. This article outlines STARD-AI and describes the process of its development.

Table 1 Reporting guidelines for AI-based medical devices and their scope

The STARD-AI statement

The final STARD-AI statement consists of 40 items that are considered essential in reporting of AI-centered diagnostic accuracy studies (Table 2). The development process can be visualized in Fig. 1. A downloadable, user-friendly version of the checklist can be found in Supplementary Table 2. Four items were modified from the STARD 2015 statement (items 1, 3, 7 and 25), and 14 new items were introduced to account for AI-specific considerations (items 6, 11, 12, 13, 14, 15b, 15d, 23, 28, 29, 35, 39, 40a and 40b). In a structure similar to STARD 2015, the checklist contains items relating to the title or abstract (item 1), abstract (item 2), introduction (items 3 and 4), methods (items 5–23), results (items 24–32), discussion (items 33–35) and other important information (items 36–40). Subsections are included within methods and results to make the checklist clearer to follow and interpret. The methods section is subdivided into study design, ethics, participants, dataset, test methods and analysis subsections, and the results section contains subitems relating to the participants, dataset and test results. In line with STARD 2015, a diagram illustrating the flow of participants is expected in reports (item 24); a template diagram is available in the STARD 2015 publication14. The rationale for new or modified items is outlined in Supplementary Table 3. For convenience, the STARD for Abstracts checklist is reproduced in Table 3 (ref. 16).

Table 2 The STARD-AI checklist
Fig. 1: STARD-AI checklist development process.
figure 1

The checklist was developed through a multistage process, including a literature review, expert and public input (PPIE), Delphi surveys and a final consensus meeting. The number of participants and items assessed at each stage are shown.

Table 3 STARD for Abstracts16

AI poses several considerations in various domains that are often not encountered in traditional diagnostic test accuracy studies. In particular, STARD-AI introduces several items that focus on data handling practices. These include detailing the eligibility criteria at both a dataset level and a participant level (item 7); source of the data and how they have been collected (item 11); dataset annotation (item 12); data capture devices and software versions (item 13); data acquisition protocols and preprocessing (item 14); partitioning of datasets into training, validation and test set purposes (item 15b); characteristics of the test set (item 25); and whether the test set represents the target condition (item 28). These items can substantially affect the diagnostic accuracy outcomes of a study and influence the risk of bias and applicability. As well as aiding evaluation of study findings, sufficient reporting of these items, in addition to clear explanations of the index test and reference standard, may facilitate reproducibility and aid in replicating studies. In line with collaborative open science practices, STARD-AI encourages disclosure of commercial interests (item 39), public availability of datasets and code (item 40a) and the external audit or evaluation of outputs (item 40b).

Use of STARD-AI can aid the comprehensive reporting of research that assesses AI diagnostic accuracy using either single or combined test data and can be applied across a broad range of diagnostic modalities. Examples include imaging, such as X-rays or computed tomography scans32; pathology through digital whole-slide images33; and clinical information in the form of EHRs34. In addition, studies may use other ways besides test accuracy to express diagnostic performance, including incremental accuracy gains within diagnostic pathways or clinical utility measures35,36. STARD-AI also supports the evaluation of multimodal diagnostic tools and can be used in studies that assess the diagnostic accuracy of large language models (LLMs), where the output consists of a diagnostic classification of differential diagnosis. By contrast, if the study focuses on the development or evaluation of a multivariable prediction model using regression, machine learning or LLM-based approaches to predict diagnostic or prognostic outcomes, use of TRIPOD+AI or TRIPOD-LLM is more appropriate30,37. CLAIM may be considered for the development or validation of a medical imaging AI model31, whereas STARD is more applicable where diagnostic accuracy of a model is the primary focus. Where relevant, authors can consider referring to multiple checklists but may select the guideline most aligned with the study’s primary aim and evaluation framework for pragmatic reasons.

Discussion

STARD-AI is a new reporting guideline that can support the reporting of AI-centered diagnostic test accuracy studies. It was developed through a multistage process consisting of a comprehensive item generation phase followed by an international multistakeholder consensus. STARD-AI addresses considerations unique to AI technology, predominantly related to algorithmic and data practices, that are not accounted by its predecessor, STARD 2015. Although it proposes a set of items that should be reported in every study, many studies may benefit from reporting additional information related to individual study methodology and outcomes. STARD-AI should, therefore, be seen as a minimum set of essential items and not as an exhaustive list.

Research into clinical diagnostics using AI tools has thus far mostly focused on establishing the diagnostic accuracy of models. However, there are many challenges to successfully translating AI models to a clinical setting, including the limited number of well-conducted external evaluation studies to date; the lack of comparative and prospective trials; the use of study metrics that may not reflect clinical efficacy; and difficulties in achieving generalizability to new populations38. The deployment of these models into clinical scenarios outside research settings has raised concerns that intrinsic biases could propagate or entrench population health inequalities or even cause patient harm39. Therefore, it is crucial for potential users of diagnostic AI tools to focus not only on model performance but also on the robustness of the underlying evidence base, primarily through identifying flaws in study design or conduct that could lead to biases and poor applicability. STARD-AI can help on this front by guiding authors to include the important information needed for readers to evaluate a study.

Specific AI diagnostic elements to consider include transparency in AI models, bias, generalizability, algorithm explainability, clinical pathway integration, data provenance and quality, validation and robustness and ethical and regulatory considerations. As diagnostic tools currently dominate the landscape of regulatory-approved AI devices26, guidelines such as STARD-AI may help to enhance the quality and transparency of studies reported for these devices. Ultimately, this may aid the development and deployment of AI models that leads to healthcare outcomes that are fair, appropriate, valid, effective and safe40. It may also support the deployment of AI models that align with Coalition for Health AI principles for trustworthy AI, namely algorithms that are reliable, testable, usable and beneficial41,42.

STARD-AI provides many new criteria that outline appropriate dataset and algorithmic practices, stresses the need to identify and mitigate algorithmic biases and requires authors to consider fairness in both the methods (item 23) and the discussion (item 35) sections. In this context, fairness refers to the equitable treatment of individuals or groups across key attributes, including demographic factors or socioeconomic status. This includes the expectation that an AI-based system should not systematically underperform or misclassify subgroups of patients in a manner that may reinforce existing health disparities. Ensuring model fairness is especially imperative in the context of diagnostic AI technology as these may eventually be deployed to assist clinical decision-making in population-wide diagnostic or screening strategies. If fairness is not considered sufficiently, equitable healthcare delivery may be hampered on a population level, and disparities between demographic groups may be exacerbated43. Datasets used to train, validate and test should ideally be diverse and represent the intended target population of the index test evaluated. Additional algorithmic practices can further reduce fairness gaps while maintaining performance39.

The addition of 10 main items, and 14 subitems in total, increases the length of the checklist compared to STARD 2015. Although this may be seen as a barrier to implementation, it was deemed necessary to address AI-specific considerations that may substantially impact the quality of study reporting. Notably, other checklists, such as TRIPOD+AI and CLAIM, contain a similar number of total items and subitems30,31. We intend to release an explanation and elaboration document to provide examples and rationale for each new or modified item in STARD-AI, which we briefly outline in Supplementary Table 3. However, many of the items remain unchanged from STARD 2015, reflecting that the general principles of reporting diagnostic accuracy studies are still essential for AI tools. In the meantime, the STARD 2015 explanation and elaboration document provides rationale and examples of appropriate reporting for the unchanged items15.

STARD-AI is designed to support the reporting of studies that evaluate the diagnostic accuracy of an AI tool. However, the increasing integration of AI system into clinical workflows highlights the growing importance of AI–human collaboration. In many real-world scenarios, AI tools are intended not to replace clinical decision-making but, rather, to inform or enhance it. Therefore, future studies should also assess the impact of AI assistance on end-user performance, in addition to reporting the standalone accuracy of the AI system. This should ideally include a comparison to a baseline in which clinical decisions are made without AI, which will aid in evaluating the clinical utility of AI on decision-making and workflows44. The experience and expertise of end users will also be important in determining performance outcomes. Addressing these elements moving forward may require the development of a separate consensus.

Although STARD-AI was developed prior to the wider introduction of generative AI and LLMs, many of these items nevertheless remain applicable to generative AI models that report diagnostic accuracy. Unlike classical AI models, which are typically trained on labeled datasets for specific tasks, LLMs and transformer-based architectures are generally pretrained on large-scale, unstructured datasets and can be subsequently finetuned for specific diagnostic tasks. Although STARD-AI can be applied to studies that investigate generative AI and future advances in AI platforms, it is likely that STARD-AI and other complementary guidelines will need to be regularly updated in response to the rapidly shifting nature of this field. Next-generation generative AI technology may consist of multimodal and generalist models that input medical and biomedical data to improve predictions45,46,47. Further advances in fields such as reinforcement learning48,49, graph neural networks50,51 and explainable AI (XAI) solutions52 may also substantially change the landscape of health AI and require new considerations in the next iteration of reporting guidelines.

The rapid pace of technological advancement may also present inherent limitations to reporting guidelines. Although many of the STARD-AI items remain applicable to newer forms of AI, including foundation models and multiagent systems, the increasing complexity and versatility of these tools may challenge traditional concepts of diagnostic evaluation. Emerging systems may provide differential diagnoses ranked on probability or even interact dynamically with users via natural language and adapt outputs based on population characteristics or user expertise. These capabilities extend beyond conventional frameworks and may not be fully captured by traditional diagnostic accuracy metrics alone. Although STARD-AI offers a strong foundation for transparent reporting, complementary frameworks such as CRAFT-MD may be better suited for evaluating different forms of AI-driven clinical support53.

We are confident that STARD-AI will prove useful to many stakeholders. STARD-AI provides study authors with a set of minimum criteria to improve the quality of reporting, although it does not aim to provide prescriptive step-by-step instructions to authors. If adopted as a reporting standard before or during manuscript submission to journals, editors and reviewers may be able to more effectively appraise submissions; its use by journals may also help to ensure that all information essential for readers is included in the published article. In the future, it may be possible that AI-based tools, such as LLMs, may assist in prescreening manuscripts for STARD-AI adherence, offering a scalable means to support checklist compliance during peer review and editorial assessment. Beyond the academic field, policymakers, regulators and industry partners are recommended to incorporate STARD-AI where the requirement for transparency of evidence is universally recognized, as well as complementary reporting guidelines within the EQUATOR Network54, in clinical AI product and policy assessments to better guide downstream decisions and recommendations. End users such as clinicians may be able to more effectively evaluate the clinical utility of AI systems to their patient populations prior to use, and patients may benefit from the eventual outcome of higher-quality research.

Conclusion

Diagnostic pathways stand to benefit substantially from the use of AI. For this to happen, researchers should report their findings in sufficient detail to facilitate transparency and reproducibility. Similarly, readers and other decisionmakers should have the necessary information to judge the risk of bias, diagnostic accuracy test determinants, clinical context and applicability of study findings. STARD-AI is a consensus-based reporting guideline that clarifies these requirements.

Methods

STARD-AI is an international initiative that seeks to provide a multistakeholder consensus on a reporting guideline for AI-centered diagnostic test accuracy studies. A Project Team comprising experts in this field (V.S., X.L., G.S.C., A.K., S.R.M., R.M.G., A.K.D., S. Shetty, D.M., P.M.B., A.D. and H.A.) coordinated the development process, made key methodological decisions and managed day-to-day operations. In addition, a Steering Committee was selected by the Project Team to oversee the guideline development process and provide strategic oversight, consisting of a diverse panel of international stakeholders with expertise in healthcare, computer science, academia, journal editing, epidemiology, statistics, industry, medical regulation and health policymaking. The Consensus Group, distinct from the Project Team and Steering Committee, included invited stakeholders who participated in the Delphi process and consensus meeting. Additional Delphi participants, who were not part of the Consensus Group or committees, contributed to the online survey rounds. The development process is visualized in Fig. 1. A full list of members of the Steering Committee and Consensus Group is provided in a footnote at the end of the article.

STARD-AI was announced in 2020 after the publication of a correspondence highlighting the need for an AI-specific guideline in this field55. The initiative to develop the reporting guideline was registered with the EQUATOR Network in June 2020, and its development adhered to the EQUATOR Network toolkit for reporting guidelines54. A protocol that outlined the process for developing STARD-AI was subsequently published56.

Ethical approval was granted by the Imperial College London Joint Research Compliance Office (SETREC reference number: 19IC5679). Written informed consent was obtained from all participants in the online scoping survey, the patient focus group and the Delphi consensus study.

Candidate item generation

A three-stage approach was employed to generate candidate items, consisting of a systematic review, an online survey of experts and a patient and public involvement and engagement (PPIE) exercise. Details of this stage can be found in the study protocol56. First, a systematic review was conducted to identify relevant articles. A member of the project team (V.S.) performed a systematic search of MEDLINE and Embase databases through the Ovid platform, as well as a non-systematic exploration of Google Scholar, social networking platforms and articles personally recommended by Project Team members. Two authors (V.S. and H.A.) independently screened abstracts and full texts to identify eligible studies, with any disagreements mediated by discussion. This review built upon the findings of a prior systematic review conducted by members of the STARD-AI team, which evaluated the diagnostic accuracy of deep learning in medical imaging and highlighted widespread variability in study design, methodological quality and reporting practices32. Themes and material extracted from included articles were used to establish considerations unique to AI-based diagnostic accuracy studies and to highlight possible additions, removals or amendments to STARD 2015 items. These considerations were subsequently framed as potential candidate items.

Second, an online survey of 80 international experts was carried out. This generated over 2,500 responses, relating to existing STARD 2015 items and potential new items or considerations. Experts were selected to reflect the full diagnostic AI continuum, including those with expertise in conventional diagnostic modalities, AI development and statistical methods, for diagnostic accuracy. This breadth of expertise was intended to ensure that candidate items reflected both the technical and clinical aspects of AI-centered diagnostic evaluation. Responses were grouped thematically to generate candidate items. Patients and members of the public were then invited to an online focus group through Zoom (Zoom Video Communications) in order to provide input as part of a PPIE exercise. This provided a patient perspective on issues that were not uncovered during the literature review or expert survey. Although no new domains were introduced from the PPIE exercise, participants placed increased emphasis on the importance of ethics and fairness, particularly in relation to how AI may impact different patient subgroups or exacerbate existing health disparities. As these elements were not a major focus of the original STARD guideline, their prioritization during the consensus process helped to refine the framing and inclusion of items in the final checklist. A list of 55 items, including 10 terminology-related items and 45 candidate checklist items, was finalized by the Project Team and Steering Committee and entered the modified Delphi consensus process.

Modified Delphi consensus process

Experts were invited to join the STARD-AI Consensus Group and participate in the online Delphi surveys as well as the consensus meeting. The Project Team and Steering Committee identified participants on the basis of being a key stakeholder, ensuring to account for a diversity in geographics and demographics to maintain a representative panel. All invited participants were provided with written information about the study and given 3 weeks to respond to the initial invitation. The Delphi process included more than 240 international participants, including healthcare professionals, clinician scientists, academics, computer scientists, machine learning engineers, statisticians, epidemiologists, journal editors, industry leaders, health regulators, funders, patients, ethicists and health policymakers.

The first two rounds of the Delphi process were online surveys conducted on DelphiManager software (version 4.0), which is maintained by the Core Outcome Measures in Effectiveness Trials (COMET) initiative. Participants were asked to rate each item on a five-point Likert scale (1, very important; 2, important; 3, moderately important; 4, slightly important; 5, not at all important). Items receiving 75% or higher ratings of ‘very important’ or ‘important’ were immediately put forward for discussion in the final round. Items achieving 75% or more responses of ‘slightly important’ or ‘not at all important’ were excluded. Items that did not achieve either threshold were entered into the next round of the Delphi process. The 75% threshold was pre-set before the beginning of the process. Participants were also given the opportunity to provide free-text comments on any of the items considered or to suggest new items. These were used by the Project Team to rephrase, merge or generate new items for subsequent rounds. The stakeholder groups represented in the Delphi rounds are outlined in Supplementary Table 1. A full list of participants in the online survey and Delphi rounds is provided in the Supplementary Note.

The first round was conducted between 6 January and 20 February 2021. Invitations were extended to 528 participants in total, of whom 240 responded (response rate of 45%). Of the participants who responded, 209 fully completed the survey (completion rate of 87%). Forty-five candidate checklist items were rated after the multistage evidence generation process. Free-text comments were collected for these items and also for the 10 terminology items. Twenty-three candidate items achieved consensus for ‘very important’ or ‘important’ and were formally moved into the consensus meeting. Fifteen items were removed or replaced by an amended item based on participant feedback. Seven items did not achieve consensus, and 19 additional items were constructed after feedback from participants, resulting in 26 total items put forward to the second round. The second round was conducted between 21 April and 4 June 2021. Invitations were sent to 235 participants, of whom 203 responded (response rate of 86%), and 143 completed the survey (completion rate of 70%). Users were again asked to rate each item and add free-text comments. A majority consensus was achieved for 22 items.

Forty-five items reached consensus over the first two rounds. As this was deemed too many to include in an instrument, a pre-consensus survey consisting of 37 members of the Project Team, Steering Committee and other key external stakeholders was conducted to agree on a final list of items for discussion in the consensus meeting, receiving a 100% response rate. Participants were asked to rate whether each item should be included in the instrument as a standalone item, included in the accompanying explanation and elaboration document or excluded from the process. Twenty-two items received a majority consensus for inclusion in the final checklist; 13 items did not reach the 75% predefined threshold; and 10 items were excluded from the process. In total, 35 items were finalized for discussion at the consensus meeting.

The virtual consensus meeting took place on 1 November 2021 and was chaired by D.M. An information sheet was pre-circulated to all participants, and individual consent was obtained. In total, 22 delegates representing all of the key stakeholder groups attended the meeting. Items were discussed in turn to gain insight into content that warrants inclusion in the checklist, particularly focusing on the 13 items that did not reach consensus from the Delphi process. Voting on each item was anonymized using the Mentimeter software platform. After this, a meeting among key members of the Steering Committee finalized the checklist based on the outcome of the consensus meeting.