A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare

Wells, Brian J.; Nguyen, Hieu M.; McWilliams, Andrew; Pallini, Matt; Bovi, Amy; Kuzma, Andrew; Kramer, Justin; Chou, Shih-Hsiung; Hetherington, Timothy; Corn, Patricia; Taylor, Yhenneko J.; Cuison, Audrey; Gagen, Mary; Isreal, McKenzie

doi:10.1038/s41746-025-01900-y

Download PDF

Article
Open access
Published: 11 August 2025

A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare

Brian J. Wells¹,
Hieu M. Nguyen²,
Andrew McWilliams³,
Matt Pallini⁴,
Amy Bovi⁵,
Andrew Kuzma⁵,
Justin Kramer⁶,
Shih-Hsiung Chou⁴,
Timothy Hetherington⁴,
Patricia Corn⁷,
Yhenneko J. Taylor²,
Audrey Cuison⁸,
Mary Gagen⁸,
McKenzie Isreal² &
FAIR-AI Consortium

npj Digital Medicine volume 8, Article number: 514 (2025) Cite this article

19k Accesses
39 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Health systems face the challenge of balancing innovation and safety to responsibly implement artificial intelligence (AI) solutions. The rapid proliferation, growing complexity, ethical considerations, and rising demand for these tools require timely and efficient processes for rigorous evaluation and ongoing monitoring. Current AI evaluation frameworks often lack the practical guidance for health systems to address these challenges. To fill this gap, we developed a prescriptive evaluation framework informed by a literature review, in-depth interviews with key stakeholders, including patients, and a multidisciplinary design workshop. The resulting framework provides health systems an outline of the resources, structures, criteria, and template documents to enable pre-implementation evaluation and post-implementation monitoring of AI solutions. Health systems will need to treat this or any alternative framework as a living document to maintain relevance and effectiveness as the AI landscape and regulations continue to evolve.

Establishing responsible use of AI guidelines: a comprehensive case study for healthcare institutions

Article Open access 30 November 2024

Innovation and challenges of artificial intelligence technology in personalized healthcare

Article Open access 16 August 2024

Trust in AI-assisted health systems and AI’s trust in humans

Article Open access 28 March 2025

Introduction

The healthcare industry is at an inflection point as the use of artificial intelligence-based tools rapidly expands, driven by the enhanced capabilities of modern electronic health record (EHR) systems and the advancement in artificial intelligence (AI) methods. The latest advancements in AI offer tremendous potential to improve patient outcomes, enhance patient experience, and increase efficiencies¹. However, if hasty deployment of AI solutions bypasses rigorous evaluation steps, AI may paradoxically produce untoward results, such as introducing or amplifying health inequities, creating wasteful care, and causing harm to those intended to be helped².

While AI has been used for clinical decision support in medicine for almost 50 years, evaluating the initial computer-based knowledge systems was relatively straightforward³. As AI use cases in healthcare expand, appropriately evaluating and monitoring AI solutions has become increasingly challenging due to more complex and, at times, inherently opaque AI models and methods with massive data requirements⁴. These challenges, combined with the rapid pace with which technology is being introduced and the increasing interest in utilizing innovative technologies, highlight the need for health systems to adopt new approaches for AI evaluation and governance. The approaches need to be consistent with the historically high standards healthcare has maintained for responsibly adopting new technology.

The necessity for oversight in healthcare is reflected in numerous publications demonstrating the gravity of potential risks that are uniquely present when AI intersects with decisions of consequence^5,6,7. To harness the benefits of AI while appropriately managing its risks, health systems need to implement intentional, practical AI evaluation and governance strategies. Despite the recent hype and growing ubiquity of AI solutions, standardized approaches for guiding the pre-implementation review and post-implementation monitoring of AI in healthcare remain limited. Although the European Union (EU) AI Act is legally enforceable in Europe, it has drawn criticism for its lack of clarity and flexibility in defining “high-risk” AI—particularly in healthcare, where risk is highly context-dependent, varying by the specific tool, degree of human oversight, and clinical use case. In the United States, frameworks such as the Food and Drug Administration (FDA)’s Software as Medical Device (SaMD) guidance, National Institute of Standards and Technology (NIST)’s AI Risk Management Framework, and the AI Bill of Rights have emerged, but they are non-binding and provide limited practical guidance for implementation within real-world healthcare systems.

In the context of enterprise risk management, health systems seek to understand, quantify, and manage risk to all stakeholders, be that to patients, employees, or the organization. To effectively address the direct and indirect risks of implementing AI solutions, evaluation frameworks must be comprehensive, standardized, repeatable, and transparent. However, existing evaluation frameworks often fail to meet these criteria, as they tend to be overly theoretical, lack practical and actionable guidance, or focus too narrowly on specific aspects of risk^8,9,10,11.

Considering these limitations, our organization, a large health system spanning the southeast and midwestern U.S., set out to create a practical, comprehensive AI framework focused on responsible AI implementation that can be applied in various healthcare settings. This project, Framework for the Appropriate Implementation and Review of AI (FAIR-AI) in healthcare, was guided by three aims: (1) to incorporate best practice recommendations from existing frameworks, guidelines, and regulations; (2) to understand the expectations and needs for an AI evaluation framework from a diverse set of health system stakeholders including patients, providers, operational leaders, and AI developers; and (3) to leverage a multidisciplinary group to synthesize best practice guidance and align stakeholder needs into a practical framework.

Results

Best practices and key considerations—narrative review

As a first step to inform the construct of FAIR-AI, we conducted a narrative review to identify the best practices and key considerations related to responsibly deploying AI in healthcare, these are summarized in Table 1. The results are organized into several themes including validation, usefulness, transparency, and equity.

Table 1 Best practices and key considerations in implementation of artificial intelligence

Full size table

Numerous publications and guidelines such as TRIPOD and TRIPOD-AI have described the reporting necessary to properly evaluate a risk prediction model, regardless of the underlying statistical or machine learning method^12,13. An important consideration in model validation is careful selection of performance metrics¹⁴. Beyond discrimination metrics like AUC, it is important to assess other aspects of model performance, such as calibration, and the F-score, which is particularly useful in settings with imbalanced data. For models that produce a continuous risk, probability decision thresholds can be adjusted to maximize classification measures such as positive predictive value (PPV) depending on the specific clinical scenario. Decision Curve Analysis can help evaluate the tradeoff between true positives and false positives to determine whether a model offers practical value at a given clinical threshold¹⁵. For regression problems, besides Mean Square Error (MSE), other metrics such as Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) can also be examined¹⁶. It is important to establish a model’s real-world applicability through dedicated validation studies^17,18. The strength of evidence supporting validation and minimum performance standards should align with the intended use case, its potential risks, and the likelihood of performance variability once deployed based on the analytic approach or data sources (Supplementary Fig 1)^14,17,18. Applying these traditional standards to evaluate the validity of generative AI models is uniquely challenging and frequently not possible. While the literature in this area is nascent, evaluation should still be performed and may require qualitative metrics such as user feedback and expert reviews, which can provide insights into performance, risks, and usefulness^19,20.

Deploying and maintaining AI solutions in healthcare requires significant resources and carries the potential for both risk and benefits, making it essential to evaluate whether a tool delivers actual usefulness, or a net benefit, to the organization, clinical team, and patients^21,22. Decision analyses can quantify the expected value of medical decisions, but they often require detailed cost estimates and complex modeling. Formal net benefit calculations simplify this process by integrating the relative value of benefits versus harms into a single metric^18,23. However, a lack of objective data, the specific context, or the nature of the solution may render these calculations impractical. In these cases, net benefit provides a construct to guide qualitative discussions among subject matter experts, helping to weigh benefits and risks while considering workflows that mitigate risks. Additionally, a thorough assessment of clinical utility may require an impact study to evaluate a solution’s effects on factors such as resource utilization, time savings, ease of use, workflow integration, end-user perception, alert characteristics (e.g., mode, timing, and targets), and unintended consequences^9,22,24.

Given the potential for ethical and equity risks when deploying AI solutions in healthcare, transparency should be present to the degree that it is possible across all levels of the design, development, evaluation, and implementation of AI solutions to ensure fairness and accountability (https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf; http://data.europa.eu/eli/reg/2024/1689/oj)²⁵^,²⁶. Specifically due to the potential for AI to perpetuate biases that could result in over- or under-treatment of certain populations, there must be a clear and defensible justification for including predictor variables that have historically been associated with discrimination, such as those outlined in the PROGRESS-Plus framework: place of residence, race/ethnicity/culture/language, occupation, gender/sex, religion, education, socioeconomic status, social capital, and personal characteristics linked to discrimination (e.g., age, disability, sexual orientation)^21,27,28,29. This is particularly important when these variables may act as proxies for other, more meaningful determinants of health. It is equally important to evaluate for patterns of algorithmic bias by monitoring outcomes for discordance between patient subgroups, as well as ensuring equal access to the AI solution itself when applicable^10,25,30,31. Once an AI solution is implemented, transparency for end-users becomes a critical element for building trust and confidence, as well as empowering users to play a role in vigilance for potential unintended consequences. To achieve this post-implementation transparency, end-users should have information readily available that explains an AI solution’s intended use, limitations, and potential risks (https://www.fda.gov/medical-devices/software-medical-device-samd/transparency-machine-learning-enabled-medical-devices-guiding-principles)³². Transparency is also critical from the patient’s perspective. There is an ethical imperative to notify patients when AI is being used and, when appropriate, to obtain their consent—particularly in sensitive or high-stakes situations^33,34. This obligation is heightened when there is no human oversight, when the technology is experimental, or when the use of AI is not readily apparent. Failing to disclose the use of AI in such contexts may undermine patient autonomy and erode trust in the healthcare system. Generative AI presents unique challenges in terms of transparency. For example, deep learning relies on vast numbers of parameters drawn from increasingly large datasets and may be inherently unexplainable. When transparency is lacking there should be a greater emphasis on human oversight and education on limitations and risks, and this is an area of ongoing research²⁰.

Stakeholder needs and priorities—interviews

Several systematic reviews emphasize the importance of stakeholder engagement in the design and implementation of AI solutions in healthcare; however, this aspect is often overlooked in the existing frameworks^35,36. To create a practical and useful framework for health systems, we borrowed from user-centric design principles to first assess stakeholders’ priorities for an AI framework and their criteria for evaluating its successful implementation. We interviewed stakeholders including health system leaders, AI developers, providers, and patients. Our findings were previously presented at the 17^th Annual Conference on the Science of Dissemination and Implementation, hosted by AcademyHealth³⁷.

The stakeholders expressed multiple priorities for an AI framework, particularly the need for: (1) risk tolerance assessments to weigh the potential patient harms of an AI solution against expected benefits, (2) a human-in-the-loop of any medical decisions made using an AI solution, (3) consideration that available, rigorous evidence may be limited when reviewing new AI solutions, and (4) awareness that solutions may not have been developed on diverse patient populations or data similar to the population in which a use case is proposed. Interviewees also highlighted the importance of ensuring that AI solutions are matched to institutional priorities and conform to all relevant regulations. They noted regulations can pose unique challenges for large, multi-state health systems. While patient safety and outcomes were identified as paramount, stakeholders also detailed the need for an AI framework to evaluate the impact of potential solutions on health system employees.

When evaluating the successful implementation and utilization of an AI framework, stakeholders were consistent in explaining that the review process must operate in a timely manner, provide clear guidelines for AI developers, and ensure fair and consistent review processes that are applicable for both internally and externally developed solutions. Multiple interviewees cited the challenges presented by the rapid pace of AI innovation, expressing concerns that an overly bureaucratic and time-consuming review process could hinder the health system’s ability to keep pace with the wider healthcare market. Similarly, multiple senior leaders and AI developers explained that a successful AI framework would both encourage internal innovation and streamline the implementation of AI solutions in a safe manner.

Framework for the appropriate implementation and review of AI (FAIR-AI) in healthcare

Findings from stakeholder interviews informed our design workshop efforts, which included health system leaders and experts in AI, with workshop participants providing explicit guidance on how to best construct the FAIR-AI to meaningfully integrate stakeholder feedback. The project team leveraged design workshop activities and participant expertise to develop a set of requirements for health systems seeking to implement AI responsibly. FAIR-AI provides a detailed outline of: (i) foundational health system requirements—artifacts, personnel, processes, and tools; (ii) inclusion and exclusion criteria that specifically detail which AI solutions ought to be evaluated by FAIR-AI, thus defining scope and ensuring accountability; (iii) review questions in the form of a low-risk screening checklist and an in-depth review that provides a comprehensive evaluation of risk and benefits across the areas of development, validation, performance, ethics and equity, usefulness, compliance and regulations; (iv) discrete risk categories that map to the review criteria and are assigned to each AI solution and its intended use case; (v) safe implementation plans including monitoring and transparency requirements; (vi) an AI Label that consolidates information in an understandable format. These core components of FAIR-AI are also displayed in Fig. 1.

Implementing a responsible AI framework requires that health systems have certain foundational elements in place: (i) artifacts include a set of guiding principles for AI implementation and an AI ethics statement (examples are shown in Supplementary Table 1), both of which should be endorsed at the highest level of the organization; (ii) personnel including an individual (or a team) with data science training who are accountable for reviews; (iii) a process for escalation to an institutional decision-making body with the multidisciplinary expertise needed to assess ethical, legal, technical, operational, and clinical implications, with the authority to act; and (iv) an inventory tool that serves as a single source of truth catalog that enables accountability for review, monitoring, and transparency requirements. It is important to establish that the AI evaluation framework does not replace but rather supports existing governance structure. Additionally, while the overarching structure of an AI governance framework like FAIR-AI may remain consistent over time, the rapid pace of change in technology and regulations requires a process for regular review and updating by subject matter experts.

As the first step in FAIR-AI, an AI solution needs to go through an intake process. Individual leaders who are responsible for the deployment of AI solutions within the enterprise are designated as business owners; for clinical solutions, the business owner is a clinical leader. In this framework, we require the business owner of an AI solution to provide a set of descriptive items through an intake form including: (i) existing problem to solve; (ii) clearly outlined intended use case; (iii) expected benefits; (iv) risks including worst-case scenario(s); (v) published and unpublished information on development, validation, and performance; and (vi) FDA approvals, if applicable.

Next, we describe the inclusion and exclusion criteria for AI solutions to be applicable to FAIR-AI. Based on the premise that enterprise risk management must cast a wide net to be aware of potential risks, the inclusion for FAIR-AI review starts with a broad, general definition of AI solutions, which intentionally also includes solutions that do not directly relate to clinical care. We adopted the definition of AI from Matheny et al., as “computer system(s) capable of activities normally associated with human cognitive effort”³⁸. We then provide additional scope specificity by excluding three general areas of AI. First, simple scoring systems and rules-based tools for which an end-user can reasonably be expected to evaluate and take responsibility for performance. Second, any physical medical device that also incorporates AI into its function, as there are well-established FDA regulations in place to evaluate and monitor risks associated with these devices (https://www.fda.gov/medical-devices/classify-your-medical-device/how-determine-if-your-product-medical-device). Third, any AI solution being considered under an Institutional Review Board (IRB)-approved research protocol that includes informed consent for the use of AI when human subjects are involved. Inclusion and exclusion criteria like these will need to be adapted to a health system’s local context.

Risk evaluation considers the magnitude and importance of adverse consequences from a decision; and in the case of FAIR-AI, the decision to implement a new AI solution³⁹. As there are numerous approaches and nomenclatures to define risk, local consensus on a clear definition is a critical initial step for a health system. We aimed for simplicity in our risk definition and the number of risk categories to ensure interpretability by diverse stakeholders. Additionally, we opted to pursue a qualitative determination of risk and avoid a purely quantitative, composite risk score approach. The requisite data rarely exist to perform such risk calculations reliably, and composites of weighted scores have the potential to dilute important individual risk factors as well as the nuance of risk mitigation offered by the workflows surrounding AI solutions (for example, requiring a human review of AI output before an action is taken). Thus, FAIR-AI determines the magnitude and importance of potential adverse effects through consensus between subject matter experts from a data science team, the business leader requesting the AI solution, and ad hoc consultation when additional expertise is needed. In this exercise, the group leverages published data and expert opinion to outline hypothetical worst-case scenarios and the harms that could occur as an indirect or direct result of output from the proposed AI solution. The consensus determines if those harms are minor, or not minor; and if not minor, are they sufficiently mitigated by the related implementation workflow and monitoring plan. This risk framework is like that proposed by the International Medical Device Regulators Forum (https://www.imdrf.org/documents/software-medical-device-possible-framework-risk-categorization-and-corresponding-considerations). It is important here to note that every AI solution should be reviewed within the context of its intended use case, which includes the surrounding implementation workflows.

As prioritized by our stakeholders, a responsible AI framework should be nimble enough to allow quick but thorough reviews of AI solutions that have a low chance of causing any harm to an individual or the organization. To that end, FAIR-AI incorporates a 2-step process: an initial low-risk screening pathway and a subsequent in-depth review pathway for all solutions that do not pass through the low-risk screen. For an AI solution to be designated low-risk, it must pass all the low-risk screening questions (Table 2). Should answers to any of the screening questions suggest potential risks, the AI solution moves on to an in-depth review guided by the questions presented in Table 3. The in-depth review involves closer scrutiny of the AI solution by the data scientist and business owner and mandates a higher burden of proof that the potential benefits of the solution outweigh the potential risks identified during the screening process. If any of the in-depth review questions results in a determination of high risk, then the solution is considered high risk. It is also possible that the discussion between the data scientist and business owner will lead to a better understanding of the solution that results in a change to the answers to one or more of the low-risk screening, resulting in a low-risk designation.

Table 2 Low-risk screening questions

Full size table

Table 3 In-depth review questions

Full size table

After the FAIR-AI review, which is described in detail in the next section, each AI solution is designated as low, moderate, or high risk according to the following definitions (Fig. 2):

Low risk: Potential adverse effects are expected to be minor and should be apparent to the end-user and business owner. No ethical, equity, compliance, or regulatory concerns were identified during a low-risk screen.
Moderate risk: Based on an in-depth review, one or more of the following are present: (1) potential adverse effects are not minor but are adequately addressed by workflows; (2) ethical, equity, compliance, or regulatory issues are suspected or present, but are appropriately mitigated.
High risk: Based on an in-depth review, one or more of the following are present: (1) potential adverse effects are notable and could have a significant negative impact on patients, teammates, individuals, or the enterprise; (2) ethical, equity, compliance, or regulatory issues suspected or present, but not adequately addressed; (3) insufficient evidence exists to recommend proceeding with implementation.

**Fig. 2: Risk categories as determined by FAIR-AI evaluation and escalation to AI Governance.**

For our health system, all AI solutions designated as high risk are escalated to the AI Governance committee where they undergo a multidisciplinary discussion. The discussion results in one of three final designations: (i) proceed to implementation under high-risk conditions; (ii) proceed to a pilot or research study; or (iii) do not proceed, implementation would create an intolerable risk for the organization.

The FAIR-AI framework is designed to encompass the full range of AI solutions in healthcare, including many that will not require in-depth review and can be designated low risk—such as those supporting back-office functions, cybersecurity, or administrative automation. Examples of moderate-risk AI tools in healthcare include solutions that support—but do not replace—clinical or administrative decision-making. These tools may influence patient care or documentation, but their outputs are generally explainable, subject to human review, and integrated into existing workflows that help mitigate risk. Examples of high-risk AI tools in healthcare include those that directly influence clinical care, diagnostics, or billing—particularly when used without consistent human oversight. They may also be deployed in sensitive contexts, such as end-of-life care or other high-stakes medical decisions. These tools often rely on complex, opaque models that can perpetuate bias, affect decision-making, and lead to significant downstream consequences if not rigorously validated and continuously monitored.

After application of the low-risk screening questions, the in-depth review questions (if necessary), and completion of the AI Governance committee review (if necessary), the proposed solution is assigned a final risk category, and a FAIR-AI Summary Statement is completed (an example is presented in Supplementary Box 1). At this point, an AI solution may need to go through other traditional governance requirements like a cyber security review, financial approvals, etc. If the AI solution ultimately is designated to move forward with implementation, then the data science team and business owners collaboratively develop a Safe AI Plan as outlined below.

The first component of the Safe AI Plan concerns monitoring requirements. Implemented AI solutions need continuous monitoring as they may fail to adapt to new data or practice changes, which can lead to inaccurate results and increasing bias over time^40,41. Similarly, when AI solutions are made readily available in workflows, it becomes easier for the solution to be used outside of its approved intended use case, which may change its inherent risk profile. For these reasons, FAIR-AI requires a monitoring plan for every deployed AI solution consisting of an attestation by the business owner at regular intervals. The attestation affirms that: (i) the deployment is still aligned with the approved use case; (ii) the underlying data and related workflows have not substantially changed; (iii) the AI solution is delivering the expected benefit(s); (iv) no unforeseen risks have been identified; and (v) there are no concerns noted related to new regulations. If the original FAIR-AI review identified specific risks, then the attestation also includes an approach to evaluate each risk along with metrics (if applicable). These evaluation metrics may range from repeating a standard model performance evaluation to obtaining periodic end-user feedback on accuracy (e.g., for a generative AI solution). The second component of the Safe AI Plan is transparency requirements. All solutions categorized as high risk also require an AI Label (Fig. 3) and end-user education at regular intervals. In situations where an end-user could potentially not be aware they are interacting with AI instead of a human, the business owner must also design implementation workflows that create transparency for the end-user (e.g., an alert, disclaimer, or consent as applicable).

Discussion

Health systems are under growing pressure to adopt an increasingly wide array of AI solutions some of which have enormous potential to transform healthcare, but many also introduce complex potential risks. The FAIR-AI framework described in this paper offers a prescriptive, practical, and scalable approach for evaluating AI solutions for use in healthcare. We have distilled the approach into a concise set of questions that a data science team member can use to quickly triage AI solutions, triggering a more time-intensive, rigorous review only when necessary. For example, since the implementation of FAIR-AI within our health system, approximately 50% of the reviewed AI solutions have been triaged as low-risk. This practical approach is necessary given the volume of new solutions released and as AI becomes more ubiquitous across healthcare. By establishing formal review criteria and a consistent risk assessment process, institutions can ensure well-documented, defensible recommendations. Ultimately, by implementing FAIR-AI or a similar framework, health systems can foster a culture that upholds high standards for both internally and vendor-developed AI solutions, protecting patients and the care team, while being an early adopter harnessing actual AI benefits.

There are many challenges to implementing and maintaining the framework we have developed. Successful implementation requires support from institutional leadership, along with the allocation of resources to maintain documentation, manage new requests, and ensure proper monitoring. Team members tasked with screening requests must be empowered to reject requests for solutions that do not provide adequate documentation for a thorough review, otherwise, the process may become slow and inefficient as they search for information. In our early experience, we have found many AI solutions lack the evidence needed to support implementation and first require further research or pilot testing, which demands substantial resources from either the health system or the vendor. Generative AI solutions present significant challenges when they intersect with patient care, particularly around the difficulty in explaining how a tool functions, the opaque nature of the data used for training, the lack of standardized performance, the extensive manual effort required to review output, the need for infrastructure to obtain user-feedback, and mechanisms for reporting inaccuracies. An often overlooked but critical challenge to the responsible implementation of AI is the significant training required for both evaluators and end-users. Several recently published guidelines provide structured approaches for assessing the reliability and transparency of large language models in healthcare. We recognize the importance of these emerging frameworks and plan to expand our AI evaluation framework to incorporate relevant elements from them. However, integrating these considerations will take time, as adapting existing validation strategies for generative AI requires careful refinement to ensure a practical, efficient, and reproducible process that aligns with stakeholder needs^42,43.

At our organization, we plan to review and adapt FAIR-AI at least annually, due to the rapid changes in the field and regulatory environment. For example, AI tools themselves are being used increasingly to monitor other AI solutions for safety, and future iterations of FAIR-AI will need to account for this evolving area. As AI solutions become pervasive across most workflows, all teammates play a role in being vigilant with an awareness of AI’s inherent limitations, security risks, and ethical considerations. To address this need to democratize responsibility, we are developing accompanying education that will enhance our organization’s responsible AI culture.

There are numerous limitations to our approach to evaluating AI solutions as described in this paper. Our evaluation and monitoring processes require a significant commitment of time and resources. Some health systems may choose to rely only on evaluations provided by other entities, which reduces the burden on the health system and speeds up the adoption of new AI tools; however, this may introduce inherent bias and conflicts of interest. For smaller healthcare systems, regional partnerships or strategic relationships will likely need to be considered as an alternate escalation pathway for high-risk solutions but is beyond the scope of this manuscript. Regardless, our framework can help smaller organizations inform a structured approach to weigh the risks and benefits of AI. While the screening and in-depth review questions provide a structured approach, they are not exhaustive, and the effectiveness of the framework depends on the diligence and expertise of the evaluators. Additionally, this framework will require that organizations make modifications to meet their needs and risk tolerance and to ensure alignment with local regulatory requirements. Modifications may also be needed to ensure the screening and in-depth review questions are clear and provide consistent risk determinations with different reviewers. Future qualitative evaluations can explore areas that may be unclear or leading to discrepancies between reviewers and thus needing further refinement.

FAIR-AI provides a practical template for health systems to adopt a process for the rigorous evaluation and monitoring of AI solutions. The prescriptive framework guided by explicit criteria is intentionally designed for health systems to use at the speed and scale required in real-world settings. This framework will enable institutions to carefully balance the desire to adopt innovative solutions while maintaining the highest standards for patient and care team safety.

Methods

Best practices and key considerations—narrative review

We conducted a narrative review to inform the development of our AI evaluation framework, opting for a pragmatic and expert-guided approach rather than a formal and focused scoping or systematic review. Given the existence of recent systematic reviews on this topic, our objective was not to comprehensively catalog or compare existing frameworks, but to synthesize insights most relevant to real-world implementation^35,36. This approach allowed us to prioritize issues based on stakeholder input, domain knowledge, and practical relevance.

For the narrative review, we utilized Google Scholar as the primary search engine to locate pertinent published frameworks and papers. Search terms included: framework, guideline, evaluation, monitoring, transparency, explainability, artificial intelligence, validation, informatics, clinical decision support, ethics, equity, regulatory, legal, usefulness, risks, benefits, implementation, deployment, predictive model, machine learning, clinical utility, health. Additionally, we reviewed institutional guidelines from the European Union’s Artificial Intelligence Act, the National Institute of Standards and Technology (NIST), and the U.S. Food and Drug Administration (FDA) and conducted citation tracking to identify influential works.

Stakeholder needs and priorities—interviews

From March to April 2024, we conducted semi-structured interviews with executive leadership (N = 3), senior risk, compliance, and legal leaders (N = 6), data developers (N = 4), providers (N = 5), and patients (N = 5) from across our health system. We utilized purposive sampling methods to ensure we obtained stakeholder feedback from the five user domains (e.g., executive leadership) that we felt would be most impacted by the implementation of an AI framework⁴⁴. Interviewees were identified by members of the study team, with recruitment outreach occurring either via email (for health system teammates) or telephone (patients). All potential participants were provided with information on the scope of the project, with interviews being scheduled for those interested. An interview guide was collaboratively developed by the study team, which included physicians, faculty, and health system leaders with expertise in ethics, equity, data science, and care delivery (Supplementary Note 1). Each interview lasted approximately 30 minutes, was completed via telephone or videoconference, and was facilitated by a male member of the study team, who is a PhD-level sociologist (JK) and holds a faculty appointment. All participants provided verbal consent prior to interviews commencing. Interviews were audio recorded and transcribed verbatim, with ATLAS.ti software aiding data analysis efforts. Transcripts were analyzed using both inductive and deductive coding methodologies, with thematic analysis employed to identify and organize emergent themes in the data. Three members of the study team collaboratively developed the coding dictionary (BJW, JK, AM), with the qualitative lead (JK) independently coding all transcripts and bringing any questions back to the team for review. Participants did not assist the study team with transcription verification, data analysis, or interpretation of findings. We followed the Consolidated criteria for Reporting Qualitative research for sharing our findings (Supplementary Note 2).

Expert consensus—design workshop

We convened a half-day, in-person workshop to synthesize the best practices identified from the literature review, the priorities outlined by stakeholders, and the consensus recommendations from a diverse team of subject matter experts. This workshop provided an opportunity to bring health system leaders and AI experts together to review study team findings and leverage their expertise to advance FAIR-AI development. While this workshop included ample discussion and a review of study team findings, it was not itself a data collection activity, rather it was focused on progressing the structure and development of the FAIR-AI. Workshop participants included individuals with expertise in legal affairs, regulatory compliance, cyber security, ethics, clinical care, clinical informatics, data science, and research (N = 33). As the starting point for the workshop, the primary project team created a draft framework outline. This outline, along with background information, pertinent literature, and summaries of stakeholder needs, were shared with attendees for review prior to the meeting.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. Interview data are not made publicly available to protect the confidentiality of the interviewees, including senior leader participants.

References

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Article PubMed CAS Google Scholar
Dedehayir, O. & Steinert, M. The hype cycle model: A review and future directions. Technol. Forecast. Soc. Change 108, 28–41 (2016).
Article Google Scholar
Shortliffe, E. H. & Buchanan, B. G. A model of inexact reasoning in medicine. Math. Biosci. 23, 351–379 (1975).
Article Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article PubMed CAS Google Scholar
Birkstedt, T., Minkkinen, M., Tandon, A. & Mäntymäki, M. AI governance: themes, knowledge gaps and future agendas. Internet Res. 33, 133–167 (2023).
Article Google Scholar
Taeihagh, A. Governance of artificial intelligence. Policy Soc. 40, 137–157 (2021).
Article Google Scholar
Mäntymäki, M., Minkkinen, M., Birkstedt, T. & Viljanen, M. Defining organizational AI governance. AI Ethics 2, 603–609 (2022).
Article Google Scholar
Bedoya, A. D. et al. A framework for the oversight and local deployment of safe and high-quality prediction models. J. Am. Med. Inform. Assoc. JAMIA 29, 1631–1636 (2022).
Article PubMed Google Scholar
Reddy, S. et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform. 28, e100444 (2021).
Article PubMed PubMed Central Google Scholar
Vollmer, S. et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ 368, l6927 (2020).
Article PubMed PubMed Central Google Scholar
van der Vegt, A. H. et al. Implementation frameworks for end-to-end clinical AI: derivation of the SALIENT framework. J. Am. Med. Inform. Assoc. JAMIA 30, 1503–1515 (2023).
Article PubMed Google Scholar
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Article PubMed PubMed Central Google Scholar
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 13, 1–10 (2015).
Article PubMed PubMed Central Google Scholar
Steyerberg, E. W., et al. Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiol. Camb. Mass 21, 128 (2010).
Article Google Scholar
Vickers, A. J. & Elkin, E. B. Decision curve analysis: A novel method for evaluating prediction models. Med. Decis. Mak. Int. J. Soc. Med. Decis. Mak. 26, 565–574 (2006).
Article Google Scholar
Botchkarev, A. A new typology design of performance metrics to measure errors in machine learning regression algorithms. Interdiscip. J. Inf. Knowl. Manag. 14, 045–076 (2019).
Google Scholar
Moons, K. G. M., Altman, D. G., Vergouwe, Y. & Royston, P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ 338, b606 (2009).
Article PubMed Google Scholar
Moons, K. G. M. et al. Risk prediction models: II. External validation, model updating, and impact assessment. Heart 98, 691–698 (2012).
Article PubMed Google Scholar
Park, Y.-J. et al. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med. Inform. Decis. Mak. 24, 72 (2024).
Article PubMed PubMed Central Google Scholar
Bandi, A., Adapa, P. V. S. R. & Kuchi, Y. E. V. P. K. The power of generative AI: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet 15, 260 (2023).
Article Google Scholar
Wiens, J. et al. Do no harm: A roadmap for responsible machine learning for health care. Nat. Med. 25, 1337–1340 (2019).
Article PubMed CAS Google Scholar
Scott, I., Carter, S. & Coiera, E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform. 28, e100251 (2021).
Article PubMed PubMed Central Google Scholar
Kappen, T. H. et al. Evaluating the impact of prediction models: Lessons learned, challenges, and recommendations. Diagn. Progn. Res. 2, 11 (2018).
Article PubMed PubMed Central Google Scholar
Osheroff, J. A. et al. A roadmap for national action on clinical decision support. J. Am. Med. Inform. Assoc. JAMIA 14, 141 (2007).
Article PubMed Google Scholar
Blackman, R. Ethical Machines: Your Concise Guide to Totally Unbiased, Transparent, and Respectful AI. (Harvard Business Review Press, 2022).
Dankwa-Mullan, I. et al. A Proposed Framework on Integrating Health Equity and Racial Justice into the Artificial Intelligence Development Lifecycle. J. Health Care Poor Underserved 32, 300–317 (2021).
Article Google Scholar
Paulus, J. K. & Kent, D. M. Race and ethnicity: A part of the equation for personalized clinical decision making?. Circ. Cardiovasc. Qual. Outcomes 10, e003823 (2017).
Article PubMed PubMed Central Google Scholar
Paulus, J. K. & Kent, D. M. Predictably unequal: Understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. NPJ Digit. Med. 3, 99 (2020).
Article PubMed PubMed Central Google Scholar
O’Neill, J. et al. Applying an equity lens to interventions: Using PROGRESS ensures consideration of socially stratifying factors to illuminate inequities in health. J. Clin. Epidemiol. 67, 56–64 (2014).
Article PubMed Google Scholar
Liu, X. et al. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
Article PubMed CAS Google Scholar
Vasey, B. et al. Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. BMJ 377, e070904 (2022).
Article PubMed PubMed Central Google Scholar
Zerilli, J., Bhatt, U. & Weller, A. How transparency modulates trust in artificial intelligence. Patterns 3, 100455 (2022).
Hurley, M. E., Lang, B. H., Kostick-Quenet, K. M., Smith, J. N. & Blumenthal-Barby, J. Patient consent and the right to notice and explanation of AI systems used in health care. Am. J. Bioeth. AJOB 25, 102–114 (2025).
Article PubMed Google Scholar
Pflanzer, M. Balancing transparency and trust: Reevaluating AI disclosure in healthcare. Am. J. Bioeth. AJOB 25, 153–156 (2025).
Article PubMed Google Scholar
de Hond, A. A. H. et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit. Med. 5, 2 (2022).
Article PubMed PubMed Central Google Scholar
Crossnohere, N. L., Elsaid, M., Paskett, J., Bose-Brill, S. & Bridges, J. F. P. Guidelines for artificial intelligence in medicine: Literature review and content analysis of frameworks. J. Med. Internet Res. 24, e36823 (2022).
Article PubMed PubMed Central Google Scholar
Kramer, J. et al. Developing a Framework for the Review and Oversight of Artificial Intelligence at a Large Healthcare Enterprise: Assessing the Needs and Priorities of Senior Health System Leadership, Providers, and Community Stakeholders. in (AcademyHealth, 2024).
Matheny, M. E., Whicher, D. & Thadaney Israni, S. Artificial Intelligence in Health Care: A Report From the National Academy of Medicine. JAMA 323, 509–510 (2020).
Article PubMed Google Scholar
Fischhoff, B., Watson, S. R. & Hope, C. Defining risk. Policy Sci. 17, 123–139 (1984).
Article Google Scholar
Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. Npj Digit. Med. 5, 1–9 (2022).
Article Google Scholar
Lu, J. et al. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. pp. 1–1 https://doi.org/10.1109/TKDE.2018.2876857(2018).
Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 31, 60–69 (2025).
Article PubMed PubMed Central CAS Google Scholar
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. Npj Digit. Med. 7, 1–20 (2024).
Article Google Scholar
Campbell, S. et al. Purposive sampling: Complex or simple? Research case examples. J. Res. Nurs. JRN 25, 652 (2020).
Article PubMed Google Scholar

Download references

Acknowledgements

The study was supported by the Duke Endowment under award number AWD00002292. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Duke Endowment. We would like to express our gratitude to Dr. Reid Blackman for his valuable feedback on the design of the framework and Sally Baek and Michael Johnson from Atrium Health for their critical support with organizing the design workshop.

Author information

Authors and Affiliations

Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA
Brian J. Wells & Nicholas M. Pajewski
Center for Health System Sciences, Atrium Health, Charlotte, NC, USA
Hieu M. Nguyen, Yhenneko J. Taylor & McKenzie Isreal
Department of Internal Medicine, Division of Hospital Medicine, Atrium Health, Charlotte, NC, USA
Andrew McWilliams
Advanced Analytics, Enterprise Data Science, Advocate Health, Charlotte, NC, USA
Matt Pallini, Shih-Hsiung Chou, Timothy Hetherington, Michael S. Carroll & Jason Heuay
Conflict, Ethics, and Influence Program, Advocate Health, Milwaukee, WI, USA
Amy Bovi, Andrew Kuzma & Natalie Hardy
Department of Family and Community Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
Justin Kramer
Compliance and Integrity, Advocate Health, Winston-Salem, NC, USA
Patricia Corn
Advanced Analytics, Advocate Health, Charlotte, NC, USA
Audrey Cuison & Mary Gagen
Department of Cardiovascular Medicine, Wake Forest University School of Medicine, Winston-Salem, NC, USA
Oguz Akbilgic
Advocate Health, Milwaukee, WI, USA
Katie Barr
Innovation and Commercialization, Advocate Health, Charlotte, NC, USA
Alicia Bowers
Clinical Ethics, Advocate Health, Chicago, IL, USA
Rikki Caffrey
Family Medicine, Atrium Health, Charlotte, NC, USA
Matthew CiRullo
Department of Pediatrics, Wake Forest University School of Medicine, Winston-Salem, NC, USA
Stephen M. Downs
Advocate Health, Oakbrook, IL, USA
Kristina Katzovitz
Advocate Health, Winston-Salem, NC, USA
Eric Kirkendall
Patient Safety, Advocate Health, Milwaukee, WI, USA
Elsie Lindgren
Office of the General Counsel, Atrium Health, Charlotte, NC, USA
Lindsey Lonergan & Gabe Wright
Cybersecurity Governance, Risk and Compliance, Advocate Health, Milwaukee, WI, USA
Elissa McKinley
Audit Services and Enterprise Risk Management, Advocate Health, OakBrook, IL, USA
Laura Sak-Castellano
Virtual Critical Care (Southeast Region Critical Care), Atrium Health, Charlotte, NC, USA
Erika Setliff

Authors

Brian J. Wells
View author publications
Search author on:PubMed Google Scholar
Hieu M. Nguyen
View author publications
Search author on:PubMed Google Scholar
Andrew McWilliams
View author publications
Search author on:PubMed Google Scholar
Matt Pallini
View author publications
Search author on:PubMed Google Scholar
Amy Bovi
View author publications
Search author on:PubMed Google Scholar
Andrew Kuzma
View author publications
Search author on:PubMed Google Scholar
Justin Kramer
View author publications
Search author on:PubMed Google Scholar
Shih-Hsiung Chou
View author publications
Search author on:PubMed Google Scholar
Timothy Hetherington
View author publications
Search author on:PubMed Google Scholar
Patricia Corn
View author publications
Search author on:PubMed Google Scholar
Yhenneko J. Taylor
View author publications
Search author on:PubMed Google Scholar
Audrey Cuison
View author publications
Search author on:PubMed Google Scholar
Mary Gagen
View author publications
Search author on:PubMed Google Scholar
McKenzie Isreal
View author publications
Search author on:PubMed Google Scholar

Consortia

FAIR-AI Consortium

Oguz Akbilgic
, Katie Barr
, Amy Bovi
, Alicia Bowers
, Rikki Caffrey
, Michael S. Carroll
, Shih-Hsiung Chou
, Matthew CiRullo
, Patricia Corn
, Audrey Cuison
, Stephen M. Downs
, Mary Gagen
, Natalie Hardy
, Timothy Hetherington
, Jason Heuay
, McKenzie Isreal
, Kristina Katzovitz
, Eric Kirkendall
, Justin Kramer
, Andrew Kuzma
, Elsie Lindgren
, Lindsey Lonergan
, Elissa McKinley
, Andrew McWilliams
, Hieu M. Nguyen
, Nicholas M. Pajewski
, Matt Pallini
, Laura Sak-Castellano
, Erika Setliff
, Yhenneko J. Taylor
, Brian J. Wells
& Gabe Wright

Contributions

A.M. and B.J.W. supervised the study. Members of the FAIR-AI Consortium contributed to the conception and design of the study. A.B., S.C., P.C., A.C., M.G., T.H., M.I., J.K., A.K., A.M., H.M.N., M.P., Y.J.T. and B.J.W. performed acquisition, analysis, and interpretation of the data and drafted and revised the manuscript.

Corresponding author

Correspondence to Hieu M. Nguyen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics

The study was approved by the Advocate Health—Wake Forest University School of Medicine IRB (#00109544) with a verbal consent procedure, as permitted under federal regulation 45 CFR 46.117(c). Avoiding written consent eliminated the need to collect any personally identifying information, thereby reducing the risk of a confidentiality breach.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wells, B.J., Nguyen, H.M., McWilliams, A. et al. A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare. npj Digit. Med. 8, 514 (2025). https://doi.org/10.1038/s41746-025-01900-y

Download citation

Received: 06 February 2025
Accepted: 21 July 2025
Published: 11 August 2025
Version of record: 11 August 2025
DOI: https://doi.org/10.1038/s41746-025-01900-y

This article is cited by

A systematic review of AI for predicting glaucoma progression: challenges and recommendations towards clinical implementation
- Yichuan G. Liang
- Leo Fan
- Andrew J. R. White
npj Digital Medicine (2026)
When routing may bind: legitimacy conditions for typed jurisdiction allocation in AI-mediated governance
- Giovanni Velotto
AI and Ethics (2026)
A Minimum Safety Case for Record-Connected Consumer Health Assistants
- Henry Bair
Journal of Medical Systems (2026)

Subjects

Abstract

Similar content being viewed by others

Establishing responsible use of AI guidelines: a comprehensive case study for healthcare institutions

Innovation and challenges of artificial intelligence technology in personalized healthcare

Trust in AI-assisted health systems and AI’s trust in humans

Introduction

Results

Best practices and key considerations—narrative review

Stakeholder needs and priorities—interviews

Framework for the appropriate implementation and review of AI (FAIR-AI) in healthcare

Discussion

Methods

Best practices and key considerations—narrative review

Stakeholder needs and priorities—interviews

Expert consensus—design workshop

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

FAIR-AI Consortium

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics

Additional information

Supplementary information

Supplementary information (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A systematic review of AI for predicting glaucoma progression: challenges and recommendations towards clinical implementation

When routing may bind: legitimacy conditions for typed jurisdiction allocation in AI-mediated governance

A Minimum Safety Case for Record-Connected Consumer Health Assistants

Search

Quick links