Introduction

Artificial intelligence (AI) is rapidly reshaping medicine. From guiding diagnostic decisions to supporting patient-facing applications to generating clinical documentation, AI systems now influence multiple facets of healthcare delivery1,2,3. This expansion reflects the rapid maturation of machine-learning models and their integration into routine clinical workflows. Interpreter services are a key area of healthcare prime for implementation of AI tools. However, it remains unclear whether academic research and rigorous evaluations have kept pace with the emergence of new applications of AI, particularly those that are patient-facing.

Using four AI-enabled knowledge platforms, ChatGPT, Perplexity, Gemini, and OpenEvidence, we ran structured exploratory queries to generate estimates of the number of publications from the past five years addressing AI in healthcare or AI in health services. We then narrowed our query to those addressing interpreter services, and finally to those including patient perspectives, patient experience, or patient-reported outcomes. Outputs meeting all three concepts were classified as “all criteria.” Results were reported verbatim as returned by each platform. We found that from 2019 to 2024, AI-related healthcare publications increased from ~11,500 in 2019 to over 28,000 in 2024 (Fig. 1). However, despite this rapid expansion, research specifically exploring the impact of AI on services that shape the patient experience, including interpreter services for patients with a non-English language preference (NELP), remains scarce (Table 1).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Estimated number of publications per year related to AI in healthcare (source: ChatGPT o3).

Table 1 Evidence gap: publications on AI interpreter services and inclusion of patient perspectives (2020–2025a)

Lessons from algorithmic bias in healthcare

Our results in Table 1 highlight a pressing research and policy need. When deployed without thorough evaluation across diverse linguistic and cultural groups, AI-supported communication tools may introduce or amplify inequities.

Recent work has documented communication-specific risks across language technologies. Automated speech recognition (ASR) systems show reduced accuracy for certain accents and dialects, while neural machine translation (NMT) models demonstrate variable performance across languages and clinical contexts4,5,6,7. A recent systematic review of NMT technologies in healthcare settings reported substantial variability in translation accuracy and emphasized the need for systematic evaluation before clinical implementation8. The limitations of existing language technologies underscore the need for context-specific assessment of AI-based interpreter systems prior to use in high-stakes communication.

Currently, there is no established methodology to ensure the accuracy of AI-based interpretation. As new technologies emerges promising to aid interpretation, it is essential that we develop clear metrics with which to assess accuracy. Without careful evaluation of these novel technologies’ performance across languages and cultural contexts, they risk increasing communication-based disparities in clinical encounters.

The emerging role of AI in interpreter services

AI-powered language access tools, including ASR, real-time translation or interpretation systems, and video-based avatars, are increasingly being considered as alternatives to traditional interpretive services, which typically rely on in-person or phone/video-based live human interpretation.

To better understand the role of these tools in clinical communication, it is important to distinguish between translation and interpretation, as they represent different communication tasks. Translation refers primarily to the conversion of written text from one language to another, whereas interpretation refers to the real-time conversion of spoken language during an interaction between individuals who speak different languages. Although emerging AI systems can approximate real-time speech-to-speech communication by combining speech recognition, machine translation, and audio synthesis, these systems largely rely on machine translation pipelines rather than human interpretive processes used in professional medical interpretation6,7,8,9. The technology may lack some of the benefits afforded by in-person interpretation, which is particularly valuable in dynamic, high-stakes clinical encounters where meaning must be conveyed accurately while accounting for tone, ambiguity, emotional nuance, and conversational context6,7.

In parallel, a two-wave cross-sectional survey of clinicians demonstrated growing expectations for diagnostic AI tools and emphasized the importance of usability and workflow integration10. Although these studies do not evaluate AI-based interpretation directly, they underscore the accelerating pace of technological adoption and the need for careful evaluation frameworks before deploying such tools in interpreter-mediated, high-stakes communication. These tools promise cost savings and expanded access in under-resourced clinical settings, especially for less common languages for which human interpreters may not readily be available. However, their performance in complex, high-stakes clinical encounters remains poorly understood. A 2024 systematic review found that AI interpretation tools performed best in simple, low-risk interactions11. Real-world medicine often involves emotionally complex discussions around diagnoses, treatment options, or end-of-life care, in which appreciating cultural nuance and maintaining patient trust are essential. Using physician-coded clinical severity ratings, in which harm was defined as a mistranslation that could cause a patient to take or fail to take an action that would delay or endanger care, a recent study found that ~5% of discharge instruction translations contained at least one error with potential for clinically significant or life-threatening harm, with high inter-rater reliability (kappa 86–97%)12. Based on these findings, continual assessment and oversight are necessary when using AI in non-low-risk healthcare communications.

As they exist now, AI-based language tools may be most appropriate for limited scenarios, such as low-risk written translation of patient-facing documents (e.g. hospital navigation instructions or general patient education materials), communication involving commonly supported languages in current AI systems when certified interpreters are not immediately available or temporary support while connection to a live interpreter is being arranged, and encounters involving rare languages as the technology continues to evolve. Even in these settings, their use should be approached cautiously and evaluated for accuracy, safety, and patient acceptability11,12.

Current gaps in the literature

Despite the explosion of AI-related healthcare research, studies that investigate the patient perspective are exceedingly rare. The few that have explored this topic have generally found patient apprehension regarding AI’s safety and efficacy13. While patients appear to be enthusiastic about AI’s potential, they continue to have reservations about particular use cases13,14. As our query found, fewer than 0.4 percent of AI-in-healthcare publications mention patient perspectives and no publications about interpreter services do so. Querying ChatGPT15 yielded only 20–30 papers in 2024 that mention AI and interpreter services, and only one or two discussing patient experience (Table 1). Using the same search criteria, Open Evidence16 surfaced just a single 2024 article (Barwise et al.)17, that focused on patient perspectives but did not include patient-reported outcomes. Importantly, patient-reported outcome measures (PROMs) require formal processes of translation, back-translation, and psychometric validation, and therefore represent a distinct methodological consideration from interpretation or translation in clinical encounters. Gemini18 and Perplexity19 returned no papers meeting all three criteria. Table 1 displays the output from this search. Notably, the different models differ in their levels of certainty, displaying numerical ranges, or utilizing the words “sparse” vs “none identified” vs “no match.” The convergence of these independent sources point to a systemic blind spot: the rapid technical development of AI interpreter tools is not being matched by research evaluating their impact on communities with NELP. Addressing this research gap is a prerequisite for optimal performance and equitable adoption. Further, using AI-based research assistants (ChatGPT, Perplexity, Gemini, and OpenEvidence) to generate comparative literature estimates represents a novel methodological approach for assessing gaps in emerging fields. As more researchers employ AI for these purposes, we believe being transparent is critical to evaluate the accuracy of results compared to traditional research methods. Our current understanding of whether AI interpretation helps or hinders patient comprehension, enhances or erodes provider trust, and improves or worsens disparities in care delivery is lacking.

Language access as a health equity imperative

Language represents a foundational component of effective communication and health equity. Patients with NELP experience higher rates of misdiagnosis, poor comprehension of treatment plans, and dissatisfaction with care20. Language barriers are linked to increased emergency room use, lower adherence to medications, and reduced engagement in shared decision-making20,21. Interpreter services are integral to safe, equitable, high-quality care. Furthermore, interpreter services are federally mandated. Compliance with the Emergency Medical Treatment and Labor Act (EMTALA) necessitates access to interpreters, and Section 1557 of the Affordable Care Act requires “meaningful access” to medical care for patients with limited English proficiency, which necessitates available interpreters22,23.

Extensive evidence demonstrates that certified medical interpreters improve communication accuracy, reduce clinical errors, enhance shared decision-making, and in some cases decrease hospital length of stay and readmission rates24,25. These well-documented benefits underscore the risks of substituting trained interpreters with emerging AI-based tools that have not yet undergone validation in high-stakes clinical settings. The adoption of AI tools should not be driven solely by cost or availability. Replacing trained interpreters with unvalidated technologies risks miscommunication in critical moments, especially if the technology is not perceived by patients as trustworthy or accurate.

A patient-centered research agenda

A robust patient-centered research agenda is essential to guide the successful utilization of AI interpreter services.

First, careful evaluation of accuracy and safety is critical. Comparative-effectiveness studies should benchmark AI systems against certified human interpreters on clinical-communication accuracy, shared-decision-making scores, visit length, cost-utility, and performance in simulated high-stakes scenarios24,25,26. Accuracy alone, however, is insufficient to determine whether these tools are appropriate for real-world clinical environments.

Second, research must examine patient perception and trust. Mixed-methods studies should capture how patients with NELP perceive and trust these tools, such as combining post-visit patient experience surveys with qualitative interviews, near-real-time smartphone experience-sampling, and multilingual sentiment analysis. All outcomes should be stratified by language, health-literacy level, and socioeconomic status to illuminate intersectional inequities.

Third, feasibility and usability must be evaluated. Early assessments should examine ease of use, learnability, efficiency within clinical workflows, perceived usefulness among clinicians and patients, acceptability, task-completion error rates, and integration with existing communication practices.

Fourth, systems for continuous error monitoring are needed. This may include human-in-the-loop annotation pipelines through which bilingual clinicians flag clinically significant mistranslations, real-time confidence scoring that escalates uncertain outputs to live interpreters, and electronic health record (EHR)-embedded safety dashboards that track usage, error type, and override rates, mirroring established pharmacovigilance models27,28,29.

Fifth, equity-focused implementation science, including interrupted time-series analyses of quality metrics and geographically diverse pragmatic studies, must evaluate whether AI interpreter services narrow or widen disparities in adverse events, readmissions, and patient-reported outcomes24,25,30.

Finally, participatory design and strong governance frameworks are key. Establishing community advisory boards of patients with NELP, professional interpreters, and cultural brokers can ensure that system development reflects community needs. As these technologies evolve, health-system oversight councils with diverse representation and authority over high-risk deployments will be necessary, alongside policy advocacy for modality-neutral interpreter reimbursement to prevent premature substitution of certified interpreters27,28,29,30.

Ultimately, successful implementation will depend not only on patient-centered evidence, but also on institutional governance that proactively identifies and mitigates algorithmic risk, an area where physician-informaticists and health-system leaders play a pivotal role31 (Table 2).

Table 2 Patient-centered research agenda items

Conclusion

AI shows significant promise in interpreter services but also carries risk. As healthcare systems increasingly consider AI-mediated language tools, patient-centered evidence will be essential to understand their impact on patient experience and clinical outcomes. We must prioritize high-quality, inclusive research that assesses the impact of these technologies for patients with NELP and centers the voices of those patients.

Importantly, discussion about AI interpreter services should also include the perspectives of professional medical interpreters, whose expertise in linguistic nuance, cultural mediation, and clinical communication is essential to safe and equitable care.

Finally, it will be critical to evaluate how accurate interpretation in clinical encounters influences trust and engagement in the healthcare system. Will these technologies ultimately strengthen or weaken trust in healthcare? Only through careful research can we ensure that emerging language technologies advance, rather than erode, equitable care.