Artificial intelligence (AI) is increasingly utilized in healthcare, including in language access services, but certain aspects remain understudied. We offer a research agenda to guide the development of evidence on how AI language access services are perceived by patients and how they impact trust and comprehension in clinical encounters, and to inform implementation strategies. We recommend a governance system to mitigate potential harm and capitalize on benefits for patients with a non-English language preference.
Introduction
Artificial intelligence (AI) is rapidly reshaping medicine. From guiding diagnostic decisions to supporting patient-facing applications to generating clinical documentation, AI systems now influence multiple facets of healthcare delivery1,2,3. This expansion reflects the rapid maturation of machine-learning models and their integration into routine clinical workflows. Interpreter services are a key area of healthcare prime for implementation of AI tools. However, it remains unclear whether academic research and rigorous evaluations have kept pace with the emergence of new applications of AI, particularly those that are patient-facing.
Using four AI-enabled knowledge platforms, ChatGPT, Perplexity, Gemini, and OpenEvidence, we ran structured exploratory queries to generate estimates of the number of publications from the past five years addressing AI in healthcare or AI in health services. We then narrowed our query to those addressing interpreter services, and finally to those including patient perspectives, patient experience, or patient-reported outcomes. Outputs meeting all three concepts were classified as “all criteria.” Results were reported verbatim as returned by each platform. We found that from 2019 to 2024, AI-related healthcare publications increased from ~11,500 in 2019 to over 28,000 in 2024 (Fig. 1). However, despite this rapid expansion, research specifically exploring the impact of AI on services that shape the patient experience, including interpreter services for patients with a non-English language preference (NELP), remains scarce (Table 1).
Estimated number of publications per year related to AI in healthcare (source: ChatGPT o3).
Lessons from algorithmic bias in healthcare
Our results in Table 1 highlight a pressing research and policy need. When deployed without thorough evaluation across diverse linguistic and cultural groups, AI-supported communication tools may introduce or amplify inequities.
Recent work has documented communication-specific risks across language technologies. Automated speech recognition (ASR) systems show reduced accuracy for certain accents and dialects, while neural machine translation (NMT) models demonstrate variable performance across languages and clinical contexts4,5,6,7. A recent systematic review of NMT technologies in healthcare settings reported substantial variability in translation accuracy and emphasized the need for systematic evaluation before clinical implementation8. The limitations of existing language technologies underscore the need for context-specific assessment of AI-based interpreter systems prior to use in high-stakes communication.
Currently, there is no established methodology to ensure the accuracy of AI-based interpretation. As new technologies emerges promising to aid interpretation, it is essential that we develop clear metrics with which to assess accuracy. Without careful evaluation of these novel technologies’ performance across languages and cultural contexts, they risk increasing communication-based disparities in clinical encounters.
The emerging role of AI in interpreter services
AI-powered language access tools, including ASR, real-time translation or interpretation systems, and video-based avatars, are increasingly being considered as alternatives to traditional interpretive services, which typically rely on in-person or phone/video-based live human interpretation.
To better understand the role of these tools in clinical communication, it is important to distinguish between translation and interpretation, as they represent different communication tasks. Translation refers primarily to the conversion of written text from one language to another, whereas interpretation refers to the real-time conversion of spoken language during an interaction between individuals who speak different languages. Although emerging AI systems can approximate real-time speech-to-speech communication by combining speech recognition, machine translation, and audio synthesis, these systems largely rely on machine translation pipelines rather than human interpretive processes used in professional medical interpretation6,7,8,9. The technology may lack some of the benefits afforded by in-person interpretation, which is particularly valuable in dynamic, high-stakes clinical encounters where meaning must be conveyed accurately while accounting for tone, ambiguity, emotional nuance, and conversational context6,7.
In parallel, a two-wave cross-sectional survey of clinicians demonstrated growing expectations for diagnostic AI tools and emphasized the importance of usability and workflow integration10. Although these studies do not evaluate AI-based interpretation directly, they underscore the accelerating pace of technological adoption and the need for careful evaluation frameworks before deploying such tools in interpreter-mediated, high-stakes communication. These tools promise cost savings and expanded access in under-resourced clinical settings, especially for less common languages for which human interpreters may not readily be available. However, their performance in complex, high-stakes clinical encounters remains poorly understood. A 2024 systematic review found that AI interpretation tools performed best in simple, low-risk interactions11. Real-world medicine often involves emotionally complex discussions around diagnoses, treatment options, or end-of-life care, in which appreciating cultural nuance and maintaining patient trust are essential. Using physician-coded clinical severity ratings, in which harm was defined as a mistranslation that could cause a patient to take or fail to take an action that would delay or endanger care, a recent study found that ~5% of discharge instruction translations contained at least one error with potential for clinically significant or life-threatening harm, with high inter-rater reliability (kappa 86–97%)12. Based on these findings, continual assessment and oversight are necessary when using AI in non-low-risk healthcare communications.
As they exist now, AI-based language tools may be most appropriate for limited scenarios, such as low-risk written translation of patient-facing documents (e.g. hospital navigation instructions or general patient education materials), communication involving commonly supported languages in current AI systems when certified interpreters are not immediately available or temporary support while connection to a live interpreter is being arranged, and encounters involving rare languages as the technology continues to evolve. Even in these settings, their use should be approached cautiously and evaluated for accuracy, safety, and patient acceptability11,12.
Current gaps in the literature
Despite the explosion of AI-related healthcare research, studies that investigate the patient perspective are exceedingly rare. The few that have explored this topic have generally found patient apprehension regarding AI’s safety and efficacy13. While patients appear to be enthusiastic about AI’s potential, they continue to have reservations about particular use cases13,14. As our query found, fewer than 0.4 percent of AI-in-healthcare publications mention patient perspectives and no publications about interpreter services do so. Querying ChatGPT15 yielded only 20–30 papers in 2024 that mention AI and interpreter services, and only one or two discussing patient experience (Table 1). Using the same search criteria, Open Evidence16 surfaced just a single 2024 article (Barwise et al.)17, that focused on patient perspectives but did not include patient-reported outcomes. Importantly, patient-reported outcome measures (PROMs) require formal processes of translation, back-translation, and psychometric validation, and therefore represent a distinct methodological consideration from interpretation or translation in clinical encounters. Gemini18 and Perplexity19 returned no papers meeting all three criteria. Table 1 displays the output from this search. Notably, the different models differ in their levels of certainty, displaying numerical ranges, or utilizing the words “sparse” vs “none identified” vs “no match.” The convergence of these independent sources point to a systemic blind spot: the rapid technical development of AI interpreter tools is not being matched by research evaluating their impact on communities with NELP. Addressing this research gap is a prerequisite for optimal performance and equitable adoption. Further, using AI-based research assistants (ChatGPT, Perplexity, Gemini, and OpenEvidence) to generate comparative literature estimates represents a novel methodological approach for assessing gaps in emerging fields. As more researchers employ AI for these purposes, we believe being transparent is critical to evaluate the accuracy of results compared to traditional research methods. Our current understanding of whether AI interpretation helps or hinders patient comprehension, enhances or erodes provider trust, and improves or worsens disparities in care delivery is lacking.
Language access as a health equity imperative
Language represents a foundational component of effective communication and health equity. Patients with NELP experience higher rates of misdiagnosis, poor comprehension of treatment plans, and dissatisfaction with care20. Language barriers are linked to increased emergency room use, lower adherence to medications, and reduced engagement in shared decision-making20,21. Interpreter services are integral to safe, equitable, high-quality care. Furthermore, interpreter services are federally mandated. Compliance with the Emergency Medical Treatment and Labor Act (EMTALA) necessitates access to interpreters, and Section 1557 of the Affordable Care Act requires “meaningful access” to medical care for patients with limited English proficiency, which necessitates available interpreters22,23.
Extensive evidence demonstrates that certified medical interpreters improve communication accuracy, reduce clinical errors, enhance shared decision-making, and in some cases decrease hospital length of stay and readmission rates24,25. These well-documented benefits underscore the risks of substituting trained interpreters with emerging AI-based tools that have not yet undergone validation in high-stakes clinical settings. The adoption of AI tools should not be driven solely by cost or availability. Replacing trained interpreters with unvalidated technologies risks miscommunication in critical moments, especially if the technology is not perceived by patients as trustworthy or accurate.
A patient-centered research agenda
A robust patient-centered research agenda is essential to guide the successful utilization of AI interpreter services.
First, careful evaluation of accuracy and safety is critical. Comparative-effectiveness studies should benchmark AI systems against certified human interpreters on clinical-communication accuracy, shared-decision-making scores, visit length, cost-utility, and performance in simulated high-stakes scenarios24,25,26. Accuracy alone, however, is insufficient to determine whether these tools are appropriate for real-world clinical environments.
Second, research must examine patient perception and trust. Mixed-methods studies should capture how patients with NELP perceive and trust these tools, such as combining post-visit patient experience surveys with qualitative interviews, near-real-time smartphone experience-sampling, and multilingual sentiment analysis. All outcomes should be stratified by language, health-literacy level, and socioeconomic status to illuminate intersectional inequities.
Third, feasibility and usability must be evaluated. Early assessments should examine ease of use, learnability, efficiency within clinical workflows, perceived usefulness among clinicians and patients, acceptability, task-completion error rates, and integration with existing communication practices.
Fourth, systems for continuous error monitoring are needed. This may include human-in-the-loop annotation pipelines through which bilingual clinicians flag clinically significant mistranslations, real-time confidence scoring that escalates uncertain outputs to live interpreters, and electronic health record (EHR)-embedded safety dashboards that track usage, error type, and override rates, mirroring established pharmacovigilance models27,28,29.
Fifth, equity-focused implementation science, including interrupted time-series analyses of quality metrics and geographically diverse pragmatic studies, must evaluate whether AI interpreter services narrow or widen disparities in adverse events, readmissions, and patient-reported outcomes24,25,30.
Finally, participatory design and strong governance frameworks are key. Establishing community advisory boards of patients with NELP, professional interpreters, and cultural brokers can ensure that system development reflects community needs. As these technologies evolve, health-system oversight councils with diverse representation and authority over high-risk deployments will be necessary, alongside policy advocacy for modality-neutral interpreter reimbursement to prevent premature substitution of certified interpreters27,28,29,30.
Ultimately, successful implementation will depend not only on patient-centered evidence, but also on institutional governance that proactively identifies and mitigates algorithmic risk, an area where physician-informaticists and health-system leaders play a pivotal role31 (Table 2).
Conclusion
AI shows significant promise in interpreter services but also carries risk. As healthcare systems increasingly consider AI-mediated language tools, patient-centered evidence will be essential to understand their impact on patient experience and clinical outcomes. We must prioritize high-quality, inclusive research that assesses the impact of these technologies for patients with NELP and centers the voices of those patients.
Importantly, discussion about AI interpreter services should also include the perspectives of professional medical interpreters, whose expertise in linguistic nuance, cultural mediation, and clinical communication is essential to safe and equitable care.
Finally, it will be critical to evaluate how accurate interpretation in clinical encounters influences trust and engagement in the healthcare system. Will these technologies ultimately strengthen or weaken trust in healthcare? Only through careful research can we ensure that emerging language technologies advance, rather than erode, equitable care.
Data availability
No datasets were generated or analyzed during the current study.
References
Sahni, N. R. & Carrus, B. Artificial intelligence in U.S. health care delivery. N. Engl. J. Med. 389, 348–358 (2023).
Haug, C. J. & Drazen, J. M. Artificial intelligence and machine learning in clinical medicine, 2023. N. Engl. J. Med. 388, 1201–1208 (2023).
Brunyé, T. T., Mitroff, S. R. & Elmore, J. G. Artificial intelligence and computer-aided diagnosis in diagnostic decisions: 5 questions for medical informatics and human-computer interface research. J. Am. Med. Inform. Assoc. ocaf123, https://doi.org/10.1093/jamia/ocaf123 (2025).
Colacci, M. et al. Sociodemographic bias in clinical machine learning models: a scoping review of algorithmic bias instances and mechanisms. J. Clin. Epidemiol. 178, 111606 (2025).
Ferryman, K., Mackintosh, M. & Ghassemi, M. Considering biased data as informative artifacts in AI-assisted health care. N. Engl. J. Med. 389, 833–838 (2023).
Ng, J. J. W. et al. Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review. BMC Med. Inf. Decis. Mak. 25, 236 (2025).
Xu, Z. et al. Voice for all: evaluating the accuracy and equity of automatic speech recognition systems in transcribing patient communications in home healthcare. Stud. Health Technol. Inf. 329, 1904–1906 (2025).
Karakus, I. et al. Bridging language gaps in healthcare: a systematic review of the practical implementation of neural machine translation technologies in clinical settings. J. Am. Med. Inform. Assoc. ocaf150, https://doi.org/10.1093/jamia/ocaf150 (2025).
Singh, K., Prabhu, A. & Kaur, N. The impact and role of artificial intelligence (AI) in healthcare: a systematic review. Curr. Top. Med. Chem. CTMC-EPUB-146975, https://doi.org/10.2174/0115680266339394250225112747 (2025).
Cabral, B. P. et al. Future use of AI in diagnostic medicine: 2-wave cross-sectional survey study. J. Med. Internet Res. 27, e53892 (2025).
Genovese, A. et al. Artificial intelligence in clinical settings: a systematic review of its role in language translation and interpretation. Ann. Transl. Med. 12, 117 (2024).
Kong, M. et al. Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis. BMJ Qual. Saf. 0, 1–9 (2025).
Richardson, J. P. et al. Patient apprehensions about the use of artificial intelligence in healthcare. NPJ Digit Med. 4, 140 (2021).
Young, A. T. et al. Patient and general public attitudes towards clinical artificial intelligence: a mixed methods systematic review. Lancet Digit Health 3, e599–e611 (2021).
OpenAI. ChatGPT (August 6 version) [Large Language Model] (OpenAI, 2025).
OpenEvidence. OpenEvidence [AI research assistant] (OpenEvidence, 2025).
Barwise, A. K. et al. Using artificial intelligence to promote equitable care for inpatients with language barriers and complex medical needs: clinical stakeholder perspectives. J. Am. Med Inf. Assoc. 31, 611–621 (2024).
Google. Gemini (Aug 6 version) [Large language model] (Google, 2025).
Perplexity. Perplexity.ai (Aug 6 version) [AI search engine] (Perplexity, 2025).
Pandey, M. et al. Impacts of English language proficiency on healthcare access, use, and outcomes among immigrants: a qualitative study. BMC Health Serv. Res 21, 741 (2021).
Sarver, J. & Baker, D. W. Effect of language barriers on follow-up appointments after an emergency department visit. J. Gen. Intern. Med. 15, 256–264 (2000).
State Operations Manual, Appendix V - Interpretive Guidelines - Responsibilities of Medicare Participating Hospitals in Emergency Cases (Centers for Medicare and Medicaid Services, 2019).
Rainer, M. F. Language Access Provisions of the Final Rule Implementing Section 1557 of the Affordable Care Act (Department of Health and Human Services, Office for Civil Rights, 2024).
Karliner, L. S. et al. Do professional interpreters improve clinical care for patients with limited English proficiency? A systematic review of the literature. Health Serv. Res. 42, 727–754 (2007).
Lindholm, M. et al. Professional language interpretation and inpatient length of stay and readmission rates. J. Gen. Intern. Med. 27, 1294–1299 (2012).
Radu, I. et al. Digital health for migrants, ethnic and cultural minorities and the role of participatory development: a scoping review. Int. J. Environ. Res. Public Health 20, 6962 (2023).
Selbst, A. D. & Barocas, S. The intuitive appeal of explainable machines. Fordham Law Rev. 87, 1085–1139 (2019).
American Medical Association. CPT Code Set: Language Interpreter Services Proposed (American Medical Association, 2024).
OSTP, U. S. Blueprint for an AI Bill of Rights: Technical Companion (OSTP, U. S., 2022).
Beaton, D. E. et al. Guidelines for the process of cross-cultural adaptation of self-report measures. Spine 25, 3186–3191 (2000).
Obermeyer, Z. et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Acknowledgements
G.O. is supported by the NIH NIMHD under award number K23MD016129, the Brigham and Women’s Hospital Center for Academic Development and Enrichment Faculty Career Development Award, the H. Richard Nesson Fellowship at Brigham and Women’s Hospital, and the Gordon and Betty Moore Foundation in partnership with the Council of Medical Specialty Societies through the National Academy of Medicine Scholars in Diagnostic Excellence program. This work was supported by G.O.’s funding. G.O. and R.W.B. also disclosed grant funding from Mass General Brigham to support language-concordant surgical care in otolaryngology–head and neck surgery, which is unrelated to the submitted work. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
Conception idea (O.F.L., E.E.W., A.F., J.R., H.B., S.S., M.H., D.W.B., K.J., R.W.B., G.O.); study design (O.F.L., E.E.W., A.F., J.R., H.B., S.S., M.H., D.W.B., K.J., R.W.B., G.O.); data recolection (O.F.L., E.E.W., A.F., J.R., H.B., S.S., M.H., D.W.B., K.J., R.W.B., G.O.); data analysis (O.F.L., E.E.W., A.F., J.R., H.B., S.S., M.H., D.W.B., K.J., R.W.B., G.O.); draft manuscript (O.F.L., E.E.W., A.F., J.R., M.A.M.R., H.B., S.S., M.H., D.W.B., K.J., R.W.B., G.O.); All authors review ctically the manuscript.
Corresponding author
Ethics declarations
Competing interests
R.W.B. discloses unrelated clinical trial grant funding from I-Mab Biopharma and unrelated research consulting funding from Analysis Group. D.W.B. has received consulting fees and/or stock options from AESOP, FeelBetter, Guided Clinical Solution, ValeraHealth, Clew, and MDClone, as well as consulting fees from Relyens, all outside the submitted work. All other authors declare no competing interests. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lynch, O.F., Witt, E.E., Fernandez, A. et al. Beyond translation: a patient-centered research agenda for artificial intelligence interpreter services in healthcare. npj Digit. Med. 9, 376 (2026). https://doi.org/10.1038/s41746-026-02764-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02764-6
