Operationalizing machine-assisted translation in healthcare

Lopez, Ivan; Velasquez, David E.; Chen, Jonathan H.; Rodriguez, Jorge A.

doi:10.1038/s41746-025-01944-0

Download PDF

Perspective
Open access
Published: 30 September 2025

Operationalizing machine-assisted translation in healthcare

Ivan Lopez^1,2,
David E. Velasquez³,
Jonathan H. Chen^2,4,5,6 &
…
Jorge A. Rodriguez^3,7

npj Digital Medicine volume 8, Article number: 584 (2025) Cite this article

1798 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Over 25 million U.S. patients with a non-English language preference face unsafe care because discharge instructions and other materials are rarely translated in time. Advances in translation assisted by large language models can close this gap, but implementation guidance is scarce. Using the Consolidated Framework for Implementation Research, we outline key considerations—innovation, individuals, inner setting, implementation process, and outer setting—to offer healthcare leaders and policymakers a practical roadmap for language model machine-assisted translation integration.

Healthcare agent: eliciting the power of large language models for medical consultation

Article Open access 01 September 2025

Mitigating the risk of health inequity exacerbated by large language models

Article Open access 04 May 2025

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis

Article Open access 09 May 2025

Introduction

Patients with a non-English language preference (NELP) experience significant barriers to high-quality health care, resulting in higher emergency department visits and hospital readmission rates compared to patients with an English language preference^1,2. These disparities are driven by multilevel factors, including insufficient use of interpretation services (real‑time spoken language support)^3,4,5 as well as limited availability of bilingual providers^6,7,8,9. Another key contributing factor is the lack of translated medical information (written text). When patients require translated medical information, including discharge instructions or informed consent forms, certified translators either craft the translated documents de novo or use computer-aided translation (CAT) software, such as translation memory tools, grammar checkers, or terminology managers. These processes are costly, labor-intensive, and lack necessary customization^10,11,12. In instances where in-house expertise is insufficient, health systems rely on third-party companies for translations. Despite these workflows, health systems continue to fail to routinely translate vital documents for patients with NELP^12,13,14, even though the Civil Rights Act of 1964 and Department of Health and Human Services (HHS) regulations require U.S. hospitals to provide reasonable access to language assistance services, including written translations¹⁵.

Machine-assisted translation (MAT) using large language models (LLMs) has the potential to bridge this gap. Before 2023, neural machine translation (NMT) was the leading automated translation technology¹⁶. However, NMT systems struggle with specialized terminology and generally offer limited contextual understanding, particularly on longer passages¹⁷. Meanwhile, existing CAT software is hampered by limited adaptability and static glossaries, requiring considerable effort from translators to refine each piece of text. In contrast, LLMs provide deeper contextual understanding, enabling them to process entire passages and handle vocabulary and linguistic nuances more precisely¹⁸. They also support multi-task learning, allowing them to translate, summarize, or simplify text—capabilities crucial for producing accessible patient materials¹⁹. Although a single-center study reported that LLM translation achieved results comparable to human translators for languages with abundant online resources, such as Spanish and Portuguese²⁰, concerns remain regarding its accuracy in languages with less digital content. Examples include Quechua and Yorùbá, which are considered digitally underrepresented languages in machine translation research due to the scarcity of comprehensive online corpora^21,22. Additionally, there remains a lack of guidance on how to safely and effectively implement LLM-supported MAT technology into healthcare, particularly as generative artificial intelligence (AI) tools in medicine expand^{19,23,24,25,26,27}.

To advance care for the millions of individuals with NELP, health systems and policymakers must take the lead in implementing LLM-supported MAT. Section 1557 of the Affordable Care Act has already established a fundamental requirement: machine-generated translations must undergo human review before reaching patients²⁸. Beyond this high‑level mandate, there is no detailed federal or professional‑body guidance on how to operationalize LLM‑based MAT (e.g., evaluation, privacy safeguards, or workflow integration), leaving health systems to devise their own policies. In this perspective, we apply the Consolidated Framework for Implementation Research (CFIR)²⁹—a comprehensive implementation science framework encompassing five domains and up to 67 constructs—as a conceptual lens to examine these challenges and considerations. By mapping these factors across CFIR domains, we present a roadmap that can help health systems judiciously operationalize MAT (Tables 1–5).

Table 1 Innovation domain: examines characteristics of an intervention that influence its adoption, implementation, and sustainability

Full size table

Table 2 Individuals domain: focuses on the characteristics of individuals involved in implementation, including their knowledge, beliefs, and personal attributes

Full size table

Table 3 Inner Setting Domain: encompasses the organizational context in which implementation occurs, including structural characteristics, available resources, and culture

Full size table

Table 4 Implementation process domain: focuses on the active strategies and actions taken to adopt, execute, and sustain an intervention

Full size table

Table 5 Outer setting domain: examines external influences on implementation, including policies, incentives, and interorganizational networks

Full size table

CFIR for machine-assisted translation

I. Innovation domain

For machine-assisted translation to succeed, health systems must learn to manage LLM limitations. For example, LLMs tend to “hallucinate” (generate plausible but factually incorrect text)³⁰, lose context (difficulty retrieving and using information located in the middle of long passages)³¹, and degrade human accountability (where users become reliant on the LLM and fail to evaluate its outputs)³². The implementation of LLM MAT must therefore include appropriate safeguards. Translators should be trained to recognize the common pitfalls (e.g., hallucinations, context loss, bias) of LLMs in MAT, and, to counteract human overreliance, translators could be intermittently prompted to justify their choice to leave a sentence unedited in a translation, similar to safety checks in electronic health records (EHRs) for potential medication-medication interactions³³.

Patient privacy is another design quality that must be considered. Closed-source LLMs should never be used in ways that expose protected health information (PHI) via unprotected application programming interfaces. We recommend leveraging zero-data-retention (ZDR) endpoints³⁴ (API interfaces that process each request in memory only and do not log or persist any input or output data) or private instances through industry partnerships. For example, Stanford Health Care has established a dedicated pathway to the Azure OpenAI Service under full institutional control, enabling high-throughput, privacy-compliant LLM queries³⁵. Such a model supports an environment for LLM-based MAT while following data protection standards.

Cost considerations also play a significant role in deciding whether to adopt LLMs. Operating large open-source models at scale can be prohibitively expensive for resource-constrained health systems. For example, running a single NVIDIA A100 80GB GPU on a cloud platform (e.g., Google Cloud) may cost around $4.74 per hour (as of May 2025)³⁶. Since payors don’t reimburse LLM use, these costs must be covered by an institution’s operating budget. Two factors, however, can offset that burden. First, as secure ZDR endpoints become more widely available, organizations can adopt a pay-as-you-go model that reduces both infrastructure demands and overall expenses. These options can help democratize access to powerful LLMs for health systems of varying sizes while maintaining patient privacy. Second, patients with NELP have consistently longer stays, higher readmission rates, and greater care costs when language-concordant services are lacking^37,38. Investing in LLM-based MAT can reduce those downstream expenses and improve value-based care metrics that payors already incentivize, such as Centers for Medicare & Medicaid Services (CMS) readmission penalties.

II. Individuals domain

For translators, health systems should accommodate existing workflows. Though workflows vary, a common one involves a staff member submitting a request, then a translator is assigned the document and works in the EHR—either from scratch or using templates. In some organizations, a second translator reviews the draft before the original translator finalizes formatting and delivers the material to the patient. LLM-based MAT can be woven into this existing workflow by integrating into the EHR, enabling automatic draft generation as soon as a translation request is initiated. The translator can open the draft within the same interface they already use to make edits. Compatibility with CAT tools, allowing translators to run grammar checks or insert pre-translated templates from a memory database directly into the EHR, ensures they retain access to their preferred resources, and if a second review is standard, that reviewer can still validate the final text.

On the clinician side, translation accuracy depends on the quality of the English text. High-quality notes are well-structured, free from spelling errors, and written in plain language. To enhance accuracy, clinicians should prepare high-quality discharge summaries. Alternatively, MAT workflow may use an LLM’s zero-shot capabilities³⁹ to first refine messy or jargon-heavy notes via a “preparation” prompt, then generate a translation with a second prompt. This two-prompt approach can be automatically applied to all incoming English text or toggled on/off by the translation team. By improving the quality of the source text, health systems can reduce the risk of inaccuracies.

Patients are also key stakeholders in MAT implementation. To secure their buy‑in, health systems must be transparent about their use of MAT and proactively seek patient input. Strategies may include brief post‑visit surveys, patient advisory boards, and focus groups dedicated to specific languages to elicit language‑specific guidance. Involving patients in this way is essential for both routine quality‑improvement loops and for the formal evaluation of LLM‑based MAT (see “Implementation Process Domain”). Additionally, surveying patients about their comfort with LLM-supported translations in various care settings (e.g., adult versus pediatric) can help create a more patient-centered approach to MAT deployment.

III. Inner setting domain

An important Inner Setting factor is the translator workforce, which in many health systems is too small to meet the increasing demand for timely translations. Translators often face excessive workloads, causing lengthy turnaround times and leaving some patients without materials in their preferred language. Integrating LLM-based MAT can offload the initial draft, shortening the translation cycle and easing the translator’s workload. This approach also allows organizations to reallocate translator time to more complex tasks. However, it is crucial to recognize that a poorly generated LLM translation may demand more effort to correct than a fully manual translation. A practical approach to mitigate this risk is to first identify which documents and language pairs MAT consistently handles well (e.g., Spanish routine discharge summaries) and where it struggles (e.g., Korean surgery informed consent forms). In the early stages, organizations might limit MAT to documents and languages where the model performs reliably while relying on human translations for more complex tasks. As data from these difficult cases accumulate, the model can be fine-tuned and iteratively improved, gradually expanding its applicability. Accuracy remains paramount: tracking turnaround times alongside quality metrics ensures faster translations still meet patient needs. By weaving these measures into routine workflows, translation teams can maintain both efficiency and quality. We detail fine-tuning and evaluation in the Implementation Process Domain section.

Many health systems rely on third-party translation services rather than (or in addition to) in-house translators. In-house teams may offer greater control over workflows, direct communication with clinicians, and familiarity with local patient populations, but they require sustained funding, staffing, and expertise across multiple languages. Third-party services can support smaller organizations who may lack specialized linguistic support, though they may introduce logistical hurdles such as limited oversight and slower turnaround times. Some of these challenges are more difficult to resolve, but others, like slower turnaround times, could be mitigated if third-party vendors adopt translation editing services. Rather than translating documents entirely de novo, an in-house LLM-based system could generate a first draft for the vendor to refine. This approach preserves the time-saving benefits of LLM-assisted translation while still leveraging the expertise and capacity of external resources.

Organizational culture is another key pillar of the Inner Setting. Health system leaders must emphasize that AI complements—rather than replaces—human translators, preserving a translator-in-the-loop model that aligns with Section 1557 of the Affordable Care Act²⁸. This approach values the expertise of translators, many deeply rooted in local communities, and ensures patients with NELP receive the same high standard of care as English-speaking counterparts. By positioning human translators as indispensable for quality control, cultural adaptation, and patient engagement, health systems maintain equity while harnessing AI’s efficiency and scalability⁴⁰.

IV. Implementation process domain

This section outlines practical steps for integrating LLM-supported MAT, drawing on the implementation of the Crisis Message Detector-1 (CMD-1), an AI-driven tool for triaging mental health messages⁴¹, as an example of embedding AI solutions into existing clinical workflows.

1.
Aligning with existing workflows: Introduce MAT within platforms or software that translators and clinicians already use, thereby preserving established workflows. For instance, embedding an MAT draft feature into the EHR can streamline adoption. In the CMD-1 project, researchers integrated their AI model into Slack, an application providers already relied on, to minimize workflow disruptions.
2.
Co-designing with end users: Actively involve translators and clinicians in MAT design—refining prompts, interfaces, and workflows—to address bottlenecks and foster trust. For example, if translators want an interactive interface with grading buttons, feedback fields, or error flags, co-designing these features into the EHR can improve implementation success. In CMD-1, for example, clinicians set the risk tolerance for false positives and false negatives.
3.
Retrospective testing: A critical first step in implementation is to evaluate MAT on retrospective data before deployment. In CMD-1, the team first tested the model on historical data in an offline environment. This safe environment enabled the identification of errors without compromising patient safety. For MAT, retrospective testing should follow LLM changes (e.g., fine-tuning) or significant updates to translator workflows.
4.
Prospective testing: After retrospective evaluation, health systems could start with small-scale deployments—limited to one frequently requested language, document type, or a few translators—and track usability, error rates, and impact over a defined period (e.g., two weeks). Once implementation is proven effective under these conditions, they could expand coverage to additional languages, documents, or translators. Key takeaways from these pilots identify areas needing improvement and highlight where MAT struggles most (e.g., certain languages or note types), guiding selective deployment and future fine-tuning. Prospective testing should follow significant model updates (e.g., post-fine-tuning) or major workflow changes.
5.
Plan for deployment infrastructure: Even the best-performing model can fail without robust technical and operational support. Health systems must first select a secure data-storage solution, such as BigQuery⁴², Snowflake⁴³, or Azure⁴⁴, to store PHI. Just as CMD-1 relied on a dedicated platform to store and audit crisis-detection outputs, MAT requires a similarly comprehensive logging framework. All types of translations should be logged, whether generated entirely by a human translator or initially drafted through the LLM-supported workflow and then refined by a human. Each translation request should log a unique identifier, timestamp, LLM used, the model’s initial translation, the final version, note type, target language, translator ID, flags for harmful outputs, and metrics such as generation and editing times. Storing all of this in a single row for each translation record enables auditing, failure analysis, and model fine-tuning as part of ongoing clinical quality improvement.
6.
Fine-tuning to meet deployment needs: Health systems can use the comprehensive logging framework to fine-tune LLM-based translation models in a supervised manner⁴⁵. Specifically, final translator-approved versions (paired with the original English texts) provide gold-standard data for train/validation/test splits. The model is trained on the “train” split and monitored on the “validation” split using automatic metrics such as chrF++ or COMET, then evaluated on the “test” split to avoid overfitting. If a health system is using an open-source MAT model, it can reduce computational overhead by adopting Parameter Efficient Fine-Tuning libraries, such as Low-Rank Adaptation, which updates only a minimal subset of the model’s weights⁴⁵. By analyzing prospective testing results, health systems can identify note types or languages where the model struggles and target those for fine-tuning. Over time, this iterative process broadens the LLM’s capabilities, improving performance across diverse document types and languages. It is also important to note that fine-tuning efforts may evolve from solely optimizing for translation accuracy to incorporating translator preferences. Once the model consistently drafts high-quality translations, further fine-tuning—using approaches like Direct Preference Optimization⁴⁶—can tailor outputs to better align with the needs of individual translators or entire teams.
7.
Measuring real-world impact: Success should be defined by practical outcomes, such as faster translation turnaround times, higher user satisfaction (both for translators and patients), and improved patient outcomes, rather than technical metrics alone. Tracking adoption rates and downstream effects (e.g., patient comprehension or readmission rates) offers a more comprehensive view of the model’s value. These operational metrics can be monitored in real time using data dashboards such as Looker Studio⁴⁷ to ensure the system meets clinical and patient needs.

Evaluation is a critical component of implementation that can be approached through multiple methods.

1.
Translation quality: The Multidimensional Quality Metrics (MQM) framework—which rates accuracy, fluency, terminology, style, and locale appropriateness and tags each error as critical, major, or minor—provides a robust gauge of quality⁴⁸. Other healthcare MT studies have used the validated 5-point fluency-adequacy-meaning-severity rubric which scores meaning preservation and grammar while labeling errors by clinical severity⁴⁹. However, these frameworks are time-intensive and are best applied periodically (e.g., monthly) on a representative sample of translations. One way to form this sample is by computing sentence embeddings of the source text and selecting a diverse subset based on similarity scores⁵⁰. Ideally, manual evaluation should be performed by the translation team, leveraging their expertise for nuanced feedback that guides ongoing model improvements. For routine monitoring, we recommend a combination of automated metrics: chrF++ for character‑level similarity⁵¹ and COMET for semantic adequacy and fluency⁵². BLEU, which measures n-gram overlap between the LLM output and the final human-approved text⁵³, could be included for historical comparability. However, its correlation with human judgements is inconsistent, so it should be interpreted with caution and supplemented by newer metrics that show stronger alignment with human ratings⁵⁴.
2.
Operational metrics: Translation turnaround time and the proportion of patients with NELP who receive language-concordant discharge instructions should be used to measure system efficiency, pinpoint areas for workflow refinement, and ensure equitable care delivery across diverse language groups.
3.
Clinical outcomes: Health systems need to track process and clinical outcomes. They can focus on measures prioritized by the CMS, such as readmission and mortality rates for conditions like heart failure or Chronic Obstructive Pulmonary Disease¹. These outcomes can be tracked by extracting data from clinical notes^55,56 or ICD-10 codes^57,58 for patients with NELP.
4.
Patient understandability and actionability: The Patient Education Materials Assessment Tool (PEMAT), which assesses the clarity, ease of understanding, and actionability of health care materials, could be used to evaluate MAT⁵⁹. Similar to MQM, it is time-consuming to administer and best applied periodically (e.g., monthly) on a representative subset of translations⁵⁰. PEMAT evaluation should be carried out by patients or patient advisory boards. Another approach to evaluation involves having a clinician identify five key points from a discharge summary, then convert those into open-ended questions. A patient representative reads the translated text and answers these questions (Fig. 1). By grading the accuracy of responses, teams gauge how well LLM-supported translations convey essential clinical information.

**Fig. 1: An illustrative patient-level evaluation workflow for LLM-supported MAT.**

V. Outer setting domain

As LLM adoption increases, organizations like the Joint Commission are needed to conduct independent quality assurance checks of MAT workflows. As part of its accreditation process, the Joint Commission assesses whether hospitals comply with Section 1557 of the Affordable Care Act, which mandates that patients with NELP receive appropriate language access services²⁸. In doing so, the Joint Commission verifies that hospitals have processes in place to deliver patient materials that are accurate, timely, and linguistically appropriate. As health systems begin to integrate MAT, these evaluations can be extended to examine whether MAT meets the same standards required for traditional translation methods using evaluation methods described in the “Implementation Process Domain.”

Health systems and public health agencies should collaborate to create a shared clinical‑translation corpus, similar to open resources such as MIMIC‑IV, which provides de‑identified clinical notes for public research use⁶⁰. Each participating institution (academic medical centers, community hospitals, rural and safety‑net facilities) would contribute de‑identified source notes across the full spectrum of document types (e.g., post‑operative instructions, discharge summaries, consent forms) paired with translator‑verified versions. The corpus would (i) maintain distinct training and held‑out test datasets and (ii) grow through a community‑driven process in which sites donate a small batch of new notes, de‑identified with methods such as “hiding in plain sight”⁶¹. Efforts should be made to collect translations for digitally underrepresented languages. Over time, this iterative, collaborative approach will enrich clinical translation data available for fine‑tuning and will maintain an evolving benchmark for evaluating LLM-based MAT.

New policy guidelines are needed to support MAT implementation, beginning with national standards for LLM use. The HHS Office of Minority Health developed the National Standards for Culturally and Linguistically Appropriate Services (CLAS) in Health and Health Care. These standards, last updated in 2013, provide a framework for culturally and linguistically appropriate care⁶² but require revision to ensure accurate, reliable, and timely MAT. HHS should also revisit Section 1557 of the Affordable Care Act to aid implementation. It's May 6, 2024 rule revision was the first to address machine translation²⁸, but further guidance is needed on safeguarding patient data, acceptable error rates, and critical error reporting. Given the substantial variation in LLM performance across different languages, policy guidelines need to be language-specific, rather than a single blanket standard. For instance, health systems could rank languages based on an LLM’s internal representations⁶³, examine the distribution of languages in the LLM’s training corpus (when made publicly available)⁶⁴, or evaluate performance on an in-house testing dataset to determine the languages in which a given model meets MAT standards. The next rule revision must set baseline qualifications for LLMs used in MAT, ensuring that only suitable models are implemented.

Conclusion

Equitable access to linguistically concordant information remains a major unmet need for the millions of people in the United States who prefer a language other than English. Machine translation with LLMs offers promising solutions—reducing translator workload, shortening turnaround times, and extending translation services to resource-constrained settings. However, it also introduces new challenges, including context loss and inaccuracies in digitally underrepresented languages. For safe, equitable outcomes, health systems must continuously evaluate accuracy, address biases, and refine workflows. Moving forward, hybrid effectiveness-implementation studies across various clinical settings⁶⁵ will be essential for determining whether LLM-based MAT truly reduces language-related disparities in practice.

Data availability

No datasets were generated or analysed during the current study.

References

Rawal, S. et al. Association between limited English proficiency and revisits and readmissions after hospitalization for patients with acute and chronic conditions in Toronto, Ontario, Canada. JAMA 322, 1605–1607 (2019).
Article PubMed PubMed Central Google Scholar
Lion, K. C., Lin, Y.-H. & Kim, T. Artificial intelligence for language translation: the equity is in the details. JAMA 332, 1427–1428 (2024).
Article PubMed Google Scholar
Flores, G. The impact of medical interpreter services on the quality of health care: a systematic review. Med. Care Res. Rev.62, 255–299 (2005).
Article PubMed Google Scholar
Schulson, L. B. & Anderson, T. S. National estimates of professional interpreter use in the ambulatory setting. J. Gen. Intern. Med. 37, 472–474 (2022).
Article PubMed Google Scholar
Diamond, L. C., Schenker, Y., Curry, L., Bradley, E. H. & Fernandez, A. Getting by: underuse of interpreters by resident physicians. J. Gen. Intern. Med. 24, 256–262 (2009).
Article PubMed Google Scholar
Detz, A. et al. Language concordance, interpersonal care, and diabetes self-care in rural Latino patients. J. Gen. Intern. Med. 29, 1650–1656 (2014).
Article PubMed PubMed Central Google Scholar
Betancourt, J. R., Green, A. R., Carrillo, J. E. & Ananeh-Firempong, O. Defining cultural competence: a practical framework for addressing racial/ethnic disparities in health and health care. Public Health Rep. 118, 293–302 (2003).
Article PubMed PubMed Central Google Scholar
Molina, R. L. & Kasper, J. The power of language-concordant care: a call to action for medical schools. BMC Med. Educ. 19, 378 (2019).
Article PubMed PubMed Central Google Scholar
Harvey, S. M., Branch, M. R., Hudson, D. & Torres, A. Listening to immigrant Latino men in rural Oregon: exploring connections between culture and sexual and reproductive health services. Am. J. Mens. Health 7, 142–154 (2013).
Article PubMed Google Scholar
Gavvala, S. Ensuring understanding: Language-concordant discharge instructions. Rice Univ. Baker Inst. Public Policy, Issue Brief. https://doi.org/10.25613/cayx-wc08 (2023).
Karpińska, P. Computer aided translation – possibilities, limitations and changes in the field of professional translation. J. Educ. Cult. Soc. 8, 133–142 (2017).
Article Google Scholar
Davis, S. H. et al. Translating discharge instructions for limited English-proficient families: strategies and barriers. Hosp. Pediatr. 9, 779–787 (2019).
Article PubMed PubMed Central Google Scholar
Choe, A. Y. et al. Improving discharge instructions for hospitalized children with limited english proficiency. Hosp. Pediatr. 11, 1213–1222 (2021).
Article PubMed Google Scholar
Diamond, L. C., Wilson-Stronks, A. & Jacobs, E. A. Do hospitals measure up to the national culturally and linguistically appropriate services standards?. Med. Care 48, 1080–1087 (2010).
Article PubMed Google Scholar
Rights (OCR), O. for C. Summary of Guidance to Federal Financial Assistance Recipients Regarding Title VI and the prohibition against national origin discrimination affecting limited English proficient persons. https://www.hhs.gov/civil-rights/for-providers/laws-regulations-guidance/guidance-federal-financial-assistance-title-vi/index.html (2007).
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://doi.org/10.48550/arXiv.1609.08144 (2016).
Koehn, P. & Knowles, R. Six challenges for neural machine translation. In Proc. First Workshop on Neural Machine Translation (eds. Luong, T., Birch, A., Neubig, G. & Finch, A.) 28–39 (Association for Computational Linguistics, Vancouver, 2017). https://doi.org/10.18653/v1/W17-3204.
Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (Curran Associates, Inc., 2017).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Article CAS PubMed PubMed Central Google Scholar
Brewster, R. C. L. et al. Performance of ChatGPT and Google Translate for Pediatric Discharge Instruction Translation. Pediatrics 154, e2023065573 (2024).
Article PubMed Google Scholar
Ortega, J. E., Castro Mamani, R. & Cho, K. Neural machine translation with a polysynthetic low resource language. Mach. Transl. 34, 325–346 (2020).
Article Google Scholar
Adebara, I., Abdul-Mageed, M. & Silfverberg, M. Linguistically-Motivated Yorùbá-English Machine Translation. In Proc. of the 29th International Conference on Computational Linguistics (eds Calzolari, N. et al.) 5066–5075 (International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022).
Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat. Med. 1–6 https://doi.org/10.1038/s41591-024-03456-y (2025).
Savage, T. et al. Fine tuning large language models for medicine: the role and importance of direct preference optimization. Preprint at https://doi.org/10.48550/arXiv.2409.12741 (2024).
Mirza, F. N. et al. Using ChatGPT to facilitate truly informed medical consent. NEJM AI 1, AIcs2300145 (2024).
Article Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Article PubMed PubMed Central Google Scholar
Zaretsky, J. et al. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA Netw. Open 7, e240357 (2024).
Article PubMed PubMed Central Google Scholar
Nondiscrimination in Health Programs and Activities. Federal Register https://www.federalregister.gov/documents/2024/05/06/2024-08711/nondiscrimination-in-health-programs-and-activities (2024).
Damschroder, L. J., Reardon, C. M., Widerquist, M. A. O. & Lowery, J. The updated consolidated framework for implementation research based on user feedback. Implement. Sci. 17, 75 (2022).
Article PubMed PubMed Central Google Scholar
Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is inevitable: an innate limitation of large language models. Preprint at http://arxiv.org/abs/2401.11817 (2024).
Liu, N. F. et al. Lost in the middle: how language models use long contexts. Trans. Assoc. Comput. Linguist. 12, 157–173 (2024).
Article Google Scholar
Levy, A., Agrawal, M., Satyanarayan, A. & Sontag, D. Assessing the impact of automated suggestions on decision making: domain experts mediate model errors but take less initiative. In Proc. 2021 CHI Conference on Human Factors in Computing Systems 1–13 (Association for Computing Machinery, New York, NY, USA, 2021). https://doi.org/10.1145/3411764.3445522.
Kuperman, G. J. et al. Medication-related clinical decision support in computerized provider order entry systems: a review. J. Am. Med. Inform. Assoc. 14, 29–40 (2007).
Article PubMed PubMed Central Google Scholar
Data controls in the OpenAI platform - OpenAI API. https://platform.openai.com.
Ng, M. Y., Helzer, J., Pfeffer, M. A., Seto, T. & Hernandez-Boussard, T. Development of secure infrastructure for advancing generative AI research in healthcare at an academic medical center. Res. Sq. rs.3.rs-5095287 https://doi.org/10.21203/rs.3.rs-5095287/v1 (2024).
Vedula, K. S. et al. Distilling large language models for efficient clinical information extraction. Preprint at https://doi.org/10.48550/arXiv.2501.00031 (2024).
Woods, A. P. et al. Limited English proficiency and clinical outcomes after hospital-based care in English-speaking countries: a systematic review. J. Gen. Intern. Med. 37, 2050–2061 (2022).
Article PubMed PubMed Central Google Scholar
Manuel, S. P., Nguyen, K., Karliner, L. S., Ward, D. T. & Fernandez, A. Association of English language proficiency with hospitalization cost, length of stay, disposition location, and readmission following total joint arthroplasty. JAMA Netw. Open 5, e221842 (2022).
Article PubMed PubMed Central Google Scholar
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. In Proc. 36th International Conference on Neural Information Processing Systems 22199–22213 (Curran Associates Inc., Red Hook, NY, USA, 2022).
Bakken, S. AI in health: keeping the human in the loop. J. Am. Med. Inform. Assoc. 30, 1225–1226 (2023).
Article PubMed PubMed Central Google Scholar
Swaminathan, A. et al. Natural language processing system for rapid detection and intervention of mental health crisis chat messages. NPJ Digit. Med. 6, 1–9 (2023).
Article Google Scholar
BigQuery enterprise data warehouse. Google Cloud https://cloud.google.com/bigquery.
The Snowflake AI Data Cloud - Mobilize Data, Apps, and AI. https://www.snowflake.com/content/snowflake-site/global/en.
Create Your Azure Free Account Or Pay As You Go | Microsoft Azure. https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account/search.
Zhang, X., Rajabi, N., Duh, K. & Koehn, P. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QLoRA. In Proc. Eighth Conference on Machine Translation (eds. Koehn, P., Haddow, B., Kocmi, T. & Monz, C.) 468–481 (Association for Computational Linguistics, Singapore, 2023). https://doi.org/10.18653/v1/2023.wmt-1.43.
Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. In Proc. 37th International Conference on Neural Information Processing Systems 53728–53741 (Curran Associates Inc., Red Hook, NY, USA, 2023).
Looker Studio. Google for Developers https://developers.google.com/looker-studio.
Lommel, A. R., Burchardt, A. & Uszkoreit, H. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proc. Translating and the Computer 35 (Aslib, London, UK, 2013).
Chen, X., Acosta, S. & Barry, A. E. Evaluating the accuracy of Google translate for diabetes education material. JMIR Diabetes 1, e3 (2016).
Article PubMed PubMed Central Google Scholar
Lopez, I., Haredasht, F. N., Caoili, K., Chen, J. H. & Chaudhari, A. Embedding-driven diversity sampling to improve few-shot synthetic data generation. Preprint at https://doi.org/10.48550/arXiv.2501.11199 (2025).
Popović, M. chrF++: words helping character n-grams. In Proc. Second Conference on Machine Translation (eds. Bojar, O. et al.) 612–618 (Association for Computational Linguistics, Copenhagen, Denmark, 2017). https://doi.org/10.18653/v1/W17-4770.
Rei, R., Stewart, C., Farinha, A. C. & Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 2685–2702 (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/v1/2020.emnlp-main.213.
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics (eds. Isabelle, P., Charniak, E. & Lin, D.) 311–318 (Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002). https://doi.org/10.3115/1073083.1073135.
Mathur, N., Baldwin, T. & Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 4984–4997 (Association for Computational Linguistics, Online, 2020). https://doi.org/10.18653/v1/2020.acl-main.448.
Lopez, I. et al. Clinical entity augmented retrieval for clinical information extraction. NPJ Digit. Med. 8, 1–11 (2025).
Article Google Scholar
Swaminathan, A. et al. Selective prediction for extracting unstructured clinical data. J. Am. Med. Inform. Assoc. 31, 188–197 (2024).
Article Google Scholar
Bates, B. A. et al. Validity of International Classification of Diseases (ICD)-10 diagnosis codes for identification of acute heart failure hospitalization and heart failure with reduced versus preserved ejection fraction in a national medicare sample. Circ. Cardiovasc. Qual. Outcomes 16, e009078 (2023).
Article PubMed Google Scholar
Gothe, H. et al. Algorithms to identify COPD in health systems with and without access to ICD coding: a systematic review. BMC Health Serv. Res. 19, 737 (2019).
Article PubMed PubMed Central Google Scholar
Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns. 96, 395–403 (2014).
Article PubMed PubMed Central Google Scholar
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Carrell, D. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc. 20, 342–348 (2013).
Article PubMed Google Scholar
National Standards for Culturally and Linguistically Appropriate Services (CLAS) in Health and Health Care. Federal Register https://www.federalregister.gov/documents/2013/09/24/2013-23164/national-standards-for-culturally-and-linguistically-appropriate-services-clas-in-health-and-health (2013).
Li, Z. et al. Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages. In Special Track on AI Alignment 28186–28194 (Association for the Advancement of Artificial Intelligence, 2025). https://doi.org/10.1609/aaai.v39i27.35038.
Xie, Y. et al. Weakly supervised scene text generation for low-resource languages. Expert Syst. Appl. 237, 121622 (2024).
Article Google Scholar
Khoong, E. C. & Rodriguez, J. A. A research agenda for using machine translation in clinical medicine. J. Gen. Intern. Med. 37, 1275–1277 (2022).
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University School of Medicine, Stanford, CA, USA
Ivan Lopez
Stanford Department of Biomedical Data Science, Stanford, CA, USA
Ivan Lopez & Jonathan H. Chen
Division of General Internal Medicine, Brigham and Women’s Hospital, Boston, MA, USA
David E. Velasquez & Jorge A. Rodriguez
Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA, USA
Jonathan H. Chen
Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA, USA
Jonathan H. Chen
Department of Medicine, Stanford, CA, USA
Jonathan H. Chen
Harvard Medical School, Boston, MA, USA
Jorge A. Rodriguez

Authors

Ivan Lopez
View author publications
Search author on:PubMed Google Scholar
David E. Velasquez
View author publications
Search author on:PubMed Google Scholar
Jonathan H. Chen
View author publications
Search author on:PubMed Google Scholar
Jorge A. Rodriguez
View author publications
Search author on:PubMed Google Scholar

Contributions

I.L. and D.E.V. conceptualized the Perspective. I.L. and D.E.V. wrote the initial draft. I.L., D.E.V., J.H.C. and J.A.R. contributed to the first draft and provided critical revisions. All authors have read, reviewed, and approved of the final manuscript.

Corresponding author

Correspondence to Ivan Lopez.

Ethics declarations

Competing interests

J.H.C. reported being a co-founder of Reaction Explorer LLC that develops and licenses organic chemistry education software, paid consulting fees from Sutton Pierce, Younker Hyde MacFarlane, and Sykes McAllister as a medical expert witness, and paid consulting fees from ISHI Health. J.A.R. receives funding from the National Institute on Minority Health and Health Disparities (NIMHD). The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lopez, I., Velasquez, D.E., Chen, J.H. et al. Operationalizing machine-assisted translation in healthcare. npj Digit. Med. 8, 584 (2025). https://doi.org/10.1038/s41746-025-01944-0

Download citation

Received: 21 January 2025
Accepted: 11 August 2025
Published: 30 September 2025
DOI: https://doi.org/10.1038/s41746-025-01944-0