Application of large language models in medicine

Liu, Fenglin; Zhou, Hongjian; Gu, Boyang; Zou, Xinyu; Huang, Jinfa; Wu, Jinge; Li, Yiru; Chen, Sam S.; Hua, Yining; Zhou, Peilin; Liu, Junling; Mao, Chengfeng; You, Chenyu; Wu, Xian; Zheng, Yefeng; Clifton, Lei; Li, Zheng; Luo, Jiebo; Clifton, David A.

doi:10.1038/s44222-025-00279-5

Review Article
Published: 07 April 2025

Application of large language models in medicine

Nature Reviews Bioengineering volume 3, pages 445–464 (2025) Cite this article

6612 Accesses
66 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs), such as ChatGPT, have received great attention owing to their capabilities for understanding and generating human language. Despite a trend in researching the application of LLMs in supporting different medical tasks (such as enhancing clinical diagnostics and providing medical education), a comprehensive assessment of their development, practical applications and outcomes in the medical space is still missing. Therefore, this Review aims to provide an overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we discuss the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. In terms of deployment, we compare different LLMs across various medical tasks and with state-of-the-art lightweight models.

Key points

Existing medical large language models (LLMs), ranging from 110 million to 520 billion parameters, are mainly developed through pre-training, fine-tuning and prompting methods using large-scale medical corpora from different sources.
Their performance is mostly evaluated based on exam-style question-answering tasks. Combining different fine-tuning and prompting methods enables LLMs to achieve comparable or even better results than experts.
LLMs perform poorly in non-question-answering tasks without pre-set options, thus requiring further improvements before integration into real clinical decision-making processes.
Medical LLMs are being adapted to various clinical applications, but large-scale clinical trials are still missing.
Mitigating hallucinations; establishing robust data, benchmarks and metrics; and addressing ethical, safety and regulatory concerns through interdisciplinary collaborations will help to accelerate the integration of LLMs into clinic practice.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Model size for different medical LLMs.**

**Fig. 3: Development of medical LLMs over time.**

**Fig. 4: Application of LLMs in medicine.**

Large language models in medicine

Article 17 July 2023

The future landscape of large language models in medicine

Article Open access 10 October 2023

Large language model agents can use tools to perform clinical calculations

Article Open access 17 March 2025

References

Zhao, W. X. et al. A survey of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.18223 (2023).
Yang, J. et al. Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 18, 160 (2024).
Article Google Scholar
Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at arXiv https://doi.org/10.48550/arXiv.2204.02311 (2022).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.13971 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at arXiv https://doi.org/10.48550/arXiv.2307.09288 (2023).
Brown, T. et al. Language models are few-shot learners. In Proc. 34th Int. Conf. Neural Inform. Process. Syst.s (eds Larochelle, H. et al.) 1877–1901 (2020).
OpenAI et al. GPT-4 technical report. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.08774 (2023).
Du, Z. et al. GLM: general language model pretraining with autoregressive blank infilling. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 320–335 (ACL, 2022).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article Google Scholar
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. https://doi.org/10.1038/s41591-024-03423-7 (2025).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.16452 (2023).
Wu, C. et al. PMC-LLaMA: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 31, 1833–1843 (2024).
Article Google Scholar
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Article Google Scholar
Li, Y. et al. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus 15, 6 (2023).
Google Scholar
Han, T. et al. MedAlpaca — an open-source collection of medical conversational AI models and training data. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.08247 (2023).
Wang, H. et al. HuaTuo: tuning LLaMA model with Chinese medical knowledge. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.06975 (2023).
Toma, A. et al. Clinical Camel: an open-source expert-level medical language model with dialogue-based knowledge encoding. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.12031 (2023).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article Google Scholar
Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries? Lancet Digit. Health 5, e107–e108 (2023).
Article Google Scholar
Yang, X. et al. A large language model for electronic health records. npj Digit. Med. 5, 194 (2022).
Article Google Scholar
Abd-Alrazaq, A. et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med. Educ. 9, e48291 (2023).
Article Google Scholar
Peng, C. et al. A study of generative large language model for medical research and healthcare. npj Digit. Med. 6, 210 (2023).
Article Google Scholar
Alsentzer, E. et al. Publicly available clinical BERT embeddings. Preprint at arXiv https://doi.org/10.48550/arXiv.1904.03323 (2019).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Article Google Scholar
Wu, J. et al. Clinical text datasets for medical artificial intelligence and large language models — a systematic review. NEJM AI 1, AIra2400012 (2024).
Article Google Scholar
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Article Google Scholar
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).
Article Google Scholar
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).
Article Google Scholar
Ye, Q. et al. Qilin-Med: multi-stage knowledge injection advanced medical large language model. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.09089 (2023).
Xiong, H. et al. DoctorGLM: fine-tuning your Chinese doctor is not a herculean task. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.01097 (2023).
Yang, S. et al. Zhongjing: enhancing the Chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Proc. AAAI Conf. Artif. Intell. (eds Wooldridge, M.J. et al.) 19368–19376 (AAAI, 2023).
Zhang, S. et al. Instruction tuning for large language models: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.10792 (2023).
He, K. et al. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. Inf. Fusion 118, 102963 (2025).
Article Google Scholar
Byambasuren, O. et al. Preliminary study on the construction of Chinese medical knowledge graph. J. Chin. Inf. Process. 33, 1–9 (2019).
Google Scholar
Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. 3rd Mach. Learn. Health Symp. (eds Hegselmann, S. et al.) 353–367 (PMLR, 2023).
Li, C. et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In Ann. Conf. Neural Inform. Process. Syst. (eds Oh, A. et al.) 28541–28564 (Curran Associates, 2023).
Saab, K. et al. Capabilities of Gemini models in medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.18416 (2024).
Hyland, S. L. et al. MAIRA-1: a specialised large multimodal model for radiology report generation. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13668 (2023).
Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.02463 (2023).
Zhang, X. et al. AlpaCare: instruction-tuned large language models for medical application. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.14558 (2023).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2106.09685 (2021).
Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. In Proc. 59th Annu. Meet. Assoc. Comput. Linguist. (eds Zong, C. et al.) 4582–4597 (ACL, 2021).
Liu, X. et al. P-Tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 61–68 (ACL, 2022).
Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In Proc. 36th Int. Conf. Mach. Learn. (eds Chaudhuri, K. & Salakhutdinov, R.) 2790–2799 (PMLR, 2019).
Xu, C., Guo, D., Duan, N. & McAuley, J. Baize: an open-source chat model with parameter-efficient tuning on self-chat data. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 6268–6278 (ACL, 2023).
Shoham, O. B. & Rappoport, N. CPLLM: clinical prediction with large language models. PLoS Digit. Health 3, e0000680 (2024).
Article Google Scholar
Dong, Q. et al. A survey on in-context learning. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Al-Onaizan, Y. et al.) 1107–1128 (ACL, 2024).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 24824–24837 (Curran Associates, 2022).
Liu, Z. et al. DeID-GPT: zero-shot medical text de-identification by GPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.11032 (2023).
Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. In Proc. Confe. Empir. Methods Nat. Lang. Process. (eds Moens, M.-F. et al.) 3045–3059 (ACL, 2021).
Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2312.10997 (2023).
Luo, Y. et al. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08747 (2023).
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13178 (2024).
Li, X. & Li, J. AnglE-optimized text embeddings. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.12871 (2023).
Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.16291 (2023).
Chen, J. et al. M3-Embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings Assoc. Comput. Linguist. (eds Ku, L. et al.) 2318–2335 (ACL, 2024).
Shao, Z. et al. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings Assoc. Comput. Linguist. (eds Bouamor, H. et al.) 9248–9274 (ACL, 2023).
Trivedi, H., Balasubramanian, N., Khot, T. & Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proc. 61st Annu. Meet. Asso. Comput. Linguist. (eds Rogers, A. et al.) 10014–10037 (ACL, 2023).
Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-rag: learning to retrieve, generate, and critique through self-reflection. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.11511 (2023).
Zakka, C. et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).
Article Google Scholar
Kim, J. & Min, M. From RAG to QA-RAG: integrating generative AI for pharmaceutical regulatory compliance process. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.01717 (2024).
Shi, W. et al. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making. In Proc. 14th ACM Int. Conf. Bioinform. Comput. Biol. Health Inform. (ACM, 2023).
Tang, L. et al. Evaluating large language models on medical evidence summarization. npj Digit. Med. 6, 158 (2023).
Article Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Article Google Scholar
Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am. Med. Inform. Assoc. 29, 1976–1988 (2022).
Article Google Scholar
Liu, F. et al. Retrieve, reason, and refine: generating accurate and faithful patient instructions. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 18864–18877 (Curran Associates, 2022).
Joseph, S. et al. Multilingual simplification of medical texts. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 16662–16692 (ACL, 2023).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. Conf. Empir. Methods Nat. Lang. Process. & 9th Int. Joint Conf. Nat. Lang. Process. (eds Inui, K. et al.) 2567–2577 (ACL, 2019).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proc. Conf. Health Inference Learn. (eds Flores, G. et al.) 248–260 (PMLR, 2022).
Omar, M., Nadkarni, G. N., Klang, E. & Glicksberg, B. S. Large language models in medicine: a review of current clinical trials across healthcare applications. PLoS Digit. Health 3, e0000662 (2024).
Article Google Scholar
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Article Google Scholar
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Article Google Scholar
Gao, Y. et al. Leveraging a medical knowledge graph into large language models for diagnosis prediction. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.14321 (2023).
Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).
Google Scholar
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2312.00164 (2023).
Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit. Health 6, e281–e290 (2024).
Article Google Scholar
Jin, Q. et al. Matching patients to clinical trials with large language models. Na. Commun. 15, 9074 (2024).
Article Google Scholar
US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06002425 (2024).
German Clinical Trials Register. drks.de https://drks.de/search/en/trial/DRKS00033775 (2024).
Wang, S., Zhao, Z., Ouyang, X., Wang, Q. & Shen, D. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 3, 133 (2024).
Article Google Scholar
Huang, C.-W., Tsai, S.-C. & Chen, Y.-N. PLM-ICD: automatic ICD coding with pretrained language models. In Proc. 4th Clin. Nat. Lang. Process. Workshop (eds Naumann, T et al.) 10–20 (ACL, 2022).
Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. DRG–LLaMA: tuning LLaMA model to predict diagnosis-related group for hospitalized patients. npj Digit. Med. 7, 16 (2024).
Article Google Scholar
Liu, J., Yang, S., Peng, T., Hu, X. & Zhu, Q. ChatICD: prompt learning for few-shot ICD coding through ChatGPT. In 2023 IEEE Int. Conf. Bioinform. Biomed. (eds Jian,g, X. et al.) 4360–4367 (IEEE, 2023).
Yang, Z., Batra, S. S., Stremmel, J. & Halperin, E. Surpassing GPT-4 medical coding with a two-stage approach. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13735 (2023).
Elkin, P. L. & Brown, S. H. in Terminology, Ontology and their Implementations 2nd edn (ed. Elkin, P. L.) 367–370 (Springer, 2023).
Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at arXiv https://doi.org/10.48550/arXiv.1907.11692 (2019).
Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In 2021 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 13748–13757 (IEEE, 2021).
Liu, X. et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med. 25, 1467–1469 (2019).
Article Google Scholar
Ma, C. et al. An iterative optimizing framework for radiology report summarization with ChatGPT. In IEEE Trans. Artif. Intell. 4163–4175 (IEEE, 2024).
Van Veen, D. et al. RadAdapt: radiology report summarization via lightweight domain adaptation of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.01146 (2023).
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. In Proc. Workshop Text Summarization Branches Out 74–81 (ACL, 2004).
Papineni, K., Roukos, S., Ward, T. & Zhu, W. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annu. Meet. Assoc. Comput. Linguist. (eds Isabelle, P. et al.) 311–318 (ACL, 2002).
Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 100802 (2023).
Article Google Scholar
US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06263855 (2024).
Xie, Q. et al. Faithful AI in medicine: a systematic review with large language models and beyond. Preprint at medRxiv https://doi.org/10.1101/2023.04.18.23288752 (2023).
Dupont, P. E. A decade retrospective of medical robotics research from 2010 to 2020. Sci. Robot. 6, eabi8017 (2021).
Article Google Scholar
Xu, H. et al. Enhancing surgical robots with embodied intelligence for autonomous ultrasound scanning. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.00461 (2024).
Wang, J. et al. Large language models for robotics: opportunities, challenges, and perspectives. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.04334 (2024).
Moghani, M. et al. SuFIA: language-guided augmented dexterity for robotic surgical assistants. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.05226 (2024).
Killeen, B. D., Chaudhary, S., Osgood, G. & Unberath, M. Take a shot! natural language control of intelligent robotic X-ray systems in surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 1165–1173 (2024).
Article Google Scholar
Weerarathna, I. N., Raymond, D. & Luharia, A. Human-robot collaboration for healthcare: a narrative review. Cureus 15, e49210 (2023).
Google Scholar
García-Ferrero, I. et al. Medical mT5: an open-source multilingual text-to-text LLM for the medical domain. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.07613 (2024).
Wang, X. et al. Apollo: lightweight multilingual medical LLM towards democratizing medical AI to 6b people. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.03640 (2024).
Pieri, S. et al. BIMediX: bilingual medical mixture of experts LLM. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13253 (2024).
Tang, C., Wang, S., Goldsack, T. & Lin, C. Improving biomedical abstractive summarisation with knowledge aggregation from citation papers. In Proc. Conf. Empir. Methods Nat. Lang. Proces. (eds Bouamor, H. et al.) 606–618 (ACL, 2023).
Guo, Y., Qiu, W., Leroy, G., Wang, S. & Cohen, T. Retrieval augmentation of large language models for lay language generation. J. Biomed. Inform. 149, 104580 (2024).
Article Google Scholar
Chen, Y., Arunasalam, A. & Celik, Z. B. Can large language models provide security & privacy advice? Measuring the ability of LLMs to refute misconceptions. In Proc. 39th Annu. Comput. Secur. Appl. Conf. 366–378 (ACL, 2023).
Karabacak, M. et al. The advent of generative language models in medical education. JMIR Med. Educ. 9, e48163 (2023).
Article Google Scholar
Biri, S. K. et al. Assessing the utilization of large language models in medical education: insights from undergraduate medical students. Cureus 15, e47468 (2023).
Google Scholar
Ahn, S. The impending impacts of large language models on medical education. Korean J. Med. Educ. 35, 103–107 (2023).
Article Google Scholar
Peacock, J., Austin, A., Shapiro, M., Battista, A. & Samuel, A. Accelerating medical education with ChatGPT: an implementation guide. MedEdPublish 13, 64 (2023).
Article Google Scholar
Tian, Q. et al. Iteratively refined ChatGPT outperforms clinical mentors in generating high-quality interprofessional education clinical scenarios: a comparative study. Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-4637356/v1 (2024).
Veras, M. et al. Usability and efficacy of artificial intelligence chatbots (ChatGPT) for health sciences students: protocol for a crossover randomized controlled trial. JMIR Res. Protoc. 12, e51873 (2023).
Article Google Scholar
Rawte, V., Sheth, A. & Das, A. A survey of hallucination in large foundation models. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.05922 (2023).
Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S. & Torous, J. B. Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can. J. Psychiatry 64, 456–464 (2019).
Article Google Scholar
Stock, A., Schlögl, S. & Groth, A. Tell me, what are you most afraid of? Exploring the effects of agent representation on information disclosure in human–chatbot interaction. In Proc. Int. Conf. Hum. Comput. Interact. (eds Degen, H. et al.) 179–191 (Springer, 2023). https://doi.org/10.1007/978-3-031-35894-4_13.
Liu, J. M. et al. ChatCounselor: a large language models for mental health support. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.15461 (2023).
Robinson, N., Connolly, J., Suddrey, G. & Kavanagh, D. J. A brief wellbeing training session delivered by a humanoid social robot: a pilot randomized controlled trial. Int. J. Soc. Robot. 16, 937–951 (2024).
Article Google Scholar
Lai, T. et al. Psy-LLM: scaling up global mental health psychological services with AI-based large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2307.11991 (2023).
Qiu, H., Li, A., Ma, L. & Lan, Z. PsyChat: a client-centric dialogue system for mental health support. In Proc. 27th Int. Conf. Comput. Support. Coop. Work Des. (CSCWD) 2979–2984 (IEEE, 2024).
US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06346496 (2024).
Xu, X. et al. Mental-LLM: leveraging large language models for mental health prediction via online text data. In Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 8 1–32 (ACM, 2024).
Ma, Z., Mei, Y. & Su, Z. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. AMIA Annu. Symp. Proc. 2023, 1105 (2023).
Google Scholar
Chung, N. C., Dyer, G. & Brocki, L. Challenges of large language models for mental health counseling. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13857 (2023).
Ren, Z., Zhan, Y., Yu, B., Ding, L. & Tao, D. Healthcare Copilot: eliciting the power of general LLMs for medical consultation. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13408 (2024).
Tu, T. et al. Towards conversational diagnostic AI. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.05654 (2024).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Article Google Scholar
Chinese Clinical Trial Register. ChiCTR.org.cn https://www.chictr.org.cn/showproj.html?proj=220887 (2024).
Stokel-Walker, C. ChatGPT listed as author on research papers: many scientists disapprove. Nature 613, 620–621 (2023).
Article Google Scholar
Roit, P. et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proc. 61st Annu. Meet. Assoc. Comput. Linguist. (eds Rogers, A. et al.) 6252–6272 (ACL, 2023).
Chern, I.-C. et al. Improving factuality of abstractive summarization via contrastive reward learning. In Proc. 3rd Workshop Trustworthy Nat. Lang. Process. (eds Ovalle, A. et al.) 55–60 (ACL, 2023).
Manakul, P., Liusie, A. & Gales, M. J. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 9004–9017 (ACL, 2023).
Shuster, K., Poff, S., Chen, M., Kiela, D. & Weston, J. Retrieval augmentation reduces hallucination in conversation. In Find. Assoc. Comput. Linguist. (eds Moens, M. et al.) 3784–3803 (ACL, 2021).
Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.11495 (2023).
Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 3214–3252 (ACL, 2022).
Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 6449–6464 (ACL, 2023).
Liu, F. et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Proc. 35th Int. Conf. Neural Inform. Process. Syst. (eds Ranzato, M. et al.) 16266–16279 (Curran Associates, 2021).
Shumailov, I. et al. Model dementia: generated data makes models forget. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.17493 (2023).
Hoelscher-Obermaier, J., Persson, J., Kran, E., Konstas, I. & Barez, F. Detecting edit failures in large language models: an improved specificity benchmark. In Find. Assoc. Comput. Linguist. (eds Rogers, A. et al.) 11548–11559 (ACL, 2023).
Liu, F. et al. A medical multimodal large language model for future pandemics. npj Digit. Med. 6, 226 (2023).
Article Google Scholar
Yao, Y. et al. Editing large language models: problems, methods, and opportunities. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 10222–10240 (ACL, 2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Adv. Neural Inform. Process. Syst. (eds Larochelle, H. et al.) 9459–9474 (Curran Associates, 2020).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. 36th Int. Conf. Neural Inf. Process. Syst. (eds Koyejo, S. et al.) 27730–27744 (Curran Associates, 2022).
Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. Preprint at arXiv https://doi.org/10.48550/arXiv.2209.14375 (2022).
Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).
Article Google Scholar
Liu, H., Sferrazza, C. & Abbeel, P. Chain of hindsight aligns language models with feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.02676 (2023).
Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 11, 887 (2023).
Article Google Scholar
Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 25, bbad493 (2024).
Article Google Scholar
Li, H. et al. Multi-step jailbreaking privacy attacks on ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.05197 (2023).
Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: how does LLM safety training fail? In Adv. Neural Inform. Process. Syst. (eds Oh, A. et al.) 80079–80110 (Curran Associates, 2023).
Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digit. Med. 6, 120 (2023).
Article Google Scholar
Derraz, B. et al. New regulatory thinking is needed for AI-based personalised drug and cell therapies in precision oncology. npj Precis. Oncol. 8, 23 (2024).
Article Google Scholar
Mökander, J., Schuett, J., Kirk, H. R. & Floridi, L. Auditing large language models: a three-layered approach. AI Ethics 4, 1085–1115 (2024).
Article Google Scholar
Liu, F. et al. Large language models are poor clinical decision-makers: a comprehensive benchmark. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Al-Onaizan, Y. et al.) 13696–13710 (ACL, 2024).
Yin, S. et al. A survey on multimodal large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.13549 (2023).
Huang, H. et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. Int. J. Oral Sci. 15, 29 (2023).
Article Google Scholar
Li, J., Liu, C., Cheng, S., Arcucci, R. & Hong, S. Frozen language model helps ECG zero-shot learning. Proc. Mach. Learn. Res. 227, 402–415 (2023).
Google Scholar
Englhardt, Z. et al. Exploring and characterizing large language models for embedded system development and debugging. In Proc. Extend. Abstr. CHI Conf. Hum. Factor. Comput. Syst. (eds Mueller,F. et al.) 150:1–150:9 (ACM, 2024).
Mello, M. M. & Guha, N. ChatGPT and physicians’ malpractice risk. JAMA Health Forum 4, e231938 (2023).
Article Google Scholar
Mekki, Y. M. & Zughaier, S. M. Teaching artificial intelligence in medicine. Nat. Rev. Bioeng. 2, 450–451 (2024).
Article Google Scholar
Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. Conf. Empir. Methods Nat. Lang. Process. & 9th Int. Joint Conf. Nat. Lang. Proces. (eds Inui, K. et al.) 3615–3620 (ACL, 2019).
Alrowili, S. & Shanker, V. Large biomedical question answering models with ALBERT and ELECTRA. In Conf. Labs Eval. Forum 213–220 (2021).
Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. In Proc. 58th Annu. Meet. Assoc. Comput. Linguist. (eds Jurafsky, D. et al.) 8342–8360 (ACL, 2020).
Yasunaga, M., Leskovec, J. & Liang, P. Linkbert: pretraining language models with document links. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. Vol. 1 (eds Muresan, S. et al.) 8003–8016 (ACL, 2022).
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. 18th BioNLP Workshop Shared Task (eds Demner-Fushman, D. et al.) 58–65 (ACL, 2019).
Phan, L. N. et al. SciFive: a text-to-text transformer model for biomedical literature. Preprint at arXiv https://doi.org/10.48550/arXiv.2106.03598 (2021).
Lu, Q., Dou, D. & Nguyen, T. ClinicalT5: a generative language model for clinical text. In Find. Assoc. Comput. Linguist. (eds Goldberg, Y. et al.) 5436–5443 (ACL, 2022).
Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 39, btad651 (2023).
Article Google Scholar
Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 37309–37323 (Curran Associates, 2022).
Venigalla, A., Frankle, J. & Carbin, M. BioMedLM: a domain-specific large language model for biomedical text. MosaicML https://medium.com/@MosaicML/pubmed-gpt-a-domain-specific-large-language-model-for-biomedical-text-567b18e2b11 (2022).
Gao, W. et al. OphGLM: training an ophthalmology large language-and-vision assistant based on instructions and dialogue. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.12174 (2023).
Chen, Y. et al. BianQue: balancing the questioning and suggestion ability of health LLMs with multi-turn health conversations polished by ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.15896 (2023).
Wang, G., Yang, G., Du, Z., Fan, L. & Li, X. ClinicalGPT: large language models finetuned with diverse medical data and comprehensive evaluation. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.09968 (2023).
Zhang, H. et al. HuatuoGPT, towards taming language model to be a doctor. In Find. Assoc. Computat. Linguist. (eds Bouamor, H. et al.) 10859–10885 (ACL, 2023).
Luo, Y. et al. BioMedGPT: open multimodal generative pre-trained transformer for biomedicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.09442 (2023).
Ferber, D. et al. Gpt-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI 1, AIcs2300235 (2024).
Article Google Scholar
Chen, Z. et al. MEDITRON-70B: scaling medical pretraining for large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.16079 (2023).
He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering. Preprint at arXiv https://doi.org/10.48550/arXiv.2003.10286 (2020).
Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Article Google Scholar
Yang, L. et al. Advancing multimodal medical capabilities of Gemini. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.03162 (2024).
Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, 100943 (2024).
Article Google Scholar
Sun, Z., Luo, C., Liu, Z. & Huang, Z. Conversational disease diagnosis via external planner-controlled large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.04292 (2024).
Dong, H. et al. Automated clinical coding: what, why, and where we are? npj Digit. Med. 5, 159 (2022).
Article Google Scholar
D’Onofrio, G. et al. Emotion recognizing by a robotic solution initiative (EMOTIVE project). Sensors 22, 2861 (2022).
Article Google Scholar
Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In Proc. 14th Int. Conf. Neural Inform. Process. Syst. (eds Leen, T. K. et al.) 893–899 (MIT press, 2000).
Mikolov, T., Karafiát, M., Burget, L., Černocký, J. & Khudanpur, S. Recurrent neural network based language model. In Proc. 11th Annu. Conf. Int. Speech Commun. Assoc. (eds Kobayashi, T. et al.) 1045–1048 (ISCA, 2010).
Sundermeyer, M., Ney, H. & Schlüter, R. From feedforward to recurrent LSTM neural networks for language modeling. In IEEE/ACM Transact. Audio Speech Lang. Process. (ed. Li, H.) 517–529 (IEEE, 2015).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. Conf. North Am. Chapt. Assoc. Comput. Linguist. (eds Burstein, J. et al.) 4171–4186 (ACL, 2019).
Vaswani, A. et al. Attention is all you need. In Proc. 31st Int. Conf. Neural Inform. Process. Syst. (eds von Luxburg, U. et al.) 6000–6010 (Curran Associates, 2017).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2001.08361 (2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 30016–30030 (Curran Associates, 2022).
He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: decoding-enhanced BERT with disentangled attention. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.03654 (2021).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).
Google Scholar
The Vicuna team. Vicuna: an open-source chatbot impressing GPT-4 with 90% ChatGPT quality. LMSYS ORG https://lmsys.org/blog/2023-03-30-vicuna/ (2023).
Jiang, A. Q. et al. Mistral 7B. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.06825 (2023).
Bai, J. et al. Qwen technical report. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.16609 (2023).
Lewis, M. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. 58th Annu. Meet. Assoc. Comput. Linguist. (eds Jurafsky, D. et al.) 7871–7880 (ACL, 2020).
Tay, Y. et al. Ul2: unifying language learning paradigms. Preprint at arXiv https://doi.org/10.48550/arXiv.2205.05131 (2022).

Download references

Acknowledgements

This work was supported in part by the Pandemic Sciences Institute at the University of Oxford, the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, an NIHR Research Professorship, a Royal Academy of Engineering Research Chair, the Well-come Trust-funded VITAL project, the UK Research and Innovation, the Engineering and Physical Sciences Research Council, and the InnoHK Hong Kong Centre for Cerebro-cardiovascular Engineering (COCHE), the Clarendon Fund, and the Magdalen Graduate Scholarship.

Author information

These authors contributed equally: Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu.
These authors jointly supervised this work: Fenglin Liu, Hongjian Zhou, Zheng Li, Jiebo Luo, David A. Clifton.

Authors and Affiliations

Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK
Fenglin Liu, Hongjian Zhou & David A. Clifton
Department of Computing, Imperial College London, London, UK
Boyang Gu
Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada
Xinyu Zou
Department of Computer Science, University of Rochester, Rochester, NY, USA
Jinfa Huang & Jiebo Luo
Institute of Health Informatics, University College London, London, UK
Jinge Wu
Western University, London, Ontario, Canada
Yiru Li
Department of Kinesiology, University of Georgia, Athens, GA, USA
Sam S. Chen
Harvard T.H. Chan School of Public Health, Boston, MA, USA
Yining Hua
Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Peilin Zhou
School of ECE, Peking University, Shenzhen, China
Junling Liu
Massachusetts Institute of Technology, Cambridge, MA, USA
Chengfeng Mao
Stony Brook University, Stony Brook, NY, USA
Chenyu You
Jarvis Research Center, Tencent YouTu Laboratory, Beijing, China
Xian Wu & Yefeng Zheng
Medical Artificial Intelligence Laboratory, Westlake University, Hangzhou, China
Yefeng Zheng
Applied Digital Health (ADH), Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
Lei Clifton
Amazon, Palo Alto, CA, USA
Zheng Li
Oxford-Suzhou Centre for Advanced Research, Suzhou, China
David A. Clifton

Authors

Fenglin Liu
View author publications
Search author on:PubMed Google Scholar
Hongjian Zhou
View author publications
Search author on:PubMed Google Scholar
Boyang Gu
View author publications
Search author on:PubMed Google Scholar
Xinyu Zou
View author publications
Search author on:PubMed Google Scholar
Jinfa Huang
View author publications
Search author on:PubMed Google Scholar
Jinge Wu
View author publications
Search author on:PubMed Google Scholar
Yiru Li
View author publications
Search author on:PubMed Google Scholar
Sam S. Chen
View author publications
Search author on:PubMed Google Scholar
Yining Hua
View author publications
Search author on:PubMed Google Scholar
Peilin Zhou
View author publications
Search author on:PubMed Google Scholar
Junling Liu
View author publications
Search author on:PubMed Google Scholar
Chengfeng Mao
View author publications
Search author on:PubMed Google Scholar
Chenyu You
View author publications
Search author on:PubMed Google Scholar
Xian Wu
View author publications
Search author on:PubMed Google Scholar
Yefeng Zheng
View author publications
Search author on:PubMed Google Scholar
Lei Clifton
View author publications
Search author on:PubMed Google Scholar
Zheng Li
View author publications
Search author on:PubMed Google Scholar
Jiebo Luo
View author publications
Search author on:PubMed Google Scholar
David A. Clifton
View author publications
Search author on:PubMed Google Scholar

Contributions

H.Z. and F.L. conceived and designed the study. H.Z., F.L., B.G., X.Z., J.H. and J.W. conducted the literature review, performed data analysis and drafted the manuscript. All authors contributed to the interpretation and final manuscript preparation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fenglin Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Bioengineering thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, F., Zhou, H., Gu, B. et al. Application of large language models in medicine. Nat Rev Bioeng 3, 445–464 (2025). https://doi.org/10.1038/s44222-025-00279-5

Download citation

Accepted: 20 January 2025
Published: 07 April 2025
Version of record: 07 April 2025
Issue date: June 2025
DOI: https://doi.org/10.1038/s44222-025-00279-5

This article is cited by

Benchmarking large language models against human experts in rehabilitation medicine: a multidimensional evaluation
- Wenhui Cao
- Mengjian Qu
- Jun Zhou
Journal of NeuroEngineering and Rehabilitation (2026)
Influence of structured output constraints on GPT-5-Thinking, Gemini 2.5 Pro, and open-weight LLMs for radiology protocol selection
- Mohammed Bahaaeldin
- Sebastian Nowak
- Narine Mesropyan
European Radiology Experimental (2026)
Digitale Diagnoseunterstützung seltener rheumatologischer Erkrankungen: Evidenz und Perspektiven
- Phillip Kremer
- Martin Krusche
- Johannes Knitza
rheuma plus (2026)
Holistic evaluation of large language models for medical tasks with MedHELM
- Suhana Bedi
- Hejie Cui
- Nigam H. Shah
Nature Medicine (2026)
A collaborative large language model for drug analysis
- Hongjian Zhou
- Fenglin Liu
- David A. Clifton
Nature Biomedical Engineering (2025)