Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Application of large language models in medicine

Abstract

Large language models (LLMs), such as ChatGPT, have received great attention owing to their capabilities for understanding and generating human language. Despite a trend in researching the application of LLMs in supporting different medical tasks (such as enhancing clinical diagnostics and providing medical education), a comprehensive assessment of their development, practical applications and outcomes in the medical space is still missing. Therefore, this Review aims to provide an overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we discuss the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. In terms of deployment, we compare different LLMs across various medical tasks and with state-of-the-art lightweight models.

Key points

  • Existing medical large language models (LLMs), ranging from 110 million to 520 billion parameters, are mainly developed through pre-training, fine-tuning and prompting methods using large-scale medical corpora from different sources.

  • Their performance is mostly evaluated based on exam-style question-answering tasks. Combining different fine-tuning and prompting methods enables LLMs to achieve comparable or even better results than experts.

  • LLMs perform poorly in non-question-answering tasks without pre-set options, thus requiring further improvements before integration into real clinical decision-making processes.

  • Medical LLMs are being adapted to various clinical applications, but large-scale clinical trials are still missing.

  • Mitigating hallucinations; establishing robust data, benchmarks and metrics; and addressing ethical, safety and regulatory concerns through interdisciplinary collaborations will help to accelerate the integration of LLMs into clinic practice.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Model size for different medical LLMs.
The alternative text for this image may have been generated using AI.
Fig. 2: Performance comparison.
The alternative text for this image may have been generated using AI.
Fig. 3: Development of medical LLMs over time.
The alternative text for this image may have been generated using AI.
Fig. 4: Application of LLMs in medicine.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

References

  1. Zhao, W. X. et al. A survey of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.18223 (2023).

  2. Yang, J. et al. Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 18, 160 (2024).

    Article  Google Scholar 

  3. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at arXiv https://doi.org/10.48550/arXiv.2204.02311 (2022).

  4. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.13971 (2023).

  5. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at arXiv https://doi.org/10.48550/arXiv.2307.09288 (2023).

  6. Brown, T. et al. Language models are few-shot learners. In Proc. 34th Int. Conf. Neural Inform. Process. Syst.s (eds Larochelle, H. et al.) 1877–1901 (2020).

  7. OpenAI et al. GPT-4 technical report. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.08774 (2023).

  8. Du, Z. et al. GLM: general language model pretraining with autoregressive blank infilling. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 320–335 (ACL, 2022).

  9. Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article  Google Scholar 

  10. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. https://doi.org/10.1038/s41591-024-03423-7 (2025).

  11. Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.16452 (2023).

  12. Wu, C. et al. PMC-LLaMA: toward building open-source language models for medicine. J. Am. Med. Inform. Assoc. 31, 1833–1843 (2024).

    Article  Google Scholar 

  13. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Article  Google Scholar 

  14. Li, Y. et al. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus 15, 6 (2023).

    Google Scholar 

  15. Han, T. et al. MedAlpaca — an open-source collection of medical conversational AI models and training data. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.08247 (2023).

  16. Wang, H. et al. HuaTuo: tuning LLaMA model with Chinese medical knowledge. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.06975 (2023).

  17. Toma, A. et al. Clinical Camel: an open-source expert-level medical language model with dialogue-based knowledge encoding. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.12031 (2023).

  18. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article  Google Scholar 

  19. Patel, S. B. & Lam, K. ChatGPT: the future of discharge summaries? Lancet Digit. Health 5, e107–e108 (2023).

    Article  Google Scholar 

  20. Yang, X. et al. A large language model for electronic health records. npj Digit. Med. 5, 194 (2022).

    Article  Google Scholar 

  21. Abd-Alrazaq, A. et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med. Educ. 9, e48291 (2023).

    Article  Google Scholar 

  22. Peng, C. et al. A study of generative large language model for medical research and healthcare. npj Digit. Med. 6, 210 (2023).

    Article  Google Scholar 

  23. Alsentzer, E. et al. Publicly available clinical BERT embeddings. Preprint at arXiv https://doi.org/10.48550/arXiv.1904.03323 (2019).

  24. Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).

    Article  Google Scholar 

  25. Wu, J. et al. Clinical text datasets for medical artificial intelligence and large language models — a systematic review. NEJM AI 1, AIra2400012 (2024).

    Article  Google Scholar 

  26. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).

    Article  Google Scholar 

  27. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

    Article  Google Scholar 

  28. Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).

    Article  Google Scholar 

  29. Ye, Q. et al. Qilin-Med: multi-stage knowledge injection advanced medical large language model. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.09089 (2023).

  30. Xiong, H. et al. DoctorGLM: fine-tuning your Chinese doctor is not a herculean task. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.01097 (2023).

  31. Yang, S. et al. Zhongjing: enhancing the Chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Proc. AAAI Conf. Artif. Intell. (eds Wooldridge, M.J. et al.) 19368–19376 (AAAI, 2023).

  32. Zhang, S. et al. Instruction tuning for large language models: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.10792 (2023).

  33. He, K. et al. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. Inf. Fusion 118, 102963 (2025).

    Article  Google Scholar 

  34. Byambasuren, O. et al. Preliminary study on the construction of Chinese medical knowledge graph. J. Chin. Inf. Process. 33, 1–9 (2019).

    Google Scholar 

  35. Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. 3rd Mach. Learn. Health Symp. (eds Hegselmann, S. et al.) 353–367 (PMLR, 2023).

  36. Li, C. et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In Ann. Conf. Neural Inform. Process. Syst. (eds Oh, A. et al.) 28541–28564 (Curran Associates, 2023).

  37. Saab, K. et al. Capabilities of Gemini models in medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.18416 (2024).

  38. Hyland, S. L. et al. MAIRA-1: a specialised large multimodal model for radiology report generation. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13668 (2023).

  39. Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.02463 (2023).

  40. Zhang, X. et al. AlpaCare: instruction-tuned large language models for medical application. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.14558 (2023).

  41. Hu, E. J. et al. LoRA: low-rank adaptation of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2106.09685 (2021).

  42. Li, X. L. & Liang, P. Prefix-tuning: optimizing continuous prompts for generation. In Proc. 59th Annu. Meet. Assoc. Comput. Linguist. (eds Zong, C. et al.) 4582–4597 (ACL, 2021).

  43. Liu, X. et al. P-Tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 61–68 (ACL, 2022).

  44. Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In Proc. 36th Int. Conf. Mach. Learn. (eds Chaudhuri, K. & Salakhutdinov, R.) 2790–2799 (PMLR, 2019).

  45. Xu, C., Guo, D., Duan, N. & McAuley, J. Baize: an open-source chat model with parameter-efficient tuning on self-chat data. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 6268–6278 (ACL, 2023).

  46. Shoham, O. B. & Rappoport, N. CPLLM: clinical prediction with large language models. PLoS Digit. Health 3, e0000680 (2024).

    Article  Google Scholar 

  47. Dong, Q. et al. A survey on in-context learning. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Al-Onaizan, Y. et al.) 1107–1128 (ACL, 2024).

  48. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 24824–24837 (Curran Associates, 2022).

  49. Liu, Z. et al. DeID-GPT: zero-shot medical text de-identification by GPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.11032 (2023).

  50. Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. In Proc. Confe. Empir. Methods Nat. Lang. Process. (eds Moens, M.-F. et al.) 3045–3059 (ACL, 2021).

  51. Gao, Y. et al. Retrieval-augmented generation for large language models: a survey. Preprint at arXiv https://doi.org/10.48550/arXiv.2312.10997 (2023).

  52. Luo, Y. et al. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.08747 (2023).

  53. Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking retrieval-augmented generation for medicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13178 (2024).

  54. Li, X. & Li, J. AnglE-optimized text embeddings. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.12871 (2023).

  55. Wang, G. et al. Voyager: an open-ended embodied agent with large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.16291 (2023).

  56. Chen, J. et al. M3-Embedding: multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Findings Assoc. Comput. Linguist. (eds Ku, L. et al.) 2318–2335 (ACL, 2024).

  57. Shao, Z. et al. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings Assoc. Comput. Linguist. (eds Bouamor, H. et al.) 9248–9274 (ACL, 2023).

  58. Trivedi, H., Balasubramanian, N., Khot, T. & Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proc. 61st Annu. Meet. Asso. Comput. Linguist. (eds Rogers, A. et al.) 10014–10037 (ACL, 2023).

  59. Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-rag: learning to retrieve, generate, and critique through self-reflection. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.11511 (2023).

  60. Zakka, C. et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI 1, AIoa2300068 (2024).

    Article  Google Scholar 

  61. Kim, J. & Min, M. From RAG to QA-RAG: integrating generative AI for pharmaceutical regulatory compliance process. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.01717 (2024).

  62. Shi, W. et al. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making. In Proc. 14th ACM Int. Conf. Bioinform. Comput. Biol. Health Inform. (ACM, 2023).

  63. Tang, L. et al. Evaluating large language models on medical evidence summarization. npj Digit. Med. 6, 158 (2023).

    Article  Google Scholar 

  64. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Article  Google Scholar 

  65. Ondov, B., Attal, K. & Demner-Fushman, D. A survey of automated methods for biomedical text simplification. J. Am. Med. Inform. Assoc. 29, 1976–1988 (2022).

    Article  Google Scholar 

  66. Liu, F. et al. Retrieve, reason, and refine: generating accurate and faithful patient instructions. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 18864–18877 (Curran Associates, 2022).

  67. Joseph, S. et al. Multilingual simplification of medical texts. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 16662–16692 (ACL, 2023).

  68. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. Conf. Empir. Methods Nat. Lang. Process. & 9th Int. Joint Conf. Nat. Lang. Process. (eds Inui, K. et al.) 2567–2577 (ACL, 2019).

  69. Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proc. Conf. Health Inference Learn. (eds Flores, G. et al.) 248–260 (PMLR, 2022).

  70. Omar, M., Nadkarni, G. N., Klang, E. & Glicksberg, B. S. Large language models in medicine: a review of current clinical trials across healthcare applications. PLoS Digit. Health 3, e0000662 (2024).

    Article  Google Scholar 

  71. Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).

    Article  Google Scholar 

  72. Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).

    Article  Google Scholar 

  73. Gao, Y. et al. Leveraging a medical knowledge graph into large language models for diagnosis prediction. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.14321 (2023).

  74. Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25, 1–53 (2024).

    Google Scholar 

  75. McDuff, D. et al. Towards accurate differential diagnosis with large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2312.00164 (2023).

  76. Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit. Health 6, e281–e290 (2024).

    Article  Google Scholar 

  77. Jin, Q. et al. Matching patients to clinical trials with large language models. Na. Commun. 15, 9074 (2024).

    Article  Google Scholar 

  78. US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06002425 (2024).

  79. German Clinical Trials Register. drks.de https://drks.de/search/en/trial/DRKS00033775 (2024).

  80. Wang, S., Zhao, Z., Ouyang, X., Wang, Q. & Shen, D. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng. 3, 133 (2024).

    Article  Google Scholar 

  81. Huang, C.-W., Tsai, S.-C. & Chen, Y.-N. PLM-ICD: automatic ICD coding with pretrained language models. In Proc. 4th Clin. Nat. Lang. Process. Workshop (eds Naumann, T et al.) 10–20 (ACL, 2022).

  82. Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. DRG–LLaMA: tuning LLaMA model to predict diagnosis-related group for hospitalized patients. npj Digit. Med. 7, 16 (2024).

    Article  Google Scholar 

  83. Liu, J., Yang, S., Peng, T., Hu, X. & Zhu, Q. ChatICD: prompt learning for few-shot ICD coding through ChatGPT. In 2023 IEEE Int. Conf. Bioinform. Biomed. (eds Jian,g, X. et al.) 4360–4367 (IEEE, 2023).

  84. Yang, Z., Batra, S. S., Stremmel, J. & Halperin, E. Surpassing GPT-4 medical coding with a two-stage approach. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13735 (2023).

  85. Elkin, P. L. & Brown, S. H. in Terminology, Ontology and their Implementations 2nd edn (ed. Elkin, P. L.) 367–370 (Springer, 2023).

  86. Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach. Preprint at arXiv https://doi.org/10.48550/arXiv.1907.11692 (2019).

  87. Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In 2021 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 13748–13757 (IEEE, 2021).

  88. Liu, X. et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med. 25, 1467–1469 (2019).

    Article  Google Scholar 

  89. Ma, C. et al. An iterative optimizing framework for radiology report summarization with ChatGPT. In IEEE Trans. Artif. Intell. 4163–4175 (IEEE, 2024).

  90. Van Veen, D. et al. RadAdapt: radiology report summarization via lightweight domain adaptation of large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.01146 (2023).

  91. Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. In Proc. Workshop Text Summarization Branches Out 74–81 (ACL, 2004).

  92. Papineni, K., Roukos, S., Ward, T. & Zhu, W. BLEU: a method for automatic evaluation of machine translation. In Proc. 40th Annu. Meet. Assoc. Comput. Linguist. (eds Isabelle, P. et al.) 311–318 (ACL, 2002).

  93. Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 100802 (2023).

    Article  Google Scholar 

  94. US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06263855 (2024).

  95. Xie, Q. et al. Faithful AI in medicine: a systematic review with large language models and beyond. Preprint at medRxiv https://doi.org/10.1101/2023.04.18.23288752 (2023).

  96. Dupont, P. E. A decade retrospective of medical robotics research from 2010 to 2020. Sci. Robot. 6, eabi8017 (2021).

    Article  Google Scholar 

  97. Xu, H. et al. Enhancing surgical robots with embodied intelligence for autonomous ultrasound scanning. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.00461 (2024).

  98. Wang, J. et al. Large language models for robotics: opportunities, challenges, and perspectives. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.04334 (2024).

  99. Moghani, M. et al. SuFIA: language-guided augmented dexterity for robotic surgical assistants. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.05226 (2024).

  100. Killeen, B. D., Chaudhary, S., Osgood, G. & Unberath, M. Take a shot! natural language control of intelligent robotic X-ray systems in surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 1165–1173 (2024).

    Article  Google Scholar 

  101. Weerarathna, I. N., Raymond, D. & Luharia, A. Human-robot collaboration for healthcare: a narrative review. Cureus 15, e49210 (2023).

    Google Scholar 

  102. García-Ferrero, I. et al. Medical mT5: an open-source multilingual text-to-text LLM for the medical domain. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.07613 (2024).

  103. Wang, X. et al. Apollo: lightweight multilingual medical LLM towards democratizing medical AI to 6b people. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.03640 (2024).

  104. Pieri, S. et al. BIMediX: bilingual medical mixture of experts LLM. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13253 (2024).

  105. Tang, C., Wang, S., Goldsack, T. & Lin, C. Improving biomedical abstractive summarisation with knowledge aggregation from citation papers. In Proc. Conf. Empir. Methods Nat. Lang. Proces. (eds Bouamor, H. et al.) 606–618 (ACL, 2023).

  106. Guo, Y., Qiu, W., Leroy, G., Wang, S. & Cohen, T. Retrieval augmentation of large language models for lay language generation. J. Biomed. Inform. 149, 104580 (2024).

    Article  Google Scholar 

  107. Chen, Y., Arunasalam, A. & Celik, Z. B. Can large language models provide security & privacy advice? Measuring the ability of LLMs to refute misconceptions. In Proc. 39th Annu. Comput. Secur. Appl. Conf. 366–378 (ACL, 2023).

  108. Karabacak, M. et al. The advent of generative language models in medical education. JMIR Med. Educ. 9, e48163 (2023).

    Article  Google Scholar 

  109. Biri, S. K. et al. Assessing the utilization of large language models in medical education: insights from undergraduate medical students. Cureus 15, e47468 (2023).

    Google Scholar 

  110. Ahn, S. The impending impacts of large language models on medical education. Korean J. Med. Educ. 35, 103–107 (2023).

    Article  Google Scholar 

  111. Peacock, J., Austin, A., Shapiro, M., Battista, A. & Samuel, A. Accelerating medical education with ChatGPT: an implementation guide. MedEdPublish 13, 64 (2023).

    Article  Google Scholar 

  112. Tian, Q. et al. Iteratively refined ChatGPT outperforms clinical mentors in generating high-quality interprofessional education clinical scenarios: a comparative study. Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-4637356/v1 (2024).

  113. Veras, M. et al. Usability and efficacy of artificial intelligence chatbots (ChatGPT) for health sciences students: protocol for a crossover randomized controlled trial. JMIR Res. Protoc. 12, e51873 (2023).

    Article  Google Scholar 

  114. Rawte, V., Sheth, A. & Das, A. A survey of hallucination in large foundation models. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.05922 (2023).

  115. Vaidyam, A. N., Wisniewski, H., Halamka, J. D., Kashavan, M. S. & Torous, J. B. Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can. J. Psychiatry 64, 456–464 (2019).

    Article  Google Scholar 

  116. Stock, A., Schlögl, S. & Groth, A. Tell me, what are you most afraid of? Exploring the effects of agent representation on information disclosure in human–chatbot interaction. In Proc. Int. Conf. Hum. Comput. Interact. (eds Degen, H. et al.) 179–191 (Springer, 2023). https://doi.org/10.1007/978-3-031-35894-4_13.

  117. Liu, J. M. et al. ChatCounselor: a large language models for mental health support. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.15461 (2023).

  118. Robinson, N., Connolly, J., Suddrey, G. & Kavanagh, D. J. A brief wellbeing training session delivered by a humanoid social robot: a pilot randomized controlled trial. Int. J. Soc. Robot. 16, 937–951 (2024).

    Article  Google Scholar 

  119. Lai, T. et al. Psy-LLM: scaling up global mental health psychological services with AI-based large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2307.11991 (2023).

  120. Qiu, H., Li, A., Ma, L. & Lan, Z. PsyChat: a client-centric dialogue system for mental health support. In Proc. 27th Int. Conf. Comput. Support. Coop. Work Des. (CSCWD) 2979–2984 (IEEE, 2024).

  121. US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/study/NCT06346496 (2024).

  122. Xu, X. et al. Mental-LLM: leveraging large language models for mental health prediction via online text data. In Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 8 1–32 (ACM, 2024).

  123. Ma, Z., Mei, Y. & Su, Z. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. AMIA Annu. Symp. Proc. 2023, 1105 (2023).

    Google Scholar 

  124. Chung, N. C., Dyer, G. & Brocki, L. Challenges of large language models for mental health counseling. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.13857 (2023).

  125. Ren, Z., Zhan, Y., Yu, B., Ding, L. & Tao, D. Healthcare Copilot: eliciting the power of general LLMs for medical consultation. Preprint at arXiv https://doi.org/10.48550/arXiv.2402.13408 (2024).

  126. Tu, T. et al. Towards conversational diagnostic AI. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.05654 (2024).

  127. Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

    Article  Google Scholar 

  128. Chinese Clinical Trial Register. ChiCTR.org.cn https://www.chictr.org.cn/showproj.html?proj=220887 (2024).

  129. Stokel-Walker, C. ChatGPT listed as author on research papers: many scientists disapprove. Nature 613, 620–621 (2023).

    Article  Google Scholar 

  130. Roit, P. et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proc. 61st Annu. Meet. Assoc. Comput. Linguist. (eds Rogers, A. et al.) 6252–6272 (ACL, 2023).

  131. Chern, I.-C. et al. Improving factuality of abstractive summarization via contrastive reward learning. In Proc. 3rd Workshop Trustworthy Nat. Lang. Process. (eds Ovalle, A. et al.) 55–60 (ACL, 2023).

  132. Manakul, P., Liusie, A. & Gales, M. J. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 9004–9017 (ACL, 2023).

  133. Shuster, K., Poff, S., Chen, M., Kiela, D. & Weston, J. Retrieval augmentation reduces hallucination in conversation. In Find. Assoc. Comput. Linguist. (eds Moens, M. et al.) 3784–3803 (ACL, 2021).

  134. Dhuliawala, S. et al. Chain-of-verification reduces hallucination in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.11495 (2023).

  135. Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. (eds Muresan, S. et al.) 3214–3252 (ACL, 2022).

  136. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y. & Wen, J.-R. HaluEval: a large-scale hallucination evaluation benchmark for large language models. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 6449–6464 (ACL, 2023).

  137. Liu, F. et al. Auto-encoding knowledge graph for unsupervised medical report generation. In Proc. 35th Int. Conf. Neural Inform. Process. Syst. (eds Ranzato, M. et al.) 16266–16279 (Curran Associates, 2021).

  138. Shumailov, I. et al. Model dementia: generated data makes models forget. Preprint at arXiv https://doi.org/10.48550/arXiv.2305.17493 (2023).

  139. Hoelscher-Obermaier, J., Persson, J., Kran, E., Konstas, I. & Barez, F. Detecting edit failures in large language models: an improved specificity benchmark. In Find. Assoc. Comput. Linguist. (eds Rogers, A. et al.) 11548–11559 (ACL, 2023).

  140. Liu, F. et al. A medical multimodal large language model for future pandemics. npj Digit. Med. 6, 226 (2023).

    Article  Google Scholar 

  141. Yao, Y. et al. Editing large language models: problems, methods, and opportunities. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Bouamor, H. et al.) 10222–10240 (ACL, 2023).

  142. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Adv. Neural Inform. Process. Syst. (eds Larochelle, H. et al.) 9459–9474 (Curran Associates, 2020).

  143. Ouyang, L. et al. Training language models to follow instructions with human feedback. In Proc. 36th Int. Conf. Neural Inf. Process. Syst. (eds Koyejo, S. et al.) 27730–27744 (Curran Associates, 2022).

  144. Glaese, A. et al. Improving alignment of dialogue agents via targeted human judgements. Preprint at arXiv https://doi.org/10.48550/arXiv.2209.14375 (2022).

  145. Xi, Z. et al. The rise and potential of large language model based agents: a survey. Sci. China Inf. Sci. 68, 121101 (2025).

    Article  Google Scholar 

  146. Liu, H., Sferrazza, C. & Abbeel, P. Chain of hindsight aligns language models with feedback. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.02676 (2023).

  147. Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 11, 887 (2023).

    Article  Google Scholar 

  148. Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 25, bbad493 (2024).

    Article  Google Scholar 

  149. Li, H. et al. Multi-step jailbreaking privacy attacks on ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.05197 (2023).

  150. Wei, A., Haghtalab, N. & Steinhardt, J. Jailbroken: how does LLM safety training fail? In Adv. Neural Inform. Process. Syst. (eds Oh, A. et al.) 80079–80110 (Curran Associates, 2023).

  151. Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digit. Med. 6, 120 (2023).

    Article  Google Scholar 

  152. Derraz, B. et al. New regulatory thinking is needed for AI-based personalised drug and cell therapies in precision oncology. npj Precis. Oncol. 8, 23 (2024).

    Article  Google Scholar 

  153. Mökander, J., Schuett, J., Kirk, H. R. & Floridi, L. Auditing large language models: a three-layered approach. AI Ethics 4, 1085–1115 (2024).

    Article  Google Scholar 

  154. Liu, F. et al. Large language models are poor clinical decision-makers: a comprehensive benchmark. In Proc. Conf. Empir. Methods Nat. Lang. Process. (eds Al-Onaizan, Y. et al.) 13696–13710 (ACL, 2024).

  155. Yin, S. et al. A survey on multimodal large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.13549 (2023).

  156. Huang, H. et al. ChatGPT for shaping the future of dentistry: the potential of multi-modal large language model. Int. J. Oral Sci. 15, 29 (2023).

    Article  Google Scholar 

  157. Li, J., Liu, C., Cheng, S., Arcucci, R. & Hong, S. Frozen language model helps ECG zero-shot learning. Proc. Mach. Learn. Res. 227, 402–415 (2023).

    Google Scholar 

  158. Englhardt, Z. et al. Exploring and characterizing large language models for embedded system development and debugging. In Proc. Extend. Abstr. CHI Conf. Hum. Factor. Comput. Syst. (eds Mueller,F. et al.) 150:1–150:9 (ACM, 2024).

  159. Mello, M. M. & Guha, N. ChatGPT and physicians’ malpractice risk. JAMA Health Forum 4, e231938 (2023).

    Article  Google Scholar 

  160. Mekki, Y. M. & Zughaier, S. M. Teaching artificial intelligence in medicine. Nat. Rev. Bioeng. 2, 450–451 (2024).

    Article  Google Scholar 

  161. Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In Proc. Conf. Empir. Methods Nat. Lang. Process. & 9th Int. Joint Conf. Nat. Lang. Proces. (eds Inui, K. et al.) 3615–3620 (ACL, 2019).

  162. Alrowili, S. & Shanker, V. Large biomedical question answering models with ALBERT and ELECTRA. In Conf. Labs Eval. Forum 213–220 (2021).

  163. Gururangan, S. et al. Don’t stop pretraining: adapt language models to domains and tasks. In Proc. 58th Annu. Meet. Assoc. Comput. Linguist. (eds Jurafsky, D. et al.) 8342–8360 (ACL, 2020).

  164. Yasunaga, M., Leskovec, J. & Liang, P. Linkbert: pretraining language models with document links. In Proc. 60th Annu. Meet. Assoc. Comput. Linguist. Vol. 1 (eds Muresan, S. et al.) 8003–8016 (ACL, 2022).

  165. Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proc. 18th BioNLP Workshop Shared Task (eds Demner-Fushman, D. et al.) 58–65 (ACL, 2019).

  166. Phan, L. N. et al. SciFive: a text-to-text transformer model for biomedical literature. Preprint at arXiv https://doi.org/10.48550/arXiv.2106.03598 (2021).

  167. Lu, Q., Dou, D. & Nguyen, T. ClinicalT5: a generative language model for clinical text. In Find. Assoc. Comput. Linguist. (eds Goldberg, Y. et al.) 5436–5443 (ACL, 2022).

  168. Jin, Q. et al. MedCPT: contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 39, btad651 (2023).

    Article  Google Scholar 

  169. Yasunaga, M. et al. Deep bidirectional language-knowledge graph pretraining. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 37309–37323 (Curran Associates, 2022).

  170. Venigalla, A., Frankle, J. & Carbin, M. BioMedLM: a domain-specific large language model for biomedical text. MosaicML https://medium.com/@MosaicML/pubmed-gpt-a-domain-specific-large-language-model-for-biomedical-text-567b18e2b11 (2022).

  171. Gao, W. et al. OphGLM: training an ophthalmology large language-and-vision assistant based on instructions and dialogue. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.12174 (2023).

  172. Chen, Y. et al. BianQue: balancing the questioning and suggestion ability of health LLMs with multi-turn health conversations polished by ChatGPT. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.15896 (2023).

  173. Wang, G., Yang, G., Du, Z., Fan, L. & Li, X. ClinicalGPT: large language models finetuned with diverse medical data and comprehensive evaluation. Preprint at arXiv https://doi.org/10.48550/arXiv.2306.09968 (2023).

  174. Zhang, H. et al. HuatuoGPT, towards taming language model to be a doctor. In Find. Assoc. Computat. Linguist. (eds Bouamor, H. et al.) 10859–10885 (ACL, 2023).

  175. Luo, Y. et al. BioMedGPT: open multimodal generative pre-trained transformer for biomedicine. Preprint at arXiv https://doi.org/10.48550/arXiv.2308.09442 (2023).

  176. Ferber, D. et al. Gpt-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI 1, AIcs2300235 (2024).

    Article  Google Scholar 

  177. Chen, Z. et al. MEDITRON-70B: scaling medical pretraining for large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.16079 (2023).

  178. He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering. Preprint at arXiv https://doi.org/10.48550/arXiv.2003.10286 (2020).

  179. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).

    Article  Google Scholar 

  180. Yang, L. et al. Advancing multimodal medical capabilities of Gemini. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.03162 (2024).

  181. Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, 100943 (2024).

    Article  Google Scholar 

  182. Sun, Z., Luo, C., Liu, Z. & Huang, Z. Conversational disease diagnosis via external planner-controlled large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.04292 (2024).

  183. Dong, H. et al. Automated clinical coding: what, why, and where we are? npj Digit. Med. 5, 159 (2022).

    Article  Google Scholar 

  184. D’Onofrio, G. et al. Emotion recognizing by a robotic solution initiative (EMOTIVE project). Sensors 22, 2861 (2022).

    Article  Google Scholar 

  185. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In Proc. 14th Int. Conf. Neural Inform. Process. Syst. (eds Leen, T. K. et al.) 893–899 (MIT press, 2000).

  186. Mikolov, T., Karafiát, M., Burget, L., Černocký, J. & Khudanpur, S. Recurrent neural network based language model. In Proc. 11th Annu. Conf. Int. Speech Commun. Assoc. (eds Kobayashi, T. et al.) 1045–1048 (ISCA, 2010).

  187. Sundermeyer, M., Ney, H. & Schlüter, R. From feedforward to recurrent LSTM neural networks for language modeling. In IEEE/ACM Transact. Audio Speech Lang. Process. (ed. Li, H.) 517–529 (IEEE, 2015).

  188. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  189. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. Conf. North Am. Chapt. Assoc. Comput. Linguist. (eds Burstein, J. et al.) 4171–4186 (ACL, 2019).

  190. Vaswani, A. et al. Attention is all you need. In Proc. 31st Int. Conf. Neural Inform. Process. Syst. (eds von Luxburg, U. et al.) 6000–6010 (Curran Associates, 2017).

  191. Kaplan, J. et al. Scaling laws for neural language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2001.08361 (2020).

  192. Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Proc. 36th Int. Conf. Neural Inform. Process. Syst. (eds Koyejo, S. et al.) 30016–30030 (Curran Associates, 2022).

  193. He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: decoding-enhanced BERT with disentangled attention. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.03654 (2021).

  194. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019).

    Google Scholar 

  195. The Vicuna team. Vicuna: an open-source chatbot impressing GPT-4 with 90% ChatGPT quality. LMSYS ORG https://lmsys.org/blog/2023-03-30-vicuna/ (2023).

  196. Jiang, A. Q. et al. Mistral 7B. Preprint at arXiv https://doi.org/10.48550/arXiv.2310.06825 (2023).

  197. Bai, J. et al. Qwen technical report. Preprint at arXiv https://doi.org/10.48550/arXiv.2309.16609 (2023).

  198. Lewis, M. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. 58th Annu. Meet. Assoc. Comput. Linguist. (eds Jurafsky, D. et al.) 7871–7880 (ACL, 2020).

  199. Tay, Y. et al. Ul2: unifying language learning paradigms. Preprint at arXiv https://doi.org/10.48550/arXiv.2205.05131 (2022).

Download references

Acknowledgements

This work was supported in part by the Pandemic Sciences Institute at the University of Oxford, the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre, an NIHR Research Professorship, a Royal Academy of Engineering Research Chair, the Well-come Trust-funded VITAL project, the UK Research and Innovation, the Engineering and Physical Sciences Research Council, and the InnoHK Hong Kong Centre for Cerebro-cardiovascular Engineering (COCHE), the Clarendon Fund, and the Magdalen Graduate Scholarship.

Author information

Authors and Affiliations

Authors

Contributions

H.Z. and F.L. conceived and designed the study. H.Z., F.L., B.G., X.Z., J.H. and J.W. conducted the literature review, performed data analysis and drafted the manuscript. All authors contributed to the interpretation and final manuscript preparation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fenglin Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Bioengineering thanks Jakob Kather and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Alpaca: https://github.com/tatsu-lab/stanford_alpaca

Bard: https://gemini.google.com/

ChatGPT: https://chat.openai.com/

Claude-3: https://www.anthropic.com/news/claude-3-family

HealthcareMagic: https://www.healthcaremagic.com/

iCliniq: https://www.icliniq.com/

LLaMA-3: https://github.com/meta-llama/llama3

MedLlama3-v20: https://huggingface.co/ProbeMedicalYonseiMAILab/medllama3-v20

OpenBioLLM: https://huggingface.co/aaditya/OpenBioLLM-Llama3-70B

PubMed: https://pubmed.ncbi.nlm.nih.gov/

PubMed Central (PMC): https://www.ncbi.nlm.nih.gov/pmc/

ShareGPT: https://sharegpt.com/

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, F., Zhou, H., Gu, B. et al. Application of large language models in medicine. Nat Rev Bioeng 3, 445–464 (2025). https://doi.org/10.1038/s44222-025-00279-5

Download citation

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s44222-025-00279-5

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing