Abstract
Physicians frequently confront time-sensitive decisions under uncertain conditions, necessitating reliable tools for forecasting clinical outcomes. Although clinical predictive models have the potential to assist in these critical decisions, their widespread adoption is hindered by complexities in data handling, model development, and integration into clinical workflows. This study introduces a novel framework (Hopkins LLM) leveraging structured electronic health records (EHRs) data to develop and deploy clinical large language models (LLMs) that act as multi-task-capable predictive engines to support clinically constrained decision-support tasks with minimal barriers to implementation. Employing the advanced LLaMA architecture, consisting of 7 billion parameters, our model was pre-trained on a comprehensive corpus and subsequently fine-tuned and tested on a dataset of 42,160 patients within Johns Hopkins Health System, addressing a spectrum of clinical and operational prediction tasks. We validated our model across three diverse external health systems and four key prediction tasks involving 1,329 patients, including 30-day all-cause readmissions, 90-day all-cause mortality, 30-day intensive care unit (ICU) admissions, and treatment recommendations. The proposed Hopkins-LLM framework achieved a mean area under the receiver operating characteristic curve (ROC-AUC) of 0.84 [0.82, 0.88], yielding a significant 0.28 advancement over zero-shot baseline LLMs (p<0.05). These findings underscore the promise of LLMs as unified, user-friendly clinical prediction systems, adept at reasoning across diverse data sources to enhance decision-making at the point of care.
Data availability
The multimodal imaging and EHRs datasets used in this study were derived from the Johns Hopkins Health System under Institutional Review Board (IRB) approval and contain protected health information. Due to patient privacy considerations, the raw data cannot be publicly shared. De-identified subsets of the data may be made available upon reasonable request to the corresponding author, subject to completion of a Data Use Agreement (DUA) and approval by the Johns Hopkins IRB.
Code availability
All deep learning models were implemented in Python (version 3.10) using PyTorch (version 2.1.2). The following libraries were used for model development and evaluation: NumPy (1.26.4), pandas (2.2.1), transformers (4.36.1), vLLM (0.2.5), scikit-learn (1.2.1), matplotlib (3.7.1), and SciPy (1.11.3). Custom code modules were used for data input/output pipelines and distributed parallelization across computing nodes and GPUs. All source code supporting this study is available for scientific research and non-commercial use at:https://github.com/YuliWanghust/Hopkins_LLM.
References
Woolf, S. H. et al. Promoting informed choice: transforming health care to dispense knowledge for decision making (2005).
Kaur, S. et al. Medical diagnostic systems using artificial intelligence (ai) algorithms: principles and perspectives. IEEE Access 8, 228049–228069 (2020).
Graber, M. L. The incidence of diagnostic error in medicine. BMJ Qual. Saf. 22, ii21–ii27 (2013).
Stern, S. D. Symptom to Diagnosis an Evidence-Based Guide (McGraw-Hill Education, 2010).
Achour, S. L., Dojat, M., Rieux, C., Bierling, P. & Lepage, E. A umls-based knowledge acquisition tool for rule-based clinical decision support system development. J. Am. Med. Inform. Assoc. 8, 351–360 (2001).
Papadopoulos, P., Soflano, M., Chaudy, Y., Adejo, W. & Connolly, T. M. A systematic review of technologies and standards used in the development of rule-based clinical decision support systems. Health Technol. 12, 713–727 (2022).
Riley, R. D. & Collins, G. S. Stability of clinical prediction models developed using statistical or machine learning methods. Biometrical J. 65, 2200302 (2023).
Eloranta, S. & Boman, M. Predictive models for clinical decision making: Deep dives in practical machine learning. J. Intern. Med. 292, 278–295 (2022).
Shouval, R. et al. Application of machine learning algorithms for clinical predictive modeling: a data-mining approach in sct. Bone Marrow Transplant. 49, 332–337 (2014).
Zhong, Z. et al. Abn-blip: Abnormality-aligned bootstrapping language-image pre-training for pulmonary embolism diagnosis and report generation from ctpa. Med. Image Anal. 107, 103786 (2026).
Giesa, N. et al. Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium. Commun. Med. 4, 251 (2024).
Xu, Y., Xu, S., Ramprassad, M., Tumanov, A. & Zhang, C. Transehr: Self-supervised transformer for clinical time series data. In Machine Learning for Health (ML4H), 623–635 (PMLR, 2023).
Oh, J., Wang, J. & Wiens, J. Learning to exploit invariances in clinical time-series data using sequence transformer networks. In Machine learning for healthcare conference, 332–347 (PMLR, 2018).
Guo, H. et al. A multitask framework for automated interpretation of multi-frame right upper quadrant ultrasound in clinical decision support. arXiv preprint arXiv:2601.12174 (2026).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186 (2019).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. Improving language understanding by generative pre-training. arXiv preprint (2018).
Huang, K., Altosaar, J. & Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Yang, X. et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540 (2022).
Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinforma. 23, bbac409 (2022).
Chen, C. et al. Integration of large language models and federated learning. Patterns 5 (2024).
Kokash, N. et al. Ontology-and llM-based data harmonization for federated learning in healthcare. arXiv preprint arXiv:2505.20020 (2025).
Nascimento, L. et al. Federated large language models in healthcare: a systematic review, opportunities and challenges. Eng. Archive (2025).
Nguyen, D.-T. et al. Federated learning for renal tumor segmentation and classification on multi-center mri dataset. J. Magn. Reson. Imaging 62, 814–824 (2025).
Floridi, L. & Chiriatti, M. Gpt-3: Its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Pan, T., Shen, J. & Xu, M. Enhancing the performance of neurosurgery medical question-answering systems using a multi-task knowledge graph-augmented answer generation model. Front. Neurosci. 19, 1606038 (2025).
Xu, L. et al. End-to-end knowledge-routed relational dialogue system for automatic diagnosis. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 7346–7353 (2019).
Liu, W. et al. Meddg: A large-scale medical consultation dataset for building medical dialogue system. arXiv preprint (2020).
Martino, A., Iannelli, M. & Truong, C. Knowledge injection to counter large language model (llm) hallucination. In European Semantic Web Conference, 182–185 (Springer, 2023).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).
Kirk, H. R., Vidgen, B., Röttger, P. & Hale, S. A. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nat. Mach. Intell. 6, 383–392 (2024).
Sutton, N. R. et al. Coronary artery disease evaluation and management considerations for high risk occupations: commercial vehicle drivers and pilots. Circ.: Cardiovas. Interv. 14, e009950 (2021).
Righini, M. et al. The simplified pulmonary embolism severity index (pesi): validation of a clinical prognostic model for pulmonary embolism. J. Thrombosis Haemost. 9, 2115–2117 (2011).
Budoff, M. J. et al. Ten-year association of coronary artery calcium with atherosclerotic cardiovascular disease (ascvd) events: the multi-ethnic study of atherosclerosis (mesa). Eur. Heart J. 39, 2401–2408 (2018).
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).
Team, G. et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025).
Tu, T. et al. Towards generalist biomedical AI. Nejm AI 1, AIoa2300138 (2024).
Toma, A. et al. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023).
Zhao, L. et al. Artificial intelligence-based lesion characterization and outcome prediction of prostate cancer on [18f] dcfpyl psma imaging. Radiotherapy Oncol. 111265 (2025).
Wu, J., Roy, J. & Stewart, W. F. Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Med. Care 48, S106–S113 (2010).
Bernstein, I. A. et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw. Open 6, e2330320–e2330320 (2023).
Xu, F. et al. Are large language models really good logical reasoners? a comprehensive evaluation and beyond. IEEE Trans. Knowledge Data Eng. (2025).
Wang, C. et al. Survey on factuality in large language models. ACM Comput. Surv. 58, 1–37 (2025).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Shamout, F., Zhu, T. & Clifton, D. A. Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2020).
Kim, J. I. et al. Machine learning for antimicrobial resistance prediction: current practice, limitations, and clinical perspective. Clin. Microbiol. Rev. 35, e00179–21 (2022).
Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).
Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. Jama 319, 1317–1318 (2018).
Perez, E., Kiela, D. & Cho, K. True few-shot learning with language models. Adv. Neural Inf. Process. Syst. 34, 11054–11070 (2021).
Zhang, C., Morris, J. X. & Shmatikov, V. Extracting prompts by inverting llm outputs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14753–14777 (2024).
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inform. Syst. 43, 1–55 (2025).
Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J. & Powell, D. Cognitive bias in clinical large language models. npj Digital Med. 8, 428 (2025).
Suenghataiphorn, T., Tribuddharat, N., Danpanichkul, P. & Kulthamrongsri, N. Bias in large language models across clinical applications: A systematic review. arXiv preprint arXiv:2504.02917 (2025).
Hsu, W.-C. et al. Mri-based ovarian lesion classification via a foundation segmentation model and multimodal analysis: A multicenter study. Radiology 316, e243412 (2025).
Wu, J. et al. Vision-language foundation model for 3d medical imaging. npj Artif. Intell. 1, 17 (2025).
Zhong, Z. et al. Vision-language model for report generation and outcome prediction in ct pulmonary angiogram. NPJ Digital Med. 8, 432 (2025).
Huang, Z. et al. A pathologist–ai collaboration framework for enhancing diagnostic accuracies and efficiencies. Nat. Biomed. Eng. 9, 455–470 (2025).
Huang, X. et al. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024).
Zhao, A. et al. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 19632–19642 (2024).
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J. & Fernández-Leal, Á Human-in-the-loop machine learning: a state of the art. Artif. Intell. Rev. 56, 3005–3054 (2023).
Cook, R. J., Zeng, L. & Yi, G. Y. Marginal analysis of incomplete longitudinal binary data: a cautionary note on locf imputation. Biometrics 60, 820–828 (2004).
Xue, H. & Salim, F. D. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Trans. Knowl. Data Eng. 36, 6851–6864 (2023).
Liu, H., Zhao, Z., Wang, J., Kamarthi, H., & Prakash, B. A. (2024, August). Lstprompt: Large language models as zero-shot time series forecasters by long-short-term prompting. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 7832–7840.
Moon, H. C., Joty, S. & Chi, X. Gradmask: Gradient-guided token masking for textual adversarial example detection. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3603–3613 (2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural Inf. Process. Syst. 35, 24824–24837 (2022).
Dwivedi, A. K., Mallawaarachchi, I. & Alvarado, L. A. Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method. Stat. Med. 36, 2187–2205 (2017).
Tong, X. et al. A novel subpixel phase correlation method using singular value decomposition and unified random sample consensus. IEEE Trans. Geosci. Remote Sens. 53, 4143–4156 (2015).
Naidu, K., Beenen, E., Gananadha, S. & Mosse, C. The yield of fever, inflammatory markers and ultrasound in the diagnosis of acute cholecystitis: a validation of the 2013 tokyo guidelines. World J. Surg. 40, 2892–2897 (2016).
Acknowledgements
This work was supported by the American Heart Association (Award #25IPA1454088), the National Institutes of Health (Award No. 1R03CA286693-01A1 and Award No. 1R01CA291826-01A1), the U.S. Department of Defense (Award No. HT94252510807), and the National Science Foundation (Award No. 2545071).
Author information
Authors and Affiliations
Contributions
Conceptualization: Y.W. and H.B. Methodology: Y.W. and Y.D. Investigation: R.W., T. M., P.T., T.V. Visualization: C.L., Z. J., I.K., J.W. Supervision: L.Y. and H.B. Writing original draft: Y.W. and Y.D. Writing, review, and editing: H.B.
Corresponding authors
Ethics declarations
Competing interests
Harrison Bai, MD, serves as an Associate Editor of the npj Digital Medicine. He was not involved in the peer-review process or editorial decision-making for this manuscript.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Y., Dai, Y., Wang, R. et al. Integrating large language models for enhanced predictive analytics in healthcare. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02572-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02572-y