Integrating large language models for enhanced predictive analytics in healthcare

Wang, Yuli; Dai, Yuwei; Wang, Robin; Mehta, Tej; Trivedi, Premal; Vu, Thao; Lin, Cheng Ting; Yang, Li; Jiao, Zhicheng; Kamel, Ihab; Wu, Jing; Bai, Harrison

doi:10.1038/s41746-026-02572-y

Download PDF

Article
Open access
Published: 02 April 2026

Integrating large language models for enhanced predictive analytics in healthcare

Yuli Wang^1,2,3^na1,
Yuwei Dai^1,4^na1,
Robin Wang⁵,
Tej Mehta²,
Premal Trivedi¹,
Thao Vu⁶,
Cheng Ting Lin²,
Li Yang^4,7^na2,
Zhicheng Jiao⁸,
Ihab Kamel¹,
Jing Wu⁹^na2 &
…
Harrison Bai¹^na2

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Physicians frequently confront time-sensitive decisions under uncertain conditions, necessitating reliable tools for forecasting clinical outcomes. Although clinical predictive models have the potential to assist in these critical decisions, their widespread adoption is hindered by complexities in data handling, model development, and integration into clinical workflows. This study introduces a novel framework (Hopkins LLM) leveraging structured electronic health records (EHRs) data to develop and deploy clinical large language models (LLMs) that act as multi-task-capable predictive engines to support clinically constrained decision-support tasks with minimal barriers to implementation. Employing the advanced LLaMA architecture, consisting of 7 billion parameters, our model was pre-trained on a comprehensive corpus and subsequently fine-tuned and tested on a dataset of 42,160 patients within Johns Hopkins Health System, addressing a spectrum of clinical and operational prediction tasks. We validated our model across three diverse external health systems and four key prediction tasks involving 1,329 patients, including 30-day all-cause readmissions, 90-day all-cause mortality, 30-day intensive care unit (ICU) admissions, and treatment recommendations. The proposed Hopkins-LLM framework achieved a mean area under the receiver operating characteristic curve (ROC-AUC) of 0.84 [0.82, 0.88], yielding a significant 0.28 advancement over zero-shot baseline LLMs (p<0.05). These findings underscore the promise of LLMs as unified, user-friendly clinical prediction systems, adept at reasoning across diverse data sources to enhance decision-making at the point of care.

Data availability

The multimodal imaging and EHRs datasets used in this study were derived from the Johns Hopkins Health System under Institutional Review Board (IRB) approval and contain protected health information. Due to patient privacy considerations, the raw data cannot be publicly shared. De-identified subsets of the data may be made available upon reasonable request to the corresponding author, subject to completion of a Data Use Agreement (DUA) and approval by the Johns Hopkins IRB.

Code availability

All deep learning models were implemented in Python (version 3.10) using PyTorch (version 2.1.2). The following libraries were used for model development and evaluation: NumPy (1.26.4), pandas (2.2.1), transformers (4.36.1), vLLM (0.2.5), scikit-learn (1.2.1), matplotlib (3.7.1), and SciPy (1.11.3). Custom code modules were used for data input/output pipelines and distributed parallelization across computing nodes and GPUs. All source code supporting this study is available for scientific research and non-commercial use at:https://github.com/YuliWanghust/Hopkins_LLM.

References

Woolf, S. H. et al. Promoting informed choice: transforming health care to dispense knowledge for decision making (2005).
Kaur, S. et al. Medical diagnostic systems using artificial intelligence (ai) algorithms: principles and perspectives. IEEE Access 8, 228049–228069 (2020).
Google Scholar
Graber, M. L. The incidence of diagnostic error in medicine. BMJ Qual. Saf. 22, ii21–ii27 (2013).
Google Scholar
Stern, S. D. Symptom to Diagnosis an Evidence-Based Guide (McGraw-Hill Education, 2010).
Achour, S. L., Dojat, M., Rieux, C., Bierling, P. & Lepage, E. A umls-based knowledge acquisition tool for rule-based clinical decision support system development. J. Am. Med. Inform. Assoc. 8, 351–360 (2001).
Google Scholar
Papadopoulos, P., Soflano, M., Chaudy, Y., Adejo, W. & Connolly, T. M. A systematic review of technologies and standards used in the development of rule-based clinical decision support systems. Health Technol. 12, 713–727 (2022).
Google Scholar
Riley, R. D. & Collins, G. S. Stability of clinical prediction models developed using statistical or machine learning methods. Biometrical J. 65, 2200302 (2023).
Google Scholar
Eloranta, S. & Boman, M. Predictive models for clinical decision making: Deep dives in practical machine learning. J. Intern. Med. 292, 278–295 (2022).
Google Scholar
Shouval, R. et al. Application of machine learning algorithms for clinical predictive modeling: a data-mining approach in sct. Bone Marrow Transplant. 49, 332–337 (2014).
Google Scholar
Zhong, Z. et al. Abn-blip: Abnormality-aligned bootstrapping language-image pre-training for pulmonary embolism diagnosis and report generation from ctpa. Med. Image Anal. 107, 103786 (2026).
Google Scholar
Giesa, N. et al. Applying a transformer architecture to intraoperative temporal dynamics improves the prediction of postoperative delirium. Commun. Med. 4, 251 (2024).
Google Scholar
Xu, Y., Xu, S., Ramprassad, M., Tumanov, A. & Zhang, C. Transehr: Self-supervised transformer for clinical time series data. In Machine Learning for Health (ML4H), 623–635 (PMLR, 2023).
Oh, J., Wang, J. & Wiens, J. Learning to exploit invariances in clinical time-series data using sequence transformer networks. In Machine learning for healthcare conference, 332–347 (PMLR, 2018).
Guo, H. et al. A multitask framework for automated interpretation of multi-frame right upper quadrant ultrasound in clinical decision support. arXiv preprint arXiv:2601.12174 (2026).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 4171–4186 (2019).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Google Scholar
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. et al. Improving language understanding by generative pre-training. arXiv preprint (2018).
Huang, K., Altosaar, J. & Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Google Scholar
Yang, X. et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540 (2022).
Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinforma. 23, bbac409 (2022).
Google Scholar
Chen, C. et al. Integration of large language models and federated learning. Patterns 5 (2024).
Kokash, N. et al. Ontology-and llM-based data harmonization for federated learning in healthcare. arXiv preprint arXiv:2505.20020 (2025).
Nascimento, L. et al. Federated large language models in healthcare: a systematic review, opportunities and challenges. Eng. Archive (2025).
Nguyen, D.-T. et al. Federated learning for renal tumor segmentation and classification on multi-center mri dataset. J. Magn. Reson. Imaging 62, 814–824 (2025).
Google Scholar
Floridi, L. & Chiriatti, M. Gpt-3: Its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020).
Google Scholar
Pan, T., Shen, J. & Xu, M. Enhancing the performance of neurosurgery medical question-answering systems using a multi-task knowledge graph-augmented answer generation model. Front. Neurosci. 19, 1606038 (2025).
Google Scholar
Xu, L. et al. End-to-end knowledge-routed relational dialogue system for automatic diagnosis. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 7346–7353 (2019).
Liu, W. et al. Meddg: A large-scale medical consultation dataset for building medical dialogue system. arXiv preprint (2020).
Martino, A., Iannelli, M. & Truong, C. Knowledge injection to counter large language model (llm) hallucination. In European Semantic Web Conference, 182–185 (Springer, 2023).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Google Scholar
Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).
Google Scholar
Kirk, H. R., Vidgen, B., Röttger, P. & Hale, S. A. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nat. Mach. Intell. 6, 383–392 (2024).
Google Scholar
Sutton, N. R. et al. Coronary artery disease evaluation and management considerations for high risk occupations: commercial vehicle drivers and pilots. Circ.: Cardiovas. Interv. 14, e009950 (2021).
Google Scholar
Righini, M. et al. The simplified pulmonary embolism severity index (pesi): validation of a clinical prognostic model for pulmonary embolism. J. Thrombosis Haemost. 9, 2115–2117 (2011).
Google Scholar
Budoff, M. J. et al. Ten-year association of coronary artery calcium with atherosclerotic cardiovascular disease (ascvd) events: the multi-ethnic study of atherosclerosis (mesa). Eur. Heart J. 39, 2401–2408 (2018).
Google Scholar
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).
Team, G. et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025).
Tu, T. et al. Towards generalist biomedical AI. Nejm AI 1, AIoa2300138 (2024).
Google Scholar
Toma, A. et al. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023).
Zhao, L. et al. Artificial intelligence-based lesion characterization and outcome prediction of prostate cancer on [18f] dcfpyl psma imaging. Radiotherapy Oncol. 111265 (2025).
Wu, J., Roy, J. & Stewart, W. F. Prediction modeling using ehr data: challenges, strategies, and a comparison of machine learning approaches. Med. Care 48, S106–S113 (2010).
Google Scholar
Bernstein, I. A. et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw. Open 6, e2330320–e2330320 (2023).
Google Scholar
Xu, F. et al. Are large language models really good logical reasoners? a comprehensive evaluation and beyond. IEEE Trans. Knowledge Data Eng. (2025).
Wang, C. et al. Survey on factuality in large language models. ACM Comput. Surv. 58, 1–37 (2025).
Google Scholar
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Google Scholar
Shamout, F., Zhu, T. & Clifton, D. A. Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2020).
Google Scholar
Kim, J. I. et al. Machine learning for antimicrobial resistance prediction: current practice, limitations, and clinical perspective. Clin. Microbiol. Rev. 35, e00179–21 (2022).
Google Scholar
Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).
Google Scholar
Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. Jama 319, 1317–1318 (2018).
Google Scholar
Perez, E., Kiela, D. & Cho, K. True few-shot learning with language models. Adv. Neural Inf. Process. Syst. 34, 11054–11070 (2021).
Google Scholar
Zhang, C., Morris, J. X. & Shmatikov, V. Extracting prompts by inverting llm outputs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 14753–14777 (2024).
Huang, L. et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans. Inform. Syst. 43, 1–55 (2025).
Google Scholar
Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J. & Powell, D. Cognitive bias in clinical large language models. npj Digital Med. 8, 428 (2025).
Google Scholar
Suenghataiphorn, T., Tribuddharat, N., Danpanichkul, P. & Kulthamrongsri, N. Bias in large language models across clinical applications: A systematic review. arXiv preprint arXiv:2504.02917 (2025).
Hsu, W.-C. et al. Mri-based ovarian lesion classification via a foundation segmentation model and multimodal analysis: A multicenter study. Radiology 316, e243412 (2025).
Google Scholar
Wu, J. et al. Vision-language foundation model for 3d medical imaging. npj Artif. Intell. 1, 17 (2025).
Google Scholar
Zhong, Z. et al. Vision-language model for report generation and outcome prediction in ct pulmonary angiogram. NPJ Digital Med. 8, 432 (2025).
Google Scholar
Huang, Z. et al. A pathologist–ai collaboration framework for enhancing diagnostic accuracies and efficiencies. Nat. Biomed. Eng. 9, 455–470 (2025).
Google Scholar
Huang, X. et al. Understanding the planning of llm agents: A survey. arXiv preprint arXiv:2402.02716 (2024).
Zhao, A. et al. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 19632–19642 (2024).
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J. & Fernández-Leal, Á Human-in-the-loop machine learning: a state of the art. Artif. Intell. Rev. 56, 3005–3054 (2023).
Google Scholar
Cook, R. J., Zeng, L. & Yi, G. Y. Marginal analysis of incomplete longitudinal binary data: a cautionary note on locf imputation. Biometrics 60, 820–828 (2004).
Google Scholar
Xue, H. & Salim, F. D. Promptcast: A new prompt-based learning paradigm for time series forecasting. IEEE Trans. Knowl. Data Eng. 36, 6851–6864 (2023).
Google Scholar
Liu, H., Zhao, Z., Wang, J., Kamarthi, H., & Prakash, B. A. (2024, August). Lstprompt: Large language models as zero-shot time series forecasters by long-short-term prompting. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 7832–7840.
Moon, H. C., Joty, S. & Chi, X. Gradmask: Gradient-guided token masking for textual adversarial example detection. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3603–3613 (2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural Inf. Process. Syst. 35, 24824–24837 (2022).
Google Scholar
Dwivedi, A. K., Mallawaarachchi, I. & Alvarado, L. A. Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method. Stat. Med. 36, 2187–2205 (2017).
Google Scholar
Tong, X. et al. A novel subpixel phase correlation method using singular value decomposition and unified random sample consensus. IEEE Trans. Geosci. Remote Sens. 53, 4143–4156 (2015).
Google Scholar
Naidu, K., Beenen, E., Gananadha, S. & Mosse, C. The yield of fever, inflammatory markers and ultrasound in the diagnosis of acute cholecystitis: a validation of the 2013 tokyo guidelines. World J. Surg. 40, 2892–2897 (2016).
Google Scholar

Download references

Acknowledgements

This work was supported by the American Heart Association (Award #25IPA1454088), the National Institutes of Health (Award No. 1R03CA286693-01A1 and Award No. 1R01CA291826-01A1), the U.S. Department of Defense (Award No. HT94252510807), and the National Science Foundation (Award No. 2545071).

Author information

These authors contributed equally: Yuli Wang, Yuwei Dai.
These authors jointly supervised this work: Li Yang, Jing Wu, Harrison Bai.

Authors and Affiliations

Department of Radiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
Yuli Wang, Yuwei Dai, Premal Trivedi, Ihab Kamel & Harrison Bai
Department of Radiology, Johns Hopkins University School of Medicine, Baltimore, MD, USA
Yuli Wang, Tej Mehta & Cheng Ting Lin
Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, USA
Yuli Wang
Department of Neurology, Second Xiangya Hospital, Central South University, Changsha, Hunan, China
Yuwei Dai & Li Yang
Department of Radiology, Stanford University School of Medicine, Stanford, CA, USA
Robin Wang
Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
Thao Vu
Clinical Medical Research Center for Stroke Prevention and Treatment of Hunan Province, The Second Xiangya Hospital, Central South University, Changsha, China
Li Yang
Department of Diagnostic Imaging, Brown University Health, Providence, RI, USA
Zhicheng Jiao
Department of Radiology, Second Xiangya Hospital, Central South University, Changsha, Hunan, China
Jing Wu

Authors

Yuli Wang
View author publications
Search author on:PubMed Google Scholar
Yuwei Dai
View author publications
Search author on:PubMed Google Scholar
Robin Wang
View author publications
Search author on:PubMed Google Scholar
Tej Mehta
View author publications
Search author on:PubMed Google Scholar
Premal Trivedi
View author publications
Search author on:PubMed Google Scholar
Thao Vu
View author publications
Search author on:PubMed Google Scholar
Cheng Ting Lin
View author publications
Search author on:PubMed Google Scholar
Li Yang
View author publications
Search author on:PubMed Google Scholar
Zhicheng Jiao
View author publications
Search author on:PubMed Google Scholar
Ihab Kamel
View author publications
Search author on:PubMed Google Scholar
Jing Wu
View author publications
Search author on:PubMed Google Scholar
Harrison Bai
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: Y.W. and H.B. Methodology: Y.W. and Y.D. Investigation: R.W., T. M., P.T., T.V. Visualization: C.L., Z. J., I.K., J.W. Supervision: L.Y. and H.B. Writing original draft: Y.W. and Y.D. Writing, review, and editing: H.B.

Corresponding authors

Correspondence to Li Yang, Jing Wu or Harrison Bai.

Ethics declarations

Competing interests

Harrison Bai, MD, serves as an Associate Editor of the npj Digital Medicine. He was not involved in the peer-review process or editorial decision-making for this manuscript.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Dai, Y., Wang, R. et al. Integrating large language models for enhanced predictive analytics in healthcare. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02572-y

Download citation

Received: 15 December 2025
Accepted: 13 March 2026
Published: 02 April 2026
DOI: https://doi.org/10.1038/s41746-026-02572-y