Abstract
Discharge summaries are critical for patient care continuity, clinical decision-making, and legal documentation, yet their creation is labor-intensive. Clinicians must manually integrate diverse data from multiple sources under time constraints, often leading to delays, inconsistencies, and potential omissions. This study introduces a novel framework to automate discharge summary generation using advanced natural language processing (NLP) techniques, aiming to reduce clinician workload while ensuring accurate, complete, and standardized documentation. We combine the Decomposed Low-Rank Adaptation (DoRA) fine-tuning method with a novel self-evaluation mechanism to enhance large language models (LLMs) for medical text generation. DoRA efficiently adapts pre-trained LLMs to the specialized medical domain, demonstrating superior performance over traditional methods such as LoRA and QLoRA, with a enhancement in BERTScore and a reduction in Perplexity across all evaluated models. The self-evaluation mechanism, inspired by cognitive psychology, iteratively re-feeds generated summaries together with segmented clinical data into the model, allowing it to systematically detect and correct omissions in each data segment, thereby ensuring the outputs accurately and comprehensively represent the original input. This approach was rigorously compared against few-shot prompting and Chain of Thought (CoT) methods. Extensive experiments show that self-evaluation improves BERTScore by 6.9% and 4.1% and increases ROUGE-L by 69.6% and 0.4% relative to few-shot and CoT baselines, respectively, while qualitative metrics also demonstrate consistent gains in accuracy and completeness. Our results demonstrate substantial enhancements in the quality and consistency of generated discharge summaries while reducing the time required for their creation. This research underscores the potential of AI-driven tools in healthcare documentation, reducing the time required for generating discharge summaries while improving their quality and consistency. The findings indicate promising prospects for automating medical documentation that adheres to high standards of accuracy and relevance.
Similar content being viewed by others
Data availability
Due to ethical restrictions, the raw data cannot be made publicly available. However, de-identified data may be obtained from the first author upon reasonable request.
References
Wilson, S., Ruscoe, W., Chapman, M. & Miller, R. General practitioner–hospital communications: A review of discharge summaries. J. Qual. Clin. Pract. 21, 104–108 (2001).
Chen, W.-K. Linear Networks and Systems. pp. 123–135, (1993).
Patel, S. B. & Lam, K. ChatGPT: The future of discharge summaries?. Lancet Digit. Health 5(3), e107–e108 (2023).
van Walraven, C. & Rokosh, E. What is necessary for high-quality discharge summaries?. Am. J. Med. Qual. 14, 160–169 (1999).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Brown, T. B. Language models are few-shot learners. Preprint at https://arxiv.org/abs/2005.14165 (2020).
Floridi, L. & Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Mind. Mach. 30, 681–694 (2020).
R. Zhou, L. Chen, and K. Yu, Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks, in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation 430 (LREC-COLING 2024), 9340–9351, (2024).
He, Z. et al. Quality of answers of generative large language models versus peer users for interpreting laboratory test results for lay patients: Evaluation study. J. Med. Internet Res. 26, e56655 (2024).
I. C. Wiest, et al. Anonymizing medical documents with local, privacy preserving large language models: The LLM-Anonymizer. medRxiv, 2024–06 (2024).
E. J. Hu, et al. Lora: Low-rank adaptation of large language models. Preprint at https://arxiv.org/abs/2106.09685 (2021).
Liu, S.-Y., et al. Dora: Weight-decomposed low-rank adaptation. Preprint at https://arxiv.org/abs/2402.09353(2024).
Jiang, A. Q. et al., Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).
Jung, H. et al., Enhancing Clinical Efficiency through LLM: Discharge Note Generation for Cardiac Patients. Preprint at https://arxiv.org/abs/2404.05144 (2024).
Dubey, A. et al., The llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).
Bai, J. et al., Qwen technical report. Preprint at https://arxiv.org/abs/2309.16609 (2023).
Yang, A. et al., Qwen2 technical report. Preprint at https://arxiv.org/abs/2407.10671 (2024).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).
Tang, A. Q., Zhang, X., & Dinh, M. N. Ignition Innovators at “Discharge Me!”: Chain-of-Thought Instruction Finetuning Large Language Models for Discharge Summaries. Preprint at https://arxiv.org/abs/2407.17636 (2024).
Brown, H. Lin, L. Kawaguchi, K. & Shieh, M. Self-Evaluation as a Defense Against Adversarial Attacks on LLMs. Preprint at https://arxiv.org/abs/2407.03234 (2024).
McAleese, N., Pokorny, R. M., Uribe, J. F. C., Nitishinskaya, E. Trebacz, M. & Leike, J. LLM critics help catch LLM bugs, Preprint at https://arxiv.org/abs/2407.00215 (2024).
Shinn, N. et al. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural. Inf. Process. Syst. 36, 8634–8652 (2023).
Manakul, P. Liusie, A. & Gales, M. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017, (2023).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Neira, R. A. Q. de Vries, G.-J. Caffarel, J. & Stretton, E. Extraction of data from a hospital information system to perform process mining. In MEDINFO 2017: Precision Healthcare through Informatics. 554–558, (IOS Press, 2017).
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
Wenbin Li conceptualized the research idea and wrote the main manuscript text. Hui Feng and Minpeng Xu conducted the literature review and contributed to drafting and editing portions of the manuscript. Chao Hu and Longlong Cheng compiled and analyzed the latest developments in machine learning techniques, wrote the technical methods section, and assisted in final manuscript revisions. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, W., Feng, H., Hu, C. et al. Accurate discharge summary generation using fine tuned large language models with self evaluation. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35552-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-35552-z


