A domain-adapted large language model to support clinicians in psychiatric clinical practice

Wang, Ruoxi; Liu, Shuyu; Zhang, Ling; Zhu, Xuequan; Yang, Zhi; Wang, Hu; Huang, Juan; Wang, Yimeng; Yang, Xiaofan; Wu, Fei; Yang, Rui; Wang, Gang; Jin, Cheng

doi:10.1038/s42256-026-01224-w

Article
Published: 27 April 2026

A domain-adapted large language model to support clinicians in psychiatric clinical practice

Nature Machine Intelligence (2026)Cite this article

1323 Accesses
11 Altmetric
Metrics details

Subjects

Abstract

Mental disorders affect nearly one billion individuals worldwide, yet professional psychiatric care remains constrained by workforce shortages and experience-dependent decision-making. Despite recent advances in large language models (LLMs), current applications in mental health are primarily patient-oriented and lack alignment with real-world psychiatric clinical workflows. Here we present PsychFound, a domain-adapted and clinician-oriented LLM developed to support psychiatric clinical practice. Developed through a three-phase framework using expert-curated psychiatric corpora and 64,588 Chinese real-world electronic health records, PsychFound integrates psychiatric professional knowledge, clinical reasoning capabilities and adaptation to the full spectrum of psychiatric clinical tasks across diagnosis, treatment planning and longitudinal management in Chinese clinical settings. In retrospective evaluations spanning three professional knowledge assessments and five clinical task benchmarks, the 7B-parameter PsychFound delivered the top overall performance among 22 LLMs. In a real-world, two-arm prospective study, resident psychiatrists assisted by PsychFound demonstrated higher consultation quality, higher diagnostic accuracy, more appropriate medication selection and reduced documentation time (all P < 0.01). A reader study with 60 psychiatrists (20 residents, 20 attendings and 20 seniors) showed that PsychFound’s clinical reasoning performance matched that of attending psychiatrists. These findings demonstrated that PsychFound provides an interpretable, expert-level decision support tool capable of improving consistency, efficiency and standardization in psychiatric clinical care.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of PsychFound: data foundations, model development pipeline and clinical integration.**

**Fig. 2: Comprehensive evaluation framework for PsychFound.**

**Fig. 3: Quantitative benchmarking of PsychFound on psychiatric expertise and clinical tasks.**

**Fig. 4: Performance benchmarking of PsychFound against representative LLMs across clinical tasks.**

**Fig. 5: Ablation study of PsychFound training strategies and their impact on diagnostic performance.**

**Fig. 6: Real-world prospective evaluation and reader-study assessment of PsychFound’s clinical utility.**

PsychiatryBench: a multi-task benchmark for LLMs in psychiatry

Article Open access 14 April 2026

Identifying psychiatric manifestations in outpatients with depression and anxiety: a large language model-based approach

Article Open access 02 December 2025

WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis

Article Open access 25 March 2026

Data availability

This study utilized two datasets for model development: PsychCorpus and PsychClinical. PsychCorpus consists of publicly available psychiatric texts and is available via GitHub at https://github.com/wrx33/PsychFound (ref. ⁵⁰). PsychClinical comprises de-identified real-world EHRs from multiple psychiatric centres and cannot be publicly released due to privacy and data-governance restrictions. Researchers may request controlled access from the corresponding author, subject to institutional and regulatory approval. For evaluation, we used publicly accessible domain-specific test sets and de-identified clinical cases from PsychBench, available via GitHub at https://github.com/wrx33/PsychBench. The study also incorporated a real-world prospective cohort. Due to ethical and data-governance constraints, the de-identified prospective study data are not publicly available. Researchers may request access from the corresponding author. All requests will be reviewed in accordance with the institution’s policies and data usage agreements, and responses will be provided within 4 weeks. Source data are provided with this paper.

Code availability

The codes for scientific research and non-commercial use are available via GitHub at https://github.com/wrx33/PsychFound (ref. ⁵⁰).

References

World Mental Health Report: Transforming Mental Health For All (World Health Organization, 2022).
Huang, Y. et al. Prevalence of mental disorders in China: a cross-sectional epidemiological study. Lancet Psychiatry 6, 211–224 (2019).
Article Google Scholar
Mental Health Atlas 2020: Review of the Eastern Mediterranean Region (World Health Organization, 2022).
Chen, R., Zhang, W. & Wu, X. Mental health policy and implementation from 2009 to 2020 in China. SSM - Ment. Health 4, 100244 (2023).
Article Google Scholar
Stein, D. J. et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry 21, 393–414 (2022).
Article Google Scholar
Feuerriegel, S. et al. Using natural language processing to analyse text data in behavioural science. Nat. Rev. Psychol. 4, 96–111 (2025).
Article Google Scholar
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP Digit. Psychiatry Neurosci. 2, 8 (2024).
Article Google Scholar
Mukherjee, S. S. et al. Natural language processing-based quantification of the mental state of psychiatric patients. Comput. Psychiatry 4, 76–106 (2020).
Article Google Scholar
Jacob, K. Patient experience and psychiatric discourse. The Psychiatrist 36, 414–417 (2012).
Article Google Scholar
Murad, M. H. et al. Measuring documentation burden in healthcare. J. Gen. Intern. Med. 39, 2837–2848 (2024).
Article Google Scholar
Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182, 564–566 (2022).
Article Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Article Google Scholar
Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).
Article Google Scholar
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).
Article Google Scholar
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Article Google Scholar
Lamichhane, B. Evaluation of chatgpt for NLP-based mental health applications. Preprint at https://arxiv.org/abs/2303.15727 (2023).
Amin, M., Cambria, E. & Schuller, B. Will affective computing emerge from foundation models and general AI? A first evaluation on ChatGPT. Preprint at http://arxiv.org/abs/2303.03186 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th International Conference on Neural Information Processing Systems 24824–24837 (2022).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Article Google Scholar
Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).
Article Google Scholar
Wang, N. et al. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024 14743–14777 (Association for Computational Linguistics, 2024).
Yang, Q. et al. Psychogat: a novel psychological measurement paradigm through interactive fiction games with llm agents. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers 14470–14505 (Association for Computational Linguistics, 2024).
Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl Acad. Sci. USA 121, e2308950121 (2024).
She, D., Zhang, C., Yao, X., Gao, Y. & Jin, Z. MindChat-R0: a large language model for emotionally supportive dialogue through reinforcement learning. In Companion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing 1209–1216 (Association for Computing Machinery, 2025).
Team, E. EmoLLM: reinventing mental health support with large language models. Preprint at https://arxiv.org/abs/2406.16442 (2024).
Chen, Y., et al. Soulchat: improving LLMs' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023 1170–1183 (Association for Computational Linguistics, 2023).
Hu, J. et al. Psycollm: enhancing LLM for psychological understanding and evaluation. IEEE Trans. Comput. Soc. Syst. 12, 539–551 (2024).
Article Google Scholar
Hiemke, C. et al. Consensus guidelines for therapeutic drug monitoring in neuropsychopharmacology: update 2017. Pharmacopsychiatry 51, 9–62 (2018).
Article Google Scholar
Wicha, S. G. et al. From therapeutic drug monitoring to model-informed precision dosing for antibiotics. Clin. Pharmacol. Ther. 109, 928–941 (2021).
Article Google Scholar
Relling, M. & Klein, T. CPIC: clinical pharmacogenetics implementation consortium of the pharmacogenomics research network. Clin. Pharmacol. Ther. 89, 464–467 (2011).
Article Google Scholar
Hicks, J. K. et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline for CYP2D6 and CYP2C19 genotypes and dosing of selective serotonin reuptake inhibitors. Clin. Pharmacol. Ther. 98, 127–134 (2015).
Article Google Scholar
Liu, S. et al. PsychBench: a comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice. Preprint at https://arxiv.org/abs/2503.01903 (2025).
Liu, J. et al. Benchmarking large language models on CMExam—a comprehensive Chinese medical exam dataset. In Proc. 37th International Conference on Neural Information Processing System 52430–52452 (2023).
Sun, Y. et al. Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association for Computational Linguistics, 2002).
Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74-81 (Association for Computational Linguistics, 2004).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations (ICLR) https://openreview.net/pdf?id=SkeHuCVFDr (2020).
International Statistical Classification of Diseases and Related Health Problems: Alphabetical Index (World Health Organization, 2004).
Yang, A. et al. Qwen2.5-1M technical report. Preprint at https://arxiv.org/abs/2501.15383 (2025).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774(2023).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Zhang, T. et al. Prevalence of personality disorders using two diagnostic systems in psychiatric outpatients in Shanghai, China: a comparison of uni-axial and multi-axial formulation. Soc. Psychiatry Psichiatr. Epidemiol. 47, 1409–1417 (2012).
Article Google Scholar
Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).
Google Scholar
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Article Google Scholar
Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).
Thieme, A., Belgrave, D. & Doherty, G. Machine learning in mental health: a systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Trans. Comput. Hum. Interact. 27, 1–53 (2020).
Article Google Scholar
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. Preprint at https://arxiv.org/abs/2402.03300 (2024).
Kwon, W. et al. Efficient memory management for large language model serving with pagedattention. Association for Computing Machinery (ACM). In Proc. 29th Symposium On Operating Systems Principles 611–626 (2023).
Wang, R. et al. PsychFound: PsychFound code and dataset. Zenodo https://doi.org/10.5281/zenodo.17768150 (2025).

Download references

Acknowledgements

PsychFound is an in-depth extension of the PsychGPT research, jointly developed by Shanghai Jiao Tong University and Beijing Anding Hospital, Capital Medical University. We thank the Chinese Psychiatric Innovation Alliance for providing data support. We are grateful to the Expert Review Committee of Beijing Anding Hospital, Capital Medical University, for their rigorous review and validation of clinical data. We also acknowledge the National Clinical Research Centre for Mental Disorders for their guidance on study design, and the Information Technology Center of Beijing Anding Hospital, Capital Medical University, for providing computational resources. We extend our sincere appreciation to all psychiatrists who participated in the prospective cohort study and the reader evaluation study. This study was funded by the Brain Science and Brain-like Intelligence Technology-National Science and Technology Major Project (grant no. 2021ZD0200600) (G.W.), General Program of National Natural Science Foundation of China (grant no. 62576210) (C.J.), Natural Science Foundation of Shanghai (grant no. 25ZR1401179) (C.J.) and Capital’s Funds for Health Improvement and Research (grant no. CFH 2024-2-1174) (L.Z.).

Author information

These authors contributed equally: Ruoxi Wang, Shuyu Liu, Ling Zhang, Xuequan Zhu.

Authors and Affiliations

Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
Ruoxi Wang, Shuyu Liu, Fei Wu & Cheng Jin
Beijing Key Laboratory of Mental Disorders, National Clinical Research Center for Mental Disorders National Center for Mental Disorders, Beijing Anding Hospital, Capital Medical University, Beijing, China
Ling Zhang, Xuequan Zhu, Zhi Yang, Hu Wang, Juan Huang, Yimeng Wang, Xiaofan Yang, Rui Yang, Gang Wang & Cheng Jin
Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China
Ling Zhang, Xuequan Zhu, Zhi Yang, Rui Yang & Gang Wang
Shanghai Point Imaging Medical Technology Co., Ltd, Shanghai, China
Cheng Jin

Authors

Ruoxi Wang
View author publications
Search author on:PubMed Google Scholar
Shuyu Liu
View author publications
Search author on:PubMed Google Scholar
Ling Zhang
View author publications
Search author on:PubMed Google Scholar
Xuequan Zhu
View author publications
Search author on:PubMed Google Scholar
Zhi Yang
View author publications
Search author on:PubMed Google Scholar
Hu Wang
View author publications
Search author on:PubMed Google Scholar
Juan Huang
View author publications
Search author on:PubMed Google Scholar
Yimeng Wang
View author publications
Search author on:PubMed Google Scholar
Xiaofan Yang
View author publications
Search author on:PubMed Google Scholar
Fei Wu
View author publications
Search author on:PubMed Google Scholar
Rui Yang
View author publications
Search author on:PubMed Google Scholar
Gang Wang
View author publications
Search author on:PubMed Google Scholar
Cheng Jin
View author publications
Search author on:PubMed Google Scholar

Contributions

R.W., S.L., L.Z. and X.Z. contributed equally to this work. C.J. and G.W. are the corresponding authors. Specifically, R.W, S.L., G.W. and C.J. all made contributions to the conception and design of the work. R.W., S.L, L.Z. and X.Z. further performed acquisition, analysis and interpretation of data for the work. R.W. and S.L. performed the development and evaluation of PsychFound. J.H., X.Y. and Y.W. organized the prospective study and the reader study. L.Z., X.Z., Z.Y. and R.Y. performed analysis of the evaluation results. H.W. assisted in data collection, computing resource allocation and model development. In writing, R.W. and S.L. drafted the work. G.W., and C.J. reviewed it critically for important intellectual content. All authors reviewed the manuscript and provided meaningful feedback. All authors approve of the version to be published and agree to be accountable for all aspects of the work to ensure that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding authors

Correspondence to Gang Wang or Cheng Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Joseph Kambeitz and Jiyeong Kim for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.

The results plotted in the radar chart have undergone normalization processing.

Source data

Extended Data Fig. 2 Error pattern distributions across the five core PsychBench tasks.

Bar plots summarize the major categories of model errors for each task: Task 1 (clinical information summarization), where most errors arose from onset-pattern misjudgment; Tasks 2 & 3 (diagnosis and differential diagnosis), dominated by inaccuracies in associated-symptom assessment; Task 4 (medication recommendation), where overly conservative treatment decisions represented the majority of errors; and Task 5 (long-term course management), where limitations were primarily attributable to remote-information and detailed-information retention. Percentages represent the proportion of each error type within the task-specific error set.

Source data

Extended Data Fig. 3 Diagnostic category distribution and accuracy of PsychFound on original English psychiatric cases.

The bar charts summarize case counts and diagnostic accuracy across ICD-10 categories: F0 (Organic and symptomatic mental disorders), F1 (Mental and behavioural disorders due to psychoactive substance use), F2 (Schizophrenia, schizotypal, and delusional disorders), F3 (Mood [affective] disorders), F4 (Neurotic, stress-related, and somatoform disorders), F5 (Behavioural syndromes associated with physiological disturbance and physical factors), F6 (Disorders of adult personality and behaviour), F7 (Mental retardation), F8 (Disorders of psychological development), and F9 (Behavioural and emotional disorders with onset usually occurring in childhood and adolescence).

Source data

Extended Data Fig. 4 PsychFound’s sensitivity to incremental perturbations in clinical information.

a, A real-world bipolar disorder case with psychotic features was used to examine the model’s responsiveness to stepwise removal of key clinical elements. b, With complete information—including manic and depressive episodes with psychotic symptoms—PsychFound correctly identified F31.5. c, Removing psychotic symptoms (strikeout ①) led the model to adjust the diagnosis to F31.4. d, Removing both psychotic symptoms and manic history (strikeout ① and ②) shifted the output to F33.3, consistent with recurrent depressive disorder. e, When only a single depressive episode remained (strikeout ①, ②, and ③), the model updated the diagnosis to F32.3.

Extended Data Fig. 5 The prospective study design.

RP represents resident psychiatrist. HAMD represents the Hamilton Depression Rating Scale. BPRS represents the Brief Psychiatric Rating Scale. SRAS represents the Suicide Risk Assessment Scale. CGI represents the Clinical Global Impression Scale.

Extended Data Table 1 Performance comparison of PsychFound and other LLMs in domain-specific knowledge test

Full size table

Extended Data Table 2 Post-study questionnaire assessment by resident psychiatrists using PsychFound

Full size table

Extended Data Table 3 Characteristics summary of PsychClinical dataset

Full size table

Extended Data Table 4 Comparison of Demographic and Clinical Characteristics Between Inpatients in the control group and experimental group

Full size table

Supplementary information

Supplementary Information (download PDF )

Supplementary Section A: design of psychiatry-specific function calling set. Supplementary Section B: Supplementary Figs. 1–31 and Supplementary Tables 1–10. Supplementary Section C: study protocol of real-world prospective study.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Source data

Source Data Fig. 3 (download XLSX )

Quantitative evaluation results of PsychFound and comparator LLMs on knowledge test and clinical tasks.

Source Data Fig. 4 (download XLSX )

Comparative evaluation results of PsychFound and representative LLMs across clinical tasks.

Source Data Fig. 5 (download XLSX )

Ablation results of different training strategies on diagnostic performance.

Source Data Fig. 6 (download XLSX )

Results of performance of resident psychiatrists in prospective study.

Source Data Extended Data Fig. 1 (download XLSX )

Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.

Source Data Extended Data Fig. 2 (download XLSX )

Statistics of error analysis on five psychiatric clinical tasks.

Source Data Extended Data Fig. 3 (download XLSX )

Results of diagnosis accuracy on original English psychiatric cases.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, R., Liu, S., Zhang, L. et al. A domain-adapted large language model to support clinicians in psychiatric clinical practice. Nat Mach Intell (2026). https://doi.org/10.1038/s42256-026-01224-w

Download citation

Received: 15 August 2025
Accepted: 24 March 2026
Published: 27 April 2026
Version of record: 27 April 2026
DOI: https://doi.org/10.1038/s42256-026-01224-w

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links