Abstract
Mental disorders affect nearly one billion individuals worldwide, yet professional psychiatric care remains constrained by workforce shortages and experience-dependent decision-making. Despite recent advances in large language models (LLMs), current applications in mental health are primarily patient-oriented and lack alignment with real-world psychiatric clinical workflows. Here we present PsychFound, a domain-adapted and clinician-oriented LLM developed to support psychiatric clinical practice. Developed through a three-phase framework using expert-curated psychiatric corpora and 64,588 Chinese real-world electronic health records, PsychFound integrates psychiatric professional knowledge, clinical reasoning capabilities and adaptation to the full spectrum of psychiatric clinical tasks across diagnosis, treatment planning and longitudinal management in Chinese clinical settings. In retrospective evaluations spanning three professional knowledge assessments and five clinical task benchmarks, the 7B-parameter PsychFound delivered the top overall performance among 22 LLMs. In a real-world, two-arm prospective study, resident psychiatrists assisted by PsychFound demonstrated higher consultation quality, higher diagnostic accuracy, more appropriate medication selection and reduced documentation time (all P < 0.01). A reader study with 60 psychiatrists (20 residents, 20 attendings and 20 seniors) showed that PsychFound’s clinical reasoning performance matched that of attending psychiatrists. These findings demonstrated that PsychFound provides an interpretable, expert-level decision support tool capable of improving consistency, efficiency and standardization in psychiatric clinical care.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
This study utilized two datasets for model development: PsychCorpus and PsychClinical. PsychCorpus consists of publicly available psychiatric texts and is available via GitHub at https://github.com/wrx33/PsychFound (ref. 50). PsychClinical comprises de-identified real-world EHRs from multiple psychiatric centres and cannot be publicly released due to privacy and data-governance restrictions. Researchers may request controlled access from the corresponding author, subject to institutional and regulatory approval. For evaluation, we used publicly accessible domain-specific test sets and de-identified clinical cases from PsychBench, available via GitHub at https://github.com/wrx33/PsychBench. The study also incorporated a real-world prospective cohort. Due to ethical and data-governance constraints, the de-identified prospective study data are not publicly available. Researchers may request access from the corresponding author. All requests will be reviewed in accordance with the institution’s policies and data usage agreements, and responses will be provided within 4 weeks. Source data are provided with this paper.
Code availability
The codes for scientific research and non-commercial use are available via GitHub at https://github.com/wrx33/PsychFound (ref. 50).
References
World Mental Health Report: Transforming Mental Health For All (World Health Organization, 2022).
Huang, Y. et al. Prevalence of mental disorders in China: a cross-sectional epidemiological study. Lancet Psychiatry 6, 211–224 (2019).
Mental Health Atlas 2020: Review of the Eastern Mediterranean Region (World Health Organization, 2022).
Chen, R., Zhang, W. & Wu, X. Mental health policy and implementation from 2009 to 2020 in China. SSM - Ment. Health 4, 100244 (2023).
Stein, D. J. et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry 21, 393–414 (2022).
Feuerriegel, S. et al. Using natural language processing to analyse text data in behavioural science. Nat. Rev. Psychol. 4, 96–111 (2025).
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP Digit. Psychiatry Neurosci. 2, 8 (2024).
Mukherjee, S. S. et al. Natural language processing-based quantification of the mental state of psychiatric patients. Comput. Psychiatry 4, 76–106 (2020).
Jacob, K. Patient experience and psychiatric discourse. The Psychiatrist 36, 414–417 (2012).
Murad, M. H. et al. Measuring documentation burden in healthcare. J. Gen. Intern. Med. 39, 2837–2848 (2024).
Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182, 564–566 (2022).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Lamichhane, B. Evaluation of chatgpt for NLP-based mental health applications. Preprint at https://arxiv.org/abs/2303.15727 (2023).
Amin, M., Cambria, E. & Schuller, B. Will affective computing emerge from foundation models and general AI? A first evaluation on ChatGPT. Preprint at http://arxiv.org/abs/2303.03186 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th International Conference on Neural Information Processing Systems 24824–24837 (2022).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).
Wang, N. et al. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024 14743–14777 (Association for Computational Linguistics, 2024).
Yang, Q. et al. Psychogat: a novel psychological measurement paradigm through interactive fiction games with llm agents. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers 14470–14505 (Association for Computational Linguistics, 2024).
Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl Acad. Sci. USA 121, e2308950121 (2024).
She, D., Zhang, C., Yao, X., Gao, Y. & Jin, Z. MindChat-R0: a large language model for emotionally supportive dialogue through reinforcement learning. In Companion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing 1209–1216 (Association for Computing Machinery, 2025).
Team, E. EmoLLM: reinventing mental health support with large language models. Preprint at https://arxiv.org/abs/2406.16442 (2024).
Chen, Y., et al. Soulchat: improving LLMs' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023 1170–1183 (Association for Computational Linguistics, 2023).
Hu, J. et al. Psycollm: enhancing LLM for psychological understanding and evaluation. IEEE Trans. Comput. Soc. Syst. 12, 539–551 (2024).
Hiemke, C. et al. Consensus guidelines for therapeutic drug monitoring in neuropsychopharmacology: update 2017. Pharmacopsychiatry 51, 9–62 (2018).
Wicha, S. G. et al. From therapeutic drug monitoring to model-informed precision dosing for antibiotics. Clin. Pharmacol. Ther. 109, 928–941 (2021).
Relling, M. & Klein, T. CPIC: clinical pharmacogenetics implementation consortium of the pharmacogenomics research network. Clin. Pharmacol. Ther. 89, 464–467 (2011).
Hicks, J. K. et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline for CYP2D6 and CYP2C19 genotypes and dosing of selective serotonin reuptake inhibitors. Clin. Pharmacol. Ther. 98, 127–134 (2015).
Liu, S. et al. PsychBench: a comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice. Preprint at https://arxiv.org/abs/2503.01903 (2025).
Liu, J. et al. Benchmarking large language models on CMExam—a comprehensive Chinese medical exam dataset. In Proc. 37th International Conference on Neural Information Processing System 52430–52452 (2023).
Sun, Y. et al. Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association for Computational Linguistics, 2002).
Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74-81 (Association for Computational Linguistics, 2004).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations (ICLR) https://openreview.net/pdf?id=SkeHuCVFDr (2020).
International Statistical Classification of Diseases and Related Health Problems: Alphabetical Index (World Health Organization, 2004).
Yang, A. et al. Qwen2.5-1M technical report. Preprint at https://arxiv.org/abs/2501.15383 (2025).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774(2023).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Zhang, T. et al. Prevalence of personality disorders using two diagnostic systems in psychiatric outpatients in Shanghai, China: a comparison of uni-axial and multi-axial formulation. Soc. Psychiatry Psichiatr. Epidemiol. 47, 1409–1417 (2012).
Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).
Thieme, A., Belgrave, D. & Doherty, G. Machine learning in mental health: a systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Trans. Comput. Hum. Interact. 27, 1–53 (2020).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. Preprint at https://arxiv.org/abs/2402.03300 (2024).
Kwon, W. et al. Efficient memory management for large language model serving with pagedattention. Association for Computing Machinery (ACM). In Proc. 29th Symposium On Operating Systems Principles 611–626 (2023).
Wang, R. et al. PsychFound: PsychFound code and dataset. Zenodo https://doi.org/10.5281/zenodo.17768150 (2025).
Acknowledgements
PsychFound is an in-depth extension of the PsychGPT research, jointly developed by Shanghai Jiao Tong University and Beijing Anding Hospital, Capital Medical University. We thank the Chinese Psychiatric Innovation Alliance for providing data support. We are grateful to the Expert Review Committee of Beijing Anding Hospital, Capital Medical University, for their rigorous review and validation of clinical data. We also acknowledge the National Clinical Research Centre for Mental Disorders for their guidance on study design, and the Information Technology Center of Beijing Anding Hospital, Capital Medical University, for providing computational resources. We extend our sincere appreciation to all psychiatrists who participated in the prospective cohort study and the reader evaluation study. This study was funded by the Brain Science and Brain-like Intelligence Technology-National Science and Technology Major Project (grant no. 2021ZD0200600) (G.W.), General Program of National Natural Science Foundation of China (grant no. 62576210) (C.J.), Natural Science Foundation of Shanghai (grant no. 25ZR1401179) (C.J.) and Capital’s Funds for Health Improvement and Research (grant no. CFH 2024-2-1174) (L.Z.).
Author information
Authors and Affiliations
Contributions
R.W., S.L., L.Z. and X.Z. contributed equally to this work. C.J. and G.W. are the corresponding authors. Specifically, R.W, S.L., G.W. and C.J. all made contributions to the conception and design of the work. R.W., S.L, L.Z. and X.Z. further performed acquisition, analysis and interpretation of data for the work. R.W. and S.L. performed the development and evaluation of PsychFound. J.H., X.Y. and Y.W. organized the prospective study and the reader study. L.Z., X.Z., Z.Y. and R.Y. performed analysis of the evaluation results. H.W. assisted in data collection, computing resource allocation and model development. In writing, R.W. and S.L. drafted the work. G.W., and C.J. reviewed it critically for important intellectual content. All authors reviewed the manuscript and provided meaningful feedback. All authors approve of the version to be published and agree to be accountable for all aspects of the work to ensure that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Joseph Kambeitz and Jiyeong Kim for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.
The results plotted in the radar chart have undergone normalization processing.
Extended Data Fig. 2 Error pattern distributions across the five core PsychBench tasks.
Bar plots summarize the major categories of model errors for each task: Task 1 (clinical information summarization), where most errors arose from onset-pattern misjudgment; Tasks 2 & 3 (diagnosis and differential diagnosis), dominated by inaccuracies in associated-symptom assessment; Task 4 (medication recommendation), where overly conservative treatment decisions represented the majority of errors; and Task 5 (long-term course management), where limitations were primarily attributable to remote-information and detailed-information retention. Percentages represent the proportion of each error type within the task-specific error set.
Extended Data Fig. 3 Diagnostic category distribution and accuracy of PsychFound on original English psychiatric cases.
The bar charts summarize case counts and diagnostic accuracy across ICD-10 categories: F0 (Organic and symptomatic mental disorders), F1 (Mental and behavioural disorders due to psychoactive substance use), F2 (Schizophrenia, schizotypal, and delusional disorders), F3 (Mood [affective] disorders), F4 (Neurotic, stress-related, and somatoform disorders), F5 (Behavioural syndromes associated with physiological disturbance and physical factors), F6 (Disorders of adult personality and behaviour), F7 (Mental retardation), F8 (Disorders of psychological development), and F9 (Behavioural and emotional disorders with onset usually occurring in childhood and adolescence).
Extended Data Fig. 4 PsychFound’s sensitivity to incremental perturbations in clinical information.
a, A real-world bipolar disorder case with psychotic features was used to examine the model’s responsiveness to stepwise removal of key clinical elements. b, With complete information—including manic and depressive episodes with psychotic symptoms—PsychFound correctly identified F31.5. c, Removing psychotic symptoms (strikeout ①) led the model to adjust the diagnosis to F31.4. d, Removing both psychotic symptoms and manic history (strikeout ① and ②) shifted the output to F33.3, consistent with recurrent depressive disorder. e, When only a single depressive episode remained (strikeout ①, ②, and ③), the model updated the diagnosis to F32.3.
Extended Data Fig. 5 The prospective study design.
RP represents resident psychiatrist. HAMD represents the Hamilton Depression Rating Scale. BPRS represents the Brief Psychiatric Rating Scale. SRAS represents the Suicide Risk Assessment Scale. CGI represents the Clinical Global Impression Scale.
Supplementary information
Supplementary Information (download PDF )
Supplementary Section A: design of psychiatry-specific function calling set. Supplementary Section B: Supplementary Figs. 1–31 and Supplementary Tables 1–10. Supplementary Section C: study protocol of real-world prospective study.
Source data
Source Data Fig. 3 (download XLSX )
Quantitative evaluation results of PsychFound and comparator LLMs on knowledge test and clinical tasks.
Source Data Fig. 4 (download XLSX )
Comparative evaluation results of PsychFound and representative LLMs across clinical tasks.
Source Data Fig. 5 (download XLSX )
Ablation results of different training strategies on diagnostic performance.
Source Data Fig. 6 (download XLSX )
Results of performance of resident psychiatrists in prospective study.
Source Data Extended Data Fig. 1 (download XLSX )
Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.
Source Data Extended Data Fig. 2 (download XLSX )
Statistics of error analysis on five psychiatric clinical tasks.
Source Data Extended Data Fig. 3 (download XLSX )
Results of diagnosis accuracy on original English psychiatric cases.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, R., Liu, S., Zhang, L. et al. A domain-adapted large language model to support clinicians in psychiatric clinical practice. Nat Mach Intell (2026). https://doi.org/10.1038/s42256-026-01224-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42256-026-01224-w


