Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A domain-adapted large language model to support clinicians in psychiatric clinical practice

Abstract

Mental disorders affect nearly one billion individuals worldwide, yet professional psychiatric care remains constrained by workforce shortages and experience-dependent decision-making. Despite recent advances in large language models (LLMs), current applications in mental health are primarily patient-oriented and lack alignment with real-world psychiatric clinical workflows. Here we present PsychFound, a domain-adapted and clinician-oriented LLM developed to support psychiatric clinical practice. Developed through a three-phase framework using expert-curated psychiatric corpora and 64,588 Chinese real-world electronic health records, PsychFound integrates psychiatric professional knowledge, clinical reasoning capabilities and adaptation to the full spectrum of psychiatric clinical tasks across diagnosis, treatment planning and longitudinal management in Chinese clinical settings. In retrospective evaluations spanning three professional knowledge assessments and five clinical task benchmarks, the 7B-parameter PsychFound delivered the top overall performance among 22 LLMs. In a real-world, two-arm prospective study, resident psychiatrists assisted by PsychFound demonstrated higher consultation quality, higher diagnostic accuracy, more appropriate medication selection and reduced documentation time (all P < 0.01). A reader study with 60 psychiatrists (20 residents, 20 attendings and 20 seniors) showed that PsychFound’s clinical reasoning performance matched that of attending psychiatrists. These findings demonstrated that PsychFound provides an interpretable, expert-level decision support tool capable of improving consistency, efficiency and standardization in psychiatric clinical care.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of PsychFound: data foundations, model development pipeline and clinical integration.
The alternative text for this image may have been generated using AI.
Fig. 2: Comprehensive evaluation framework for PsychFound.
The alternative text for this image may have been generated using AI.
Fig. 3: Quantitative benchmarking of PsychFound on psychiatric expertise and clinical tasks.
The alternative text for this image may have been generated using AI.
Fig. 4: Performance benchmarking of PsychFound against representative LLMs across clinical tasks.
The alternative text for this image may have been generated using AI.
Fig. 5: Ablation study of PsychFound training strategies and their impact on diagnostic performance.
The alternative text for this image may have been generated using AI.
Fig. 6: Real-world prospective evaluation and reader-study assessment of PsychFound’s clinical utility.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

This study utilized two datasets for model development: PsychCorpus and PsychClinical. PsychCorpus consists of publicly available psychiatric texts and is available via GitHub at https://github.com/wrx33/PsychFound (ref. 50). PsychClinical comprises de-identified real-world EHRs from multiple psychiatric centres and cannot be publicly released due to privacy and data-governance restrictions. Researchers may request controlled access from the corresponding author, subject to institutional and regulatory approval. For evaluation, we used publicly accessible domain-specific test sets and de-identified clinical cases from PsychBench, available via GitHub at https://github.com/wrx33/PsychBench. The study also incorporated a real-world prospective cohort. Due to ethical and data-governance constraints, the de-identified prospective study data are not publicly available. Researchers may request access from the corresponding author. All requests will be reviewed in accordance with the institution’s policies and data usage agreements, and responses will be provided within 4 weeks. Source data are provided with this paper.

Code availability

The codes for scientific research and non-commercial use are available via GitHub at https://github.com/wrx33/PsychFound (ref. 50).

References

  1. World Mental Health Report: Transforming Mental Health For All (World Health Organization, 2022).

  2. Huang, Y. et al. Prevalence of mental disorders in China: a cross-sectional epidemiological study. Lancet Psychiatry 6, 211–224 (2019).

    Article  Google Scholar 

  3. Mental Health Atlas 2020: Review of the Eastern Mediterranean Region (World Health Organization, 2022).

  4. Chen, R., Zhang, W. & Wu, X. Mental health policy and implementation from 2009 to 2020 in China. SSM - Ment. Health 4, 100244 (2023).

    Article  Google Scholar 

  5. Stein, D. J. et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry 21, 393–414 (2022).

    Article  Google Scholar 

  6. Feuerriegel, S. et al. Using natural language processing to analyse text data in behavioural science. Nat. Rev. Psychol. 4, 96–111 (2025).

    Article  Google Scholar 

  7. Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP Digit. Psychiatry Neurosci. 2, 8 (2024).

    Article  Google Scholar 

  8. Mukherjee, S. S. et al. Natural language processing-based quantification of the mental state of psychiatric patients. Comput. Psychiatry 4, 76–106 (2020).

    Article  Google Scholar 

  9. Jacob, K. Patient experience and psychiatric discourse. The Psychiatrist 36, 414–417 (2012).

    Article  Google Scholar 

  10. Murad, M. H. et al. Measuring documentation burden in healthcare. J. Gen. Intern. Med. 39, 2837–2848 (2024).

    Article  Google Scholar 

  11. Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182, 564–566 (2022).

    Article  Google Scholar 

  12. Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Article  Google Scholar 

  13. Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).

    Article  Google Scholar 

  14. Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).

    Article  Google Scholar 

  15. Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

    Article  Google Scholar 

  16. Lamichhane, B. Evaluation of chatgpt for NLP-based mental health applications. Preprint at https://arxiv.org/abs/2303.15727 (2023).

  17. Amin, M., Cambria, E. & Schuller, B. Will affective computing emerge from foundation models and general AI? A first evaluation on ChatGPT. Preprint at http://arxiv.org/abs/2303.03186 (2023).

  18. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th International Conference on Neural Information Processing Systems 24824–24837 (2022).

  19. Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).

    Article  Google Scholar 

  20. Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).

    Article  Google Scholar 

  21. Wang, N. et al. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024 14743–14777 (Association for Computational Linguistics, 2024).

  22. Yang, Q. et al. Psychogat: a novel psychological measurement paradigm through interactive fiction games with llm agents. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers 14470–14505 (Association for Computational Linguistics, 2024).

  23. Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl Acad. Sci. USA 121, e2308950121 (2024).

  24. She, D., Zhang, C., Yao, X., Gao, Y. & Jin, Z. MindChat-R0: a large language model for emotionally supportive dialogue through reinforcement learning. In Companion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing 1209–1216 (Association for Computing Machinery, 2025).

  25. Team, E. EmoLLM: reinventing mental health support with large language models. Preprint at https://arxiv.org/abs/2406.16442 (2024).

  26. Chen, Y., et al. Soulchat: improving LLMs' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023 1170–1183 (Association for Computational Linguistics, 2023).

  27. Hu, J. et al. Psycollm: enhancing LLM for psychological understanding and evaluation. IEEE Trans. Comput. Soc. Syst. 12, 539–551 (2024).

    Article  Google Scholar 

  28. Hiemke, C. et al. Consensus guidelines for therapeutic drug monitoring in neuropsychopharmacology: update 2017. Pharmacopsychiatry 51, 9–62 (2018).

    Article  Google Scholar 

  29. Wicha, S. G. et al. From therapeutic drug monitoring to model-informed precision dosing for antibiotics. Clin. Pharmacol. Ther. 109, 928–941 (2021).

    Article  Google Scholar 

  30. Relling, M. & Klein, T. CPIC: clinical pharmacogenetics implementation consortium of the pharmacogenomics research network. Clin. Pharmacol. Ther. 89, 464–467 (2011).

    Article  Google Scholar 

  31. Hicks, J. K. et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline for CYP2D6 and CYP2C19 genotypes and dosing of selective serotonin reuptake inhibitors. Clin. Pharmacol. Ther. 98, 127–134 (2015).

    Article  Google Scholar 

  32. Liu, S. et al. PsychBench: a comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice. Preprint at https://arxiv.org/abs/2503.01903 (2025).

  33. Liu, J. et al. Benchmarking large language models on CMExam—a comprehensive Chinese medical exam dataset. In Proc. 37th International Conference on Neural Information Processing System 52430–52452 (2023).

  34. Sun, Y. et al. Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).

  35. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association for Computational Linguistics, 2002).

  36. Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74-81 (Association for Computational Linguistics, 2004).

  37. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations (ICLR) https://openreview.net/pdf?id=SkeHuCVFDr (2020).

  38. International Statistical Classification of Diseases and Related Health Problems: Alphabetical Index (World Health Organization, 2004).

  39. Yang, A. et al. Qwen2.5-1M technical report. Preprint at https://arxiv.org/abs/2501.15383 (2025).

  40. Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774(2023).

  41. Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

  42. Zhang, T. et al. Prevalence of personality disorders using two diagnostic systems in psychiatric outpatients in Shanghai, China: a comparison of uni-axial and multi-axial formulation. Soc. Psychiatry Psichiatr. Epidemiol. 47, 1409–1417 (2012).

    Article  Google Scholar 

  43. Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).

    Google Scholar 

  44. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).

    Article  Google Scholar 

  45. Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).

  46. Thieme, A., Belgrave, D. & Doherty, G. Machine learning in mental health: a systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Trans. Comput. Hum. Interact. 27, 1–53 (2020).

    Article  Google Scholar 

  47. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  48. Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. Preprint at https://arxiv.org/abs/2402.03300 (2024).

  49. Kwon, W. et al. Efficient memory management for large language model serving with pagedattention. Association for Computing Machinery (ACM). In Proc. 29th Symposium On Operating Systems Principles 611–626 (2023).

  50. Wang, R. et al. PsychFound: PsychFound code and dataset. Zenodo https://doi.org/10.5281/zenodo.17768150 (2025).

Download references

Acknowledgements

PsychFound is an in-depth extension of the PsychGPT research, jointly developed by Shanghai Jiao Tong University and Beijing Anding Hospital, Capital Medical University. We thank the Chinese Psychiatric Innovation Alliance for providing data support. We are grateful to the Expert Review Committee of Beijing Anding Hospital, Capital Medical University, for their rigorous review and validation of clinical data. We also acknowledge the National Clinical Research Centre for Mental Disorders for their guidance on study design, and the Information Technology Center of Beijing Anding Hospital, Capital Medical University, for providing computational resources. We extend our sincere appreciation to all psychiatrists who participated in the prospective cohort study and the reader evaluation study. This study was funded by the Brain Science and Brain-like Intelligence Technology-National Science and Technology Major Project (grant no. 2021ZD0200600) (G.W.), General Program of National Natural Science Foundation of China (grant no. 62576210) (C.J.), Natural Science Foundation of Shanghai (grant no. 25ZR1401179) (C.J.) and Capital’s Funds for Health Improvement and Research (grant no. CFH 2024-2-1174) (L.Z.).

Author information

Authors and Affiliations

Contributions

R.W., S.L., L.Z. and X.Z. contributed equally to this work. C.J. and G.W. are the corresponding authors. Specifically, R.W, S.L., G.W. and C.J. all made contributions to the conception and design of the work. R.W., S.L, L.Z. and X.Z. further performed acquisition, analysis and interpretation of data for the work. R.W. and S.L. performed the development and evaluation of PsychFound. J.H., X.Y. and Y.W. organized the prospective study and the reader study. L.Z., X.Z., Z.Y. and R.Y. performed analysis of the evaluation results. H.W. assisted in data collection, computing resource allocation and model development. In writing, R.W. and S.L. drafted the work. G.W., and C.J. reviewed it critically for important intellectual content. All authors reviewed the manuscript and provided meaningful feedback. All authors approve of the version to be published and agree to be accountable for all aspects of the work to ensure that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding authors

Correspondence to Gang Wang or Cheng Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Joseph Kambeitz and Jiyeong Kim for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.

The results plotted in the radar chart have undergone normalization processing.

Source data

Extended Data Fig. 2 Error pattern distributions across the five core PsychBench tasks.

Bar plots summarize the major categories of model errors for each task: Task 1 (clinical information summarization), where most errors arose from onset-pattern misjudgment; Tasks 2 & 3 (diagnosis and differential diagnosis), dominated by inaccuracies in associated-symptom assessment; Task 4 (medication recommendation), where overly conservative treatment decisions represented the majority of errors; and Task 5 (long-term course management), where limitations were primarily attributable to remote-information and detailed-information retention. Percentages represent the proportion of each error type within the task-specific error set.

Source data

Extended Data Fig. 3 Diagnostic category distribution and accuracy of PsychFound on original English psychiatric cases.

The bar charts summarize case counts and diagnostic accuracy across ICD-10 categories: F0 (Organic and symptomatic mental disorders), F1 (Mental and behavioural disorders due to psychoactive substance use), F2 (Schizophrenia, schizotypal, and delusional disorders), F3 (Mood [affective] disorders), F4 (Neurotic, stress-related, and somatoform disorders), F5 (Behavioural syndromes associated with physiological disturbance and physical factors), F6 (Disorders of adult personality and behaviour), F7 (Mental retardation), F8 (Disorders of psychological development), and F9 (Behavioural and emotional disorders with onset usually occurring in childhood and adolescence).

Source data

Extended Data Fig. 4 PsychFound’s sensitivity to incremental perturbations in clinical information.

a, A real-world bipolar disorder case with psychotic features was used to examine the model’s responsiveness to stepwise removal of key clinical elements. b, With complete information—including manic and depressive episodes with psychotic symptoms—PsychFound correctly identified F31.5. c, Removing psychotic symptoms (strikeout ) led the model to adjust the diagnosis to F31.4. d, Removing both psychotic symptoms and manic history (strikeout and ) shifted the output to F33.3, consistent with recurrent depressive disorder. e, When only a single depressive episode remained (strikeout , , and ), the model updated the diagnosis to F32.3.

Extended Data Fig. 5 The prospective study design.

RP represents resident psychiatrist. HAMD represents the Hamilton Depression Rating Scale. BPRS represents the Brief Psychiatric Rating Scale. SRAS represents the Suicide Risk Assessment Scale. CGI represents the Clinical Global Impression Scale.

Extended Data Table 1 Performance comparison of PsychFound and other LLMs in domain-specific knowledge test
Extended Data Table 2 Post-study questionnaire assessment by resident psychiatrists using PsychFound
Extended Data Table 3 Characteristics summary of PsychClinical dataset
Extended Data Table 4 Comparison of Demographic and Clinical Characteristics Between Inpatients in the control group and experimental group

Supplementary information

Supplementary Information (download PDF )

Supplementary Section A: design of psychiatry-specific function calling set. Supplementary Section B: Supplementary Figs. 1–31 and Supplementary Tables 1–10. Supplementary Section C: study protocol of real-world prospective study.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Source data

Source Data Fig. 3 (download XLSX )

Quantitative evaluation results of PsychFound and comparator LLMs on knowledge test and clinical tasks.

Source Data Fig. 4 (download XLSX )

Comparative evaluation results of PsychFound and representative LLMs across clinical tasks.

Source Data Fig. 5 (download XLSX )

Ablation results of different training strategies on diagnostic performance.

Source Data Fig. 6 (download XLSX )

Results of performance of resident psychiatrists in prospective study.

Source Data Extended Data Fig. 1 (download XLSX )

Leaderboard of the comprehensive performance of all tested LLMs on the five clinical tasks of PsychBench.

Source Data Extended Data Fig. 2 (download XLSX )

Statistics of error analysis on five psychiatric clinical tasks.

Source Data Extended Data Fig. 3 (download XLSX )

Results of diagnosis accuracy on original English psychiatric cases.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, R., Liu, S., Zhang, L. et al. A domain-adapted large language model to support clinicians in psychiatric clinical practice. Nat Mach Intell (2026). https://doi.org/10.1038/s42256-026-01224-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s42256-026-01224-w

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research