Introduction

Mental health issues have been a concern of global health ever since they recognized the profound impact on individuals and societies, and the urgency has only grown in recent years. Nearly 1% of all global deaths annually are now due to suicide, with approximately 800,000 people dying by suicide each year1. In the United States alone, the annual public mental health expenditure exceeded $16.1 billion, including a $2.21 billion budget for the National Institute of Mental Health (NIMH) and $13.9 billion on mental healthcare2. Still, even in the United States, the psychiatry workforce is projected to face a pressing shortage through 2024, with a potential shortfall of 14,280 to 31,091 psychiatrists3,4. And in low-and-middle income countries, the situation is even worse, with up to 85% of people there still receiving no treatment for their mental health5.

In response to the growing mental health crisis and the projected shortage of mental health professionals, artificial intelligence (AI)-driven mental health applications like chatbots are emerging as vital tools to bridge the treatment gap. These technologies offer scalable, accessible, and cost-effective support, particularly in areas where traditional mental health services, including psychiatric care, are insufficient or unavailable. As of 2023, the global market for mental health apps has grown rapidly, with over 10,000 apps collectively serving millions of users6. AI-driven platforms are increasingly incorporating psychiatric assessments, medication management reminders, and monitoring tools that assist in the management of conditions such as depression, anxiety, and bipolar disorder. Studies suggest these tools can help reduce symptoms and improve patient outcomes, making them a promising avenue for addressing mental health challenges, especially in regions with limited access to psychiatric professionals, and they are increasingly being integrated into broader mental health care strategies to help meet the growing demand7,8.

The introduction of large language models (LLMs) like OpenAI’s ChatGPT9, Google’s Bard10, and Anthropic’s Claude11 marks a transformative advancement in AI-driven mental health care, offering capabilities far beyond those of earlier AI tools. Unlike previous models, which were limited to scripted interactions and specific tasks, LLMs can engage in dynamic, context-aware conversations that feel more natural and personalized via generating human-like conversations. This allows them to provide tailored emotional support, detect subtle cues indicating changes in mental health, and adjust their guidance to meet individual user needs in generative tasks. Increasingly, research is exploring anthropomorphic features such as empathy, politeness, and other human-like traits in these models to enhance their effectiveness in delivering more realistic and supportive mental health care12.

Despite the promising potential, these tools are still in the early stages of development and evaluation. Users often do not understand the models they are interacting with, including the limitations and biases inherent in the AI’s design. Unfortunately, there is currently no standardized framework for evaluating the effectiveness and safety of these models in mental health applications. Many studies, including those focused on evaluating LLMs, often develop their own metrics and methods, leading to inconsistent and sometimes unreliable results. The lack of standardized evaluation hinders the comparison of models or assess their true impact on mental health outcomes. Concerns about data privacy, the potential for misuse, and the ethical implications of relying on AI for sensitive mental health care decisions further underscore the need for rigorous oversight. Considering these promises and challenges, a scoping review of the current applications of LLMs in mental health care is essential from the perspective of psychiatrists and clinical informaticians. Our review aims to synthesize existing research with a focus on clinical relevance, identify gaps in understanding from a mental health practice standpoint, and provide clear guidelines for future development and evaluation of these technologies in real-world settings.

Background

Subfields of Mental health care and the potential of generative AI

The potential of generative AI in mental health care is broad given the many different treatment approaches employed today for care delivery. These approaches generally fall into three main categories: psychotherapy, psychiatry, and general mental health support. Psychotherapy is one of the most common forms of mental health care. However, access to psychotherapy is often limited by factors like a shortage of therapists, long wait times, and high costs. Generative AI could help address these issues by offering on-demand support, providing education about mental health, and guiding people through therapeutic exercises when they can’t see a therapist in person. Psychiatry focuses on the medical side of mental health care, including diagnosing, treating, and preventing mental disorders. But like psychotherapy, psychiatry also faces challenges, particularly a shortage of psychiatrists. Generative AI could support psychiatrists by helping monitor patients’ symptoms, reminding them to take their medication, and providing initial assessments, which could reduce the strain on the healthcare system and improve patient outcomes. General mental health support includes a wide range of services designed to promote mental well-being and prevent mental health problems. This might include community programs, self-help resources, peer support networks, and public health initiatives. These services are important for early intervention, managing stress, and preventing more serious mental health issues from developing. However, many people don’t take advantage of these resources, often because of stigma, lack of awareness, or insufficient availability. Generative AI could help make these resources more accessible by providing anonymous, personalized support through chatbots and apps that offer mental health education, coping strategies, and encouragement to seek help in a way that feels safe and non-judgmental.

Large language models (LLMs)

Although LLMs gained widespread attention with the release of OpenAI’s ChatGPT-4, the concept has existed for some time, though there is no single unified definition. In the natural language processing (NLP) community, LLMs are generally understood as large generative AI models capable of producing text by predicting the next word or phrase based on vast amounts of training data. NLP has evolved drastically over time, with early models being task-specific and limited in their ability to understand context and nuance. The introduction of advanced deep learning frameworks marked a major improvement, as these models are designed to better capture contextual language meaning. However, they still struggled with generating coherent, contextually appropriate text over longer conversations, which is crucial for mental health applications. LLMs have advanced this further by leveraging large datasets and transformer architectures to predict and generate highly coherent and context-aware text. This enables them to mimic human conversation, making them valuable for creating therapeutic content, offering psychoeducation, and simulating therapy sessions—important tools for expanding access to mental health care. For clinicians, LLMs offer promising tools to support mental health services by providing personalized, scalable interactions. However, it’s important to recognize that most current LLMs are general models and do not perform as well as specialized pre-trained models for domain-specific tasks such as prediction and classification. For example, Bidirectional Encoder Representations from Transformers (BERT) models, which model word segments (tokens) using both the segments before and after them, are more accurate and efficient for these purposes. As a result, pretraining and fine-tuning becomes a crucial step as it provides the model with contextual knowledge and linguistic patterns specific to the mental health applications. This finetuning and pretraining process can incorporate emotional cues and expert-written examples to enhance the model’s interpretability and responsiveness to improve the performance of LLMs in specific generative tasks.

Results

Mental disorders, conditions, and subconstructs

Mental disorders referenced in the included studies vary widely in terms of definitions, measurement instruments, and the use of standards. While some studies focus on clinically confirmed diagnoses, relying on established criteria like those found in the DSM-513, others take a less structured approach. In such cases, mental health constructs are often defined arbitrarily using user-expressed keywords or affects rather than expert knowledge or validated measures. This is especially common in studies conducted outside the medical or clinical domain, where mental health constructs may be interpreted more loosely or tailored to the context of the AI models. Such inconsistencies in the use and understanding of validated measures highlight a potential gap when applying AI models to various targeted mental health constructs—including affect, symptoms, diagnosis, and treatment—reflecting a broader issue in this interdisciplinary field. Therefore, we categorized the targeted mental health disorders, conditions, and subconstructs into two groups: 1) those measured or defined with validated approaches, relying on standard diagnostic criteria and validated clinical knowledge; and 2) those assessed with non-validated measures, lacking a clear definition, standard, or validated method for assessment or diagnosis.

As shown in Table 1, eight studies out of the sixteen reviewed included validated measures for mental health constructs14,15,16,17,18,19,20,21, while nine relied only on ad-hoc (less well established) approaches16,17,21,22,23,24,25,26,27, and three studies included constructs with a mix of both types of measurements16,17,21. Across both groups, depression14,16,17,18,19,21,24,25,26 was the most frequently studied mental health construct. The Patient Health Questionnaire-9 (PHQ-9)16,19 and the Center for Epidemiologic Studies Depression Scale for Children (CES-DC)16 were adopted as inclusion criteria and outcome measures16,19, while another study used PHQ-9 as an exclusion criterion21. Other clinically valid constructs include anxiety14,16,18, positive and negative affects (PANAS)14, Attention-Deficit/Hyperactivity Disorder (ADHD)15,20, bipolar disorder19, loneliness17, and stress16.

One study evaluated GPT’s performance on 100 clinical case vignettes of different disorders, comparing GPT against psychiatrists across different clinical constructs, covering a wide range of disorders28. However, not all studies referenced clinical mental health constructs provided specific criteria. For example, one study diagnosed study subjects with clinical interviews “using screening instruments over different disorders”15. Other studies also incorporated expert judgments from mental health providers without mentioning the specific process or referring to well-established criteria used in the study22,23,28. Depression17,24,25,26 and suicidality16,17,22,23,27 have also been frequently studied with less well-established and customized constructs. For instance, one study associated the construct of depression with self-identified feelings of depression17 or simply with the word “sad”23. Another study filtered social media posts related to suicidal ideation and self-harm using regular expressions (e.g., “.(commit suicide).”, “.(cut).”)29. More specific subconstructs of mental health care include psychological challenges due to social emotions28,29, cognitive distortion and negative thoughts19,20, and abuse19. These studies used less well-established and more arbitrary standards for definitions and assessment. (Tables 2, 3).

Cognitive Behavior Therapy (CBT)30 is the most referenced treatment method for anxiety, cognitive distortion, depression, and loneliness15,16,17,18,22. It is an evidence-based, well-established psychological treatment. Elements and techniques from CBT30, such as cognitive restructuring22,23 and mindfulness18, have been incorporated into LLM models to provide digital self-guided interventions. Other evidence-based treatment approaches include occupational therapy20, which is used to support children with ADHD, and peer support27, where the chat agent simulates individuals with similar experiences to provide empathetic emotional support.

Table 1 Mental disorders, conditions, and subconstructs in generative applications of LLMs for mental health care

Applications and model information

Existing generative applications of LLMs in mental health care can be categorized into six main types based on model functionalities: Clinical Assistant, Counselling17,29, Therapy17,23, Emotional Support16,17,31,32 Positive Psychology Intervention14,22,23, and Education15,33. Among them, the Clinical Assistant application includes attempts to develop and evaluate LLMs for supporting mental health professionals by generating management strategies and diagnoses for psychiatric conditions. In the Counselling category, LLMs are used to interact with participants, such as engaging Spanish teenagers in discussions about mental health disorders15 and providing relationship advice in single-session interventions34. Emotional Support applications have focused on offering empathetic responses and support in various contexts, such as mitigating loneliness and suicide risk among students17. In the Therapy category, LLMs are integrated into treatments for conditions like ADHD, enhancing care through simulated therapy scenarios35 and immersive therapy experiences using virtual reality(VR)18. Positive Psychology Interventions involve using LLMs to personalize recommendations and facilitate cognitive restructuring, thereby reducing negative thoughts and emotional intensity14,22. Finally, in Education, LLMs have been employed to train medical students in communication skills, providing a realistic and positive simulated patient experience33, as well as promoting awareness of mental health among young people15. Most of these studies only support text-based input/output modalities14,15,19,22,23,24,26,27,31,32,34. A subset of systems17,18,20,35 supports multimodal input/output, incorporating speech, images, or video for a richer user experience. Some applications incorporate physical embodiment through VR17,18 or robotics20,35. These applications are seen across various target user groups, including healthcare providers19,26, patients14,16,18,20,22,23,24,31,32,34, and the general public15,32,33.

OpenAI’s GPT series models are the most studied, see in 14 studies14,18,19,20,22,23,24,28,31,32,34,35, with 11 using the latest advanced models like GPT-3.5, ChatGPT, GPT-4, and customized GPTs, while four studies used the earlier GPT-3 model. Other LLMs used23,26 include Huawei’s PanGu26, T520, and DialoGPT36 are open-source. Some studies did not specify the platforms they employed, while many studies used digital platforms such as websites and mobile phones. Some studies developed agents with physical embodiments22, and some others21,35 used Raspberry PI, a type of single-board computer (Supplementary Table 2). Among those that used OpenAI’s models, three were based on OpenAI’s web interface24,28,34, one did not directly state their platform but appeared to use the API based on the structure of their methods19, and only eight (57.1%) explicitly referenced API use or temperature parameters14,18,20,28,32. Language support by these models varied, covering more than English, with three applications supported by multiple languages17,20,35, and 14 applications supporting a single language—seven in English18,19,22,23,24,32,34, three in Chinese14,26,29, two in Korean16,31, and two in Spanish15,33.

Table 2 Overview of input/output modalities, models, and target users in generative applications of LLMs in mental health care

Task performance and clinical effectiveness

The study designs and evaluations of existing research are highly heterogeneous and often inconsistent, making it challenging to accurately assess their task performance and clinical effectiveness. Thus, we provide a high-level summary of the findings here. We offer a detailed summary of each study’s task, performance/results, sample size, clinical validation method, and participant demographics can be found in Supplementary Table 3.

Several studies have explored the use of LLMs for clinical decision support in psychiatry. In one study, ChatGPT-3.5 was evaluated using 100 clinical case vignettes covering diverse psychiatric conditions28. The model achieved a “Grade A” rating in 61% of cases, “Grade B” in 31%, and “Grade C” in 8%, indicating different levels of diagnostic accuracy in simulated scenarios. However, this study did not involve real patients, and no clinical validation was performed. Similarly, another study assessed GPT-4’s performance in clinical decision-making for bipolar depression cases. GPT-4 selected optimal treatments in 50.8% of cases, slightly outperforming community clinicians19. Although promising, these results are based on hypothetical cases, and the model’s effectiveness in actual clinical practice remains unverified. Overall, while LLMs demonstrate potential in generating clinically relevant information, the lack of clinical validation and reliance on simulated vignettes limit the evidence for their effectiveness in real-world diagnostic support.

Several studies have investigated the application of LLMs in aspects of therapeutic interventions, particularly in cognitive restructuring and positive psychology. Liu et al.14 conducted randomized controlled trials with 326 participants to test GPT-based chatbots delivering Positive Psychology Interventions (PPIs). The chatbot provided personalized recommendations and engaged users in multi-round dialogues with resulting improvements in mental well-being, reductions in anxiety, and increased life satisfaction metrics. This suggests that LLMs can effectively facilitate interventions aimed at enhancing psychological well-being. Another study explored the use of LLMs in self-reflective journaling among 28 psychiatric outpatients diagnosed with Major Depressive Disorder22. Clinicians reported that the LLM-assisted journaling system enriched patient records and provided better insights into patients’ conditions. In a large-scale randomized controlled trial34, involving over 15,000 participants, Sharma et al. evaluated an LLM’s assistance in cognitive restructuring for self-guided mental health interventions. The study found that 67% of participants reported reduced emotional intensity, and 65% overcame negative thoughts after interacting with the LLM. These results indicate the potential scalability and effectiveness of LLMs in supporting cognitive-behavioral techniques.

LLMs have also been used to provide emotional support and enhance engagement, particularly among youth and marginalized populations, Mármol-Romero et al. examined a GPT-based chatbot’s engagement with Spanish-speaking teenagers on mental health topics15. The observational study involved 102 students, and the chatbot facilitated open discussions on anxiety and depression. The engagement led to meaningful conversations with 44 participants, indicating potential for early outreach and mental health education among adolescents. Another study investigated the use of the Replika chatbot among 1006 students17. The study found that 3% of participants reported cessation of suicidal ideation after interacting with the chatbot, and 75% reported feeling less lonely, suggesting that LLM chatbots can provide immediate emotional support.t. However, the lack of long-term outcomes from all studies is notable.

Table 3 Summary of unified evaluation constructs

Evaluation methods, scales, and constructs

A standardized and well-established set of constructs and scales is essential in systematically measuring mental health interventions, particularly when evaluating new technologies. Constructs refer to specific concepts or characteristics being measured, such as privacy, safety, or user experience. They provide a clear focus on what is being assessed in a study, which is crucial for ensuring that the evaluation is meaningful and relevant. Scales, in turn, offer a structured and standardized approach to quantify these constructs. This standardization is necessary for consistency across different studies, allowing researchers to compare results and draw more robust conclusions.

Given the diversity in how constructs are defined and measured across studies, it is important to use a framework that can harmonize these variations. While there are many approaches, we used a hierarchical framework37 inspired by the American Psychiatric Association app evaluation model. A 2024 review of evaluation models38 noted this framework “is straightforward, comprehensive, flexible, and relevant to diverse contexts” and so also provides us a promising starting point. This framework categorizes constructs into three levels: (1) Safety, Privacy, and Fairness; (2) Trustworthiness and Usefulness; and (3) Design and Operational Effectiveness. The pyramid framework ensures that each level of evaluation builds on the previous one. For example, without ensuring that an intervention is safe, it would be premature to evaluate its usability or cost-effectiveness.

Among the studies reviewed, those that involved direct participant feedback (n = 5)14,17,18,22,23 generally focused on user-centric constructs. These studies typically involved larger sample sizes ranging from 28 to over 15,000 participants and assessed constructs such as accessibility, ease of use, personalized engagement, user experience, and cost-effectiveness. They provide direct insights into how user experience of LLMs is in real-world settings. On the other hand, studies that focused on evaluating LLM performance—typically involving expert assessments—concentrated more on foundational and core efficacy constructs. These studies often used smaller sample sizes, ranging from 12 to 100 cases, focusing on technical or functional aspects of the LLMs. Additionally, one study23 designed and incorporated automated metrics for Rationality, Positivity, and Empathy, using NLP models to evaluate LLM outputs. These automated evaluations offer a more detailed, algorithmic perspective on the LLM’s performance, complementing human judgments.

The heterogeneous use of scales remains a problem in the mental health field. We observe that 12 studies developed their own scales15,18,19,20,21,23,26,28,32,34,35 or adapted existing ones for their evaluations. Most of the studies using validated scales were those directly measuring patient outcomes, such as anxiety, where the General Anxiety Disorder-7 (GAD-7) was employed14,32. However, many articles that created their own scales without clear rationale, and often lacked references to support their methods, raising challenges with the validity and reliability of their methods.

Figure 1 presents a pyramid shaped schematic of the current status of evaluated constructs in the generative applications of LLMs for mental health care, based on the health AI-chatbot evaluation framework37. The figure includes the number of articles counted for each level 2 construct, with gray texts indicating constructs never evaluated by existing research. The foundational levels are less frequently assessed: only three studies evaluated the fundamental construct “Safety, Privacy, and Fairness”; Thirteen studies assessed the second-level construct “Trustworthiness and Usefulness”; and another 11 articles evaluated the third-level construct “Design and Operational Effectiveness.” Although “Trustworthiness and Usefulness” is the most evaluated category, more than half of its subconstructs remain unassessed. Across the framework, constructs such as “Accountability,” “Transparency,” “Explainability and Interpretability,” “Testability,” “Security,” and “Resilience” have never been evaluated.

Fig. 1: Pyramid framework of evaluation constructs in generative applications of LLMs in mental health care.
figure 1

Constructs in gray represent constructs with no associated articles. “N” represents the number of unique articles that assessed each construct. Gray text indicates constructs that were not assessed in any study. Foundational areas like “Safety, Privacy, and Fairness” are rarely evaluated, highlighting key gaps in critical aspects such as “Accountability,” “Transparency,” and “Security”.

Discussion

Our review suggests that there is great enthusiasm for LLM-based mental health interventions and that many teams are creating interesting and unique applications. We found these chatbots already developed to serve as clinical assistants, counselors, emotional support vehicles, and positive psychology interventions. However, despite the enthusiasm for applying LLMs in mental health care, the current evidence regarding their task performance and clinical effectiveness is limited and varies across studies. Many studies lack rigorous clinical validation, standardized outcome measures, and adequate sample sizes, which hampers the ability to draw definitive conclusions. Furthermore, the inconsistent use and understanding of well established measurement methods across studies complicate the evaluation of these interventions. We observed that mental health constructs were often referenced without accompanying well established instruments and measurements, and in some cases, researchers tailored the definition or assessment to fit their specific AI models, leading to challenges in consistent categorization. This inconsistency underscores a broader issue within the interdisciplinary field of AI and mental health—the variation in how constructs like affect, mood, diagnosis, and treatment are applied complicates efforts to maintain clear distinctions between mental health constructs with and without validated measurements.

The evaluation of LLM-based mental health interventions is hindered by the lack of unified guidelines for scale development and reporting. While this is appropriate for feasibility testing, it belies the ability to understand the actual clinical potential of these new chatbots. With the majority of studies using non-well-established ad-hoc scales without addressing their validity and reliability, there is an opportunity for the next wave of research to better support the credibility and the need for guidelines to standardize reporting and scales used in this field. While effective evaluation is still nascent, results, as shown in the table, highlight that the current focus ignores foundational privacy and safety concerns. LLM-based mental health chatbots are multifaceted with privacy, technical, engagement, legal, and clinical considerations. Our team recently introduced a simplified framework to unify these many evaluations, suggesting that safety and privacy should be the foundation of any evaluation37. This is not to minimize the value of evaluation of design and effectiveness (level 3) and usefulness and trustworthiness (level 2), but rather that such should not be at the expense of priority over safety, privacy, and fairness (level 1). Without these level 1 considerations, LLM-based mental health interventions may be impressive but unfit for healthcare or clinical use.

Our results also show that the focus of current LLMs today is directed more at patients and less at clinicians. This approach is logical as direct to consumer/patient approaches often avoid complex healthcare regulations and clinical workflow barriers. However, this approach also risks fragmenting the potential of LLM-based mental health interventions to influence care as there is strong evidence that clinician engagement is required for more sustained and impactful patient use with any digital technology10. There is strong data that clinicians are interested in using LLMs in care, but first require and are asking for more training and support on how to use these in care39.

The LLMs reviewed in this paper target a wide variety of disorders. Over half of the studies reviewed included clinically valid disorders, with other studies targeting general mental health constructs. However, we found that many studies did not offer sufficient details on the target population, and the difference between mental health risk factors versus mental health conditions was poorly delineated. We acknowledge that psychiatric nosology is challenging, as highlighted in recent literature40, but this challenge highlights how the evaluation of AI systems in mental health may quickly reach an impasse. For example, constructs like depression were often mentioned in a broad and non-specific manner, without reference to diagnostic criteria or standardized and well-established metrics such as the PHQ-9 or GAD-7. This was particularly pronounced in studies conducted by researchers outside the medical or clinical domains. Such inconsistent use of constructs and measurement methods complicates efforts to maintain a clear distinction between mental health constructs with and without validated measures, calling attention to a broader issue within the interdisciplinary field of AI and mental health. For example, one study specified a population of children and adolescents, ages between 12 and 18 years old15, but overall, most studies lacked detailed demographic information. Given that only one study emphasized data security, with conversations proceeding through a HIPAA-compliant environment18, the lack of more clinical use cases is perhaps appropriate. Another issue is the dependence on proprietary models, such as OpenAI’s GPT-3.5 and GPT-4, in many mental health applications. This reliance raises concerns about transparency and customization, as the use of closed-source models limits external validation of reliability and safety, crucial in mental health research. To improve measurement specificity for specific populations or disorders, model pretraining and fine-tuning are key aspects to be considered41. More models and studies should include domain and audience-specific models pre-trained on clinical data with more rigorous applications of standardized diagnostic tools. Promoting the use of open-source models and improving transparency can enhance the scientific and ethical standards of these applications.

To advance the scalability and scientific rigor of LLM-based mental health interventions, the research community must also adopt more controlled methodologies. Some studies, particularly those utilizing ChatGPT, rely on the website interface for research purposes. While this approach is convenient, it should be discouraged by rigorous scientific investigations. Research should be conducted using the API, where hyperparameters such as the “temperature” can be controlled, ensuring replicability of the results. The website interface should primarily be used for testing third-level constructs such as Design and Operational Effectiveness and potentially assessing the safety and transparency of the user-facing system. However, researchers must also address factors like backend model updates and stochastic elements in the sampling process to ensure consistent reproducibility and reliability.

Finally, the global applicability of LLM-based mental health tools warrants careful consideration. Public health, especially mental health care, is a global issue, and it’s crucial to develop and deploy mental health chatbots in countries and regions where resources are limited and where stigma may be higher. These areas often do not primarily speak English. It’s encouraging that 10 out of the 17 studies (58.8%) support non-English languages, either in a single other language or as multilingual chatbots, which is a positive step toward language equity and global health. But this also raises an issue, beyond the scope of this paper, of whether these chatbots offer the same level of correctness, consistency, and verifiability as English-trained chatbots, given that research suggests this is often not the case42.

Future directions for LLMs in mental health care should prioritize expanding their applications beyond narrow prediction tasks, especially given that only 17 studies over the past five years have explored generative tasks prospectively involving human participants for evaluation. Human-centered studies provide critical insights into how LLMs interact with individuals, particularly in sensitive contexts like mental health care, where nuances in communication and emotional understanding are vital. Addressing current limitations such as small sample sizes and lack of diverse participant demographics, future research should employ larger, more representative samples to enhance the generalizability of findings. To improve the rigor and credibility of LLM-based mental health interventions, studies should prioritize the development of standardized evaluation guidelines. These guidelines should include the creation of validated and reliable scales that can be universally applied across studies, ensuring consistent and accurate assessments of clinical potential. By standardizing evaluation metrics, researchers can overcome the variability that currently impedes comparability and synthesis of results across different studies. To enhance transparency and overcome the limitations of proprietary models, researchers should move away from using web interfaces like ChatGPT for rigorous scientific studies, as these platforms lack the necessary controls for reproducibility. Instead, APIs and locally deployable models that allow for control over hyperparameters should be used to ensure the replicability of the results. This approach will mitigate concerns about reproducibility and allow for more precise manipulation of model parameters, leading to more reliable outcomes. Finally, studies focused on critical constructs such as beneficence, validity, and reproducibility should adopt rigorous evaluation methods and well-established scales, moving beyond metrics like recall and F1 scores, to establish a more comprehensive understanding of model accuracy and clinical relevance. Incorporating ethical considerations and addressing privacy and safety concerns in study designs will also enhance the trustworthiness of LLM applications in mental health care. Equally important is the advancement of novel methodologies and rigorous standards to ensure fairness. A recent study has demonstrated strategies to mitigate biases and promote equity in LLM applications, including assessing demographic disparities in empathy, the implementing demographic-aware prompting, and evaluating subgroup performance in mental health contexts. Future studies should explore new fairness metrics tailored specifically to mental health contexts, such as cultural adaptability or intersectional biases43.

We would like to acknowledge the limitations of the evidence in this review, which are primarily rooted in the absence of standardized evaluation criteria across studies, resulting in challenges for comparison and synthesis of findings. Many studies depend on non-well-established, ad-hoc scales without thorough clinical validation, which undermines the robustness and generalizability of their conclusions. Furthermore, the frequent use of proprietary LLMs, such as OpenAI’s GPT series, introduces issues of transparency and reproducibility, as closed-source settings hinder independent verification and limit replicability. The review processes used also have limitations, as inconsistent reporting practices lead to gaps in essential metrics, demographic detail, and evaluation frameworks, all of which are critical for cross-study analysis. Collectively, these factors highlight an urgent need for a unified, rigorous framework to assess and validate LLM applications in mental health systematically. Addressing these gaps through standardization will be essential for improving the reliability of findings and ensuring that LLMs contribute meaningfully and safely to mental health care.

Methods

We adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines44 to ensure a transparent and reproducible search process (Fig. 2). Our search included four databases: APA PsycNet, Scopus, PubMed, and Web of Science. To ensure comprehensiveness, we employed a combination of generative AI keywords and LLM keywords, and used the shortest matching string to capture all lexical variations. Our search query was as follows, with different variations used across database platforms (detailed in Supplementary Table 1):

(“generative artificial intelligence” OR “large language models” OR “generative model” OR “chatbot”) AND (“mental” OR “psychiatr” OR “psycho” OR “emotional support”)

Fig. 2: The PRISMA figure of the search and screening process.
figure 2

The PRISMA diagram shows the systematic process of study selection. Of 1204 articles initially identified across databases, 726 unique records remained after duplicates were removed. Further screening yielded 16 articles meeting the inclusion criteria.

We conducted the search in the title or abstract of articles, covering the period from January 1, 2020, to July 19, 2024, without language restrictions. The search results included 259 articles from PubMed, 444 articles from Scopus, 1 article from APA PsycNet (PsychInfo and PsycArticles), and 500 articles from Web of Science. The initial search yielded 1,204 articles, with 14 additional articles identified from sources such as Google Scholar, the ACM Digital Library, and reverse referencing. After removing 492 duplicates, we were left with a total of 726 unique articles.

We applied the following inclusion criteria to select studies for our review: first, the study must involve using an LLM to generate responses (generative task); second, the study must focus specifically on mental health care, distinguishing it from studies in related fields like psycholinguistics; third, the study must have human validations rather than relying purely on automated evaluation. An LLM is defined as “transformer-based models with more than ten billion parameters, which are trained on massive text data and excel at a variety of complex generation tasks.” in this study, following a highly cited review from the NLP community45. We excluded reviews, meta-analyses, and clinical trials from our selection. Then, we further removed seven studies not meet our inclusion criteria upon full-text review. The result analysis review includes 16 articles, with 15 full-text-length papers and one brief communication paper.

Data extraction was conducted by one or two authors for each section, with a second author independently reviewing for accuracy. For mental health conditions, data were extracted to categorize disorders, symptoms, care settings, interventions, assessments, and diagnostic sources, with a distinction made between clinically validated disorders and general mental health constructs. For applications and model details, we extracted data on input/output modalities, model types, embodiment, open-source availability, and target user populations. Regarding tasks and clinical effectiveness, we collected data on the primary tasks involving LLMs, sample sizes, demographic characteristics, and methods of clinical validation. Evaluation methods were categorized, with constructs mapped to a hierarchical evaluation framework, producing a harmonized pyramid to systematically assess LLMs across various levels of evidence. Further details on the screening process, data extraction, and synthesis are provided in Supplementary Note 1.