Enhanced large language models for effective screening of depression and anxiety

Liu, June M.; Gao, Mengxia; Sabour, Sahand; Chen, Zhuang; Huang, Minlie; Lee, Tatia M. C.

doi:10.1038/s43856-025-01158-1

Download PDF

Article
Open access
Published: 05 November 2025

Enhanced large language models for effective screening of depression and anxiety

Communications Medicine volume 5, Article number: 457 (2025) Cite this article

4361 Accesses
2 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Background

Depressive and anxiety disorders are widespread, necessitating timely identification and management. These conditions manifest through various emotional and behavioral symptoms, such as persistent sadness and excessive worry. When left undiagnosed and untreated, these disorders can cause severe consequences, including increased risk of suicide and substantial socioeconomic burden. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges.

Methods

This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews.

Results

Evaluations show that EmoScan exceeds the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score = 0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations.

Conclusions

This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.

Plain language summary

Depression and anxiety are common but often hard to detect. This study explores the potential of artificial intelligence (AI) for identification. We used a type of advanced AI that understands and writes human-like text, known as a large language models (LLMs), to automatically create a large collection of realistic conversations between a psychiatrist and a client. We named this collection of simulated interviews “PsyInterview”. Using this data, we trained a new AI system called EmoScan to capture the signs of emotional disorders. EmoScan can screen for depression and anxiety by having a short, text-based conversation with a person. Our evaluation showed that EmoScan is more accurate at identifying these signs and explaining its reasoning compared to other popular AIs. It also conducts high-quality interviews. This work shows how LLMs can help improve mental health care by making screening faster and more accessible, especially where expert resources are limited.

Introduction

Emotional disorders are among the most prevalent mental health conditions worldwide^1,2, leading to at least $6.5 trillion healthcare-related costs globally³. Among these conditions, anxiety and depressive disorders are particularly common^4,5,6, with a high prevalence rate of approximately 4.8% and 3.2% respectively, in the general population⁷. While timely screening of emotional disorders, followed by an appropriate treatment, is crucial for individuals’ well-being⁸, this process can be time-consuming and labour-intensive, often requiring comprehensive interviews, collection of history and background information, and manpower from healthcare professionals^9,10.

For effective screening, it is important to understand the specific clinical characteristics associated with these disorders. Anxiety and depressive disorders are characterized by distinct diagnostic criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5)¹⁰. For example, depression or Major Depressive Disorder (MDD) is characterized by persistent depressed mood, diminished interest in activities, significant weight changes, sleep disturbances, and other symptoms, while anxiety or Generalized Anxiety Disorder (GAD) involves excessive anxiety and worry, accompanied by restlessness, fatigue, difficulty concentrating, and other physical symptoms. However, assessing such information at scale remains a challenge, particularly in settings with limited clinical resources^11,12, highlighting the need for innovative approaches to support efficient screening¹³.

To address this challenge, we propose utilizing Large Language Models (LLMs) to facilitate the screening of anxiety and depressive disorders. Our approach aims to enable automated assessment of clinical symptoms while maintaining the structured nature of professional interviews. LLMs are deep learning systems trained on massive collections of text to predict words in a sequence¹⁴. There has been a growing interest in applying LLMs to solve mental health-related problems^15,16,17,18. Existing studies have explored the potential of using textual information to identify psychiatric symptoms and life events¹⁹, detect depression based on post histories on social media²⁰, and provide explainable screening for emotional disorders^21,22,23,24. However, most existing studies mainly focus on using data from online platforms (i.e., social media) for screening emotional disorders. While these studies show the potential of utilizing LLMs for large-scale screenings, they may not be applicable in clinical contexts due to the lack of detailed backgrounds, symptoms, and diagnoses of individuals from real clinical cases.

Brief clinical interviews could be a more reliable method for screening compared to using data from online platforms, since they provide an opportunity to gather more relevant information²⁵. Previous studies have leveraged semi-structured clinical interviews or questions from scales to obtain clients’ information, which were further used for identifying emotional disorders^26,27. Recognizing the effectiveness of these methods, efforts have been made to develop LLMs incorporating real-life clinical interviews and expert diagnoses²⁸. Although the study of utilizing clinical interviews for developing LLMs has shown potential, the process of conducting interviews for data collection is highly time-consuming and expensive, making it less feasible for training large-scale models. Moreover, the sensitive nature of human studies and stringent privacy regulations often limit access to real clinical data for training LLMs²⁹. To address this issue, we propose a short-term solution involving the use of scalable data synthesis techniques to generate realistic clinical scenarios while preserving participant confidentiality. However, the challenge remains in generating sufficiently large and diverse interviewing data that accurately capture the necessary information from clients with emotional disorders.

In light of the aforementioned challenges, we proposed a data-generative pipeline that synthesizes clinical interviews on emotional disorders, thereby facilitating the development of clinical LLMs. This pipeline was built following the comprehensive guidelines of psychiatric interviews. Utilizing this pipeline, we generated PsyInterview, a dataset derived from published materials, which comprises multi-turn interviews, corresponding screenings, and accompanying explanations. By automating the generation of clinical interview data, our approach could address the scarcity and sensitivity of real-world clinical data, enabling the training of LLMs on a larger and more diverse set of scenarios.

This study investigates the potential of using synthesized data (i.e., PsyInterview dataset) to boost Large Language Models (LLMs) for emotional disorder screening. To the best of our knowledge, this is the first study to generate structured clinical interviews for emotional disorders and integrate both a screening and interview-assisting agent into a single system. Utilizing the PsyInterview dataset, we develop an LLM agent specifically designed for screening emotional disorders from clinical interviews. It distinguishes between coarse disorders such as anxiety disorders and depressive disorders and also identifies fine-grained conditions such as Major Depressive Disorder and Generalized Anxiety Disorder. Moreover, the agent generates explanations for its screening results to improve interpretability. Besides, the PsyInterview supports the development of an interviewing agent to simulate the initial stages of clinical interviews. Together, we explore two key components: a screening agent that provides reasoning for its diagnostic suggestions, and an interviewing agent designed to assist with initial patient assessments. We evaluate the two agents through both automated metrics and human experts’ evaluation. By integrating the screening and interviewing agents, EmoScan provides a unified framework for the initial assessment of emotional disorders. Overall, this study presents a generative pipeline for synthesizing clinically relevant interview data and demonstrates the potential of EmoScan in supporting emotional disorder screening.

Methods

The overview of the methods section can be found in Fig. 1. This study was approved by the Human Research Ethics Committee of the University of Hong Kong (approval number: EA240276).

Data preparation

To acquire high-quality data for training effective LLMs in clinical settings, we initially developed a four-stage data generative pipeline that transforms various forms of case descriptions or information into refined psychiatrist-client dialogues. Subsequently, we gathered 1,157 cases involving emotional and other psychiatric disorders and converted them into interactive interviews utilizing our data-generative pipeline. To ensure the data quality of the generated interview data, we recruited three experts to evaluate the generated dialogues. Finally, we presented the dataset PsyInterview. Details of the data preparation are shown below.

Data generative pipeline

We have developed a data-generative pipeline designed to transform varying formats of case descriptions or information with reliable labels into polished psychiatrist-client dialogues. These refined dialogues can later be used to train models or conduct evaluations.

The pipeline comprises four main stages (Fig. 2). The first step requires gathering detailed client information or descriptions. This data can be extracted from casebooks and clinical notes describing the client’s experiences or stories, and also from scientific literature and databases containing dialogue resources. Once the raw data is compiled, the next step involves extracting key components such as the client’s complaints, medical history, history of drug or alcohol abuse, etc. To achieve this, we adapted a standardized template³⁰ conventionally employed for psychiatric evaluation. This approach ensures a thorough extraction of the client’s complete background. The details of the template can be found in Supplementary Information: Information Extraction Template. The third step involves converting the extracted data into a raw conversation following a topic flow based on Morrison’s³¹ guidebook for psychiatric interviews. Briefly, the psychiatrist will first ask the client’s identification and complaints, then collect the medical and psychiatric histories, family history, and finally personal and social history. The final stage of the pipeline focuses on polishing the raw conversation; for example, removing sensitive personal information (e.g., name and location) and deleting duplicate content. The polishing rules were modified from those in Wang et al.’s³² study on synthetic clinical interviewing conversations. Prompts for each of these steps are comprehensively provided in Supplementary Information: Polishing Rules.

**Fig. 2: The Interview Generative Pipeline.**

Data Source

To train a model capable of identifying emotional disorders in a diverse population, we collected cases encompassing emotional disorders, other mental disorders of the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), excluding depressive and anxiety disorders (e.g., Schizophrenia Spectrum and Other Psychotic Disorders), and healthy controls. Case descriptions were collected from a wide array of sources, including clinical casebooks, research papers, and an open-source dataset³³, which collectively provided us with a diverse range of 1157 cases. Out of the 1157 cases, there were 144 cases with emotional disorders, including depressive and anxiety disorders, and 269 cases with other mental disorders such as schizophrenia spectrum and other psychotic disorders. Given that mental disorders often exhibit shared clinical symptoms, including cases of other mental disorders, could enhance the LLMs’ ability to effectively characterize individuals with emotional disorders. By incorporating a broader range of cases, the LLMs can learn how to distinguish between different mental disorders. The screening results and explanations for these cases adhere to the diagnostic criteria outlined in the DSM-5. A detailed list of sources was provided in Supplementary Information: Detailed list of sources. The process of how explanations were generated can be found in Supplementary Information: Process of explanation generation.

The remaining 744 healthy control cases were sourced from the PESC dataset³³ (https://github.com/chengjl19/PAL). The PESC dataset was adapted from the ESConv dataset³⁴, which was created by recruiting crowd-workers and instructing them to engage in conversations while acting as help-seekers and supporters, with the goal of producing more natural dialogues that closely mimic real-life situations. ESConv was primarily developed for providing emotional support for non-clinical individuals facing common emotional concerns. Based on ESConv, PESC extracted the cases’ persona, which provided additional important information (i.e., demographic information, social status, personality, etc.) about the cases’ identification. These personas enrich the clients’ profile, thereby enhancing the informativeness of the responses within emotional support conversations. In this case, we adapted the extracted persona and conversations to serve as case information for healthy controls.

We then used the interview generative pipeline to transform the 1,157 case descriptions to corresponding interactive interviews. The sample distribution of the cases can be found in Fig. 3a. On average, each conversation consists of 14 utterances from either a psychiatrist or a client, with each utterance having 24 words. The detailed statistics of the synthetic interviews are shown in Fig. 3b. All data used for evaluation were sourced exclusively from publicly available, published materials, rather than directly from patients. Before being processed by the LLMs, all data was rigorously de-identified to ensure the removal of any personal identifiers. This process included a manual review of both input and output data to confirm no personally identifiable information was used or inadvertently generated. As the study utilized fully anonymized, secondary data from the public domain, direct patient consent was not applicable. The transfer of this non-identifiable data for evaluation purposes was conducted in full compliance with privacy regulations and was within the scope of our institutional ethics approval.

**Fig. 3: Data distribution and quality evaluation.**

Data quality-check

To ensure the authenticity of the generated conversations in clinical settings, we recruited three experts to evaluate the quality of the generated dialogues. One of them is a psychiatrist, and two are clinical psychologists, each with over 10 years of clinical experience. The two clinical psychologists have expertise in different therapeutic approaches: one specializes in Cognitive Behavioural Therapy (CBT), while the other has training in Parent-Child Interaction Therapy (PCIT), Dialectical Behaviour Therapy (DBT), and Acceptance and Commitment Therapy (ACT). The psychiatrist focuses on early psychosis intervention and cognitive impairments in schizophrenia. We randomly selected 50 cases from the training dataset, with approximately one-third of these cases representing depressive disorders, anxiety disorders, and healthy controls each. Each case was rated by two experts based on (1) information alignment between case description/information, dialogue, and explanations; (2) naturalness and consistency of the dialogues; and (3) logicality and compliance of explanations. The experts were asked to rate each item on a 5-point Likert scale (1 = very misaligned/not natural at all/etc., 5 = very aligned/very natural/etc.). The evaluation criteria and scales are adapted from established principles and practices that have been used in previous studies evaluating psychiatric interviews and empathic dialogues^35,36,37.

To further assess the reliability of our data, we adopted an interview skill assessment developed by Morrison³¹. Since the original assessment was designed for extended interviews and our project focused on relatively shorter dialogues (8-rounds), we selected dimensions aligned with our context, including history, ending the interview, establishing rapport, and use of interviewing techniques. History consists of the psychiatrist’s enquiry about medical history, family history of mental disorder, history of present illness, and personal and social history. Ending the interview involves giving a warning that the interview is concluding and expressing interest and appreciation at the end. Establishing rapport focuses on building relationships with the client while the use of interview techniques dimension refers to the psychiatrist’s approach in gathering client information. The history and ending the interviewing dimensions are particularly important for our interviewing agent, as obtaining an accurate history is crucial for achieving a reliable screening result of the emotional disorder³⁸, and employing proper ending tactics ensures the conversation remains focused and does not become endless. Meanwhile, the two supplementary dimensions, establishing rapport and use of interview techniques, more suited for longer interviews, ensure that the interviewing style of our generated corpus is acceptable. We collected the responses from the three experts and computed an average score for each of the dimensions. If the average score of a particular dimension exceeds 50% of the maximum, it indicates that the responses of that dimension are generally positive and acceptable. The results showed that our data received positive responses across all dimensions, as shown in Fig. 3c.

Model training and evaluation

Using the generated interview data from PsyInterview, we trained a system called EmoScan and evaluated its effectiveness in screening for emotional disorders. On the other hand, we assessed its performance in conducting interviewing tasks with GPT-4 acting as the client. In this section, we outline the training procedures for EmoScan’s screening and interviewing agents, as well as the evaluation methods used to compare EmoScan’s performance with baselines.

Training

We developed a system that consists of a screening and an interviewing agent. The screening agent was a fine-tuned Mistral- 7B³⁹ trained by conversational history and screening outputs (i.e., a combination of screening results and explanations), while the interviewing agent was trained by the same base-model with the multi-turn conversational data synthesized by the generative pipeline, as Mistral-7B and related models excel in diverse benchmarks³⁹ and clinical tasks^40,41. The screening agent will provide a result with an explanation that describes why the client gets the result according to the DSM-5 based on the conversational history, and the interviewing agent is capable to conduct psychiatric interviews with clients. Training details can be found in Supplementary Information: Model training details.

Evaluation

We compared EmoScan with recent widely used LLMs, OpenAI’s GPT-4 (gpt-4-0613)⁴², Llama 3 (Meta-Llama-3-70B), and Mistral-7B³⁹. These three LLMs served as baselines in screening/explainability and interviewing evaluation. GPT-4 was accessed via the OpenAI API, while LLaMA-3 and Mistral-7B were implemented using the Deep Infra API. To ensure fair comparisons across models, we used consistent hyperparameter settings. All models were set with default sampling parameters: a temperature of 1.0 and a top_p value of 1.0. These settings preserve the model’s original probability distribution and allow for natural variability in responses^43,44. We maintained identical experimental conditions across all models. This included using the same prompt content and formats (for zero-shot, few-shot, and chain-of-thought settings; prompt content can be found in Supplementary Information: Chain of Thought (COT) prompts and Supplementary Information: Prompt structure), shared few-shot examples, and the same context length. Additionally, the same evaluation metrics, such as F1-score and BERTScore, were applied to all model outputs. This controlled setup minimizes potential confounding factors and ensures that observed differences reflect true model capabilities.

Research Question 1: Does the screening agent have the ability to do screening and provide explanations?

We compared the screening performance of EmoScan and baselines for both coarse- and fine-grained classification of emotional disorders. At the coarse-grained level, we assessed the LLMs’ ability to identify individuals with depressive disorders or anxiety disorders from the healthy controls. At the fine-grained level, we evaluated the LLMs’ capability to identify specific emotional disorders (e.g., Major Depressive Disorder, Generalized Anxiety Disorder) among positive cases. The classification performance was evaluated using weighted F1⁴⁵, which indicates the harmonic mean of precision and recall depending on each class’s sample size. To determine if there were significant differences in classification performance among the LLMs, we ran each model on the test set three times and then applied independent two-sample t-tests to assess the statistical significance of the performance differences between EmoScan and the baselines. Additionally, we used BERTScore⁴⁶, ROUGE⁴⁷, and BLEU⁴⁸ to measure the quality of explanations generated by the four LLMs, comparing their similarity to the ground truth explanations.

To explore the potential of baselines, we further employed few-shot⁴⁹ and chain-of-thought (CoT)⁵⁰ prompting techniques. Few-shot prompting provides LLMs with a few task examples during inference, which can improve LLMs’ performance without changing their weights. In our study, we randomly selected four cases from the training data to serve as few-shot samples. The samples included diverse cases: one Depressive Disorder case, one Anxiety Disorder case, one case with other disorders, and one healthy control case. Each example consisted of the input dialogue and the corresponding structured ground-truth output. CoT prompting offers intermediate reasoning steps for LLMs, enhancing their performance on complex reasoning tasks. Our study adapted an emotional disorders’ screening guideline from a comprehensive mental health disorder diagnosis handbook⁵¹ as the CoT prompt. The detailed CoT prompt can be found in Supplementary Information: Chain of Thought (COT) prompts. Finally, we include a condition with both few-shot and CoT prompts, in which we simply combined the CoT prompt and the four examples together, measuring whether baselines’ performance could be improved with reasoning steps and samples. All prompting strategies (zero-shot, few-shot, CoT, few-shot + CoT) began with a system prompt instructing the LLM on its core task (screening for emotional disorders based on DSM-5), the required output structure, and the definitions of coarse (Anxiety, Depressive) and fine-grained disorders. Following the few-shot/CoT prompts and input dialogue, a final instruction was added to guide the LLM on how to format its response. The detailed prompting structure can be found in Supplementary Information: Prompt structure.

To test the generalizability of EmoScan, we compared our system with the base model on an external out-of-domain dataset, D4⁵². This dataset was developed to screen for depression in simulated conversations between two crowdsource workers. Since this dataset was originally in Chinese, we first translated the dataset to English using the Youdao API. We have conducted translation checking on the testing dataset, and then compared the screening performance of our model and the base model Mistral-7B. We specifically chose to compare with Mistral-7B because it serves as EmoScan’s foundation model, allowing us to directly evaluate how our generated dataset enhances cross-domain performance while controlling for the model architecture.

RQ2: Does the interviewing agent have acceptable interviewing performance?

Obtaining essential key information about the client during an interview is crucial for achieving accurate screening results. To evaluate the interviewing performance of different models, we applied two fundamental dimensions in the interviewing assessment rated by experts: history and ending the interview. We compared EmoScan with baselines on these two essential dimensions.

To simulate the interviewing process between clients and psychologists, we first built a patient simulator using Autogen⁵³ to interact with all four LLMs (i.e., EmoScan, GPT-4, Llama 3, Mistral-7B). During this process, we instructed GPT-4 to act as a client, responding to the psychiatrist’s questions, while providing it with the relevant patient information. Simultaneously, we assigned one of the four LLMs to act as the psychiatrist, responsible for asking questions. The interviewing dialogues were recorded for subsequent evaluation.

Using LLMs as raters in evaluation is an effective method which helps reduce expensive and time-consuming human labours. Previous studies have applied GPT-4 to evaluate conversational responses, demonstrating a strong correlation with human raters^54,55. Therefore, in this study, we also used a separate GPT-4 agent to act as a judge, assessing the interviewing performance through a comparative analysis of the interviewing dialogues generated by our model versus those produced by the other three models. For each evaluation, raters were presented with two conversations generated by EmoScan–interviewing agent and the other model, in which the LLMs acted as psychiatrists and talked with the same simulated client. Models’ names were masked, and the two conversations’ order was randomized each time to avoid bias. After that, the rater voted for one of them on the two dimensions. Finally, we calculated the winning rate of the four models. To ensure the reliability of GPT-4’s rating, we randomly selected 90 conversation pairs and recruited six human experts with backgrounds in psychology to rate these samples based on the same guideline provided to GPT-4. We conducted Chi-Square tests to compare the ratings provided by GPT-4 and those given by the human experts.

Statistics and reproducibility

Statistical analyses were conducted using Python 3.9 with libraries including scipy.stats and SPSS 26.0 for hypothesis testing and sklearn.metrics for performance metrics. For comparing screening performance between EmoScan and baseline models, independent two-sample t-tests were applied to F1-score results across three runs of each model on the test set, with significance determined at p < 0.05. Chi-Square tests (using scipy.stats.chi2_contingency) were performed to assess agreement between GPT-4 and human expert ratings of interviewing performance, with degrees of freedom = 2 and significance threshold set at p < 0.05.

Performance metrics including weighted F1-score, recall, precision, BERTScore, ROUGE, and BLEU were calculated using standard implementations (e.g., sklearn for F1-score, transformers library for BERTScore). Sample sizes for evaluations were as follows: 129 dialogues in the internal test set, 90 conversation pairs for human-GPT-4 rating comparisons, and the external D4 dataset (translated to English) for generalizability testing. All experiments were replicated three times with consistent hyperparameters (temperature = 1.0, top_p = 1.0) to ensure reliability.

Reproducibility is supported by public access to the dataset, code, and source data for key figures via the GitHub repository (https://github.com/Junemengyuan/EmoScan). This includes scripts for data generation, model training, and evaluation metrics calculation, enabling exact replication of reported results.

Results

Evaluation of EmoScan screening agent

Table 1 summarizes the screening performance comparison between EmoScan and the baselines based on weighted F1⁴⁵. As demonstrated, EmoScan significantly outperformed all baselines with zero-shot, few-shot, and chain of thought prompting (F1 = 0.7467, t ∈ [6.7143, 25.4563], p ∈ [1.4 × 10^-5, 2.562 × 10^-3]) in the independent sample t test, with improvements in the classification of both depressive (F1 = 0.6333) and anxiety disorders (F1 = 0.8567). This result may be attributed to EmoScan’s more cautious approach in identifying cases as positive (i.e., with emotional disorders) compared to the baselines, as the precision was much higher for both depressive and anxiety disorders. To further illustrate models’ performance, we present a summary of true positive, false positive, true negative, and false negative cases for each model, showing that EmoScan produced the highest true classifications overall. Details can be found in Supplementary Table 1. Meanwhile, although the introduction of fine-grained categories may have increased the difficulty of the second classification task, EmoScan’s performance (F1 = 0.2567) still exhibited considerable improvements compared to the base model (the highest F1 = 0.0467), showing the efficacy of PsyInterview.

Table. 1 Screening results comparison

Full size table

When screening emotional disorders, an explanation for the screening result will be provided by EmoScan, which can help psychiatrists understand the underlying logic of the screening outputs generated by the LLMs. To assess the effectiveness of these explanations, we utilized multiple metrics: ROUGE⁴⁷, BLEU⁴⁸, and BERTScore⁴⁶. The ROUGE assesses the overlap between the generated and reference texts (i.e., truth explanations); BLEU evaluates textual similarity based on specific length’s overlap; while BERTScore, known for capturing semantic similarity more effectively, focuses on contextual meaning. EmoScan demonstrated exceptional performance in BERTScore (0.9408), indicating a high level of semantic alignment. It also achieved high scores in BLEU (0.0660) for matching short sequences and ROUGE-1 (0.3951) for matching single words. These results underscore EmoScan’s capability to produce accurate and contextually appropriate explanations. Although EmoScan’s performance in ROUGE-2 (0.1132) and ROUGE-L (0.2086) metrics was slightly below that of Llama3, it outperformed all other baselines in the two metrics, showing its proficiency in capturing longer text dependencies. Overall, EmoScan’s performance and explainability remain superior, as demonstrated in Table 2.

Table. 2 Screening explanations on cases with emotional disorders

Full size table

Generalizability of EmoScan screening agent

We compared the performance on the D4 dataset before and after training the model to highlight EmoScan’s generalizability in related tasks. In the original study, psychiatrists categorized D4 cases into two groups: one with a risk of depression and the other without a depression risk. EmoScan (F1 = 0.67) outperformed the base model Mistral-7B (F1 = 0.64) when classifying the two groups.

Although EmoScan slightly outperformed Mistral-7B in the external validation, its performance declined compared to the internal validation, where it achieved an F1-score of 0.92 in classifying depression cases within the depression-related subset of our PsyInterview dataset. This drop in performance may indicate limitations in cross-domain generalization. To better understand why EmoScan did not maintain its performance in the external validation, we conducted several additional analyses. First, we found notable differences in text length between the two datasets: texts in the PsyInterview dataset were shorter on average (mean = 2,646 words; SD = 315), while D4 texts were longer (mean = 3770 words; SD = 1077), with greater variability across samples. Second, we also analyzed differences in language style using TF-IDF vectorization⁵⁶, combined with PCA for dimensionality reduction. Those methods were used in previous studies to identify linguistic patterns in text data⁵⁷. As shown in Supplementary Fig. 1, the results showed that PsyInterview samples were more widely distributed in the feature space, reflecting greater linguistic diversity, while D4 samples clustered tightly, suggesting a more uniform writing style. These differences likely contributed to the reduced generalization performance of EmoScan when applied to D4 dataset.

Evaluation of EmoScan Interviewing Agent

EmoScan outperformed Mistral, Llama3, and GPT-4 in most dimensions of interviewing performance, as evident from both GPT-4 and human experts’ rating results (Fig. 4). We allowed each rater to vote for one of the two conversations generated by either EmoScan or one of the baselines, and the result could be “EmoScan win”, “The other LLM win” or “Tie”. In each round of evaluation, the rater (either GPT-4 or a human expert) was shown two anonymized conversations generated by EmoScan and one of the baselines, without knowing which model produced which conversation. Each rater was asked to choose the conversation that showed better interviewing performance based on the interview skill assessment used in the data quality-check section, in which there are six dimensions: the psychiatrist’s enquiry about medical history, family history of mental disorder, history of present illness, and personal and social history, warning about the interview is ending and conclusion with interest and appreciation. The decision could result in a tie or favour one model over the other. This methodology allowed us to quantitatively assess the relative strengths of EmoScan compared to commonly used LLMs.

**Fig. 4: Ratings on Interviewing performance by GPT-4 and Human experts.**

To assess the reliability of GPT-4 as an automated rater, we conducted Chi-Square tests between GPT-4 and aggregated human expert ratings. The Chi-Square test assesses whether there is a significant association between two categorical variables—in this case, the evaluation outcomes from GPT-4 and those from human raters. Specifically, we used the chi2_contingency function from the scipy.stats library to calculate the Chi-Square statistic and p-values. A threshold of p < 0.05 was used to determine statistical significance: p-values below this threshold suggest a strong association between GPT-4 and human ratings, supporting GPT-4’s reliability as an evaluator. Human ratings were collected from six independent evaluators. For each evaluation dimension, we constructed contingency tables comparing the frequency of three possible outcomes (EmoScan win, baseline win, tie) between GPT-4 and human raters. The Chi-Square statistic (X²) measured the divergence between GPT-4 and human votes, with degrees of freedom (df=2) reflecting the three-category comparison.

Discussion

The present study introduced a pipeline for synthesizing clinical interviews to screen for emotional disorders, generated an interviewing dataset PsyInterview and trained a system, EmoScan, capable of both providing screening results with explanations and conducting interviews. Upon evaluation, EmoScan demonstrated superior performance in screening emotional disorders, providing robust screening explanations compared to baselines, conducting interviews, and showed greater generalizability than the base model. In a clinical context, EmoScan has the potential to save expert clinicians’ time by effectively communicating with individuals and identifying those with emotional disorders. Additionally, it could also provide explanations for its output, enabling experts to comprehend the decision-making logic and gain confidence in the results. However, it is worth noting that while EmoScan shows promise, it still requires further validation. This includes reducing false positives and negatives, testing in clinical settings with oversight from clinicians, and refining the system based on real-world feedback to ensure it is reliable, safe, and effective before it can be used in practice.

Consequently, our pipeline represents a meaningful step forward in developing tools that could support mental health screening. While the results are promising, this work is still at the research stage and not yet ready for clinical use. It’s important to note that EmoScan is designed to support clinicians, not replace them, and professional judgement should always guide the final diagnosis. We recognize that moving toward real-world application will require several intermediate steps. Bringing EmoScan into real-world practice will take several intermediate steps. First, large-scale studies will be needed to evaluate the effectiveness and long-term performance of EmoScan. Second, the system must be retrained and tested on different populations to help improve accuracy. In contexts with abundant clinical resources, minimizing false negatives and reducing under-diagnosis is essential to ensure that individuals in need receive care^58,59. Conversely, in places with limited resources, reducing false positives becomes more critical to avoid unnecessary referrals and overburdening healthcare systems⁶⁰. Third, the tool should be tested in real clinics to evaluate how easy it is to use. Feedback from both clinicians and patients will be important to refining the system⁶¹. Lastly, after undergoing multiple stages of refinement and evaluation, EmoScan could be considered for use in clinical settings once it has shown both low false positives and false negatives performance in supporting clinical diagnosis.

The performance of EmoScan on interviewing can be largely attributed to clinical information contained in the PsyInterview dataset, which closely simulates patient histories, and the structured dialogue format that ensures focused and clinically relevant interactions, drawing from clinical literatures that emphasize structured enquiry to elicit comprehensive and reliable responses from clients^62,63. These factors differentiate EmoScan from general-purpose models trained on less specific data. Specifically, the clients’ profiles in PsyInterview, such as their detailed medical history, family history of mental disorders, personal and social background, enable EmoScan to ask more targeted and clinically relevant questions during interviews. In addition, the structured generation pipeline ensures a consistent and logical flow of conversation, including effective cues for initiating and concluding interviews, which further enhances the quality of the interaction.

Researchers had underscored the potential of AI to complement professional mental health expertise⁶⁴, while emphasizing domain-specific scope to avoid risks associated with applications of general LLMs to clinical tasks⁶⁵. Our findings highly supported the recent perspectives on utilizing LLMs for mental health applications. Our results showed that EmoScan achieved significantly higher overall F1 scores compared to all the baseline general LLMs, though its recall score was marginally lower than a few of them. This could be attributed to that EmoScan was a cautious system with high precision—especially when screening for anxiety disorders, where general LLMs tended to underperform and incorrectly label many healthy individuals as patients. Being a cautious system, EmoScan minimized the likelihood of false diagnoses for emotional disorders, thus preventing unnecessary costs related to further treatment. However, we acknowledge that under-diagnosis can lead to serious consequences, including delayed treatment, worsening of symptoms, and reduced trust in the healthcare system^66,67. Therefore, to decrease the risk of under-diagnosis, we suggest that users chat and screen with the models for multiple sessions over time to capture temporal variation in their symptoms and obtain a more reliable diagnosis. Importantly, EmoScan is designed as a supportive screening tool to assist clinical expertise, not as a replacement. In cases where model outputs are inconsistent across sessions or explanation patterns are unclear, clinicians should step in to provide more professional screening outcomes. In summary, EmoScan holds potential for application in clinical screening, offering valuable support to clinicians in making more efficient diagnoses.

One of our contributions was the development of a scalable pipeline for generating data efficiently. Previous research in this field had often relied on data that either incurred high costs or utilized pre-existing datasets annotated using questionnaires. For instance, the depression-related conversation corpus D4⁵² employed as a validation dataset in our study was created by recruiting individuals to role-play patients or doctors. While the data quality of D4 was good, the associated costs of human role-playing were relatively high. Other notable studies have explored emotional disorders using conversational datasets such as DAIC-WOZ, a multi-modal dataset developed for depression diagnosis^27,68. While increased access to real clinical data would be ideal for training robust LLMs, this remains a challenge due to privacy concerns and regulatory requirements^69,70. As in DAIC-WOZ, the sample size of such datasets is often relatively small, limiting their application in automated clinical screening. To address this challenge, we believe a dual approach is necessary. In the short term, data synthesis provides a practical solution by generating clinical notes and interview transcripts. Since the field works toward solutions for secure and ethical sharing of clinical data⁷¹, there is a critical need for quick approaches to develop effective mental health AI systems. Our data-generative pipeline offers a workable and scalable solution. The balanced method, incorporating both automation and selective human monitoring, not only ensured high data quality but also reduced costs and ethical concerns associated with real-life conversations. In the long term, we advocate for the development of secure, community-driven data-sharing platforms to de-identify and share real clinical notes. Together, these strategies are complementary and essential for advancing the development and application of LLMs in mental health care.

One limitation of our study was the small sample size of each fine-grained emotional disorder. While our dataset comprised around 20 different emotional disorders, the number of samples for each fine-grained disorder was limited due to the constraints of available sources. The limited training sample size for each disorder posed a challenge for many LLMs, including EmoScan, as it prevented them from learning hidden patterns, which ultimately resulted in lower accuracy when identifying each fine-grained disorder. To improve model performance, future studies could consider enlarging the sample size for each fine-grained disorder. Anxiety frequently co-occurs with depression^72,73. However, in our current dataset, only four cases exhibit both conditions. This is largely due to the use of textbook-style cases that focus on prototypical symptoms of single disorders, which limits the system’s applicability to populations with comorbid presentations. Future work should aim to address this gap by incorporating more cases involving comorbidity, which will improve the model’s generalizability and relevance to real-world clinical settings. Additionally, future researchers may integrate multi-modality information to train LLMs. For example, some previous research had underscored the benefits of incorporating acoustic speech information within LLM frameworks for depression detection^27,74,75. Our study focused on textual data for easier implementation, yet incorporating multi-modal inputs could help researchers enhance screening accuracy. Moreover, while EmoScan currently prioritizes precision to minimize false positives, this approach may reduce its ability to detect all true cases, potentially increasing the risk of under-diagnosis. Future versions of EmoScan could try to incorporate hybrid models to dynamically balance recall and precision based on population risk profiles (e.g., prioritizing higher recall for high-risk demographic subgroups) to reduce the risk of under-diagnosis and misdiagnosis, thereby enhancing patient safety and trust. Additionally, collaborating with global clinics to evaluate model performance across diverse demographic groups will be crucial for bias detection and mitigation. We also recognize that transparency and accountability are essential ethical considerations, and future work must prioritize enhancing these aspects to ensure that clinicians and patients can clearly understand and trust the system’s outputs. Furthermore, bias detection was not performed in this study due to the lack of demographic information in the training data. Future research needs to evaluate the efficacy of our pipeline across diverse population subgroups to support a more equitable and precise application of EmoScan.

Conclusion

Our study introduced a pipeline for effective screening and interviewing of emotional disorders using Large Language Models (LLMs). Using our pipeline, we synthesized data from existing clinical interviews and created the PsyInterview dataset. Subsequently, we developed EmoScan, an LLM-based system trained on our collected dataset for screening emotional disorders with explanations and conducting interviews to collect clinical information. EmoScan demonstrated superior accuracy, robust explanations, and strong interviewing skills, significantly outperforming baseline models. Our pipeline and models hold considerable potential for efficient and effective clinical screening and interviewing, providing mental health professionals with a valuable tool to support their work.

Data availability

The dataset applied in this manuscript is publicly available through our GitHub repository:https://github.com/Junemengyuan/EmoScan.The dataset is permanently archived on Zenodo under the https://doi.org/10.5281/zenodo.17032438⁷⁶. All other data are available from the corresponding author on reasonable request. The source data for Fig. 3 is available in the GitHub repository (file name: dataset.json); and the source data for Fig. 4 is in the GitHub repository (file name: human-ratings.csv; gpt4_ratings.csv).

Code availability

All codes used in this manuscript are publicly available through our GitHub repository: https://github.com/Junemengyuan/EmoScan. The code is permanently archived on Zenodo under the https://doi.org/10.5281/zenodo.17032438⁷⁶.

References

Bullis, J. R., Boettcher, H., Sauer-Zavala, S., Farchione, T. J. & Barlow, D. H. What is an emotional disorder? A transdiagnostic mechanistic definition with implications for assessment, treatment, and prevention. Clin. Psychol.: Sci. Pract. 26, 20 (2019).
Google Scholar
Institute for Health Metrics and Evaluation. GBD results. https://vizhub.healthdata.org/gbd-results/ (2019).
Konnopka, A. & König, H. Economic burden of anxiety disorders: a systematic review and meta-analysis. Pharmacoeconomics 38, 25–37 (2020).
Article PubMed Google Scholar
Finning, K., Moore, D., Ukoumunne, O. C., Danielsson-Waters, E. & Ford, T. The association between child and adolescent emotional disorder and poor attendance at school: a systematic review protocol. Syst. Rev. 6, 121 (2017).
Article PubMed PubMed Central Google Scholar
Zvolensky, M. J., Farris, S. G., Leventhal, A. M. & Schmidt, N. B. Anxiety sensitivity mediates relations between emotional disorders and smoking. Psychol. Addict. Behav. 28, 912 (2014).
Article PubMed PubMed Central Google Scholar
Watson, D., O’Hara, M. W. & Stuart, S. Hierarchical structures of affect and psychopathology and their implications for the classification of emotional disorders. Depress. anxiety 25, 282–288 (2008).
Article PubMed Google Scholar
Santomauro, D. F. et al. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. Lancet 398, 1700–1712 (2021).
Article Google Scholar
Sau, A. & Bhakta, I. Screening of anxiety and depression among seafarers using machine learning technology. Inform. Med. Unlock. 16, 100228 (2019).
Article Google Scholar
Wright, B., Dave, S. & Dogra, N. 100 cases in psychiatry. (CRC Press, 2017).
American Psychiatric Association. Structured clinical interview for DSM-5 (SCID-5), (2015).
Bruckner, T. A. et al. The mental health workforce gap in low-and middle-income countries: a needs-based approach. Bull. World Health Organ. 89, 184–194 (2011).
Article PubMed Google Scholar
Kakuma, R. et al. Human resources for mental health care: current situation and strategies for action. Lancet 378, 1654–1663 (2011).
Article PubMed Google Scholar
Meffert, S. M., Neylan, T. C., Chambers, D. A. & Verdeli, H. Novel implementation research designs for scaling up global mental health care: overcoming translational challenges to address the world’s leading cause of disability. Int. J. Ment. Health Syst. 10, 19 (2016).
Article PubMed PubMed Central Google Scholar
Blank, I. A. What are large language models supposed to model?. Trends Cogn. Sci. 27, 987–989 (2023).
Article PubMed Google Scholar
Ke, L., Tong, S., Cheng, P. & Peng, K. Exploring the frontiers of LLMs in psychological applications: a comprehensive review. Artif. Intell. Rev. 58, 305 (2025).
Article Google Scholar
Omar, M. et al. Applications of large language models in psychiatry: a systematic review. Front. Psychiatry 15, 1422807 (2024).
Article PubMed PubMed Central Google Scholar
King, D. R. et al. An introduction to generative artificial intelligence in mental health care: considerations and guidance. Curr. Psychiatry Rep. 25, 839–846 (2023).
Article PubMed Google Scholar
Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. NPJ Ment. Health Res. 3, 12 (2024).
Article PubMed PubMed Central Google Scholar
Chen, S. et al. Mapping Long-term Causalities in Psychiatric Symptomatology and Life Events from Social Media. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 5472-5487 (Association for Computational Linguistics, 2024).
Lan, X., Cheng, Y., Sheng, L., Gao, C. & Li, Y. Depression detection on social media with large language models. arXiv preprint arXiv:2403.10750 (2024).
Wang, Y., Inkpen, D. & Gamaarachchige, P. K. Explainable depression detection using large language models on social media data. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024) 108-126 (Association for Computational Linguistics, 2024).
Xu, X. et al. Mental-llm: Leveraging large language models for mental health prediction via online text data. Proc. ACM Interact., Mob., Wear. Ubiquit. Technol. 8, 1–32 (2024).
Google Scholar
Dalal, S. et al. A cross attention approach to diagnostic explainability using clinical practice guidelines for depression. IEEE J. Biomed. Health Inform. (2024).
Dalal, S., Jain, S. & Dave, M. Deep knowledge-infusion for explainable depression detection. arXiv preprint arXiv:2409.02122 (2024).
Nie, L., Xu, J. & Wang, R. Health information needs and feedback of users in the online TCM community. Plos One 19, e0301536 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rosenman, G., Hendler, T. & Wolf, L. LLM Questionnaire Completion for Automatic Psychiatric Assessment. In Findings of the Association for Computational Linguistics: EMNLP 2024 403-415 (Association for Computational Linguistics, 2024).
Chen, Z. et al. Depression detection in clinical interviews with LLM-empowered structural element graph. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 8181-8194 (Association for Computational Linguistics, 2024).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature, 1–9 (2025).
Adarmouch, L., Felaefel, M., Wachbroit, R. & Silverman, H. Perspectives regarding privacy in clinical research among research professionals from the Arab region: an exploratory qualitative study. BMC Med. ethics 21, 27 (2020).
Article PubMed PubMed Central Google Scholar
Prendergast, K. Psychiatric case studies for advanced practice. (Lippincott Williams & Wilkins, 2018).
Morrison, J. The first interview. (Guilford Publications, 2016).
Wang, J. et al. NoteChat: A Dataset of Synthetic Patient-Physician Conversations Conditioned on Clinical Notes. In Findings of the Association for Computational Linguistics: ACL 2024 15183-15201 (Association for Computational Linguistics, 2024).
Cheng, J., Sabour, S., Sun, H., Chen, Z. & Huang, M. PAL: Persona-Augmented Emotional Support Conversation Generation. In Findings of the Association for Computational Linguistics: ACL 2023 535-554 (Association for Computational Linguistics, 2023).
Liu, S. et al. Towards Emotional Support Dialog Systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 3469-3483 (Association for Computational Linguistics, 2021).
Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C. & Althoff, T. Towards facilitating empathic conversations in online mental health support: A reinforcement learning approach. In Proceedings of the web conference 2021 194-205 (2021).
Roy, K. et al. Proknow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance. Front. Big Data 5, 1056728 (2023).
Article PubMed PubMed Central Google Scholar
So, J. -h. et al. Aligning large language models for enhancing psychiatric interviews through symptom delineation and summarization: pilot study. JMIR Form. Res. 8, e58418 (2024).
Article PubMed PubMed Central Google Scholar
Jansson, L. & Nordgaard, J. The psychiatric interview for differential diagnosis. Vol. 270 (Springer, 2016).
Jiang, A. Q. et al. Mistral 7B. ArXiv abs/2310.06825 (2023).
Cong, Y., LaCroix, A. N. & Lee, J. Clinical efficacy of pre-trained large language models through the lens of aphasia. Sci. Rep. 14, 15573 (2024).
Article CAS PubMed PubMed Central Google Scholar
Longwell, J. B. et al. Performance of large language models on medical oncology examination questions. JAMA Netw. Open 7, e2417641–e2417641 (2024).
Article PubMed PubMed Central Google Scholar
OpenAI. Gpt-4 technical report. https://doi.org/10.48550/arXiv.2303.08774 (2023).
Shin, S. & Kim, Y. in The Thirteenth International Conference on Learning Representations.
Haji, F. et al. Improving LLM reasoning with multi-agent Tree-of-Thought Validator agent. arXiv preprint arXiv:2409.11527 (2024).
Schultebraucks, K., Yadav, V., Shalev, A. Y., Bonanno, G. A. & Galatzer-Levy, I. R. Deep learning-based classification of posttraumatic stress disorder and depression following trauma utilizing visual and auditory markers of arousal and mood. Psychol. Med. 52, 957–967 (2022).
Article PubMed Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr (Addis Ababa, Ethiopia, 2020).
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. in Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
Brown, T. et al. Language models are few-shot learners. Adv. neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural Inf. Process. Syst. 35, 24824–24837 (2022).
Google Scholar
Morrison, J. Diagnosis made easier: Principles and techniques for mental health clinicians. (Guilford Publications, 2023).
Yao, B. et al. D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 2438-2459 (Association for Computational Linguistics, 2022).
Wu, Q. et al. In First Conference on Language Modeling.
Liao, Y. et al. Automatic interactive evaluation for large language models with state aware patient simulator. arXiv preprint arXiv:2403.08495 (2024).
Hackl, V., Müller, A. E., Granitzer, M. & Sailer, M. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings. In Frontiers in Education. 1272229 (Frontiers Media SA, 2023).
Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Document. 28, 11–21 (1972).
Article Google Scholar
Koppel, M., Schler, J. & Argamon, S. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 9–26 (2009).
Article Google Scholar
Bradford, A., Meyer, A. N., Khan, S., Giardina, T. D. & Singh, H. Diagnostic error in mental health: a review. BMJ Qual. Saf. 33, 663–672 (2024).
Article PubMed Google Scholar
Kumagai, N. et al. Assessing recurrence of depression using a zero-inflated negative binomial model: A secondary analysis of lifelog data. Psychiatry Res. 300, 113919 (2021).
Article PubMed Google Scholar
Greenwood-Lee, J., Jewett, L., Woodhouse, L. & Marshall, D. A. A categorisation of problems and solutions to improve patient referrals from primary to specialty care. BMC Health Serv. Res. 18, 986 (2018).
Article PubMed PubMed Central Google Scholar
Al-Garadi, M. et al. Large Language Models in Healthcare. arXiv preprint arXiv:2503.04748 (2025).
Blanchard, J. J. & Brown, S. B. Structured diagnostic interview schedules (1998).
Targum, S. D. The distinction between clinical and research interviews in psychiatry. Innov. Clin. Neurosci. 8, 40 (2011).
PubMed PubMed Central Google Scholar
Elyoseph, Z., Levkovich, I. & Shinan-Altman, S. Assessing prognosis in depression: comparing perspectives of AI models, mental health professionals and the general public. Fam. Med. Community Health 12, e002583 (2024).
Article PubMed PubMed Central Google Scholar
Au Yeung, J. et al. AI chatbots not yet ready for clinical use. Front. Digit. health 5, 1161098 (2023).
Article PubMed PubMed Central Google Scholar
Meyer, A. N., Giardina, T. D., Khawaja, L. & Singh, H. Patient and clinician experiences of uncertainty in the diagnostic process: current understanding and future directions. Patient Educ. Counsel. 104, 2606–2615 (2021).
Article Google Scholar
Suzuki, R. et al. Association of patients’ past misdiagnosis experiences with trust in their current physician among Japanese adults. J. Gen. Intern. Med. 37, 1115–1121 (2022).
Article PubMed Google Scholar
Gratch, J. et al. The distress analysis interview corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) 3123-3128 (2014).
Murdoch, B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med. Ethics 22, 122 (2021).
Article PubMed PubMed Central Google Scholar
Williamson, S. M. & Prybutok, V. Balancing privacy and progress: a review of privacy challenges, systemic oversight, and patient perceptions in AI-driven healthcare. Appl. Sci. 14, 675 (2024).
Article CAS Google Scholar
Baric-Parker, J. & Anderson, E. E. Patient data-sharing for AI: Ethical challenges, catholic solutions. Linacre Q. 87, 471–481 (2020).
Article PubMed PubMed Central Google Scholar
Kalin, N. H. The critical relationship between anxiety and depression. Am. J. Psychiatry 177, 365–367 (American Psychiatric Association Washington, DC, 2020).
Kircanski, K., LeMoult, J., Ordaz, S. & Gotlib, I. H. Investigating the nature of co-occurring depression and anxiety: Comparing diagnostic and dimensional research approaches. J. Affect. Disord. 216, 123–135 (2017).
Article PubMed Google Scholar
Zhang, X. et al. When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 146-158 (Association for Computational Linguistics, 2024).
Tao, Y. et al. Classifying anxiety and depression through LLMs virtual interactions: a case study with ChatGPT. In 2023 IEEE international conference on bioinformatics and biomedicine (BIBM) 2259-2264 (IEEE, 2023).
Liu, J. M. et al EmoScan (Version v1.0.1) [Computer software and dataset]. Zenodo. https://doi.org/10.5281/zenodo.17032438 (2025).

Download references

Acknowledgements

This project was supported by The Guangdong-Hong Kong Joint Laboratory for Psychiatric Disorders (#2023B1212120004) and The University of Hong Kong May Endowed Professorship in Neuropsychology.

Author information

These authors contributed equally: June M. Liu, Mengxia Gao.

Authors and Affiliations

Laboratory of Neuropsychology and Human Neuroscience, The University of Hong Kong, Hong Kong, China
June M. Liu, Mengxia Gao & Tatia M. C. Lee
School of Humanities and Social Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen, China
Mengxia Gao
The CoAI group, DCST; Institute for Artificial Intelligence; State Key Lab of Intelligent Technology and Systems; Beijing National Research Center for Information Science and Technology; Tsinghua University, Beijing, China
Sahand Sabour, Zhuang Chen & Minlie Huang
Innocentre of Clinical Neuropsychology, The University of Hong Kong, Hong Kong, China
Tatia M. C. Lee
Guangdong-Hong Kong Joint Laboratory for Psychiatric Disorders, Guangdong-Hong Kong, China
Tatia M. C. Lee

Authors

June M. Liu
View author publications
Search author on:PubMed Google Scholar
Mengxia Gao
View author publications
Search author on:PubMed Google Scholar
Sahand Sabour
View author publications
Search author on:PubMed Google Scholar
Zhuang Chen
View author publications
Search author on:PubMed Google Scholar
Minlie Huang
View author publications
Search author on:PubMed Google Scholar
Tatia M. C. Lee
View author publications
Search author on:PubMed Google Scholar

Contributions

J.M.L. and M.G. contributed to study design, data analysis and interpretation, and drafting and revision of the manuscript. S.S. and Z.C. contributed to data analysis and interpretation, and manuscript revision. M.H. contributed to the study conceptualization and manuscript revision. T.M.C.L. contributed to study conceptualization, study design, data interpretation, and manuscript revision.

Corresponding authors

Correspondence to Minlie Huang or Tatia M. C. Lee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks Mustafa Can Gursesli, Sumit Dalal, Paulina Bondaronek for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

nr-reporting-summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, J.M., Gao, M., Sabour, S. et al. Enhanced large language models for effective screening of depression and anxiety. Commun Med 5, 457 (2025). https://doi.org/10.1038/s43856-025-01158-1

Download citation

Received: 14 December 2024
Accepted: 15 September 2025
Published: 05 November 2025
Version of record: 05 November 2025
DOI: https://doi.org/10.1038/s43856-025-01158-1

Subjects

Abstract

Background

Methods

Results

Conclusions

Plain language summary

Introduction

Methods

Data preparation

Data generative pipeline

Data Source

Data quality-check

Model training and evaluation

Training

Evaluation

Research Question 1: Does the screening agent have the ability to do screening and provide explanations?

RQ2: Does the interviewing agent have acceptable interviewing performance?

Statistics and reproducibility

Results

Evaluation of EmoScan screening agent

Generalizability of EmoScan screening agent

Evaluation of EmoScan Interviewing Agent

Discussion

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information (download PDF )

nr-reporting-summary (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links