Introduction

Emotional disorders are among the most prevalent mental health conditions worldwide1,2, leading to at least $6.5 trillion healthcare-related costs globally3. Among these conditions, anxiety and depressive disorders are particularly common4,5,6, with a high prevalence rate of approximately 4.8% and 3.2% respectively, in the general population7. While timely screening of emotional disorders, followed by an appropriate treatment, is crucial for individuals’ well-being8, this process can be time-consuming and labour-intensive, often requiring comprehensive interviews, collection of history and background information, and manpower from healthcare professionals9,10.

For effective screening, it is important to understand the specific clinical characteristics associated with these disorders. Anxiety and depressive disorders are characterized by distinct diagnostic criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5)10. For example, depression or Major Depressive Disorder (MDD) is characterized by persistent depressed mood, diminished interest in activities, significant weight changes, sleep disturbances, and other symptoms, while anxiety or Generalized Anxiety Disorder (GAD) involves excessive anxiety and worry, accompanied by restlessness, fatigue, difficulty concentrating, and other physical symptoms. However, assessing such information at scale remains a challenge, particularly in settings with limited clinical resources11,12, highlighting the need for innovative approaches to support efficient screening13.

To address this challenge, we propose utilizing Large Language Models (LLMs) to facilitate the screening of anxiety and depressive disorders. Our approach aims to enable automated assessment of clinical symptoms while maintaining the structured nature of professional interviews. LLMs are deep learning systems trained on massive collections of text to predict words in a sequence14. There has been a growing interest in applying LLMs to solve mental health-related problems15,16,17,18. Existing studies have explored the potential of using textual information to identify psychiatric symptoms and life events19, detect depression based on post histories on social media20, and provide explainable screening for emotional disorders21,22,23,24. However, most existing studies mainly focus on using data from online platforms (i.e., social media) for screening emotional disorders. While these studies show the potential of utilizing LLMs for large-scale screenings, they may not be applicable in clinical contexts due to the lack of detailed backgrounds, symptoms, and diagnoses of individuals from real clinical cases.

Brief clinical interviews could be a more reliable method for screening compared to using data from online platforms, since they provide an opportunity to gather more relevant information25. Previous studies have leveraged semi-structured clinical interviews or questions from scales to obtain clients’ information, which were further used for identifying emotional disorders26,27. Recognizing the effectiveness of these methods, efforts have been made to develop LLMs incorporating real-life clinical interviews and expert diagnoses28. Although the study of utilizing clinical interviews for developing LLMs has shown potential, the process of conducting interviews for data collection is highly time-consuming and expensive, making it less feasible for training large-scale models. Moreover, the sensitive nature of human studies and stringent privacy regulations often limit access to real clinical data for training LLMs29. To address this issue, we propose a short-term solution involving the use of scalable data synthesis techniques to generate realistic clinical scenarios while preserving participant confidentiality. However, the challenge remains in generating sufficiently large and diverse interviewing data that accurately capture the necessary information from clients with emotional disorders.

In light of the aforementioned challenges, we proposed a data-generative pipeline that synthesizes clinical interviews on emotional disorders, thereby facilitating the development of clinical LLMs. This pipeline was built following the comprehensive guidelines of psychiatric interviews. Utilizing this pipeline, we generated PsyInterview, a dataset derived from published materials, which comprises multi-turn interviews, corresponding screenings, and accompanying explanations. By automating the generation of clinical interview data, our approach could address the scarcity and sensitivity of real-world clinical data, enabling the training of LLMs on a larger and more diverse set of scenarios.

This study investigates the potential of using synthesized data (i.e., PsyInterview dataset) to boost Large Language Models (LLMs) for emotional disorder screening. To the best of our knowledge, this is the first study to generate structured clinical interviews for emotional disorders and integrate both a screening and interview-assisting agent into a single system. Utilizing the PsyInterview dataset, we develop an LLM agent specifically designed for screening emotional disorders from clinical interviews. It distinguishes between coarse disorders such as anxiety disorders and depressive disorders and also identifies fine-grained conditions such as Major Depressive Disorder and Generalized Anxiety Disorder. Moreover, the agent generates explanations for its screening results to improve interpretability. Besides, the PsyInterview supports the development of an interviewing agent to simulate the initial stages of clinical interviews. Together, we explore two key components: a screening agent that provides reasoning for its diagnostic suggestions, and an interviewing agent designed to assist with initial patient assessments. We evaluate the two agents through both automated metrics and human experts’ evaluation. By integrating the screening and interviewing agents, EmoScan provides a unified framework for the initial assessment of emotional disorders. Overall, this study presents a generative pipeline for synthesizing clinically relevant interview data and demonstrates the potential of EmoScan in supporting emotional disorder screening.

Methods

The overview of the methods section can be found in Fig. 1. This study was approved by the Human Research Ethics Committee of the University of Hong Kong (approval number: EA240276).

Fig. 1: Overview of the Study.
Fig. 1: Overview of the Study.
Full size image

a Our proposed generative pipeline and modelling training process. The pipeline transforms various formats of case descriptions/information to clinical interviews. We recruited licensed psychiatrists and clinical psychologists to evaluate the quality of the generated interview. Then we used the generated data to train EmoScan, which consists of two agents designed to screen for emotional disorders, provide relevant explanations, and conduct brief clinical interviews, respectively. b The screening agent can screen for emotional disorders based on the conversational history. We evaluated its screening performance on the testing dataset and also its generalization on an external dataset (D4). c The interviewing agent is designed to communicate effectively with users. We evaluated the interviewing performance of the agent conducting pairwise comparisons with other LLMs. Briefly, we instructed GPT-4 to act as a client and chat with the studied LLMs (EmoScan, Mistral, Llama3, and GPT-4). Subsequently, another GPT-4 rater and human experts separately reviewed the conversation history to assess the interviewing skills of these LLMs. The icons used in the creation of Fig. 1 were sourced from ICONS8.

Data preparation

To acquire high-quality data for training effective LLMs in clinical settings, we initially developed a four-stage data generative pipeline that transforms various forms of case descriptions or information into refined psychiatrist-client dialogues. Subsequently, we gathered 1,157 cases involving emotional and other psychiatric disorders and converted them into interactive interviews utilizing our data-generative pipeline. To ensure the data quality of the generated interview data, we recruited three experts to evaluate the generated dialogues. Finally, we presented the dataset PsyInterview. Details of the data preparation are shown below.

Data generative pipeline

We have developed a data-generative pipeline designed to transform varying formats of case descriptions or information with reliable labels into polished psychiatrist-client dialogues. These refined dialogues can later be used to train models or conduct evaluations.

The pipeline comprises four main stages (Fig. 2). The first step requires gathering detailed client information or descriptions. This data can be extracted from casebooks and clinical notes describing the client’s experiences or stories, and also from scientific literature and databases containing dialogue resources. Once the raw data is compiled, the next step involves extracting key components such as the client’s complaints, medical history, history of drug or alcohol abuse, etc. To achieve this, we adapted a standardized template30 conventionally employed for psychiatric evaluation. This approach ensures a thorough extraction of the client’s complete background. The details of the template can be found in Supplementary Information: Information Extraction Template. The third step involves converting the extracted data into a raw conversation following a topic flow based on Morrison’s31 guidebook for psychiatric interviews. Briefly, the psychiatrist will first ask the client’s identification and complaints, then collect the medical and psychiatric histories, family history, and finally personal and social history. The final stage of the pipeline focuses on polishing the raw conversation; for example, removing sensitive personal information (e.g., name and location) and deleting duplicate content. The polishing rules were modified from those in Wang et al.’s32 study on synthetic clinical interviewing conversations. Prompts for each of these steps are comprehensively provided in Supplementary Information: Polishing Rules.

Fig. 2: The Interview Generative Pipeline.
Fig. 2: The Interview Generative Pipeline.
Full size image

a Collect case descriptions from clinical casebooks, clinical notes, scientific literatures, and other related sources. b Extract information from the case description following a screening template. c Generate raw interviews from the extracted information. The conversations should follow an interviewing flow. d Polish the generated conversations to remove private information and prevent unexpected information leakage. The icons used in the creation of Fig. 2 were sourced from ICONS8.

Data Source

To train a model capable of identifying emotional disorders in a diverse population, we collected cases encompassing emotional disorders, other mental disorders of the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), excluding depressive and anxiety disorders (e.g., Schizophrenia Spectrum and Other Psychotic Disorders), and healthy controls. Case descriptions were collected from a wide array of sources, including clinical casebooks, research papers, and an open-source dataset33, which collectively provided us with a diverse range of 1157 cases. Out of the 1157 cases, there were 144 cases with emotional disorders, including depressive and anxiety disorders, and 269 cases with other mental disorders such as schizophrenia spectrum and other psychotic disorders. Given that mental disorders often exhibit shared clinical symptoms, including cases of other mental disorders, could enhance the LLMs’ ability to effectively characterize individuals with emotional disorders. By incorporating a broader range of cases, the LLMs can learn how to distinguish between different mental disorders. The screening results and explanations for these cases adhere to the diagnostic criteria outlined in the DSM-5. A detailed list of sources was provided in Supplementary Information: Detailed list of sources. The process of how explanations were generated can be found in Supplementary Information: Process of explanation generation.

The remaining 744 healthy control cases were sourced from the PESC dataset33 (https://github.com/chengjl19/PAL). The PESC dataset was adapted from the ESConv dataset34, which was created by recruiting crowd-workers and instructing them to engage in conversations while acting as help-seekers and supporters, with the goal of producing more natural dialogues that closely mimic real-life situations. ESConv was primarily developed for providing emotional support for non-clinical individuals facing common emotional concerns. Based on ESConv, PESC extracted the cases’ persona, which provided additional important information (i.e., demographic information, social status, personality, etc.) about the cases’ identification. These personas enrich the clients’ profile, thereby enhancing the informativeness of the responses within emotional support conversations. In this case, we adapted the extracted persona and conversations to serve as case information for healthy controls.

We then used the interview generative pipeline to transform the 1,157 case descriptions to corresponding interactive interviews. The sample distribution of the cases can be found in Fig. 3a. On average, each conversation consists of 14 utterances from either a psychiatrist or a client, with each utterance having 24 words. The detailed statistics of the synthetic interviews are shown in Fig. 3b. All data used for evaluation were sourced exclusively from publicly available, published materials, rather than directly from patients. Before being processed by the LLMs, all data was rigorously de-identified to ensure the removal of any personal identifiers. This process included a manual review of both input and output data to confirm no personally identifiable information was used or inadvertently generated. As the study utilized fully anonymized, secondary data from the public domain, direct patient consent was not applicable. The transfer of this non-identifiable data for evaluation purposes was conducted in full compliance with privacy regulations and was within the scope of our institutional ethics approval.

Fig. 3: Data distribution and quality evaluation.
Fig. 3: Data distribution and quality evaluation.
Full size image

a Disorder distribution of the PsyInterview. The green bar represents healthy controls, the two pink bars correspond to emotional disorders, and the remaining blue bars indicate other disorders. b Statistics of the PsyInterview, including the number of dialogues, average dialogue length, and average utterance length in terms of both word and token. c Data quality-check results, comprising average scores for dialogue quality and interviewing skill. The green line denotes the threshold line.

Data quality-check

To ensure the authenticity of the generated conversations in clinical settings, we recruited three experts to evaluate the quality of the generated dialogues. One of them is a psychiatrist, and two are clinical psychologists, each with over 10 years of clinical experience. The two clinical psychologists have expertise in different therapeutic approaches: one specializes in Cognitive Behavioural Therapy (CBT), while the other has training in Parent-Child Interaction Therapy (PCIT), Dialectical Behaviour Therapy (DBT), and Acceptance and Commitment Therapy (ACT). The psychiatrist focuses on early psychosis intervention and cognitive impairments in schizophrenia. We randomly selected 50 cases from the training dataset, with approximately one-third of these cases representing depressive disorders, anxiety disorders, and healthy controls each. Each case was rated by two experts based on (1) information alignment between case description/information, dialogue, and explanations; (2) naturalness and consistency of the dialogues; and (3) logicality and compliance of explanations. The experts were asked to rate each item on a 5-point Likert scale (1 = very misaligned/not natural at all/etc., 5 = very aligned/very natural/etc.). The evaluation criteria and scales are adapted from established principles and practices that have been used in previous studies evaluating psychiatric interviews and empathic dialogues35,36,37.

To further assess the reliability of our data, we adopted an interview skill assessment developed by Morrison31. Since the original assessment was designed for extended interviews and our project focused on relatively shorter dialogues (8-rounds), we selected dimensions aligned with our context, including history, ending the interview, establishing rapport, and use of interviewing techniques. History consists of the psychiatrist’s enquiry about medical history, family history of mental disorder, history of present illness, and personal and social history. Ending the interview involves giving a warning that the interview is concluding and expressing interest and appreciation at the end. Establishing rapport focuses on building relationships with the client while the use of interview techniques dimension refers to the psychiatrist’s approach in gathering client information. The history and ending the interviewing dimensions are particularly important for our interviewing agent, as obtaining an accurate history is crucial for achieving a reliable screening result of the emotional disorder38, and employing proper ending tactics ensures the conversation remains focused and does not become endless. Meanwhile, the two supplementary dimensions, establishing rapport and use of interview techniques, more suited for longer interviews, ensure that the interviewing style of our generated corpus is acceptable. We collected the responses from the three experts and computed an average score for each of the dimensions. If the average score of a particular dimension exceeds 50% of the maximum, it indicates that the responses of that dimension are generally positive and acceptable. The results showed that our data received positive responses across all dimensions, as shown in Fig. 3c.

Model training and evaluation

Using the generated interview data from PsyInterview, we trained a system called EmoScan and evaluated its effectiveness in screening for emotional disorders. On the other hand, we assessed its performance in conducting interviewing tasks with GPT-4 acting as the client. In this section, we outline the training procedures for EmoScan’s screening and interviewing agents, as well as the evaluation methods used to compare EmoScan’s performance with baselines.

Training

We developed a system that consists of a screening and an interviewing agent. The screening agent was a fine-tuned Mistral- 7B39 trained by conversational history and screening outputs (i.e., a combination of screening results and explanations), while the interviewing agent was trained by the same base-model with the multi-turn conversational data synthesized by the generative pipeline, as Mistral-7B and related models excel in diverse benchmarks39 and clinical tasks40,41. The screening agent will provide a result with an explanation that describes why the client gets the result according to the DSM-5 based on the conversational history, and the interviewing agent is capable to conduct psychiatric interviews with clients. Training details can be found in Supplementary Information: Model training details.

Evaluation

We compared EmoScan with recent widely used LLMs, OpenAI’s GPT-4 (gpt-4-0613)42, Llama 3 (Meta-Llama-3-70B), and Mistral-7B39. These three LLMs served as baselines in screening/explainability and interviewing evaluation. GPT-4 was accessed via the OpenAI API, while LLaMA-3 and Mistral-7B were implemented using the Deep Infra API. To ensure fair comparisons across models, we used consistent hyperparameter settings. All models were set with default sampling parameters: a temperature of 1.0 and a top_p value of 1.0. These settings preserve the model’s original probability distribution and allow for natural variability in responses43,44. We maintained identical experimental conditions across all models. This included using the same prompt content and formats (for zero-shot, few-shot, and chain-of-thought settings; prompt content can be found in Supplementary Information: Chain of Thought (COT) prompts and Supplementary Information: Prompt structure), shared few-shot examples, and the same context length. Additionally, the same evaluation metrics, such as F1-score and BERTScore, were applied to all model outputs. This controlled setup minimizes potential confounding factors and ensures that observed differences reflect true model capabilities.

Research Question 1: Does the screening agent have the ability to do screening and provide explanations?

We compared the screening performance of EmoScan and baselines for both coarse- and fine-grained classification of emotional disorders. At the coarse-grained level, we assessed the LLMs’ ability to identify individuals with depressive disorders or anxiety disorders from the healthy controls. At the fine-grained level, we evaluated the LLMs’ capability to identify specific emotional disorders (e.g., Major Depressive Disorder, Generalized Anxiety Disorder) among positive cases. The classification performance was evaluated using weighted F145, which indicates the harmonic mean of precision and recall depending on each class’s sample size. To determine if there were significant differences in classification performance among the LLMs, we ran each model on the test set three times and then applied independent two-sample t-tests to assess the statistical significance of the performance differences between EmoScan and the baselines. Additionally, we used BERTScore46, ROUGE47, and BLEU48 to measure the quality of explanations generated by the four LLMs, comparing their similarity to the ground truth explanations.

To explore the potential of baselines, we further employed few-shot49 and chain-of-thought (CoT)50 prompting techniques. Few-shot prompting provides LLMs with a few task examples during inference, which can improve LLMs’ performance without changing their weights. In our study, we randomly selected four cases from the training data to serve as few-shot samples. The samples included diverse cases: one Depressive Disorder case, one Anxiety Disorder case, one case with other disorders, and one healthy control case. Each example consisted of the input dialogue and the corresponding structured ground-truth output. CoT prompting offers intermediate reasoning steps for LLMs, enhancing their performance on complex reasoning tasks. Our study adapted an emotional disorders’ screening guideline from a comprehensive mental health disorder diagnosis handbook51 as the CoT prompt. The detailed CoT prompt can be found in Supplementary Information: Chain of Thought (COT) prompts. Finally, we include a condition with both few-shot and CoT prompts, in which we simply combined the CoT prompt and the four examples together, measuring whether baselines’ performance could be improved with reasoning steps and samples. All prompting strategies (zero-shot, few-shot, CoT, few-shot + CoT) began with a system prompt instructing the LLM on its core task (screening for emotional disorders based on DSM-5), the required output structure, and the definitions of coarse (Anxiety, Depressive) and fine-grained disorders. Following the few-shot/CoT prompts and input dialogue, a final instruction was added to guide the LLM on how to format its response. The detailed prompting structure can be found in Supplementary Information: Prompt structure.

To test the generalizability of EmoScan, we compared our system with the base model on an external out-of-domain dataset, D452. This dataset was developed to screen for depression in simulated conversations between two crowdsource workers. Since this dataset was originally in Chinese, we first translated the dataset to English using the Youdao API. We have conducted translation checking on the testing dataset, and then compared the screening performance of our model and the base model Mistral-7B. We specifically chose to compare with Mistral-7B because it serves as EmoScan’s foundation model, allowing us to directly evaluate how our generated dataset enhances cross-domain performance while controlling for the model architecture.

RQ2: Does the interviewing agent have acceptable interviewing performance?

Obtaining essential key information about the client during an interview is crucial for achieving accurate screening results. To evaluate the interviewing performance of different models, we applied two fundamental dimensions in the interviewing assessment rated by experts: history and ending the interview. We compared EmoScan with baselines on these two essential dimensions.

To simulate the interviewing process between clients and psychologists, we first built a patient simulator using Autogen53 to interact with all four LLMs (i.e., EmoScan, GPT-4, Llama 3, Mistral-7B). During this process, we instructed GPT-4 to act as a client, responding to the psychiatrist’s questions, while providing it with the relevant patient information. Simultaneously, we assigned one of the four LLMs to act as the psychiatrist, responsible for asking questions. The interviewing dialogues were recorded for subsequent evaluation.

Using LLMs as raters in evaluation is an effective method which helps reduce expensive and time-consuming human labours. Previous studies have applied GPT-4 to evaluate conversational responses, demonstrating a strong correlation with human raters54,55. Therefore, in this study, we also used a separate GPT-4 agent to act as a judge, assessing the interviewing performance through a comparative analysis of the interviewing dialogues generated by our model versus those produced by the other three models. For each evaluation, raters were presented with two conversations generated by EmoScan–interviewing agent and the other model, in which the LLMs acted as psychiatrists and talked with the same simulated client. Models’ names were masked, and the two conversations’ order was randomized each time to avoid bias. After that, the rater voted for one of them on the two dimensions. Finally, we calculated the winning rate of the four models. To ensure the reliability of GPT-4’s rating, we randomly selected 90 conversation pairs and recruited six human experts with backgrounds in psychology to rate these samples based on the same guideline provided to GPT-4. We conducted Chi-Square tests to compare the ratings provided by GPT-4 and those given by the human experts.

Statistics and reproducibility

Statistical analyses were conducted using Python 3.9 with libraries including scipy.stats and SPSS 26.0 for hypothesis testing and sklearn.metrics for performance metrics. For comparing screening performance between EmoScan and baseline models, independent two-sample t-tests were applied to F1-score results across three runs of each model on the test set, with significance determined at p < 0.05. Chi-Square tests (using scipy.stats.chi2_contingency) were performed to assess agreement between GPT-4 and human expert ratings of interviewing performance, with degrees of freedom = 2 and significance threshold set at p < 0.05.

Performance metrics including weighted F1-score, recall, precision, BERTScore, ROUGE, and BLEU were calculated using standard implementations (e.g., sklearn for F1-score, transformers library for BERTScore). Sample sizes for evaluations were as follows: 129 dialogues in the internal test set, 90 conversation pairs for human-GPT-4 rating comparisons, and the external D4 dataset (translated to English) for generalizability testing. All experiments were replicated three times with consistent hyperparameters (temperature = 1.0, top_p = 1.0) to ensure reliability.

Reproducibility is supported by public access to the dataset, code, and source data for key figures via the GitHub repository (https://github.com/Junemengyuan/EmoScan). This includes scripts for data generation, model training, and evaluation metrics calculation, enabling exact replication of reported results.

Results

Evaluation of EmoScan screening agent

Table 1 summarizes the screening performance comparison between EmoScan and the baselines based on weighted F145. As demonstrated, EmoScan significantly outperformed all baselines with zero-shot, few-shot, and chain of thought prompting (F1 = 0.7467, t [6.7143, 25.4563], p [1.4 × 10-5, 2.562 × 10-3]) in the independent sample t test, with improvements in the classification of both depressive (F1 = 0.6333) and anxiety disorders (F1 = 0.8567). This result may be attributed to EmoScan’s more cautious approach in identifying cases as positive (i.e., with emotional disorders) compared to the baselines, as the precision was much higher for both depressive and anxiety disorders. To further illustrate models’ performance, we present a summary of true positive, false positive, true negative, and false negative cases for each model, showing that EmoScan produced the highest true classifications overall. Details can be found in Supplementary Table 1. Meanwhile, although the introduction of fine-grained categories may have increased the difficulty of the second classification task, EmoScan’s performance (F1 = 0.2567) still exhibited considerable improvements compared to the base model (the highest F1 = 0.0467), showing the efficacy of PsyInterview.

Table. 1 Screening results comparison

When screening emotional disorders, an explanation for the screening result will be provided by EmoScan, which can help psychiatrists understand the underlying logic of the screening outputs generated by the LLMs. To assess the effectiveness of these explanations, we utilized multiple metrics: ROUGE47, BLEU48, and BERTScore46. The ROUGE assesses the overlap between the generated and reference texts (i.e., truth explanations); BLEU evaluates textual similarity based on specific length’s overlap; while BERTScore, known for capturing semantic similarity more effectively, focuses on contextual meaning. EmoScan demonstrated exceptional performance in BERTScore (0.9408), indicating a high level of semantic alignment. It also achieved high scores in BLEU (0.0660) for matching short sequences and ROUGE-1 (0.3951) for matching single words. These results underscore EmoScan’s capability to produce accurate and contextually appropriate explanations. Although EmoScan’s performance in ROUGE-2 (0.1132) and ROUGE-L (0.2086) metrics was slightly below that of Llama3, it outperformed all other baselines in the two metrics, showing its proficiency in capturing longer text dependencies. Overall, EmoScan’s performance and explainability remain superior, as demonstrated in Table 2.

Table. 2 Screening explanations on cases with emotional disorders

Generalizability of EmoScan screening agent

We compared the performance on the D4 dataset before and after training the model to highlight EmoScan’s generalizability in related tasks. In the original study, psychiatrists categorized D4 cases into two groups: one with a risk of depression and the other without a depression risk. EmoScan (F1 = 0.67) outperformed the base model Mistral-7B (F1 = 0.64) when classifying the two groups.

Although EmoScan slightly outperformed Mistral-7B in the external validation, its performance declined compared to the internal validation, where it achieved an F1-score of 0.92 in classifying depression cases within the depression-related subset of our PsyInterview dataset. This drop in performance may indicate limitations in cross-domain generalization. To better understand why EmoScan did not maintain its performance in the external validation, we conducted several additional analyses. First, we found notable differences in text length between the two datasets: texts in the PsyInterview dataset were shorter on average (mean = 2,646 words; SD = 315), while D4 texts were longer (mean = 3770 words; SD = 1077), with greater variability across samples. Second, we also analyzed differences in language style using TF-IDF vectorization56, combined with PCA for dimensionality reduction. Those methods were used in previous studies to identify linguistic patterns in text data57. As shown in Supplementary Fig. 1, the results showed that PsyInterview samples were more widely distributed in the feature space, reflecting greater linguistic diversity, while D4 samples clustered tightly, suggesting a more uniform writing style. These differences likely contributed to the reduced generalization performance of EmoScan when applied to D4 dataset.

Evaluation of EmoScan Interviewing Agent

EmoScan outperformed Mistral, Llama3, and GPT-4 in most dimensions of interviewing performance, as evident from both GPT-4 and human experts’ rating results (Fig. 4). We allowed each rater to vote for one of the two conversations generated by either EmoScan or one of the baselines, and the result could be “EmoScan win”, “The other LLM win” or “Tie”. In each round of evaluation, the rater (either GPT-4 or a human expert) was shown two anonymized conversations generated by EmoScan and one of the baselines, without knowing which model produced which conversation. Each rater was asked to choose the conversation that showed better interviewing performance based on the interview skill assessment used in the data quality-check section, in which there are six dimensions: the psychiatrist’s enquiry about medical history, family history of mental disorder, history of present illness, and personal and social history, warning about the interview is ending and conclusion with interest and appreciation. The decision could result in a tie or favour one model over the other. This methodology allowed us to quantitatively assess the relative strengths of EmoScan compared to commonly used LLMs.

Fig. 4: Ratings on Interviewing performance by GPT-4 and Human experts.
Fig. 4: Ratings on Interviewing performance by GPT-4 and Human experts.
Full size image

The blue-green section (above) shows the ratings by GPT-4, and the yellow-red section (below) displays the ratings provided by human experts. The correlation is significant across all six dimensions, with the highest p-value being less than 0.05 (X2 [9.6004, 69.8743], p [0.0477, 0.0016]) (n = 90 conversation pairs).

To assess the reliability of GPT-4 as an automated rater, we conducted Chi-Square tests between GPT-4 and aggregated human expert ratings. The Chi-Square test assesses whether there is a significant association between two categorical variables—in this case, the evaluation outcomes from GPT-4 and those from human raters. Specifically, we used the chi2_contingency function from the scipy.stats library to calculate the Chi-Square statistic and p-values. A threshold of p < 0.05 was used to determine statistical significance: p-values below this threshold suggest a strong association between GPT-4 and human ratings, supporting GPT-4’s reliability as an evaluator. Human ratings were collected from six independent evaluators. For each evaluation dimension, we constructed contingency tables comparing the frequency of three possible outcomes (EmoScan win, baseline win, tie) between GPT-4 and human raters. The Chi-Square statistic (X2) measured the divergence between GPT-4 and human votes, with degrees of freedom (df=2) reflecting the three-category comparison.

Discussion

The present study introduced a pipeline for synthesizing clinical interviews to screen for emotional disorders, generated an interviewing dataset PsyInterview and trained a system, EmoScan, capable of both providing screening results with explanations and conducting interviews. Upon evaluation, EmoScan demonstrated superior performance in screening emotional disorders, providing robust screening explanations compared to baselines, conducting interviews, and showed greater generalizability than the base model. In a clinical context, EmoScan has the potential to save expert clinicians’ time by effectively communicating with individuals and identifying those with emotional disorders. Additionally, it could also provide explanations for its output, enabling experts to comprehend the decision-making logic and gain confidence in the results. However, it is worth noting that while EmoScan shows promise, it still requires further validation. This includes reducing false positives and negatives, testing in clinical settings with oversight from clinicians, and refining the system based on real-world feedback to ensure it is reliable, safe, and effective before it can be used in practice.

Consequently, our pipeline represents a meaningful step forward in developing tools that could support mental health screening. While the results are promising, this work is still at the research stage and not yet ready for clinical use. It’s important to note that EmoScan is designed to support clinicians, not replace them, and professional judgement should always guide the final diagnosis. We recognize that moving toward real-world application will require several intermediate steps. Bringing EmoScan into real-world practice will take several intermediate steps. First, large-scale studies will be needed to evaluate the effectiveness and long-term performance of EmoScan. Second, the system must be retrained and tested on different populations to help improve accuracy. In contexts with abundant clinical resources, minimizing false negatives and reducing under-diagnosis is essential to ensure that individuals in need receive care58,59. Conversely, in places with limited resources, reducing false positives becomes more critical to avoid unnecessary referrals and overburdening healthcare systems60. Third, the tool should be tested in real clinics to evaluate how easy it is to use. Feedback from both clinicians and patients will be important to refining the system61. Lastly, after undergoing multiple stages of refinement and evaluation, EmoScan could be considered for use in clinical settings once it has shown both low false positives and false negatives performance in supporting clinical diagnosis.

The performance of EmoScan on interviewing can be largely attributed to clinical information contained in the PsyInterview dataset, which closely simulates patient histories, and the structured dialogue format that ensures focused and clinically relevant interactions, drawing from clinical literatures that emphasize structured enquiry to elicit comprehensive and reliable responses from clients62,63. These factors differentiate EmoScan from general-purpose models trained on less specific data. Specifically, the clients’ profiles in PsyInterview, such as their detailed medical history, family history of mental disorders, personal and social background, enable EmoScan to ask more targeted and clinically relevant questions during interviews. In addition, the structured generation pipeline ensures a consistent and logical flow of conversation, including effective cues for initiating and concluding interviews, which further enhances the quality of the interaction.

Researchers had underscored the potential of AI to complement professional mental health expertise64, while emphasizing domain-specific scope to avoid risks associated with applications of general LLMs to clinical tasks65. Our findings highly supported the recent perspectives on utilizing LLMs for mental health applications. Our results showed that EmoScan achieved significantly higher overall F1 scores compared to all the baseline general LLMs, though its recall score was marginally lower than a few of them. This could be attributed to that EmoScan was a cautious system with high precision—especially when screening for anxiety disorders, where general LLMs tended to underperform and incorrectly label many healthy individuals as patients. Being a cautious system, EmoScan minimized the likelihood of false diagnoses for emotional disorders, thus preventing unnecessary costs related to further treatment. However, we acknowledge that under-diagnosis can lead to serious consequences, including delayed treatment, worsening of symptoms, and reduced trust in the healthcare system66,67. Therefore, to decrease the risk of under-diagnosis, we suggest that users chat and screen with the models for multiple sessions over time to capture temporal variation in their symptoms and obtain a more reliable diagnosis. Importantly, EmoScan is designed as a supportive screening tool to assist clinical expertise, not as a replacement. In cases where model outputs are inconsistent across sessions or explanation patterns are unclear, clinicians should step in to provide more professional screening outcomes. In summary, EmoScan holds potential for application in clinical screening, offering valuable support to clinicians in making more efficient diagnoses.

One of our contributions was the development of a scalable pipeline for generating data efficiently. Previous research in this field had often relied on data that either incurred high costs or utilized pre-existing datasets annotated using questionnaires. For instance, the depression-related conversation corpus D452 employed as a validation dataset in our study was created by recruiting individuals to role-play patients or doctors. While the data quality of D4 was good, the associated costs of human role-playing were relatively high. Other notable studies have explored emotional disorders using conversational datasets such as DAIC-WOZ, a multi-modal dataset developed for depression diagnosis27,68. While increased access to real clinical data would be ideal for training robust LLMs, this remains a challenge due to privacy concerns and regulatory requirements69,70. As in DAIC-WOZ, the sample size of such datasets is often relatively small, limiting their application in automated clinical screening. To address this challenge, we believe a dual approach is necessary. In the short term, data synthesis provides a practical solution by generating clinical notes and interview transcripts. Since the field works toward solutions for secure and ethical sharing of clinical data71, there is a critical need for quick approaches to develop effective mental health AI systems. Our data-generative pipeline offers a workable and scalable solution. The balanced method, incorporating both automation and selective human monitoring, not only ensured high data quality but also reduced costs and ethical concerns associated with real-life conversations. In the long term, we advocate for the development of secure, community-driven data-sharing platforms to de-identify and share real clinical notes. Together, these strategies are complementary and essential for advancing the development and application of LLMs in mental health care.

One limitation of our study was the small sample size of each fine-grained emotional disorder. While our dataset comprised around 20 different emotional disorders, the number of samples for each fine-grained disorder was limited due to the constraints of available sources. The limited training sample size for each disorder posed a challenge for many LLMs, including EmoScan, as it prevented them from learning hidden patterns, which ultimately resulted in lower accuracy when identifying each fine-grained disorder. To improve model performance, future studies could consider enlarging the sample size for each fine-grained disorder. Anxiety frequently co-occurs with depression72,73. However, in our current dataset, only four cases exhibit both conditions. This is largely due to the use of textbook-style cases that focus on prototypical symptoms of single disorders, which limits the system’s applicability to populations with comorbid presentations. Future work should aim to address this gap by incorporating more cases involving comorbidity, which will improve the model’s generalizability and relevance to real-world clinical settings. Additionally, future researchers may integrate multi-modality information to train LLMs. For example, some previous research had underscored the benefits of incorporating acoustic speech information within LLM frameworks for depression detection27,74,75. Our study focused on textual data for easier implementation, yet incorporating multi-modal inputs could help researchers enhance screening accuracy. Moreover, while EmoScan currently prioritizes precision to minimize false positives, this approach may reduce its ability to detect all true cases, potentially increasing the risk of under-diagnosis. Future versions of EmoScan could try to incorporate hybrid models to dynamically balance recall and precision based on population risk profiles (e.g., prioritizing higher recall for high-risk demographic subgroups) to reduce the risk of under-diagnosis and misdiagnosis, thereby enhancing patient safety and trust. Additionally, collaborating with global clinics to evaluate model performance across diverse demographic groups will be crucial for bias detection and mitigation. We also recognize that transparency and accountability are essential ethical considerations, and future work must prioritize enhancing these aspects to ensure that clinicians and patients can clearly understand and trust the system’s outputs. Furthermore, bias detection was not performed in this study due to the lack of demographic information in the training data. Future research needs to evaluate the efficacy of our pipeline across diverse population subgroups to support a more equitable and precise application of EmoScan.

Conclusion

Our study introduced a pipeline for effective screening and interviewing of emotional disorders using Large Language Models (LLMs). Using our pipeline, we synthesized data from existing clinical interviews and created the PsyInterview dataset. Subsequently, we developed EmoScan, an LLM-based system trained on our collected dataset for screening emotional disorders with explanations and conducting interviews to collect clinical information. EmoScan demonstrated superior accuracy, robust explanations, and strong interviewing skills, significantly outperforming baseline models. Our pipeline and models hold considerable potential for efficient and effective clinical screening and interviewing, providing mental health professionals with a valuable tool to support their work.