Exploring the potential of lightweight large language models for AI-based mental health counselling task: a novel comparative study

Maurya, Ritesh; Rajput, Nikhil; Diviit, M. G.; Mahapatra, Satyajit; Ojha, Manish Kumar

doi:10.1038/s41598-025-05012-1

Download PDF

Article
Open access
Published: 02 July 2025

Exploring the potential of lightweight large language models for AI-based mental health counselling task: a novel comparative study

Ritesh Maurya¹,
Nikhil Rajput²,
M. G. Diviit²,
Satyajit Mahapatra³ &
…
Manish Kumar Ojha²

Scientific Reports volume 15, Article number: 22463 (2025) Cite this article

6140 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

In recent years, Transformer-based large language models (LLMs) have significantly improved upon their text generation capability. Mental health is a serious concern that can be addressed using LLM-based automated mental health counselors. These systems can provide empathetic responses to individuals in need while considering the negative beliefs, stigma, and taboos associated with mental health issues. Considering the large size of these LLMs makes it difficult to deploy these automated counselors on low cost/resource devices such as edge devices. Therefore, the motivation of the present study to analyze the effectiveness of lightweight LLMs in the development of automated mental health counseling systems. In this study, lightweight open source LLMs such as Google’s T5_s (small variant), BART_B (base variant), FLAN-T5_s (small variant), and Microsoft’s GODEL_B (base variant) have been fine-tuned for automated mental health counseling task utilizing a diverse set of datasets publicly available online. The experimental results reveal that BART’s base variant outperformed the other models across all key metrics such as ROUGE-1, ROUGE-2, ROUGE-L, and BLEU with scores of 0.4727, 0.2665, 0.3554, and 25.3993 respectively. In comparison to other models, BART-base model generated empathetic, and emotionally supportive responses. These findings highlight the potential of lightweight LLMs (small size LLMs), in advancing the field of LLM-based mental health counseling solutions and underscore the need for exploration of lightweight LLMs for this mental health counseling use case. The code for this work is available at the following link: https://github.com/diviitmg03/Comparative-analysis-of-LLMs-.git.

Using a fine-tuned large language model for symptom-based depression evaluation

Article Open access 07 October 2025

The impact of fine-tuning LLMs on the quality of automated therapy assessed by digital patients

Article Open access 13 September 2025

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Article Open access 06 March 2025

Introduction

Mental health crisis is a serious global challenge, according to the WHO report one out of eight individual is suffering from mental health related issues¹. As per the ICMR report(2017), 14.3% of the total population is suffering from mental disorders². India faces a severe shortage of mental health professionals, with the availability of 0.75 psychiatrists per 100,000 people, far below the WHO’s recommended ratio of 3 per 100,000³.

In recent years, large language models(LLMs) have been used for a a wide variety of natural language processing-related applications, including question-answering, text generation, language summarization, etc⁴. Considering the advancement of LLMs in different fields, the applications of LLMs in the field of mental health counseling are still unexplored. The large size of these LLM models poses significant challenges in their deployment on low resource computing devices such as edge devices. Therefore, present study is aimed at exploring the potential of open source lightweight LLMs in the development of automated mental health counseling systems.

Most of the automated chat bots developed earlier for mental health counseling were either rule-based systems or relying on conventional machine learning models for their development⁵. The chat bots developed using such methods were designed for domain specific issues such as depression, anxiety, suicide prevention or stress management. The response generated by the conventional AI-based conversational agents was based on pre-defined rules without considering the need of an individual seeking mental counseling help. The major limitation of these types of systems were lack of generalisation to wide variety of mental health issues.

With the advancement of Transformer-based LLM models in recent years, it has been made possible to understand, analyze and respond in a meaningful manner to different mental health issues using LLM-based automated chatbots⁶. These LLM-based chatbots solves the limitations of rule-based chatbots earlier used for solving diverse mental health issues. LLM model once fine-tuned on large dataset can be deployed for wide variety of mental health counseling related issues. However, considering the large size( billion number of parameters) of these LLMs prevents their deployment on edge devices thereby, limits their scalability. Considering the problem of limited scalability of these LLM-based systems due to their large size, the proposed work analyses the effectiveness of lightweight LLMs(parameter count in few millions) in the development of automated chat bots for metal health counseling related tasks.

The primary hypothesis of this study is: How do different lightweight language models perform in terms of accuracy and contextual relevance when generating responses for mental health counseling tasks under diverse scenarios? To investigate this hypothesis, different lightweight LLMs, including T5_s⁷, BART_B⁸, FLAN-T5_s⁹, and GODEL_B¹⁰ have been fine-tuned on mental health conversation dataset. The performance and the contextual relevance of responses generated by these models were analyzed. The dataset used for fine-tuning these models comprised publicly available mental health counseling related conversations between patients and experts, structured in form of questions and corresponding responses, curated from different online platforms. A comprehensive evaluation of these models was performed to check their efficacy in generating contextually relevant and accurate responses within different mental counseling scenarios . Key performance metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BLEU(Bilingual Evaluation Understudy) and perplexity were used to benchmark the performance of these models. This detailed analysis provided valuable insights into the suitability and effectiveness of these lightweight LLMs for mental health counseling applications.

Related works

Some of the works related to the present study have been discussed in this section.

The recent advancement of LLMs in the field of natural language processing has opened new possibilities in healthcare, particularly in addressing the mental health crisis. Several studies have been conducted to explore the potential of artificial intelligence in mental health. In the work¹¹, researchers used rule-based systems in developing chatbots to address mental health-related issues. These systems generate synthetic responses in contrast to therapeutically relevant responses which are necessary for mental health counseling related task. In other work, researchers reviewed the use of AI in addressing diverse challenges in the field of mental health such as disease prevention, diagnosis and treatment interventions¹². The researchers also proposed a method that uses data from various digital footprints of an individual, such as their social media posts and data related to their use of smartphones, to analyze the pattern and predict the mental health status of an individual¹³. In other work researchers analyzed the successful integration of AI in the field of mental healthcare concerning issues need to be addressed like privacy, bias and diagnostic accuracy¹⁴. Researchers also analyzed the ethical issues and the risk factors involved in the development of such AI-enabled chat bots such as over-reliance on the generated response and lack of emotional intelligence in these systems limits their adaptability in real life use case¹⁵. Schyff et al.¹⁶ proposed a conservational AI-based metal health support chatbot named Leora which provide support to its users in case of mild symptoms of anxiety and depression. The response generated by these AI-based conversational agents were generalized in nature and did not provide personalized assistance. Rathnayaka et al.¹⁷ in their work proposed an AI and behavioral activation-based personalized assistance for recurrent emotional support catering to the need of individual seeking therapeutic response to mental health related queries.

Recent advancements in the field of natural language processing with the development of Transformer-based LLMs models have significantly transformed healthcare. Peng et al.¹⁸ developed a model by fine-tuning the GPT-3 architecture with up to 20 billion parameters for healthcare text generation and biomedical natural language processing task. Singhal et al.¹⁹ have developed Med-PALM 2 LLM model for medical question answering task. This model achieved score of 67.2% on the MedQA dataset. Yang et al.²⁰ proposed an ensemble of LLMs for medical question answering. He et al.²¹ fine tuned BERT model on knowledge of different disease such as their signs, symptoms, diagnosis and treatment. Despite the significant advancement of LLMs in the field of healthcare their adoption in the field of mental healthcare is in nascent stage or still unexplored systematically. Challenges like data scarcity, low user engagement and high drop out rates prevent the deployment of these LLMs in the field of mental healthcar^22,23. Yadav et al.²⁴ compared the state-of-the-art LLM models with and without fine-tuning and found that the fine-tuned LLMs performed better with improved generalization for automatic generation of diagnostic summaries for mental state screening. Zheng et al.²⁵ generated dataset named ExTES (ExTensible Emotional Support dialogue dataset) and fine-tuned LLaMA model for mental healthcare with emotional support.

LLMs like Med-PaLM 2 have been successfully deployed in mental health diagnosis task with 92.5% accuracy in correctly diagnosing the depressive disorder²⁶ Similarly, PaLM 2 model when fine-tuned on medical domain data generates more comprehensive list of psychiatric diagnosis in comparison to the medical experts²⁷. Some group of researchers have proposed a framework to examining the issues such as bias, stereotyping, privacy violation and exacerbating inequalities in LLMs²⁸ In other work researchers have proposed sociocultural-technical approach to address the challenges like technical costs, literacy gaps, biases, and inequalities²⁹.

Considering the large size (billion number of parameters) of these LLMs which makes it impractical to deploy these models on low resource edge device. Therefore, in this work open source lightweight(parameters in few millions) LLMs have been explored for mental health counseling related task with special focus on generating empathetic, coherent, and contextually relevant response.

Methodology

The proposed method evaluates the effectiveness of the four four open-source lightweight LLMs (T5-small, FLAN-T5-small, BART-base, and GODEL-base) fine-tuned on curated mental health counseling conversation datasets. The proposed method involves steps like data preprocessing, tokenization, model fine-tuning, and evaluation using different metrics (ROUGE, BLEU, Perplexity), along with preliminary human feedback. The general flow of the proposed methodology is shown in Fig. 1. Detailed pictorial representation of the proposed methodology has been shown in Fig. 3. The following subsections provide a detailed explanation of the sub-processes involved in the proposed methodology.

Data collection

The datasets used in this work were curated from diverse online sources such as HuggingFace and GitHub. The collected data was organised in standardised format in form of question-response pairs, suitable for training of lightweight LLMs used in this study for the comparison purpose. The following datasets were used in this study for fine-tuning the LLMs: Aditya Mental Health Counselling Dataset³⁰, Mental Health Counselling Chat³¹, Counsel Chat Dataset³² and Amod-Mental Health Counselling Conversations³³. Diagramatic representation of the distribution of data among these datasets has been shown in Fig. 2.

Table 1 provides detailed information such as source, size and several other details about all four datasets used in this study for fine-tuning lightweight LLMs. The size of the combined datasets obtained after combining the instances of conversation from all four datasets into a single dataset was 20,500. Each instance consists of question and corresponding answer in form of question-answer pair in English language. The combined dataset consists of mental health related conversation from wide variety of topics such as depression, anxiety, self-esteem, coping mechanism, stress management and relationship issues.

Each dataset provides unique scenarios, enabling the fine-tuned language models to capture the details of various mental health concerns and respond with contextually appropriate and empathetic dialogue. The diversity in topics and conversational styles within the datasets supports a comprehensive training, promoting the development of language models capable of addressing complex mental health issues.

Table 1 Summary of four mental health datasets used in our experiment.

Full size table

Data cleaning

The first step in data cleaning process involves deletion of useless columns and duplicate entries to avoid repetition of data. Text fields in the dataset were cleaned by removal of HTML tags, extra strings and system generated prompts. The dataset was further organized into dialogue pairs in form of question and response. All four datasets cleaned in such a manner were combined further to utilise them for fine-tuning different lightweight LLMs.

Data pre-processing

The pre-processing pipeline starting from data preparation to tokenisation involved the following key steps:

Input preparation and Tokenization: Each question-answer pair in a the dataset was formatted as proper input form suitable for model’s input. In case of \(\hbox {T5}_{S}\) and FLAN-T5_S models,each question in the dataset was pre-fixed with \(``<question>''\) to contextualize the input. Contextualized input was then tokenised using ’T5Tokenizer’ from the Hugging Face Transformers library.It is based on SentencePiece tokenisation algorithm. Maximum sequence length was set to 512 to ensure that long conversations are truncated and padded as and when required to maintain the consistence in the length of input sequence.

Target Encoding: Transformers like T5 and BART expect separate tokenisation of responses corresponding to each question to generate the target sequences. Tokenisation on responses were applied using the same tokeniser as used in case of input questions. Maximum sequence length was set to 512 in case of target response generation. The encoded responses were assigned “labels” for supervised learning so that generated response and ground truth response can be compared.

Dataset Mapping: Dataset mapping was applied to both training as well as validation sets to ensures batch-wise tokenisation of question-answer pairs for efficent processing. This mapping ensures proper structuring of all input sequences for further processing by the LLMs.

Selection and fine-tuning lightweight-language models

Different lightweight Sequence-to-sequence LLMs having a few million of parameters were selected and fine-tuned for mental health counseling conversation task. A detailed description of the selected lightweight-LLMs used in this study for the comparison purpose has been presented as follows:

T5(Text-to-Text Transfer Transformer): A prominent Seq2Seq model developed by Google with just only 60 million parameters makes it a suitable to be assessed for the present task of mental health counseling. It is based on text-to-text framework, takes text as an input and generates text thereon which makes in suitable for the present counseling conversation task. This model is efficient for wide variety of text generation tasks such as document summarization, question-answering and many more.In this work the ’small’ variant of T5 model has been used for the fine-tuning purpose⁷.

BART(Bidirectional and Auto-Regressive Transformers): BART model developed by Facebook takes inspiration from bidirectional text understanding capabilities of BERT (Bidirectional Encoder Representations from Transformers)-based encoder model and auto-regressive capabilities of GPT(Generative pre-trained transformer)-based decoder model. It is pre-trained on text corrupted through some noise function and it is good at text generation tasks like summarization, translation, question answering and text classification. In this work, the ’base’ variant of the BART model has been chosen considering its seq2seq framework with limited number (139 million) of parameters which makes it suitable for fine-tuning it on mental health counseling conversation task³⁴.

FLAN-T5: FLAN is a family of improved transformers, in contrast to T5, FLAT-T5 has been instruction-tuned and improves the performance over T5 for diverse range of text generation tasks such as summarization, text generation, reasoning and question answering. FLAT-T5 is found to be more effective in zero-shot, few-shot and chain-of-thought reasoning.In this work FLAN-T5’s small size variant with 80 million parameters is used for fine-tuning on mental health counseling dataset⁹.

GODEL(Grounded Open Dialogue Language Model): Developed by Microsoft Research, it is an open-source pre-trained language model which is a successor of Microsoft’s DialoGPT model and it was primarily developed for goal-oriented dialogue. This model is based on encoder-decoder architecture and it is well trained for generating response conditioned on some external text¹⁰. The reason behind choosing this model is based on its suitability for dialogue generation task with only 220 million parameters.

The detailed description about each of these models has been provided in Table 2. These models were fine-tuned on the pre-processed combined dataset using the hyperparameters presented in experimental results section. The detailed description of these hyper-parameters has been provided in the next section.

Table 2 Comparison of different language models used in the study.

Full size table

Response generation and evaluation of fine-tuned LLMs

The combined dataset was split into train and test set: 70% of the combined dataset was used for training whereas 30% of it was used for testing purpose. The questions in the test set were given as an input to these fine-tuned lightweight LLMs and the response generated by these LLMs were recorded and compared with the actual responses present in the dataset. Several quantitative metrics such as ROUGE³⁵, BLEU³⁶, and perplexity³⁷, have been used to evaluate the effectiveness of the response generated by these fine-tuned models. These scores helps in evaluating the quality, relevance, and reliability of the response generated by these models.

ROUGE score evaluates the overlap between generated text and targeted response by evaluating recall across n-grams. The evaluation of ROUGE score ensures that the response generated by these models are aligned with targeted response.

BLEU (Bilingual Evaluation Understudy) score measures the precision of n-gram overlap between the generated response and the target response. High BLEU scores can contribute in generating more fluent and relevant responses, which may help in the improving the user engagement in mental health counseling conversations.

Perplexity score quantifies the amount of uncertainty or model’s confidence in generated response. Lower perplexity score denotes higher confidence in the prediction made by the model and vice versa for the higher perplexity score. A low perplexity score ensures the consistent and coherent interactions during mental health counseling conversation.

Using the ROUGE and BLEU score, informativeness and lexical similarity of the generated responses have been measured, whereas the lower perplexity score indicates a more coherent and a well-trained fine-tuned model. Altogether, these metrics provide a comprehensive evaluation framework, assessing the text similarity and the predictability of the fine-tuned LLMs; however, empathy and contextual relevance of the generated responses have also been assessed by the human evaluator.

The proposed methodology has been outlined in Fig. 3, starting from data collection, pre-processing , splitting, fine-tuning and evaluation of response generated by lightweight LLMs for mental health counseling conversation. 70% of the combined dataset was used for training whereas 30% of the dataset was used for testing purpose. The responses generated by these models were compared against the target response and different performance metrics were evaluated. The generated responses were evaluated against ROUGE, BLEU and Perplexity scores, in addition to that the generated responses were also evaluated against contextual coherence and emphatic reply generated by these fine-tuned LLMs under different counseling scenarios using human evaluator.

Experimental results and discussions

Experiment setup and performance metrics

All experiments were performed with a NVIDIA T4 GPU with 16GB of RAM. Test dataset consists of approximately 4,000 mental health counseling dialogues, used for testing the trained models. The evaluation was performed by comparing the model-generated responses against the reference responses.

Hyperparameters used

The hyperparameters used in this work for fine-tuning the LLMs are summarized in Table 3. Batch size was set to 8 and the learning rate was set to \(5 \times 10^{-5}\). The fine-tuned models were trained for 50 epochs to ensure proper convergence without any sign of overfitting. Other hyperparamters like weight decay was set to 0.01 for regularization and mixed precision (fp16) was used for better efficiency and storage complexity.

Table 3 Hyperparameters used for evaluating the language models in this study.

Full size table

Tokenisers

The details of the tokenisers and tokenisation approach used with the different fine-tuned LLM models used in this work are presented in Table 4

Table 4 Tokenisers and tokenisation approach.

Full size table

Performance comparison

Different performance metrics used to compare the response generated by the fine-tuned models with the target response have been presented in Table 5. Different performance metrics such as ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores were used to compare the performance of the models. The comparison of the fine-tuned models on these parameters ensure that the responses generated by these models are contextually relevant and coherent.

Among all the models, BART_B model achieved highest score across all the metrics, achieving a ROUGE-1 score of 0.4727, ROUGE-2 of 0.2665, ROUGE-L of 0.3554 and BLEU score of 25.3993. These metrics ensures that the responses generated by the BART_B model are contextually relevant as well as coherent. GODEL_B model achieved moderate performance with a BLEU score of 6.6183 and reasonable ROUGE scores. The response generated by GODEL_B model as presented in Table 6 is found to be emphatic though it lacks specificity as observed in the response generate by the BART_B model. The performance of FLAN-T5_S and T5_S model was not found satisfactory as FLAN-T5_S model achieved a ROUGE-1 score of 0.2632 and BLEU score of 3.0431, whereas in case of T5_S model ROUGE-1 score of 0.2585 and BLEU score of 3.0649 was achieved which suggest the limitations of T5_S and FLAN-T5_S model in capturing the . linguistic subtleties and psychological nuances necessary for the generation of comprehensive counseling responses.

Table 5 Performance metrics comparison.

Full size table

To analyse the linguistic coherence and psychological relevance of the response generated by these fine-tuned models, a comparison of the perplexity score obtained by these models when a common question is posed “Can you help me with understanding how to deal with anxiety?” to these models as presented in Table 6 .

T5_B model achieved lowest perplexity score of 1.78, representing the fluency and coherence of the generated text. However, the response generated by T5_B model against the common question was found to be repetitive and it also lack practical utility, undermining of the use of T5_B model in real life mental health counseling applications.

GODEL_B achieved a perplexity score of 3.40 while the response generated by the model was found to be in empathetic tone, supporting anxiety but lack actionable detail. FLAN-T5-small, with a higher perplexity score of 6.20, generated concise but less fluent and psychologically irrelevant response.

Though the BART_B model achieved a perplexity score of 3.75, however, considering the fluency, physiological depth, actionable advice and well-framed structure of the generated response, BART_B model among other models provide most coherent response making it a suitable model for the implementation of LLM-based mental health counseling. In contrast, considering the low perplexity score of T5_B model suggest that it prioritizes fluency over the depth of a generated response whereas in case of GODEL_B and FLAN-T5_S model generated response focuses more on empathy than providing actionable advice which is necessary for therapeutic counseling use case.

Table 6 Responses generated by different models and a comparison of their perplexity scores for the common question, “Can you help me understand how to deal with anxiety?”.

Full size table

Analysis of real-time response generation

To analyse the effectiveness of the real-time response generation capabilities of the fine-tuned models used in this study, the response generated by these models were assessed for their contextual coherence, relevance, empathy in addressing the mental health related queries. For this purpose, two contextually different questions were posed to each of these fine-tuned models and the responses generated by these models were presented in Tables 7 and 8.

Response to question 1: crying during therapy

Analysis of the response generated by these fine-tuned models as presented in Table 7, for the question “Is it normal for people to cry during therapy, or is it just me?” has been made in this subsection. The response generated by the BART_B model was found to be coherent and empathetic, addressing emotional support and providing reassurance in a compassionate tone. The response generated by the FLAT-T5_s model was also empathetic and relevant; however, occasional repetitiveness in the generated response diminishes its fluency. In contrast, the responses generated by the GODEL_B and T5_S models suffer from redundancy and contradictions, thereby diminishing the reliability of these models in generating contextually coherent and empathetic responses.

Response to question 2: solutions for panic attacks

Analysis of the response generated by these fine-tuned models as presented in Table 8, for the question “Give me solution for my panic attack.” has been made in this subsection. The response generated by the BART_B model was found to be structured and practical, suggesting suitable coping mechanism and actionable advice useful in case of therapeutic context. FLAN-T5_S provides response to the question in a supportive tone but it lacks detailed actionable advice, therefore, the response generated by the FLAN-T5_S small was found to be less effective than the response generated by the BART_B model. GODEL model provides a brief and generic response, without any actionable advice. The response generated by the T5_S model is found to be contextually irrelevant with contradictory statements.

Table 7 Responses from different models to the question: “Is it normal for people to cry during therapy, or is it just me?”.

Full size table

Table 8 Responses from different models to the question: “Give me solution for my panic attack.”.

Full size table

Comparing the responses generated by fine-tuned lightweight LLMs for Question 1 and Question 2 as presented in Tables 7 and 8 reveals that the BART_B model generated detailed, actionable and emphatic response aligning with the principle of mental health counseling. FLAT-T5_S model generated response with emphatic tone but lacks in depth required to be utilised in practical use cases. The responses generated by the GODEL and T5_S model lacks detail and suffered with redundancy and contradiction.

Table 9 presents a comparative analysis of language models used in this study on the basis of their memory requirement and inference speed. Despite the significant difference between the memory requirement of these language models, these models can easily be deployed on the mobile devices considering their small memory requirement and fast inference speed.

Table 9 Comparison of lightweight LLMs in terms of parameter count, model size, and inference speed.

Full size table

All the models evaluated in this study are lightweight enough to be deployed on low cost GPUs or even on mobile System-on-Chip (SoC) devices. The low memory footprint and fast inference make these models well suited in real world scenarios. Although in previous studies^26,27 LLMs like GPT-3 or LLaMA have shown decent performance on mental health related conversation tasks. Considering the high computational demands of these LLMs in terms of memory and compute, their deployment at end devices is still not plausible.

These findings highlights the potential of BART_B model for its application in the implementation of mental healthcare support system.

Limitations and future work

Though the present study provides a valuable insight into the application of lightweight LLMs in the implementation of automated mental health question-answering task, the following limitations of the proposed study should also be acknowledge: The datasets utilised in this study though collected from diverse online resources the verifiability of collected data limits its use in diverse mental healthcare contexts. Considering the focus of the present study on small LLMs with limited parameters, the large or medium size LLMs with more training data may improve the performance in real-life use cases of mental healthcare.

While automatic metrics such as ROUGE, BLEU and Perplexity scores have been used to evaluate the response generated by the language models used in this study, a preliminary human assessment by single non-expert evaluator have also been made to assess the coherence and empathy of generated responses. However, we acknowledge the limitations of this informal evaluation. In future work, we plan to involve mental health counselors and users to develop human-in-loop-based framework providing domain-specific feedback and also providing more rigorous assessment of the emotional and therapeutic effectiveness of the model-generated responses.

Ethical considerations

Deployment of language models for mental health counseling raises several ethical concerns. Some of the issues are the generation of inappropriate or misleading responses and hallucinations, which could be harmful if interpreted as professional advice. Therefore, users of AI-based conversational systems should be informed that all the advices are AI generated, and the responses generated by the language models should not be taken seriously without the advice of an expert physiologist.

Another ethical consideration is related with the data privacy related to the sensitive personal information. Moreover, inherent biases in training data might lead to biased responses requiring future efforts in bias detection and mitigation.

Responsible deployment of these systems should follow a human-in-the-loop framework and must involve mental health professionals to assure safety and fairness,in practical use.

Conclusion

This study investigated the potential of lightweight Large Language Models (LLMs) for AI-driven mental health counseling, focusing on four models–Microsoft-Godel, BART-base, T5-small, and FLAN-T5-small–fine-tuned on a curated dataset derived from diverse, high-quality mental health counseling resources. Among these models, BART-base consistently achieved the highest ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores, demonstrating its superior capability to generate coherent, contextually relevant linguistically accurate responses. The responses generated by fine-tuned lightweight LLMs were also analyzed based on empathy and emotional supportiveness. These results underscore the viability of lightweight LLMs like BART-base for mental health applications, even in scenarios requiring nuanced and emotionally supportive communication.

Furthermore, the study highlights the critical role of a curated and diverse dataset in fine-tuning LLMs for specialized tasks such as mental health counseling. The dataset not only enhanced model performance but also served as a valuable foundation for fine-tuning LLMs to this sensitive domain. With further improvements in data quality and validation, it could enable the development of robust AI-powered mental health counseling tools.

Overall, this research demonstrates the potential of lightweight, accessible LLMs to contribute to mental health interventions, offering scalable and cost-effective solutions. Future research should explore refining these models for broader generalizability, integrating ethical frameworks to ensure safe deployment, and addressing the challenges of real-world implementation to advance the field of AI-driven mental health counseling support system. Future research should also focus on curating and validating a more diversified and high-quality dataset to enhance the robustness and effectiveness of these models.

Data availability

The datasets used in this study have been collected from the following sources.Mental Health Counselling Dataset on Hugging Face, Chat Data on GitHub, dataset for counseling dialogues and Mental Health Counseling Conversations on Hugging Face.

References

Who mental health report. https://www.who.int/teams/mental-health-and-substance-use/world-mental-health-report.
Collaborators, I.S.-L.D.B.I.M.D. The burden of mental disorders across the states of India: The global burden of disease study 1990–2017. Lancet Psychiatry 7, 148–161. https://doi.org/10.1016/S2215-0366(19)30475-4 (2020).
Article Google Scholar
National mental health survey. https://science.thewire.in/health/the-case-to-expand-psychiatric-education-for-mbbs-students/.
Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. Preprint at arXiv (2023). arXiv:2303.12712.
The rise of ai in mental health care. https://trendsresearch.org/insight/smart-therapy-solutions-the-rise-of-ai-in-mental-health-care/#:~:text=In%20the%20realm%20of%20mental,enhancing%20treatment%20accessibility%20and%20effectiveness.
Denecke, K., Abd-Alrazaq, A. & Househ, M. Artificial intelligence for chatbots in mental health: Opportunities and challenges. https://doi.org/10.1007/978-3-030-67303-1_10 (2021).
Google. T5 small. https://huggingface.co/google-t5/t5-small (2023).
Facebook. Bart base. https://huggingface.co/facebook/bart-base (2023).
Google. Flan-t5 small. https://huggingface.co/google/flan-t5-small (2023).
Microsoft. Godel v1.1 large seq2seq. https://huggingface.co/microsoft/GODEL-v1_1-large-seq2seq (2023).
D’Alfonso, S. Ai in mental health. Curr. Opin. Psychol. 36, 112–117. https://doi.org/10.1016/j.copsyc.2020.04.005 (2020).
Article PubMed Google Scholar
Feng, X., Hu, M. & Guo, W. Application of artificial intelligence in mental health and mental illnesses. In Proceedings of the 3rd International Symposium on Artificial Intelligence for Medicine Sciences, 506–511, https://doi.org/10.1145/3570773.3570834 (Association for Computing Machinery, New York, NY, USA, 2022).
Darzi, P. Could artificial intelligence be a therapeutic for mental issues?. Sci. Insights 43, 1111–1113. https://doi.org/10.15354/si.23.co132 (2023).
Article Google Scholar
Shimada, K. The role of artificial intelligence in mental health: A review. Sci. Insights 43, 1119–1127. https://doi.org/10.15354/si.23.re820 (2023).
Article Google Scholar
Abd-Alrazaq, A. A. et al. An overview of the features of chatbots in mental health: A scoping review. Int. J. Med. Inform. 132, 103978. https://doi.org/10.1016/j.ijmedinf.2019.103978 (2019).
Article PubMed Google Scholar
van der Schyff, E., Ridout, B., Amon, K., Forsyth, R. & Campbell, A. Providing self-led mental health support through an artificial intelligence-powered chat bot (leora) to meet the demand of mental health care. J. Med. Internet Res. 25, e46448. https://doi.org/10.2196/46448 (2023).
Article PubMed PubMed Central Google Scholar
Rathnayaka, P. et al. A mental health chatbot with cognitive skills for personalised behavioural activation and remote health monitoring. Sensors https://doi.org/10.3390/s22103653 (2022).
Article PubMed PubMed Central Google Scholar
Peng, C. et al. A study of generative large language model for medical research and healthcare. npj Digital Med. 6, 210. https://doi.org/10.1038/s41746-023-00958-w (2023).
Article Google Scholar
Singhal, K. et al. Towards expert-level medical question answering with large language models (2023). arXiv:2305.09617.
Yang, H. et al. One llm is not enough: Harnessing the power of ensemble learning for medical question answering. https://doi.org/10.1101/2023.12.21.23300380 (2023).
He, Y., Zhu, Z., Zhang, Y., Chen, Q. & Caverlee, J. Infusing disease knowledge into bert for health question answering, medical inference and disease name recognition (2020). arXiv:2010.03746.
Stade, E. C. et al. Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation https://doi.org/10.1038/s44184-024-00056-z (2024).
Lai, T. et al. Supporting the demand on mental health services with ai-based conversational large language models (llms). BioMedInformatics 4, 8–33. https://doi.org/10.3390/biomedinformatics4010002 (2024).
Article Google Scholar
Yadav, M., Sahu, N. K., Chaturvedi, M., Gupta, S. & Lone, H. R. Fine-tuning large language models for automated diagnostic screening summaries (2024). arXiv:2403.20145.
Zheng, Z., Liao, L., Deng, Y. & Nie, L. Building emotional support chatbots in the era of llms (2023). arXiv:2308.11584.
Galatzer-Levy, I. R., McDuff, D., Natarajan, V., Karthikesalingam, A. & Malgaroli, M. The capability of large language models to measure psychiatric functioning. arXiv preprint arXiv:2308.01834, https://doi.org/10.48550/arXiv.2308.01834 (2023).
McDuff, D., Schaekermann, M., Tu, T. & et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164, https://doi.org/10.48550/arXiv.2312.00164 (2023).
Solaiman, I. et al. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, https://doi.org/10.48550/arXiv.2306.05949 (2023).
Malgaroli, M. et al. Large language models for the mental health community: Framework for translating code to care. Lancet Dig. Health 7, e282–e285. https://doi.org/10.1016/S2589-7500(24)00034-6 (2024).
Article Google Scholar
Aditya. Mental health counselling dataset. https://huggingface.co/datasets/Aditya149/Mental_Health_Counselling_Dataset (2023).
Bertagnolli, N. Counselchat dataset. https://github.com/nbertagnolli/counsel-chat/blob/master/data/counselchat-data.csv (2020).
Bertagnolli, N. Counsel chat: A dataset for counseling dialogues. https://huggingface.co/datasets/nbertagnolli/counsel-chat (2023).
Amod. Mental health counseling conversations. https://huggingface.co/datasets/Amod/mental_health_counseling_conversations (2023).
Bart model overview. https://huggingface.co/docs/transformers/en/model_doc/bart.
Face, H. Rouge metric (2023).
GeeksforGeeks. Nlp - bleu score for evaluating neural machine translation in python. https://www.geeksforgeeks.org/nlp-bleu-score-for-evaluating-neural-machine-translation-python/ (2023).
Face, H. Perplexity in transformers. https://huggingface.co/docs/transformers/en/perplexity (2023).

Download references

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur, 273010, India
Ritesh Maurya
Department of Artificial Intelligence, Amity University, Noida, 201303, India
Nikhil Rajput, M. G. Diviit & Manish Kumar Ojha
Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, 576104, India
Satyajit Mahapatra

Authors

Ritesh Maurya
View author publications
Search author on:PubMed Google Scholar
Nikhil Rajput
View author publications
Search author on:PubMed Google Scholar
M. G. Diviit
View author publications
Search author on:PubMed Google Scholar
Satyajit Mahapatra
View author publications
Search author on:PubMed Google Scholar
Manish Kumar Ojha
View author publications
Search author on:PubMed Google Scholar

Contributions

Ritesh Maurya conceptualised and designed the study and prepared the manuscript. Nikhil Rajput and M. G. Diviit contributed to the implementation of the proposed framework. Satyajit Mahapatra contributed to data collection and critically reviewed the manuscript. Manish Kumar Ojha provided valuable insights into the testing of the proposed framework.

Corresponding author

Correspondence to Satyajit Mahapatra.

Ethics declarations

Competing interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Maurya, R., Rajput, N., Diviit, M.G. et al. Exploring the potential of lightweight large language models for AI-based mental health counselling task: a novel comparative study. Sci Rep 15, 22463 (2025). https://doi.org/10.1038/s41598-025-05012-1

Download citation

Received: 24 February 2025
Accepted: 30 May 2025
Published: 02 July 2025
Version of record: 02 July 2025
DOI: https://doi.org/10.1038/s41598-025-05012-1

Subjects

Abstract

Similar content being viewed by others

Using a fine-tuned large language model for symptom-based depression evaluation

The impact of fine-tuning LLMs on the quality of automated therapy assessed by digital patients

Leveraging large language models to assist philosophical counseling: prospective techniques, value, and challenges

Introduction

Related works

Methodology

Data collection

Data cleaning

Data pre-processing

Selection and fine-tuning lightweight-language models

Response generation and evaluation of fine-tuned LLMs

Experimental results and discussions

Experiment setup and performance metrics

Hyperparameters used

Tokenisers

Performance comparison

Analysis of real-time response generation

Response to question 1: crying during therapy

Response to question 2: solutions for panic attacks

Limitations and future work

Ethical considerations

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links