Introduction

India is a huge country where more than 151 million people depend on agriculture for their livelihood. Approximately 60% of the Indian population works in this sector and contributes approximately 18% to India’s GDP1. After the green revolution in the 1960s and 1970s, a drastic change was observed in Indian agriculture practices, where rich people gained power and wealth at the expense of farmers2. In the 1990s, liberalization and globalization also affected Indian agricultural policies, where the open market was prioritized, which had an adverse effect on small farmers due to the entry of international companies into the agriculture sector3. In 1990, policies were revised where public investment was reduced in agriculture, leading to farmer suicide and poor mental health3. Due to these reasons, the farming population reduced to 32% at the end of 2022, and farmers continued to suffer losses4. Indian farmers are also not very educated, and due to all these issues, they are not capable enough to present their schemes for their benefits in front of the government, and some political parties also try to take benefits from this. As a result, Indian farmers are suffering from mental distress, depression, and anxiety. The suicide rate is also increasing among Indian farmers. According to data from the National Crime Records Bureau (NCRB), more than 358,164 farmers committed suicide between 1995 and 2019, and the condition is still the same5. The report shows that in India the farmer suicide rate is 1.5 times higher than the national average6. These facts show that in India the conditions of farmers are not as good, and some actions must be taken to reduce this suicide rate and improve their mental health. This study seeks to explore several questions related to the development of an intelligent model to examine the mental health of farmers in India. It aims to understand the primary mental health challenges faced by Indian farmers and to investigate whether vocal and speech-based characteristics can effectively serve to detect mental health conditions. Another important area of inquiry is how this digital model can be integrated into rural healthcare systems to enable mental health screening. The study also considers the usability and accessibility challenges of deploying this technology in rural settings associated with the use of AI for mental health.

To overcome these shortcomings, the present study aimed to introduce a novel deep learning–based diagnostic model to assess the mental health of Indian farmers by analyzing their spoken responses to a structured questionnaire. The farmer responses collected in the vernacular are converted into spectrogram representations and then analyzed using CNN. We use CNNs instead of traditional machine learning models because of their ability to capture spatial hierarchies from the inputs (spectrogram) automatically without manual feature engineering. The purpose of the paper is to develop and validate a scalable and user-friendly CNN architecture for mental health conditions for Indian farmers with a focus on stress levels, coping mechanisms, and social support.

Research Hypotheses:

H1

The farmer voice-based CNN model can perform with high accuracy in classifying mental health status compared to traditional approaches.

H2

The system will exhibit good usability, allowing farmers to perform tasks efficiently and satisfactorily.

To test these hypotheses, a structured questionnaire was constructed that included questions about stress, coping strategies, and social support. Information was collected from 350 farmers through face-to-face communication in the field. The audio data set is used to train and test the proposed model. In addition to predictive performance, the validation of the model was structured according to six usability aspects: learnability, efficiency, configurability, satisfaction, understandability, and effectiveness. This two-fold approach (i.e., focus on technical performance and user-driven validation) guarantees that the model is not only precise but also practical for rural use.

The contribution of this work, thus, lies in that it offers (i) a CNN-based mental health assessment model designed for the agricultural community, and (ii) a usability validation framework based on the evidence, which shows the feasibility of adapting the model in rural health outreach programs. By addressing the gaps in availability and diagnostic reliability, this study provides a replicable model to facilitate intervention and policy measures aimed at improving farmer well-being in India.

This paper has 5 sections. Section “Related work” focuses on the related work, whereas Sect. “Problem statement and solution” discusses the problem statement and its possible solution. Section “Proposed methodology” describes the proposed work and analyzes the results, and Sect. “Results and discussions” focuses on the conclusion of this research work along with the future scope.

Related work

Reasons for poor mental health of Indian farmers

Agriculture is the pillar of the Indian economy and employs more than 151 million people, almost 60% of the total population. Although socioeconomically significant, Indian farmers suffer from extreme mental illness driven by a host of socioeconomic, environmental, and policy-induced causes. These encompass chronic financial indebtedness, fluctuating crop production due to capricious climate patterns, weak support mechanisms from the government, fluctuating market prices, and recurrent experience of natural disasters in the form of droughts, floods, and cyclones1. In several instances, such stressors are compounded by social pressures, ignorance, and the lack of community-focused mental health services. Consequently, mental health in farmers is a greatly overlooked aspect of public health care in rural India.

Agrarian policies have historically deeply influenced the mental well-being of farmers. Such was the 1960s Green Revolution, which vastly increased production in agriculture but benefited rich farmers disproportionately. This policy unintentionally drove small farmers to the margins through its increased economic disparities, creating sources of mental health stressors2. Similarly, the economic liberalization and globalization policies undertaken in the 1990s exposed India to international competition, exposed domestic farmers to price fluctuation, de-subsidization, and reduced bargaining power3. These changes increase operational risk for marginal farmers and increase current exposures, creating conditions conducive to psychological misery.

The structural weaknesses of the agrarian economy were starkly revealed in 2018 when frustration turned into mass farmer protests. Around 200 farmers protested in Delhi, Mumbai and other cities, demanding improved MSP and special parliamentary debates over agrarian distress. These protests soon spread across the states and sometimes took violent turns and led to injuries and deaths, including suicides, highlighting the extent of attendant mental health concerns6. Statistical analysis indicates that the Indian farmer suicide rate is 47% higher than the national rate. Almost 48 suicides are registered each day, and about 16,500 suicides occur each year7 .

The COVID-19 pandemic further intensified these strains, causing unprecedented shocks through the value chain in agriculture. Farmers struggled with acute labour shortages, reduced access to seeds and fertilizers, reduced transport logistics, and reduced market access. Financial relief from government agencies tended to be delayed or inadequate, worsening desperation and hopelessness in the agricultural community. Suicides among farm workers increased by 18% in 2020 over other years, as the compounding effects of economic insecurity and health crises took their toll. These figures underscore the compounding stress effects faced by farmers from the nexus between policy failure, economic instability, and health crises.

Present state of research in Indian farmers’ mental health

There is growing academic focus on the mental well-being of Indian farmers owing to increasing rates of suicide and widespread psychosocial stress in agricultural populations. In 2018, an economic study registered 13,754 suicides in 2012 and cited underlying socioeconomic causes, including fragmentation of land, crop failure, and cycles of indebtedness10 . Kallakuri et al12.examined the mental well-being in 12 Andhra Pradesh villages and cited rates of anxiety (10.8%), depression (14.4%), and suicidal thoughts (3.5%). These statistics highlight the necessity to incorporate mental health services in the primary care system in rural areas.

A 2019 overarching systematic review collated key risk indicators such as environmental fluctuation, insecure sources of income, chronic illness, and isolation from institutional support networks11. In a seminal study on South Indian farmers, more than 90% of respondents exhibited symptoms of depression, highlighting the magnitude of mental distress in the countryside areas12. A 2020 study in Maharashtra also converged with the findings, revealing anxiety and sleeplessness in 55% of the sample and somatic symptoms in 34.7%. The mean mental health index scored 0.58 with a standard deviation of 0.49, reflecting far-reaching psychological impairment13.

Chitrasena et al.1614 underscored the fact that declining market returns and declining commodity prices have exacerbated the psychological loads on farmer households. A meta-analysis of 92 peer-reviewed papers documented a range of methodological styles toward mental health measurement—quantitative surveying, ethnographic case studies, empirical field research, and mixed-method designs—underscoring the dynamic interplay between stressors in the environment and mental health15. The evidence clearly shows that mental health in rural India cannot be considered in solitary terms but needs to be recognized as a multidimensional construct driven by determinants in the form of structural, economic, and environmental elements.

The COVID-19 pandemic exacerbated all this, leading to chain reactions in the form of agrarian social and financial crises and mental ones. Various post-pandemic research studies have documented the multi-faceted effects with a focus on both immediate and residual mental health effects16. A targeted 2022 review of 29 papers evaluated the link between disasters and the psychological effects in farmer populations. The evidence persistently showed elevated rates of post-traumatic stress, anxiety disorders, and depression in the wake of such incidents17. These effects necessitate the call for an interdisciplinary action from public health and social welfare and agricultural development.

As a response, technological interventions have also been considered as mitigating tools. Internet-based applications and online counselling portals have proven to be potentially scalable and accessible avenues for mental health assistance to farmers. These digital interventions are not only helping individual farmers but also empowering paramedical staff with decision-making support and remote monitoring systems18. In turn, international climate change—marked with enhanced temperatures, unpredictable monsoons, and depleting groundwater levels—has increased rural stress. Such environmental deterioration directly and indirectly influences mental health through physical exhaustion and crop losses and economic insecurity and pressure to migrate, respectively. Researchers promote multi-stakeholder interventions involving policy reforms, community mental health activities, government-run counselling services, and resilience-based models adjusted to agrarian contexts1920.

Problem statement and solution

From the literature review, it has been analyzed that there is a growing recognition of the need for mental health interventions for farmers in India, including counselling and support services. It is also worth mentioning that there is a growing trend for using digital technologies, such as mobile health (mHealth) and telemedicine. An intelligent computer-based digital model can be used to examine the mental health of the farmer, and this research area is currently in its early stages in India. Although there have been some studies and initiatives aimed at using digital technologies to improve access to mental health services for farmers in India, much more work is needed to fully understand the potential benefits and challenges of such models.

To provide solutions to the issues addressed above, research should be carried out on the development of intelligent computer-based models to provide efficient results to examine farmers’ mental health, which should be easy to use. This paper is an attempt in this direction, in which a CNN-based intelligent computer model is proposed to examine the mental health of the Indian farmer. Indian farmers are not very familiar with the latest technologies. So, data will be collected in the form of their audio while asking them questions. The questionnaire is prepared based on different factors that affect the mental health of farmers.

Intelligent computer-based digital model

The Intelligent Computer-Based Digital Model is proposed to provide an efficient assessment tool for farmers’ mental health. The model utilizes artificial intelligence and machine learning techniques to examine the mental health status of Indian farmers by processing audio data, considering different factors such as stress levels, coping mechanisms, and social support systems. The model can be used to monitor the well-being of farmers over time, providing early warning signs of potential mental health problems.

Convolutional neural network

CNN is a widely popular deep neural network that is mainly used in computer vision. There are different layers in CNN, and this approach outperforms with audio data sets. In the input layer, neurons are trained by the input data and perform the operations. The convolutional layer of the CNN architecture is responsible. After that, the pooling layer summarizes all the features generated by the previous layer. Just before the output layer, there is a fully connected layer, which is used to connect neurons between different layers 23. This basic architecture of CNN is shown in Fig. 1. In this work, vocal data from Indian farmers is captured while submitting the questionnaire, which works as input and is classified into seven different categories (anger, disgust, fear, happiness, sadness, astonishment, and neutral).

Fig. 1
figure 1

Structure of the proposed CNN model detailing sequence and construct of convolution, pooling and fully connected layers for processing spectrogram inputs to predict emotion categories.

Reinforcement learning

This is an algorithm that works on the concept of maximization of reward points by taking the necessary actions by the intelligent agent2122. In this work, farmers are treated as agents, and the accuracy of the results along with the value of the usability factor is treated as reward points. This work proposed the application of Reinforcement Learning (RL) on the proposed model in case the accuracy of the result is low or the value of the usability factor is less than 0.5. Both conditions are fully satisfied in this work; however, we have cross-validated our model in terms of reinforcement learning mentioned in Sect. "Quantitative evaluation of the proposed model".

FMHA Action is required for updating the proposed model to examine farmer’s mental health (FMH) depending on the initial results.

FMHIR Initial results to examine farmer’s mental health and usability factor value.

FMHRP Reward points to validate the proposed model.

FMHP Policy decided for farmer’s mental health model according to FMHRP.

FMHQ Changes required in the strategy for updating the questionnaire.

Learning rate (β): 0.1–0.001.

Discount Factor (γ): 0–1.

Algorithm 1
figure a

Reinforcement learning

Proposed methodology

The proposed iCDAM (Intelligent Computer-Based Digital Model) methodology to assess the mental health of Indian farmers is a multimodal audio processing framework that leverages deep learning for emotion-driven inference. The complete methodology is described in Fig. 2 and consists of the following phases:

Fig. 2
figure 2

Architecture of the proposed intelligent computer-based digital model (iCDAM) showing fused audio input pre-processing, spectrogram generation, and CNN-based classification to assess farmers’ mental health.

Collect farmer’s requirements

Initially, Indian farmers’ requirements are gathered, for which a survey was conducted with Indian farmers of different geographical regions.

Identification of problem

After interacting with Indian farmers of different geographical regions, it was identified that there is a lack of communication between farmers and others (leaders, government, professional bodies, etc.). Indian farmers are not educated enough to be able to use the latest tools and technologies and communicate their problems efficiently. They need an efficient and intelligent computer model to share their opinions and thoughts. Therefore, we have decided to propose an intelligent model that can be trained using audio data from Indian farmers.

Vocal dataset collection using a questionnaire for training of the model

A questionnaire was prepared considering stress levels, coping mechanisms, social support systems, etc. Vocal data from Indian farmers was collected during the answering of the questions. Inclusion criteria were farmers (age 20–65 years) farming for at least the last five consecutive years, able to understand and converse in Hindi/local dialect and willing to participate and be audio recorded; while those who had a severe psychiatric diagnosis or disorders apart from hearing /speech impairments or were unwilling to record sessions were excluded. Data were collected in five local sessions, in which approximately 70 farmers participated after informed consent and a brief introduction to the objectives of the study. A pretested structured questionnaire based on a review of literature, modifications of the WHO well-being index, and experts’ opinions was used for data collection to assess stress (frequency, intensity and causes), coping (adaptive and maladaptive), and social support (family, peers, and society institutions). To eliminate literacy challenges and capture natural language use, participants’ responses were collected in audio files using handheld digital recorders (44.1 kHz), with all sessions equaling 40 min in length to have consistent recordings. Later we transformed them into spectrograms for CNN-based analysis, because such representation (spectrogram) is a generally used visualized form of speech emotion recognition that not only simplifies but facilitates the learning study within hierarchical space without manual feature extraction. As part of model training, participant feedback was used to assess the six usability constructs, including learnability, efficiency, effectiveness, configuration, satisfaction, and understandability, based on ISO 9241-11 Human–Computer Interaction standards to ensure that the system is not only technically robust but also operationally applicable in rural healthcare settings. Annotation of this audio dataset is done based on various factors present in audio responses, such as emotions, behavior, and average time delay. These categories are anger, disgust, fear, happiness, sadness, astonishment, and neutrality. The audio data collected are shown in Fig. 3.

Fig. 3
figure 3

The bar plot represents the amplitude distribution of audio samples collected from the farmers via questionnaire, showcasing signal variation across time frames. https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en.

Dataset for validation of the proposed model

This work uses a gender-dominated Hindi speech database known as Common Voice Hindi (Indian), which is assumed to be a subset of the Mozilla Common Voice corpus filtered and recorded specifically targeting the Indian accents. The data set consists of 2498 audio samples in MP3 format. The duration of each audio clip is short, on the order of 10 s, making the data set amenable for real-time or lightweight scenarios. Dataset link: https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en

Data pre-processing and splitting

The following processing procedures were conducted for the reliability and generalization of our model: Firstly, noise filtering: A Butterworth bandpass filter (300 Hz–3.4 kHz) was applied. Second, Non-Speech Segmentation: First, segmentation of silence: The segments here and the silence periods were removed by applying the energy-based VAD method. Third, Segmentation: The audio files were segmented to have fixed-length windows (i.e., 2 s) with a 25% overlap to retain context. Finally, we normalized the amplitude of the signal with the energy characteristic extracted from the root mean square. Figure 4 shows the pre-processed data, displaying the original audio waveforms with their corresponding trimmed and normalized versions.

Fig. 4
figure 4

Audio data collected from farmers via questionnaire, (a) Original Waveform and (b) Trimmed and Normalized Waveforms.

Spectrogram generation

The clean audio was converted to 2D logarithmic spectrograms using Window. Size: 512 samples, Hop Length: 256 samples, Number of Mel Bands: 128, Spectrogram Resolution: 224 × 224 pixels, and Library Used: Librosa (Python). Mathematically, an audio signal is converted into an audio signal using a short-time Fourier transform (STFT), and the calculations are represented through mathematical formulas:

$$\:X[n,\:t]\:=\:Lw\:\sum\:[l=1]\:x[t\:*\:H\:+\:l]\:*\:w\left[l\right]\:*\:e^(-i\:*\:2\pi\:\:*\:n\:*\:N\:*\:l)\:$$
(1)

where

$$\:\left\{\begin{array}{c}\begin{array}{c}X\left[n,\:t\right]:\:is\:the\:complex-valued\:spectrogram\:at\:frequency\:bin{\:n}^{{\prime\:}}and\:time{\:t}^{{\prime\:}}\\\:x\left[t\:*\:H\:+\:l\right]:\:is\:the\:audio\:signal\:sample\:at\:time\:t{\prime\:}*\:H\:+\:{l}^{{\prime\:}}\\\:w\left[l\right]:\:is\:a\:window\:function\:\left(e.g.,\:Hann,\:Hamming\right)applied\:to\\\:\:a\:small\:segment\:of\:the\:audio\:signal.\\\:\:\\\:{e}^{-i\:*\:2\pi\:\:*\:n\:*\:N\:*\:l}:\:is\:the\:complex\:exponential,\:where{{}^{{\prime\:}}N}^{{\prime\:}}is\:the\:FFT\:\\\:length\:and\:{\prime\:}l{\prime\:}\:is\:the\:sample\:index\:within\:the\:window.\:\end{array}\\\:L:\:is\:the\:number\:of\:samples\:in\:the\:window\\\:H:\:is\:the\:hop\:size\:\left(overlap\right)\:between\:consecutive\:windows.\:\end{array}\right.$$

These spectrograms were saved as image arrays used for the transferred CNN. Represented in Fig. 5.

Fig. 5
figure 5

Spectrogram representation of the preprocessed dataset, displayed as image arrays with a resolution of 224 × 224 pixels.

In addition, the data set is divided into teach (training), confirm (validation), and test. The size of the training, validation, and test data sets was 70%, 15%, and 15%, respectively. The usability was evaluated along six dimensions: learnability (ratio of possible to overall tasks solved), efficiency (proportion of farmers who could solve the task within the time budget), effectiveness (average ratio over sessions solved), as well as configurability, satisfaction, and understandability (tuple ratings). For all usability scores, descriptive statistics (mean and standard deviation percentages) were calculated. Inferential analyses included chi-square tests to compare completion rates across demographic variables (e.g., age, education) and t-tests or analysis of variance (ANOVA) comparisons of efficiency by group. A significance level of p ¡ 0.05 was used. Data preprocessing and CNN training were performed in Python 3.11 using TensorFlow 2.15, Keras, and Librosa for CNN training; statistical analysis was performed using SPSS v26 and Python packages (scipy, statsmodels).

Working of the proposed model

In this section, we present the complete work for the classification of our proposed 2D-CNN model through the 13 layers presented in Table 1 and visualized in Fig. 6. Here we feed vocal spectrogram image data of size 128 × 128 × 3 (height, width, RGB channels) into the input layer of the proposed system. These spectrogram images are extracted from audio responses gathered using a farmer-interaction questionnaire system.

Table 1 Internal architecture of proposed model.
Fig. 6
figure 6

Flowchart of the proposed model architecture illustrating the key components and sequential processing steps.

The methodology begins with identifying the farmer’s requirements and collecting vocal data through structured questionnaires. These audio responses are preprocessed using trimming, normalization, and spectrogram generation using the Mel scale. The resulting spectrogram images are resized to 128 × 128 × 3 for CNN compatibility.

These input spectrograms are passed through the first Conv2D layer, which generates a feature map of size 126 × 126 × 32. This layer uses 32 filters of size 3 × 3 and applies ReLU activation for non-linearity. The shape of the output data is computed using Eq. (2):

$$\left( {OUT_{H} ,OUT_{W} } \right) = \left( {\frac{{{\text{h}} + 2{\text{p}} - {\text{fh}}}}{s} + 1,\frac{{{\text{w}} + 2{\text{p}} - {\text{fw}}}}{s} + 1} \right)$$
(2)

where \({OUT}_{H}\) and \({OUT}_{W}\) are output height and output weight.

h, w = input data size; p = padding size; fh = height of filter; s = no off stride; fw = weight of filter. This is followed by batch normalization, which stabilizes and accelerates the training process. Next, MaxPooling2D with a 2 × 2 filter and stride of 2 reduces the spatial dimensions to 63 × 63 × 32.

This stage is passed to the second Conv2D layer, which increases the depth to 64 while reducing the spatial size to 61 × 61. ReLU is applied for non-linearity, followed again by Batch Normalization and MaxPooling2D, resulting in a size of 30 × 30 × 64.

The process continues through a third Conv2D layer, expanding the depth to 128 and reducing the size to 28 × 28. Again, Batch Normalization and MaxPooling2D reduce the shape to 14 × 14 × 128. These progressive convolutions help capture deeper hierarchical features. The final feature map is flattened into a 1D vector of size 25,088, which is passed into a dense layer with 128 neurons and ReLU activation. To mitigate overfitting, dropout is applied with a rate of 0.5. The final dense layer uses a SoftMax activation function to classify the input into one of four categories (e.g., different levels or types of mental health states). The SoftMax function computes the probability distribution across the target classes, defined as:

$$\sigma \left( z \right)_{i} = \frac{{e^{zi} }}{{\mathop \sum \nolimits_{j = 1}^{n} e^{zi} }}$$
(3)

where \({z}_{i}\) is the input score for class I, and n is the total number of classes.

The output values range between 0 and 1 and sum up to 1, allowing confident classification. Post-inference, the model’s predictions are validated using usability factors such as accuracy, interpretability, and consistency. If the validation score exceeds a predefined usability threshold (e.g., > 0.5), the model is deemed successfully validated. Otherwise, the system uses reinforcement learning to refine the questionnaire, iteratively improving the input data quality and model accuracy.

Usability factors to validate the proposed model

To validate the effectiveness of this model, a usability test is performed using six different usability factors. If the value of these factors is greater than 0.5, then we can say that the proposed model provides an effective solution to examine farmers’ mental health. Usability factors used for this validation are

Learnability

Learnability can be calculated by dividing the number of tasks a user can achieve by the total number of tasks22. Farmers are asked questions, and, during their responses, their audio responses are captured. This experiment was carried out on 350 farmers, of whom 337 farmers have answered all the questions (100%), 8 farmers have answered 94% of the questions, and the remaining 5 farmers have answered 92% of the answers. Learnability can be calculated as the average of these responses using the following equation.

$$Learnability = { }\frac{{No.{ }of{ }achievable{ }Tasks}}{{Total{ }Number{ }of{ }Tasks}}$$
(4)

From the above equation, the estimated value for learnability will be 95.33.

Efficiency

The model’s efficiency can be gauged by how long it takes farmers to answer. A total of 40 min was given to perform this task, where out of 350 farmers, 336 farmers answered all the questions in time, while the rest of the farmers did not answer all the questions within that time. The average efficiency of this proposed model can be taken as 0.96, or 96%.

Configurability

It can be calculated using the following equation24:

$$Cofigurability = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\; easilty \;configured\; in\; another\; enviornment} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right\}$$
(5)

The proposed model can be configured in any environment and updated by adding or changing questions, so its configurability value is 1. The purpose of this project is to convey the theoretical adaptability of the model to different computing platforms since the core system is implemented using standard Python-based modules. However, the results should not be interpreted as empirical validation, as no multi-platform or cross-environment deployment was conducted.

Satisfaction

Satisfaction measures how many users are satisfied with the proposed model. 88% of farmers said the model was excellent; 7% said it was good and flexible. For the rest of the farmers, the proposed model needs some improvement. The general satisfaction level can be taken as 0.95, or 95%.

Understandability

The existence of meta-information (EMI) can be used to measure the understandability of any model. The value of EMI will be 1 if metadata exists; otherwise, its value will be 024. For helping farmers, all instructions were properly available as metadata, and personal assistance was also provided. In this case, the value of the EMI can be considered at 1.

Effectiveness

The percentage of tasks completed by farmers can be used to denote the efficacy of the proposed model24. An experiment was carried out in five separate sessions, and 70 farmers participated in each session. The number of farmers who completed 100% of the task in successive sessions was 64, 67, 62, 63, and 65. The following equation can be used to calculate the average value of effectiveness25.

$$\overline{E} = \frac{{\mathop \sum \nolimits_{y = 1}^{F} \mathop \sum \nolimits_{x = 1}^{N} m_{xy} }}{FN}*100$$
(6)

In this equation N denotes the total number of sessions, whereas F denotes the number of users (farmers) in the respective session. Each session (x) result corresponding to the number of farmers (y) is denoted as xy.

$$\overline{E} = \left( {\frac{64 + 67 + 62 + 63 + 65}{{70{*}5}}} \right){*}100$$
(7)
$$\overline{E} = \left( {\frac{254}{{350}}} \right){*}100$$
(8)
$$\overline{E} = 0.917 = 0.{92}$$
(9)

Results and discussions

Qualitative result analysis

The proposed CNN model was trained using 70% of the available dataset to evaluate its effectiveness in detecting and classifying emotions related to farmers’ mental health. The remaining data was split for testing and validation purposes. The models were trained over 20 epochs. Training and validation of the graph are shown in Fig. 7.

Fig. 7
figure 7

Performance curves showing training and validation accuracy and loss for the proposed model.

The accuracy graph in Fig. 7a shows a rapid improvement in both training and validation accuracy during the first few epochs. Training accuracy reaches nearly 100%, and validation accuracy closely follows, peaking around 98–99% by epoch 10. This indicates that the model is highly effective in capturing patterns in the training data while maintaining strong performance on validation data, suggesting good generalization capability. However, there is minor fluctuation in validation accuracy after epoch 10, which might hint at early signs of overfitting, but the differences are relatively small. The model maintains high validation accuracy throughout the training process, demonstrating its robustness.

The loss graph in Fig. 7b complements the accuracy plot by showing a steep drop in training and validation loss initially. Training loss quickly converges near zero, and validation loss also drops significantly from above 5.0 to below 0.5 within the first few epochs. After epoch 5, both losses stabilize at low values. While the validation loss remains low, there are some small spikes, particularly after epoch 15. These could indicate moments of slight instability or noise in the validation data, but they do not significantly impact overall accuracy. The persistent low training loss and relatively flat validation loss trend suggest that the model is not severely overfitting. The model exhibits excellent training behaviour, achieving near-perfect accuracy with minimal loss. Validation metrics are consistently high, showing strong generalization to new data. Minor fluctuations in validation performance after several epochs are within acceptable limits and do not compromise the overall effectiveness of the model.

Upon completion of training, the model demonstrated strong generalization capabilities. It achieved a training accuracy of 99.67% with a validation accuracy of 99.73%, while maintaining a validation loss of just 10%, indicating minimal overfitting and high robustness.

The emotion-wise classification results are summarized in Table 2. The model performed consistently across all seven emotion categories—Anger, Astonishment, Disgust, Fear, Happiness, Neutral, and Sadness—achieving F1scores ranging from 0.9800 to 0.9950, with particularly high performance in detecting Happiness (F1-score: 0.9950) and Disgust (F1-score: 0.9900). Even the lowest-performing category, Fear, still achieved a high F1-score of 0.9800, which reflects the model’s effectiveness across varied emotional states.

Table 2 Performance evaluation of the proposed model showing precision, recall, and F1-scores for emotion classification.

The performance of the model is represented through the confusion matrix in Fig. 8, where the model demonstrates high precision and recall, as evidenced by the diagonal dominance in the confusion matrix. Most values lie on the diagonal, indicating that the model correctly classified most instances for each emotion. Specifically, happiness was predicted with perfect accuracy (100/100 correct), and disgust, neutral, and anger also show strong performance, with only 1 misclassification each. Fear, astonishment, and sadness had 2 misclassifications each but still retained high accuracy (98%). Anger was once confused with astonishment, and vice versa, possibly due to similar vocal intensities or expression patterns in the dataset. Fear and sadness were sometimes confused with each other, which is expected due to overlapping emotional cues. Despite these misclassifications, no class suffered from a major confusion pattern, and the errors appear to be minimal and randomly distributed rather than systematic.

Fig. 8
figure 8

Confusion matrix depicting the performance of the proposed emotion classification model across seven emotional categories. The matrix highlights the number of correctly and incorrectly classified instances for each emotion.

These results highlight the model’s ability to accurately identify and differentiate between emotional states that are indicative of farmers’ mental well-being. The overall accuracy of 98.71%, combined with macro- and weighted averages above 98.70%, reflects a well-balanced classifier with minimal bias towards any emotion class. The high performance proves that the proposed CNN model is highly effective for use in emotion recognition applications, particularly in the context of monitoring mental health in agricultural communities. This could prove invaluable in developing scalable, technology-driven solutions for mental health issues among farmers.

Statistical validation of the proposed model using a t-test

A pairwise t-test was performed on the F1-score of each emotion class to statistically verify the consistency of the performance of the developed deep learning model across various emotion classes. Our goal was to determine if the differences in the performance of the models on different emotion classes are statistically significant or not.

The F1-scores for each respective class (in Table 3 and Fig. 9) varied between 0.9849 and 1.0000, with a macro-average of 0.9872 and a weighted average of 0.9872. A robust model was achieved for most in the classes, such as for Disgust, Neutral, Happiness, etc., with an F1 of 0.99 or above.

Table 3 Performance evaluation of the proposed model across different emotion categories.
Fig. 9
figure 9

F1-score distribution across different emotion classes, illustrating the model’s performance in balancing precision and recall for each emotional category.

Through the boxplot, we can observe most of those F1-scores are linear across a scale either identical to or near the upper bound close to 1.0, with the exception of two classes (sadness and astonishment being just below that). This is tested statistically by employing a one-sample t-test with the perfect benchmark value of 1.0 for both class-wise F1-scores.

The p value of the t-test gives statistical evidence on how the model’s performance is consistent. The null hypothesis (H0) was that the class-wise F1-score did not deviate from the ideal score of 1.0, and the alternative hypothesis (H1) stated that there was a statistical deviation as compared to the ideal score. Based on the analysis, the p value obtained was greater than 0.05, indicating that the differences observed are not statistically significant. As a result, we fail to reject the null hypothesis. Additionally, effect sizes (Cohen’s d) across all classes were < 0.2, confirming negligible practical differences. Furthermore, 95% confidence intervals for class-wise F1-scores (0.982–1.000) demonstrated narrow variability and high reliability. This combined evidence shows that variations in F1-scores across emotion classes are statistically non-significant, with trivial effect sizes, thereby confirming that the proposed model is both robust and uniformly consistent across emotional categories.

Proposed model validation on usability factors

Six usability factors are used to validate the effectiveness of the purpose model. These factors scored significantly high with estimated values between 0.92 and 1.0, indicating that not only is the model valid, but it is also easy to use and stable. The finding indicates that the model is suitable for field applications, especially with the sensitive nature of testing the mental health of farmers. Satisfaction and understanding scored perfectly (1.0) in the usability test, which means that users considered the system highly intuitive and overall satisfying. Likewise, configurability and efficiency also achieved very high rates (about 0.96) because of its good support for smooth configuration and operational simplicity; all information can be easily transmitted. Usability values for learnability and satisfaction were high (around 0.95), which means that the system is very easily learned by new users.

The means of effectiveness got the worst score of usability (~ 0.92); it is, however, a level well above the lower limit (0.5), which means that the model is reliably successful in reaching its purposes. This small drop is likely an area for future optimization, such as with more training data or with how users perceive the impact of their interactions on the interface.

The high usability scores further corroborate the efficacy of the model not just in terms of prediction performance (99.67% accuracy) but also in terms of user interaction, trust, and practical utility. This finding establishes the model’s environmental readiness for deployment in real life, as in conditions of mental health monitoring, especially in rural and agro-based localities requiring intuitive and dependable instruments.

These results of the validation test are summarized in Table 4, and Fig. 10 provides its graphical representation.

Table 4 Validation of the proposed CNN-based model on the SAVEE dataset.
Fig. 10
figure 10

Estimated usability scores across key evaluation factors, including learnability, efficiency, configurability, satisfaction, understandability, and effectiveness.

Quantitative Evaluation of the proposed model

Speech Emotion Recognition (SER) is a publicly available dataset that is a combination of four different datasets (Crema, Savee, Tess, and Ravess)2633...Performance of the proposed 2D-CNN model developed to classify and monitor the mental health of farmers using spectrogram representations of vocal data. The experiment was carried out using the SAVEE dataset, which contains labeled emotional speech data. After training the model in the training subset, the final training accuracy reached 99.77%, with a validation accuracy of 99.73%, indicating the model’s strong generalizability. The validation loss was observed to be 10%, with minimal overfitting of the test and efficient learning. The testing phase further validated the model’s robustness. The test accuracy was recorded at approximately 99.12%, closely aligned with the training and validation accuracies. This consistency reinforces the effectiveness of the proposed CNN architecture in extracting deep emotional patterns from spectrograms. This results in an overall precision of 99.20%, a recall of 99.95%, and an F1 score of 99.07%, as mentioned in Table 5, which confirms the model’s exceptional ability to accurately detect emotional states across gender. Furthermore, training and validation accuracy and loss curves for 20 epochs are shown in Fig. 11. The graphs reflect steady learning with high accuracy and reduced loss over time. The validation curves closely follow the training curves, indicating minimal variance and a well-tuned model.

Table 5 Validation of the proposed CNN-based model on the SAVEE dataset.
Fig. 11
figure 11

Training and validation accuracy and loss graphs of the proposed model on the SAVEE dataset.

These results confirm the feasibility of the model for use in the real world in assessing the mental health of farmers through their vocal expressions. The integration of CNN with audio-derived spectrogram features has proven to be a powerful approach for emotion classification and mental state analysis.

The performance of the proposed model is also compared with some existing state-of-the-art on the Savee dataset which is given in Table 6.

Table 6 Performance comparison of proposed model with existing work on a similar dataset25.

Application of reinforcement learning for cross validation of the proposed model

To improve the performance and robustness of the CNN model, reinforcement learning (RL) was incorporated into the structure to achieve dynamic feature selection and classification decision-making. However, our best model using reinforcement learning (CNN + RL) achieved precision, recall, F1-score, and accuracy scores of 0.99639, 0.99652, 0.99634, and 0.99638, respectively. It is interesting to note that the same results were obtained on the baseline CNN model (not integrated with the reinforcement). This suggests that even though reinforcement learning may have adaptive learning advantages and holds promise for exploration–exploitation trade-offs, in this specific scenario it did not provide added value to the standalone CNN architecture. This result could be explained by the fact that the baseline performance of the CNN model is high, and thus only a small space for improvement can be achieved, or because the reward strategy in the RL module was not well suited for the task. However, these results argue that this CNN model is very strong and resilient to the target task, even without the need for reinforcement.

Conclusion and future scope

This research work aims to propose an intelligent computer-based digital model that uses a convolutional neural network to examine the mental health of Indian farmers, taking multiple factors into account. For this purpose, the audio responses of Indian farmers are used as a data source, which were captured during the answer to the questionnaire. The proposed model was able to provide the results with a high precision rate (99.67%). The proposed model was successfully validated based on a usability test for which six different factors learnability, efficiency, configurability, satisfaction, understandability, and effectiveness supported its suitability for rural settings that face literacy barriers and have limited access to mental health professionals who could manage the content of the survey in other countries, and the values of these factors show that the proposed model can be used as a promising tool to examine the mental health of Indian farmers. The proposed model was analyzed to outperform existing approaches in terms of accurate prediction and can provide a scalable, practical, and culturally fair solution for early detection.

In future work, it could be possible to expand the sample size to better analyze subgroups between genders, different age groups, and regional differences (limited in this study by an unbalanced data set). These findings could inform mobile applications and AI-based platforms that could be made available to young farmers reporting real-time support and resources, while hybrid algorithms, as well as longitudinal studies, can contribute to improving predictive precision by capturing long-term relationships between risk factors for agricultural stressors and mental health outcomes. However, impediments including low participation of women in sub-samples, a smaller number for age group-wise stratification, implementation cost concerns, and digital infrastructure and literacy among Indian farmers continue to pose challenges. It will be essential to address these through targeted awareness campaigns, user-friendly interfaces, and integration with national mental health initiatives. Training health care workers to use AI tools can help sustain deployment. With such advancements, these models can help not just early detection but also instantaneous decision support and empower the existing mental health ecosystem for agriculture.