Abstract
Stress is widely acknowledged as a significant contributor to health issues. Recognizing stress involves assessing an individual’s physiological and psychological responses to stressors, which is crucial for human well-being. Physiological signal-based stress assessment offers greater accuracy and objectivity compared to traditional methods. To enhance stress level detection, we propose a novel approach using deep learning models that classify mental stress states (stress, baseline, amusement) based on multimodal physiological signals converted into RGB images through Gramian Summation Angular Field (GASF), Gramian Difference Angular Field (GADF), and Markov Transition Field (MTF) transformations. Experimental findings showcase the effectiveness of the proposed model, achieving an accuracy of 90.96% and an F1-score of 91.67%. The consistently high F1 scores across all categories demonstrate the model’s exceptional performance. Experimental results underscore the method’s effectiveness in capturing the relationship between multimodal physiological signals and stress, offering a promising tool for mental stress recognition.
Similar content being viewed by others
Introduction
Stress is recognized as a significant factor contributing to health problems and is defined as the body’s response to mental, physical, and emotional stimuli controlled by the sympathetic nervous system (SNS) and the hypothalamic-pituitary-adrenocortical axis (HPA axis)1,2. It directly impacts health through autonomic and neuroendocrine responses, and it can also indirectly affect health by altering behaviors3. Research indicates that mental stress has direct physiological consequences, potentially leading to various disorders such as stroke, cardiovascular disease, cognitive impairments, language disorders, and depression3,4,5,6. It can also disrupt sleep, decision-making7, disturb skin homeostasis, and exacerbate interpersonal conflicts8.
Stress assessment helps people detect and prevent mental and physical health problems caused by chronic stress early and adjust their state promptly. In special industries such as military, aviation, and medical emergency, workers often face high-pressure situations, making stress assessment crucial for ensuring safety. Therefore, researchers have developed various methods for stress assessment.
Traditional assessment of mental stress has relied on subjective methods. These approaches are highly subjective and susceptible to influences from various factors during the measurement process. Other tests detect stress by observing facial expressions or assessing physical behaviors under stress. However, physical behaviors can be intentionally or unintentionally controlled9, thus these tests do not provide objective and accurate measurements of stress levels.
Recently, researchers have started using physiological signals for automatic stress recognition. Physiological signal characteristics are difficult to conceal and largely involuntary, making them a more accurate and objective method for assessing stress compared to traditional mental stress assessment methods.
Studies have shown that stress is associated with various physiological factors and is accompanied by diverse symptoms10. Currently, stress can be measured using multiple physiological signals including electrocardiogram (ECG), blood volume pulse (BVP), electromyogram (EMG), electrodermal activity (EDA), respiration (RESP), skin temperature (SKT), pupil diameter (PD), eye movement, and speech recordings11,12.
Single physiological signals struggle to accurately reflect mood changes, whereas integrating multimodal physiological signals can enhance stress recognition13. This is because different physiological signals are captured from various perspectives when observing sympathetic nervous system (SNS) activities. Each modality can compensate for the deficiencies of others, and combining different modal signals can leverage intermodal complementarity to enhance classification accuracy14,15,16.
Recently, the development of stress recognition based on wearable devices has shown promise. Wearable devices can capture multimodal physiological signals. Schmidt et al. introduced the WESAD dataset for wearable stress and emotion detection17; Singh et al. utilized inputs from wearable physiological sensors to propose a hybrid deep learning-based ensemble approach for automatically extracting features in mental healthcare systems and classifying various stress states18; Siirtola et al. employed wrist-worn sensors to gather biosignals for emotion prediction and evaluated the performance of different prediction models19; Mitro et al. developed an AI wristband as a physiological monitoring tool for critical and emergency events20.
Wearable sensors can capture multimodal physiological signals more easily and with minimal intrusiveness. In this context, our research draws inspiration from previous studies on the use of physiological data in wearable technology and aims to open a unique path for pressure recognition.
In this paper, our goal is to detect stress levels by analyzing data provided by wearable sensors. Most previous stress detection methods based on physiological signals rely on one-dimensional signals. Given the recent boom in computer vision, we aim to link the fields of affective computing, mental stress detection, and computer vision. Inspired by the recent success of computer vision, we address the problem of encoding one-dimensional data into images to enable machines to “visualize,” classify, and learn structures and patterns. In this study, we propose a method to convert one-dimensional data into three types of two-dimensional data: Gramian Summation Angular Field (GASF), Gramian Difference Angular Field (GADF), and Markov Transition Field (MTF). The converted 2D data are encoded into RGB images, which are then fed into our proposed deep learning model for the detection of mental stress levels.
The main contributions of this paper are as follows:
-
Our proposed image coding method is, to the best of our knowledge, the first to code one-dimensional data into RGB three-channel images for stress recognition.
-
We utilize three methods, GASF, GADF, and MTF, to convert one-dimensional signals into three different types of two-dimensional data. The converted 2D data are then encoded into RGB three-channel images, which are subsequently fed into our proposed deep learning network model for mental stress level detection.
-
We evaluated our proposal using the WESAD dataset to ensure its feasibility. Furthermore, our approach yielded better results.
The rest of the paper is organized as follows: The second part describes the related work in the field of pressure recognition, the third part specifies the proposed methodology, the fourth part presents the experimental results, the fifth part concludes the study, and the sixth part provides a discussion.
Related work
In recent years, physiological signals have become increasingly utilized to measure stress. Physiological signals are more accurate and objective than traditional psychological assessments; furthermore, the use of physiological signals for the detection of mental stress is less intrusive and complex than biological analysis. Data from different modalities are often used in stress identification, allowing for the construction of more reliable and accurate stress identification models by evaluating and integrating these data sources21.
Healey et al. proposed a method to analyze stress under driving conditions using electrocardiogram (ECG), electromyography (EMG), skin conductivity (SC), and respiration (RESP)22. Ghaderi et al. extracted statistical features from physiological signals such as respiration (RESP), galvanic skin activity (GSR), heart rate (HR), and electromyography (EMG) and fed them into K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) classifiers for stress detection23. Jung and Yoon predicted mental stress levels by monitoring health parameters such as electroencephalogram (EEG), blood pressure (BP), heart rate (HR), and respiratory rate (RR) in dynamic situations. Decision tree models and random forest algorithms were used to classify mental stress levels based on these parameters24. Sriramprakash et al. proposed an ML model to analyze work-related stress using electrocardiogram (ECG) and galvanic skin activity (GSR). The best model achieved an accuracy of 92.75% for both stress levels, and the authors found that ECG and GSR were the most effective feature combination25. Betti et al. extracted features from electrocardiogram (ECG), galvanic skin activity (GSR), and electroencephalogram (EEG) data and analyzed them using the Support Vector Machine (SVM) classification algorithm to classify stress based on significant features26. Vaitheeshwari et al. used a combination of electrocardiogram (ECG), galvanic skin activity (GSR), and eye movement signals to classify stress conditions using machine learning (ML) and deep learning (DL) models27. Ding et al. used eight features, such as heart rate and body temperature, to predict stress using machine learning techniques and proposed a hybrid model combining gradient boosting machine (GBM) and random forest (RF)28.
Physiological signal data such as ECG, respiration, body temperature, and galvanic skin response are easily collected using wearable devices. Researchers have explored the correlation between human stress and physiological signals obtained from these devices29,30. Chatterjee et al. proposed a lightweight deep neural network for detecting mental stress using physiological signals. They collected ECG, GSR, skin temperature, and EMG signals from wearable devices to classify three stress levels31. Almadhor et al. proposed a joint learning-based approach using deep neural network (DNN) models for stress classification based on electrical skin activity from wearable devices32.
Given that deep convolutional neural networks can learn complex nonlinear relationships for accurate stress detection, research on using deep learning to detect stress has increased rapidly in recent years. Through end-to-end neural network architectures, researchers are able to automatically learn discriminative features from raw signals, significantly improving classification accuracy33,34,35. Hwang et al. proposed a CNN-LSTM network that utilizes ultra-short-term raw ECG signals to identify stress states without the need for feature engineering36. He et al. also proposed a deep CNN for analyzing 10-second ECG signals, and the results confirmed the feasibility of using short-term ECG for stress detection37. Seo et al. collected data such as electrocardiogram (ECG) and respiration (RESP) and proposed a deep learning method to detect work-related stress using multimodal signals14. Tanwar et al. proposed a hybrid deep learning model framework based on attention, which consists of CNNs and LSTMs with attentional mechanisms for stress recognition. The inputs to the model include physiological signals such as ECG and EDA21. Pan et al. effectively improved the accuracy of EEG-based mental state assessment through multi-domain feature fusion and decision-level fusion strategies38. In addition, Li et al. proposed a method combining domain generalization and residual networks to enhance cross-subject emotion recognition34. Other studies have employed self-supervised learning and transfer learning strategies to improve the performance of emotion recognition models under small-sample conditions2,39,40.
To detect stress levels by analyzing measurements provided by wearable sensors, we propose a deep neural network model based on image coding. The RGB images generated from this coding are then fed into the model for mental stress detection.
Proposed methodology
The primary objective of this research is to present a novel and promising method for stress recognition. The proposed framework integrates a model that combines image coding with Convolutional Neural Networks (CNNs). The model classifies baseline, stress, and amusement states based on RGB three-channel images converted from multimodal physiological signals. The image coding-based stress recognition model proposed in this paper comprises three main components: data preprocessing of multimodal signals, RGB image coding of the preprocessed data, and classification of the coded images using the proposed deep learning model. As illustrated in Fig. 1, these three components are sequentially integrated within the framework. This section provides a detailed description of the stages involved in the study and the framework’s components.
Stress recognition model based on image coding.
Dataset
In this study, the WESAD dataset was utilized for stress recognition. The WESAD dataset is a publicly available dataset for wearable stress and emotion detection. The dataset contains physiological and motion data recorded from 15 subjects using wrist-worn and chest-worn devices17. Subjects labeled three types of emotional states: baseline, stress, and amusement. The dataset collects BVP, ECG, EDA, EMG, RESP, body temperature, and threeaxis acceleration under three emotional states (baseline, stress, amusement). The wrist-worn device was an Empatica E417, which records BVP, EDA, TEMP, and ACC at 64 Hz, 4 Hz, 4 Hz, and 32 Hz, respectively. The chest-worn device is the RespiBAN Professional217. Respiban is equipped with sensors for measuring RESP. Data recording utilizes four analog ports: ECG, EDA, EMG, and TEMP. All signals are sampled at 700 Hz. The RESP signal is recorded using a respiratory inductance sensor placed around the subject’s chest.
Data preprocessing
Data preprocessing is a crucial step. Segmentation of sensor signals using a sliding window, based on the studies of Yang et al.41, Chung et al.42, and Kreibig43 as well as our experimental work, employs a window length of 30 s. Each 30-second data segment is analyzed independently, with no overlap between consecutive windows. Longer window durations allow the model to better account for changes and trends in physiological signals that may contain relevant information about the subject’s emotional state. Conversely, shorter window durations may result in the loss of important temporal relationships and patterns in the data. While longer window sizes capture more comprehensive information, they also increase the computational load, memory requirements, and processing time of the model. This is especially critical for real-time applications, which require efficient processing. Therefore, a balance must be struck between capturing sufficient temporal information from physiological signals and maintaining manageable computational complexity. Synchronization is performed based on the timestamps of the multimodal signals to ensure temporal alignment.
Due to the physical limitations of the sensors, raw signals may contain random noise that can affect the accuracy of subsequent training and validation processes. Therefore, this random noise is removed during the data preprocessing stage. Electrodermal activities (EDAs), controlled by the sympathetic nervous system (SNS), are particularly sensitive to highly aroused states. Therefore, in the proposed preprocessing stage, we use a low-pass filter with a cutoff frequency of 1 Hz to remove high-frequency noise from the EDA signals. To decompose the EDA signal into its phasic and tonic components, we employed the cvxEDA (Convex Optimization-Based Electrodermal Activity Decomposition) method44. cvxEDA formulates EDA decomposition as a convex optimization problem, providing a robust way to separate rapid sympathetic responses (phasic) from slow-changing baseline variations (tonic).
The processed physiological signals were used to extract a variety of statistical features. These features were carefully chosen to capture both the global and local characteristics of each signal. Specifically, we extracted the following features:
Blood Volume Pulse (BVP): BVP mean, standard deviation, minimum, maximum, and peak frequency. These features characterize the amplitude and variability of peripheral blood volume, which is modulated by cardiac activity. The peak frequency provides insight into the dominant heart rate, while variability measures (e.g., standard deviation) can reflect changes in autonomic nervous system balance.
Electrodermal Activity (EDA): EDA mean, standard deviation, minimum, maximum, phasic, smoothed, and tonic components. We also extracted features from the phasic and tonic components, including mean, standard deviation, minimum, and maximum values. These features capture both short-term (phasic) and long-term (tonic) variations in skin conductance, which are known to be sensitive to emotional and stress responses.
Respiration: Respiration mean, standard deviation, minimum, and maximum. These features characterize breathing patterns that may change under stress, including the depth and regularity of breathing.
Temperature: Skin temperature mean, standard deviation, minimum, maximum, and slope. Skin temperature fluctuations can be linked to emotional and stress states, and the slope helps capture any rapid changes in skin temperature over time.
Accelerometer (ACC): Mean, standard deviation, minimum, and maximum values for the total accelerometer signal (‘net_acc’) as well as for the individual axes (ACC_x, ACC_y, ACC_z). These features describe the overall body movement as well as movement patterns in each spatial direction.
Additionally, data input variables may exhibit highly skewed or non-standard distributions, potentially due to outliers, multimodal distributions, or highly exponential distributions. Prior to transforming the data into images, we applied quantile normalization to the data. Quantile Normalization (QN) is a statistical method primarily used to address inconsistent distributions across multiple datasets and is particularly useful for comparing datasets from different groups or samples45. The core idea is to ensure that data from different groups have equal values at each quantile, thereby making the distribution of data from each group consistent for subsequent comparison and analysis. Therefore, we sorted the feature values in ascending order to capture the overall distribution of each feature. Then, we calculated the mean value for each quantile across all samples and mapped each original value to its corresponding quantile mean. This approach ensures that the data distributions from different groups or samples become statistically comparable.
Encoding dataset to images
The Gramian Summation Angular Field (GASF), Gramian Difference Angular Field (GADF), and Markov Transition Field (MTF) methods are used to encode one-dimensional data into two-dimensional representations. Subsequently, these two-dimensional representations are converted into RGB images.
Wang and Oates introduced the concepts of the Gramian Angular Field (GAF) and Markov Transition Field (MTF) in 201546,47. The GAF technique maps a one-dimensional sequence into a two-dimensional image by calculating the directional relationships between sequence elements to reveal underlying periodicity and structural features. The summation and difference methods correspond to two operations for constructing Gram matrices, each capturing intrinsic data patterns from different perspectives. Camara et al. converted ECG signals into GASF images, developed an ECG recognition system, and tested it in both resting and active states48. Ghosh et al. also encoded data into GASF images to detect individual stress levels49. Inspired by these studies, we innovatively encoded the extracted features of EDA, ACC, BVP, Resp, and Temp signals into GASF and GADF images.
MTF is a sequence analysis method based on Markov chain theory, revealing statistical laws and dynamic properties of state transitions by analyzing transfer probabilities and converting sequences into image representations. Ji et al. proposed an intelligent ECG diagnosis method using ResNet, where MTF converts ECG signals into two-dimensional matrices for ResNet input50. Zhang et al. developed a robust deep learning framework for dominant instability pattern recognition in power systems using MTF to encode time series into images51. Li et al. employed MTF and deep neural networks for classifying surface EMG signals based on emotional stimuli52. Inspired by these studies, we innovatively encoded the extracted features of EDA, ACC, BVP, Resp, and Temp signals into MTF images.
We transform physiological signals into images using Gramian Angular Summation Field (GAF) and Markov Transition Field (MTF). GAF, by mapping time series to a polar coordinate system, preserves time dependencies and captures non-linear relationships, making it suitable for signals like ECG and EDA. MTF encodes time series as images based on state transitions, effectively extracting dynamic features from the signals.
In addition to GASF, GADF, and MTF, other time series imaging methods such as Recurrence Plots (RP)53 have been proposed. However, RP may not effectively capture temporal dependencies and dynamic transitions essential to our task. In contrast, GASF and GADF encode temporal correlations in polar coordinates, while MTF captures transition dynamics, making them better suited for our task and our dataset.
Subsequently, we combined the GASF, GADF, and MTF images derived from the aforementioned transformations to construct a three-channel image. Each transformation method (GASF, GADF, and MTF) corresponds to one distinct channel. This approach provides comprehensive, multi-dimensional input features for subsequent deep learning models, thereby enhancing the ability to capture deeper patterns and associations54,55.
Framework of deep learning
After data preprocessing and image encoding, a series of RGB images are generated. These RGB images are then input into a deep learning model for stress recognition.
The architecture of our proposed model is detailed in Table 1. Here, Params # in Table 1 represents the number of parameters in the model. These parameters, including learnable weights and biases, are optimized during training and determine the model’s complexity and representational capacity. The first layer is a 2D convolutional layer with 64 filters, a kernel size of 7, and ReLU (Rectified Linear Unit) as the activation function. The second layer is a 2D max pooling layer. We tested different settings (32, 64, and 128 filters) and found that 64 filters provided an optimal balance between computational cost and feature extraction capability. Considering that our model processes images derived from multimodal physiological signals, the first convolutional layer is designed to capture broad spatial dependencies across modalities. A 7 × 7 filter is chosen to provide a larger receptive field for extracting global structure and cross-modal information. Experimental comparisons with smaller filter sizes (3 × 3, 5 × 5) showed that 7 × 7 achieved the best balance between feature extraction and computational efficiency. This is followed by four 2D convolutional layers with a kernel size of 3, containing 64, 64, 128, and 256 filters, respectively. Each layer uses ReLU (Rectified Linear Unit) as the activation function and is followed by a batch normalization layer to normalize the contributions of each mini-batch. Next, a 2D max pooling layer with a pool size of 2 is applied to select the maximum values from the input feature map. Finally, two dense layers are added, with the activation function set to Softmax for multi-class classification. Batch normalization is employed to stabilize training, accelerate convergence, and improve generalization. Our experiments demonstrate that models with batch normalization converge faster and exhibit lower validation loss compared to those without it. This finding is consistent with previous studies, which have highlighted the benefits of batch normalization in deep learning architectures56,57,58.
Results
This section presents the proof-of-concept results of using coded RGB images for mental stress detection.
Evaluation indicator
In this work, we use several performance evaluation metrics to assess the classification results of the model, including accuracy, precision, recall, and F1 score. These metrics are derived from the confusion matrix. The confusion matrix for binary classification is illustrated in Table 2. TP represents the number of samples correctly predicted as positive cases, FP represents the number of samples incorrectly predicted as positive cases, TN represents the number of samples correctly predicted as negative cases, and FN represents the number of samples incorrectly predicted as negative cases.
Accuracy is a key indicator of classification performance, defined as the ratio of the total number of correct predictions to the total number of samples. The formula for calculating accuracy is:
Precision is the proportion of true positive predictions among all samples predicted as positive, and it is calculated as:
Recall represents the proportion of actual positive cases that were correctly predicted out of all actual positive cases. It is calculated using the following formula:
Precision and recall are two complementary metrics. Higher precision indicates that the classifier predicts positive samples more accurately, but it may miss many positive cases. The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects of classification performance. It is calculated as follows:
In multiclass classification tasks, the rationale for each evaluation metric is similar to that in binary classification. The labels can be categorized into positive and negative classes for each class individually. Specifically, for multiclass classification, each class is treated as the positive class, and the confusion matrix is computed for binary classification between this class and all others. After obtaining the confusion matrices for each class, the computation of evaluation metrics such as precision and recall in multiclass classification follows the same principles as in binary classification.
In this study, potential class imbalance in the dataset may affect the model’s performance, particularly for underrepresented classes. To assess the model’s performance more comprehensively, we have included a detailed analysis of the confusion matrix, which highlights the misclassifications per stress category. Furthermore, we examined additional metrics such as precision, recall, and F1-score to better capture the model’s ability to distinguish between classes.
In addition to these performance metrics, we used training and validation accuracy plots, along with training and validation loss plots, for graphical analysis. Accuracy plots help us monitor the neural network’s performance over time. Loss plots help diagnose learning issues that may lead to underfitting or overfitting of the model.
Test results
We employed the proposed model to conduct separate experiments with different modal combinations. The models used for these experiments with different modal combinations or multimodal inputs are similar to those described in Table 1 but with slightly varying input shapes.
Table 3 presents the results of the proposed model with various combinations of modal inputs. Our experiments show that accuracy, precision, recall, and F1 score generally improve as the number of modalities increases. The highest accuracy, precision, recall, and F1 score are achieved with the combination of EDA, ACC, BVP, and Temp, resulting in an accuracy of 90.96% and an F1 score of 91.67%. When combining all five modalities, the test accuracy reaches 90.68% and the F1 score reaches 91.41%.
Different multimodal combinations achieved high accuracy and F1 scores. Performance plateaued as the number of modalities increased, likely due to increased data complexity and potential noise from additional modalities. These results underscore the potential of the proposed model for accurately identifying stress from multimodal data.
For the combination of EDA, ACC, BVP, and Temp, the categorization of each stress category—baseline, stress, and amusement—is detailed in Table 4, including precision, recall, and F1 scores. Here, 0 indicates amusement, 1 indicates baseline, and 2 indicates stress. The table also provides macro averages, weighted averages, and detailed accuracies for the quantitative metrics. The macro-average indicates that all categories contribute equally to the final mean, while the weighted average reflects each category’s contribution to the mean according to its prevalence. The weighted averages for precision, recall, and F1-score are 90.74%, 90.96%, and 90.72%, respectively.
The confusion matrix generated by the image-coding-based deep learning network model using the WESAD dataset is shown in Fig. 2. The figure includes eight different modal combinations. Here, the x-axis represents the predicted labels, while the y-axis shows the actual labels of the data. As shown in the figure, the classification accuracy for the stress state is higher compared to the baseline and amusement states. We hypothesize that this may be due to potential class imbalance in the dataset. Specifically, for the combined modality EDA + ACC + BVP + Temp, the classification accuracy for the stress category reaches 99%. Figure 3 illustrates the plots of the loss function and accuracy over time. These plots correspond to the combination of all five modalities used in the proposed model.
We further evaluated the performance of three transformation methods—GASF, GADF, and MTF—on our model and compared them with the results of our proposed three-channel construction. The results are presented in Table 5.
Confusion matrix for different modal combinations of the proposed model.
Graphical plots of training and validation vs. epochs a for accuracy b for cross-entropy loss.
Comparison with existing pressure detection models
Table 6 presents a comparison of the accuracy of our proposed method with those reported in other studies. We reimplemented these methods while ensuring that the preprocessing steps and dataset partitioning remained consistent with our study. For a fair comparison, we included only those studies that used the WESAD dataset to classify stress into three categories: baseline, stress, and amusement. As shown in Table 6, our method achieves higher accuracy and F1 scores
Conclusion
In this study, we propose a novel image-coding-based method for detecting stress levels, utilizing a deep learning model to classify stress, baseline, and amusement states based on multimodal physiological signals. This method converts one-dimensional signals into three-channel, two-dimensional images, which are then fed into our deep learning model, demonstrating excellent performance. The proposed model was evaluated on the WESAD dataset, and by using a combination of various multimodal physiological signals, it achieved a maximum accuracy of 90.96% and an F1 score of 91.67%. The model’s F1 scores were consistently high across all categories, confirming its excellent performance. The experimental results demonstrate that our method effectively captures the relationship between multimodal physiological signals and stress. Our work is expected to serve as an effective auxiliary tool for mental stress recognition.
Discussion
The experimental results indicate that increasing the number of modalities improves test accuracy, F1 score, and other performance metrics. Single physiological signals can be unstable due to individual differences, environmental changes, and other factors. In contrast, combining multiple modalities provides more comprehensive information, enhances the model’s adaptability and robustness to varying conditions, and mitigates the impact of external interference. Multimodal signals integrate characteristics from different physiological parameters and leverage complementary information across modalities, thereby improving the accuracy and reliability of stress recognition.
The experimental results indicate that the performance of five-modal combinations is slightly inferior to that of four-modal combinations (i.e., EDA, ACC, BVP, and Temp). While multimodal physiological signal combinations offer several advantages, such as improved recognition accuracy, enhanced robustness, and better matching rates, more modalities are not always better. The effectiveness of multimodal signals depends on several factors. Each additional modality must ensure that the new data is of high quality and consistent with the existing modalities. Low-quality or inconsistent data may introduce noise, degrading overall performance. Additionally, information redundancy among modalities can occur, and too many modalities may repeatedly convey similar or highly correlated features. This redundancy offers limited improvement in recognition accuracy and may increase the processing burden. Furthermore, as the number of modalities increases, model interpretability may decrease, making it more challenging to understand and debug.
In this study, we did not apply specific techniques to address potential class imbalances in the dataset. To further investigate this, we have included an analysis of the confusion matrix (Fig. 2). Future work may explore re-sampling techniques or class-weighted loss functions to mitigate any potential bias.
This study has several limitations that should be considered. For example, it may be constrained by the specific dataset used, which could limit the diversity and generalizability of the results. Furthermore, the lack of real-time evaluation may impede the model’s practical applicability in time-sensitive scenarios. In future work, we will further address these limitations, enhance the interpretability and generalization performance of the proposed model, and explore its eventual deployment on real-world wearable devices.
Data availability
The datasets analysed during the current study are publicly available, and can be downloaded from https://ubicomp.eti.uni-siegen.de/home/datasets/icmi18/.
References
Guo, J. et al. Environmental toxicant exposure and depressive symptoms. JAMA Netw. Open. 7, e2420259 (2024).
Gan, Y., Huang, H., Wu, X. & Meng, M. What doesn’t kill us makes us stronger: Insights from neuroscience studies and molecular genetics. Curr. Opin. Behav. Sci. 59, 101431 (2024).
O’Connor, D. B., Thayer, J. F. & Vedhara, K. Stress and health: A review of psychobiological processes. Annu. Rev. Psychol. 72, 663–688 (2021).
Zou, G. J. et al. Microglial activation in the medial prefrontal cortex after remote fear recall participates in the regulation of auditory fear extinction. Eur. J. Pharmacol. 978, 176759 (2024).
Du, Y. et al. Neurometabolite levels in the brains of patients with autism spectrum disorders: A meta-analysis of proton magnetic resonance spectroscopy studies (N = 1501). Mol. Psychiatry. 28, 3092–3103 (2023).
Li, J. et al. Is stress always bad?? The role of job stress in producing innovative ideas. Knowl. Manage. Res. Pract. 23, 77–88 (2025).
Hao, S. et al. Group identity modulates bidding behavior in repeated lottery contest: neural signatures from event-related potentials and electroencephalography oscillations. Front. Neurosci. 17, 1184601 (2023).
Hao, S. et al. Group membership modulates the hold-up problem: an event-related potentials and oscillations study. Soc. Cognit. Affect. Neurosci. 18, nsad071 (2023).
Yang, K. How to prevent deception: A study of digital deception in visual poverty livestream. New. Media Soc. 14614448241285443 https://doi.org/10.1177/14614448241285443 (2024).
Alberdi, A., Aztiria, A. & Basarab, A. Towards an automatic early stress recognition system for office environments based on multimodal measurements: A review. J. Biomed. Inf. 59, 49–75 (2016).
Ou, J. et al. Detecting muscle fatigue among community-dwelling senior adults with shape features of the probability density function of sEMG. J. Neuroeng. Rehabil. 21, 196 (2024).
Zhao, Y. et al. Highly sensitive, wearable piezoresistive Methylcellulose/chitosan@MXene aerogel sensor array for real-time monitoring of physiological signals of pilots. Sci. China Mater. 68, 542–551 (2025).
Kim, N., Seo, W., Kim, S. & Park, S. M. Electrogastrogram: Demonstrating feasibility in mental stress assessment using sensor fusion. IEEE Sens. J. 21, 14503–14514 (2021).
Seo, W., Kim, N., Park, C. & Park, S. M. Deep learning approach for detecting work-related stress using multimodal signals. IEEE Sens. J. 22, 11892–11902 (2022).
Zhang, J., Yin, Z., Chen, P. & Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inform. Fusion. 59, 103–126 (2020).
Yin, J., Qiao, Z., Han, L. & Zhang, X. EEG-based emotion recognition with autoencoder feature fusion and MSC-TimesNet model. Comput. Methods Biomech. BioMed. Eng. 1–18. https://doi.org/10.1080/10255842.2025.2477801 (2025).
Schmidt, P. et al. Introducing WESAD, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM International Conference on Multimodal Interaction 400–408 (ACM, Boulder CO USA, 2018). https://doi.org/10.1145/3242969.3242985
Singh, G., Phukan, O. C. & Kumar, R. Stress recognition with multi-modal sensing using bootstrapped ensemble deep learning model. Expert Syst. 40, e13239 (2023).
Siirtola, P., Tamminen, S., Chandra, G., Ihalapathirana, A. & Röning, J. Predicting emotion with biosignals: A comparison of classification and regression models for estimating valence and arousal level using wearable sensors. Sensors (Basel, Switzerland) 23, 1598 (2023).
Mitro, N. et al. AI-Enabled smart wristband providing real-time vital signs and stress monitoring. Sensors (Basel Switzerland). 23, 2821 (2023).
Tanwar, R., Phukan, O. C., Singh, G., Pal, P. K. & Tiwari, S. Attention based hybrid deep learning model for wearable based stress recognition. Eng. Appl. Artif. Intell. 127, 107391 (2024).
Healey, J. & Picard, R. W. Detecting stress during real-world driving tasks using physiological sensors. IEEE Trans. Intell. Transp. Syst. 6, 156–166 (2005).
Ghaderi, A., Frounchi, J. & Farnam, A. Machine learning-based signal processing using physiological signals for stress detection. In 2015 22nd Iranian Conference on Biomedical Engineering (ICBME) 93-98 (2015).https://doi.org/10.1109/ICBME.2015.7404123
Jung, Y. & Yoon, Y. I. Multi-level assessment model for wellness service based on human mental stress level. Multimed Tools Appl. 76, 11305–11317 (2017).
Sriramprakash, S., Prasanna, V. D. & Murthy, O. V. R. Stress detection in working people. Procedia Comput. Sci. 115, 359–366 (2017).
Betti, S. et al. Evaluation of an integrated system of wearable physiological sensors for stress monitoring in working environments by using biological markers. IEEE Trans. Biomed. Eng. 65, 1748–1758 (2018).
Vaitheeshwari, R., Yeh, S. C., Wu, E. H. K., Chen, J. Y. & Chung, C. R. Stress recognition based on multiphysiological data in high-pressure driving VR scene. IEEE Sens. J. 22, 19897–19907 (2022).
Ding, C., Zhang, Y. & Ding, T. A systematic hybrid machine learning approach for stress prediction. PeerJ Comput. Sci. 9, e1154 (2023).
Anusha, A. S., Jose, J., Preejith, S. P., Jayaraj, J. & Mohanasankar, S. Physiological signal based work stress detection using unobtrusive sensors. Biomed. Phys. Eng. Express. 4, 065001 (2018).
Can, Y. S., Arnrich, B. & Ersoy, C. Stress detection in daily life scenarios using smart phones and wearable sensors: A survey. J. Biomed. Inform. 92, 103139 (2019).
Chatterjee, D., Dutta, S., Shaikh, R. & Saha, S. A lightweight deep neural network for detection of mental States from physiological signals. Innov. Syst. Softw. Eng. https://doi.org/10.1007/s11334-022-00470-6 (2022).
Almadhor, A. et al. Wrist-based electrodermal activity monitoring for stress detection using federated learning. Sensors (Basel) 23, 3984 (2023).
Pan, H., Li, Z., Fu, Y., Qin, X. & Hu, J. Reconstructing visual stimulus representation from EEG signals based on deep visual representation model. IEEE Trans. Human-Mach. Syst. 54, 711–722 (2024).
Li, J., Li, J., Wang, X., Zhan, X. & Zeng, Z. A domain generalization and residual network-based emotion recognition from physiological signals. Cyborg Bionic Syst. 5, 0074 (2024).
Zhu, C. Computational intelligence-based classification system for the diagnosis of memory impairment in psychoactive substance users. J. Cloud Comp. 13, 119 (2024).
Hwang, B. et al. Deep ECGNet: An optimal deep learning framework for monitoring mental stress using ultra short-term ECG signals. Telemedicine e-Health 24, 753–772 (2018).
He, J., Li, K., Liao, X., Zhang, P. & Jiang, N. Real-time detection of acute cognitive stress using a convolutional neural network from electrocardiographic signal. IEEE Access. 7, 42710–42717 (2019).
Pan, H., Tong, S., Song, H. & Chu, X. A. Miner mental state evaluation scheme with decision level fusion based on multidomain EEG information. IEEE Trans. Human-Mach. Syst. 55, 289–299 (2025).
Si, X., He, H., Yu, J. & Ming, D. Cross-subject emotion recognition brain–computer interface based on fNIRS and DBJNet. Cyborg Bionic Syst. 4, 0045 (2023).
Cai, H., Wang, Y., Luo, Y. & Mao, K. A dual-channel collaborative transformer for continual learning. Appl. Soft Comput. 171, 112792 (2025).
Yang, K. et al. Behavioral and physiological signals-based deep multimodal approach for mobile emotion recognition. IEEE Trans. Affect. Comput. 14, 1082–1097 (2023).
Chung, S., Lim, J., Noh, K. J., Kim, G. & Jeong, H. Sensor data acquisition and multimodal sensor fusion for human activity recognition using deep learning. Sensors 19, 1716 (2019).
Kreibig, S. D. Autonomic nervous system activity in emotion: A review. Biol. Psychol. 84, 394–421 (2010).
Greco, A., Valenza, G., Lanata, A., Scilingo, E. P. & Citi, L. CvxEDA: A convex optimization approach to electrodermal activity processing. IEEE Trans. Biomed. Eng. 63, 797–804 (2016).
Bolstad, B. M., Irizarry, R. A., Åstrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193 (2003).
Wang, Z. & Oates, T. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. Proc. Workshops at the Twenty-Ninth AAAI Conf. on Artificial Intelligence 1, 1–7 (2015).
Wang, Z. & Oates, T. Imaging Time-Series to Improve Classification and Imputation. Preprint at https://doi.org/10.48550/arXiv.1506.00327 (2015).
Camara, C., Peris-Lopez, P., Safkhani, M. & Bagheri, N. ECG identification based on the Gramian angular field and tested with individuals in resting and activity States. Sensors 23, 937 (2023).
Ghosh, S., Kim, S., Ijaz, M. F., Singh, P. K. & Mahmud, M. Classification of mental stress from wearable physiological sensors using Image-Encoding-Based deep neural network. Biosensors 12, 1153 (2022).
Ji, L., Wei, Z., Hao, J. & Wang, C. An intelligent diagnostic method of ECG signal based on Markov transition field and a ResNet. Comput. Methods Programs Biomed. 242, 107784 (2023).
Zhang, R. et al. Encoding time series as images: A robust and transferable framework for power system DIM identification combining rules and VGGNet. IEEE Trans. Power Syst. 38, 5781–5793 (2023).
Li, R. et al. Emotion stimuli-based surface electromyography signal classification employing Markov transition field and deep neural networks. Measurement 189, 110470 (2022).
Zbilut, J. P. & Webber, C. L. Embeddings and delays as derived from quantification of recurrence plots. Phys. Lett. A. 171, 199–203 (1992).
Yao, F., Zhang, H., Gong, Y., Zhang, Q. & Xiao, P. A study of enhanced visual perception of marine biology images based on diffusion-GAN. Complex. Intell. Syst. 11, 227 (2025).
Wang, B. et al. Single-shot super-resolved fringe projection profilometry (SSSR-FPP): 100,000 frames-per-second 3D imaging with deep learning. Light Sci. Appl. 14, 70 (2025).
Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Preprint at https://doi.org/10.48550/arXiv.1502.03167 (2015).
Santurkar, S., Tsipras, D., Ilyas, A. & Madry, A. How Does Batch Normalization Help Optimization? Preprint at https://doi.org/10.48550/arXiv.1805.11604 (2019).
Bjorck, J., Gomes, C., Selman, B. & Weinberger, K. Q. Understanding Batch Normalization. Preprint at https://doi.org/10.48550/arXiv.1806.02375 (2018).
Acknowledgements
We thank all colleagues and researchers involved in this research for their valuable group discussion and suggestions on this study.
Funding
This work was supported by the Fundamental Research Funds for the Central Universities of South-Central Minzu University (Grant Number: CZH25005, CZQ23029 and CZQ23031), and the Fundamental Research Funds of the South-Central Minzu University (Grant Number: YZZ18011).
Author information
Authors and Affiliations
Contributions
Conceptualization, S. Y., Y. G. and L. Z.; Methodology, S. Y., Y. G. and Y. Z.; Visualization, S. Y.; Data curation, S. Y.; Formal analysis, S. Y.; Validation, S. Y.; Data curation, S. Y.; Formal analysis, S. Y.; Investigation, S. Y.; Software, S. Y.; Project administration, S. Y.; Writing – review & editing, S. Y., Y. G., L. Z., Q. X., X. L., F. W. and Z. Z.; Writing – original draft, S. Y. and Y. Z.; Supervision, Y. G., L. Z., Q. X., X. L., F. W. and Z. Z.; Funding acquisition, Y. G. and F. W. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, S., Gao, Y., Zhu, Y. et al. A deep learning approach to stress recognition through multimodal physiological signal image transformation. Sci Rep 15, 22258 (2025). https://doi.org/10.1038/s41598-025-01228-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-01228-3
Keywords
This article is cited by
-
Discriminating psychological stress levels: multi-level attentive LSTM approach
Neural Computing and Applications (2025)
-
Multimodal signal fusion for stress detection using deep neural networks: a novel approach for converting 1D signals to unified 2D images
Signal, Image and Video Processing (2025)





