Fusion of EEG feature extraction and CNN-MSTA transformer emotion recognition classification model

Chu, Wenjuan; Wen, Hongbo; Liu, Huanyu; Zhang, Xinghua; Guo, Jing

doi:10.1038/s41598-025-28470-z

Download PDF

Article
Open access
Published: 13 December 2025

Fusion of EEG feature extraction and CNN-MSTA transformer emotion recognition classification model

Wenjuan Chu¹,
Hongbo Wen¹,
Huanyu Liu²,
Xinghua Zhang³ &
…
Jing Guo⁴

Scientific Reports volume 15, Article number: 45779 (2025) Cite this article

1294 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

A multi-scale sparse temporal autoencoder Transformer (3D-CNN-MSTA-Transformer) is proposed for sentiment classification of EEG signals. The overall framework of this model includes raw EEG data input, 3D feature extraction, CNN feature transformation module, improved MSTA Transformer classification, and classification result output. Extracting three-dimensional features to mine information with emotion recognition value from EEG data; The CNN feature transformation module decomposes 3D convolution kernels to reduce computational costs; The MSTA Transformer classification module consists of four parts: data preparation, sparse time block autoencoder, hidden space embedding, and temporal Transformer. It utilizes multi head self-attention mechanism to achieve sentiment classification. The experiment showed that compared with the classic models ResNet-34, ShuffleNet V2, and MobileNet V2 on the SEED, SEED-IV-1, and SEED-IV-2 datasets, the accuracy, recall, specificity, and F1 score of our method were significantly superior on the three datasets, verifying the effectiveness of the proposed algorithm.

Introduction

The generation of emotions is a complex and multi-level process closely related to various physiological signals, and this relationship is a two-way interaction. In the process of emotional experience, the body undergoes a series of physiological changes, including heart rate, blood pressure, respiratory rate, skin conductance response, and hormone levels. These physiological signals are regulated by the autonomic nervous system and require the cooperation of multiple organ systems. Understanding the interrelationships between these systems has been a topic that researchers have been exploring^1,2.

The changes in physiological signals not only reflect the surface manifestations of emotions, but also reveal the complex response mechanisms of emotions within the human body at a deeper level. This complex physiological response network provides valuable data foundation for researchers, making it possible to capture these signals through external devices and understand and analyze emotions. With the development of technology, especially the advancement of wearable devices, scientists are able to record and analyze physiological data in a more accurate and real-time manner, further promoting research in the field of emotion recognition³.

Wearable sensors are one of the important areas of technological advancement in recent years. These devices can continuously monitor and collect user physiological data, providing unprecedented opportunities for emotion recognition based on physiological data. Neuroscience theory research shows that the generation and changes of human emotions are closely related to brain activity, and EEG collected from the brain contains a large amount of information related to emotional changes. Reference⁴ classified EEG dynamics based on self-reported emotional states of participants while listening to music, in order to explore the relationship between emotional states and brain activity. Support Vector Machine (SVM) was used to classify four emotional states (joy, anger, sadness, and pleasure), with an average classification accuracy of 82.29% among 26 participants. Reference⁵ takes the differential features extracted from multi-channel electroencephalography as input and integrates deep belief networks with hidden Markov models, achieving an accuracy of 87.62%. Reference⁶ proposed a solution method based on empirical mode decomposition for detecting dynamically evolving emotional patterns on electrocardiograms. The study showed that when the induction method was effective for subjects, the responsiveness of electrocardiograms to emotions was higher. Reference⁷ analyzed EEG signals using the fusion of CNN and LSTM, and achieved a recognition accuracy of 93.74% on the SEED dataset. Reference⁸ used a Deep Convolutional Neural Network (DCNN) to automatically detect and diagnose epileptic seizures in EEG signals, achieving an accuracy of 88.76%.

Physiological signals play a crucial role in identifying emotional expressions⁹. Changes in heart rate are often associated with stress or excitement, changes in skin conductivity levels can reflect anxiety or relaxation states, and electrical signals of facial muscle activity (such as smiling or frowning) recorded by electromyography directly express emotions such as happiness or anger. Therefore, using multiple different physiological signals for comprehensive analysis can more accurately understand and identify an individual’s emotional state than using only a single type of physiological signal. Reference¹⁰ designed and implemented a specialized experimental protocol to induce specific emotional states, while obtaining three types of peripheral physiological signals: ECG, GSR, and RESP. Nonlinear methods were used to extract features instead of standard features, which improved the recognition of EEG physiological signals. Reference¹¹ induced emotions in participants through videos and collected their EEG, ECG, and GSR. Multiple machine learning algorithms and deep learning models were used to classify physiological features into emotional and personality categories. Reference¹² used a combined feature set of ECG and GSR to classify three emotional states: happy, sad, and neutral, and evaluated the effectiveness of GSR and ECG in emotional state recognition. Reference⁷ proposed an emotion recognition model based on a multi-layer long short-term memory recurrent neural network (LSTM-RNN) with two attention mechanisms, which achieved higher accuracy on the Mahnob HCI database. Reference¹³ uses LSTM for emotion recognition in natural conversations based on HR, ST, and EDA signals. Reference¹⁴ used EEG and ECG for emotion recognition of musicians, and the results showed that EEG and ECG features achieved the best classification performance, which was more accurate than using ECG alone. Reference¹⁵ uses ECG and EDA signals for fusion to recognize music induced emotions, with an accuracy rate 4% to 6% higher than single modal methods.

Transformer and Graph Networks are two types of deep learning models with different focuses when dealing with data structures and relationships. Transformer is a deep learning architecture based on self attention mechanism, whose core advantages lie in powerful parallel computing capability, long sequence processing capability, and model performance. It is widely used in fields such as natural language processing (NLP), computer vision (CV), and robot interaction. For example, Reference¹⁶ proposed an Edge assisted equatorial transformer for industrial scene reconstruction algorithm, designed an edge detection branch to constrain the consistency of epipolar geometry and edge features, and can reconstruct dense scene representations with limited memory blocks. Reference¹⁷ proposes a Robust Depth Estimation Based on Parallax Attention for Aerial Scene Perception algorithm, which applies 3D convolution to capture the corresponding relationship of feature matching, and based on Transformer’s end-to-end difference estimation network, solves the problems of region shaking caused by disparity changes and overfitting caused by cost volume regularization in direct disparity regression. Graph networks specialize in processing graph structured data, such as social networks, molecular structures, knowledge graphs, etc. Graph structured data consists of nodes and edges, which can naturally represent the relationships between entities. For example, Reference¹⁸ proposed an fMRI-based Brain Disease Diagnosis: A Graph Network Approach algorithm that can select brain regions that are important for classification, thereby locating all active regions related to brain diseases and achieving better diagnostic performance. Reference¹⁹ proposed an individual level fMRI segmentation based on graphs algorithm that fully utilizes the connectivity information of fMRI for segmentation, with less dependence on spatial structure and better functional segmentation results. Relatively speaking, the Transformer model has strong parallel computing capabilities and can efficiently process ECG long sequence data. Through self-attention mechanism, it can effectively capture long-distance dependencies in the sequence. This is the main consideration for selecting the Transformer model for emotion recognition and classification in this paper.

This article proposes a multi-scale sparse temporal autoencoder Transformer for sentiment classification of EEG signals, based on three-dimensional feature extraction and convolutional feature transformation of EEG signals. The model is referred to as the 3D-CNN-MSTA-Transformer. The overall framework of this model includes raw EEG data input, 3D feature extraction, CNN feature transformation module, improved MSTA Transformer classification, and classification result output. The experimental results show that the model has good classification accuracy for various emotions on the SEED, SEED-IV-1, and SEED-IV-2 datasets, verifying the effectiveness of the proposed algorithm.

Related research

Basic emotion theory

Emotion classification provides a theoretical framework and reference standards for emotion recognition. By clarifying the categories and standards of emotion classification, emotion recognition systems can design and optimize recognition algorithms more targetedly. The common emotional theory models are shown in Table 1.

Table 1 Common emotion theory models.

Full size table

The Basic Emotion Theory is the earliest method of emotion classification, mainly proposed by psychologist Ekman in the 1960s. He found through cross-cultural research that there are six basic emotions in human facial expressions: happiness, sadness, anger, fear, surprise, and disgust. He believes that these basic emotions are universal and not influenced by culture. Reference²⁰ points out that there are many similarities between humans and animals in expressing emotions, and these expressions have biological adaptive functions. Ekman’s basic emotion theory not only validates Darwin’s viewpoint, but also provides evidence support through experimental research. In addition, reference²¹ also proposed another fundamental theory of emotions. He believes that emotions are composed of eight basic emotions, with three dimensions of intensity, similarity, and polarity, which can be combined in pairs to form complex emotions. Anger and disgust combine to form vision, while happiness and surprise combine to form joy. The emotion theory model in reference²² intuitively demonstrates the relationship between emotions through geometric shapes.

Dimension emotion model

The dimensional emotion model compensates for the limitations of the discrete emotion model by mapping discrete emotions to a coordinate space, which is a method of describing emotions as continuous variables on different dimensions. Reference²³ proposed a three-dimensional spatial model of emotions, which has three dimensions: Pleasure, Arousal, and Dominance, as shown in Fig. 1. This three-dimensional emotion representation model is also commonly referred to as the PAD (Pleasure Acoustic Dominance) model.

Pleasure level represents the degree of emotional pleasure of the user, also known as valence. The positive and negative aspects of pleasure represent positive and negative emotions, respectively. The degree of arousal reflects the intensity of emotions, with positive indicating a high degree of emotional arousal and negative indicating a low degree of emotional arousal. Dominance degree represents the degree of influence on or from the outside world. Positive dominance indicates high dominance and a sense of dominance, while negative dominance indicates low dominance and weakness. Therefore, dominance degree is also known as the energy dimension.

Circular emotion model

During the in-depth study of the PAD model in reference²⁴, it was found that pleasure and arousal can distinguish the vast majority of emotional types. Therefore, he proposed the circular emotion model, which only uses two dimensions of pleasure and arousal, and believes that emotions are continuously distributed on these two dimensions, and each emotion is evenly distributed in the circular ring. This model is also known as the VA (Valence Acoustic) emotion model because of its horizontal and vertical axis structure, as shown in Fig. 2.

The two-dimensional structure of the VA model makes emotion classification both simple and easy to understand, effectively capturing the complexity of emotions, making it currently the most commonly used emotion model. The sentiment ring model proposed in reference²⁵ emphasizes the continuity and diversity of emotions, providing a new perspective for sentiment classification. Compared to basic emotion theory, the emotion dimension model is more flexible and can describe the intensity and complexity of emotions. This model can map various emotional states into a two-dimensional space for quantitative analysis and comparison.

In addition, there are also other dimensions of emotion models, such as the Valence Acoustic Dominance Space theory proposed in reference²⁶, which includes motivational, evaluative, and control dimensions. The Positive Negative Affect (PANA) model proposed in reference²⁷, the Energy Tension (ET) model proposed in reference²⁸, and so on.

Complex emotion model

The complex emotion model is based on basic emotion and emotion dimension models, further refining emotion types and describing the complexity and diversity of emotions. For example, reference²⁹ proposes a four-dimensional emotion model that includes pleasure, tension, excitement, and conviction. He believes that human emotions are composed of basic emotions, complex emotions, and social emotions. Basic emotions are innate, complex emotions are a combination of basic emotions, and social emotions are influenced by culture and social environment.

The methods for complex emotion classification also include emotion vocabulary classification, emotion state classification, etc. For example, psychologist Barnett proposed a classification method for emotional vocabulary, which summarizes different emotional categories through analysis of emotional vocabulary. This method helps to understand the language expression of emotions and the relationships between different emotional vocabulary.

In order to better analyze and utilize features from different emotion related aspects, a hybrid model combining basic emotion features and dimension-based emotion features can be constructed.

Let ${{\varvec{x}}_b}$ represent the feature vector extracted from basic emotion recognition. Assuming there are n basic emotions and ${{\varvec{x}}_b}=\left( {{x_{b1}},{x_{b2}}, \cdots ,{x_{bn}}} \right)$, where ${x_{bi}}$ is the eigenvalue corresponding to the i-th basic emotion.

Let ${{\varvec{x}}_d}$ represent the feature vector extracted from dimension-based emotion recogre extracted from each franition. If we consider a three-dimensional PAD model, then ${{\varvec{x}}_d}=\left( {{x_{d1}},{x_{d2}},{x_{d3}}} \right)$, where ${x_{d1}}$, ${x_{d2}}$, and ${x_{d3}}$ correspond to the values of pleasure, arousal, and dominance dimensions, respectively.

This hybrid model combines the features of these two types. A simple method is to use linear combination. The overall feature vector ${\varvec{x}}$ of the hybrid model can be expressed as:

$${\varvec{x}}=\alpha {{\varvec{x}}_b}+\beta {{\varvec{x}}_d}$$

(1)

Where, α and β are weight coefficients used to control the contributions of basic emotional features and dimension based emotional features. These weight coefficients can be determined through experiments, such as using grid search method on the validation set to find the optimal value that maximizes the performance of the emotion recognition model.

Description of emotion recognition issues

Algorithm process

The overall framework of the 3D-CNN-MSTA-Transformer model is shown in Fig. 3, which mainly includes raw EEG data input, 3D feature extraction, CNN feature transformation module, improved MSTA-Transformer classification (three or four categories), and classification result output.

The main functions of the three modules in the figure are: (1) three-dimensional (3D) feature extraction. Aim to extract information with emotion recognition value from raw EEG data. Firstly, divide the sample with a duration of T into S segments using a W time window without overlap, while retaining the time information. Extract 5 frequency bands of θ, α, β, γ, and γ’ from each EEG segment, and apply the 51–75 Hz frequency band, breaking through previous limitations. Calculate the differential features of each EEG segment in 5 frequency bands, which are stable and excellent in EEG emotion recognition. To maintain spatial information, convert the differential features of the 62 channel EEG into a two-dimensional differential feature matrix, and fill in 0 values where there are no values. Finally, the matrices of each frequency band are superimposed and concatenated to obtain a three-dimensional feature matrix that integrates spatial, temporal, and frequency information, providing rich features for subsequent analysis. (2) CNN feature transformation module. The CNN feature transformation module consists of local spatial time-frequency convolutional layers, global spatial convolutional layers, and average pooling. Inspired by the idea of pseudo 3D convolution, the 3D convolution kernel is decomposed into 2D-CNN and 1D-CNN convolution kernels to reduce computational costs and accelerate training speed. The local spatial convolution layer uses 64 3 × 3 × 1 convolution kernels to mine spatial information of adjacent electrode channels, preventing edge information loss; The time-frequency convolution layer extracts time-frequency dimension information using 64 1 × 1 × 5 convolution kernels. The kernel of the global spatial convolution layer is the same size as the two-dimensional spatial matrix, integrating the spatial information of the global electrode channels, compressing and fusing features, and reducing the number of parameters. Batch normalization is applied after each convolutional layer, and features are compressed through average pooling. After dimension transformation, deep features are extracted to adapt to the input of the Transformer module. (3) MSTA Transformer classification module. The model logically consists of four parts: data preparation, sparse time block autoencoder, hidden space embedding, and temporal Transformer. Data preparation provides the model with serialized vector form data that meets the requirements; Sparse time block autoencoder improves the embedding layer of temporal Transformer, enhances temporal dependency modeling ability through lightweight temporal position encoding, and obtains hidden space; Hidden space embedding improves the original Transformer embedding layer by parallel single-layer fully connected neural network position encoding and summation to obtain the final embedding vector; The temporal Transformer utilizes a multi head self-attention mechanism to capture the global dependency patterns of the original temporal signal. After multiple transformations in the encoder and decoder, it ultimately achieves sentiment classification and outputs the classification results.

Three-dimensional feature organization

Figure 2 shows the specific process of processing a single raw EEG sample to generate 3D feature structures. Firstly, for a sample with a duration of T, a W time window is used for non-overlapping segmentation into S segments. The EEG of each W window is treated as a frame in the sample, so that one EEG sample contains S consecutive frames, thus preserving the temporal information of the EEG. Research has shown that the high-frequency part of EEG has a greater impact on emotion recognition, and the effect of fusing multiple frequency bands is better than that of a single frequency band. Therefore, five frequency bands, namely θ (4–7Hz), α (8–13Hz), β (14–30Hz), γ (31–50Hz), and ${\gamma}^{\prime}$ (51–75 Hz), were extracted from each frame of EEG. It is worth noting that most previous studies only extracted frequency bands within 50 Hz, but this article also applied the frequency band of 51–75 Hz. In addition, research has shown that differential features are one of the most suitable and stable features in the field of EEG emotion recognition, and are superior to other commonly used features such as PSD. Therefore, the differential features of each EEG frame on 5 frequency bands were calculated using the following formula:

$${\text{DE}}= - \int\limits_{X} f (x)\log (f(x)){\text{d}}x$$

(2)

where, $f(x)$ represents the probability density function of continuous information.

For a fixed length EEG that approximately follows a $N\left( {\mu ,{\sigma ^2}} \right)$ Gaussian distribution, its differential characteristics can be expressed as:

$${\text{DE}} = {{\log \left( {2{{\uppi }}e\sigma ^{2} } \right)} \mathord{\left/ {\vphantom {{\log \left( {2{{\uppi }}e\sigma ^{2} } \right)} 2}} \right. \kern-\nulldelimiterspace} 2}$$

(3)

where, e is the Euler constant, and µ and σ are the mean and standard deviation, respectively.

In order to maintain the spatial information of the EEG electrode positions, a method of constructing a two-dimensional matrix was adopted, as shown in the dashed box in the middle of Fig. 4.

Based on the position of the EEG electrodes in the brain, the differential features of the 62 channel EEG were converted into a two-dimensional differential feature matrix. For positions without values, zeros were used to fill in to avoid other redundant information. Afterwards, the two-dimensional differential feature matrices of all frequency bands in each frame of EEG are superimposed to obtain a three-dimensional matrix of size h×w×b, where h and w represent the maximum values of the electrodes in the vertical and horizontal directions, respectively, and b represents the number of frequency bands. Finally, the three-dimensional feature matrix composed of each frame signal was concatenated in the frequency band dimension according to the time sequence of S frames, resulting in a three-dimensional feature matrix ${{\varvec{X}}_n} \in {{\varvec{R}}^{h \times w \times (b \times s)}}$ composed of samples, which fused the spatial, temporal, and frequency information of EEG. In this article, the sample length T and time window W are 6 and 1, respectively, meaning that each sample has 6 consecutive pauses. The values of h, w, and b are 9, 9, and 5, respectively.

CNN module

The CNN module mainly consists of three parts: local spatial time-frequency convolutional layer, global spatial convolutional layer, and average pooling. Inspired by the idea of pseudo 3D convolution, the traditional 3D convolution kernel with a size of k×k×l is equivalently decomposed into the k×k×1 convolution kernel of 2D-CNN and the 1 × 1×l convolution kernel of 1D-CNN, which can effectively reduce computational costs and accelerate training speed. Therefore, this article decomposes the 3D convolution kernel of the local spatial time-frequency convolutional layer into a local spatial convolutional layer and a time-frequency convolutional layer. The details of the relevant layer parameters of the CNN module are shown in Table 2.

Table 2 CNN module related layer parameters.

Full size table

In Table 2, the local spatial convolution layer contains 64 convolution kernels with a size of 3 × 3 × 1 and a stride of 1. This layer is used to mine the spatial information of adjacent electrode channels and adopts a zero-padding mode to prevent the loss of edge spatial information. The time-frequency convolutional layer contains 64 convolution kernels with a size of 1 × 1 × 5 and a stride of 1. This layer is used to extract information in the time-frequency dimension. The kernel size of the global spatial convolutional layer is 9 × 9 × 1, which is the same size as the two-dimensional spatial matrix. This layer is used to integrate the spatial information of the global electrode channels, and the number of kernels is set to 32 to further compress and fuse feature information, reducing the number of parameters.

Add batch normalization (BN) operation after all convolutional layers to accelerate training speed and suppress overfitting. Next, an average pooling layer with a kernel size of 1 × 1 × 6 and a stride of 4 is used to further compress feature information along the time-frequency dimension. Finally, the obtained feature map is subjected to dimensional transformation to adapt to the input of the Transformer module. The operating formula for the CNN module is:

$${{\varvec{Y}}_n}={\text{CN}}{{\text{N}}_{{\text{module}}}}\left( {{{\varvec{X}}_n}} \right),n=1,2, \cdots ,N$$

(4)

Where, ${\text{CN}}{{\text{N}}_{{\text{module}}}}\left( \cdot \right)$ represents all operations of the CNN module, ${{\varvec{Y}}_n} \in {{\mathbf{R}}^{6 \times 32}}$ represents the extracted deep feature information, where 6 represents the number of features, 32 represents the dimensional information of each feature, which is also a token in the Transformer network, and N represents the number of samples.

MSTA transformer model

Sparse autoencoder

Sparse auto encoder (SAE) consists of an encoder and a decoder. The input of the encoder is the original image space of the data, the output of the decoder is the value domain space, and the output of the encoder or the input of the decoder is the hidden space. If the encoder is ${A_0}$, the hidden space is ${A_1}$, and the decoder is ${A_2}$, then:

$$\left\{ {\begin{array}{*{20}{l}} {{A_1}=\operatorname{sigmoid} \left( {{W_1}{A_0}+{b_1}} \right)} \\ {{A_2}=\operatorname{sigmoid} \left( {{W_2}{A_1}+{b_2}} \right)} \end{array}} \right.$$

(5)

$$\operatorname{sigmoid} (z)={\left( {1+{{\text{e}}^{ - z}}} \right)^{ - 1}}$$

(6)

Where, z is any real number; ${W_1}$ and ${b_1}$ are encoder weights and biases; ${W_2}$ and ${b_2}$ are decoder weights and biases.

The optimization objective is to minimize the reconstruction loss and approximate the probability density distribution. Therefore, the network loss function can be obtained as:

$$\begin{aligned} J\left( {{W_1},{W_2},{b_1},{b_2}} \right) &=\frac{1}{M}\sum\limits_{{i=1}}^{M} {{{\left\| {{x^{(i)}} - {{\hat {x}}^{(i)}}} \right\|}^2}} +\frac{\lambda }{2}\left\| {{W_1}} \right\|+\frac{\lambda }{2}\left\| {{W_2}} \right\| \hfill \\ & \quad +{\text{ }}\beta \sum\limits_{{j=1}}^{d} {\left[ {p \cdot {{\log }_2}\frac{p}{{{{\hat {p}}_j}}}+(1 - p) \cdot {{\log }_2}\frac{{(1 - p)}}{{\left( {1 - {{\hat {p}}_j}} \right)}}} \right]} \hfill \\ \end{aligned}$$

(7)

where, ${\hat {p}_j}={{\sum\nolimits_{{i=1}}^{M} {a_{j}^{{(i)}}} } \mathord{\left/ {\vphantom {{\sum\nolimits_{{i=1}}^{M} {a_{j}^{{(i)}}} } M}} \right. \kern-0pt} M}$, $p={{\sum\nolimits_{{i=1}}^{M} {s_{j}^{{(i)}}} } \mathord{\left/ {\vphantom {{\sum\nolimits_{{i=1}}^{M} {s_{j}^{{(i)}}} } M}} \right. \kern-0pt} M}$. $a_{j}^{{(i)}}$ represents the j-th neuron output value of the i-th sample in the hidden space ${W_1}$; $s_{j}^{{(i)}}$ represents the j-th neuron input value of the i-th sample in encoder ${A_0}$; M is the total number of samples; β is the sparse constraint coefficient; λ is the regularization coefficient. After optimizing the above constraints, the hidden space obtained by SAE is a low rank approximation of the original image space.

In the proposed MSTA Transformer, SAE is designed as a lightweight structure, compared to the original Transformer: (1) the encoder part does not use a large number of neurons that may cause overfitting and high computational costs. For example, in the experiment, the number of neurons in the hidden layer was set to 64, which is much smaller than the typical number of neurons in the original Transformer embedding layer. (2) The decoder structure is simple. It uses a single-layer fully connected neural network to reconstruct input data from hidden space, which can further reduce parameter size and computational complexity. (3) Introduce sparse constraints in SAE training. The sparse constraint coefficient β in Eq. (10) is set to a non-zero value (e.g. β = 0.1), which promotes sparse activation of neurons in the hidden layer and achieves lightweight design.

Modern optimization theory

The general criteria for evaluating optimization algorithms in modern algebra are the convergence of the optimization objective value and the time complexity of the algorithm. According to whether the optimization algorithm utilizes Hessian matrix information, optimization algorithms can be divided into first-order algorithms and second-order algorithms. The general gradient descent algorithms are all order algorithms, such as Gradient descent (GD) and Stochastic gradient descent (SGD). Due to the need to calculate the gradient direction of all sample points during the iteration process of GD, SGD is generally used for random form substitution in large-scale optimization scenarios. Adaptive moment estimation (Adam), as an approximate first-order algorithm, has the advantage of not requiring sample data to satisfy the assumption of independent and identically distributed data, which is more in line with the data distribution characteristics in real-world EEG physiological signal emotion classification application scenarios. However, this algorithm has noise during the training phase, and there is a certain deviation between its theoretical optimization convergence value and the true target analytical solution. Therefore, in pursuit of high-precision application scenarios, alternative algorithms are often needed. Stochastic variance reduction gradient (SVRG) is an optimization algorithm based on double loop iteration, which can dynamically allocate memory for theoretical solutions, periodically reduce variance to reduce model training noise, and improve the final prediction accuracy of the model. In this paper, SVRG optimization algorithm is used to further optimize and improve the model.

Algorithm framework

This article proposes a multi-scale sparse temporal autoencoder Transformer (MSTA Transformer) sentiment classification model, which logically includes four parts: data preparation, sparse time block autoencoder, hidden space embedding, and temporal Transformer. Compared with traditional prediction models, the temporal Transformer proposed in this paper adopts a multi head self-attention mechanism to capture and recognize sequence dependent patterns, which can more effectively model the long-term temporal dependence of sequences and thus achieve emotional classification prediction of EEG physiological signals. The overall architecture of the model in this article is shown in Fig. 5.

In the sentiment classification model shown in Fig. 5, there are four stages: data preparation, sparse time block autoencoder, hidden space embedding, and temporal Transformer.

1.
Data preparation stage. Data preparation is the bridge between raw EEG physiological signal data and prediction models, which logically includes two parts: data collection and storage, and preprocessing. Preprocessing can be further divided into data cleaning, normalization, and timely sorting. The data is collected in real-time by wearable sensors, and the time resolution of the selected data in this article is 1 minute. The obtained raw EEG physiological signal emotion classification data includes two fields: the collection time and real-time EEG physiological signal. Data cleaning is based on the raw EEG physiological signal data, which removes the field mismatch record values and missing record values to retain the valuable record values for research. Considering the scaling effect of data dimensions and ranges on deep learning model neurons during backpropagation parameter updates, it is necessary to use formula (7) to normalize the data to the interval (0,1), that is:
$$x = \frac{{z - z_{{\min }} }}{{z_{{\max }} - z_{{\min }} }}$$
(8)

where, z is the actual input EEG physiological signal value in the raw data; x is the normalized emotional classification value of EEG physiological signals in the original data; ${z_{\hbox{min} }}$ is the minimum value for emotional classification of EEG physiological signals in the original data; ${z_{\hbox{max} }}$ is the maximum value for emotional classification of EEG physiological signals in the raw data.

Temporal organization refers to the process of organizing data in a format that conforms to the model’s solving objectives based on normalized data. The goal of solving the model in this article is to fit the following conditional probability distribution function:
$${P_{{\text{obj}}}}=P\left( {{x_{{t_0}+1:t+T}}\left| {{x_{1:{t_0}}}} \right.} \right)$$
(9)

where, ${t_0}$ is the input window size, and T is the prediction window size. Formula (9) is a single value prediction problem. Based on this, after data preparation, each sample point obtained is a serialized vector form data with an input window size of ${t_0}$ and a prediction window size of T. Each element of the vector is a normalized value based on the raw data that has been cleaned.
2.
Sparse time block autoencoder. Sparse time block autoencoder is an improvement of SAE network on the embedding layer of temporal Transformer, which enhances the modeling ability of temporal dependence by lightweight temporal position encoding. The basic structure of the network is basically consistent with the SAE network described in "Sparse autoencoder" section. The difference is that the sparse time block autoencoder network first needs to combine the serialized sample points into time blocks according to their temporal semantic proximity relationship, with a combination coefficient set to 9. The resulting time block $\varvec{B}_{1} ,\varvec{B}_{2} , \cdots ,\user2{B}_{n}$ is a high-dimensional tensor in computer memory expression. In order to pass through the subsequent SAE network, it needs to be transformed into n machine vectors through a single fully connected layer (including bias parameters). The obtained machine vectors can be used as the original image space input for training the SAE network. The optimization objective function expression is shown in formula (6), which includes two parts: minimizing reconstruction loss and minimizing KL divergence. The hidden space is located between the encoder and decoder of the SAE network, and its numerical form is the encoding vector of the time block, which can uniquely identify the current time block information. After the SAE network completes training, the hidden space is the low rank approximation of the original EEG physiological signal emotion classification temporal signal at the multi nearest neighbor scale.
3.
Hidden space embedding. The advantage of using Transformer for time series analysis lies in its ability to capture the global dependency patterns of the original temporal signals through a multi head self-attention mechanism, and with the support of sparse time block self-encoding, it can further comprehensively consider the neighboring elements of signals on any physically significant node. At this point, there is a certain probability that the machine encoded vector contains duplicate values. On the other hand, due to the structural characteristics of the Transformer model, it cannot distinguish the center time of each input node. Therefore, this paper improves the original Transformer embedding layer. Firstly, a parallel single-layer fully connected neural network is used to encode the position of the sparse time block autoencoder hidden space output and the timestamp information of the original EEG physiological signal emotion classification sample points. Then, the two encoding vectors obtained from the above process are added together to obtain the final embedding vector.
4.
Temporal Transformer. The embedding vector is first transformed into a query, key, and value through three single linear layers in the temporal Transformer encoder, and then passed into a multi head self-attention layer. The dot product vector obtained through nonlinear transformation is added to the original embedding vector and passed into the layer normalization network. The resulting layer normalization vector is transformed through a single linear layer and then added to the original vector to obtain the encoder output. Similarly, in the decoder part, the embedded vector is first transformed into query, key, and value through three single linear layers. The difference from the encoder is that the obtained query, key, and value will be transmitted to the multi head self-attention layer together with a ladder mask. After the above non-linear transformation, the embedded vector will be added and summed with the original embedded vector and converted into a value vector through layer normalization. The encoder output vector will be used as the query, key, and the three will be transmitted to the third multi head self-attention layer together for non-linear transformation; The obtained vector is added to the value vector and then subjected to layer normalization, and finally transformed into a linear layer to obtain the predicted EEG physiological signal at the desired time.

Experimental analyses

Experimental setup

The experimental hardware is i7-12700 h 2.3 GHz, NVIDIA Geforce GTX 2060 GPU, and the software environment uses Python 3.8 Keras framework to build the network model. The optimizer uses SVRG (Random Variance Reduced Gradient) optimizer to calculate the batch gradient for the entire training set. The initial learning rate is set to 0.0004, and the learning rate decays according to the following formula: ${l_r}={{0.0004} \mathord{\left/ {\vphantom {{0.0004} {\left( {1+decay\_rate \cdot epoch} \right)}}} \right. \kern-0pt} {\left( {1+decay\_rate \cdot epoch} \right)}}$, where ${l_r}$ represents the current learning rate, $decay\_rate$ is the decay rate (set to 0.001), and $epoch$ is the current training algebra. The batch size for each iteration is 128. Using the cross-entropy loss function, for samples with true labels y and prediction probabilities p for each category, the cross-entropy loss L is calculated as $L= - \sum \left( {i=1{\text{ to }}n} \right) \cdot {y_i} \cdot {\text{log}}\left( {{p_i}} \right)$, where n is the number of classes. The model training cycle is set to 100. If the validation loss does not decrease within 10 consecutive cycles, the training will stop. All experiments were conducted using 5-fold cross validation. The dataset is randomly divided into 5 approximately equal sized non overlapping subsets. In each fold, 4 subsets are used for training and the remaining 1 subset is used for validation. The final result is averaged over 5 multiples to evaluate the stability of the model.

The SEED and SEED-IV datasets are both public emotion datasets collected by the BCMI Laboratory of Shanghai Jiao Tong University. The specific situation is as follows:

1.
The SEED dataset contains three types of emotions: positive, neutral, and negative. It includes EEG physiological signal data of 15 subjects (7 males and 8 females) with an average age of around 23 years old. Using the international 10–20 system, EEG of 15 subjects watching related videos was recorded from 62 EEG electrodes. Each participant collected EEG at three different time periods, and watched the same 15 videos for approximately 4 min each. Among them, there were 5 positive, neutral, and negative movie clips each. There is a 45 s self-assessment time and a 15 s break time between the videos.
2.
The SEED-IV dataset includes four emotions: happiness, sadness, fear, and neutrality. It contains EEG physiological signal data collected in the same manner from 15 subjects (6 males and 9 females), except that each subject watched 24 different videos at three different time periods, each lasting approximately 2 min. This article comprehensively mixed the EEG of each time period in the SEED-IV dataset, performed sample segmentation for each experiment without the need for re registration, and deleted the remaining data below T in each experiment.

EEG preprocessing of SEED dataset and SEED-IV dataset: (1) Data cleaning. If the amplitude value of the raw EEG physiological signal data point exceeds the normal range of the EEG signal, it will be considered as a missing or damaged value and deleted. (2) Data filtering. Using a bandpass filter from 0 to 75 Hz, which has the characteristics of a Finite Impulse Response (FIR) filter, can eliminate low-frequency noise (DC offset) and high-frequency noise (outside the frequency range of the electroencephalogram). (3) Data downsampling. The downsampling rate is set to 200 Hz, achieved by sampling every n samples, where n is determined based on the ratio of the original sampling rate to the desired sampling rate (200 Hz). The format of the preprocessed SEED and SEED-IV datasets is shown in Table 3.

Table 3 Preprocessed SEED and SEED-IV dataset formats.

Full size table

The hardware environment used in the experiment is NVIDIA Geforce GTX 4060 GPU, and the software environment adopts Python 3.8 Keras framework to build the network model. Using SVRG optimizer, the batch size for each iteration is 128, and the initial learning rate is set to 0.0004, with a custom learning rate decay strategy. For the SEED dataset, a holistic classification experiment involving all subjects is more suitable for the practical application of emotional brain computer interfaces. For the SEED-IV dataset, an overall classification experiment was conducted for each time period, and the final result was taken as the average of three time periods. In order to better evaluate the stability of the model, all experiments were conducted using 5-fold cross validation.

Classification accuracy test

The sentiment recognition classification results on three datasets are shown in Table 4. From the data in Table 4, the model demonstrates good classification accuracy for each type of emotion.

Table 4 Classification results of various types of emotion recognition (%).

Full size table

From the data in Table 4, it can be seen that the model exhibits excellent performance in recognizing and classifying various emotions on different datasets. On the SEED dataset, the classification accuracy of various emotions is generally high. Taking the “calm” emotion as an example, the accuracy (P) reached 95.113 ± 1.0%, the recall (R) was 93.347 ± 1.2%, and the F1 value also reached 94.216 ± 1.1%, indicating that the model’s recognition of the “calm” emotion is relatively accurate and comprehensive. Comparing the SEED and SEED-IV-1 datasets, for the “happy” emotion, the accuracy increased from 94.687 ± 1.1 to 96.732 ± 0.9%, the recall increased from 92.439 ± 1.0% to 97.245 ± 0.8%, and the F1 value increased from 93.540 ± 0.9% to 97.098 ± 0.7% on the SEED-IV-1 dataset, indicating a significant enhancement in the model’s ability to recognize the “happy” emotion on the SEED-IV-1 dataset. Looking at the SEED-IV-1 and SEED-IV-2 datasets again, the accuracy of the “sadness” emotion on the SEED-IV-1 dataset is 93.117 ± 1.0%, the recall rate is 88.139 ± 1.3%, and the F1 value is 90.543 ± 1.2%; On the SEED-IV-2 dataset, the accuracy decreased to 86.513% ± 1.4%, the recall rate decreased to 84.108 ± 1.5%, and the F1 value decreased to 85.398 ± 1.3%, indicating a decrease in the model’s recognition performance of “sadness” emotion on the SEED-IV-2 dataset. The P-values of most emotions are less than 0.05, indicating that the model has statistical significance in recognizing various emotions on different datasets, further verifying the effectiveness of the model. Overall, the model performs well in most emotional categories, but there are some differences between different datasets. This may be related to factors such as sample distribution and feature differences in the dataset, and further exploration of the impact of these factors on model performance can be conducted to optimize the model’s emotion recognition ability in different scenarios.

In the SEED-IV-1 dataset, the recall rates of fear and sadness are relatively low, but they still reach 87.65% and 88.24%, respectively; The confusion matrix is shown in Fig. 6. From the diagonal, it can be seen that the model can classify most emotional categories well, indicating its excellent recognition performance. In summary, the model achieved good performance in emotion recognition for the vast majority of the three datasets, with only a few misclassifications for a few emotions, which is related to the inevitable noise and systematic errors in the image sequence acquisition process. Facial image data has variability due to different extraction time intervals, which limits the possibility of direct comparison with other research methods.

From the confusion matrix shown in Fig. 6, it can be seen that there is a high misclassification rate for fear and sadness emotions. We conducted a causal analysis on the EEG signal characteristics of these two emotions: (1) EEG frequency band analysis. Analyzing the frequency band characteristics of EEG signals of fear and sadness, it was found that their power spectral densities showed a certain degree of similarity in the θ (θ, 4–7 Hz) and α (α, 8–13 Hz) frequency bands. (2) Time pattern analysis. The time series diagram of EEG signals of fear and sadness shows that they exhibit relatively slow patterns of change in certain electrode channels (such as frontal lobe electrodes), without significant rapid fluctuations. (3) Spatial distribution analysis. There is some overlap in the activation brain regions of these two emotions, which exhibit activity signal fluctuations in both the prefrontal cortex and temporal lobe, indicating that the spatial features of fear and sadness extracted from EEG signals may not be clear enough.

Comparison of algorithm performance

To evaluate the efficiency of the proposed model, it was compared with classical CNN models, including Res-Net-34³⁰, ShuffleNet V2³¹, and MobileNet V2³². The above three algorithms focus on image classification architecture. In addition, to further verify the effectiveness of the proposed algorithms for EEG signal processing, EEGNet³³ and vanilla Transformer³⁴, two algorithms specifically designed for EEG data structures, were selected for comparative experiments. The results are shown in Table 5, where the average values of evaluation metrics on various emotion recognition classification models were compared.

According to the experimental data shown in Table 5, on the SEED dataset, the proposed model performs excellently in accuracy (A), recall (R), specificity (S), and F1 score. Compared with ResNet-34, the proposed model has an A value that is 4.463% higher, an R value that is 4.051% higher, an S value that is 0.728% higher, and an F1 value that is 4.541% higher. And the computation time of the proposed model is only 3.006 s, far lower than ResNet-34’s 8.524 s. ShuffleNet V2 and MobileNet V2, as lightweight models, have significantly weaker performance than the proposed model, with significant differences in various indicators. Although vanilla Transformer has certain performance, its computation time is relatively long, at 7.246s. The proposed model has higher computational efficiency while ensuring performance. EEGNet has similar performance, but the proposed model still outperforms in various indicators. In the SEED IV-1 dataset, a similar trend is also observed. The proposed model outperforms other models in all indicators comprehensively. Compared with ResNet-34, the A value is 3.79% higher, the R value is 3.645% higher, the S value is 0.928% higher, and the F1 value is 3.8% higher. And the calculation time is significantly reduced. On the SEED-IV-2 dataset, the proposed model still maintains its advantage. Compared with other models, it has improved in accuracy, recall and other indicators, and has the shortest calculation time. This indicates that the proposed model has stable and excellent performance on different datasets, effectively reducing computation time and improving model efficiency while ensuring high classification accuracy.

Table 5 Comparison of facial image modal classic network and results in this paper (%).

Full size table

Ablation experiment

1.
To further analyze the impact of each module in the model on its performance, ablation experiments were conducted on the three datasets mentioned above. The experimental setup is as follows: Experiment 1 involves removing the sparse autoencoder module; Experiment 2 involves removing the hidden space embedding module; Experiment 3 is the original Transformer sentiment classification method. As shown in Fig. 7, the accuracy of Experiment 1 decreased by 2.712%, 1.023%, and 4.186% on three datasets, respectively. This indicates that in different datasets, the attention module in the speech model is crucial for the recognition performance of the model, and the model needs to use channel spatial attention modules to highlight key emotional information; The accuracy of Experiment 2 decreased by 2.457%, 7.108%, and 4.134% on three datasets, respectively, indicating that the multi branch module plays a crucial role in processing multi-scale features of emotional information, especially for data with significant changes in emotional features at different scales, its effect is more evident; The accuracy of Experiment 3 decreased by 3.658%, 6.553%, and 5.541% on three datasets, respectively, indicating that the attention modules in both speech and facial image modalities jointly improved the performance of the model.
2.
To explore the impact of extracting different data amplification factors on the accuracy of model recognition, 4 frames, 8 frames, 12 frames, and 16 frames were extracted separately, and the results are shown in Fig. 8. The recognition accuracy of the model in the above three datasets gradually improves as the number of extracted EEG physiological signal frames increases. This indicates that increasing the frame rate of EEG physiological signals helps to provide richer facial expression information, significantly improving the performance of emotion recognition. It suggests that emotional expression has complex and diverse characteristics, requiring more information to capture subtle emotional changes. But as the knowledge learned by the model becomes more and more comprehensive, the performance of the model reaches a bottleneck, and increasing the amount of data cannot significantly improve the accuracy of recognition.

To verify the effectiveness of decomposing 3D convolution kernels into 2D-CNN and 1D-CNN convolution kernels, the following experiment is designed: (1) Experiment 4. Use undifferentiated 3D convolution kernels for feature extraction while keeping other modules unchanged. (2) Experiment 5. Decompose the 3D convolution kernel into 2D-CNN and 1D CNN convolution kernels of different sizes (2D convolution kernel size is 4 × 4 × 1, 1D convolution kernel size is 1 × 1 × 4) for feature extraction. The experimental results are shown in Table 6.

Table 6 Comparison of different 3D Convolution kernel decomposition Methods.

Full size table

According to the results in Table 6, it can be seen that on the SEED dataset, the accuracy of Experiment 5 is 92.732%, higher than Experiment 4’s 90.567% and the original method’s 89.234%, and the time of 22.7s is less than Experiment 4’s 25.3s and the original method’s 25.1s. On the SEED-IV-1 dataset, Experiment 5’s accuracy of 92.970% is significantly ahead of Experiment 4’s 88.901% and the original method’s 87.654%, and the time of 23.1s is also more advantageous. However, on the SEED-IV-2 dataset, although the accuracy of Experiment 5 was 88.153%, which was higher than Experiment 4’s 86.789% and the original method’s 85.321%, the improvement was relatively small, and the time of 24.8 s was less than the original method’s 29.0 s. Overall, the method of decomposing the 3D convolution kernel into 2D-CNN and 1D-CNN convolution kernels can improve accuracy and reduce time in most cases, verifying its effectiveness.

Conclusion

This article focuses on the problem of sentiment classification of EEG signals and proposes an innovative 3D-CNN-MSTA-Transformer sentiment classification model. This model deeply explores the information closely related to emotion recognition in EEG data through three-dimensional feature extraction, laying a solid foundation for subsequent analysis. The CNN feature transformation module cleverly decomposes the 3D convolution kernel into 2D-CNN and 1D-CNN convolution kernels, effectively reducing computational costs while accelerating training speed and improving the overall efficiency of the model. The improved MSTA Transformer classification module achieves efficient classification of emotions through the collaborative work of data preparation, sparse time block autoencoder, hidden space embedding, and temporal Transformer, utilizing advanced technologies such as multi head self-attention mechanism.

Although the model proposed in this article has achieved good results in emotion recognition classification tasks, there are still many unknown areas to be explored in the field of emotion recognition. In the future, we will continue to deepen our research in the following areas: (1) model optimization and improvement. In response to the performance fluctuations of existing models on different datasets, we will further optimize the model structure to enhance its adaptability and stability under various data distributions and emotional expressions. (2) Multi modal emotion recognition. Considering the limitations of a single physiological signal in emotion recognition, we will investigate how to integrate multiple physiological signals (such as EEG, ECG, GSR, etc.) as well as non-physiological signals such as facial expressions and speech to construct a emotion recognition system. By comprehensively utilizing multiple information sources, the comprehensiveness and accuracy of emotion recognition can be improved. (3) Real time emotion recognition and application. With the popularity of wearable devices and mobile computing, real-time emotion recognition has become a promising research direction. We will be committed to developing emotion recognition algorithms and systems that can operate efficiently in real-time environments, providing strong support for fields such as mental health monitoring, human-computer interaction, and intelligent education. (4) Cross cultural emotion recognition. There are differences in emotional expression and cognitive patterns among people from different cultural backgrounds. We will conduct research on cross-cultural emotion recognition, exploring how to construct emotion recognition models that can adapt to different cultural environments and further improve the adaptability of algorithms.

Data availability

All data generated or analysed during this study are included in this published article.

References

Khare, S. K. et al. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inform. Fusion. 102, 102019 (2024).
Article Google Scholar
Jafari, M. et al. Emotion recognition in EEG signals using deep learning methods: A review. Comput. Biol. Med. 165, 107450 (2023).
Article PubMed Google Scholar
Geetha, A. V. et al. Emotion recognition with deep learning: advancements, challenges, and future directions. Inform. Fusion. 105, 102218 (2024).
Article Google Scholar
Rashad, M. et al. CCNN-SVM: automated model for emotion recognition based on custom convolutional neural networks with SVM. Information 15(7), 384 (2024).
Article MathSciNet Google Scholar
Krishna, B. H. et al. Emotion-net: automatic emotion recognition system using optimal feature selection-based hidden Markov CNN model. Ain Shams Eng. J. 15(12), 103038 (2024).
Article Google Scholar
Mishra, S. P., Warule, P. & Deb, S. Improvement of emotion classification performance using multi-resolution variational mode decomposition method. Biomed. Signal Process. Control. 89, 105708 (2024).
Article Google Scholar
Geethanjali, R. & Valarmathi, A. A novel hybrid deep learning IChOA-CNN-LSTM model for modality-enriched and multilingual emotion recognition in social media. Sci. Rep. 14(1), 22270 (2024).
Article ADS PubMed PubMed Central Google Scholar
Bhangale, K. B. & Kothandaraman, M. A novel two-way feature extraction technique using multiple acoustic and wavelets packets for deep learning based speech emotion recognition. Multimedia Tools Appl. 84(15), 14529–14552 (2025).
Article Google Scholar
Elsheikh, R. A. et al. Improved facial emotion recognition model based on a novel deep convolutional structure. Sci. Rep. 14(1), 29050 (2024).
Article ADS PubMed PubMed Central Google Scholar
Sreedivya, R. S. & Sreelatha, G. Emotion classification in virtual reality: An RMS-optimized RNN approach using ECG, GSR, and eye-tracking signals. in 2025 4th International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) 580–585 (IEEE, 2025).
Dessai, A. & Virani, H. And multidomain feature fusion for emotion classification based on electrocardiogram and galvanic skin response signals. Sci 6(1), 10 (2024).
Article Google Scholar
Gahlan, N. & Sethia, D. Federated learning inspired privacy sensitive emotion recognition based on multi-modal physiological sensors. Cluster Comput. 27(3), 3179–3201 (2024).
Article Google Scholar
Makhmudov, F., Kutlimuratov, A. & Cho, Y. I. Hybrid LSTM–attention and CNN model for enhanced speech emotion recognition. Appl. Sci. 14(23), 11342 (2024).
Article Google Scholar
Bhatlawande, S. et al. Emotion recognition based on the fusion of vision, EEG, ECG, and EMG signals. Int. J. Electr. Comput. Eng. Syst. 15(1), 41–58 (2024).
Google Scholar
Kumar, P. S. et al. Deep learning-based automated emotion recognition using physiological signals and time-frequency methods. IEEE Trans. Instrum. Meas. 73, 1–12 (2024).
Google Scholar
Tong, W. et al. Edge-assisted epipolar transformer for industrial scene reconstruction. IEEE Trans. Autom. Sci. Eng. 22, 701–711 (2024).
Article Google Scholar
Tong, W. et al. Robust depth Estimation based on parallax attention for aerial scene perception. IEEE Trans. Industr. Inf. 20(9), 10761–10769 (2024).
Article Google Scholar
Tong, W. et al. fMRI-based brain disease diagnosis: A graph network approach. IEEE Trans. Med. Rob. Bionics. 5(2), 312–322 (2023).
Article Google Scholar
Tong, K. W. et al. Individual-level fMRI segmentation based on graphs. IEEE Trans. Cogn. Dev. Syst. 15(4), 1773–1782 (2023).
Article Google Scholar
Gaddanakeri, R. D. et al. Analysis of EEG signals in the DEAP dataset for emotion recognition using deep learning algortihms. in 2024 IEEE 9th International Conference for Convergence in Technology (I2CT) 1–7 (IEEE, 2024).
Bardak, F. K., Seyman, M. N. & Temurtaş, F. Adaptive neuro-fuzzy based hybrid classification model for emotion recognition from EEG signals. Neural Comput. Appl. 36(16), 9189–9202 (2024).
Article Google Scholar
Bravo, L. et al. A systematic review on artificial Intelligence-Based dialogue systems capable of emotion Recognition. Technol. Interact. 9(3), 28 (2025).
Google Scholar
Jha, S. K., Suvvari, S., & Kumar, M. Band representations exploring the impact of KNN and MLP classifiers on valence-arousal emotion recognition using EEG: An analysis of DEAP dataset and EEG. in International Conference on Advances in Computing and Data Sciences 3–13 (Springer Nature Switzerland, Cham, 2024).
Aslan, M., Baykara, M. & Alakuş, T. B. Analysis of brain areas in emotion recognition from EEG signals with deep learning methods. Multimed.a Tools Appl. 83(11), 32423–32452 (2024).
Article Google Scholar
Jha, S. K., Suvvari, S. & Kumar, M. Emotion recognition from electroencephalogram (EEG) signals using a multiple column convolutional neural network model. SN Comput. Sci. 5(2), 213 (2024).
Article Google Scholar
Huo, Y. & Ge, Y. VAD-Net: Multidimensional emotion recognition from facial expression images. in 2024 International Joint Conference on Neural Networks (IJCNN) 1–7 (IEEE, 2024).
Ramaswamy, M. P. A. & Palaniswamy, S. Emotion recognition: A comprehensive review, trends, and challenges. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 14(6), e1563 (2024).
Google Scholar
Pring, E. X. et al. Music communicates social emotions: evidence from 750 music excerpts. Sci. Rep. 14(1), 27766 (2024).
Article ADS PubMed PubMed Central Google Scholar
Heng, L. & McAdams, S. The function of timbre in the perception of affective intentions: effect of enculturation in different musical traditions. Musicae Sci. 28(4), 675–702 (2024).
Article Google Scholar
Alif, M. A. R. State-of-the-art Bangla handwritten character recognition using a modified resnet-34 architecture. Int. J. Innov. Sci. Res. Technol. 9, 438–448 (2024).
Google Scholar
Ahmad, F. et al. Emotion recognition of the driver based on KLT algorithm and ShuffleNet V2. Signal. Image Video Process. 18(4), 3643–3660 (2024).
Article Google Scholar
Zhu, Q. et al. A study on expression recognition based on improved mobilenetV2 network. Sci. Rep. 14(1), 8121 (2024).
Article ADS PubMed PubMed Central Google Scholar
Khushiyant, Mathur, V. et al. REEGNet: A resource efficient eegnet for EEG trail classification in healthcare. Intell. Decis. Technol. 18(2), 1463–1476 (2024).
Google Scholar
Yang, C. et al. SIMformer: Single-layer vanilla transformer can learn free-space trajectory similarity. arxiv preprint arxiv:2410.14629. (2024).

Download references

Author information

Authors and Affiliations

China Basic Education Quality Monitoring Collaborative Innovation Center, Beijing Normal University, Beijing, China
Wenjuan Chu & Hongbo Wen
Johns Hopkins University, Baltimore, USA
Huanyu Liu
Beijing Institute of Information Technology, Beijing, China
Xinghua Zhang
Institute of International Chinese Education, Beijing Language and Culture University, Beijing, China
Jing Guo

Authors

Wenjuan Chu
View author publications
Search author on:PubMed Google Scholar
Hongbo Wen
View author publications
Search author on:PubMed Google Scholar
Huanyu Liu
View author publications
Search author on:PubMed Google Scholar
Xinghua Zhang
View author publications
Search author on:PubMed Google Scholar
Jing Guo
View author publications
Search author on:PubMed Google Scholar

Contributions

W.C. and H.W. wrote the main manuscript text, H.L. prepared relevant references and experimental equipment, and X.Z. and J.G. participated in the manuscript writing and related experimental implementation. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Hongbo Wen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chu, W., Wen, H., Liu, H. et al. Fusion of EEG feature extraction and CNN-MSTA transformer emotion recognition classification model. Sci Rep 15, 45779 (2025). https://doi.org/10.1038/s41598-025-28470-z

Download citation

Received: 28 September 2025
Accepted: 11 November 2025
Published: 13 December 2025
Version of record: 31 December 2025
DOI: https://doi.org/10.1038/s41598-025-28470-z