Multi-branch convolutional neural network with cross-attention mechanism for emotion recognition

Yan, Fei; Guo, Zekai; Iliyasu, Abdullah M.; Hirota, Kaoru

doi:10.1038/s41598-025-88248-1

Download PDF

Article
Open access
Published: 01 February 2025

Multi-branch convolutional neural network with cross-attention mechanism for emotion recognition

Fei Yan¹,
Zekai Guo¹,
Abdullah M. Iliyasu^2,3 &
…
Kaoru Hirota^3,4

Scientific Reports volume 15, Article number: 3976 (2025) Cite this article

7089 Accesses
9 Citations
Metrics details

Subjects

Abstract

Research on emotion recognition is an interesting area because of its wide-ranging applications in education, marketing, and medical fields. This study proposes a multi-branch convolutional neural network model based on cross-attention mechanism (MCNN-CA) for accurate recognition of different emotions. The proposed model provides automated extraction of relevant features from multimodal data and fusion of feature maps from diverse sources as modules for the subsequent emotion recognition. In the feature extraction stage, various convolutional neural networks were designed to extract critical information from multiple dimensional features. The feature fusion module was used to enhance the inter-correlation between features based on channel-efficient attention mechanism. This innovation proves effective in fusing distinctive features within a single mode and across different modes. The model was assessed based on EEG emotion recognition experiments on the SEED and SEED-IV datasets. Furthermore, the efficiency of the proposed model was evaluated via multimodal emotion experiments using EEG and text data from the ZuCo dataset. Comparative analysis alongside contemporary studies shows that our model excels in terms of accuracy, precision, recall, and F1-score.

Detecting emotions through EEG signals based on modified convolutional fuzzy neural network

Article Open access 06 May 2024

Multimodal fusion for anticipating human decision performance

Article Open access 08 June 2024

Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features

Article Open access 14 March 2025

Introduction

Emotions are important aspects of our daily lives. It is essential in interpersonal communication, cognitive processing, decision-making, and reflecting personal psychological and physiological conditions¹. The two levels of emotional response are external expression and internal transformation. Facial expressions, gestures, and language are main examples of external expressions². On the contrary, physiological indicators, such as electroencephalogram (EEG) and magnetoencephalogram (MEG) are examples of internal changes, as are skin resistance, heart rate, blood pressure, respiratory rate, and blood pressure. Although emotional reactions are essentially subjective experiences, external patterns exhibit significant degree of complexity and ambiguity. Presently, EEG signals are widely used in clinical research, especially when emotions are involved. They are non-invasive, affordable, and unaffected by language and cultural differences³. The application of EEG in emotion recognition research mainly involves five processes: EEG data collection, preprocessing, feature extraction, feature optimization/selection, and emotion classification.

Nowadays, EEG signals are classified based on their frequency ranges, including the widely known $\delta$-wave (1-4 Hz), $\theta$-wave (4-7 Hz), $\alpha$-wave (8-13 Hz), $\beta$-wave (13-30 Hz), and $\gamma$-wave (31-50 Hz)⁴. In order to effectively retain the key information embedded in each frequency band, the current study is based on five frequency bands, while feature extraction is focused on three key domains: time, frequency, and the time-frequency domains⁵. Time-domain features (TDF) focus on the temporal characteristics of EEG signals. Three often used time-domain features include statistical, Hjorth⁶, and fractal dimension⁷ features. Frequency-domain features (FDF) are used to express the changes in EEG frequency, where power spectral density (PSD)⁸ and differential entropy⁹ are the most commonly used. Finally, the time-frequency domain features (TFDF) of EEG mainly integrate the time-frequency features of EEG signals, which can provide a comprehensive description of the signal dynamics from the time and frequency axes. The main time-frequency domain features include short-time Fourier transform (STFT)¹⁰, continuous wavelet transform¹¹, and wavelet transform¹². Overall, these feature extraction methods complement one another, serving as robust tools for in-depth research on EEG signals. When considered collectively, they not only illuminate the fundamental characteristics of brain activity but also provide significant support for effectively capturing and extracting EEG signals in emotion recognition tasks.

However, EEG features are often high-dimensional and sparse, and using them directly may decrease emotion recognition performance while increasing computational costs. Therefore, it is crucial to filter or optimize these features after extraction. Despite significant progress in this field, extracting richer and more relevant information remains vital for enhancing overall system effectiveness. Additionally, while deep neural network (DNN) models are widely employed in current emotion recognition techniques, they still encounter challenges in EEG-based emotion classification tasks. Specifically, issues such as neglecting the correlations between EEG channels and the incompleteness or unreliability of the extracted features directly affect the accuracy and robustness of emotion recognition systems.

In this study, we introduce a multi-branch convolutional neural network model utilizing a cross-attention mechanism (MCNN-CA) to achieve accurate recognition of various emotions. The proposed model is designed to proficiently extract key information from the different feature dimensions inherent to EEG signals, and then seamlessly merge them through cross feature fusion. Compared to existing methods documented in contemporary literature, our model goes beyond traditional paradigms, by capturing complex subtle differences in various feature dimensions and also dynamically fine-tuning the weighting coefficients in different high-dimensional information flows to highlight important clues required for emotional state recognition. The main highlights of our work are as follows:

To effectively learn high-dimensional signal features, our model employs multiple convolutional neural network modules to extract features across varying dimensions. Each module incorporates a parametric rectified linear unit (PReLU) activation function and a batch normalization (BN) layer to expedite high-dimensional information extraction and mitigate the gradient vanishing issue during training and testing. This innovative architecture not only enriches the information available for subsequent feature fusion but also lays the groundwork for enhancing overall system performance.
In developing the feature fusion module, we integrate an efficient channel attention mechanism into the multi-feature fusion process of the proposed model. This module effectively prioritizes task-relevant information, significantly reducing focus on ancillary data and selectively eliminating incongruent information. This strategic approach ameliorates the potential dilemma of information overload associated with the availability of diverse features, thus substantially enhancing the effectiveness and precision of the subsequent classifier module in performing classification tasks.
In this study, we constructed a comprehensive end-to-end emotion recognition model that processes raw signals as inputs and delivers corresponding emotion labels for prediction. In the realm of EEG emotion recognition, we undertook a series of experiments to validate our model’s performance. Moreover, we reported extensive evaluations using a multimodal dataset that includes both EEG and text inputs. The results of our experiments clearly demonstrate the substantive advantages of our proposed model in terms of accuracy, recall, and F1-score in multimodal emotion recognition tasks.

Details of the enumerated contributions of our proposed model are presented in the remainder of the work. Specifically, Section 2 provides a literature review on EEG-based emotion recognition techniques. Section 3 elucidates the distinct modules that constitute the architecture of our model. In Section 4, a detailed description of the pre-processing operations and data enrichment methods used in the dataset are discussed. This is followed in Section 5 by the results emanating from experiments on three datasets and compares the outcomes alongside those from contemporary studies.

Literature review on EEG emotion recognition

As mentioned earlier, EEG is vital for emotion recognition tasks due to its unique characteristics and adaptability. Refining the features extracted from EEG signals or supplementing them with patterns from other sources is essential for improving the accuracy of subsequent emotion recognition networks. In this section, we provide a detailed overview of research focused on emotion classification using EEG, as well as studies involving EEG in emotion classification within the context of multimodal fusion.

EEG emotion recognition

In EEG-based emotion recognition, feature extraction is a vital step. This process focuses on identifying and transforming key features or relevant information from EEG signals into interpretable representations. These features reflect the physiological and anatomical characteristics of underlying brain activity, but managing the large volume of data requires substantial computational resources and sophisticated algorithms. To tackle this complexity and enable meaningful interpretation, feature extraction techniques are essential for simplifying and optimizing the signal representation.

One widely used feature extraction technique is the Hilbert-Huang transform (HHT), which is particularly effective for analyzing nonlinear and non-stationary signals, a common characteristic of EEG data¹³. HHT decomposes the signal into intrinsic mode functions and calculates instantaneous frequency and power, preserving important temporal information from peak time-frequency analysis while maintaining linear properties. This makes HHT a valuable tool for EEG feature extraction. Another commonly utilized technique is principal component analysis (PCA), which simplifies large datasets by identifying patterns and emphasizing similarities and differences within the data¹⁴. PCA works by rotating the dataset to find the directions (principal components) where the variance is maximized, effectively reducing the dimensionality of the data while retaining most of its informational content. This method not only helps to reduce memory usage and computational demands but also has become a popular choice in EEG feature extraction.

Various studies have recently focused on filtering or optimizing features extracted from EEG signals. For instance, Valenza et al.¹⁵ applied PCA to reduce the dimensionality of high-dimensional data while minimizing information loss. They combined PCA with a quadratic discriminant classifier to enhance recognition rates, reporting accuracies of 92.29% for arousal and 90.36% for valence. Yang et al.⁶ developed a cross-subject emotion recognition method that integrates sequential backward selection with significance testing and a support vector machine (SVM) model, effectively improving EEG emotion recognition accuracy while reducing computational overhead. Furthermore, Wu et al.¹⁶ utilized the Laplacian matrix to convert functional connectivity features, such as phase locking value, Pearson correlation coefficient, spectral coherence, and mutual information, into semi-positive values. They applied a max operator to ensure the positivity of the transformed features and employed a symmetric positive definiteness network to extract deep spatial information, which was validated through fully connected layers. A decision-level fusion strategy was then employed to achieve more accurate and stable recognition results.

On the other hand, classifiers can generally be categorized into machine learning (ML) classifiers and deep learning classifiers. SVM remains one of the most widely used ML classifiers. For instance, Tuncer et al.¹⁷ employed a tunable Q-factor wavelet transform based on a fractal pattern feature extraction method to generate multi-level features, achieving a remarkable 99.82% accuracy using the SVM algorithm. Similarly, Algumaei et al.¹⁸ reported higher accuracy compared to SVM on the SEED dataset (i.e., the SJTU emotion EEG dataset¹⁹) by using linear discriminant analysis (LDA) as the classifier. Unlike traditional ML methods, deep neural networks (DNNs) can capture underlying data structures and extract deeper feature representations through training, thus simplifying the feature engineering process that is typically complex in ML. As a result, many recent studies have adopted deep learning classifiers for emotion recognition tasks. For example, Tao et al.²⁰ introduced a channel attention mechanism to assign adaptive weights to different EEG channels, coupled with an extended self-attention mechanism to explore the temporal dependencies in EEG signals. Similarly, Jia et al.²¹ proposed a two-stream network with an attention mechanism that adaptively focuses on important patterns. Their approach highlights the significance of integrating information across different regions and adaptively learning key information, demonstrating the importance of constructing a unified EEG emotion recognition framework. Given that information from different domains is based on distinct features, understanding and leveraging the interrelationships between these unique features is essential.

Existing studies primarily concentrate on extracting and selecting the optimal feature sets from raw EEG data. However, the feature extraction process may inadvertently lead to the loss of valuable information, which can hinder the model’s ability to learn critical aspects of the data. In this study, we aim to address this challenge by automatically extracting features from EEG data across multiple domains, specifically the time-domain, frequency-domain, and time-frequency domain, using advanced neural network architectures. Furthermore, we employ the efficient channel attention mechanism and integrate it into the multiple features fusion procedure of the proposed model. This approach not only enhances the interpretability of the data but also significantly improves the accuracy and performance of emotion recognition tasks.

Multimodal emotion recognition

Humans typically convey emotions through diverse means such as language, facial expressions, and body movements²². As a result, approaches that integrate multiple features have gained considerable traction in emotion recognition studies. Feature fusion amalgamates information from various sources to produce more informative representations, thereby bridging the gap between independent features. Researchers are now developing multimodal emotion recognition systems by merging physiological signals, notably EEG, with other perceptual modalities including speech, images, and text data. Similarly, in the realm of natural language processing, unimodal methods have certain limitations, particularly in their ability to capture the complexity of human language. To overcome these, Linzen et al.²³ advocated for actively exploring the potential of multimodal data to accelerate computers’ understanding and generalization of natural language. Leveraging physiological signals is particularly intriguing for simulating human-like language learning phenomena.

In the midst of these studies, Hulliyah et al.²⁴ designed an innovative random forest classifier that analyzes the emotional tones evoked by EEG signals, enabling effective classification of Twitter comments. Their model demonstrated excellent performance in a four-class emotion recognition task, with experimental results validating its particular effectiveness in accurately identifying anger emotions. Meanwhile, Gupta et al.²⁵ presented a comprehensive end-to-end system that not only identifies emotional information in EEG data but also incorporates it into text to enrich emotional expression. The experimental results highlighted promising outcomes for the text enhancement task and showcased the overall robustness of the end-to-end system. Furthermore, Wang et al.²⁶ introduced a sequence-to-sequence decoding method that enhances text sequences using a pre-trained language model, seamlessly fusing them with EEG features. This approach achieved remarkable results in integrating word-level text and EEG information, further advancing the field of emotion recognition.

However, practical challenges emerge when attempting to utilize diverse modalities, such as EEG signals and text, simultaneously due to data heterogeneity and variability. Despite the significant progress made in these studies, a universal method for transforming text and EEG signals into common feature vectors for fusion remains elusive. Moreover, many of these studies rely on EEG features to adjust and supplement emotion classification results, implying that errors in EEG emotion recognition may not effectively correct text emotion classifications. Therefore, this study employs a feature fusion module that enhances the inter-correlation between features using a channel-efficient attention mechanism. This innovation proves effective and robust in fusing distinctive features within a single mode and across different modalities.

To address these challenges, this study proposes a multi-branch convolutional neural network model that employs a mutli-feature fusion mechanism for accurate emotion recognition. The following sections will detail this proposed methodology in line with the previously highlighted contributions.

Architecture of the proposed model

In this section, our main goal is to build an intelligent network that can elucidate the complex mapping correspondence between data features and emotional labels. Given that EEG data is characterized by non-Gaussian distribution and non-stationarity, along with the subtle emotional nuances inherent in text information, achieving accurate emotion recognition from either EEG signals or text is challenging. To overcome this dilemma, we formulate a multi-branch convolutional neural network framework. For reference, the basic architecture of our network model is presented in Fig. 1. In the feature extraction module, we use enhanced convolutional neural networks to extract EEG features from different domains, including time, frequency, and time-frequency domains. Furthermore, to process text data for multimodal emotion recognition, we developed a BiLSTM network model to extract features from text data. Subsequently, the feature fusion module is designed to process the extracted features, facilitating seamless integration. Ultimately, the fused features are fed into the emotion recognition module for empirical evaluation, which then produces conclusive output results. In the remainder of this section, we will explain the complex working principles of each module and follow up with a detailed introduction to its implementation process.

Table 1 EEG feature extraction module based on convolutional networks.

Full size table

Feature extraction module

Feature extraction plays a central role in machine learning and data analysis²⁷. It involves transforming raw data into more complex and practical feature representations²⁸. The primary purpose of feature extraction is to improve an algorithm’s understanding and use of the data, thereby facilitating its application in the areas of classification, clustering, regression, and so on²⁹. As earlier promised, in subsequent sections, we highlight details of our EEG and text feature extraction modules.

EEG feature extraction module

In the EEG feature extraction process, we obtain relevant information from three perspectives: the time domain, frequency domain, and time-frequency domain. Time-domain feature extraction focuses on the signal’s dynamic changes over time. By calculating parameters such as mean and standard deviation, we reveal the overall fluctuation level of the signal, laying the groundwork for understanding EEG dynamics. Frequency-domain feature extraction emphasizes the signal’s characteristics in the frequency dimension. Analyzing parameters like power spectral density and band energy helps us understand energy distribution across frequencies and differentiate between various brainwave activities. We will employ multiple algorithms to extract features from the time or frequency domain, utilizing a one-dimensional (1D) convolutional neural network to optimize these features and enhance the model’s performance. Finally, time-frequency domain feature extraction integrates analytical methods from both domains. Techniques such as short-time Fourier transform, wavelet transform, and time-frequency spectrograms provide a comprehensive view of the signal’s time-varying characteristics. By combining time and frequency data, we employ a two-dimensional (2D) convolutional neural network to further optimize the extracted features, improving the overall effectiveness of the model. The specific dimensions of extracted features and the network architecture are detailed in Table 1.

For convenience, we represent the time domain, frequency domain, and time-frequency domain as $D_T$, $D_F$, and $D_{T\text {-}F}$, respectively. In the designed feature extraction network, we let $x^{l-1}_{T}\in {\mathbb {R}}^{L\times 1\times C}$ and $x^{l-1}_F\in {\mathbb {R}}^{L \times 1 \times C}$ be the input dimension of the 1D convolutional network used to extract $D_T$ and $D_F$, and $k^{l}_{1D}\in {\mathbb {R}}^{K\times 1\times C\times N}$ be the size of the convolutional kernel dimension in 1D convolution, then for a 2D convolutional network extracting $D_{T\text {-}F}$, the input dimension is $x^{l-1}_{T-F}\in {\mathbb {R}}^{W \times H \times C}$ and the convolutional kernel size is $k^{l}_{2D}\in {\mathbb {R}}^{K \times K \times C \times N}$. Among them, the convolution operation of the l-th layer $u^l$ is shown by Eq. (1):

$$\begin{aligned} u^l = k^l *x^{l-1} + b = \sum _{c=1}^{C} k_c^l *x^{l-1}_c + b^{l}, \end{aligned}$$

(1)

where $b^l$ denotes the bias, $*$ indicates the convolution operation, and C represents the number of input channels.

In our model, the design of convolutional modules depends on the sequence of convolutional neural network layers. First, the convolutional layer uses a set of learnable convolutional kernels as basis to process the input data through convolutional operations. However, since emotion recognition is not a linear problem, then using only convolutional layers for feature learning could lead to problems, such as information loss, which in turn leads to overfitting of the entire model that affects the accuracy of the test set. To address this issue, we integrated the PReLU activation function layer and BN layer into the convolutional model. Overall, the introduction of the PReLU activation function layer and BN layer helps convolutional models to better learn the features of the input data. This process can be expressed mathematically in the form presented in Eq. (2):

$$\begin{aligned} x^l = \sigma (\phi ( \omega (x^l -1 ))), \end{aligned}$$

(2)

where $x^l$ represents the feature map of the lth convolutional layer, $\sigma$(-) is the PReLU activation function, $\phi (\cdot )$ is the batch normalization operation, and $\omega (\cdot )$ denotes the convolutional operation.

Last but not least, a common approach employed to highlight the key features of convolution operations and to extend the perceptual domain is the introduction of a pooling layer after the convolutional layer. As an independent neural layer, the pooling operation has no parameters, and its main purpose is to filter out irrelevant features while retaining important representations. Therefore, the feature maps generated by the pooling layers can capture the key information of the original data better. The nth feature map of the lth pooling layer $y_n^l$ cloud be expressed by Eq. (3):

$$\begin{aligned} y_n^l = pool(x_n^l,k,s), \end{aligned}$$

(3)

where $x_n^l$ is the nth output feature map of the lth convolutional layer, i.e., $pool(\cdot )$ is the max pooling operation, k is the pooling kernel size, and s represents the stride of the pooling kernel.

After processing features through the extraction network, we observed that the dimensions of optimized features vary across different extraction networks, which creates difficulties for subsequent feature fusion operations. Dimensionality reduction of high-dimensional features is a common and effective strategy during the optimization process. Thus, we flatten the 2D features along the channel dimension, converting them into 1D features that can then undergo 1D convolution operations for dimensionality reduction. The output feature map $F_z$ for this process can be formulated using Eq. (4):

$$\begin{aligned} F_z = conv1d(f(x_{TFDF}^l)), \end{aligned}$$

(4)

where $f(\cdot )$ denotes the flattening operation, $x_{TFDF}^l$ represents the characteristics of the input, $conv1d(\cdot )$ indicates 1D convolution operation. As a result, our approach to EEG feature extraction combines insights from the time, frequency, and time-frequency domains using advanced convolutional networks to improve accuracy in emotion recognition tasks.

Text feature extraction module

To achieve effective fusion of text and EEG features, it is equally essential to extract and optimize the text features. In this process, we employ the BiLSTM-CNN network for text feature extraction. This network combines the strengths of bidirectional long short-term memory (BiLSTM) networks and convolutional neural networks (CNNs), facilitating the capture of contextual information and local structures within the text. As illustrated in Fig. 2, BiLSTM can capture contextual information in sentences, which helps the model to better understand the patterns of word vectors in sentences, and effectively alleviates the problem of gradient vanishing that may occur during the model’s learning process. Subsequently, by effectively capturing local features through CNNs, this combination enhances the model’s comprehension of the text data, leading to improved performance.

In BiLSTM, the storage, updating, and output of historical information are precisely controlled by storage units, input gates, forgetting gates, and output gates. Input gates are used to regulate the effect of input vectors on the state of storage units, ensuring that information is properly integrated. By this configuration, the output gate allows the state of the memory unit to affect the final output result. Consequently, the function of the forgetting gate is to allow memory cells to selectively retain or forget previous information as required to meet the needs of the current task. Before feeding text features into our proposed BiLSTM-CNN network, the initial step involves generating word vectors for the text data. To achieve this, we utilize a GloVe (global vectors for word representation) dictionary to convert the text features into word vectors. GloVe is a word embedding method based on global word frequency statistics that learns vector representations for each word by constructing a co-occurrence matrix of words. This representation effectively captures the semantic information and contextual relationships of words, establishing a strong basis for subsequent text feature extraction.

The text feature input to the BiLSTM-CNN is defined as $X=[X_1, X_2, \cdots , X_T]$, where $X_T$ is a d-dimensional word embedding vector. These word vectors subsequently serve as input to the BiLSTM network. The hidden layer size of the BiLSTM network is set to H to effectively capture contextual information within the text sequences. Other network parameters include the input weight matrix $W \in R^{H \times R}$, the hidden state weight matrix $U \in R^{H \times H}$, and the bias vector $b \in R^{H \times 1}$. During the implementation of LSTM, these parameters (W, U, b) are learned and updated through the back propagation algorithm, enabling the model to capture long-term dependencies and mitigate the vanishing gradient problem during training. At time step t, the information $x_t$ must first enter the forgetting gate $f_t$ for information filtering, as described in Eq. (5):

$$\begin{aligned} f_t = \sigma (W_{f}x_{t} + U_{f}h_{t-1} + b_f), \end{aligned}$$

(5)

where $h_{t\text {-}1}$ represents the feature mapping learned by the network at time $t\text {-}1$, and $\sigma (\cdot )$ is a sigma function. Subsequently, the features will be further processed through input gate $i_t$, as shown in Eq. (6):

$$\begin{aligned} i_t = \sigma (W_{i}x_{t} + U_{i}h_{t-1} + b_i). \end{aligned}$$

(6)

In addition, the candidate cell state, which represents the current input information without considering the previous cell state, is computed at each time step. This candidate state is derived from the current input and the previous hidden state, processed through specific weight matrices and activated using the hyperbolic tangent function, as represented in Eq. (7):

$$\begin{aligned} c_{t1} = \tanh (W_{c}x_{t} + U_{c}h_{t-1} + b_{c}). \end{aligned}$$

(7)

Here, $tanh(\cdot )$ serves as a normalization function to transform its input into a value between -1 and 1. The current cell state $c_{t}$ is then determined by the previous cell state $c_{t1}$, the forgetting gate $f_t$, and the input gate $i_t$. Moreover, similar to Eqs. (5) and (6), the feature vector must pass through the output gate $o_t$ to obtain the necessary information. Thus, the hidden state $h_t$ at the current time step is crucial for information transfer and feature representation, whose implementation is formulated as

$$\begin{aligned} h_{t} = o_{t} \otimes \tanh (c_{t}). \end{aligned}$$

(8)

In this equation, $\otimes$ denotes element-wise multiplication.

The hidden state conveys information from the current time step to the next, enabling the model to capture and retain long-range dependencies. It also regulates the flow of information through the gating mechanisms, including the forgetting gate, input gate, and output gate, which determine what information to retain, add, or discard. Acting as a compressed representation of the input sequence up to the current time step, the hidden state encodes all relevant contextual information for classification, prediction, or generation tasks. In a BiLSTM, it integrates information from both forward and backward sequences, allowing for a simultaneous understanding of context from both directions. Moreover, the hidden state is crucial for network training, as it is updated through the back propagation algorithm, enabling the LSTM to effectively handle sequential data and perform various sequence-related tasks.

To extract information from text features and integrate it with data from other sources, we employed a CNN architecture with three inner layers that filter data and adjust the dimensionality of the text features. This design effectively consolidates multiple features, enhancing overall performance. The subsequent experimental sections will demonstrate how our CNN architecture addressed challenges related to varying feature dimensions during the fusion process, further refining the information in the text features.

Feature fusion module

A single feature is insufficient to effectively address EEG emotion recognition and multimodal emotion recognition tasks. To tackle the challenges of EEG feature fusion and multimodal feature integration, we have developed a strategy for incorporating diverse features into the model. This strategy first optimizes the dimensions of the various features to facilitate fusion on a unified dimensional scale, and then extracts useful information from the generated modal data. Simply applying traditional feature fusion methods may introduce redundant information, which can adversely affect subsequent emotion classification tasks. Therefore, we propose a cross-feature intersection fusion module based on the integration of different features. This module employs an efficient channel attention mechanism that dynamically adjusts the weights of channel features, emphasizing important features while suppressing redundant information. The workflow of the feature fusion module is illustrated in Fig. 3.

In feature fusion mechanism, global features of single mode are especially important for classification. The feature interaction mechanism is introduced to utilize this global information so that each branch contains essential information from other branches. Given the feature mapping $F_x, F_y \in {\mathbb {R}}^{B \times I \times C}$, two types of modal data are extracted, where B is the batch size, C is the number of channels, and I is the length of each feature map. We interact with any two of these three domains. Due to the asymmetric operation of feature interaction, selecting features from three domains for feature interaction provokes the fusion of features containing redundant information. For convenience, we will only use $F_x$ and $F_y$ as examples. First, $F_y$ is aggregated through channel maximization pooling to obtain globally expressive features that are then concatenated at the same level $F_x$, which can be denoted in the form presented in Eq. (9):

$$\begin{aligned} \overline{F_x} = \psi (F_x, {pool}_m(F_y)), \end{aligned}$$

(9)

where $\overline{F_x}\in {\mathbb {R}}^{B\times (I+1)\times C}$, $\psi (\cdot )$ denotes the connection operation, and $pool_m(\cdot )$ represents global maximum pooling.

In this process, to retain the essential information of features $F_x$ and $F_y$ in emotion recognition, we utilize the efficient channel attention (ECA) module to assign appropriate weights on these features. The ECA module incorporates a critical global average pooling (GAP) component, which allows us to effectively extract the global information from the features. Furthermore, we use a multi-layer perceptron to adjust the weight of feature $F_y$ and use the maximum operator calculation method to fuse these two features. This operator computes the maximum of two feature maps $F_x$ and $F_y$ at the same position a of the cth channel as defined in Eq. (10):

$$\begin{aligned} \Gamma _{max}^{a,c} = max{(\overline{F_{x}}^{a,c},\overline{F_{y}}^{a,c})}. \end{aligned}$$

(10)

In the end, these two features are merged to obtain the final feature vector using the feature fusion method that was introduced in the previous section.

In our model, the time-domain and frequency-domain features of EEG signals are processed through a feature extraction network and adjusted to a consistent dimension, allowing for seamless integration with text features during the subsequent fusion phase. The text features are similarly processed using a BiLSTM-CNN network to ensure their dimensions match those of the EEG features, thus achieving consistency across multi-modal data in the feature space. This uniform dimensional setting not only simplifies the feature fusion process but also provides a stable foundation for the emotion classification task. However, given the complexity of the data structure of time-frequency features, directly merging them with other features could lead to information redundancy or dimension mismatch. To address this issue, our model first applies a dimensionality reduction technique to the time-frequency features using Eq. (4), ensuring their dimensions align with those of the time-domain, frequency-domain, and text features. This reduction method aims to retain key information from the time-frequency features, enabling their effective integration within a unified feature space. By ensuring consistency and integrity in the feature space, our model can effectively synthesize information from different modalities, ultimately improving the accuracy and robustness of emotion classification.

Auxiliary parameters of the proposed model

In this section, we present further details of the emotion recognition module as well as the optimizer and loss function used to train and test the model. The emotion recognition module consists of four fully connected layers (FCLs), each containing 2048, 512, and 128 neurons. These three FCLs are connected to the dropout layer and the rectified linear unit (ReLU) activation function, respectively. Next, based on the number of different emotion classifications, the extracted features are connected to the corresponding number of fully connected layers of neurons to complete the emotion classification task.

Meanwhile, in neural networks, there are many optimization algorithms to choose from. We use AdaGrad as the optimizer for neural network training. The choice is motivated because AdaGrad is a gradient descent optimization method with an adaptive learning rate. It makes the learning rate of the parameters adaptive, performing larger updates for infrequent parameters and smaller updates for frequent parameters, which makes AdaGrad very suitable for processing sparse data. Another advantage of AdaGrad is that it eliminates the need to manually adjust the learning rate. It constantly adjusts the learning rate during the iteration process and allows each parameter in the objective function to have its learning rate.

Furthermore, since emotion recognition task belongs to multiple value classification task, then cross entropy function is used as loss function as defined in Eq. (11):

$$\begin{aligned} \ell =\frac{1}{N} \sum _i L_i = - \frac{1}{N} \sum _i \sum _{c=1}^{M} y_{ic} \log (p_{ic}). \end{aligned}$$

(11)

Among them $y_{ic}$ is a binary indicator that represents the true label of the ith sample for the cth class and $p_{ic}$ are represents the predicted probability that the ith sample belongs to the cth class. Additionally, M and N denote the number of classes and the number of samples in a batch, respectively.

Experimental results and comparative analysis

In this section, a series of experiments were conducted to evaluate the proposed model. We begin by outlining the data processing operations and data augmentation methods used in our approach. To facilitate a more in-depth analysis of the model’s classification performance, we used accuracy (Acc), recall (Rec), precision (Pre), and F1-score (F1) evaluation criteria to compare with relevant research in recent years. In EEG emotion recognition, independent experiments on human subjects are conducted for which the stability of the model can be verified. Therefore, standard deviation (STD) was also included as an assessment metric. On top of that, in this section, we report comparisons and analysis of the experimental results using three different datasets: SEED, SEED-IV, and ZuCo. To verify the generalizability of our proposed model, we compared it with recent EEG emotion recognition models that also employed SEED and SEED-IV datasets as well as others on the ZuCo dataset. Through extensive testing, we found that the model achieved optimal performance with a learning rate of 9e-10, a batch size of 16, and 1000 training epochs. Unless otherwise noted, all experiments were conducted using this configuration. These experiments enabled us to comprehensively evaluate the model’s performance across different datasets, validating its effectiveness and reliability in the field of EEG emotion recognition.

Dataset description and preprocessing

Currently, the minimum-maximum normalization technique is used to standardize EEG data. This technique effectively maps the existing range of the original signal to the interval [0,1] while also preserving the inherent waveform characteristics. This normalization procedure can be in the form of Eq. (12):

$$\begin{aligned} x^{*} = \frac{x[n] - x[n]_{min}}{x[n]_{max} - x[n]_{min}} , \end{aligned}$$

(12)

where given the original signal x[n], $x[n]_{min}$ and $x[n]_{max}$ represent the minimum and maximum values within the signal, respectively. For text data used for multimodal emotion recognition, we simply standardize the capitalization of sentences and remove spaces and non-emotional symbols. Finally, as outlined earlier, the GloVe model is used to convert sentences into feature vectors.

Dataset preprocessing

The SEED dataset encompasses three distinct emotion categories: positive, negative, and neutral¹⁹. This dataset was collected by recording EEG data from 15 participants as they watched stimulus videos. Each participant engaged in three separate experiments, which were spaced approximately one week apart. Each experiment comprised 15 film clips, with the participants asked to provide feedback immediately after each experiment. The dataset records EEG data from 62 signal channels over an original sampling frequency of 1000 Hz, which was subsequently downsampled to 200 Hz. In our experiments, we divided the raw EEG data into five frequency bands and extracted features, such as PSD and differential entropy (DE) from each frequency band. To further explore the signal characteristics, we conducted a time-frequency domain feature analysis. This involved extracting eight time-domain features from each of the eight frequency bands, including mean, peak, standard deviation, skewness, kurtosis, maximum, minimum, and zero crossing rate. We applied a STFT to the original EEG signals, generating 2D spectra. The dimensions of the time-domain feature matrix are $62 \times 8$. Combining the results of these analyses, we obtained a comprehensive feature set, where PSD and DE features are represented as matrices with dimensions of $62 \times 249 \times 106$, while the dimensions of the time-frequency domain feature matrix are $62 \times 257 \times 206$.

The SEED-IV dataset comprises four distinct emotion categories: happiness, sadness, fear, and neutral emotions³⁰. The experiment involved 15 participants, each of whom completed three experiments that consisted of 24 video clips with 6 clips representing each emotion type. Following each experiment, participants provided feedback, and their EEG signals from 62 channels were recorded. The EEG signals were divided into non-overlapping 4-second segments and downsampled to a sampling rate of 128Hz. In our implementation, we employed the same data processing method that was used for the SEED dataset. The dimensions of the time-domain signal feature matrix are $62 \times 8$, the DE feature matrix is $62 \times 250 \times 106$, the PSD feature matrix is $62 \times 249 \times 106$, and the STFT feature matrix is $62 \times 257 \times 206.$

The ZuCo dataset comprises natural reading tasks divided into three categories: sentiment analysis, natural reading, and task-specific reading³¹. In the sentiment analysis dataset, participants were presented with 400 sentences from the Stanford sentiment tree database, covering positive, neutral, and negative sentiment markers³². EEG signals were recorded during participants’ natural reading tasks, capturing both sentence-level and word-level EEG features. Word-level EEG features were recorded based on the first fixation duration of each word, capturing the EEG signal when a subject’s eyes first encountered the word. For both word-level and sentence-level EEG signals, data were recorded in eight frequency bands at a sampling frequency of 500 Hz, including theta1 (4-6 Hz), theta2 (6.5-8 Hz), alpha1 (8.5-10 Hz), alpha2 (10.5-13 Hz), beta1 (13.5-18 Hz), beta2 (18.5-30 Hz), gamma1 (30.5-40 Hz), and gamma2 (40-49.5 Hz). In our study, the same data processing methods were applied to both sentence-level and word-level EEG signals. Furthermore, we extracted eight time-domain features from these eight frequency bands, including mean, peak, standard deviation, skewness, kurtosis, maximum, minimum, and zero crossing rate. The eight frequency bands were standardized and used as frequency domain features. Additionally, we utilized STFT to generate time-frequency domain features. As a result, the combined time-domain and frequency-domain features constituted a matrix with dimension $104 \times 8$. The overall dimensionality of the resulting time-frequency domain features matrix is $104 \times 257 \times 56$.

Data augmentation

When faced with data constraints, augmentation methods are necessary to mitigate the adverse effects of data scarcity³³. In this section, we will apply data augmentation techniques to the dataset. The text data, as mentioned in the previous section, is derived from another emotional dataset. To minimize data errors, we will focus on using three different data augmentation techniques specifically for the EEG signals in the three datasets: rotation, flipping, and the application of Gaussian noise. The augmentation process involves expanding the dataset by introducing a scaling factor, denoted as N. Figure 4 illustrates the EEG data augmentation process for an EEG signal from the SEED dataset [see Fig. 4(a)], which consists of ten EEG channels.

Table 2 Experiments on EEG emotion recognition using SEED and SEED-IV datasets.

Full size table

Rotation of EEG signal Rotation is a common signal enhancement operation that is widely used in image processing and digital signal processing. In image processing, the direction or angle of an image can be changed by rotating it. This is especially useful for image correction, correcting skewed images, or aligning images. In digital signal processing, rotation can be used for phase adjustment or signal correction to ensure that the signal is aligned with the desired direction or frequency. By rotating and modulating an EEG signal, (I, Q), around its origin, we obtain enhanced signal samples ($I^{'}$, $Q^{'}$) using Eq. (13):

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} =\begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} I \\ Q \end{bmatrix}, \end{aligned}$$

(13)

where $\theta$ is the angle of rotation. In this study, we rotate the EEG signal 180 degrees around the center position. This is partly because the 180-degree rotation is a safe method for data enhancement, which does not distort the signal content or cause loss of information. Moreover, it may be more resistant to some environmental disturbances and noise, thereby allowing the model to cope better with signals from different directions. The plot in Fig. 4(b) suggests that this type of rotation can augment the data without causing any loss of data.

Flip of EEG signal Flipping is another signal enhancement technique that is extensively used in image processing. Common flipping operations include horizontal flipping and vertical flipping, which we highlighted in the sequel. To flip an EEG signal (I, Q) horizontally, it is necessary to flip the I component of the signal horizontally while leaving the Q component unchanged. For vertical flipping, it involves flipping the Q component of the EEG signal vertically while leaving the I component unchanged. These are defined respectively as

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} -I \\ Q \end{bmatrix} \text {and} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} I \\ -Q \end{bmatrix}. \end{aligned}$$

(14)

Usually, horizontal flipping is more suitable for most applications, but considering the presence of a symmetric signal source, if it is flipped horizontally, it will not differ from the original signal. At the same time, using vertical flipping can produce effective data augmentation and create more differences, thereby allowing the model to learn more features and increase its performance and robustness. Therefore, we decided to use vertical flipping for data augmentation operations. Therefore, as presented in Fig. 4(c), the diversity of data processed using this operation increases.

Gaussian Noise in EEG signals Gaussian noise is a common type of noise that is used to simulate various natural noises, such as thermal noise in electronic components or random noise in the environment. In signal enhancement, Gaussian noise can be used to assess the robustness or noise immunity of signal processing algorithms. A common method for dealing with Gaussian noise is to apply filters or denoising algorithms to minimize the effect of the noise on the signal and by that improve the quality of the signal. By applying Gaussian noise $\mathcal {N}(0, \sigma ^2)$ to the modulated EEG signal, (I, Q), we obtain enhanced signal samples $(I^{'}, Q^{'})$ using Eq. (15):

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} I \\ Q \end{bmatrix} + \mathcal {N}(0, \sigma ^2). \end{aligned}$$

(15)

Meanwhile, since the probability density function of Gaussian noise follows a relevant distribution, we use a Gaussian distribution function to represent Gaussian noise, where $\sigma ^2$ represents the variance of this noise.

The use of Gaussian noise to enhance data enhancement of EEG signals can enable the model to learn robustness to interference during training while improving the model’s ability to generalize. Additionally, the parameters of Gaussian noise can be adjusted to control the intensity of noise and keep the effect of noise on the original data within a controllable range. In this study, we set $\sigma$ to 0.2 because we found that when $\sigma$ is set to this value, the impact on the data is minimal. This can be seen in Fig. 4(d) that substantial variation in the data situated at the center is noticeable and the magnitude of the change diminishes as it approaches the edges. Moreover, the differences in these data changes adhere to a Gaussian distribution.

Experiments on EEG emotion recognition

As outlined in the SEED and SEED-IV data processing description in Section 4.1, we extracted four distinct features from these two datasets to enhance our analysis. The network flowchart using time-domain (TDF), PSD, and STFT features is presented in Fig. 5, showcasing how each feature contributes to the overall framework for emotion recognition.

To establish the veracity of the experimental results, in the SEED dataset, we randomly selected 12 experimental data groups as the training set, while the remaining 3 experimental data groups were designated as the test set. The test and training sets were separated in a 12:3 ratio. Meanwhile, to assess the impact of distinctive features on the model, we first combined the TDF, PSD, DE, and STFT in pairs to obtain four sets of experimental results. Second, three experiments TDF+PSD+STFT, TDF+DE+STFT, and PSD+DE+STFT were undertaken with their results reported in Table 2. Overall, we observed significant differences in performance, with the PSD+DE+STFT method recorded best performance in all reported metrics (highlighted in bold). This method produced scores of 91.50% for Acc, 91.62% for Rec, 91.49% for Pre, and 91.70% for F1. Meanwhile, the DE+STFT method also recorded a powerful performance boasting an Acc of 89.01%, Rec of 89.00%, Pre of 89.07%, and an F1 of 89.03%. While PSD+STFT and DE+STFT combinations recorded satisfactory results, they exhibited slightly lower Acc values of 87.00% and 88.50%, respectively. Notably, the TDF+PSD combination recorded the least favourable results, with Acc of 81.25%. A further analysis of the results recorded in Table 2 indicates that combinations involving STFT performed better in the experiments. This is attributed to STFT’s ability to capture both time and frequency domain information in EEG signals while effectively filtering out random noise from the original signals, thereby emphasizing periodic signals. Additionally, we observed that using time-domain features with statistical features often led to a decrease in experimental results. As such, our experimental results confirmed that the combination of DE features is more effective in enhancing performance than combinations of PSD features.

In SEED-IV dataset experiments, we arbitrarily chose to use the 20 experimental data as the training set and the last 4 experimental data as the test set. Therefore, test and training sets were divided in a ratio of 20:4. At the same time, we conducted extensive research using the multidimensional features of EEG data. Specifically, we paired these critical areas to promote the fusion of complementary features. For this purpose, we also used the same experimental approach as the one reported where four distinctive features in pairs were combined into three experimental scenarios. The results of these experiments are presented in Table 2, which reveals the significant impact of the different data processing methods on performance. The combination PSD+DE+STFT excels in various performance indicators, particularly in accuracy, recall, precision, and the F1-score, achieving the highest scores (i.e., 91.50%, 91.62%, 91.49%, and 91.70%, respectively). The TDF+DE+STFT combination also performs well recording scores of 89.01% for Acc, 89.00% for Rec, 89.07% for Pre, and 89.03% for F1. Conversely, TDF+PSD exhibits the weakest performance across all performance indicators with an Acc of only 80.25%. In contrast to the experiments on SEED dataset, the experimental results of the three-feature combinations using SEED-IV dataset are significantly superior to those of the pairwise combinations. Additionally, the results are indicative that the experimental outcomes of STFT-based combinations are often better than other combinations, while the outcomes of DE-based combinations are better than those of PSD-based combinations. This again validates the conclusions that were reported using SEED dataset.

In addition, due to the extensive research on SEED and SEED-IV datasets, we chose different network architectures for comparative analysis. To ensure the credibility of the experimental results, we used the same method as those reported in other studies to separate the test and training sets, to record subject independent experiments. Specifically, we employed the leave-one-subject-out cross validation program to evaluate the performance of the model. In each round of experiments, the EEG signal of one participant is selected as the test set, and those of the other 14 participants are selected as the training set. Then, 15 rounds of experiments were conducted with different test sets for each round. Accordingly, the average experimental results were recorded as the final results. Considering similarity between the registered outcomes for the TDF+PSD+STFT, TDF+DE+STFT, and PSD+DE+STFT experiments, we compared their results (three groups of experiments) with those reported in similar recent studies, as presented in Table 3 for the SEED and SEED-IV datasets. We further discuss these results in the sequel.

Table 3 A comparison of the proposed scheme and related works applied to the SEED dataset and SEED-IV dataset.

Full size table

Table 4 Experiments on multimodal emotion recognition using ZuCo dataset.

Full size table

The following is a brief overview of the comparative model. As shown in Table 3, the PR-PL method⁴⁵ achieved the highest accuracy on both the SEED and SEED-IV datasets where accuracy rates from respective dataset is 85.56% and 74.92%, thus indicating strong performance. Similarly, the PSD+DE+STFT method demonstrated impressive accuracy, recording 85.85% on the SEED dataset and a solid 73.50% on the SEED-IV dataset. In contrast, the RGNN³⁹ and Bi-HDM⁴¹ methods also showed high accuracy on the SEED dataset, with scores of 85.30% and 85.40%, respectively, but their performance was noticeably lower on the SEED-IV dataset. On the other hand, the SVM³⁵ and DGCNN³⁷ methods underperformed on the SEED-IV dataset, with accuracies of 56.61% and 52.82%, respectively. The TDF+PSD+STFT feature extraction yielded an accuracy of 79.03% on the SEED dataset and 69.04% on the SEED-IV dataset, which is less competitive compared to current research. In contrast, the PSD+DE+STFT experiment outperformed the DGCNN model³⁷ by 5.9% on the SEED dataset and surpassed the leading PR-PL model by 0.3% in the comparison experiment. Furthermore, on the SEED-IV dataset, our PSD+DE+STFT results were only 1% lower than the top PR-PL model³⁵. Through extensive experimentation, we have found that when extracting features from three different domains, concatenating all five frequency bands may introduce redundancy in the final feature set. Additionally, the CNN feature extraction network tends to focus on global features, which can lead to the retention of redundant features and may adversely affect subsequent feature fusion and emotion recognition processes.

Experiments on multimodal emotion recognition

To validate the generalization of our model, we evaluated it to multimodal emotion recognition tasks for EEG and text. As discussed in Section 2.2, combining EEG and text data for emotion recognition offers a more comprehensive understanding of emotional states, enhances recognition performance, and allows for a deeper exploration of the mechanisms underlying emotions. The multimodal emotion recognition model is illustrated in Fig. 6. Additionally, to establish the performance of our model in multimodal emotion recognition, we used the publicly available multimodal dataset ZuCo. This dataset provides EEG information for each sentence and word for everyone to use. Using it, we designed two different fusion methods, namely a multimodal emotion recognition model based on word and word-level EEG fusion, as well as multimodal emotion recognition based on sentence and sentence-level EEG.

Since emotional labels can only be expressed at the word level, we take the following steps: first, we concatenate word-level EEG features and word vectors according to the order of words in the sentence. Second, we feed these concatenated features into the mutli-feature fusion module, and finally, we use a classifier for sentiment recognition. The methods in the experimental dataset for extracting features from EEG and word information were followed strictly. For comparison with other contemporary studies, we randomly shuffled the EEG data and divided it into 80% as the training set and 20% as the test set. We denote word features as Text, and TDF, FDF, and TFDF to represent EEG time-domain, frequency-domain, and time-frequency-domain features. We conducted seven experiments, including TDF+Text, FDF+Text, TDF+FDF+Text, TDF+FDF+Text, TDF+TFDF+Text, FDF+TFDF+Text, and TDF+FDF+TFDF+Text, whose details are presented in Table 4.

Table 5 Ablation experiment using SEED dataset.

Full size table

We can see that the TDF+TFDF+Text combination demonstrates outstanding results in both accuracy (94.32%) and the F1-score (94.22%). This performance is evidence of its effectiveness in classification tasks. The TDF+FDF+Text combination also performed well with an accuracy of 94.03%, recall of 93.78%, precision of 93.54%, and F1-score of 93.04%. Meanwhile, the RNN-multimodal and CNN-multimodal models showed high recall rates but underperformed in other performance indicators. Considering the few recent studies in this aspect, our study is compared alongside Hollenstein et al. contributions⁴⁶, which reports Rec, Pre, and F1 as evaluation indicators for experimental validation. Our TDF+TFDF+Text outperformed RNN-multimodal by 20% in terms of the evaluation indicator. Similarly, it surpassed other studies by 22%. Additionally, we observed that TDF+TFDF+Text recorded higher accuracy at 94.32%, while the accuracy of TDF+FDF+TFDF+Text was relatively lower. From the outcomes of our experiments, we surmise that during the final feature fusion stage, the presence of multiple pieces of information introduced a degree of redundancy, leading to a decrease in accuracy.

Similar to word level fusion, we have also completed relevant experiments on sentence level fusion. We feed pre-processed sentences and EEG information directly into the model, enabling it to autonomously extract features and learn from the data. The sentence features are denoted as Text, while the EEG time-domain, frequency-domain, and time-frequency features are represented as TDF, FDF, and TFDF, respectively. We conducted seven experiments involving different feature combinations: TDF+Text, FDF+Text, TFDF+Text, TDF+FDF+Text, FDF+TFDF+Text, and TDF+FDF+TFDF+Text, whose results of these experiments are presented in Table 4. Given the limited research on sentence-level EEG and multimodal emotion recognition, we opted to include several commonly used classifiers (MLP⁴⁷, Resnet50⁴⁸, Transformer⁴⁹) for comparative experiments. To enhance the experiment’s reliability, we randomized all classified data, allocating 80% for training and the remaining 20% for testing.

We can draw the conclusion from Table 4 that the combination of TDF+FDF+TFDF+Text excels in terms of accuracy, recall, precision, and the F1-score (highlighted in bold). It records an accuracy of 96.95% and an impressive F1-score of 97.01%, establishing itself as the best-performing combination. Following closely, the TDF+Text combination delivers strong accuracy at 95.64% but slightly lower in terms of the F1-score. The ResNet50 (TDF+FDF+TFDF+Text) and Transformer (TDF+FDF+TFDF+Text) models excel in F1-score, achieving 52.88% and 77.38%, respectively, but they have lower accuracy. On the other hand, the performance of MLP (TDF+FDF+TFDF+Text) is comparatively lower with diminished accuracy and F1-score. Consequently, Table 4 demonstrates how our model offers significant advantage over the MLP, ResNet50, and Transformer models across all four evaluation indicators. Notably, our experimental results consistently record outcomes in the range of 96% to 97%.

These experimental results underscore the universality of our record outcomes in model. Another important advantage from our model can be seen when it is applied on different datasets, particularly on the ZuCo dataset. However, this advantage was limited when applied to the SEED-IV dataset, which consists of pure EEG data. The foregoing results and analysis suggest the substantial impact of the choice of distinctive features on the final results. Additionally, the use of nearly identical feature extraction networks for unique features contributes to the relatively minor differences in the final features.

Finally, it is essential to highlight the distinctive nature of our approach compared to other emotion recognition methods. Our model is capable of accepting the original signal as input and extracting various signal features through its dedicated feature extraction module functions. Additionally, we incorporated attention mechanisms to accentuate dissimilarities among various emotional data and minimize disparities within the same emotional label data. This intuitive design facilitates the learning of emotional distinctions among different individuals by our model which leads to improved recognition accuracy.

Ablation experiments

As outlined in previous sections, our model consists of two key modules: a feature extraction module and a feature fusion module. To ascertain the contribution of these two modules to the model performance, we conducted a series of ablation experiments. However, considering the capability of the feature extraction module to integrate all the obtained features into a unified size-dimension matrix, we use a three-layer CNN model to ablate the feature extraction module. For the ablation experiment for the feature fusion module, we will employ the commonly used splicing methods for fusion based on the obtained features. This section recounts the outcomes of our ablation experiment of the feature fusion module on the EEG dataset (SEED) as presented in Table 5. The ablation experiments for the feature extraction module and the feature fusion module were conducted using the ZuCo dataset, and the results are presented in Tables 6 and 7.

The experimental results presented in Table 5 were obtained using the same experimental processing method as in Table 2. Consequently, we could discern the significant contribution of the feature fusion module to our experimental outcomes. Notably, the utilization of feature fusion module had the most pronounced impact on combinations involving DE, resulting in improvements up to 30%. The experiments involving PSD+DE observed substantial enhancements at least 27%, and even the PSD+STFT experiment demonstrated a remarkable 30% improvement due to the application of feature fusion module.

Table 6 Ablation experiment on the feature fusion module at the sentence level in ZuCo.

Full size table

Comparing the experimental results in Tables 6 and 7 with those in Table 4 reveals interesting insights, as depicted in Fig. 7. It appears that the combinations of FDF+Text and TDF+Text can yield substantial improvements when feature extraction module and feature fusion module are employed, respectively. On the other hand, for most other combinations, the experimental results using only the feature fusion module outperformed those using only the feature extraction module. Notably, the combination of TDF+FDF+TFDF+Text did not exhibit a significant performance improvement in both tables. Our tests lead to the conclusion that this may be attributed to the TFDF features, which encompass both time and frequency domain characteristics, thereby providing a certain level of stability to the experimental results. Additionally, this combination seems to provide sufficient information for the model to learn, potentially diminishing the need to optimize key information using feature extraction modules and feature fusion modules. Furthermore, we conducted ablation experiments involving both the feature extraction and the feature fusion modules on the ZuCo dataset. However, the results of all seven experiments were in the 30% to 40% range, which suggests that these experimental results were all caused by overfitting. Consequently, we surmise that this data can demonstrate the feature extraction and feature fusion modules provide significant advantages in emotion recognition tasks.

Table 7 Ablation experiment on the feature extraction module at the sentence level in ZuCo.

Full size table

Conclusion

In this study, we introduced an intuitive approach to enhance the performance of emotion recognition tasks. Our method leverages feature fusion and attention mechanisms to enhance multi-branch CNN models. In particular, we developed an intelligent network capable of processing raw EEG signals, autonomously extracting features, self-training, and performing emotion recognition. Our extensive experiments illustrate the distinct advantages of our proposed model in applications that require EEG-based emotion recognition. Moreover, we demonstrated the versatility of our model by conducting experiments in the domain of multimodal emotion recognition, incorporating EEG and text data. Through ablation experiments, we have underscored the substantial value of our proposed feature extraction and feature fusion modules in enhancing the accuracy of emotion recognition tasks.

Our study aligns seamlessly with the data distribution commonly employed in EEG-based emotion recognition studies³⁴³⁵. However, in real-world applications of social emotion recognition, more intricate and challenging scenarios are encountered. In this study, our model has been validated using rigorous experiments that encompass EEG emotion recognition as well as multimodal emotion recognition which combines EEG and text data. These robust findings provide a sturdy foundation for validating our approach. Looking ahead, we plan to extend the application of our model to diverse classification tasks, including emotion recognition based on images and speech, and multi-object detection. To further enhance the classification effect of the model, we will draw on the idea of literature⁵⁰ and combine the proposed model with the gated recurrent unit. Meanwhile, the research results of the unified topic semantic model on the semantic relevance of geographic terms in the literature⁵¹ also provide important insights into the generalization ability of the model in complex tasks. Additionally, we are mindful of some current limitations, such as the slow processing speed of the feature extraction and feature fusion modules, we will develop frameworks to optimize the temporal and spatial complexity of our modules to enhance our model’s capacity for emotion recognition. Second, it is important to acknowledge the individuality inherent in EEG signals. Different individuals may exhibit varying intensities and patterns in their EEG responses, leading to potential deviations in recognition results. To address this challenge, we plan to explore personalized models that adapt to the unique characteristics of each individual’s EEG data, thereby improving the robustness of our emotion recognition system. Third, we aim to explore additional multimodal approaches. By incorporating diverse data sources, such as video and audio, into our framework, we can capture a richer context for emotion recognition. This may involve using advanced signal processing and machine learning techniques to better integrate these modalities, ultimately enhancing the accuracy and reliability of our predictions. These directions together with the concomitant developments in technology will usher in breakthroughs to meet the escalating demands for emotion recognition in real-world societal contexts.

Data Availability

The data that support the findings of this study are available on request from the corresponding author.

References

Dolan, R. J. Emotion, cognition, and behavior. Science 298(5596), 1191–1194 (2002).
Article ADS CAS PubMed MATH Google Scholar
Bellamkonda, S., Gopalan, N., Mala, C. & Settipalli, L. Facial expression recognition on partially occluded faces using component based ensemble stacked CNN. Cogn. Neurodyn. 17(4), 985–1008 (2023).
Article PubMed Google Scholar
Mühl, C., Allison, B., Nijholt, A. & Chanel, G. A survey of affective brain computer interfaces: Principles, state-of-the-art, and challenges. Brain-Comput. Interf. 1(2), 66–84 (2014).
Article Google Scholar
Davidson, R. J. What does the prefrontal cortex “do’’ in affect: Perspectives on frontal EEG asymmetry research. Biol. Psychol. 67(1–2), 219–234 (2004).
Article PubMed Google Scholar
Jenke, R., Peer, A. & Buss, M. Feature extraction and selection for emotion recognition from EEG. IEEE Trans. Affect. Comput. 5(3), 327–339 (2014).
Article MATH Google Scholar
Yang, F., Zhao, X., Jiang, W., Gao, P. & Liu, G. Multi-method fusion of cross-subject emotion recognition based on high-dimensional EEG features. Front. Comput. Neurosci. 13, 53 (2019).
Article PubMed PubMed Central MATH Google Scholar
Liu, Y., & Sourina, O. “Real-time fractal-based valence level recognition from EEG,” in Proceedings of the Transactions on Computational Science XVIII, pp. 101–120 (2013).
Welch, P. The use of fast fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Trans. Audio Electroacoust. 15(2), 70–73 (1967).
Article ADS MATH Google Scholar
Duan, R., Zhu, J., Lu, B. Differential entropy feature for EEG-based emotion classification. in Proceedings of the International IEEE/EMBS Conference on Neural Engineering, pp. 81–84 (2013).
Griffin, D. & Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984).
Article ADS MATH Google Scholar
Grossmann, A. & Morlet, J. Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM J. Math. Anal. 15(4), 723–736 (1984).
Article MathSciNet MATH Google Scholar
Li, Y., Wei, H. & Billings, S. A. Identification of time-varying systems using multi-wavelet basis functions. IEEE Trans. Control Syst. Technol. 19(3), 656–663 (2010).
Article MATH Google Scholar
Huang, N. & Wu, Z. A review on hilbert-huang transform: Method and its applications to geophysical studies. Rev. Geophys. 49(2), 66–84 (2008).
MATH Google Scholar
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics 2(4), 433–459 (2010).
Article MATH Google Scholar
Valenza, G., Lanata, A. & Scilingo, E. P. The role of nonlinear dynamics in affective valence and arousal recognition. IEEE Trans. Affect. Comput. 3(2), 237–249 (2011).
Article MATH Google Scholar
Wu, M. et al. A study on the combination of functional connection features and riemannian manifold in eeg emotion recognition. Front. Neurosci. 17, 1345770 (2024).
Article PubMed PubMed Central MATH Google Scholar
Tuncer, T., Dogan, S. & Subasi, A. A new fractal pattern feature generation function based emotion recognition method using EEG. Chaos, Solitons & Fractals 144, 110 671-110 671 (2021).
Article MathSciNet MATH Google Scholar
Algumaei, M., Hettiarachchi, I. T., Veerabhadrappa, R. & Bhatti A. Wavelet packet energy features for EEG-based emotion recognition. in Proceedings of the International Conference on Systems, Man, and Cybernetics, pp. 1935–1940 (2021).
Zheng, W. & Lu, B. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans. Auton. Ment. Dev. 7(3), 162–175 (2015).
Article MATH Google Scholar
Wei, T. et al. EEG-based emotion recognition via channel-wise attention and self attention. IEEE Trans. Affect. Comput. 14(1), 382–393 (2020).
MATH Google Scholar
Jia, Z., Lin, Y., Cai, X., Chen, H., Gou, H., & Wang, J. SST-EmotionNet: Spatial-spectral-temporal based attention 3D dense network for EEG emotion recognition. in Proceedings of the ACM International Conference on Multimedia, pp. 2909–2917 (2020).
Sivaiah, B., Gopalan, N., Mala, C. & Lavanya, S. Fl-capsnet: facial localization augmented capsule network for human emotion recognition. SIViP 17(4), 1705–1713 (2023).
Article Google Scholar
Schwartz, D., Toneva, M., Wehbe, L. “Inducing brain-relevant bias in natural language processing models,” in Proceedings of the Conference on Neural Information Processing Systems, pp. 14 100–14 110(2019).
Hulliyah, K. Analysis of emotion recognition model using electroencephalogram (EEG) signals based on stimuli text. Turk. J. Comput. Math. Educ. 12(3), 1384–1393 (2021).
MATH Google Scholar
Gupta, A. et al. Enhancing text using emotion detected from EEG signals. J. Grid Comput. 17(3), 325–340 (2019).
Article Google Scholar
Wang, Z., & Heng, J. Open vocabulary electroencephalography-to-text decoding and zero-shot sentiment classification. in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5350–5358 (2022).
Guyon, I. & Elisseeff, A. An introduction to feature extraction. Feature Extract. Stud. Fuzz. Soft Comput. 207, 1–25 (2006).
Article MATH Google Scholar
Saberi, Z. A., Sadr, H., & Yamaghani, M. R. An intelligent diagnosis system for predicting coronary heart disease. in Proceedings of the International Conference on Artificial Intelligence and Robotics, pp. 131–137 (2024).
Bellamkonda, S. & Settipalli, L. Efl-lcnn: enhanced face localization augmented light convolutional neural network for human emotion recognition. Multimed. Tools Appl. 83(4), 12 089-12 110 (2024).
Article Google Scholar
Zheng, W., Liu, W., Lu, Y., Lu, B. & Cichocki, A. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE Trans. Cybernet. 49(3), 1110–1122 (2018).
Article MATH Google Scholar
Hollenstein, N. et al. Zuco, a simultaneous EEG and eye-tracking resource for natural sentence reading. Scientific Data 5(1), 1–13 (2018).
Article Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013).
Huang, L. et al. Data augmentation for deep learning-based radio modulation classification. IEEE Access 8, 1498–1506 (2019).
Article MATH Google Scholar
Zheng, W. Multichannel EEG-based emotion recognition via group sparse canonical correlation analysis. IEEE Trans. Cognit. Dev. Syst. 9(3), 281–290 (2016).
Article MATH Google Scholar
Li, X. et al. Exploring EEG features in cross-subject emotion recognition. Front. Neurosci. 12, 162 (2018).
Article PubMed PubMed Central MATH Google Scholar
Li, Y., Zheng, W., Cui, Z., Zhang, T., & Zong, Y. A novel neural network model based on cerebral hemispheric asymmetry for EEG emotion recognition. in Proceedings of the International Joint Conference on Artificial Intelligence, 1561–1567 (2018).
Wang, X., Zhang, T., Xu, X., Chen, L., Xing, X., Chen, C. P. EEG emotion recognition using dynamical graph convolutional neural networks and broad learning system. in Proceedings of the International Conference on Bioinformatics and Biomedicine, pp. 1240–1244 (2018).
Li, J., Qiu, S., Shen, Y., Liu, C. & He, H. Multisource transfer learning for cross-subject EEG emotion recognition. IEEE Trans. Cybernet. 50(7), 3281–3293 (2019).
MATH Google Scholar
Zhong, P., Wang, D. & Miao, C. EEG-based emotion recognition using regularized graph neural networks. IEEE Trans. Affect. Comput. 13(3), 1290–1301 (2020).
Article MATH Google Scholar
Song, T. et al. MPED: A multi-modal physiological emotion database for discrete emotion recognition. IEEE Access 7, 12 177-12 191 (2019).
Article Google Scholar
Li, Y. et al. A novel bi-hemispheric discrepancy model for EEG emotion recognition. IEEE Trans. Cognit. Dev. Syst. 13(2), 354–367 (2020).
Article MATH Google Scholar
Li, Y., Fu, B., Li, F., Shi, G. & Zheng, W. A novel transferability attention neural network model for EEG emotion recognition. Neurocomputing 447, 92–101 (2021).
Article MATH Google Scholar
Song, T. et al. Graph-embedded convolutional neural network for image-based EEG emotion recognition. IEEE Trans. Emerg. Top. Comput. 10(3), 1399–1413 (2021).
Article MATH Google Scholar
Li, Y. et al. GMSS: Graph-based multi-task self-supervised learning for EEG emotion recognition. IEEE Trans. Affect. Comput. 14(3), 2512–2525 (2022).
Article MATH Google Scholar
Zhou, R., Zhang, Z., Fu, H., Li, Z., Li, L., Huang, G., Dong, Y., Li, F., Xin, Y., Liang, Z. PR-PL: A novel transfer learning framework with prototypical representation based pairwise learning for EEG-based emotion recognition. arXiv:2202.06509 (2022).
Hollenstein, N. et al. Decoding EEG brain activity for multi-modal natural language processing. Front. Hum. Neurosci. 15, 659 410-659 410 (2021).
Article Google Scholar
David, R. The elements of statistical learning: data mining, inference, and prediction. J. Am. Stat. Assoc. 99(466), 567–567 (2004).
Article MATH Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. in Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. Attention is all you need. in Advances in Neural Information Processing Systems, pp. 5998–6008 (2017).
Khodaverdian, Z., Sadr, H., Edalatpanah, S. A. & Nazari, M. An energy aware resource allocation based on combination of CNN and GRU for virtual machine selection. Multimed. Tools Appl. 83(9), 25 769-25 796 (2024).
Article Google Scholar
Sadr, H., Soleimandarabi, M. N., Pedram, M., Teshnelab, M. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms. in Proceedings of the International Conference on Web Research, pp. 134–140 (2019).

Download references

Acknowledgements

This study is sponsored by the Prince Sattam Bin Abdulaziz University, Saudi Arabia via funding for the Project Number 2024/01/78908.

Author information

Authors and Affiliations

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, 130022, China
Fei Yan & Zekai Guo
College of Engineering, Prince Sattam Bin Abdulaziz University, Al-Kharj, 11942, Saudi Arabia
Abdullah M. Iliyasu
School of Computing, Tokyo Institute of Technology, Yokohama, 226-8502, Japan
Abdullah M. Iliyasu & Kaoru Hirota
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Kaoru Hirota

Authors

Fei Yan
View author publications
Search author on:PubMed Google Scholar
Zekai Guo
View author publications
Search author on:PubMed Google Scholar
Abdullah M. Iliyasu
View author publications
Search author on:PubMed Google Scholar
Kaoru Hirota
View author publications
Search author on:PubMed Google Scholar

Contributions

Fei Yan: Conceptualization, Investigation, Methodology, Writing-original draft, Writing-review & editing. Zekai Guo: Methodology, Data curation, Software, Validation, Visualization, Writing-original draft. Abdullah M. Iliyasu: Formal analysis, Writing-original draft, Validation, Funding acquisition, Writing-review & editing. Kaoru Hirota: Supervision, Writing-review & editing.

Corresponding author

Correspondence to Abdullah M. Iliyasu.

Ethics declarations

Publicly available data

The SEED and SEED-IV datasets are sourced from the SJTU Emotion EEG Dataset, accessible at https://bcmi.sjtu.edu.cn/home/seed/seed.html and https://bcmi.sjtu.edu.cn/home/seed/seed-iv.html, respectively. The ZuCo dataset can be found in the Zurich Cognitive Language Processing Corpus at https://osf.io/q3zws/wiki/home/.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Not applicable.

Consent to participate

All authors read and agreed to participate in the final manuscript.

Consent to publish

All authors agreed to publish this paper, if accepted.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yan, F., Guo, Z., Iliyasu, A.M. et al. Multi-branch convolutional neural network with cross-attention mechanism for emotion recognition. Sci Rep 15, 3976 (2025). https://doi.org/10.1038/s41598-025-88248-1

Download citation

Received: 30 May 2024
Accepted: 28 January 2025
Published: 01 February 2025
Version of record: 01 February 2025
DOI: https://doi.org/10.1038/s41598-025-88248-1

Keywords

This article is cited by

An improved facial emotion recognition system using convolutional neural network for the optimization of human robot interaction
- Ravi Raj
- Ilker Demirkol
Scientific Reports (2025)
An enhanced social emotional recognition model using bidirectional gated recurrent unit and attention mechanism with advanced optimization algorithms
- Taghreed Ali Alsudais
- Muhammad Swaileh A. Alzaidi
- Mohammed Yahya Alzahrani
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Detecting emotions through EEG signals based on modified convolutional fuzzy neural network

Multimodal fusion for anticipating human decision performance

Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features

Introduction

Literature review on EEG emotion recognition

EEG emotion recognition

Multimodal emotion recognition

Architecture of the proposed model

Feature extraction module

EEG feature extraction module

Text feature extraction module

Feature fusion module

Auxiliary parameters of the proposed model

Experimental results and comparative analysis

Dataset description and preprocessing

Dataset preprocessing

Data augmentation

Experiments on EEG emotion recognition

Experiments on multimodal emotion recognition

Ablation experiments

Conclusion

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Publicly available data

Competing interests

Ethical approval

Consent to participate

Consent to publish

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

An improved facial emotion recognition system using convolutional neural network for the optimization of human robot interaction

An enhanced social emotional recognition model using bidirectional gated recurrent unit and attention mechanism with advanced optimization algorithms

Search

Quick links