Introduction

Emotions are important aspects of our daily lives. It is essential in interpersonal communication, cognitive processing, decision-making, and reflecting personal psychological and physiological conditions1. The two levels of emotional response are external expression and internal transformation. Facial expressions, gestures, and language are main examples of external expressions2. On the contrary, physiological indicators, such as electroencephalogram (EEG) and magnetoencephalogram (MEG) are examples of internal changes, as are skin resistance, heart rate, blood pressure, respiratory rate, and blood pressure. Although emotional reactions are essentially subjective experiences, external patterns exhibit significant degree of complexity and ambiguity. Presently, EEG signals are widely used in clinical research, especially when emotions are involved. They are non-invasive, affordable, and unaffected by language and cultural differences3. The application of EEG in emotion recognition research mainly involves five processes: EEG data collection, preprocessing, feature extraction, feature optimization/selection, and emotion classification.

Nowadays, EEG signals are classified based on their frequency ranges, including the widely known \(\delta\)-wave (1-4 Hz), \(\theta\)-wave (4-7 Hz), \(\alpha\)-wave (8-13 Hz), \(\beta\)-wave (13-30 Hz), and \(\gamma\)-wave (31-50 Hz)4. In order to effectively retain the key information embedded in each frequency band, the current study is based on five frequency bands, while feature extraction is focused on three key domains: time, frequency, and the time-frequency domains5. Time-domain features (TDF) focus on the temporal characteristics of EEG signals. Three often used time-domain features include statistical, Hjorth6, and fractal dimension7 features. Frequency-domain features (FDF) are used to express the changes in EEG frequency, where power spectral density (PSD)8 and differential entropy9 are the most commonly used. Finally, the time-frequency domain features (TFDF) of EEG mainly integrate the time-frequency features of EEG signals, which can provide a comprehensive description of the signal dynamics from the time and frequency axes. The main time-frequency domain features include short-time Fourier transform (STFT)10, continuous wavelet transform11, and wavelet transform12. Overall, these feature extraction methods complement one another, serving as robust tools for in-depth research on EEG signals. When considered collectively, they not only illuminate the fundamental characteristics of brain activity but also provide significant support for effectively capturing and extracting EEG signals in emotion recognition tasks.

Figure 1
figure 1

Architecture of the proposed MCNN-CA model. The input data must enter the feature extraction module for feature extraction, followed by feature fusion, and finally an emotion recognition module consisting of fully connected layers is used to obtain the results.

However, EEG features are often high-dimensional and sparse, and using them directly may decrease emotion recognition performance while increasing computational costs. Therefore, it is crucial to filter or optimize these features after extraction. Despite significant progress in this field, extracting richer and more relevant information remains vital for enhancing overall system effectiveness. Additionally, while deep neural network (DNN) models are widely employed in current emotion recognition techniques, they still encounter challenges in EEG-based emotion classification tasks. Specifically, issues such as neglecting the correlations between EEG channels and the incompleteness or unreliability of the extracted features directly affect the accuracy and robustness of emotion recognition systems.

In this study, we introduce a multi-branch convolutional neural network model utilizing a cross-attention mechanism (MCNN-CA) to achieve accurate recognition of various emotions. The proposed model is designed to proficiently extract key information from the different feature dimensions inherent to EEG signals, and then seamlessly merge them through cross feature fusion. Compared to existing methods documented in contemporary literature, our model goes beyond traditional paradigms, by capturing complex subtle differences in various feature dimensions and also dynamically fine-tuning the weighting coefficients in different high-dimensional information flows to highlight important clues required for emotional state recognition. The main highlights of our work are as follows:

  • To effectively learn high-dimensional signal features, our model employs multiple convolutional neural network modules to extract features across varying dimensions. Each module incorporates a parametric rectified linear unit (PReLU) activation function and a batch normalization (BN) layer to expedite high-dimensional information extraction and mitigate the gradient vanishing issue during training and testing. This innovative architecture not only enriches the information available for subsequent feature fusion but also lays the groundwork for enhancing overall system performance.

  • In developing the feature fusion module, we integrate an efficient channel attention mechanism into the multi-feature fusion process of the proposed model. This module effectively prioritizes task-relevant information, significantly reducing focus on ancillary data and selectively eliminating incongruent information. This strategic approach ameliorates the potential dilemma of information overload associated with the availability of diverse features, thus substantially enhancing the effectiveness and precision of the subsequent classifier module in performing classification tasks.

  • In this study, we constructed a comprehensive end-to-end emotion recognition model that processes raw signals as inputs and delivers corresponding emotion labels for prediction. In the realm of EEG emotion recognition, we undertook a series of experiments to validate our model’s performance. Moreover, we reported extensive evaluations using a multimodal dataset that includes both EEG and text inputs. The results of our experiments clearly demonstrate the substantive advantages of our proposed model in terms of accuracy, recall, and F1-score in multimodal emotion recognition tasks.

Details of the enumerated contributions of our proposed model are presented in the remainder of the work. Specifically, Section 2 provides a literature review on EEG-based emotion recognition techniques. Section 3 elucidates the distinct modules that constitute the architecture of our model. In Section 4, a detailed description of the pre-processing operations and data enrichment methods used in the dataset are discussed. This is followed in Section 5 by the results emanating from experiments on three datasets and compares the outcomes alongside those from contemporary studies.

Literature review on EEG emotion recognition

As mentioned earlier, EEG is vital for emotion recognition tasks due to its unique characteristics and adaptability. Refining the features extracted from EEG signals or supplementing them with patterns from other sources is essential for improving the accuracy of subsequent emotion recognition networks. In this section, we provide a detailed overview of research focused on emotion classification using EEG, as well as studies involving EEG in emotion classification within the context of multimodal fusion.

EEG emotion recognition

In EEG-based emotion recognition, feature extraction is a vital step. This process focuses on identifying and transforming key features or relevant information from EEG signals into interpretable representations. These features reflect the physiological and anatomical characteristics of underlying brain activity, but managing the large volume of data requires substantial computational resources and sophisticated algorithms. To tackle this complexity and enable meaningful interpretation, feature extraction techniques are essential for simplifying and optimizing the signal representation.

One widely used feature extraction technique is the Hilbert-Huang transform (HHT), which is particularly effective for analyzing nonlinear and non-stationary signals, a common characteristic of EEG data13. HHT decomposes the signal into intrinsic mode functions and calculates instantaneous frequency and power, preserving important temporal information from peak time-frequency analysis while maintaining linear properties. This makes HHT a valuable tool for EEG feature extraction. Another commonly utilized technique is principal component analysis (PCA), which simplifies large datasets by identifying patterns and emphasizing similarities and differences within the data14. PCA works by rotating the dataset to find the directions (principal components) where the variance is maximized, effectively reducing the dimensionality of the data while retaining most of its informational content. This method not only helps to reduce memory usage and computational demands but also has become a popular choice in EEG feature extraction.

Various studies have recently focused on filtering or optimizing features extracted from EEG signals. For instance, Valenza et al.15 applied PCA to reduce the dimensionality of high-dimensional data while minimizing information loss. They combined PCA with a quadratic discriminant classifier to enhance recognition rates, reporting accuracies of 92.29% for arousal and 90.36% for valence. Yang et al.6 developed a cross-subject emotion recognition method that integrates sequential backward selection with significance testing and a support vector machine (SVM) model, effectively improving EEG emotion recognition accuracy while reducing computational overhead. Furthermore, Wu et al.16 utilized the Laplacian matrix to convert functional connectivity features, such as phase locking value, Pearson correlation coefficient, spectral coherence, and mutual information, into semi-positive values. They applied a max operator to ensure the positivity of the transformed features and employed a symmetric positive definiteness network to extract deep spatial information, which was validated through fully connected layers. A decision-level fusion strategy was then employed to achieve more accurate and stable recognition results.

On the other hand, classifiers can generally be categorized into machine learning (ML) classifiers and deep learning classifiers. SVM remains one of the most widely used ML classifiers. For instance, Tuncer et al.17 employed a tunable Q-factor wavelet transform based on a fractal pattern feature extraction method to generate multi-level features, achieving a remarkable 99.82% accuracy using the SVM algorithm. Similarly, Algumaei et al.18 reported higher accuracy compared to SVM on the SEED dataset (i.e., the SJTU emotion EEG dataset19) by using linear discriminant analysis (LDA) as the classifier. Unlike traditional ML methods, deep neural networks (DNNs) can capture underlying data structures and extract deeper feature representations through training, thus simplifying the feature engineering process that is typically complex in ML. As a result, many recent studies have adopted deep learning classifiers for emotion recognition tasks. For example, Tao et al.20 introduced a channel attention mechanism to assign adaptive weights to different EEG channels, coupled with an extended self-attention mechanism to explore the temporal dependencies in EEG signals. Similarly, Jia et al.21 proposed a two-stream network with an attention mechanism that adaptively focuses on important patterns. Their approach highlights the significance of integrating information across different regions and adaptively learning key information, demonstrating the importance of constructing a unified EEG emotion recognition framework. Given that information from different domains is based on distinct features, understanding and leveraging the interrelationships between these unique features is essential.

Existing studies primarily concentrate on extracting and selecting the optimal feature sets from raw EEG data. However, the feature extraction process may inadvertently lead to the loss of valuable information, which can hinder the model’s ability to learn critical aspects of the data. In this study, we aim to address this challenge by automatically extracting features from EEG data across multiple domains, specifically the time-domain, frequency-domain, and time-frequency domain, using advanced neural network architectures. Furthermore, we employ the efficient channel attention mechanism and integrate it into the multiple features fusion procedure of the proposed model. This approach not only enhances the interpretability of the data but also significantly improves the accuracy and performance of emotion recognition tasks.

Multimodal emotion recognition

Humans typically convey emotions through diverse means such as language, facial expressions, and body movements22. As a result, approaches that integrate multiple features have gained considerable traction in emotion recognition studies. Feature fusion amalgamates information from various sources to produce more informative representations, thereby bridging the gap between independent features. Researchers are now developing multimodal emotion recognition systems by merging physiological signals, notably EEG, with other perceptual modalities including speech, images, and text data. Similarly, in the realm of natural language processing, unimodal methods have certain limitations, particularly in their ability to capture the complexity of human language. To overcome these, Linzen et al.23 advocated for actively exploring the potential of multimodal data to accelerate computers’ understanding and generalization of natural language. Leveraging physiological signals is particularly intriguing for simulating human-like language learning phenomena.

In the midst of these studies, Hulliyah et al.24 designed an innovative random forest classifier that analyzes the emotional tones evoked by EEG signals, enabling effective classification of Twitter comments. Their model demonstrated excellent performance in a four-class emotion recognition task, with experimental results validating its particular effectiveness in accurately identifying anger emotions. Meanwhile, Gupta et al.25 presented a comprehensive end-to-end system that not only identifies emotional information in EEG data but also incorporates it into text to enrich emotional expression. The experimental results highlighted promising outcomes for the text enhancement task and showcased the overall robustness of the end-to-end system. Furthermore, Wang et al.26 introduced a sequence-to-sequence decoding method that enhances text sequences using a pre-trained language model, seamlessly fusing them with EEG features. This approach achieved remarkable results in integrating word-level text and EEG information, further advancing the field of emotion recognition.

However, practical challenges emerge when attempting to utilize diverse modalities, such as EEG signals and text, simultaneously due to data heterogeneity and variability. Despite the significant progress made in these studies, a universal method for transforming text and EEG signals into common feature vectors for fusion remains elusive. Moreover, many of these studies rely on EEG features to adjust and supplement emotion classification results, implying that errors in EEG emotion recognition may not effectively correct text emotion classifications. Therefore, this study employs a feature fusion module that enhances the inter-correlation between features using a channel-efficient attention mechanism. This innovation proves effective and robust in fusing distinctive features within a single mode and across different modalities.

To address these challenges, this study proposes a multi-branch convolutional neural network model that employs a mutli-feature fusion mechanism for accurate emotion recognition. The following sections will detail this proposed methodology in line with the previously highlighted contributions.

Architecture of the proposed model

In this section, our main goal is to build an intelligent network that can elucidate the complex mapping correspondence between data features and emotional labels. Given that EEG data is characterized by non-Gaussian distribution and non-stationarity, along with the subtle emotional nuances inherent in text information, achieving accurate emotion recognition from either EEG signals or text is challenging. To overcome this dilemma, we formulate a multi-branch convolutional neural network framework. For reference, the basic architecture of our network model is presented in Fig. 1. In the feature extraction module, we use enhanced convolutional neural networks to extract EEG features from different domains, including time, frequency, and time-frequency domains. Furthermore, to process text data for multimodal emotion recognition, we developed a BiLSTM network model to extract features from text data. Subsequently, the feature fusion module is designed to process the extracted features, facilitating seamless integration. Ultimately, the fused features are fed into the emotion recognition module for empirical evaluation, which then produces conclusive output results. In the remainder of this section, we will explain the complex working principles of each module and follow up with a detailed introduction to its implementation process.

Table 1 EEG feature extraction module based on convolutional networks.

Feature extraction module

Feature extraction plays a central role in machine learning and data analysis27. It involves transforming raw data into more complex and practical feature representations28. The primary purpose of feature extraction is to improve an algorithm’s understanding and use of the data, thereby facilitating its application in the areas of classification, clustering, regression, and so on29. As earlier promised, in subsequent sections, we highlight details of our EEG and text feature extraction modules.

Figure 2
figure 2

Structure of the designed BiLSTM-CNN network.

EEG feature extraction module

In the EEG feature extraction process, we obtain relevant information from three perspectives: the time domain, frequency domain, and time-frequency domain. Time-domain feature extraction focuses on the signal’s dynamic changes over time. By calculating parameters such as mean and standard deviation, we reveal the overall fluctuation level of the signal, laying the groundwork for understanding EEG dynamics. Frequency-domain feature extraction emphasizes the signal’s characteristics in the frequency dimension. Analyzing parameters like power spectral density and band energy helps us understand energy distribution across frequencies and differentiate between various brainwave activities. We will employ multiple algorithms to extract features from the time or frequency domain, utilizing a one-dimensional (1D) convolutional neural network to optimize these features and enhance the model’s performance. Finally, time-frequency domain feature extraction integrates analytical methods from both domains. Techniques such as short-time Fourier transform, wavelet transform, and time-frequency spectrograms provide a comprehensive view of the signal’s time-varying characteristics. By combining time and frequency data, we employ a two-dimensional (2D) convolutional neural network to further optimize the extracted features, improving the overall effectiveness of the model. The specific dimensions of extracted features and the network architecture are detailed in Table 1.

For convenience, we represent the time domain, frequency domain, and time-frequency domain as \(D_T\), \(D_F\), and \(D_{T\text {-}F}\), respectively. In the designed feature extraction network, we let \(x^{l-1}_{T}\in {\mathbb {R}}^{L\times 1\times C}\) and \(x^{l-1}_F\in {\mathbb {R}}^{L \times 1 \times C}\) be the input dimension of the 1D convolutional network used to extract \(D_T\) and \(D_F\), and \(k^{l}_{1D}\in {\mathbb {R}}^{K\times 1\times C\times N}\) be the size of the convolutional kernel dimension in 1D convolution, then for a 2D convolutional network extracting \(D_{T\text {-}F}\), the input dimension is \(x^{l-1}_{T-F}\in {\mathbb {R}}^{W \times H \times C}\) and the convolutional kernel size is \(k^{l}_{2D}\in {\mathbb {R}}^{K \times K \times C \times N}\). Among them, the convolution operation of the l-th layer \(u^l\) is shown by Eq. (1):

$$\begin{aligned} u^l = k^l *x^{l-1} + b = \sum _{c=1}^{C} k_c^l *x^{l-1}_c + b^{l}, \end{aligned}$$
(1)

where \(b^l\) denotes the bias, \(*\) indicates the convolution operation, and C represents the number of input channels.

In our model, the design of convolutional modules depends on the sequence of convolutional neural network layers. First, the convolutional layer uses a set of learnable convolutional kernels as basis to process the input data through convolutional operations. However, since emotion recognition is not a linear problem, then using only convolutional layers for feature learning could lead to problems, such as information loss, which in turn leads to overfitting of the entire model that affects the accuracy of the test set. To address this issue, we integrated the PReLU activation function layer and BN layer into the convolutional model. Overall, the introduction of the PReLU activation function layer and BN layer helps convolutional models to better learn the features of the input data. This process can be expressed mathematically in the form presented in Eq. (2):

$$\begin{aligned} x^l = \sigma (\phi ( \omega (x^l -1 ))), \end{aligned}$$
(2)

where \(x^l\) represents the feature map of the lth convolutional layer, \(\sigma\)(-) is the PReLU activation function, \(\phi (\cdot )\) is the batch normalization operation, and \(\omega (\cdot )\) denotes the convolutional operation.

Last but not least, a common approach employed to highlight the key features of convolution operations and to extend the perceptual domain is the introduction of a pooling layer after the convolutional layer. As an independent neural layer, the pooling operation has no parameters, and its main purpose is to filter out irrelevant features while retaining important representations. Therefore, the feature maps generated by the pooling layers can capture the key information of the original data better. The nth feature map of the lth pooling layer \(y_n^l\) cloud be expressed by Eq. (3):

$$\begin{aligned} y_n^l = pool(x_n^l,k,s), \end{aligned}$$
(3)

where \(x_n^l\) is the nth output feature map of the lth convolutional layer, i.e., \(pool(\cdot )\) is the max pooling operation, k is the pooling kernel size, and s represents the stride of the pooling kernel.

After processing features through the extraction network, we observed that the dimensions of optimized features vary across different extraction networks, which creates difficulties for subsequent feature fusion operations. Dimensionality reduction of high-dimensional features is a common and effective strategy during the optimization process. Thus, we flatten the 2D features along the channel dimension, converting them into 1D features that can then undergo 1D convolution operations for dimensionality reduction. The output feature map \(F_z\) for this process can be formulated using Eq. (4):

$$\begin{aligned} F_z = conv1d(f(x_{TFDF}^l)), \end{aligned}$$
(4)

where \(f(\cdot )\) denotes the flattening operation, \(x_{TFDF}^l\) represents the characteristics of the input, \(conv1d(\cdot )\) indicates 1D convolution operation. As a result, our approach to EEG feature extraction combines insights from the time, frequency, and time-frequency domains using advanced convolutional networks to improve accuracy in emotion recognition tasks.

Text feature extraction module

To achieve effective fusion of text and EEG features, it is equally essential to extract and optimize the text features. In this process, we employ the BiLSTM-CNN network for text feature extraction. This network combines the strengths of bidirectional long short-term memory (BiLSTM) networks and convolutional neural networks (CNNs), facilitating the capture of contextual information and local structures within the text. As illustrated in Fig. 2, BiLSTM can capture contextual information in sentences, which helps the model to better understand the patterns of word vectors in sentences, and effectively alleviates the problem of gradient vanishing that may occur during the model’s learning process. Subsequently, by effectively capturing local features through CNNs, this combination enhances the model’s comprehension of the text data, leading to improved performance.

In BiLSTM, the storage, updating, and output of historical information are precisely controlled by storage units, input gates, forgetting gates, and output gates. Input gates are used to regulate the effect of input vectors on the state of storage units, ensuring that information is properly integrated. By this configuration, the output gate allows the state of the memory unit to affect the final output result. Consequently, the function of the forgetting gate is to allow memory cells to selectively retain or forget previous information as required to meet the needs of the current task. Before feeding text features into our proposed BiLSTM-CNN network, the initial step involves generating word vectors for the text data. To achieve this, we utilize a GloVe (global vectors for word representation) dictionary to convert the text features into word vectors. GloVe is a word embedding method based on global word frequency statistics that learns vector representations for each word by constructing a co-occurrence matrix of words. This representation effectively captures the semantic information and contextual relationships of words, establishing a strong basis for subsequent text feature extraction.

The text feature input to the BiLSTM-CNN is defined as \(X=[X_1, X_2, \cdots , X_T]\), where \(X_T\) is a d-dimensional word embedding vector. These word vectors subsequently serve as input to the BiLSTM network. The hidden layer size of the BiLSTM network is set to H to effectively capture contextual information within the text sequences. Other network parameters include the input weight matrix \(W \in R^{H \times R}\), the hidden state weight matrix \(U \in R^{H \times H}\), and the bias vector \(b \in R^{H \times 1}\). During the implementation of LSTM, these parameters (WUb) are learned and updated through the back propagation algorithm, enabling the model to capture long-term dependencies and mitigate the vanishing gradient problem during training. At time step t, the information \(x_t\) must first enter the forgetting gate \(f_t\) for information filtering, as described in Eq. (5):

$$\begin{aligned} f_t = \sigma (W_{f}x_{t} + U_{f}h_{t-1} + b_f), \end{aligned}$$
(5)

where \(h_{t\text {-}1}\) represents the feature mapping learned by the network at time \(t\text {-}1\), and \(\sigma (\cdot )\) is a sigma function. Subsequently, the features will be further processed through input gate \(i_t\), as shown in Eq. (6):

$$\begin{aligned} i_t = \sigma (W_{i}x_{t} + U_{i}h_{t-1} + b_i). \end{aligned}$$
(6)

In addition, the candidate cell state, which represents the current input information without considering the previous cell state, is computed at each time step. This candidate state is derived from the current input and the previous hidden state, processed through specific weight matrices and activated using the hyperbolic tangent function, as represented in Eq. (7):

$$\begin{aligned} c_{t1} = \tanh (W_{c}x_{t} + U_{c}h_{t-1} + b_{c}). \end{aligned}$$
(7)

Here, \(tanh(\cdot )\) serves as a normalization function to transform its input into a value between -1 and 1. The current cell state \(c_{t}\) is then determined by the previous cell state \(c_{t1}\), the forgetting gate \(f_t\), and the input gate \(i_t\). Moreover, similar to Eqs. (5) and (6), the feature vector must pass through the output gate \(o_t\) to obtain the necessary information. Thus, the hidden state \(h_t\) at the current time step is crucial for information transfer and feature representation, whose implementation is formulated as

$$\begin{aligned} h_{t} = o_{t} \otimes \tanh (c_{t}). \end{aligned}$$
(8)

In this equation, \(\otimes\) denotes element-wise multiplication.

The hidden state conveys information from the current time step to the next, enabling the model to capture and retain long-range dependencies. It also regulates the flow of information through the gating mechanisms, including the forgetting gate, input gate, and output gate, which determine what information to retain, add, or discard. Acting as a compressed representation of the input sequence up to the current time step, the hidden state encodes all relevant contextual information for classification, prediction, or generation tasks. In a BiLSTM, it integrates information from both forward and backward sequences, allowing for a simultaneous understanding of context from both directions. Moreover, the hidden state is crucial for network training, as it is updated through the back propagation algorithm, enabling the LSTM to effectively handle sequential data and perform various sequence-related tasks.

To extract information from text features and integrate it with data from other sources, we employed a CNN architecture with three inner layers that filter data and adjust the dimensionality of the text features. This design effectively consolidates multiple features, enhancing overall performance. The subsequent experimental sections will demonstrate how our CNN architecture addressed challenges related to varying feature dimensions during the fusion process, further refining the information in the text features.

Figure 3
figure 3

Flowchart of the feature fusion module. The two fused features highlight their respective important information through the ECA model. The highlighted features are then filtered using an MLP network and finally fused.

Figure 4
figure 4

Data augmentation: (a) Original EEG data, (b) Rotated EEG data, (c) Vertically flipped EEG data, and (d) Noisy EEG data.

Feature fusion module

A single feature is insufficient to effectively address EEG emotion recognition and multimodal emotion recognition tasks. To tackle the challenges of EEG feature fusion and multimodal feature integration, we have developed a strategy for incorporating diverse features into the model. This strategy first optimizes the dimensions of the various features to facilitate fusion on a unified dimensional scale, and then extracts useful information from the generated modal data. Simply applying traditional feature fusion methods may introduce redundant information, which can adversely affect subsequent emotion classification tasks. Therefore, we propose a cross-feature intersection fusion module based on the integration of different features. This module employs an efficient channel attention mechanism that dynamically adjusts the weights of channel features, emphasizing important features while suppressing redundant information. The workflow of the feature fusion module is illustrated in Fig. 3.

In feature fusion mechanism, global features of single mode are especially important for classification. The feature interaction mechanism is introduced to utilize this global information so that each branch contains essential information from other branches. Given the feature mapping \(F_x, F_y \in {\mathbb {R}}^{B \times I \times C}\), two types of modal data are extracted, where B is the batch size, C is the number of channels, and I is the length of each feature map. We interact with any two of these three domains. Due to the asymmetric operation of feature interaction, selecting features from three domains for feature interaction provokes the fusion of features containing redundant information. For convenience, we will only use \(F_x\) and \(F_y\) as examples. First, \(F_y\) is aggregated through channel maximization pooling to obtain globally expressive features that are then concatenated at the same level \(F_x\), which can be denoted in the form presented in Eq. (9):

$$\begin{aligned} \overline{F_x} = \psi (F_x, {pool}_m(F_y)), \end{aligned}$$
(9)

where \(\overline{F_x}\in {\mathbb {R}}^{B\times (I+1)\times C}\), \(\psi (\cdot )\) denotes the connection operation, and \(pool_m(\cdot )\) represents global maximum pooling.

In this process, to retain the essential information of features \(F_x\) and \(F_y\) in emotion recognition, we utilize the efficient channel attention (ECA) module to assign appropriate weights on these features. The ECA module incorporates a critical global average pooling (GAP) component, which allows us to effectively extract the global information from the features. Furthermore, we use a multi-layer perceptron to adjust the weight of feature \(F_y\) and use the maximum operator calculation method to fuse these two features. This operator computes the maximum of two feature maps \(F_x\) and \(F_y\) at the same position a of the cth channel as defined in Eq. (10):

$$\begin{aligned} \Gamma _{max}^{a,c} = max{(\overline{F_{x}}^{a,c},\overline{F_{y}}^{a,c})}. \end{aligned}$$
(10)

In the end, these two features are merged to obtain the final feature vector using the feature fusion method that was introduced in the previous section.

In our model, the time-domain and frequency-domain features of EEG signals are processed through a feature extraction network and adjusted to a consistent dimension, allowing for seamless integration with text features during the subsequent fusion phase. The text features are similarly processed using a BiLSTM-CNN network to ensure their dimensions match those of the EEG features, thus achieving consistency across multi-modal data in the feature space. This uniform dimensional setting not only simplifies the feature fusion process but also provides a stable foundation for the emotion classification task. However, given the complexity of the data structure of time-frequency features, directly merging them with other features could lead to information redundancy or dimension mismatch. To address this issue, our model first applies a dimensionality reduction technique to the time-frequency features using Eq. (4), ensuring their dimensions align with those of the time-domain, frequency-domain, and text features. This reduction method aims to retain key information from the time-frequency features, enabling their effective integration within a unified feature space. By ensuring consistency and integrity in the feature space, our model can effectively synthesize information from different modalities, ultimately improving the accuracy and robustness of emotion classification.

Auxiliary parameters of the proposed model

In this section, we present further details of the emotion recognition module as well as the optimizer and loss function used to train and test the model. The emotion recognition module consists of four fully connected layers (FCLs), each containing 2048, 512, and 128 neurons. These three FCLs are connected to the dropout layer and the rectified linear unit (ReLU) activation function, respectively. Next, based on the number of different emotion classifications, the extracted features are connected to the corresponding number of fully connected layers of neurons to complete the emotion classification task.

Figure 5
figure 5

Structure of the EEG emotion recognition model using TDF, PSD, and STFT features.

Meanwhile, in neural networks, there are many optimization algorithms to choose from. We use AdaGrad as the optimizer for neural network training. The choice is motivated because AdaGrad is a gradient descent optimization method with an adaptive learning rate. It makes the learning rate of the parameters adaptive, performing larger updates for infrequent parameters and smaller updates for frequent parameters, which makes AdaGrad very suitable for processing sparse data. Another advantage of AdaGrad is that it eliminates the need to manually adjust the learning rate. It constantly adjusts the learning rate during the iteration process and allows each parameter in the objective function to have its learning rate.

Furthermore, since emotion recognition task belongs to multiple value classification task, then cross entropy function is used as loss function as defined in Eq. (11):

$$\begin{aligned} \ell =\frac{1}{N} \sum _i L_i = - \frac{1}{N} \sum _i \sum _{c=1}^{M} y_{ic} \log (p_{ic}). \end{aligned}$$
(11)

Among them \(y_{ic}\) is a binary indicator that represents the true label of the ith sample for the cth class and \(p_{ic}\) are represents the predicted probability that the ith sample belongs to the cth class. Additionally, M and N denote the number of classes and the number of samples in a batch, respectively.

Experimental results and comparative analysis

In this section, a series of experiments were conducted to evaluate the proposed model. We begin by outlining the data processing operations and data augmentation methods used in our approach. To facilitate a more in-depth analysis of the model’s classification performance, we used accuracy (Acc), recall (Rec), precision (Pre), and F1-score (F1) evaluation criteria to compare with relevant research in recent years. In EEG emotion recognition, independent experiments on human subjects are conducted for which the stability of the model can be verified. Therefore, standard deviation (STD) was also included as an assessment metric. On top of that, in this section, we report comparisons and analysis of the experimental results using three different datasets: SEED, SEED-IV, and ZuCo. To verify the generalizability of our proposed model, we compared it with recent EEG emotion recognition models that also employed SEED and SEED-IV datasets as well as others on the ZuCo dataset. Through extensive testing, we found that the model achieved optimal performance with a learning rate of 9e-10, a batch size of 16, and 1000 training epochs. Unless otherwise noted, all experiments were conducted using this configuration. These experiments enabled us to comprehensively evaluate the model’s performance across different datasets, validating its effectiveness and reliability in the field of EEG emotion recognition.

Dataset description and preprocessing

Currently, the minimum-maximum normalization technique is used to standardize EEG data. This technique effectively maps the existing range of the original signal to the interval [0,1] while also preserving the inherent waveform characteristics. This normalization procedure can be in the form of Eq. (12):

$$\begin{aligned} x^{*} = \frac{x[n] - x[n]_{min}}{x[n]_{max} - x[n]_{min}} , \end{aligned}$$
(12)

where given the original signal x[n], \(x[n]_{min}\) and \(x[n]_{max}\) represent the minimum and maximum values within the signal, respectively. For text data used for multimodal emotion recognition, we simply standardize the capitalization of sentences and remove spaces and non-emotional symbols. Finally, as outlined earlier, the GloVe model is used to convert sentences into feature vectors.

Figure 6
figure 6

Structure of the multimodal emotion recognition model for EEG and text data.

Dataset preprocessing

The SEED dataset encompasses three distinct emotion categories: positive, negative, and neutral19. This dataset was collected by recording EEG data from 15 participants as they watched stimulus videos. Each participant engaged in three separate experiments, which were spaced approximately one week apart. Each experiment comprised 15 film clips, with the participants asked to provide feedback immediately after each experiment. The dataset records EEG data from 62 signal channels over an original sampling frequency of 1000 Hz, which was subsequently downsampled to 200 Hz. In our experiments, we divided the raw EEG data into five frequency bands and extracted features, such as PSD and differential entropy (DE) from each frequency band. To further explore the signal characteristics, we conducted a time-frequency domain feature analysis. This involved extracting eight time-domain features from each of the eight frequency bands, including mean, peak, standard deviation, skewness, kurtosis, maximum, minimum, and zero crossing rate. We applied a STFT to the original EEG signals, generating 2D spectra. The dimensions of the time-domain feature matrix are \(62 \times 8\). Combining the results of these analyses, we obtained a comprehensive feature set, where PSD and DE features are represented as matrices with dimensions of \(62 \times 249 \times 106\), while the dimensions of the time-frequency domain feature matrix are \(62 \times 257 \times 206\).

The SEED-IV dataset comprises four distinct emotion categories: happiness, sadness, fear, and neutral emotions30. The experiment involved 15 participants, each of whom completed three experiments that consisted of 24 video clips with 6 clips representing each emotion type. Following each experiment, participants provided feedback, and their EEG signals from 62 channels were recorded. The EEG signals were divided into non-overlapping 4-second segments and downsampled to a sampling rate of 128Hz. In our implementation, we employed the same data processing method that was used for the SEED dataset. The dimensions of the time-domain signal feature matrix are \(62 \times 8\), the DE feature matrix is \(62 \times 250 \times 106\), the PSD feature matrix is \(62 \times 249 \times 106\), and the STFT feature matrix is \(62 \times 257 \times 206.\)

The ZuCo dataset comprises natural reading tasks divided into three categories: sentiment analysis, natural reading, and task-specific reading31. In the sentiment analysis dataset, participants were presented with 400 sentences from the Stanford sentiment tree database, covering positive, neutral, and negative sentiment markers32. EEG signals were recorded during participants’ natural reading tasks, capturing both sentence-level and word-level EEG features. Word-level EEG features were recorded based on the first fixation duration of each word, capturing the EEG signal when a subject’s eyes first encountered the word. For both word-level and sentence-level EEG signals, data were recorded in eight frequency bands at a sampling frequency of 500 Hz, including theta1 (4-6 Hz), theta2 (6.5-8 Hz), alpha1 (8.5-10 Hz), alpha2 (10.5-13 Hz), beta1 (13.5-18 Hz), beta2 (18.5-30 Hz), gamma1 (30.5-40 Hz), and gamma2 (40-49.5 Hz). In our study, the same data processing methods were applied to both sentence-level and word-level EEG signals. Furthermore, we extracted eight time-domain features from these eight frequency bands, including mean, peak, standard deviation, skewness, kurtosis, maximum, minimum, and zero crossing rate. The eight frequency bands were standardized and used as frequency domain features. Additionally, we utilized STFT to generate time-frequency domain features. As a result, the combined time-domain and frequency-domain features constituted a matrix with dimension \(104 \times 8\). The overall dimensionality of the resulting time-frequency domain features matrix is \(104 \times 257 \times 56\).

Data augmentation

When faced with data constraints, augmentation methods are necessary to mitigate the adverse effects of data scarcity33. In this section, we will apply data augmentation techniques to the dataset. The text data, as mentioned in the previous section, is derived from another emotional dataset. To minimize data errors, we will focus on using three different data augmentation techniques specifically for the EEG signals in the three datasets: rotation, flipping, and the application of Gaussian noise. The augmentation process involves expanding the dataset by introducing a scaling factor, denoted as N. Figure 4 illustrates the EEG data augmentation process for an EEG signal from the SEED dataset [see Fig. 4(a)], which consists of ten EEG channels.

Table 2 Experiments on EEG emotion recognition using SEED and SEED-IV datasets.

Rotation of EEG signal Rotation is a common signal enhancement operation that is widely used in image processing and digital signal processing. In image processing, the direction or angle of an image can be changed by rotating it. This is especially useful for image correction, correcting skewed images, or aligning images. In digital signal processing, rotation can be used for phase adjustment or signal correction to ensure that the signal is aligned with the desired direction or frequency. By rotating and modulating an EEG signal, (I, Q), around its origin, we obtain enhanced signal samples (\(I^{'}\), \(Q^{'}\)) using Eq. (13):

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} =\begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} I \\ Q \end{bmatrix}, \end{aligned}$$
(13)

where \(\theta\) is the angle of rotation. In this study, we rotate the EEG signal 180 degrees around the center position. This is partly because the 180-degree rotation is a safe method for data enhancement, which does not distort the signal content or cause loss of information. Moreover, it may be more resistant to some environmental disturbances and noise, thereby allowing the model to cope better with signals from different directions. The plot in Fig. 4(b) suggests that this type of rotation can augment the data without causing any loss of data.

Flip of EEG signal Flipping is another signal enhancement technique that is extensively used in image processing. Common flipping operations include horizontal flipping and vertical flipping, which we highlighted in the sequel. To flip an EEG signal (I, Q) horizontally, it is necessary to flip the I component of the signal horizontally while leaving the Q component unchanged. For vertical flipping, it involves flipping the Q component of the EEG signal vertically while leaving the I component unchanged. These are defined respectively as

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} -I \\ Q \end{bmatrix} \text {and} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} I \\ -Q \end{bmatrix}. \end{aligned}$$
(14)

Usually, horizontal flipping is more suitable for most applications, but considering the presence of a symmetric signal source, if it is flipped horizontally, it will not differ from the original signal. At the same time, using vertical flipping can produce effective data augmentation and create more differences, thereby allowing the model to learn more features and increase its performance and robustness. Therefore, we decided to use vertical flipping for data augmentation operations. Therefore, as presented in Fig. 4(c), the diversity of data processed using this operation increases.

Gaussian Noise in EEG signals Gaussian noise is a common type of noise that is used to simulate various natural noises, such as thermal noise in electronic components or random noise in the environment. In signal enhancement, Gaussian noise can be used to assess the robustness or noise immunity of signal processing algorithms. A common method for dealing with Gaussian noise is to apply filters or denoising algorithms to minimize the effect of the noise on the signal and by that improve the quality of the signal. By applying Gaussian noise \(\mathcal {N}(0, \sigma ^2)\) to the modulated EEG signal, (IQ), we obtain enhanced signal samples \((I^{'}, Q^{'})\) using Eq. (15):

$$\begin{aligned} \begin{bmatrix} I^{'} \\ Q^{'} \end{bmatrix} = \begin{bmatrix} I \\ Q \end{bmatrix} + \mathcal {N}(0, \sigma ^2). \end{aligned}$$
(15)

Meanwhile, since the probability density function of Gaussian noise follows a relevant distribution, we use a Gaussian distribution function to represent Gaussian noise, where \(\sigma ^2\) represents the variance of this noise.

The use of Gaussian noise to enhance data enhancement of EEG signals can enable the model to learn robustness to interference during training while improving the model’s ability to generalize. Additionally, the parameters of Gaussian noise can be adjusted to control the intensity of noise and keep the effect of noise on the original data within a controllable range. In this study, we set \(\sigma\) to 0.2 because we found that when \(\sigma\) is set to this value, the impact on the data is minimal. This can be seen in Fig. 4(d) that substantial variation in the data situated at the center is noticeable and the magnitude of the change diminishes as it approaches the edges. Moreover, the differences in these data changes adhere to a Gaussian distribution.

Experiments on EEG emotion recognition

As outlined in the SEED and SEED-IV data processing description in Section 4.1, we extracted four distinct features from these two datasets to enhance our analysis. The network flowchart using time-domain (TDF), PSD, and STFT features is presented in Fig. 5, showcasing how each feature contributes to the overall framework for emotion recognition.

Figure 7
figure 7

Bar chart illustration of ablation experiments on feature fusion and extraction modules at the sentence level in ZuCo dataset..

To establish the veracity of the experimental results, in the SEED dataset, we randomly selected 12 experimental data groups as the training set, while the remaining 3 experimental data groups were designated as the test set. The test and training sets were separated in a 12:3 ratio. Meanwhile, to assess the impact of distinctive features on the model, we first combined the TDF, PSD, DE, and STFT in pairs to obtain four sets of experimental results. Second, three experiments TDF+PSD+STFT, TDF+DE+STFT, and PSD+DE+STFT were undertaken with their results reported in Table 2. Overall, we observed significant differences in performance, with the PSD+DE+STFT method recorded best performance in all reported metrics (highlighted in bold). This method produced scores of 91.50% for Acc, 91.62% for Rec, 91.49% for Pre, and 91.70% for F1. Meanwhile, the DE+STFT method also recorded a powerful performance boasting an Acc of 89.01%, Rec of 89.00%, Pre of 89.07%, and an F1 of 89.03%. While PSD+STFT and DE+STFT combinations recorded satisfactory results, they exhibited slightly lower Acc values of 87.00% and 88.50%, respectively. Notably, the TDF+PSD combination recorded the least favourable results, with Acc of 81.25%. A further analysis of the results recorded in Table 2 indicates that combinations involving STFT performed better in the experiments. This is attributed to STFT’s ability to capture both time and frequency domain information in EEG signals while effectively filtering out random noise from the original signals, thereby emphasizing periodic signals. Additionally, we observed that using time-domain features with statistical features often led to a decrease in experimental results. As such, our experimental results confirmed that the combination of DE features is more effective in enhancing performance than combinations of PSD features.

In SEED-IV dataset experiments, we arbitrarily chose to use the 20 experimental data as the training set and the last 4 experimental data as the test set. Therefore, test and training sets were divided in a ratio of 20:4. At the same time, we conducted extensive research using the multidimensional features of EEG data. Specifically, we paired these critical areas to promote the fusion of complementary features. For this purpose, we also used the same experimental approach as the one reported where four distinctive features in pairs were combined into three experimental scenarios. The results of these experiments are presented in Table 2, which reveals the significant impact of the different data processing methods on performance. The combination PSD+DE+STFT excels in various performance indicators, particularly in accuracy, recall, precision, and the F1-score, achieving the highest scores (i.e., 91.50%, 91.62%, 91.49%, and 91.70%, respectively). The TDF+DE+STFT combination also performs well recording scores of 89.01% for Acc, 89.00% for Rec, 89.07% for Pre, and 89.03% for F1. Conversely, TDF+PSD exhibits the weakest performance across all performance indicators with an Acc of only 80.25%. In contrast to the experiments on SEED dataset, the experimental results of the three-feature combinations using SEED-IV dataset are significantly superior to those of the pairwise combinations. Additionally, the results are indicative that the experimental outcomes of STFT-based combinations are often better than other combinations, while the outcomes of DE-based combinations are better than those of PSD-based combinations. This again validates the conclusions that were reported using SEED dataset.

In addition, due to the extensive research on SEED and SEED-IV datasets, we chose different network architectures for comparative analysis. To ensure the credibility of the experimental results, we used the same method as those reported in other studies to separate the test and training sets, to record subject independent experiments. Specifically, we employed the leave-one-subject-out cross validation program to evaluate the performance of the model. In each round of experiments, the EEG signal of one participant is selected as the test set, and those of the other 14 participants are selected as the training set. Then, 15 rounds of experiments were conducted with different test sets for each round. Accordingly, the average experimental results were recorded as the final results. Considering similarity between the registered outcomes for the TDF+PSD+STFT, TDF+DE+STFT, and PSD+DE+STFT experiments, we compared their results (three groups of experiments) with those reported in similar recent studies, as presented in Table 3 for the SEED and SEED-IV datasets. We further discuss these results in the sequel.

Table 3 A comparison of the proposed scheme and related works applied to the SEED dataset and SEED-IV dataset.
Table 4 Experiments on multimodal emotion recognition using ZuCo dataset.

The following is a brief overview of the comparative model. As shown in Table 3, the PR-PL method45 achieved the highest accuracy on both the SEED and SEED-IV datasets where accuracy rates from respective dataset is 85.56% and 74.92%, thus indicating strong performance. Similarly, the PSD+DE+STFT method demonstrated impressive accuracy, recording 85.85% on the SEED dataset and a solid 73.50% on the SEED-IV dataset. In contrast, the RGNN39 and Bi-HDM41 methods also showed high accuracy on the SEED dataset, with scores of 85.30% and 85.40%, respectively, but their performance was noticeably lower on the SEED-IV dataset. On the other hand, the SVM35 and DGCNN37 methods underperformed on the SEED-IV dataset, with accuracies of 56.61% and 52.82%, respectively. The TDF+PSD+STFT feature extraction yielded an accuracy of 79.03% on the SEED dataset and 69.04% on the SEED-IV dataset, which is less competitive compared to current research. In contrast, the PSD+DE+STFT experiment outperformed the DGCNN model37 by 5.9% on the SEED dataset and surpassed the leading PR-PL model by 0.3% in the comparison experiment. Furthermore, on the SEED-IV dataset, our PSD+DE+STFT results were only 1% lower than the top PR-PL model35. Through extensive experimentation, we have found that when extracting features from three different domains, concatenating all five frequency bands may introduce redundancy in the final feature set. Additionally, the CNN feature extraction network tends to focus on global features, which can lead to the retention of redundant features and may adversely affect subsequent feature fusion and emotion recognition processes.

Experiments on multimodal emotion recognition

To validate the generalization of our model, we evaluated it to multimodal emotion recognition tasks for EEG and text. As discussed in Section 2.2, combining EEG and text data for emotion recognition offers a more comprehensive understanding of emotional states, enhances recognition performance, and allows for a deeper exploration of the mechanisms underlying emotions. The multimodal emotion recognition model is illustrated in Fig. 6. Additionally, to establish the performance of our model in multimodal emotion recognition, we used the publicly available multimodal dataset ZuCo. This dataset provides EEG information for each sentence and word for everyone to use. Using it, we designed two different fusion methods, namely a multimodal emotion recognition model based on word and word-level EEG fusion, as well as multimodal emotion recognition based on sentence and sentence-level EEG.

Since emotional labels can only be expressed at the word level, we take the following steps: first, we concatenate word-level EEG features and word vectors according to the order of words in the sentence. Second, we feed these concatenated features into the mutli-feature fusion module, and finally, we use a classifier for sentiment recognition. The methods in the experimental dataset for extracting features from EEG and word information were followed strictly. For comparison with other contemporary studies, we randomly shuffled the EEG data and divided it into 80% as the training set and 20% as the test set. We denote word features as Text, and TDF, FDF, and TFDF to represent EEG time-domain, frequency-domain, and time-frequency-domain features. We conducted seven experiments, including TDF+Text, FDF+Text, TDF+FDF+Text, TDF+FDF+Text, TDF+TFDF+Text, FDF+TFDF+Text, and TDF+FDF+TFDF+Text, whose details are presented in Table 4.

Table 5 Ablation experiment using SEED dataset.

We can see that the TDF+TFDF+Text combination demonstrates outstanding results in both accuracy (94.32%) and the F1-score (94.22%). This performance is evidence of its effectiveness in classification tasks. The TDF+FDF+Text combination also performed well with an accuracy of 94.03%, recall of 93.78%, precision of 93.54%, and F1-score of 93.04%. Meanwhile, the RNN-multimodal and CNN-multimodal models showed high recall rates but underperformed in other performance indicators. Considering the few recent studies in this aspect, our study is compared alongside Hollenstein et al. contributions46, which reports Rec, Pre, and F1 as evaluation indicators for experimental validation. Our TDF+TFDF+Text outperformed RNN-multimodal by 20% in terms of the evaluation indicator. Similarly, it surpassed other studies by 22%. Additionally, we observed that TDF+TFDF+Text recorded higher accuracy at 94.32%, while the accuracy of TDF+FDF+TFDF+Text was relatively lower. From the outcomes of our experiments, we surmise that during the final feature fusion stage, the presence of multiple pieces of information introduced a degree of redundancy, leading to a decrease in accuracy.

Similar to word level fusion, we have also completed relevant experiments on sentence level fusion. We feed pre-processed sentences and EEG information directly into the model, enabling it to autonomously extract features and learn from the data. The sentence features are denoted as Text, while the EEG time-domain, frequency-domain, and time-frequency features are represented as TDF, FDF, and TFDF, respectively. We conducted seven experiments involving different feature combinations: TDF+Text, FDF+Text, TFDF+Text, TDF+FDF+Text, FDF+TFDF+Text, and TDF+FDF+TFDF+Text, whose results of these experiments are presented in Table 4. Given the limited research on sentence-level EEG and multimodal emotion recognition, we opted to include several commonly used classifiers (MLP47, Resnet5048, Transformer49) for comparative experiments. To enhance the experiment’s reliability, we randomized all classified data, allocating 80% for training and the remaining 20% for testing.

We can draw the conclusion from Table 4 that the combination of TDF+FDF+TFDF+Text excels in terms of accuracy, recall, precision, and the F1-score (highlighted in bold). It records an accuracy of 96.95% and an impressive F1-score of 97.01%, establishing itself as the best-performing combination. Following closely, the TDF+Text combination delivers strong accuracy at 95.64% but slightly lower in terms of the F1-score. The ResNet50 (TDF+FDF+TFDF+Text) and Transformer (TDF+FDF+TFDF+Text) models excel in F1-score, achieving 52.88% and 77.38%, respectively, but they have lower accuracy. On the other hand, the performance of MLP (TDF+FDF+TFDF+Text) is comparatively lower with diminished accuracy and F1-score. Consequently, Table 4 demonstrates how our model offers significant advantage over the MLP, ResNet50, and Transformer models across all four evaluation indicators. Notably, our experimental results consistently record outcomes in the range of 96% to 97%.

These experimental results underscore the universality of our record outcomes in model. Another important advantage from our model can be seen when it is applied on different datasets, particularly on the ZuCo dataset. However, this advantage was limited when applied to the SEED-IV dataset, which consists of pure EEG data. The foregoing results and analysis suggest the substantial impact of the choice of distinctive features on the final results. Additionally, the use of nearly identical feature extraction networks for unique features contributes to the relatively minor differences in the final features.

Finally, it is essential to highlight the distinctive nature of our approach compared to other emotion recognition methods. Our model is capable of accepting the original signal as input and extracting various signal features through its dedicated feature extraction module functions. Additionally, we incorporated attention mechanisms to accentuate dissimilarities among various emotional data and minimize disparities within the same emotional label data. This intuitive design facilitates the learning of emotional distinctions among different individuals by our model which leads to improved recognition accuracy.

Ablation experiments

As outlined in previous sections, our model consists of two key modules: a feature extraction module and a feature fusion module. To ascertain the contribution of these two modules to the model performance, we conducted a series of ablation experiments. However, considering the capability of the feature extraction module to integrate all the obtained features into a unified size-dimension matrix, we use a three-layer CNN model to ablate the feature extraction module. For the ablation experiment for the feature fusion module, we will employ the commonly used splicing methods for fusion based on the obtained features. This section recounts the outcomes of our ablation experiment of the feature fusion module on the EEG dataset (SEED) as presented in Table 5. The ablation experiments for the feature extraction module and the feature fusion module were conducted using the ZuCo dataset, and the results are presented in Tables 6 and 7.

The experimental results presented in Table 5 were obtained using the same experimental processing method as in Table 2. Consequently, we could discern the significant contribution of the feature fusion module to our experimental outcomes. Notably, the utilization of feature fusion module had the most pronounced impact on combinations involving DE, resulting in improvements up to 30%. The experiments involving PSD+DE observed substantial enhancements at least 27%, and even the PSD+STFT experiment demonstrated a remarkable 30% improvement due to the application of feature fusion module.

Table 6 Ablation experiment on the feature fusion module at the sentence level in ZuCo.

Comparing the experimental results in Tables 6 and 7 with those in Table 4 reveals interesting insights, as depicted in Fig. 7. It appears that the combinations of FDF+Text and TDF+Text can yield substantial improvements when feature extraction module and feature fusion module are employed, respectively. On the other hand, for most other combinations, the experimental results using only the feature fusion module outperformed those using only the feature extraction module. Notably, the combination of TDF+FDF+TFDF+Text did not exhibit a significant performance improvement in both tables. Our tests lead to the conclusion that this may be attributed to the TFDF features, which encompass both time and frequency domain characteristics, thereby providing a certain level of stability to the experimental results. Additionally, this combination seems to provide sufficient information for the model to learn, potentially diminishing the need to optimize key information using feature extraction modules and feature fusion modules. Furthermore, we conducted ablation experiments involving both the feature extraction and the feature fusion modules on the ZuCo dataset. However, the results of all seven experiments were in the 30% to 40% range, which suggests that these experimental results were all caused by overfitting. Consequently, we surmise that this data can demonstrate the feature extraction and feature fusion modules provide significant advantages in emotion recognition tasks.

Table 7 Ablation experiment on the feature extraction module at the sentence level in ZuCo.

Conclusion

In this study, we introduced an intuitive approach to enhance the performance of emotion recognition tasks. Our method leverages feature fusion and attention mechanisms to enhance multi-branch CNN models. In particular, we developed an intelligent network capable of processing raw EEG signals, autonomously extracting features, self-training, and performing emotion recognition. Our extensive experiments illustrate the distinct advantages of our proposed model in applications that require EEG-based emotion recognition. Moreover, we demonstrated the versatility of our model by conducting experiments in the domain of multimodal emotion recognition, incorporating EEG and text data. Through ablation experiments, we have underscored the substantial value of our proposed feature extraction and feature fusion modules in enhancing the accuracy of emotion recognition tasks.

Our study aligns seamlessly with the data distribution commonly employed in EEG-based emotion recognition studies3435. However, in real-world applications of social emotion recognition, more intricate and challenging scenarios are encountered. In this study, our model has been validated using rigorous experiments that encompass EEG emotion recognition as well as multimodal emotion recognition which combines EEG and text data. These robust findings provide a sturdy foundation for validating our approach. Looking ahead, we plan to extend the application of our model to diverse classification tasks, including emotion recognition based on images and speech, and multi-object detection. To further enhance the classification effect of the model, we will draw on the idea of literature50 and combine the proposed model with the gated recurrent unit. Meanwhile, the research results of the unified topic semantic model on the semantic relevance of geographic terms in the literature51 also provide important insights into the generalization ability of the model in complex tasks. Additionally, we are mindful of some current limitations, such as the slow processing speed of the feature extraction and feature fusion modules, we will develop frameworks to optimize the temporal and spatial complexity of our modules to enhance our model’s capacity for emotion recognition. Second, it is important to acknowledge the individuality inherent in EEG signals. Different individuals may exhibit varying intensities and patterns in their EEG responses, leading to potential deviations in recognition results. To address this challenge, we plan to explore personalized models that adapt to the unique characteristics of each individual’s EEG data, thereby improving the robustness of our emotion recognition system. Third, we aim to explore additional multimodal approaches. By incorporating diverse data sources, such as video and audio, into our framework, we can capture a richer context for emotion recognition. This may involve using advanced signal processing and machine learning techniques to better integrate these modalities, ultimately enhancing the accuracy and reliability of our predictions. These directions together with the concomitant developments in technology will usher in breakthroughs to meet the escalating demands for emotion recognition in real-world societal contexts.