Introduction

Brain-computer interfaces (BCIs) represent an emerging technological field, offering an innovative approach to human-computer interaction. By enabling direct neural communication, BCIs allow individuals to control external devices or systems solely through brain activity, bypassing traditional motor pathways. BCIs hold significant promise for applications in healthcare, rehabilitation, entertainment, and education. In the medical sector, they provide hope for individuals with motor impairments, enabling the restoration of control over bodily functions. For instance, BCIs have been crucial in helping individuals with spinal cord injuries operate prosthetic limbs and have supported stroke survivors in regaining mobility1,2.

A critical BCI modality is EEG-based motor imagery (MI), which utilizes electroencephalographic (EEG) signals to deduce a user’s intent for limb movement. MI signals, which are the brain’s response to the mental rehearsal of motor actions, are essential for a BCI to identify the intended limb movement and control external devices accordingly.

Researchers have traditionally relied on pattern recognition and machine learning methods, using handcrafted features to classify EEG data. These approaches have proven highly effective, enabling the development of communication aids for stroke and epilepsy patients, brainwave-controlled devices like wheelchairs and robots for individuals with mobility impairments, and remote pathology detection systems based on EEG3,4,5. Despite these advancements, creating effective BCI systems remains a significant challenge. The limited spatial resolution, low signal-to-noise ratio (SNR), and dynamic nature of MI signals complicate the extraction of reliable features. Additionally, the substantial inherent noise in EEG data adds another layer of complexity to the analysis of brain dynamics and the precise classification of EEG signals.

Traditional BCIs generally encompass five main processing stages: data acquisition, signal processing, feature extraction, classification, and feedback6. Each stage often relies on manually specified signal processing7, feature extraction8, and classification methods9, requiring significant expertise and prior knowledge of the expected EEG signals. For instance, preprocessing steps are typically tailored to specific EEG features of interest, such as band-pass filtering for certain frequency ranges, which might exclude other potentially relevant EEG features outside the band-pass range. As BCI technology expands into new application areas, the demand for robust feature extraction techniques continues to grow10,11,12,13.

Early research in EEG signal classification has significantly contributed to our understanding of MI and other cognitive tasks. For example, the study on the classification of MI BCI using multivariate empirical mode decomposition (MEMD) demonstrated the effectiveness of MEMD in dealing with data nonstationarity, low SNR, and closely spaced frequency bands of interest. This approach allows for enhanced localization of frequency information in EEG, providing a highly localized time-frequency representation14. Another study focused on emotional state classification from EEG data using machine learning approaches, highlighting the importance of power spectrum features and feature smoothing methods in improving classification accuracy15. Additionally, research on mu rhythm (de)synchronization and EEG single-trial classification illustrated the importance of event-related desynchronization (ERD) and synchronization (ERS) patterns in discriminating between different MI tasks16.

These early studies have laid a solid foundation for the field, deepening our understanding of EEG signals and MI classification. The insights gained from these traditional methods have significantly influenced the development of subsequent technologies, which continue to benefit from the robust feature extraction techniques and classification strategies established by prior research. As a result, contemporary models are better equipped to handle the complexities of EEG data, leveraging the advancements made by earlier studies to achieve improved performance in various BCI applications.

In recent years, BCI technology has gained significant attention in the classification of MI tasks. Traditional MI classification methods mainly rely on manual feature extraction and machine learning algorithms. While these methods have achieved some success, they also have certain limitations, such as the cumbersome feature extraction process and the high demand for domain expertise17,18,19. The advent of deep learning has brought new possibilities for MI classification by learning discriminative features directly from raw EEG data, thereby reducing the need for manual feature extraction. Among deep learning methods, Convolutional Neural Networks (CNNs) have become foundational due to their layered feature extraction capabilities and end-to-end learning potential. Various CNN architectures, such as Inception-CNN, Residual CNN, 3D-CNN, and Multi-scale CNN, have been widely applied in MI classification20,21,22,23,24,25.

In addition to CNNs, Recurrent Neural Networks (RNNs) and Temporal Convolutional Networks (TCNs) have been used to capture the temporal dynamics in EEG signals26,27. For instance, Kumar et al.28 proposed an LSTM model combined with FBCSP features and an SVM classifier, while Luo and Chao26 utilized FBCSP features as inputs to a Gated Recurrent Unit (GRU) model, which showed superior performance over LSTM.To overcome the limitations of individual models, researchers have attempted to combine different deep learning models. For example, integrating CNNs with LSTM to leverage their respective strengths23. Additionally, TCNs, a novel variant of CNNs designed for time-series modeling and classification, can exponentially expand the receptive field size by linearly increasing the number of parameters, thereby avoiding the gradient issues faced by RNNs29.

The emergence of attention mechanisms has further advanced EEG signal decoding. Since Bahdanau et al.30 introduced attention-based models, these mechanisms have been widely applied in various fields, such as Natural Language Processing (NLP) and Computer Vision (CV)30,31. Recent efforts in MI classification have begun to harness the potential of transformer models, yielding promising results24,32.

Despite the impressive capabilities of deep learning, it also faces significant challenges. For instance, RNNs, while adept at capturing temporal dynamics, are difficult to train, computationally costly, and susceptible to gradient vanishing problems. Similarly, CNNs excel in local feature extraction but may struggle with capturing global information. Transformer models, although effective with sequential data, often require large datasets to converge, posing a limitation with the typically scarce EEG data. In the realm of motor imagery (MI) classification, these challenges are compounded by the limited availability of publicly accessible EEG MI datasets, leading to overfitting in models with extensive parameter spaces2,33. Unlike more mature fields like computer vision and natural language processing, where deep learning has benefited from abundant data, EEG data presents unique hurdles such as high variability, low signal-to-noise ratio, and non-stationarity, complicating model training and generalization34. While transfer learning offers potential, the distinct characteristics of EEG signals necessitate customized approaches35,36. Thus, there is a pressing need for deep learning models specifically optimized for EEG data, alongside further research to improve data understanding and model robustness in this emerging field.

Our proposed model amalgamates the contextual processing prowess of transformers with the nuanced temporal dynamics captured by temporal convolutional networks (TCNs). This amalgamation is meticulously engineered to discern both the global and local dependencies that are characteristic of EEG signals. In our pursuit, we have also integrated cutting-edge developments from transformer architectures to bolster our model’s efficacy. Our methodology represents a concerted effort to refine the interplay between transformers and TCNs, with the objective of bolstering the robustness and precision of EEG signal classification in a systematic and empirical fashion.

Our contribution: In this paper, we introduce EEGEncoder, a novel model for EEG-based MI classification that effectively combines the temporal dynamics captured by TCNs with the advanced attention mechanisms of Transformers. This integration is further augmented by incorporating recent technical enhancements in Transformer architectures. Moreover, we have developed a new parallel structure within EEGEncoder to bolster its robustness. Our work aims to provide a robust and efficient tool to the MI classification community, thereby facilitating progress in brain-computer interface technology. Notably, our model has demonstrated outstanding performance on the BCI Competition IV dataset 2a37, highlighting its potential and effectiveness in real-world applications.

Methods

The input to the EEGEncoder model consists of segmented EEG data recorded during motor imagery tasks. These segments are preprocessed through the Downsampling Projector, which employs multiple layers of convolution to reduce the dimensionality and noise of the input signals. The processed signals are then fed into the DSTS blocks for feature extraction.

The output of the model is a classification of the EEG segments into one of several categories, which correspond to the intended movements as labeled in the training dataset. The number of categories is determined by the specific dataset used. For instance, in the BCIC IV 2a dataset, there are four categories: left hand, right hand, feet, and tongue.

The proposed EEGEncoder model, as depicted in Fig. 1, is designed to classify motor imagery (MI) EEG signals into specific movement categories. The architecture of EEGEncoder primarily consists of a Downsampling Projector and multiple parallel Dual-Stream Temporal-Spatial (DSTS) blocks. Each DSTS block integrates Temporal Convolutional Networks (TCN) and stable transformers to capture both temporal and spatial features of EEG signals. To enhance the model’s robustness, dropout layers are introduced before each parallel DSTS branch. The following sections provide a detailed description of the structure and function of each module.

Fig. 1
figure 1

Architecture of the EEGEncoder. The figure illustrates the data processing pipeline within the EEGEncoder, highlighting the novel application of parallel dropout layers to enrich the diversity of the hidden state representations.

Downsampling projector for EEG signal preprocessing

The Downsampling projector module within our EEG-based deep learning framework is designed to preprocess Motor Imagery EEG data, preparing it for intricate analysis by subsequent Transformer and Temporal Convolutional Network (TCN) layers. This module adeptly reshapes high-dimensional EEG sequences, characterized by a temporal resolution of 1125 and 22 channels, into a format that is conducive to convolutional processing. The main purpose of this process is to reduce the length of the sequence by passing continuous EEG signals through simple convolutional layers and average pooling layers.

Considering the EEG data analogous to an image with dimensions (1125, 22, 1), our approach involves the application of convolutional layers to extract spatial-temporal features, while concurrently mitigating noise and reducing inter-channel latency effects.

The core architecture of the Downsampling projector, as illustrated in Fig. 2, comprises three convolutional layers. The first convolutional layer is designed to initiate the feature extraction process without the application of an activation function. In contrast, the second and third convolutional layers are each followed by a batch normalization (BN) layer and an exponential linear unit (ELU) activation layer to stabilize the learning process and introduce non-linear dynamics into the model. The ELU38 activation function is defined as:

$$\begin{aligned} ELU(x) = {\left\{ \begin{array}{ll} x & \text {if } x > 0 \\ \alpha (e^x - 1) & \text {if } x \le 0 \end{array}\right. } \end{aligned}$$
(1)

where \(\alpha\) is a hyperparameter that defines the value to which an ELU saturates for negative net inputs.

Fig. 2
figure 2

Architecture of the Downsampling projector. The figure provides a detailed schematic of the Downsampling projector’s architecture. It includes three convolutional layers, with the second and third layers each followed by a batch normalization (BN) layer and an ELU activation layer. Additionally, two average pooling layers and two dropout layers are incorporated to foster model generalization. Specific parameters, such as the kernel size and stride for the convolutional layers, and the kernel size for the average pooling layers, are also depicted. For example, ”Conv 1x16, (64,1)” signifies a convolutional layer transitioning from an input channel depth of 1 to an output channel depth of 16, with a stride of 64 along the width and 1 along the height of the input feature map.

The second convolutional layer employs filters of size (1, 22) to compress the channel dimension, effectively encoding channel-wise information into a singular spatial dimension. This strategic choice is informed by the understanding that variations among EEG channels are generally subtle and often predominantly due to noise.

Following this, average pooling layers with a stride of 7 are applied to reduce the temporal dimension. Interspersed dropout layers serve to promote regularization.

Stabilizing the transformer layer

In the subsequent modules, we employ a modified Transformer layer39, which has been adapted with recent technological advancements to enhance training stability and model efficacy. Here, we detail the specific alterations applied to the Transformer architecture.

Pre-normalization is a widely adopted strategy in deep learning, particularly for large-scale natural language processing (NLP) models like the Transformer. It is instrumental in stabilizing the training of very deep networks by addressing the vanishing and exploding gradient issues.

Unlike the standard Transformer architecture, where each sub-layer (such as self-attention and feed-forward layers) is succeeded by a residual connection and layer normalization (post-normalization), pre-normalization involves applying LayerNorm before each sub-layer.

Below is the simplified pseudocode for a Transformer block utilizing pre-normalization:

figure a

The advantages of pre-normalization are manifold:

  • Enhanced Gradient Flow: By normalizing inputs prior to each layer, we mitigate the risk of gradient vanishing or exploding during backpropagation, thus enabling the training of deeper architectures.

  • Stable Training Dynamics: Normalization ensures a consistent distribution of inputs across layers, fostering stability throughout the training phase.

  • Quicker Convergence: Pre-normalization has been associated with faster convergence rates in training models.

Our approach also incorporates RMSNorm, or Root Mean Square Layer Normalization40, as the normalization function. RMSNorm diverges from traditional Layer Normalization by normalizing solely the standard deviation of activations, not the mean and standard deviation. It achieves this by dividing the activations by their root mean square, which maintains gradient scale and facilitates the training of deep networks.

$$\begin{aligned} \text {RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{N}\sum _{i=1}^{N}x_i^2 + \epsilon }} \end{aligned}$$
(2)

Here, x represents the layer input, N is the input’s dimensionality, and \(\epsilon\) is a small constant to prevent division by zero. This equation computes the RMS of the input and normalizes it by this value.

The key benefits of RMSNorm include:

  • Reduced Computational Burden: RMSNorm obviates the need to compute the mean, thereby reducing computational demands relative to Layer Normalization.

  • Stable Training: By normalizing activation scales, RMSNorm aids in gradient flow, enhancing the overall stability of the training regimen.

  • Compatibility with Deep Networks: RMSNorm is particularly advantageous for deep networks, where it helps avert the typical gradient issues associated with such architectures.

To further enhance our model, we have substituted the typically-used ReLU activation with the Swish Gated Linear Unit (SwiGLU)41. SwiGLU is defined as the componentwise product of two linear transformations of the input, one of which is Swish-activated:

$$\begin{aligned} \text {SwiGLU}(x,W,V,b,c) = Swish_\beta (xW + b) \odot (xV + c) \end{aligned}$$
(3)

In the equation, \(x\) is the input, \(W\) and \(V\) are weight matrices, Swish42 is defined as \(x \cdot \delta (\beta x)\), where \(\delta (z) = (1 + exp(-z))^{-1}\) is the sigmoid function and \(\beta\) is either a constant or a trainable parameter. SwiGLU’s principal advantages are:

  • Computational Efficiency: The gating mechanism of SWiGLU is notably efficient.

  • Augmented Model Capacity: It empowers the model to encapsulate more complex functionalities.

  • Performance Enhancement: SWiGLU typically boosts model performance across a range of tasks.

Dual-stream temporal-spatial block

The Dual-Stream Temporal-Spatial Block (DSTS Block) , as shown in Fig. 3 presents an architecture specifically designed for the analysis of electroencephalogram (EEG) data during Motor Imagery (MI) tasks. This architecture integrates Temporal Convolutional Networks (TCNs) with stable Transformer modules, capitalizing on their complementary strengths to capture the temporal and spatial characteristics inherent in EEG signals.

Fig. 3
figure 3

Architecture of the DSTS Block. The DSTS Block integrates a TCN for local temporal feature extraction with a self-attention block for global spatial context analysis, enabling a detailed examination of EEG signals for MI classification tasks.

TCNs utilize causal convolutions to process time-series data, effectively capturing temporal features with a high level of detail. The convolutional approach simplifies training and enhances feature extraction, which is particularly advantageous when dealing with the noisy and redundant nature of EEG data. However, the local focus of TCNs may result in insufficient representation of global dependencies, a notable challenge when analyzing extensive EEG sequences.

In contrast, Transformers employ a global self-attention mechanism that allows for the integration of contextual information across entire sequences. This capability enables the Transformer to perceive the broader context within the data, addressing a limitation of TCNs. Nonetheless, training Transformers can be complex, especially initially, and their performance may be less than optimal with the inherently noisy and complex EEG data.

The DSTS Block is engineered to leverage the TCN’s proficiency in local feature extraction and the Transformer’s capacity for global context comprehension, thus aiming to provide a comprehensive analysis of EEG data. We also adopt the relative position representations as proposed by Shaw et al. in their seminal work43. This dual-stream approach is anticipated to improve the model’s ability to identify patterns relevant to MI tasks by enhancing its analytical complexity.

EEG data is processed through two distinct yet parallel pathways within the DSTS Block:

  • The TCN pathway focuses on extracting local temporal features (\(H_{\text {temporal}}\)), utilizing causal convolutions to prioritize recent inputs and maintain temporal continuity.

  • The Transformer pathway is dedicated to identifying global spatial relationships (\(H_{\text {spatial}}\)), applying self-attention to consider inputs across the full sequence for a holistic spatial analysis.

To preserve the temporal sequence of EEG signals, a causal mask is integrated into the stable Transformer, ensuring information flow remains unidirectional. This approach is essential for maintaining the sequence’s integrity, as it guarantees that predictions are based solely on past and present data:

$$\begin{aligned}&H_{\text {temporal}}^{\prime } = \text {TCN}(H_{\text {temporal}})[:, -1, :] \end{aligned}$$
(4)
$$\begin{aligned}&H_{\text {spatial}}^{\prime } = \text {StableTransformer}(H_{\text {spatial}}, \text {mask}=\text {causal})[:, -1, :] \end{aligned}$$
(5)

The variables \(H_{\text {temporal}}^{\prime }\) and \(H_{\text {spatial}}^{\prime }\) denote the final hidden states from the TCN and stable Transformer pathways, respectively, extracted from the last element in the sequence dimension. This selection strategy captures the accumulated temporal and spatial information up to the current moment.

These final hidden states are then integrated to create a composite feature representation, which is processed by a multi-layer perceptron (MLP) for the classification task:

$$\begin{aligned}&H_{\text {integrated}}^{\prime } = H_{\text {temporal}}^{\prime } + H_{\text {spatial}}^{\prime } \end{aligned}$$
(6)
$$\begin{aligned}&H_{\text {output}} = \text {MLP}(H_{\text {integrated}}^{\prime }) \end{aligned}$$
(7)

The integration of TCN and Transformer pathways within the DSTS Block is designed to balance their respective strengths and limitations, enhancing the robustness and precision of BCI applications.

EEG signal classification with EEGEncoder

The EEGEncoder architecture represents a novel approach to the classification of electroencephalogram (EEG) signals. Traditional methodologies in this domain have frequently employed moving window techniques to extract temporal features from EEG data. These methods involve slicing the EEG sequence into overlapping temporal windows, which are then fed into the model to capture the dynamic aspects of the signal.

However, our architecture departs from this convention by harnessing the Transformer’s intrinsic capability to contextualize data across the entire sequence. We postulate that this feature of the Transformer reduces the dependency on moving window slicing, thereby preserving the continuity and integrity of the temporal sequence.

To introduce variability and enhance the robustness of the model, we incorporate multiple parallel dropout layers. These layers independently introduce perturbations to the hidden states of the EEG sequence, a strategy designed to improve the model’s performance by simulating a form of ensemble learning within the architecture itself.

After extensive experimentation and comparative analysis, we have optimized the EEGEncoder by configuring the DSTS block with a stable transformer consisting of four layers and two attention heads. Additionally, we have integrated five parallel branches, each comprising a dropout layer followed by a DSTS block. This configuration was determined to strike an optimal balance between model complexity and performance, leading to improvements in classification accuracy and generalizability.

Results

In this section, we provide a detailed evaluation of the EEGEncoder model, demonstrating its classification capabilities on the BCI Competition IV 2a dataset37. We compare the performance of our model with various established models to underscore its effectiveness in decoding the complex patterns inherent in EEG signals for motor imagery tasks. The subsequent subsections elaborate on the model’s performance metrics, a comparative analysis with other models, and discuss the significance of these results for the progression of brain-computer interface technologies.

Dataset

In our study, we primarily utilized the BCI Competition IV dataset 2a (BCI-2a) for training and evaluating the EEGEncoder model. The BCI-2a dataset comprises recordings from nine healthy subjects, each performing four different motor imagery (MI) tasks: left hand (class 1), right hand (class 2), feet (class 3), and tongue (class 4) movements.

Each subject participated in two sessions recorded on different days. Each session consisted of six runs, with each run containing 48 trials (12 trials per MI task), resulting in a total of 288 trials per session. The EEG signals were recorded using 22 Ag/AgCl electrodes at a sampling rate of 250 Hz. The signals were bandpass filtered between 0.5 Hz and 100 Hz, and a 50 Hz notch filter was applied to reduce power line interference.

At the beginning of each session, a recording of approximately 5 minutes was performed to estimate the EOG influence, divided into three blocks: two minutes with eyes open, one minute with eyes closed, and one minute with eye movements. Due to technical problems, the EOG block for subject A04T contains only the eye movement condition.

During the experiments, subjects were seated in a comfortable armchair in front of a computer screen. Each trial began with a fixation cross appearing on a black screen, accompanied by a short acoustic warning tone. Two seconds later, a cue in the form of an arrow (pointing left, right, down, or up) appeared for 1.25 seconds, prompting the subject to perform the corresponding motor imagery task until the fixation cross disappeared at six seconds.

For our research, one session was used for model training, while the other was reserved for evaluation testing. The raw MI EEG signals from all bands and channels were fed into the model in the form of a \(C \times T\) two-dimensional matrix. Minimal preprocessing was applied to the raw data, employing a standard scaler to normalize the signals to have zero mean and unit variance.

Our research concentrates on the BCI-2a dataset due to its increased complexity and the greater challenge it presents, which better demonstrates the performance capabilities of our model. The dataset is well-documented and has been extensively used in the BCI community, ensuring its reliability and relevance for evaluating EEG classification models.

Certainly, here’s the revised section with the inclusion of Information Transfer Rate (ITR) as a new metric:

Performance metrics

To evaluate the performance of the EEGEncoder, we employ three key metrics: accuracy, Cohen’s kappa, and ITR. These metrics provide a comprehensive assessment of the model’s classification capabilities.

Accuracy (Acc) is calculated as follows:

$$\begin{aligned} \text {Acc} = \frac{\sum ^n_{i=1} \frac{TP_i}{I_i}}{n} \end{aligned}$$
(8)

where n is the number of categories, \(TP_i\) represents the true positive count for class i, and \(I_i\) is the total number of samples in class i. Accuracy measures the proportion of correctly classified samples, offering a straightforward evaluation of the model’s overall performance.

Cohen’s kappa (Kappa) is computed using the formula:

$$\begin{aligned} \text {Kappa} = \frac{1}{n} \sum ^n_{a=1} \frac{P_a - P_e}{1 - P_e} \end{aligned}$$
(9)

where \(P_a\) denotes the actual percentage of agreement, and \(P_e\) represents the expected percentage of agreement by chance. Kappa is particularly important for this task as it adjusts for chance agreement, providing a more reliable measure of the model’s performance, especially in scenarios with imbalanced class distributions.

Information Transfer Rate (ITR) is another crucial metric in the field of BCI, as it quantifies the speed and efficiency of information transmission from the brain to the computer. ITR is calculated using the formula:

$$\begin{aligned} \text {ITR} = \frac{60}{T} \left[ \log _2 N + P \log _2 P + (1 - P) \log _2 \left( \frac{1 - P}{N - 1} \right) \right] \end{aligned}$$
(10)

where T is the average time in seconds per trial, N is the number of possible targets or classes, and P is the classification accuracy. ITR is measured in bits per minute (bits/min) and provides an insight into how efficiently the system can convert brain signals into actionable commands.

Accuracy and Cohen’s kappa are standard metrics for evaluating the performance of classification tasks. Accuracy provides a direct measure of the model’s ability to correctly classify EEG segments and is typically expressed as a percentage. However, in datasets with imbalanced classes, relying solely on accuracy may not be sufficient. Cohen’s kappa addresses this issue by accounting for the possibility of chance agreement, offering a more reliable evaluation metric. Kappa is represented as a decimal ranging from 0 to 1.

ITR is particularly important in the BCI domain as it not only considers the accuracy but also the speed of communication, making it a vital metric for practical applications where both precision and efficiency are critical. By incorporating ITR, we ensure that our evaluation captures the real-world usability of the EEGEncoder in BCI applications.

This dual evaluation approach, now augmented with ITR, ensures a comprehensive assessment of the model’s effectiveness, reliability, and efficiency in classifying MI-EEG signals. Additionally, these metrics are commonly used in similar studies, making them suitable for our evaluation.

Training configuration

The model is trained with a specific set of parameters, as outlined in Table 1.

Table 1 EEGEncoder Training Configurations.

The CrossEntropyLoss function is employed with label smoothing, which is set to a value of 0.1 to soften the target distributions, potentially improving the generalization of the model. To further regularize the training process and prevent overfitting, a dropout ratio of 0.3 is applied across the network, and weight decay with a coefficient of 0.5 is applied to all MLP layers.

Results on BCI IV 2a dataset

The EEGEncoder model underwent a comprehensive evaluation using the BCI Competition IV dataset 2a. Performance was assessed across three key metrics: accuracy, Cohen’s kappa, and ITR. We compared EEGEncoder with four state-of-the-art models: ATCNet, TCNetFusion, EEGTCNet, and D-ACTNet. To ensure the robustness and reliability of the results, we conducted experiments with five distinct random seeds. Each model was trained and tested under identical experimental settings, and the average results from these five iterations were reported.

It is important to note that while we reimplemented ATCNet, TCNetFusion, and EEGTCNet using the official implementations provided by Altaheri et al.32, the source code for D-ACTNet was not available. Consequently, we used the average performance metrics reported in the D-ACTNet paper as a basis for comparison. Due to this limitation, we did not compute ITR for D-ACTNet. The classification results for all nine subjects, along with detailed comparisons across models, are presented in Table 2

In terms of accuracy and Cohen’s kappa metrics, the EEGEncoder outperformed the comparative models in eight out of nine subjects, with the exception of subject 4. The model demonstrated particularly significant performance improvements in subjects 2 and 5, highlighting its enhanced ability to manage the EEG signal variations observed in these individuals.

In terms of ITR comparison, the EEGEncoder model outperforms other models. ITR primarily evaluates the model’s prediction accuracy and speed, and we attribute EEGEncoder’s advantages to three main factors. First, the architecture of EEGEncoder significantly compresses sequential information at its base, allowing for four layers of transformers without introducing excessive complexity. Second, the optimizations in PyTorch accelerate the computation of attention mechanisms, enhancing overall model efficiency. Finally, the use of Stable Transformers, which are faster and consume less memory compared to traditional transformers, contributes to quicker predictions. Consequently, EEGEncoder demonstrates superior speed alongside its already impressive accuracy when compared to other models.

Table 2 Classification Performance for BCIC IV 2a subjects 1-9. Comparison of EEGEncoder, ACTNet, TCNetFusion, EEGTCNet, and D-ATCNet Models in Terms of Accuracy and Kappa Coefficient.

The results indicate that the EEGEncoder not only excels in overall performance but also shows resilience in subjects where other models tend to falter. This resilience could be attributed to the model’s architecture, which may be more adept at capturing the nuances of EEG signals across diverse cognitive tasks. However, further studies are warranted to confirm these findings and to explore the full potential of EEGEncoder in real-world BCI applications.

Ablation study

To validate the efficacy of the various enhancements applied to the EEGEncoder, we conducted a series of ablation experiments. We began by consolidating the data from all nine subjects, merging their respective training and testing sets into single, comprehensive datasets. This approach allowed us to more effectively evaluate the generalizability of the model’s improvements across different subjects.

Here, we present a selection of key experiments that were instrumental in assessing the impact of specific modifications. These experiments included removing the transformer component from the DSTS block, using 5 shift windows instead of five dropout branches, varying the number of transformer layers, adjusting the quantity of DSTS branches within the EEGEncoder, and comparing the performance of our modified stable transformer against the Vanilla Transformer. To ensure the statistical significance of our results, we averaged the outcomes across five iterations, each initialized with a different random seed. The summarized results are displayed in Table 3.

Table 3 Performance Comparison of EEGEncoder With and Without Various Improvements.

The data in Table 3 illustrates the impact of each modification on the EEGEncoder’s performance. The removal of the transformer component led to a noticeable decrease in both accuracy and Cohen’s kappa, underscoring its contribution to the model’s effectiveness. Adjusting the number of transformer layers showed that a balance is needed to optimize performance, as evidenced by the slight decrease in accuracy with eight layers compared to two. Similarly, the number of DSTS branches was found to be a factor, with a single branch reducing performance and ten branches not improving it significantly. Lastly, the comparison between our stable transformer and the Vanilla Transformer variant indicates the importance of our modifications for achieving higher accuracy and Cohen’s kappa.

Discussion

In our research, we have innovatively designed a model based on Temporal Convolutional Networks (TCN) and Transformers, specifically optimized for the classification of Motor Imagery (MI) signals derived from electroencephalograms (EEG). Our model introduces the DSTS block, a novel component that enhances the extraction of both local and global information from EEG data. By incorporating the Stable Transformer, we have stabilized the training process of the Transformer and reduced computational complexity. Furthermore, we have replaced the commonly used window shift technique with parallel multi-branch dropout+DSTS, which adds robustness and diversity to the feature extraction process.

Fig. 4
figure 4

Confusion matrices of the classification results for four models: EEGEncoder, ATCNet, TCNetFusion, and EEGTCNet. These matrices represent the average performance across nine subjects. The EEGEncoder model demonstrates superior accuracy in the ’Foot’ category compared to the other models, while the ’Tongue’ category shows relatively lower performance.

The empirical evaluation of our proposed model has yielded promising results in the BCI Competition IV dataset 2a, where we achieved commendable performance without the need for complex preprocessing-relying only on a simple standard scalar. Looking ahead, our goal is to extend the training and validation of our model across a more diverse and extensive range of datasets. We aim to incorporate cutting-edge deep learning techniques, such as pre-training, to enhance the model’s complexity and effectiveness. Ultimately, we aspire to achieve superior performance in MI classification tasks across a broader spectrum of categories, supported by larger datasets and more sophisticated model architectures.

Additionally, we computed the confusion matrices for the classification results of four models, as shown in Fig. 4. These confusion matrices were derived by averaging the performance of the models across nine subjects. From the Fig. 4, it is evident that, compared to the other three models (’ATCNet’, ’TCNetFusion’, ’EEGTCNet’), our EEGEncoder significantly outperforms in the ’Foot’ category. This superior performance in the ’Foot’ category is a key factor contributing to the overall better performance of the EEGEncoder.

However, the ’Tongue’ category exhibits the lowest performance among all categories. This lower performance might be due to the fact that the other three actions involve larger motor imagery movements of major body parts, whereas ’Tongue’ involves relatively finer and smaller movements. If more data of similar but distinct categories were available for training, it could potentially enhance the classification accuracy for the ’Tongue’ category significantly.

Moreover, the confusion matrices reveal that for the ’Right hand’ and ’Left hand’ categories, the EEGEncoder is not the best performing model among the four. Nevertheless, when considering the average performance across all four categories, the EEGEncoder exhibits the best overall performance. Additionally, the performance differences between categories are the smallest for EEGEncoder, which can be attributed to our DSTS module’s ability to extract global information and utilize multiple branches to enhance model robustness.