Introduction

Brain-computer interface (BCI) technology opens the door to direct communication and control of the biological brain with peripheral devices1. Electroencephalogram (EEG) acquired on the scalp is the most commonly employed neurophysiological signal for constructing a BCI system owing to its low cost and high time resolution. So far, EEG-based BCIs have been widely applied in many fields such as neural rehabilitation, control of assistive technologies, entertainment and intraoperative awareness detection2,3.

The choice of an experimental paradigm is crucial in the study of BCI systems. Motor imagery (MI)4,5, steady-state visual evoked potential (SSVEP)6,7 and event-related potentials8,9 are widely studied and applied paradigms. In particular, SSVEP-based BCI (SSVEP-BCI) has become a hot spot in current research due to its high information transfer rate (ITR) and short training time1. SSVEP is the brain signal evoked by a repetitive visual stimulus and its maximal amplitude is located over the occipital region. SSVEP comprises the fundamental and harmonic signals of a stimulus frequency. In general, an SSVEP-BCI system includes multiple stimuli. When a user gazes at different stimuli, he/she can send varying instructions for controlling an electronic device.

EEG decoding is one core issue in EEG-based BCI systems. Due to low signal-to-noise ratio (SNR) and high inter-trial variability of EEG signals, accurately decoding them and recognizing the stimulus frequency poses a huge challenge, limiting the real-world applications of SSVEP-BCIs. EEG signals can be decoded by either a traditional machine learning (ML) algorithm or a deep learning (DL) network. According to the requirement for training data, the decoding methods are divided into training-free and training-based ones. Compared to the latter, the former requires long data segment to achieve an acceptable classification accuracy, and thus degrades the information transfer rate (ITR), which is a major performance metrics of an SSVEP-BCI system. For example, based on two typical datasets Benchmark and BETA32,33, the training-based system achieved the highest ITRs of 250 bits/min and 186.76 bits/min respectively, whereas the highest ITRs of training-free system were only 210.81 bits/min and 129.41 bits/min respectively29. Thereby, the existing BCI studies focus mainly on the training-based method. Traditional ML algorithms aim to boost the SNR of multi-channel EEG signals through spatially filtering. Thereby, the emphasis is placed on the optimization of spatial filters using training data. So far, several typical spatial filtering algorithms have been proposed for SSVEP-BCIs such as minimum energy combination10, canonical correlation analysis (CCA) and its variants11,12,13,14,15,16 and task-related component analysis (TRCA)17. Among them, the TRCA achieved better performance than other algorithms and has become a benchmark method for SSVEP classification. TRCA suppresses noise in SSVEP by maximizing the inter-trial covariance. By incorporating filter bank analysis15 and ensemble spatial filter17, the performance of TRCA can be further augmented.

Recently, deep learning (DL) techniques provide new solutions for the BCI classification task18,19,20,21,22,23,24,25,26,27,28,29. DL can automatically extract features applied to classification from original signals and improve classification accuracy. Feature extraction and pattern recognition are combined in the one frame to avoid information loss caused by different objective functions of the two stages. Lawhern et al.18 proposed a compact CNN named EEGNet for EEG-based BCIs and achieved great success in four BCI paradigms. They employed depthwise and separable convolutions to create an EEG-based specific network that packages the concept of feature extraction including spatial and temporal filtering. Subsequently, Waytowich et al.19 revised the EEGNet and applied it to an asynchronous SSVEP-BCI system. Guney et al.20 proposed a DNN structure (we rename it as sbCNN in the study) for processing sub-band SSVEP signals with convolutions across the sub-bands of harmonics, channels and time and classifying with a fully connected layer. A fine-tuning strategy was used for training the network to boost the intra-subject classification. The sbCNN network achieved the highest-ever ITRs on two benchmark SSVEP data sets. Zhang et al.21 proposed a bidirectional Siamese correlation analysis (bi-SiamCA) method for the detection of SSVEP signals. Two long short-term memory (LSTM)-based parallel subnetworks are used to extract features from the SSVEP signal and template signals and then calculate the similarity between the outputs of the two branches. The experimental results on two SSVEP datasets indicate that the network can significantly improve the classification accuracy compared with the prominent traditional and DL methods especially at short data lengths. Pan et al.22 proposed an efficient SSVEP network (SSVEPNet) for frequency recognition, which is based on one-dimensional convolution and LSTM module. Spectral normalization and label smoothing technologies are utilized to enhance network performance. Experimental results on two datasets validated the effectiveness of the network for SSVEP classification. Recently, the attention mechanism23 based on Transformer model attracts wide interest in DL field and is applied to SSVEP-BCIs24,25. Bagchi and Bathula24 proposed an EEG-ConTransformer network that incorporates multi-head self-attention modules to capture inter-region interaction patterns and convolutional filters to learn temporal patterns. Chen et al.25 proposed a Transformer-based DNN model (SSVEPformer) for SSVEP classification. They adopted the complex spectrum features of SSVEP data as the model input for enabling the model to jointly explore the spectral and spatial information. The two models achieved better classification accuracies and ITRs than other baseline methods.

Although both traditional ML method and DL method have achieved high decoding accuracy of SSVEP signals, neither of their classification performance can fully meet the needs of practical applications. Both methods have their own strengths and limitations. The former is able to extract discriminative feature based on the expertise of the researchers, but has a weak learning capability; The latter has a stronger ability to automatically represent high-level abstractions and can accurately model complex EEG signals, but require large amounts of labeled data, which is difficult to obtain for BCI research. Thereby, combining these two methods for decoding SSVEP signals is expected to improve the frequency detection accuracy by making use of their advantages.

Recently, it has been considered that single SSVEP classification models often suffer from overfitting or underfitting problems and may have low classification performance26. Yao et al.27 proposed a model called FB-EEGNet for SSVEP frequency detection, which fuses the features of multiple neural networks, resulting in much higher classification accuracy than a single model. Meanwhile, Li et al.28 also proposed a new DNN named Conv-CA that combines a DL model with a traditional ML model, which achieves higher classification accuracy and ITR through the fact that the outputs of the two parallel branches used for signal and reference are ultimately correlated and decoded with each other. This approach of combining DL models with traditional ML models provides new ideas for the processing of EEG signals, demonstrating its feasibility. Deng et al.29 proposed a new algorithm termed TRCA-Net to increase the classification performance of SSVEP signals, which first utilizes TRCA algorithm to create spatial filters for extracting task-related features, then the features from different filters are rearranged as new multi-channel signals and finally the rearranged features are classified with a deep CNN.

Motivated by the above-mentioned three studies, we propose a novel classification framework named eTRCA + sbCNN that combines an ensemble task-related component analysis (eTRCA)17 and a sub-band convolutional neural network (sbCNN)20 for recognizing the frequency of SSVEP signals. In principle, any one traditional ML method and any one DL method can be selected as two base methods for the combination. eTRCA and sbCNN are adopted in the study because they are state-of-the-art ML and DL method respectively. Specifically, based on sub-band filtered EEG data, the eTRCA and the sbCNN are first trained separately with training data, then the trained models are respectively used for classifying a single-trial testing signal, next the two classification score vectors are fused by an addition operation, and finally the frequency corresponding to the maximal summed score is decided as the stimulus frequency of the SSVEP signal. The contributions of this article are as follows:

  1. (1)

    In order to fully exploit the knowledge of eTRCA and the learning ability of sbCNN, a parallel model-combining framework eTRCA + sbCNN is proposed for detecting the frequencies of SSVEP signals. A feature fusion method is further proposed for enhancing SSVEP classification by summing the two classification score vectors of a testing trial yielded by eTRCA and sbCNN;

  2. (2)

    The performance of eTRCA + sbCNN is analyzed in depth and validated using two SSVEP datasets containing a total of 105 subjects and compared with that of eTRCA, sbCNN and other state-of-the-art traditional ML and DL methods in terms of classification accuracy and ITR at different lengths of data, numbers of channels and training trials. The experimental results validated the superiority of the method.

Methods

The flowchart of the eTRCA + sbCNN framework is shown in Fig. 1, which includes the components of data preprocessing, model training and testing-trial classification of eTRCA and sbCNN, addition of two classification score vectors, and frequency recognition. We will detail all these components in the following subsections except for the data preprocessing, which will be elaborated in the section Data Acquisition and Preprocessing.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The algorithmic flowchart of the eTRCA + sbCNN framework.

Sub-band convolutional neural network (sbCNN)

  1. (1)

    Network structure and parameters. The sbCNN20 is an end-to-end system that receives temporally filtered signals with three sub-bands. The network architecture is shown in Fig. 2a. The network consists mainly of four convolutional layers and one fully connected layer: the first convolutional layer is used for sub-band combining, the second for channel combining, the third and fourth for extracting features, and the fully connected layer predicts the stimulus frequency by selecting the frequency with the highest probability of being returned by the last softmax function. This sbCNN has 12 layers in total, whose parameters are shown in Fig. 2b.

  2. (2)

    Model training and classification. The model training and classification method for sbCNN is shown in Fig. 3. The sbCNN is trained in two stages using transfer learning method. In the first stage, the training data from all subjects are used for training. The weights of the first layer are initialized with 1, and the weights of the other layers are initialized from a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. In addition, dropout probabilities of 0.1, 0.1, and 0.95 are applied between the second and third layers, the third and fourth layers, and the fourth and fifth layers respectively. The network is trained in each iteration based on the training batch \(\left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1}^{{D_{b} }}\), where \(D_{b}\) is the number of trials in the batch, by minimizing the following categorical cross-entropy loss via Adam optimizer with the learning rate of \(\nu = 0.0001\)

$$\left( {1/D_{b} } \right)\sum\limits_{i = 1}^{{D_{b} }} { - \log \left( {s_{i} (y_{i} )} \right) + \lambda \left| W \right|^{2} }$$
(1)

where \(\lambda = 0.001\) is the constant of the L2 regularization, \(s_{i} \epsilon [0,1]\)\(^{{N_{F} \times 1}}\) is the softmax output for the instance \(x_{i}\), \(N_{F}\) is the number of class labels, \(s_{i} (y_{i} )\) is the \(y_{i} ^{\prime}{\text{th}}\) entry of \(s_{i}\), \(W\) are the weights of all layers in the sbCNN and the final prediction is done by \(\hat{y} = \arg \max s_{i} (j)\).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

(a) sbCNN network architecture; (b) Parameters of the sbCNN network.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Model training and classification method for sbCNN.

In the second stage, only the training data from the testing subject are used for training, The weights and biases of each layer are initialized with the weights yielded in the first-stage training. The dropout probabilities between the second and third layers and between the third and fourth layers are modified as 0.6 for the Benchmark dataset and 0.7 for the BETA dataset due to reduced amount of training data.

For the two stages, training stops when the maximal number of epochs is reached, which is 1000 and 800 for Benchmark and BETA dataset respectively. The batch size is 100 and 200 for the Benchmark in the first-stage training and second-stage training respectively, whereas that is 100 and 120 for the BETA dataset in the first-stage and second-stage training respectively.

After the second-stage training, the subject-specific model \(net\) is obtained, which is employed for classifying the test data. For a single-trial testing signal \(\tilde{X}_{t} ,t = 1,2, \ldots ,N_{t}\), where \(N_{t}\) is the number of testing trials, the classification score vector \(s_{t}^{sbCNN} \in R^{{1 \times N_{F} }}\), where \(N_{F}\) is the number of classes or stimulus frequencies, and the stimulus frequency \(f_{t}^{sbCNN}\) are obtained using the neural network classification function as

$$\left[ {s_{t}^{sbCNN} ,f_{t}^{sbCNN} } \right] = classify(net,\tilde{X}_{t} )$$
(2)

Ensemble task-related component analysis (eTRCA)

(1) eTRCA Algorithm: TRCA16 is one of the most popular algorithms for recognizing SSVEP signals. It is used for creating a spatial filter by maximizing the reducibility of task-relevant components. Assume that the individual training data from the nth stimulus and the kth sub-band are denoted as \(x_{n}^{k} \epsilon R^{{N_{C} \times N_{S} \times N_{T} }} ,n = 1,2, \ldots ,N_{F} ,k = 1,2, \ldots ,N_{K}\), where \(N_{C} ,N_{S}\), \(N_{T} ,N_{F} \;{\text{and}}\;N_{K}\) are the number of channels, sampling points in a trial, trials for each stimulus, visual stimuli and sub-bands respectively. TRCA aims to optimize a spatial filter for each sub-band and each stimulus by maximizing the sum of inter-trial covariance, after projecting the multi-channel signal into single-channel signal using the spatial filter. Thereby, the objective of this algorithm is to find a spatial filter that maximizes the covariance as follows

$$\begin{aligned} w_{n}^{k} = & \mathop {\arg \max }\limits_{{w_{n}^{k} }} \sum\limits_{\begin{subarray}{l} i,j = 1 \\ i \ne j \end{subarray} }^{{N_{T} }} {{\text{cov}} \left( {\left( {w_{n}^{k} } \right)^{T} x_{n,i}^{k} ,\left( {w_{n}^{k} } \right)^{T} x_{n,j}^{k} } \right)} \\ = & \mathop {\arg \max }\limits_{{w_{n}^{k} }} \left( {w_{n}^{k} } \right)^{T} \left( {\sum\limits_{\begin{subarray}{l} i,j = 1 \\ i \ne j \end{subarray} }^{{N_{T} }} {{\text{cov}} \left( {x_{n,i}^{k} ,x_{n,j}^{k} } \right)} } \right)w_{n}^{k} = \mathop {\arg \max }\limits_{{w_{n}^{k} }} \left( {w_{n}^{k} } \right)^{T} S_{n}^{k} w_{n}^{k} \\ \end{aligned}$$
(3)

where \(S_{n}^{k}\) denotes the sum of cross-covariance matrices between all pairs of trials for nth stimulus and kth sub-band. To yield a finite solution, the variance of \(S_{n}^{k}\) is normalized to one

$$\sum\limits_{i = 1}^{{N_{T} }} {{\text{Var}}\left( {\left( {w_{n}^{k} } \right)^{T} x_{n,i}^{k} } \right)} = \left( {w_{n}^{k} } \right)^{T} \left( {\sum\limits_{i = 1}^{{N_{T} }} {{\text{Cov}}\left( {x_{n,i}^{k} } \right)} } \right)w_{n}^{k} = \left( {w_{n}^{k} } \right)^{T} Q_{n}^{k} w_{n}^{k} = 1$$
(4)

where \(Q_{n}^{k}\) denotes the sum of self-covariance matrices for nth stimulus. With the constraint, the optimization boils down to a Rayleigh–Ritz eigenvalue decomposition problem and the spatial filter is estimated as

$$\hat{w}_{n}^{k} = \mathop {\arg \max }\limits_{{w_{n}^{k} }} \frac{{\left( {w_{n}^{k} } \right)^{T} S_{n}^{k} w_{n}^{k} }}{{\left( {w_{n}^{k} } \right)^{T} Q_{n}^{k} w_{n}^{k} }}.$$
(5)

The solution of above equation can be represented as the eigenvector of matrix \(Q^{ - 1} S\) corresponding to the maximum eigenvalue.

The ensemble TRCA (eTRCA) is the extended version of TRCA. The ensemble spatial filter for kth sub-band, \(w^{k} \in R^{{N_{C} \times N_{F} }}\), is obtained by concatenating the spatial filters from all stimuli

$$W^{k} = \left[ {w_{1}^{k} ,w_{2}^{k} , \ldots ,w_{{N_{F} }}^{k} } \right]$$
(6)

(2) Model training and classification: Different from the training method for sbCNN, eTRCA is trained with subject specific method, i.e., only the training data from the testing subject is used for training eTRCA model. The model training and classification for eTRCA is illustrated in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Model training and classification for eTRCA.

The spatial filter is used for filtering a testing trial from the kth sub-band \(\tilde{X}_{t}^{k} \in R^{{N_{C} \times N_{S} }}\) and the template signal from the kth sub-band and the nth stimulus, which is the average of all training trials, i.e., \(\overline{X}_{n}^{k} = (1/N_{T} )\sum\nolimits_{i = 1}^{{N_{T} }} {X_{n,i}^{k} } \in R^{{N_{C} \times N_{S} }}\). Their Pearson correlation coefficient \(\rho\) can be calculated as

$$r_{t,n}^{k} = \rho \left( {\tilde{X}_{t}^{k} W^{k} ,\overline{X}_{n}^{k} W^{k} } \right)$$
(7)

The classification score is yielded by integrating the correlation coefficients from \(N_{K}\) sub-bands using the following formula16

$$s_{t,n}^{eTRCA} = \sum\limits_{k = 1}^{{N_{K} }} {\left( {k^{ - 1.25} + 0.25} \right)r_{t,n}^{k} }$$
(8)

Classification score vector \(s_{t}^{eTRCA} = [s_{t,1}^{eTRCA} ,s_{t,2}^{eTRCA} , \cdots ,s_{{t,N_{F} }}^{eTRCA} ]\) is yielded accordingly. Finally, the stimulus frequency of the testing trial predicted by eTRCA can be decided as follows:

$$f_{t}^{eTRCA} = \mathop {\arg \max }\limits_{n} s_{t,n}^{eTRCA}$$
(9)

Feature combination

Currently, sbCNN and eTRCA are the state-of-the-art traditional DL algorithms and ML algorithms respectively, which have yielded good performance in SSVEP-BCIs. Nevertheless, the eTRCA algorithm is limited in effectively utilizing information from other subjects, which is precisely the strength of DL methods. Therefore, effective integration of eTRCA and sbCNN approaches is expected to improve the classification performance of SSVEP-BCIs. In this study, a feature combination framework was developed to detect the stimulus frequency of SSVEP signals. Under the condition of small samples in the training set, it is expected to improve the classification performance of SSVEP-BCIs and the robustness of frequency detection, thus promoting their practical application.

In this study, a feature combination approach was developed to detect the stimulus frequency of SSVEP signals. We treat the two classification score vectors of a testing trial yielded by eTRCA and sbCNN respectively as two feature vectors, and fuse them as a feature vector for SSVEP classification. There are two commonly used methods for feature fusion: (1) Normalized sum. The two vectors are first normalized by their maximal score values respectively and then summed up; (2) Weighted sum. The two vectors are first weighted by the training accuracies generated by eTRCA and subCNN respectively and then summed up. Unfortunately, classification results of the two methods are not satisfactory. Instead, we fuse the two feature vectors by direct sum.

Specifically, for a testing trial \(t\), the two classification score vectors \(s_{t}^{sbCNN} \in R^{{1 \times N_{F} }}\) and \(s_{t}^{eTRCA} \in R^{{1 \times N_{F} }}\) derived from the sbCNN and the eTRCA classification respectively are summed together

$$s_{t} = s_{t}^{sbCNN} + s_{t}^{eTRCA} \epsilon R^{{1 \times N_{F} }}$$
(10)

where \(N_{F}\) is the number of stimuli (or stimulus frequencies). Then the stimulus frequency \(f_{t}\) of the testing trial \(\tilde{X}_{t}\) is decided as the frequency with the maximal score value

$$f_{t} = \mathop {\arg \max }\limits_{n} s_{t} ,n = 1,2, \ldots ,N_{F}$$
(11)

Remark 1

Model combining is a strategy commonly used in the field of machine learning and data analytics30,31, aiming at obtaining comprehensive performance that is more accurate, robust, or generalizable by integrating the outputs of several different models or methods. The starting point of this approach is that individual models may have unique strengths in different aspects, and by combining them effectively, they are able to compensate for their respective shortcomings and thus improve the overall performance. Common model combination methods include sequential combination29 and parallel combination27,28, among which the latter improves the overall performance, reduces the risk of overfitting, and improves robustness by integrating the outputs of several different models. Thereby, the parallel combination is used in the study.

Remark 2

Although the method for feature combination is straightforward, it is highly effective. The reason can be analyzed from three aspects as follows.

  1. (a)

    If each of the two scores corresponding to the correct stimulus frequency takes the maximum value in its classification score vector, the sum of two scores must take the maximum value and thereby the combined model can correctly identify the stimulus frequency of a single-trial testing data;

  2. (b)

    If one of the two scores corresponding to the correct stimulus frequency does not take the maximum value in its classification score vector, it is very likely to take the second largest value or a relatively large value. The sum of the two scores corresponding to the correct stimulus frequency can also take the maximum value and thus the combined model can correctly identify the stimulus frequency;

  3. (c)

    If both the two scores corresponding to the correct stimulus frequency do not take the maximum values in their classification score vectors, each of them is very likely to take the second largest value or a relatively large value. In this case, the sum of the two scores may still take the maximum value and thus the combined model may still correctly identify the stimulus frequency as long as the two largest scores do not occur at the same stimulus frequency.

Data acquisition and preprocessing

Data acquisition

The proposed method is evaluated on two publicly available SSVEP datasets, Benchmark32 and BETA33. The main differences between them lie in the number of subjects and the number of blocks performed by each subject. In addition, their experimental settings were also different.

  1. (1)

    Benchmark dataset. The dataset was acquired using 64 EEG channels from an SSVEP-based spelling experiment containing 40 stimulus targets (or frequencies), which were modulated by a joint frequency and phase coding approach. The stimulus frequencies ranged from 8 Hz and 15.8 Hz at 0.2 Hz intervals, while the stimulus phases ranged from 0 rad and 1.5 rad at 0.5 rad intervals. The sampling rate was 1000 Hz. 35 healthy subjects (17 females, average age of 22 years) took part in the experiment. The experiment contained 6 blocks and each block consisted of 40 trials, which correspond to 40 stimulus targets prompted in randomized order, i.e., every target contained 6 trials. Every trial lasted 6 s, including 0.5 s for visual cue, 5 s for visual stimulus and 0.5 s for relaxing. The dataset was collected in the laboratory with electromagnetic shielding.

  2. (2)

    BETA dataset. The dataset was acquired in a similar SSVEP-based spelling experiment to the Benchmark dataset. Seventy healthy subjects (42 males, average 25 years old) participated in the experiment. The difference from the Benchmark data is as follows: Each subject performed 4 blocks and each block contained 40 trials corresponding to 40 stimulus targets prompted in randomized order, i.e., each target contained 4 trials. Every trial lasted 3 s for the first 15 subjects or 4 s for the remaining subjects, including 0.5 s for visual cue, 2 s or 3 s for visual stimulus and 0.5 s for relaxing. Besides, the BETA dataset was collected outside the laboratory without electromagnetic shielding.

Data preprocessing

For each of the two datasets, EEG data of the 9 electrodes over occipital lobe, Pz, PO5, PO3, POz, PO4, PO6, O1, Oz, O2, were employed for the study. The raw EEG data were down-sampled from 1000 to 250 Hz. Single-trial data were segmented in the temporal window [0.64 s, (0.64 + d) s] and [0.63 s, (0.63 + d) s] for the Benchmark and BETA dataset respectively, where d denotes the data length used for frequency recognition, 0.5 s the time for gaze shifting, and 0.14 s and 0.13 s the latency delay in the visual system for Benchmark and BETA dataset respectively30,31. All data segments from the nine channels were filtered in the frequency range [m × 8 Hz, 90 Hz] with an IIR filter of Chebyshev type I, where \(m = 1,2, \cdots ,N_{SB}\) is the index of sub-bands, where \(N_{SB}\) is the number of sub-bands. Five and three sub-bands are used for eTRCA and sbCNN method respectively17,20. The temporally filtering was done forward and backward using the Matlab function filtfilt to avoid phase distortion.

Results

The performance of the eTRCA + sbCNN framework was evaluated by comparing it with those of eTRCA and sbCNN from the following aspects: classification accuracy, simulated ITR, feature distribution and confusion matrix in terms of different lengths of data, number of channels and number of training trials used for classification. In addition, the accuracy and ITR of eTRCA + sbCNN were also compared with those of several state-of-the-art tradition ML and DL models to further verify its superiority. Classification accuracy is the ratio of the number of testing trials correctly recognized to total number of testing trials, whereas ITR in bits/min1 is formulated as

$$ITR = \left( {\log_{2} M + P\log_{2} P + (1 - P)\log_{2} \left[ {\frac{1 - P}{{M - 1}}} \right]} \right)\left( \frac{60}{T} \right)$$
(12)

where M is the number of stimuli, P is the classification accuracy of stimuli, and T is the mean time in seconds for a detection. For the calculation of ITRs, the 0.5 s for gaze shifting was added in the time for target detection.

The ITR represents the amount of information transmitted per unit time by a communication system. It is positively correlated with the number of detected stimuli and the detection accuracy, and negatively correlated with the detection time of a stimulus. In contrast, classification accuracy takes only the number of stimuli into account, without considering the detection time. Thereby, ITR is a more usable performance measure of SSVEP BCI systems than classification accuracy in real-world scenarios.

Classification accuracy and ITR

Since the performance of an SSVEP BCI is severely affected by the length of data, the number of channels and the number of the training trials used for frequency detection, we investigated the relationship between classification accuracy and ITR and these parameters. It is noted that both accuracy and ITR are the averaged accuracy and ITR across all subjects in a dataset.

Figure 5 shows the accuracies and ITRs yielded in each of the two datasets at five different lengths of data ranging between 0.2 s and 1 s with the stride of 0.2 s. The number of channels was fixed at 9 for the two datasets, whereas the number of training blocks was fixed at 5 for the Benchmark dataset and 3 for the BETA dataset. It is easily observed from the figure that for each dataset, the accuracies of these three models increase positively with the data length, whereas the ITRs of these three models first increase with data length, reach the peak at a data length and then drop continually. eTRCA + sbCNN significantly outperformed both eTRCA and sbCNN in terms of accuracy and ITR at all lengths of data except for 0.2 s, at which eTRCA + sbCNN and sbCNN do not have statistical difference in accuracy. The reason is that the performance of eTRCA is too poor at the data length of 0.2 s compared to sbCNN, so that their complementarity is not reflected. when the data length is 0.2 s, the highest ITRs, 249.47 bits/min and 188.55 bits/min, are achieved by eTRCA + sbCNN at the data length of 0.4 s for the Benchmark and the BETA dataset respectively. These results validate that the method for model combination is effective for improving the performance of SSVEP BCIs.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Classification accuracies (a) and ITRs (c) of the three models at five data lengths for the Benchmark dataset; Classification accuracies (b) and ITRs (d) of the three models at five data lengths for the BETA dataset. Error bars denote standard errors. The statistically significant difference between two algorithms (i.e., p value yielded by the paired t-test) is indicated by asterisks: * p < 0.05; ** p < 0.01; *** p < 0.001.

Figure 6 illustrates the accuracies and ITRs for each of the two datasets at four different groups of channels, each of which includes the number of channels from 3 to 9 with the stride of 2. The data length was fixed at 0.6 s for the two datasets, whereas the number of training blocks was fixed at 5 for the Benchmark and 3 for the BETA dataset. For the purpose of simplicity, the channels in each group were sequentially chosen from the nine channels employed for the study. It is seen from the figure that for each dataset, both the accuracies and the ITRs of these three models rise positively with the number of channels used for frequency recognition. In terms of accuracy and ITR, eTRCA + sbCNN is significantly superior to both eTRCA and sbCNN at 5, 7, and 9 channels, but significantly inferior to or has no statistical difference with sbCNN at 3 channels. These results demonstrate that using more than 3 channels, the method for model combination is effective for improving BCI performance. The reason is that too few EEG recording channels severely affect the performance of eTRCA algorithm, so does the combination of eTRCA and sbCNN.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Classification accuracies (a) and ITRs (c) of the three models at four groups of channels for the Benchmark dataset; Classification accuracies (b) and ITRs (d) of the three models at four groups of channels for the BETA dataset.

Figure 7 depicts the classification accuracies and ITRs of the three models for each of the two datasets at four and two numbers of training trials for the Benchmark and BETA dataset respectively. The data length and the number of channels used for frequency recognition were fixed at 0.6 s and 9 respectively for the two datasets. As shown in the figure, both the accuracies and ITRs of the three models rise with the number of training trials. For either dataset, eTRCA + sbCNN significantly outperforms both eTRCA and sbCNN as for accuracy and ITR at each number of training trials. These results indicate that as for the different training efforts, the method for model combination is effective for improving BCI performance.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Classification accuracies (a) and ITRs (c) of the three models at four numbers of training trials for the Benchmark dataset; Classification accuracies (b) and ITRs (d) of the three models at two numbers of training trials for the Beta dataset.

Feature distribution

In order to further explore the performance of the proposed framework, we employed a 2-dimensional t-SNE34 to compare the 40-dimensional features, i.e., the classification score vector. Figure 8 shows the two-dimensional feature distributions of the three models for the two datasets. In the figure, only the feature distributions of the first eight stimulus frequencies (i.e., the first 8 categories) starting from 8 Hz with the interval of 1 Hz are shown for easy observation. The raw 40-dimensional feature vectors were generated with 0.6 s-long data, 9 channels and 5 training blocks for the Benchmark dataset and 0.6 s-long data, 9 channels and 3 training blocks for BETA dataset. Each point in the figure represents one testing trial and the colors denote different categories. Based on leave-one block-out cross validation, each category includes a total of 35 (subjects) × 6 (blocks) = 210 testing trials for the Benchmark dataset or 70 (subjects) × 4 (blocks) = 280 testing trials for the Beta datasets. The results show that in each row, from left to right, the latter model produces tighter clustering and more separable categories in the 2-dimensional feature space compared to the former model. This arises from the fact that compared to eTRCA, sbCNN leverages other subjects’ data for training the DL model, and the additional information improves the quality of feature signals; on the other hand, eTRCA + sbCNN exploits the advantages of two models, reduces the overall model bias and improves the robustness of feature signals.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

For the first eight stimulus frequencies from 8 to 15 Hz with a stride of 1 Hz, 2-dimensional feature distributions of testing trials from all blocks and all subjects yielded by t-SNE for the three models in the Benchmark dataset (a) and the BETA dataset (b).

Feature distribution

In order to further explore the performance of the proposed framework, we employed a 2-dimensional t-SNE34 to compare the 40-dimensional features, i.e., the classification score vector. Figure 8 shows the two-dimensional feature distributions of the three models for the two datasets. In the figure, only the feature distributions of the first eight stimulus frequencies (i.e., the first 8 categories) starting from 8 Hz with the interval of 1 Hz are shown for easy observation. The raw 40-dimensional feature vectors were generated with 0.6 s-long data, 9 channels and 5 training blocks for the Benchmark dataset and 0.6 s-long data, 9 channels and 3 training blocks for BETA dataset. Each point in the figure represents one testing trial and the colors denote different categories. Based on leave-one block-out cross validation, each category includes a total of 35 (subjects) × 6 (blocks) = 210 testing trials for the Benchmark dataset or 70 (subjects) × 4 (blocks) = 280 testing trials for the Beta datasets. The results show that in each row, from left to right, the latter model produces tighter clustering and more separable categories in the 2-dimensional feature space compared to the former model. This arises from the fact that compared to eTRCA, sbCNN leverages other subjects’ data for training the DL model, and the additional information improves the quality of feature signals; on the other hand, eTRCA + sbCNN exploits the advantages of two models, reduces the overall model bias and improves the robustness of feature signals.

Confusion matrix

Due to space limitation, only the Benchmark dataset is used to calculated the averaged confusion matrix of the eTRCA + sbCNN across subjects, which is shown in Fig. 9. The confusion matrix is yielded using 1 s-long data, 9 channels and 5 training blocks. The total number of testing trials for each stimulus frequency or category equals 210 (i.e., 6 (blocks) × 35 (subjects)). The experimental results show that the correct recognition rates of these stimuli were high and there were no large differences in accuracy among different stimuli.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Averaged confusion matrix of the proposed eTRCA + sbCNN model across 35 subjects in the Benchmark dataset yielded at the data length of 1.0 s.

Comparison with other models

To further validate the performance of eTRCA + sbCNN, we compare it with other state-of-the-art methods including traditional ML algorithms and DL networks. For the Benchmark dataset, the following methods are compared: (a) CCA11; (b) FBCCA15; (c) TRCA17; (d) compact-CNN18; (e) conv-CA28; (f) bi-SiamCA21. Their classification accuracies and ITRs are reported in Table 1. It is observed from the table that both accuracies and ITRs of the proposed method are higher than those of all other methods at the five different data lengths. Paired t-tests at 95% confidence level exhibit that eTRCA + sbCNN is significantly better than every other method in both accuracy and ITR at each of the five data lengths with all p values smaller than 0.05.

Table 1 Averaged classification accuracies (%) and ITRs (bits/min) in brackets of seven compared models across all subjects in the Benchmark datasets at the data lengths of 0.2 s, 0.4 s, 0.6 s, 0.8 s and 1.0 s.

For the BETA dataset, the following methods are compared: (a) FBCCA15; (b) ms-eCCA35; (c) ITCCA13; (d) TRCA17; (e) Conv-CA28. Since most of these methods provided classification results only at the data length of 1.0 s, their accuracies and ITRs for the dataset at the data length are reported in Table 2. Paired t-tests at 95% confidence level exhibit that eTRCA + sbCNN is significantly better than every other method in both accuracy and ITR at the data length with all p values smaller than 0.05.

Table 2 Averaged classification accuracies (%) and ITRs (bits/min) of the six compared models across all subjects in the BETA dataset at the data length of 1.0 s.

Computational complexity

The total training time of eTRCA + sbCNN includes that of sbCNN and eTRCA. The former is long because sbCNN comprises the two stages of global training and fine-tuning, whereas the latter is short and negligible compared to the former. The testing stage includes classifying a testing signal with the trained sbCNN and eTRCA model, summing two classification score vectors and predicting the label of the testing trial. This experiment was performed under Matlab R2023a on a computer configured with 12th Gen Intel (R) Core (TM) i5-12400F CPU @2.50 GHz, 32 GB RAM, a GeForce 3080 GPU, 64-bit Windows10. The training time takes about 2.4 h and the testing time of a single trial is less than 0.1 s. The training time is too long to suit online applications. However, the training time of sbCNN can be significantly reduced by replacing the global model with that trained from other subjects, and thereby eTRCA + sbCNN is still applicable to online experiments.

Discussion

Model combining is regarded as a promising approach to address the problem of how to improve the classification performance of a pattern recognition system with small samples in the training set. In this study, we propose a novel model combining-based classification method for improving the performance of SSVEP-BCI systems. Based on the addition of two feature vectors derived from two state-of-the-art models eTRCA and sbCNN respectively, the combined model eTRCA + sbCNN achieves significantly higher classification accuracy and ITR than both eTRCA and sbCNN on two commonly used SSVEP datasets.

Model combination can be done by either sequential method29 or parallel method27,28. The method proposed by Deng et al.29 belongs to the former, in which an eTRCA model is utilized for feature extraction, and a serially connected sbCNN model is employed for feature abstraction and classification, whereas the method proposed in this study belongs to the latter, in which the two models eTRCA and sbCNN are parallelly connected and their feature signals are summed together for classification. The advantage of the former method is that introducing eTRCA filters prior to sbCNN network improves the SNR of its input data, whereas that of the latter is that it avoids the loss of useful information in raw EEG data.

The proposed method for model combination works well because of the complementary nature of the two models. The DL-based model sbCNN is totally data driven, has strong ability to automatically represent high-level abstraction through multiple convolutional layers, and is able to extract generic features by exploiting the data from other subjects. On the other hand, the traditional model eTRCA is created relying on neurophysiological knowledge of a specific BCI paradigm and can extract appropriative features that fit this paradigm. Their combination takes into account both the universality and the specificity of the feature signals and consequently can achieve better performance than single models.

This paper provides a framework for combining a traditional ML model and a DL model to improve the performance of SSVEP BCIs. Any one traditional ML and any one DL model can be combined in an appropriate fashion. However, the performance of the two models must not differ too much, otherwise their combination will not improve and may even degrade the overall classification performance. As shown in Figs. 5 and 6, when the data length is 0.2 s, the classification accuracy of eTRCA is much lower than that of sbCNN, so that the accuracy of eTRCA + sbCNN has no significant difference with sbCNN for the two datasets; When the number of EEG channels is 3, the classification accuracy of eTRCA is also much lower than that of sbCNN, so that eTRCA + sbCNN has no significant difference with and is inferior to sbCNN in accuracy for Benchmark dataset and BETA dataset respectively, and has no significant difference with sbCNN in ITR for the two datasets. In the future, we will explore new methods for model combination such as weighting two models to further improve the classification effect of SSVEP signals.

The limitation of the proposed method is that the training time is too long when the sbCNN model is trained using transfer learning. For example, the total training time is about 2.4 h at the 0.4 s-long data for the Benchmark dataset comprising 35 subjects. This is not impractical for a new user due to the tedious calibration procedure. However, the main training time is taken on the first stage, in which the training includes large amount of data from other subjects. In practice, the first-stage training can be removed by directly transferring global model pre-trained on the existing dataset to a new user. Only the second-stage training for fine-tuning is performed with his/her calibration data. In addition, only an offline analysis of the proposed algorithm was conducted in this study. We will explore the its online implementation in the future.

Conclusion

In this paper, we present a model-combining framework eTRCA + sbCNN, specialized for frequency identification of the SSVEP BCIs. For a testing trial, the framework combines a traditional ML model eTRCA and sub-band signal-based DL model sbCNN by summing their classification score vectors and predicts the testing label (i.e., stimulus frequency) by choosing the label corresponding to the maximal summed score. We apply this proposed framework to two SSVEP BCI datasets including a total of 105 subjects and evaluate it by classification accuracy and ITR at different data lengths, channel subsets and training blocks. The experimental results demonstrate that the eTRCA + sbCNN framework significantly outperforms both eTRCA and sbCNN and several state-of-the-art traditional ML or DL models, and provides promising potential for the decoding of EEG signals in SSVEP BCIs.