Introduction

BCIs can restore communication for paralyzed patients who have lost their ability to move or speak. Imagery handwriting is the focus of BCI research, where participants only write letters in mind without performing the actual action1,2. BCI allows users to communicate with external devices efficiently by interpreting the neural signals of imagined characters. The imagery-based handwriting paradigm is widely utilized in the BCI fields, where neural information generated by internally imagined actions is translated into control signals. There are two main strategies in deep learning for decoding imagined character tasks: CNN-based and RNN-based3,4,5,6,7.

CNNs have an advantage in perceiving local information. Pei et al.5 first applied independent component analysis to decompose the Electroencephalogram (EEG) signals and extract features for training a CNN-based classifier to recognize handwritten characters. Ullah et al.3 introduced a system that utilized a deep convolutional neural network to recognize visual/mental imagery of English alphabets, enabling direct typing through brain signals. However, CNNs have limitations in processing temporal features of neural signals due to the fixed size of the convolution kernel6,7.

RNNs exhibit advantages in processing sequence data, such as text and time series8,9,10,11,12,13. Willett et al.14 introduced a brain-to-text communication method by utilizing neural signals from the motor cortex and decoding them using a recurrent neural network. Sun et al.15 proposed a Brain2Char architecture that decodes character sequences from electrocorticography (ECoG) signals using 3D Inception layers, bidirectional recurrent layers, and dilated convolution layers. However, RNNs encounter gradient vanishing or exploding when processing long data sequences16.

We argue that CNN and RNN decoding techniques restrict capturing the global context information of neural signals. Moreover, there is an overlap between the temporal and spatial characteristics of neural signals during activities17. However, these overlaps cannot be fully grasped using CNN and RNN approaches.

The Transformer is a deep neural network that utilizes the self-attention mechanism to effectively capture long dependencies in sequential data18,19,20,21. It has been used for neural signal decoding, which makes use of its ability to capture temporal information along the time series22,23. Furthermore, cross-attention is a technique employed in neural networks to identify intricate connections between multiple sequences or multimodal data24. It is deployed in the Transformer, which allows one sequence to attend to another, establishing dependencies between them. Inspired by the cross-attention24, we introduced a Temporal-Spatial Cross Attention Network (TSCA-Net) in decoding imagined characters. The contributions of our paper are as follows:

  • We cascaded LSTM and Transformer in a Temporal Feature (TF) module to extract temporal features from MEA signals. By harnessing the benefits of LSTM’s gating mechanism, we capture continuous temporal features within the time dimension of MEA signals. Subsequently, we employ the Transformer to capture the global dependencies among distinct temporal sequences.

  • We introduced the Transformer into the Spatial Feature (SF) module for the extraction of spatial features. It allowed the Transformer to capture long-term dependencies in the channel sequence, enabling the extraction of spatial features from the MEA signals.

  • We developed a Temporal-Spatial Cross (TSCross) module that comprises two essential sub-modules, namely TSCross-SingleT and TSCross-SingleC. It is designed to extract the interplay between temporal and spatial features.

The organization of this paper is as follows. In the section “Introduction”, we present a concise overview of previous studies that have employed deep learning models for feature extraction from character signals. The section “Methods” introduced the overall structure of TSCA-Net and provided details about its key components. In the section “Experiments”, we conducted experiments to compare the classification accuracy of TSCA-Net with other models. Additionally, we conducted ablation experiments to assess the importance of each component. Finally, we summarized the contributions of this paper and identified future research directions.

Figure 1
figure 1

The overall architecture of TSCA-Net. (a) TSCA-Net consists of four modules: TF, SF, TSCross, and Classifier, as illustrated. (b) The TF module combines cascaded LSTM and Transformer models to extract temporal features from MEA signals. (c) In the SF module, the Transformer model is employed to capture long-term dependencies among the channel sequences. (d) The TSCross module leverages a cross-attention mechanism to capture the interdependencies of both temporal and spatial features. It comprises two submodules, namely TSCross-SingleT and TSCross-SingleC. The Classifier module is utilised to predict the category labels of input MEA signals, consisting of a global pooling layer and two fully connected layers. Notably, the encoder of the Transformer model is exclusively deployed within the TSCA-Net framework.

Methods

To effectively capture the temporal and spatial features of imagined character signals, we developed the TSCA-Net model, which consists of four modules: TF, SF, TSCross, and Classifier, as shown in Fig. 1. The detailed procedures for each module in TSCA-Net are described below.

Overall architecture

The TF module was implemented to capture long-term temporal dependencies in neural signals. The time series of signals were initially inputted into the LSTM module, which is responsible for capturing continuous temporal features. Subsequently, the processed temporal representation was fed into the transformer module, which extracted the inter-dependencies among different positions by calculating attention scores25.

The SF module was designed to capture the spatial relationships between different channels on a global scale. In the SF procedure, the channel series of the signals are directly inputted into the transformer module, which computes attention scores to capture the inter-dependencies. Unlike the TF, the SF utilizes the raw channel sequences as input.

The TSCross module captures the correlation between temporal and spatial features of neural signals. Due to the distinct dependency relationship between time and channels, the TSCross module was divided into two sub-modules: TSCross-SingleT and TSCross-SingleC. TSCross-SingleT computed the attention degree of acquisition channels at different time points, while TSCross-SingleC computed the attention degree of different time points on the acquisition channels.

In the TSCross-SingleT processing flow, both the channel sequences of the neural signals and the representation obtained by the TF were fed into the TSCross. The channel sequences were encoded as the query vector to calculate the attention scores, while the representation obtained was encoded as the key and value vectors for attention calculation.

Similarly, in the TSCross-SingleC processing flow, the time sequence and the representation captured by the SF were used as inputs to the TSCross. The time sequence was encoded as the query vector for attention calculation, while the obtained representation was encoded as the key and value vectors.

The Classifier is a crucial component of the TSCA-Net, responsible for mapping the features of input data to corresponding category labels or probability distributions. It takes the fused representation captured by TF, SF, and TSCross as its input. The Classifier plays a vital role in the TSCA-Net by assigning category labels or probability distributions to input data features. It takes the combined representation captured by TF, SF, and TSCross as its input. The processed features go through two fully connected layers before being fed into the softmax function. The softmax function function generates a probability distribution for each possible classification. The predicted label is determined by selecting the category with the highest probability.

Multi-head self-attention

The Multi-head Self-Attention (MSA) mechanism allows for focusing on multiple positions in a sequence at the same time26. This is achieved by mapping individual elements of the input sequence to three vectors: key, value, and query. The mechanism then calculates the output by taking a weighted sum of these vectors. The MSA is made up of several heads, each with its own set of key, value, and query vectors. These vectors are used to attend to different pieces of information within the sequence. Finally, the output vectors of each head are combined through concatenation or weighted averaging to produce the final output.

$$\begin{aligned} &\text {MSA}(Q, K, V)={\text {concat}}\left( \text {head}_1, \ldots , \text {head}_{N_h}\right) W^o\\&\text {head}_i=\text {Attention}\left( Q W_i^Q, K W_i^K, V W_i^V\right) \\&\text {Attention}\left( Q_i, K_i, V_i\right) ={\text {Softmax}}\left( \frac{Q_i K_i^\text { T}}{\sqrt{d_k}}\right) V_i \end{aligned}$$
(1)

The MSA formula is shown in Eq. (1), where \(\text {head}_i\) refers to the i-th attention head out of a total of \(N_h\) heads. The linear transformation matrix for mapping the concatenated inputs to the output space is denoted as \(W^o\in {\mathbb {R}}^{h d_{v}\times d_{\text {model}}}\), where \(d_{\text {model}}\) represents the dimension of input embedding. The query, key, and value vectors of the i-th head are represented by \(Q W_i^Q\), \(K W_i^K\) and \(V W_i^V\) respectively, where \(W_i^Q\in {\mathbb {R}}^{d_{\text {model}}\times d_q}\), \(W_i^K\in {\mathbb {R}}^{d_{\text {model}}\times d_k}\), and \(W_i^V\in {\mathbb {R}}^{d_{\text {model}}\times d_v}\) are the corresponding weight matrices, and \(d_q\), \(d_k\) and \(d_v\) are dimensions of the query, key, and value vectors.

Relative position embedding

The traditional Transformers have relatively weak position relations within the input sequences26,27,28. Therefore, it requires a position embedding to enhance its positional awareness when using a Transformer for modelling. The ViT model introduced a learnable absolute position embedding that adds a fixed vector to each position within the input sequence18. However, this method is unable to depict the relative relationships between positions accurately.

To enhance the location perception of the Transformer, TSCA-Net deploys relative positions embedding to generate the vector representations of each other dynamically29. Specifically, the relative position information in TSCA-Net is represented by a two-dimensional matrix with dimensions equal to the length of the input sequence. During the attention computation, the position biases indexed via the relative position matrix are added to the attention scores. The enhanced MSA is represented by Equation (1), where the weight of the vector position is denoted by B.

$$\begin{aligned} \text {Attention}\left( Q,K,V\right) =\text {SoftMax}\left( QK^\text {T}/\sqrt{d}+B\right) V, \end{aligned}$$
(2)

Cross-entropy loss function

The cross-entropy function is a type of convex function that helps prevent the model from getting stuck in suboptimal solutions. During the backpropagation, the model parameters are updated continuously using gradient information from the loss function30. The detailed formula is defined as Eq. (3), where \(N_t\) represents the number of trials and \(N_c\) represents the number of categories. \(y_{m}^{n}\) and \({\hat{y}}_m^n\) refer to the ground-truth labels and the predicted labels for the mth trial, respectively.

$$\begin{aligned} {\mathscr {L}}=-\frac{1}{N_t} \sum _{n=1}^{N_c} \sum _{m=1}^{N_t} y_{m}^{n} \log \left( {{\hat{y}}_m^n}\right) , \end{aligned}$$
(3)

Experiments

Data acquisition and preprocessing

To evaluate the effectiveness of our proposed model, we conducted experiments on the Imagery Handwritten Character dataset14. The dataset records the neural activity of one participant from two microelectrode arrays placed in the hand area of the precentral gyrus over 10 days. The participant had a high-level spinal cord injury and was paralyzed from the neck down, with hand movements limited to twitching and micromotion. He was instructed to imagine holding a pen and writing letters and symbols by hand on ruled paper, as if not paralyzed.

The raw handwriting-imagination dataset comprises 26 English letters and five special characters (commas, apostrophes, question marks, periods, and spaces). These special characters serve as pauses or cues during the imagination handwriting. After a time-warping step was applied, as described in the literature14, the five special characters were removed from the dataset. Subsequently, by merging the data collected over ten days, we obtained a collection of 3,172 trials of spiking activities, encompassing 26 English lowercase letters. Each trial sample was represented as a two-dimensional matrix with 201 time steps and 192 electrodes. In our experiments, we aimed to train the model to classify 26 lowercase letters.

Implementation details of the TSCA-Net

To explain the TSCA-Net employed in the experiments, we presented detailed information about its key modules: TF, SF, TSCross, and Classifier. Specific implementations and parameters are listed in Table 1.

Table 1 Details of the TSCA-Net Architecture. In the table, C represents the number of MEA channels, T represents the number of time points, and N represents the number of output classes.

Platform environment

Our experiments were conducted on a Linux operating system with the TITAN XP GPU. Our model was implemented with PyTorch 1.10.2 and Python 3.7.12. We utilized the Adam optimizer with a weight decay of 0.0001 and an initial learning rate of 3e-5, and trained our model for 400 epochs with a batch size of 8.

Metrics

To evaluate the performance of our proposed model, we used four metrics: accuracy, precision, recall, and F1 scores. Accuracy is the proportion of correctly classified samples, while precision measures the proportion of true positive samples among all samples classified as positive. Similarly, recall measures the proportion of true positive samples among all actual positive samples. The F1 score is a balanced measure of the model’s performance and represents the harmonic mean of precision and recall. The mathematical formulas for calculating these metrics are shown in Eqs. (4), (5), (6), and (7). Due to the limited size of the experimental dataset, we conducted a five-fold cross-validation in the experiment. Specifically, we split the dataset into training and testing sets in a 4:1 ratio. The evaluation metrics were performed on the testing set for each of the five folds, and the average of these five evaluations was used to assess the model’s performance.

$$\begin{aligned} {\textrm{Accuracy}}= \frac{TP+TN}{TP+FN+TN+FP}\times 100\%, \end{aligned}$$
(4)
$$\begin{aligned} {\textrm{Precision}}= \frac{TP}{TP+FP}\times 100\%, \end{aligned}$$
(5)
$$\begin{aligned} {\textrm{Recall}} = \frac{TP}{TP+FN}\times 100\%, \end{aligned}$$
(6)
$$\begin{aligned} {\textrm{F1}} = \frac{2\times {\textrm{Precision}} \times {\textrm{Recall}}}{{\textrm{Precision}}+{\textrm{Recall}}}\times 100\%. \end{aligned}$$
(7)

where TP is the true positive, FN is the false negative, TN is the true negative, and FP is the false positive.

Comparison experiments

We conducted comparison experiments with seven commonly classification models including EEG-Net31, EEG-TCNet32, GRU14, LSTM33, S3T22, R-Transformer34, and ViT18 To evaluate the performance of TSCA-Net in decoding signals for classification tasks.

  • EEG-Net and EEG-TCNet are convolution-based models that utilize CNNs or spatiotemporal convolutional neural networks to extract temporal and spatial features from EEG signals.

  • GRU and LSTM are recurrent neural networks used for time series analysis and sequence modeling.

  • S3T, R-Transformer, and ViT are three models of the Transformer type. S3T, a Transformer-based model developed by South China University of Technology, is designed to model spatiotemporal features of EEG signals. R-Transformer developed by Michigan State University combines RNNs and multi-head attention. ViT, a vision transformer developed by Google, addresses long-range dependency in image classification.

Table 2 Performance comparison of TSCA-Net with seven state-of-the-art models.

After conducting five experiments in cross-validation, the results of eight different models were evaluated in terms of accuracy, precision, recall, and F1 score. The results are presented in Table 2. TSCA-Net demonstrated the highest performance with accuracy, precision, recall, and F1 score values of 92.66\(\%\), 92.77\(\%\), 92.70\(\%\), and 92.58\(\%\), respectively. It was found that EEG-TCNet had the weakest performance among all the compared models. Among the two CNN models, EEG-Net performed slightly better than EEG-TCNet, possibly because of its simpler architecture and ease of training and optimization. Regarding the two RNN models, it was found that LSTM exhibited superior classification performance over GRU. ViT had the lowest performance among the Transformer models, trailing behind the other three models. However, EEG-TCNet achieved the highest accuracy among the CNN models, while ViT had the lowest accuracy among the Transformer models. These findings suggest that Transformer models are more effective than CNN models for recognizing imagined characters.

Table 3 Cross-validation results of comparative models.
Figure 2
figure 2

Statistics of standard deviation and mean of accuracy.

The statistical results of comparative models obtained through five-fold cross-validation experiments are shown in Table 3. TSCA-Net achieved the highest mean accuracy and the lowest standard deviation among all five experiments. Moreover, Fig. 2 demonstrates that the lowest accuracy of TSCA-Net is still better than the optimal performance of other comparison models. Overall, TSCA-Net has consistently performed well and outperformed the other comparison models.

Figure 3
figure 3

Confusion matrix. The horizontal axis represents the predicted categories by the TSCA-Net, and the vertical axis represents the true labels of the 26 lowercase letters.

Table 4 The precision and recall of TSCA-Net for recognizing 26 characters.
Figure 4
figure 4

Visualization of data distribution using t-SNE.

From the confusion matrix presented in Fig. 3, we can see that the diagonal elements have the darkest colors. It indicates that TSCA-Net has relatively low error rates for identifying imagery handwritten characters. The model has achieved a remarkable 100\(\%\) precision and recall rates for recognizing characters such as ’m’ and ’u’.

To further validate the performance of TSCA-Net, we conducted a comparative analysis of the distribution of the raw characters data and the distribution of extracted with TSCA-Net. In Fig. 4a, the distribution of the 26 letters in the raw dataset appears highly chaotic. Characters belonging to the same category cannot be effectively clustered in a 2D space and exhibit a random distribution. However, in Fig. 4b, it can be observed that after the feature extraction process by TSCA-Net, the 26 letters are distributed into 26 distinct clusters. The letters are tightly clustered within the same group, and the boundaries between groups are distinguishable by different categories.

Table 4 shows the optimal TSCA-Net model used for character signals decoding, which achieved an accuracy rate of 94.30\(\%\). The table highlights that the TSCA-Net model delivered exceptional precision and recall rates in predicting characters ’m’ and ’u’, which had a 100\(\%\) rate. However, the precision for character ’c’ prediction was the lowest among all the characters, and the recall for character ’r’ prediction was also the lowest. These findings suggest that the proposed TSCA-Net model is a robust and reliable method for decoding MEA signals in imagery handwriting tasks, and it outperforms other models.

Ablation experiments

In the ablation experiments, we analyzed the effect of the TSCA-Net model performance based on three factors: model components, position embedding, and core parameters of multi-head attention.

Ablation experiments of model components

In our experiments evaluating TSCA-Net, we analyzed the impact of three components: TF, SF, and TSCross. Results are presented in Table 5. Our findings show that the absence of the TF has the most significant impact on the accuracy of the TSCA-Net model. The accuracy of the model decreased by 5.69\(\%\) compared to the optimal model. Similarly, the absence of the SF component resulted in a decrease of 2.44\(\%\) in accuracy, with the TSCA-Net accuracy dropping to 91.86\(\%\).

Table 5 Ablation study of TSCA-Net components. The effects of LSTM and Transformer in TF, SingleT, and SingleC in TSCross, and SF modules on the performance of the TSCA-Net model were explored. Here SingleT represents TSCross-SingleT, SingleC represents TSCross-SingleC.

As a part of our study, we analyzed the effect of TSCross-SingleT and TSCross-SingleC components on accuracy, both individually and combined. We observed that the removal of both components resulted in a decrease in accuracy by 4.43\(\%\). However, when only TSCross-SingleT was removed, the accuracy decreased by 2.85\(\%\), whereas the removal of only TSCross-SingleC caused a decrease of 1.89\(\%\).

The experiments suggest that temporal features have the most significant impact on character recognition accuracy, followed by the spatiotemporal features acquired through the spatiotemporal interaction module. Conversely, the impact of spatial features is relatively negligible. We found that using spatial channels instead of time as a query in the MSA of the TSCross has a relatively significant impact on character recognition accuracy.

Table 6 Ablation study of position embedding. Absolute, Sin/Cos, Relative, and None are four distinct position embedding methods, where None indicates the absence of position embedding in the model.

Ablation experiments of position embedding

In the ablation experiments, we assessed the accuracy of four position embedding methods: relative, absolute, Sin/Cos, and none. From the experimental results presented in Table 6, it can be concluded that the TSCA-Net accuracy is highest using relative position embedding, while Sin/Cos position embedding produces the lowest accuracy. Specifically, the accuracy with relative position embedding is 2.56\(\%\) higher than without position embedding, 3.00\(\%\) higher than with absolute position embedding, and 3.35\(\%\) higher than with Sin/Cos position embedding. These findings suggest that relative position embedding is the most effective approach to improve the accuracy of TSCA-Net in character recognition tasks.

Table 7 Ablation study of the multi-head self-attention with different hyperparameters. Heads and Dims represent the number of attention heads and the embedding dimension, respectively.

Ablation experiments of the core parameters of the MSA

We performed experiments to determine the core parameters of multi-head attention in the TSCA-Net model. Specifically, we varied the number of heads and the encoding dimension. Based on the results presented in Table 7, we found that setting the number of heads to 16 and the encoding dimension to 128 led to the highest character recognition accuracy of 94.30\(\%\). Therefore, we set these values as the default for the TSCA-Net model.

Conclusion

This paper proposed TSCA-Net, a spatiotemporal cross-attention network for recognizing imagined characters. TSCA-Net consisted of four modules: TF, SF, TSCross, and Classifier. The TF module combined LSTM and Transformer to capture the temporal features of neural signals, while the SF module focused on capturing spatial information. The TSCross module was introduced to learn the relationship between the temporal and spatial features.

Experimental results on the Imagery Handwritten Character dataset demonstrated that TSCA-Net outperformed other comparison models in accuracy, precision, recall, and F1 score. Additionally, TSCA-Net provided an effective approach for extracting MEA signal features in brain-computer interface systems.

However, we manually selected the hyperparameters of TSCA-Net for cross-validation based on recognition accuracy. The absence of nested cross-validation for hyperparameter selection will lead to a potential risk of overfitting. This challenge needs to be addressed in future research efforts.