Temporal-spatial cross attention network for recognizing imagined characters

Xu, Mingyue; Zhou, Wenhui; Shen, Xingfa; Qiu, Junping; Li, Dingrui

doi:10.1038/s41598-024-59263-5

Download PDF

Article
Open access
Published: 04 July 2024

Temporal-spatial cross attention network for recognizing imagined characters

Mingyue Xu^1,2,
Wenhui Zhou²,
Xingfa Shen²,
Junping Qiu² &
…
Dingrui Li²

Scientific Reports volume 14, Article number: 15432 (2024) Cite this article

3983 Accesses
5 Citations
Metrics details

Subjects

Abstract

Previous research has primarily employed deep learning models such as Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) for decoding imagined character signals. These approaches have treated the temporal and spatial features of the signals in a sequential, parallel, or single-feature manner. However, there has been limited research on the cross-relationships between temporal and spatial features, despite the inherent association between channels and sampling points in Brain-Computer Interface (BCI) signal acquisition, which holds significant information about brain activity. To address the limited research on the relationships between temporal and spatial features, we proposed a Temporal-Spatial Cross-Attention Network model, named TSCA-Net. The TSCA-Net is comprised of four modules: the Temporal Feature (TF), the Spatial Feature (SF), the Temporal-Spatial Cross (TSCross), and the Classifier. The TF combines LSTM and Transformer to extract temporal features from BCI signals, while the SF captures spatial features. The TSCross is introduced to learn the correlations between the temporal and spatial features. The Classifier predicts the label of BCI data based on its characteristics. We validated the TSCA-Net model using publicly available datasets of handwritten characters, which recorded the spiking activity from two micro-electrode arrays (MEAs). The results showed that our proposed TSCA-Net outperformed other comparison models (EEG-Net, EEG-TCNet, S3T, GRU, LSTM, R-Transformer, and ViT) in terms of accuracy, precision, recall, and F1 score, achieving 92.66$\%$, 92.77$\%$, 92.70$\%$, and 92.58$\%$, respectively. The TSCA-Net model demonstrated a 3.65$\%$ to 7.49$\%$ improvement in accuracy over the comparison models.

Hybrid fuzzy deep neural network toward temporal-spatial-frequency features learning of motor imagery signals

Article Open access 25 December 2022

Enhanced EEG signal classification in brain computer interfaces using hybrid deep learning models

Article Open access 25 July 2025

Stockwell transform and semi-supervised feature selection from deep features for classification of BCI signals

Article Open access 11 July 2022

Introduction

BCIs can restore communication for paralyzed patients who have lost their ability to move or speak. Imagery handwriting is the focus of BCI research, where participants only write letters in mind without performing the actual action^1,2. BCI allows users to communicate with external devices efficiently by interpreting the neural signals of imagined characters. The imagery-based handwriting paradigm is widely utilized in the BCI fields, where neural information generated by internally imagined actions is translated into control signals. There are two main strategies in deep learning for decoding imagined character tasks: CNN-based and RNN-based^3,4,5,6,7.

CNNs have an advantage in perceiving local information. Pei et al.⁵ first applied independent component analysis to decompose the Electroencephalogram (EEG) signals and extract features for training a CNN-based classifier to recognize handwritten characters. Ullah et al.³ introduced a system that utilized a deep convolutional neural network to recognize visual/mental imagery of English alphabets, enabling direct typing through brain signals. However, CNNs have limitations in processing temporal features of neural signals due to the fixed size of the convolution kernel^6,7.

RNNs exhibit advantages in processing sequence data, such as text and time series^{8,9,10,11,12,13}. Willett et al.¹⁴ introduced a brain-to-text communication method by utilizing neural signals from the motor cortex and decoding them using a recurrent neural network. Sun et al.¹⁵ proposed a Brain2Char architecture that decodes character sequences from electrocorticography (ECoG) signals using 3D Inception layers, bidirectional recurrent layers, and dilated convolution layers. However, RNNs encounter gradient vanishing or exploding when processing long data sequences¹⁶.

We argue that CNN and RNN decoding techniques restrict capturing the global context information of neural signals. Moreover, there is an overlap between the temporal and spatial characteristics of neural signals during activities¹⁷. However, these overlaps cannot be fully grasped using CNN and RNN approaches.

The Transformer is a deep neural network that utilizes the self-attention mechanism to effectively capture long dependencies in sequential data^18,19,20,21. It has been used for neural signal decoding, which makes use of its ability to capture temporal information along the time series^22,23. Furthermore, cross-attention is a technique employed in neural networks to identify intricate connections between multiple sequences or multimodal data²⁴. It is deployed in the Transformer, which allows one sequence to attend to another, establishing dependencies between them. Inspired by the cross-attention²⁴, we introduced a Temporal-Spatial Cross Attention Network (TSCA-Net) in decoding imagined characters. The contributions of our paper are as follows:

We cascaded LSTM and Transformer in a Temporal Feature (TF) module to extract temporal features from MEA signals. By harnessing the benefits of LSTM’s gating mechanism, we capture continuous temporal features within the time dimension of MEA signals. Subsequently, we employ the Transformer to capture the global dependencies among distinct temporal sequences.
We introduced the Transformer into the Spatial Feature (SF) module for the extraction of spatial features. It allowed the Transformer to capture long-term dependencies in the channel sequence, enabling the extraction of spatial features from the MEA signals.
We developed a Temporal-Spatial Cross (TSCross) module that comprises two essential sub-modules, namely TSCross-SingleT and TSCross-SingleC. It is designed to extract the interplay between temporal and spatial features.

The organization of this paper is as follows. In the section “Introduction”, we present a concise overview of previous studies that have employed deep learning models for feature extraction from character signals. The section “Methods” introduced the overall structure of TSCA-Net and provided details about its key components. In the section “Experiments”, we conducted experiments to compare the classification accuracy of TSCA-Net with other models. Additionally, we conducted ablation experiments to assess the importance of each component. Finally, we summarized the contributions of this paper and identified future research directions.

Methods

To effectively capture the temporal and spatial features of imagined character signals, we developed the TSCA-Net model, which consists of four modules: TF, SF, TSCross, and Classifier, as shown in Fig. 1. The detailed procedures for each module in TSCA-Net are described below.

Overall architecture

The TF module was implemented to capture long-term temporal dependencies in neural signals. The time series of signals were initially inputted into the LSTM module, which is responsible for capturing continuous temporal features. Subsequently, the processed temporal representation was fed into the transformer module, which extracted the inter-dependencies among different positions by calculating attention scores²⁵.

The SF module was designed to capture the spatial relationships between different channels on a global scale. In the SF procedure, the channel series of the signals are directly inputted into the transformer module, which computes attention scores to capture the inter-dependencies. Unlike the TF, the SF utilizes the raw channel sequences as input.

The TSCross module captures the correlation between temporal and spatial features of neural signals. Due to the distinct dependency relationship between time and channels, the TSCross module was divided into two sub-modules: TSCross-SingleT and TSCross-SingleC. TSCross-SingleT computed the attention degree of acquisition channels at different time points, while TSCross-SingleC computed the attention degree of different time points on the acquisition channels.

In the TSCross-SingleT processing flow, both the channel sequences of the neural signals and the representation obtained by the TF were fed into the TSCross. The channel sequences were encoded as the query vector to calculate the attention scores, while the representation obtained was encoded as the key and value vectors for attention calculation.

Similarly, in the TSCross-SingleC processing flow, the time sequence and the representation captured by the SF were used as inputs to the TSCross. The time sequence was encoded as the query vector for attention calculation, while the obtained representation was encoded as the key and value vectors.

The Classifier is a crucial component of the TSCA-Net, responsible for mapping the features of input data to corresponding category labels or probability distributions. It takes the fused representation captured by TF, SF, and TSCross as its input. The Classifier plays a vital role in the TSCA-Net by assigning category labels or probability distributions to input data features. It takes the combined representation captured by TF, SF, and TSCross as its input. The processed features go through two fully connected layers before being fed into the softmax function. The softmax function function generates a probability distribution for each possible classification. The predicted label is determined by selecting the category with the highest probability.

Multi-head self-attention

The Multi-head Self-Attention (MSA) mechanism allows for focusing on multiple positions in a sequence at the same time²⁶. This is achieved by mapping individual elements of the input sequence to three vectors: key, value, and query. The mechanism then calculates the output by taking a weighted sum of these vectors. The MSA is made up of several heads, each with its own set of key, value, and query vectors. These vectors are used to attend to different pieces of information within the sequence. Finally, the output vectors of each head are combined through concatenation or weighted averaging to produce the final output.

$$\begin{aligned} &\text {MSA}(Q, K, V)={\text {concat}}\left( \text {head}_1, \ldots , \text {head}_{N_h}\right) W^o\\&\text {head}_i=\text {Attention}\left( Q W_i^Q, K W_i^K, V W_i^V\right) \\&\text {Attention}\left( Q_i, K_i, V_i\right) ={\text {Softmax}}\left( \frac{Q_i K_i^\text { T}}{\sqrt{d_k}}\right) V_i \end{aligned}$$

(1)

The MSA formula is shown in Eq. (1), where $\text {head}_i$ refers to the i-th attention head out of a total of $N_h$ heads. The linear transformation matrix for mapping the concatenated inputs to the output space is denoted as $W^o\in {\mathbb {R}}^{h d_{v}\times d_{\text {model}}}$, where $d_{\text {model}}$ represents the dimension of input embedding. The query, key, and value vectors of the i-th head are represented by $Q W_i^Q$, $K W_i^K$ and $V W_i^V$ respectively, where $W_i^Q\in {\mathbb {R}}^{d_{\text {model}}\times d_q}$, $W_i^K\in {\mathbb {R}}^{d_{\text {model}}\times d_k}$, and $W_i^V\in {\mathbb {R}}^{d_{\text {model}}\times d_v}$ are the corresponding weight matrices, and $d_q$, $d_k$ and $d_v$ are dimensions of the query, key, and value vectors.

Relative position embedding

The traditional Transformers have relatively weak position relations within the input sequences^26,27,28. Therefore, it requires a position embedding to enhance its positional awareness when using a Transformer for modelling. The ViT model introduced a learnable absolute position embedding that adds a fixed vector to each position within the input sequence¹⁸. However, this method is unable to depict the relative relationships between positions accurately.

To enhance the location perception of the Transformer, TSCA-Net deploys relative positions embedding to generate the vector representations of each other dynamically²⁹. Specifically, the relative position information in TSCA-Net is represented by a two-dimensional matrix with dimensions equal to the length of the input sequence. During the attention computation, the position biases indexed via the relative position matrix are added to the attention scores. The enhanced MSA is represented by Equation (1), where the weight of the vector position is denoted by B.

$$\begin{aligned} \text {Attention}\left( Q,K,V\right) =\text {SoftMax}\left( QK^\text {T}/\sqrt{d}+B\right) V, \end{aligned}$$

(2)

Cross-entropy loss function

The cross-entropy function is a type of convex function that helps prevent the model from getting stuck in suboptimal solutions. During the backpropagation, the model parameters are updated continuously using gradient information from the loss function³⁰. The detailed formula is defined as Eq. (3), where $N_t$ represents the number of trials and $N_c$ represents the number of categories. $y_{m}^{n}$ and ${\hat{y}}_m^n$ refer to the ground-truth labels and the predicted labels for the mth trial, respectively.

$$\begin{aligned} {\mathscr {L}}=-\frac{1}{N_t} \sum _{n=1}^{N_c} \sum _{m=1}^{N_t} y_{m}^{n} \log \left( {{\hat{y}}_m^n}\right) , \end{aligned}$$

(3)

Experiments

Data acquisition and preprocessing

To evaluate the effectiveness of our proposed model, we conducted experiments on the Imagery Handwritten Character dataset¹⁴. The dataset records the neural activity of one participant from two microelectrode arrays placed in the hand area of the precentral gyrus over 10 days. The participant had a high-level spinal cord injury and was paralyzed from the neck down, with hand movements limited to twitching and micromotion. He was instructed to imagine holding a pen and writing letters and symbols by hand on ruled paper, as if not paralyzed.

The raw handwriting-imagination dataset comprises 26 English letters and five special characters (commas, apostrophes, question marks, periods, and spaces). These special characters serve as pauses or cues during the imagination handwriting. After a time-warping step was applied, as described in the literature¹⁴, the five special characters were removed from the dataset. Subsequently, by merging the data collected over ten days, we obtained a collection of 3,172 trials of spiking activities, encompassing 26 English lowercase letters. Each trial sample was represented as a two-dimensional matrix with 201 time steps and 192 electrodes. In our experiments, we aimed to train the model to classify 26 lowercase letters.

Implementation details of the TSCA-Net

To explain the TSCA-Net employed in the experiments, we presented detailed information about its key modules: TF, SF, TSCross, and Classifier. Specific implementations and parameters are listed in Table 1.

Table 1 Details of the TSCA-Net Architecture. In the table, C represents the number of MEA channels, T represents the number of time points, and N represents the number of output classes.

Full size table

Platform environment

Our experiments were conducted on a Linux operating system with the TITAN XP GPU. Our model was implemented with PyTorch 1.10.2 and Python 3.7.12. We utilized the Adam optimizer with a weight decay of 0.0001 and an initial learning rate of 3e-5, and trained our model for 400 epochs with a batch size of 8.

Metrics

To evaluate the performance of our proposed model, we used four metrics: accuracy, precision, recall, and F1 scores. Accuracy is the proportion of correctly classified samples, while precision measures the proportion of true positive samples among all samples classified as positive. Similarly, recall measures the proportion of true positive samples among all actual positive samples. The F1 score is a balanced measure of the model’s performance and represents the harmonic mean of precision and recall. The mathematical formulas for calculating these metrics are shown in Eqs. (4), (5), (6), and (7). Due to the limited size of the experimental dataset, we conducted a five-fold cross-validation in the experiment. Specifically, we split the dataset into training and testing sets in a 4:1 ratio. The evaluation metrics were performed on the testing set for each of the five folds, and the average of these five evaluations was used to assess the model’s performance.

$$\begin{aligned} {\textrm{Accuracy}}= \frac{TP+TN}{TP+FN+TN+FP}\times 100\%, \end{aligned}$$

(4)

$$\begin{aligned} {\textrm{Precision}}= \frac{TP}{TP+FP}\times 100\%, \end{aligned}$$

(5)

$$\begin{aligned} {\textrm{Recall}} = \frac{TP}{TP+FN}\times 100\%, \end{aligned}$$

(6)

$$\begin{aligned} {\textrm{F1}} = \frac{2\times {\textrm{Precision}} \times {\textrm{Recall}}}{{\textrm{Precision}}+{\textrm{Recall}}}\times 100\%. \end{aligned}$$

(7)

where TP is the true positive, FN is the false negative, TN is the true negative, and FP is the false positive.

Comparison experiments

We conducted comparison experiments with seven commonly classification models including EEG-Net³¹, EEG-TCNet³², GRU¹⁴, LSTM³³, S3T²², R-Transformer³⁴, and ViT¹⁸ To evaluate the performance of TSCA-Net in decoding signals for classification tasks.

EEG-Net and EEG-TCNet are convolution-based models that utilize CNNs or spatiotemporal convolutional neural networks to extract temporal and spatial features from EEG signals.
GRU and LSTM are recurrent neural networks used for time series analysis and sequence modeling.
S3T, R-Transformer, and ViT are three models of the Transformer type. S3T, a Transformer-based model developed by South China University of Technology, is designed to model spatiotemporal features of EEG signals. R-Transformer developed by Michigan State University combines RNNs and multi-head attention. ViT, a vision transformer developed by Google, addresses long-range dependency in image classification.

Table 2 Performance comparison of TSCA-Net with seven state-of-the-art models.

Full size table

After conducting five experiments in cross-validation, the results of eight different models were evaluated in terms of accuracy, precision, recall, and F1 score. The results are presented in Table 2. TSCA-Net demonstrated the highest performance with accuracy, precision, recall, and F1 score values of 92.66$\%$, 92.77$\%$, 92.70$\%$, and 92.58$\%$, respectively. It was found that EEG-TCNet had the weakest performance among all the compared models. Among the two CNN models, EEG-Net performed slightly better than EEG-TCNet, possibly because of its simpler architecture and ease of training and optimization. Regarding the two RNN models, it was found that LSTM exhibited superior classification performance over GRU. ViT had the lowest performance among the Transformer models, trailing behind the other three models. However, EEG-TCNet achieved the highest accuracy among the CNN models, while ViT had the lowest accuracy among the Transformer models. These findings suggest that Transformer models are more effective than CNN models for recognizing imagined characters.

Table 3 Cross-validation results of comparative models.

Full size table

The statistical results of comparative models obtained through five-fold cross-validation experiments are shown in Table 3. TSCA-Net achieved the highest mean accuracy and the lowest standard deviation among all five experiments. Moreover, Fig. 2 demonstrates that the lowest accuracy of TSCA-Net is still better than the optimal performance of other comparison models. Overall, TSCA-Net has consistently performed well and outperformed the other comparison models.

Table 4 The precision and recall of TSCA-Net for recognizing 26 characters.

Full size table

From the confusion matrix presented in Fig. 3, we can see that the diagonal elements have the darkest colors. It indicates that TSCA-Net has relatively low error rates for identifying imagery handwritten characters. The model has achieved a remarkable 100$\%$ precision and recall rates for recognizing characters such as ’m’ and ’u’.

To further validate the performance of TSCA-Net, we conducted a comparative analysis of the distribution of the raw characters data and the distribution of extracted with TSCA-Net. In Fig. 4a, the distribution of the 26 letters in the raw dataset appears highly chaotic. Characters belonging to the same category cannot be effectively clustered in a 2D space and exhibit a random distribution. However, in Fig. 4b, it can be observed that after the feature extraction process by TSCA-Net, the 26 letters are distributed into 26 distinct clusters. The letters are tightly clustered within the same group, and the boundaries between groups are distinguishable by different categories.

Table 4 shows the optimal TSCA-Net model used for character signals decoding, which achieved an accuracy rate of 94.30$\%$. The table highlights that the TSCA-Net model delivered exceptional precision and recall rates in predicting characters ’m’ and ’u’, which had a 100$\%$ rate. However, the precision for character ’c’ prediction was the lowest among all the characters, and the recall for character ’r’ prediction was also the lowest. These findings suggest that the proposed TSCA-Net model is a robust and reliable method for decoding MEA signals in imagery handwriting tasks, and it outperforms other models.

Ablation experiments

In the ablation experiments, we analyzed the effect of the TSCA-Net model performance based on three factors: model components, position embedding, and core parameters of multi-head attention.

Ablation experiments of model components

In our experiments evaluating TSCA-Net, we analyzed the impact of three components: TF, SF, and TSCross. Results are presented in Table 5. Our findings show that the absence of the TF has the most significant impact on the accuracy of the TSCA-Net model. The accuracy of the model decreased by 5.69$\%$ compared to the optimal model. Similarly, the absence of the SF component resulted in a decrease of 2.44$\%$ in accuracy, with the TSCA-Net accuracy dropping to 91.86$\%$.

Table 5 Ablation study of TSCA-Net components. The effects of LSTM and Transformer in TF, SingleT, and SingleC in TSCross, and SF modules on the performance of the TSCA-Net model were explored. Here SingleT represents TSCross-SingleT, SingleC represents TSCross-SingleC.

Full size table

As a part of our study, we analyzed the effect of TSCross-SingleT and TSCross-SingleC components on accuracy, both individually and combined. We observed that the removal of both components resulted in a decrease in accuracy by 4.43$\%$. However, when only TSCross-SingleT was removed, the accuracy decreased by 2.85$\%$, whereas the removal of only TSCross-SingleC caused a decrease of 1.89$\%$.

The experiments suggest that temporal features have the most significant impact on character recognition accuracy, followed by the spatiotemporal features acquired through the spatiotemporal interaction module. Conversely, the impact of spatial features is relatively negligible. We found that using spatial channels instead of time as a query in the MSA of the TSCross has a relatively significant impact on character recognition accuracy.

Table 6 Ablation study of position embedding. Absolute, Sin/Cos, Relative, and None are four distinct position embedding methods, where None indicates the absence of position embedding in the model.

Full size table

Ablation experiments of position embedding

In the ablation experiments, we assessed the accuracy of four position embedding methods: relative, absolute, Sin/Cos, and none. From the experimental results presented in Table 6, it can be concluded that the TSCA-Net accuracy is highest using relative position embedding, while Sin/Cos position embedding produces the lowest accuracy. Specifically, the accuracy with relative position embedding is 2.56$\%$ higher than without position embedding, 3.00$\%$ higher than with absolute position embedding, and 3.35$\%$ higher than with Sin/Cos position embedding. These findings suggest that relative position embedding is the most effective approach to improve the accuracy of TSCA-Net in character recognition tasks.

Table 7 Ablation study of the multi-head self-attention with different hyperparameters. Heads and Dims represent the number of attention heads and the embedding dimension, respectively.

Full size table

Ablation experiments of the core parameters of the MSA

We performed experiments to determine the core parameters of multi-head attention in the TSCA-Net model. Specifically, we varied the number of heads and the encoding dimension. Based on the results presented in Table 7, we found that setting the number of heads to 16 and the encoding dimension to 128 led to the highest character recognition accuracy of 94.30$\%$. Therefore, we set these values as the default for the TSCA-Net model.

Conclusion

This paper proposed TSCA-Net, a spatiotemporal cross-attention network for recognizing imagined characters. TSCA-Net consisted of four modules: TF, SF, TSCross, and Classifier. The TF module combined LSTM and Transformer to capture the temporal features of neural signals, while the SF module focused on capturing spatial information. The TSCross module was introduced to learn the relationship between the temporal and spatial features.

Experimental results on the Imagery Handwritten Character dataset demonstrated that TSCA-Net outperformed other comparison models in accuracy, precision, recall, and F1 score. Additionally, TSCA-Net provided an effective approach for extracting MEA signal features in brain-computer interface systems.

However, we manually selected the hyperparameters of TSCA-Net for cross-validation based on recognition accuracy. The absence of nested cross-validation for hyperparameter selection will lead to a potential risk of overfitting. This challenge needs to be addressed in future research efforts.

Data availability

The handwriting BCI dataset in this study is publicly available on GitHub at https://github.com/xy21yue/imagined-character-data.

References

Lotte, F. et al. A review of classification algorithms for EEG-based brain-computer interfaces: A 10 year update. J. Neural Eng. 4(2), R1 (2018).
Article Google Scholar
Guillot, A., Moschberger, K. & Collet, C. Coupling movement with imagery as a new perspective for motor imagery practice. Behav. Brain Funct. 9(9), 8–8 (2013).
Article PubMed PubMed Central Google Scholar
Ullah, S. & Halim, Z. Imagined character recognition through EEG signals using deep convolutional neural network. Med. Biol. Eng. Comput. 59, 1167–1183 (2021).
Article PubMed Google Scholar
Janapati, R., Desai, U., Kulkarni, S. A. & Tayal, S. Human-Machine Interface Technology Advancements and Applications (CRC Press, 2023).
Book Google Scholar
Pei, L. & Ouyang, G. Online recognition of handwritten characters from scalp-recorded brain activities during handwriting. J. Neural Eng. 18, 046070 (2021).
Article ADS Google Scholar
Han, K. et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110 (2020).
Article Google Scholar
Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).
Article ADS CAS PubMed Google Scholar
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014).
Google Scholar
Lotte, F., Congedo, M., Lécuyer, A. & Lamarche, F. A review of classification algorithms for EEG-based brain-computer interfaces. J. Neural Eng. 15, 031005 (2007).
Article Google Scholar
Ma, X., Qiu, S., Du, C., Xing, J. & He, H. Improving EEG-based motor imagery classification via spatial and temporal recurrent neural networks. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 1903–1906 (IEEE, 2018).
Alhagry, S., Fahmy, A. A. & El-Khoribi, R. A. Emotion recognition based on EEG using LSTM recurrent neural network. Int. J. Adv. Comput. Sci. Appl.https://doi.org/10.14569/IJACSA.2017.081046 (2017).
Article Google Scholar
Dai, Z. et al. Transformer-xl: Language modeling with longer-term dependency. ICLR 2019 (2018).
Beltagy, I. Peters, M. E. & Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
Willett, F. R., Avansino, D. T., Hochberg, L. R., Henderson, J. M. & Shenoy, K. V. High-performance brain-to-text communication via handwriting. Nature 593, 249–254 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, P., Anumanchipalli, G. K. & Chang, E. F. Brain2char: A deep architecture for decoding text from brain recordings. J. Neural Eng. 17, 066015 (2020).
Article ADS Google Scholar
Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. JMLR.org (2012).
Gordon, S. M., Jaswa, M., Solon, A. J. & Lawhern, V. J. Real world bci: cross-domain learning and practical applications. In Proceedings of the 2017 ACM Workshop on an Application-oriented Approach to BCI out of the Laboratory, 25–28 ( 2017).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D. & Houlsby, N. An image is worth $16\times 16$ words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Dai, Z., Yang, Z., Yang, Y., Carbonell, J. & Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
Wen, Q. et al. Transformers in time series: A survey. arXiv preprint arXiv:2202.07125 (2022).
Zhou, D. et al. Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714 ( 2021).
Song, Y., Jia, X., Yang, L. & Xie, L. Transformer-based spatial-temporal feature learning for EEG decoding. arXiv preprint arXiv:2106.11170 ( 2021).
Tibrewal, N., Leeuwis, N. & Alimardani, M. Classification of motor imagery EEG using deep learning increases performance in inefficient BCI users. PLoS One 17, e0268880 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, C. F., Fan, Q. & Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021).
Han, K. et al. Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021).
Google Scholar
Vaswani, A. et al. Attention is all you need. arXiv (2017).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Shaw, P., Uszkoreit, J. & Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021).
He, K., Gkioxari, G., Dollar, P. & Girshick, R. Mask R-CNN. In International Conference on Computer Vision (2017).
Lawhern, V. J. et al. EEGNet: A compact convolutional network for EEG-based brain-computer interfaces. J. Neural Eng. 15, 0560131–05601317 (2018).
Article Google Scholar
Ingolfsson, T. M. et al. EEG-TCNet: An accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) 2958–2965 (IEEE, 2020).
Raviprakash, H. et al. Deep learning provides exceptional accuracy to ECoG-based functional language mapping for epilepsy surgery. Front. Neurosci. 14, 400 (2020).
Article Google Scholar
Wang, Z., Ma, Y., Liu, Z. & Tang, J. R-transformer: Recurrent neural network enhanced transformer. arXiv preprint arXiv:1907.05572 (2019).

Download references

Funding

This work was supported by the National Social Science Foundation of China (No. 19ZDA348), and “Pioneer” and “Leading Goose” R & D Program of Zhejiang (No. 2023C01143).

Author information

Authors and Affiliations

College of Information Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou, 310018, Zhejiang, China
Mingyue Xu
Department of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, Zhejiang, China
Mingyue Xu, Wenhui Zhou, Xingfa Shen, Junping Qiu & Dingrui Li

Authors

Mingyue Xu
View author publications
Search author on:PubMed Google Scholar
Wenhui Zhou
View author publications
Search author on:PubMed Google Scholar
Xingfa Shen
View author publications
Search author on:PubMed Google Scholar
Junping Qiu
View author publications
Search author on:PubMed Google Scholar
Dingrui Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Wenhui Zhou and Junping Qiu conceived and designed the study. Dingrui Li conducted the experiments and performed the data analysis. Mingyue Xu and Xingfa Shen contributed to the interpretation of the results. Mingyue Xu wrote the initial draft of the manuscript. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Mingyue Xu or Junping Qiu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, M., Zhou, W., Shen, X. et al. Temporal-spatial cross attention network for recognizing imagined characters. Sci Rep 14, 15432 (2024). https://doi.org/10.1038/s41598-024-59263-5

Download citation

Received: 13 November 2023
Accepted: 08 April 2024
Published: 04 July 2024
Version of record: 04 July 2024
DOI: https://doi.org/10.1038/s41598-024-59263-5