Introduction

Language facilitates human communication and information exchange, in which the human brain plays a crucial role, enabling complex cognitive processes supporting the linguistic perception, comprehension and production. Despite advancing insights into the human brain, the linguistic representation and processing are still underexplored, making it challenging to explain intuitively the internal mechanisms1. Deep learning presents a viable solution for understanding language in the brain by utilizing large-scale trainable parameters to map the correlation between external stimuli and neural activity. This paper summarizes representative solutions and current progress on linguistic neural decoding, addressing the potential promotion by leveraging large language models (LLMs). Progress in this field involves the joint efforts of neuroscientists and artificial intelligence researchers. We introduce the neurological foundations supporting neural decoding with deep networks and illustrate multiple model architectures. We classify task forms into multiple standardized paradigms, facilitating researchers to further progress their work, and conclude by discussing the challenges faced by related fields and proposing directions for potential applications. It is important to note that the language discussed in this paper is a synthesis of semantic and syntactic information, featuring specific content presented in a defined format, primarily including text and speech forms. Visual image reconstruction is excluded, as it contains semantic content but lacks linguistic syntactic presentation. Similarly, motions such as handwriting are not considered, given their involvement with bodily movements and minimal relevance to language.

Supplementary Note 1 summarizes the main content of this survey. We begin by discussing the neurological basis of linguistic decoding. Neural tracking ensures the temporal alignment of brain responses with linguistic properties, while continuous neural prediction supports the integration of contextual information. Stimuli recognition is the simplest form of neural decoding, involving the differentiation of linguistic stimuli by analyzing the subject’s evoked brain responses. For text stimuli reconstruction, decoding is performed at the word or sentence level using classifiers, embedding models, and custom network modules. Considering the dynamics of speech flow, restoring the speech envelope, mel-frequency cepstral coefficient (MFCC), and speech waves present broader challenges. Brain recording translation paradigms are applied in natural reading and listening scenarios, where the decoding system generates the stimulus sequence in textual or speech form based on the evoked brain response. This task is analogous to machine translation, treating brain activity as the source language and translating it into human-understandable text. Speech neuroprosthesis focuses on decoding inner or vocal speech based on human intentions. The field has progressed from phoneme-level recognition to open-vocabulary sentence decoding. Brain-to-speech technology is a promising direction, with spectrograms generated through matching algorithms or by considering speech properties, synthesizer parameters, and articulator movements. Additionally, to assist neuroscientists and artificial intelligence researchers in better developing decoding systems, we provide evaluation metrics introduced from deep learning tasks before the introduction of brain decoding solutions (Section 2) and a concise summary of the machine learning models and algorithms discussed in this review (Supplementary Note 1). Compared to previous reviews of neural decoding2,3, our article includes recent advances and expands the task formats to a larger scope. Furthermore, our work focuses on the specification of task paradigms and methodology, which complements the work on the internal mechanisms of language models and human language systems4.

Brain-network alignment

Brain signal recordings measure and quantify the biometric neural response from the human brain, which can be divided into two categories: invasive and non-invasive. The latter, including functional magnetic resonance imaging (fMRI), electroencephalography (EEG), magnetoencephalography (MEG), etc., are affected by transcranial attenuation with a lower signal-to-noise ratio (SNR)5. On the other hand, invasive methods such as electrocorticography (ECoG) are hampered by the limited public availability due to the necessity of neurosurgery. The essence of deep learning lies in leveraging the inherent correlations within data to complete prediction, regression and generation, and the alignment of neural activities and linguistic representations is crucial for enabling these capabilities. Specifically, neural tracking enables the theoretical possibility of achieving temporally continuous decoding from evoked brain activities, while the neural prediction process underscores the benefit of contextual information integration, which is commonly used in current neural decoding approaches.

In this paper, the process by which the brain receives external language stimuli and transforms it into specific neural representations is referred to as perception, which primarily involves neural encoding. During this process, external stimuli are transformed into specific neural response patterns, with neural tracking ensuring the association between language and neural representations6, as shown in Fig. 1a. The cortical activity automatically tracks the dynamics of speech as well as various linguistic properties, including surprisal, phonetic sequences, word sequences, and other linguistic representations7,8,9,10. A minor time shift has been observed for information transfer and neural response. It ensures the temporal alignment of brain recordings with linguistic representations, facilitating the serialized and temporal modeling of cortical activities. As shown in Fig. 1b, language stimuli are encoded into regular evoked brain responses. In contrast, linguistic neural decoding aims to reconstruct the stimuli perceived or the intention expressed from high-dimensional brain responses. In Fig. 1c, the brain undergoes the following processes in communications: perception converts external linguistic stimuli into specific neural patterns; comprehension involves steps such as semantic extraction, understanding, and reasoning; generation (production) entails outputting responses in a specific form, for example, by guiding the vocal organs to produce speech. In natural listening settings, the human brain encodes a wide range of acoustic features and processes external language stimuli temporally through prediction, highlighting the importance of contextual information in cortical perception, even at the level of single neurons11,12. Predictive processing fundamentally forms the comprehension mechanisms, occurring hierarchically in both acoustic and linguistic levels13,14,15,16. This phenomenon underscores the profound impact of context on the forecasting and tracking of ongoing speech streams, necessitating the use of contextual representations to investigate cortical responses17,18,19. This characteristic is similar to language models constructed by neural networks, where the same stimuli presented in varying contexts are mapped onto diverse semantic features. Despite ample evidence supporting the predictive characteristics of human language processing, it has recently been suggested that the benefits actually come from the capacities of models to predict brain responses20. Regardless of the mechanisms in the perception process, language models have the potential to understand and infer neural responses.

Fig. 1: The formation of linguistic representation in the human brain.
figure 1

a The human brain tracks the dynamic flow of speech and linguistic properties with minor response delay, and the neural response is performed in a continuous predictive manner. b The human brain and the neural networks can both encode textual or verbal stimuli into specific representations, and the decoding process aims to reconstruct the linguistic information. c In vocal communications, the brain processes the perception, comprehension, and generation of language. The processor and communication icons are from Vecteezy and Dreamstime.

When processing natural language, artificial neural networks exhibit patterns of functional specialization similar to those of cortical language networks21. Research, particularly focused on Transformers and LLMs, shows that the representations in these models account for a significant portion of the variance observed in the human brain22,23. To further this analogy, it has been verified that the brain encoding models and pre-trained LLMs follow the scaling laws, where the model performance increases as the number of parameters grows, indicating the necessity to develop larger systems to bridge the brain activity patterns and human linguistic representations if given sufficient data and other necessary conditions24,25. A recent study26 has indicated that, in addition to model scaling, the amount of data utilized during the training process positively influences the similarity of representations between the brain and neural networks. Furthermore, alignment training is deemed an effective approach to enhancing this similarity.

Neural decoding division and evaluation

Linguistic neural decoding aims to generate the corresponding external stimuli or inner intention from the activated brain signals. This field has lacked a fine-grained division, preventing researchers from systematically conducting their work. In this review, previous research has been categorized according to the experiment design, stimulus type and decoding target (Table 1). Stimuli recognition is the simplest form and usually requires a modest candidate set and limited sequence length. For speech stimuli, in addition to identifying the textual content, some work considers reconstructing simple speech features and waveforms. These tasks are typically treated as simple classification or regression. Brain recording translation differs in its ability to handle open-vocabulary continuous decoding, which means a sharp increase in the search space and results in a deterioration in the accuracy without introducing intrusive signals. This task focuses more on semantic consistency rather than the absolute identity of the text. Speech neuroprosthesis aims to generate inner speech from spontaneous neural activation patterns. The subjects do not receive external stimuli but perform pronunciation tasks of imagined speech or attempted speech. Researchers have achieved word-level high-precision continuous decoding with invasive recordings.

Table 1 Divided categories and their corresponding characteristics

As an interdisciplinary field of neuroscience and artificial intelligence, early work on neural decoding mainly follows the paradigm of classification, recognition and sequence decoding. Similar experiments are closely related to machine translation (MT), text-to-speech (TTS), and automatic speech recognition (ASR). Table 2 summarizes the evaluation metrics. In the textual stimuli classification paradigm, accuracy is widely used to measure the percentage of correct instances. As for sequential decoding, ASR and MT tasks generate text sequences with distinct accuracy requirements. The evaluation metrics for the latter focus on semantic consistency, which is extensively employed in brain recording translation. To be more specific, BLEU (bilingual evaluation understudy)27 calculates the precision of n-grams compared to reference translations, and ROUGE (recall-oriented understudy for gisting evaluation) pays more attention to recall. BERTScore28 is a recent metric leveraging deep contextualized embeddings from BERT29 to capture semantic similarity instead of matching exact n-grams. When invasive data is used, ASR metrics become more applicable, such as inner speech recognition in speech neuroprosthesis. WER (word error rate) is a common metric of ASR systems. It measures the accuracy of decoded hypotheses word by word. In addition to the word-level calculation, CER (character error rate) and PER (phoneme error rate) are carried out on character- and phoneme-level, respectively.

Table 2 Evaluation metrics for linguistic neural decoding

In natural listening and speaking scenarios, the metrics derived from TTS are mainly used in speech reconstruction tasks, for the decoding outputs of both are speech waves. The simplest method is to calculate the statistical correlation between the generated and reference speech, with the PCC (Pearson correlation coefficient) showing the most preference. It measures the linear relationship between two continuous variables. STOI (short-time objective intelligibility)30 is used to evaluate the speech intelligibility. It is designed to provide an objective measure that correlates well with human subjective intelligibility ratings. FFE (F0 frame error)30 and MCD (mel-cepstral distortion)31 aim to evaluate the accuracy of pitch and MFCC, respectively, which have been widely used in TTS. MOS (mean opinion score) is commonly used to estimate the perceived quality of audio, video, and multimedia content. It provides a subjective measure of quality based on human judgments and typically uses a five-point scale where participants rate the quality of the synthesized speech slices.

Stimuli recognition

As shown in Fig. 2a, compared with fine-grained decoding, a moderate set of candidates is necessary for stimulus recognition. The subjects passively receive external information by reading text or listening to podcasts, and deep learning methods are adopted to classify the original stimuli based on evoked brain signals.

Fig. 2: Stimuli recognition of evoked brain activity.
figure 2

a An overview of the stimuli recognition task. The subject receives textual or vocal information while the active brain signals are collected. The raw brain recordings are processed into feature space, followed by classifiers, networks or pre-trained models to distinguish the original stimuli based on the complexity and candidate size. Several approaches adopted word embeddings (i.e., word2vec87) to compare the decoded vector in a semantic space. b In natural listening scenarios, restoring the original speech features and waveform is a more complex task. Regression models (i.e., ridge regression), CNN and RNN-based network modules, and paramount generation models (i.e., GAN) are widely used. c The decoding architecture for various speech-related targets. The speech envelope can be easily reconstructed with CNNs, while more complex networks are necessary for the decoding of MFCC61,65. The most difficult task is to synthesize the stimulus wave, where an encoder-generator-vocoder architecture has been verified effective70. The non-invasive collection icon is from Vecteezy.

Textual stimuli classification

Language presented in text highly condenses information and avoids the temporal variability of corresponding speech signals. Early work focused on recovering language information from text stimuli. This paradigm distinguishes the original information provided to the subject from several candidates. The previous approach defined a word set of concrete nouns to avoid neural representations of abstract concepts32,33. Classifiers were adopted to distinguish which word had been perceived by the subject. Following this, other studies extended to abstract nouns, proving the superiority of text-based models over visually grounded approaches34, resulting in the evaluation of 8 different word embedding models for predicting another given either the neural activation patterns or word representations35. In ref. 36, the researchers presented a neural decoding system based on a semantic space trained on massive text corpora. The decoded representations were detailed enough to differentiate between sentences with similar meanings. Larger vocabularies bring greater difficulties. In ref. 37, a network module with dense layers and a regression-based decoder was implemented to directly classify an fMRI scan over a 180-word vocabulary. The recognition effect far exceeded the chance probability (5.22% Top-1 and 13.59% Top-5 accuracy). Following these achievements, researchers predicted masked words and phrases38. The proposed approach utilized an encoder-decoder paradigm and achieved 18.20% and 7.95% top-1 accuracy over a 2000-word vocabulary on the two tasks, respectively.

Starting from these approaches, some work treated the sentence-level responses as a combination of latent word effects, bridging the relationship between the neural process when receiving words and a whole sentence39,40,41. Following these approaches, the holistic encoding of sentence stimuli was proposed42,43. Studies further evaluated various distributed semantic models to predict or decipher brain response to textual sentences, with the Transformer-based model achieving the best performance44. Another classification task was performed on the passage level. The researchers predicted the evoked brain response during natural reading and classified the corresponding brain activity by distance to the synthesized brain image45. In ref. 46, the approach bridged the textual stimuli pattern and MEG recordings using multiple network architectures, with BERT showing the best performance.

Textual stimulus classification is greatly limited by the decoding range and is almost performed on dozens or hundreds of candidates, which is separate from real-world applications. As an initial attempt, this task illustrates the possibility of obtaining textual information from the evoked cortex, gradually developing into open vocabulary sequence decoding.

Speech stimuli reconstruction

Speech perception entails processes that convert acoustic signals into neural representations. In neuroscience, this includes the complete pathway from the cochlear nerve to the auditory cortex areas. Previous research has demonstrated that the hierarchical structure in neural networks trained on speech representations aligns with that of the ascending auditory pathway, supporting the feasibility of deep learning approaches47.

The speech stimuli reconstruction aims at forming semantic information, acoustic features, and synthesized perceived speech from evoked brain activity (Fig. 2). Classifiers had been used to distinguish perceived stimuli before the deep learning methods were applied. The logistic regression was applied to classify the speech stimuli perceived by an unseen subject during training48. Inspired by the ASR systems, the phoneme-level Viterbi decoding was introduced to recognize the heard utterance in a question-answering setting49. Another work introduced a contrastive learning model inspired by CLIP50 to predict the correct segment out of 1000 possibilities51. It leveraged the correlation between speech waves and EEG/MEG time series with wav2vec 2.052 and convolutional neural networks (CNNs) as the speech and brain modules, respectively. The research on content and subject recognition is not separated, considering that the speech flow can be identified in both spaces. One attempt was to adopt variational autoencoders to transform the EEG space into disentangled latent spaces, representing the content and subject distribution, respectively53.

The speech envelope refers to the variations in amplitude and intensity of a speech signal over time. It plays a crucial role in speech perception and understanding, for our brains are tuned to these variations, helping recognize speech sounds, syllables, and words54,55. Earlier work focused on the signal processing and linear model to align the envelope representation with brain activity56. After that, some other research implemented convolutional models57,58 or based on mutual information analysis59. In ref. 60, the researchers evaluated the envelope construction performance of ridge regression, convolution and fully connected layers. The more in-depth research led to the development of the VLAAI, a convolution-based architecture to achieve more precise reconstruction61. Considering the highly robust correlation between envelope and linguistic information, some extended to a cocktail party setting, where the attended speech envelope was predicted with a context-aware neural network62. A recent work adopted a transformer-based encoder-decoder architecture63. Compared with the speech envelope, MFCC is a widely used feature in speech recognition that represents the short-term power spectrum of sound. The parallels between speech recognition and brain-to-text technologies inspired the prediction of MFCC from brain recordings using custom networks, regression and generative models64,65. Subsequent research extended this approach to various acoustic features, predicting 16 different types using an attention-based regression model66.

Instead of reconstructing the acoustic features, synthesizing speech directly from brain recordings is more challenging, yet it holds greater practical significance and application prospects. In ref. 67, the researchers opened up the possibility of speech restoration with evoked brain recordings. This approach implemented a linear spectrogram model with strict recording quality and word selection requirements. The following studies investigated the reconstruction performance of linear and non-linear models based on speech spectrogram and vocoder parameters of the synthesizer68. The result demonstrated the significance of non-linear neural networks. Other studies leveraged Wasserstein GAN (wGAN)69 for generator pre-training to obtain the spectrogram representation70, and dual generative adversarial network (DualGAN)71 for cross-domain mapping between EEG signals and speech waves72. In this field, network optimization contributes to performance improvement, with the self-attention module demonstrating its superiority to multi-layer perceptrons (MLPs) and CNNs to restore the spectrogram73.

Compared with text, speech varies more and contains richer information, which brings more challenges to restoring the speech stimuli. From the current perspective, reconstructing recognizable speech waveforms requires multiple rounds of iterations of recording quality and network architecture.

Brain recording translation

Decoding natural sentences from brain signals remains a significant challenge. Unlike simpler tasks that convert brain signals into categorical labels, brain recording translation directly decodes linguistic stimuli into word sequences (Fig. 3). This process borrows concepts from machine translation, as both tasks aim to map representations between two different units of analysis. Brain recording translation involves open-vocabulary decoding based on neural patterns, which implies a vast search space. However, it fundamentally differs from machine translation, for the stimulus text or speech is deterministic, while the potential targets can be numerous for the latter. Given the resolution limitations of non-invasive neuroimaging, this task demonstrates the balance between the brain recording quality and the recognition granularity.

Fig. 3: The experiment setting and model architecture of brain recording translation.
figure 3

a For natural reading, the subjects are exposed to text while the active brain signals are collected. The eye movements are typically recorded to determine the text transcription corresponding to the brain data at each time step. A sequence-to-sequence model processes the evoked brain recordings to determine the related word and then forms the decoded sentences. b A feasible translation model architecture, including feature extraction, feature transformation and a pre-trained encoder–decoder to generate the decoding sentence. Both the pre-trained language models (i.e., BART) and speech models (i.e., Whisper) have been verified to be effective. The non-invasive collection icon is from Vecteezy.

The brain recordings are typically collected during natural reading and listening scenarios (Fig. 4a). Researchers reconstruct text stimuli through deep learning solutions. In ref. 74, the authors first introduced the concept of machine translation into neural decoding. Although this work decoded word sequences during attempted speech, the serialization of text generation provided new insights for subsequent work. The neural network architecture contained temporal convolution to model contextual relations and encoder–decoder recurrent neural networks (RNNs) to generate predicted text. The experiment was conducted with ECoG recordings and carried out on a vocabulary list of several hundred words. The following work turned to the BLEU and ROUGE scores75. This work largely expanded the decoding vocabulary (~50,000) by fully leveraging the inference capabilities of pre-trained LLMs. Specifically, a multi-layer Transformer encoder is used to map non-invasive EEG features to the embedding space of the BART tokenizer76, and the decoded sentence is generated through its decoder. Following these achievements, this paradigm was progressed by directly interpreting raw brain signals with contrastive learning methods and introducing discrete encoding into the EEG recording representation borrowed from VQ-VAE77,78. However, their models were highly estimated with teacher-forcing schema during evaluation79, which means instead of feeding the model’s previous predictions for the next time step, the actual target values were used during inference. This prevented them from generating meaningful sentences in real-life applications.

Fig. 4: Overview of speech neuroprosthesis.
figure 4

a The experimental setting for inner speech recognition. From the neurological perspective, brain waves control the movement of the articulatory system to complete the pronunciation of each phoneme in a series, indicating the mapping from evoked brain signals to movements of the articulators to phonemes. The classification and recognition module is adopted to generate the corresponding phoneme sequences before leveraging the language model to form word sequences. b The comparison between ASR and inner speech recognition (ISR). The raw time-series signals are processed for feature extraction and then fed into the acoustic and brain models, respectively. Both models aim to bridge the relationship between learnable features related to acoustics and phoneme sequences. The Viterbi decoding algorithm is performed on the sum of the phoneme probability from the acoustic/brain model and the language probability derived from a language model trained on an extensive corpus to generate the decoded word sequences. c The brain model can be implemented to decode various modalities. For inner speech recognition, the phoneme and word sequences are decoded with the aim of language models. For brain-to-speech decoding, the speech waves are synthesized according to the articulator gestures, synthesizer parameters or speech properties. By modeling the articulator gesture probability and adopting a gesture-animation system, the talking head can be generated. Different modalities are associated through TTS, ASR, talking head generation (THG) and synthesis methods. d The acoustic-related brain activities show the potential to develop communication-aided BCI for ALS patients, considering the decoding feasibility of text, speech and facial expressions. The articulation and ALS icons are from Oxford Academic, Springer Open, and Iconfinder. The talking head image is from ref. 153.

Alternatively, a solution with implementation potential was proposed to generate text directly from MEG recordings without teacher forcing80. The proposed architecture, NeuSpeech, utilized MEG instead of EEG or fMRI and incorporated a Whisper model. During the training process, only a small portion of parameters within the encoder were fine-tuned, while the Transformer layers in the encoder and the entire decoder remained frozen. The advanced solution contributed to an open-vocabulary MEG-to-text translation model capable of generating unseen text81, where multiple alignments were conducted between the MEG recordings and speech audio. The brain module was mapped to Whisper representations in three aspects: the Mel spectrogram, hidden state and decoded text. Another work proposed simultaneously leveraging the inferring ability of LLMs and implemented an fMRI encoder to learn a suitable prompt in an auditory-decoding setting. The prompt of text and fMRI modalities were aligned through a contrastive loss82. In ref. 83, the researchers directly used the representation decoded from fMRI as the input for LLMs and found a closer alignment with content deemed surprising for the LLM backbone. As for the improvement from modeling strategies, PREDFT utilized the predictive encoding with a side network to generate predictive representation with a multi-head self-attention module84.

The setting of the brain recording translation is reasonable. Under this paradigm, more work emerged that implements LLMs to translate brain signals in large vocabularies, including schemes using contrastive learning and curriculum learning85. By constructing positive and negative sample pairs from the EEG of different subjects exposed to identical or different sentence stimuli, the method aimed to pull closer the representation distances of semantically similar inputs, while pushing apart dissimilar ones. More similar sample pairs are considered challenging, and the strategy followed a progression from easy to difficult. A similar approach was also used for decoding fMRI signals, which used an encoder-decoder architecture with BART as a text generator86. The reconstruction loss of fMRI signals was used to train a better encoder, and the discretized EEG signal and the text vector after word2vec87 were fed to the contrastive learning module in EEG-text pairs, in which the EEG representation aligned with pre-trained language models. Another method of experiment was to collect brain recordings while participants listened to narrative stories88. The fMRI data was sent into GPT after the feature extractor to complete the sequence generation task. Under the same experimental context,89 employs encoders and projectors to align the distributions between fMRI and text. An external large model, GPT, samples candidate words before selecting the option with the closest distribution to the predicted fMRI signal. This process completes the sequence decoding in an autoregressive manner.

The models of brain recording translation, especially the structures proposed in the past year, and their performance on various datasets are shown in Supplementary Table 1. The word sequence decoded from the non-invasive brain signal shows great disparity with the original textual signal, as reflected in the high WER, while they are consistent with semantic correlation, achieving a promising BERTScore. Considering the promotion prospects of non-invasive signal acquisition equipment, this is a feasible experiment design, which does not require accurate decoding of text information but focuses more on semantics reconstruction.

Speech neuroprosthesis

Some neurological diseases can result in the loss of communication abilities. Many patients rely on a brain–computer interface (BCI) to spell words90,91, move the computer cursor92, and direct handwriting93. Although these systems can improve the quality of life for patients, communication efficiency is a concern. A major challenge is to overcome the limitations of current spelling-based methods to achieve a natural rate of communication. The goal of speech neuroprosthesis (SN) is to directly decode the words or speech waves that the experiment participants intend to speak from their brain signals (Fig. 4). This represents a hopeful path for creating devices that assist in voice communication.

Inner speech recognition

The inner speech was first called imagined speech in a two-phoneme classification task94. The subjects have lost their ability to produce recognizable sounds, and the brain signals are recorded as they try to speak. In some experiments, the brain signals during vocal speech are also collected. Unlike brain recording translation, inner speech recognition demands high-quality brain waves, as high-resolution neural recordings improve the accuracy of speech decoding95. This task is highly correlated with ASR, for they both: (1) model the relation between diverse temporal features and deterministic textual information; (2) correlate with pronunciation and acoustics; (3) aim to generate language-compliant text. A recent study shows that even at the level of a single neuron, there are significant neural representations related to inner and vocalized speech that are sufficient to discriminate between words from a small vocabulary96.

Phonemes, recognized as the foundational elements of speech pronunciation, have historically been the focus of initial studies aiming to decipher human articulatory patterns through brain activities. Previous studies have provided evidence for the neural representation of phonemes and other acoustic features during the perception of speech97,98. The pioneer attempted to apply instance-based matching algorithms and demonstrated the feasibility of text decoding from brain recordings even without learning for features99,100,101. The following research concentrated on identifying these phonetic units, framing the task similarly to classification due to the relatively narrow scope of phoneme varieties. Experiments have been conducted using linear classifiers102, support vector machine (SVM)103,104,105, naive Bayes classifier106, k-nearest neighbor classifier94, linear discriminant analysis (LDA) classifier107,108,109, flexible discriminant analysis (FDA)110, and based on brain recording features after principal component analysis (PCA)111. The above work was conducted with a few phoneme candidates with clear acoustic boundaries. Following this, the researchers achieved full-set phoneme decoding of American English112, and implemented a similar approach with brainwave recorded by mobile EEG devices113.

Progressing from phonemes, researchers achieved advancements toward decoding brain signals into words within a modest vocabulary range. Many investigations were conducted in severely restricted sets with clearly distinguishable pronunciations. Due to the small vocabulary (typically involving several, a dozen, or several dozen candidates), such approaches employed classifiers. In refs. 114,115,116, the researchers introduced a human-defined lexicon, where a multiclass SVM and relevance vector machine were used for intended speech decoding. Another work based on classification distinguished five words in Spanish and focused on the multiple-modality fusion of text, sound and EEG signals117. The most recent achievements conducted the illustration of speech-related representation on a single neuron level recognition96. The LDA classifier was adopted to distinguish six words and two pseudowords. Deep learning methods have also been applied to the recognition of imagined speech. The premier attempt implemented several networks to classify imagined words “yes” and “no”118, followed by research utilizing deep belief neural networks for brain activity feature extraction as well as phoneme and word recognition119. The cascade approaches divided the pipeline into convolutional-based modules, including an MFCC prediction module and a word classification model65. Network structures with larger parameters are suitable for more complex recognition units, for instance, conducting long word recognition using a mixed network module containing CNNs and RNNs120. To test the recognition performance of the network model on longer units, the researchers investigated the decoding performance of five imagined and spoken phrases with fully connected layers and CNNs121.

The challenge of low SNR in brain signal recordings, primarily from non-invasive techniques, is a significant obstacle to expanding the decoding space5. In ref. 74, the authors achieved word sequence decoding on a vocabulary of 250 words using an RNN-based encoder-decoder architecture with invasive ECoG recordings. The most promising approach to generating sentences originates from speech recognition tasks. Specifically, the hybrid model ASR includes an acoustic model, a language model, and a lexicon. The acoustic model calculates the scores of recognition units and then adds them to the language model scores to generate the decoding hypothesis. The cascade speech neuroprosthesis replaces the acoustic model with a brain model and decodes the corresponding phoneme or small-vocabulary word hypothesis before generating the sentences122,123. These works typically adopted the Gaussian mixture model (GMM) to fit the data distribution of invasive brain activities. Such approaches did not make a groundbreaking impact until the replacement of GMM with artificial neural networks contributed to a steady improvement124,125. This groundbreaking work used RNNs to model the mapping relationship between invasive brain activity and phonemes. The phoneme scores, in conjunction with an n-gram language model trained on a large amount of external text, worked together through the Viterbi search algorithm to decode sentence hypotheses, and a lexicon established the connection from phonemes to words. Through this work, researchers achieved a 25.8% WER on a vocabulary of 125,000 words within the acceptable bounds of performance126, with a recognition rate of 62 words per minute. A similar previous work was proposed127, in which an encoder–decoder architecture with a feature regularization module was used to decode character sequences from ECoG recordings. However, the regularization process consumed acoustic and articulatory kinematic features, which are unavailable for ALS patients. The continuous speech decoding has extended to logosyllabic languages like Mandarin Chinese, designing three CNNs to predict the initials, tones and finals of Pinyin, a phonetic text input system based on the Latin alphabet128. The prediction of initials was based on the articulatory feature, including the place and manner of articulation and whether voiced or aspirated. A more convincing result appeared in multilingual recognition, where the participant was presented with the target phrases either in English or Spanish129. In ref. 130, encoder–decoder RNNs were implemented to recognize the vocal speech, where the representations generated by revised wav2vec131 yielded superior decoding performance to the original ECoG data. Another recent approach introduced an end-to-end framework with pre-trained LLMs for decoding invasive brain signals, leveraging the comprehensive inferring capability of GPT-2, OPT, and LLaMA2132,133,134,135. As an initial attempt, this model achieved comparable performance to the cascade model, demonstrating a promising avenue.

Since cascade inner speech recognition and LLM-augmented approaches have achieved efficient and accurate performance, breakthroughs in this field have been accelerated. However, invasive data collection introduces medical risks, which makes it difficult to promote among patient groups. Additionally, it has been verified that brain patterns vary over time and in subjects124. We believe that inner speech recognition is the most promising solution for communication-aided BCI, but there’s still a distance from a high-security, high-quality, and low-latency strategy.

Brain-to-speech

Another challenging approach is to directly synthesize speech waves from brain signals. Neuroprostheses using speech synthesis employ deep learning models to convert brain activity records sequentially into synthesizer commands136,137, kinematic features (e.g., amplitude envelope), or acoustic features (e.g., pitches and MFCC)138,139, thereby reconstructing the original speech signal. For instance, a study implemented the DenseNet regression model140 to map ECoG features to the spectrogram141. Articulatory-based speech synthesizers generate intelligible speech signals from primary speech articulators using articulator representations142 or electromagnetic articulography (EMA)143,144. EMA measures the position of mouth articulators: the tongue, lips, velum, jaw, and larynx. This method is based on the finding that during speech production, activity in the brain’s sensorimotor cortex closely aligns with articulatory characteristics145. Additionally, various features related to synthesized speech, such as vocal pitch146, articulatory kinematic trajectories147,148, and speech energy149, can be identified based on brain activity. Speech synthesis without relying on deep learning, such as unit selection, has also been extensively studied150. Besides synthesizing intelligible waves, researchers are also focusing on generating spontaneous speech, including speech with accurate lexical tones. A feasible approach involved constructing specific neural networks to separately decode the neural activities of tones and syllables, then using the combined decoded features to synthesize tonal speech151. The synthesis delay is an important factor in realizing speech-centric BCI systems. In ref. 152, an online speech synthesis was proposed with a neural voice activity detection to generate speech-sensitive neural pieces, a bidirectional decoding model to estimate acoustic features and a vocoder to obtain the corresponding speech wave.

In addition to speech synthesis, information related to other modalities can be obtained through invasive brain signals. The most intuitive attempt is to leverage articulator gestures for facial movement synthesis125, which can be achieved by decoding orofacial representations in the speech motor cortex142,147. It has been verified that facial movement could be generated using an avatar-animation system, and the progress on talking head generation inspired restoring the patient’s own face153. In theory, multiple elements of building a digital human can be obtained from invasive brain activity, including textual sentences, speech waves, facial movements, as well as body movements not related to language154,155. This may be the future development direction of communication-aided BCIs, which can restore the patient’s dignity to the greatest extent possible and communicate with the outside world through a virtual image that is the same as a normal person’s (Fig. 4d). For patients who are confined to bed and unable to move, especially ALS patients, this can greatly improve their quality of life.

Progress, challenges, and future

Progress to idea BCI and current challenges

Language is the primary means of human communication, and decoding linguistic information from brain activity is crucial for the development of future BCIs. We summarize the gap between neural decoding systems and ideal BCIs from the following aspects, addressing both progress and challenges (Fig. 5):

  • Neural signal collection: Even though the invasive recording outperforms with its superior qualities of brain imaging, the necessary surgery and unbearable medical risks prevent its spread in patients. The collection of high-quality non-invasive data is a prerequisite for word-level fine-grained sequence decoding. Limited by the current level of neural recording collection and the noise resistance of the network architecture, it has not yet been possible to achieve an acceptable level of open-vocabulary continuous decoding with non-invasive data. A feasible alternative, brain recording translation, is to focus on the semantic consistency of the decoded text, not requiring absolute consistency of the corresponding text or high restoration of the speech, but achieving a considerable level of semantic accuracy.

  • Subject- and time-invariant: For the same neural stimulation, brain activity varies across subjects and acquisition time124. On a small vocabulary, a 3-month clinical trial in an ALS patient showed that speech commands could be accurately detected and decoded without recalibrating or retraining the model156, and another study showed that the developed decoding system worked successfully in two human patients96. However, experiments on a wider population with an open vocabulary have not yet been carried out, and the generalization of models trained on a single data source still needs to be discussed.

  • High precision, low latency and multi-function: the upper bound of speech-related BCIs can be viewed as a corresponding ASR system, considering the unified backend of the neural networks and the superposition of noise from the brain response to speech. The development of more sophisticated and responsive BCIs could revolutionize how we interact with machines, offering applications in medical rehabilitation, verbal communication and even entertainment. Furthermore, integrating multiple modalities-such as visual and auditory inputs-can enhance the functionality of BCIs, enabling more comprehensive communication solutions. Current experiments on multiple tasks have shown that text, speech, and visual reconstruction of neural signals have achieved the ability to restore semantic features80,152,157,158,159, which indicates a potential solution by modality fusion and system integration. However, it must be emphasized that the detailed restoration performance of the above experiments needs great improvement. There are also discussions to be addressed on striking a balance to avoid error accumulation and promoting the main modalities with auxiliary information.

  • Privacy preservation: ethical debates regarding collecting and decoding neural signals from the human brain remain an important limiting factor160,161. Invasive data collection is only carried out on a small population due to its surgical risk and usually requires ensuring the necessity of craniotomy for medical treatment. Non-invasive data has much more promotional potential but also carries significant risks of privacy leakage—a more comprehensive data usage convention may be necessary, including the standardization of the collection process, the requirements of experiment subjects, the decoding granularity and vocabulary, and necessary solutions to avoid violation of personal privacy. To have widespread potential, BCI systems must be privacy-preserving and ethically sound. The system should be clearly aware of what information can be accessed, displayed, or made public, and choose to conceal or ignore when it comes to personal privacy or inner thoughts. A responsible stance within society that firmly opposes the misuse of neurodata would serve as the ethical guide for the future advancement of neurotechnology.

Fig. 5: Characteristics of an ideal BCI system for communication its achieving solutions.
figure 5

The BCI system requires high-quality brain recordings and addresses the problem of individual and time differences through strategies such as domain alignment. Additionally, the reform of network structure, especially the application of LLM, provides ideas for high-precision, low-latency, and multi-functional interactions. Some icons are from Dreamstime, Vecteezy and Iconfinder.

Future directions

Even though we are still a long way from efficient and harmless BCIs, some directions have shown bright prospects. A unified brain representation could be the next big breakthrough in neural decoding, which has made a great impact on other modalities. There is a consensus on the individual and temporal variability of neural signals. Performing individualized data collection and model training is not a feasible solution considering the corresponding recording duration and computational resources. Instead, the implementation of a unified neural representation is a promising solution by fine-tuning with limited user-specific recordings to form personalized decoding systems. This requires expanding the previous experiment to a group population and collecting much more dynamic neural recordings, with self-supervised learning providing a strong relationship with semantic information130,162,163.

While invasive neural decoding has demonstrated superior performance, the main limitation of non-invasive signals is their significant noise level. It is worth practicing to perform data augmentation and denoising on neural signals, and existing solutions are mainly based on generative models, such as GAN and diffusion164,165. Research on the robustness of model architectures is still in its early stages, especially since LLM has recently demonstrated extraordinary reasoning performance. Considering the significant mismatch between the tokenized text and neural space, robust neural networks with a stable training strategy are possible to boost the generation performance.

Large language models preserve powerful understanding, reasoning, and generation capabilities, and previous studies have shown that LLMs trained on vast amounts of textual corpus enable the ability to be aligned with other modalities through smaller-scale fine-tuning, thereby generating content with strong semantic consistency and vivid details. The same phenomenon also applies to neural data, where the most significant trend in linguistic neural decoding has been implementing a textual LLM as the backend decoder for text generation82,85,135. As shown in Fig. 5, in the initial attempts, the LLMs were adopted to generate hypothesis candidates with a separate module score for each potential sentence89,166. A more promising approach treats the LLM as the inferring core to generate correlated textual information135, and gradually evolves into a unified decoding system with multi-modality inputs and user-specified output167. We believe that the update and iteration of LLMs will promote qualitative changes in neural decoding, thereby achieving application levels in the near future.

Parallel to model improvement, in neuroscience, a pressing issue is the precise collection of neural recordings related to language processing, including acoustic and phonological aspects. This requires identifying specific neuronal populations and brain areas involved in language functions11,12,168. High-resolution scanners, wearable neurotechnology devices and advanced equipment are also necessary169, and more reasonable experimental settings need to be explored. An important aspect is to unify the data collection framework to explore the possibility of developing a massive neural corpus from multiple resources, which means diverse stimuli and subject conditions. The neural recordings from a single experiment trial are typically suitable for small network training, while pre-training and fine-tuning on a larger scale are likely to process data spanning several orders of magnitude.

As for privacy preservation and technology regulation, strict management and supervision need to cover the entire process of data collection, model training and deployment application161. The premise is to form clear data usage standards, minimize the dimensions and duration of neural recordings while ensuring decoding performance, and strictly discard potential privacy-aware instances. The dissemination and use of neural data need to ensure that the goal of the corresponding experiment is for human welfare, and data encryption, differential privacy and federated learning are protection measures that need to be considered. As the modality and experimental population of neural decoding expand, we strongly call for the formation of a unified ethical perspective, such as human rights guidelines, which requires neural computing companies and related major researchers to assume corresponding scientific responsibilities.

The interaction between the brain and the environment is bidirectional. This article mainly explains the direction of neural decoding, that is, from the neural recordings to linguistic stimuli or intended messages. Stimuli encoding, by performing tiny simulated currents on the cortex to generate evoked brain activity, might be a solution for sensory loss, including blindness and deafness. Guiding brain cognition through artificial stimulation, commonly known as deep brain stimulation (DBS), is a promising direction for disease treatment and has emerged as an effective treatment for neurological conditions such as Alzheimer’s170 and Parkinson’s disease171. Another question is whether BCIs can improve the efficiency of information transmission. Information interaction via voice or visual text is limited by the rate of speech flow and vision refresh, while the brain’s information reception rate may far exceed both thresholds. When machine operating efficiency reaches a certain level, a large-scale industrial revolution may come from a leap in information transmission efficiency. In general, brain linguistic decoding is a cross-disciplinary collaboration. We expect a further revolution from strengthened cooperation between biology, engineering, and machine intelligence to promote innovation and accelerate the development of brain signal recording technology and its applications.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.