Transfer learning via distributed brain recordings enables reliable speech decoding

Singh, Aditya; Thomas, Tessy; Li, Jinlong; Hickok, Greg; Pitkow, Xaq; Tandon, Nitin

doi:10.1038/s41467-025-63825-0

Download PDF

Article
Open access
Published: 01 October 2025

Transfer learning via distributed brain recordings enables reliable speech decoding

Nature Communications volume 16, Article number: 8749 (2025) Cite this article

267 Accesses
27 Altmetric
Metrics details

Subjects

Abstract

Speech brain-computer interfaces (BCIs) combine neural recordings with large language models to achieve real-time intelligible speech. However, these decoders rely on dense, intact cortical coverage and are challenging to scale across individuals with heterogeneous brain organization. To derive scalable transfer learning strategies for neural speech decoding, we used minimally invasive stereo-electroencephalography recordings in a large cohort performing a demanding speech motor task. A sequence-to-sequence model enabled decoding of variable-length phonemic sequences prior to and during articulation. This enabled development of a cross-subject transfer learning framework to isolate shared latent manifolds while enabling individual model initialization. The group-derived decoder significantly outperformed models trained on individual data alone, enabling decoding robustness despite variable coverage and activation. These results highlight a pathway toward generalizable neural prostheses for speech and language disorders by leveraging large-scale intracranial datasets with distributed spatial sampling and shared task demands.

A neural speech decoding framework leveraging deep learning and speech synthesis

Article Open access 08 April 2024

A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages

Article 20 May 2024

Decoding speech perception from non-invasive brain recordings

Article Open access 05 October 2023

Introduction

Intracranial recordings have yielded novel insights into how focal neuronal populations encode articulatory kinematics, latent phonetics^1,2,3, and vocal modulation⁴. These insights have driven the creation of brain-computer interfaces (BCIs) to enable communication in speech apraxia. Thus far, these data have been recorded from intact sensorimotor cortex in anarthric individuals with damaged efferent pathways or end effectors^5,6,7. Models derived from these are highly individualized^5,8,9 and not readily extensible to patients with cortical loss due to brain injury. In such aphasic, as opposed to anarthric, individuals, a BCI using sparse data from brain regions with residual language capacities combined with a transfer model from a population of normal individuals will enable us to bridge gaps in clinical translation and enable the development of a generalizable prosthesis.

In service of this goal, we implemented a tongue twister paradigm^10,11, designed to load the articulatory system, in a cohort of 25 patients using over 3600 stereoelectroencephalographic (sEEG) depth electrodes. We used sequence-to-sequence models to decode phonemes^{12,13,14,15,16} from distributed speech hubs^{3,17,18,19,20,21} and assessed the contribution of the number of channels, the effect of the number of trials (a surrogate for quantity of neural data used for training) and assessed decoding performance not only during but also prior to articulation. We then developed a grouped transfer learning technique to train population neural latents^22,23 and assessed the combined effects of each of these factors to generate a robust, reliable training manifold for speech decoding. These manifolds were then implemented as generalizable decoders on patients not used to train them and demonstrated improved inference in individuals with limited coverage of the speech motor cortex (akin to missing these brain regions due to injury).

By leveraging multi-site and multi-subject cortical data, this architecture is initialized on diverse neural codes, enabling a pre-trained nonlinear neural encoder that maps onto a linear readout effector. While others have focused on pre-training at the stimuli level, to provide stronger priors for speech waveform reconstructions^24,25, we restricted the complexity of the decoder output to phonemic sequences and instead built stronger priors for the encoder. This approach pushed the informational limits of neural data creating a rich latent feature set from models that learn subject-independent representations of articulation. These generalizable manifolds of speech production, coupled with transfer learning, allowed us to estimate planned phonemic trajectories in patients lacking sufficient data to construct the latent feature set. This framework can potentially facilitate neural prosthetics for aphasic patients who lack the typical levels of word production fluency needed to initialize decoding models.

Results

Across the task, average accuracy for pronouncing all words correctly in the tongue twister trial was 87% ( ± 4% S.D.). Trials with articulatory errors (8% ± 3% S.D.) or dysfluency (5% ± 2% S.D.) were excluded. Across the cohort of 25 patients, we recorded comprehensively from peri-sylvian frontotemporal language sites (Fig. 1B, C). A mixed effects multi-level analysis (SB-MEMA, Kadipasaoglu et al.^26,27) used to aggregate data and revealed expected loci of activation in subcentral gyrus (SCG), superior temporal gyrus (STG), posterior middle temporal gyrus, premotor cortex, and inferior frontal gyrus (IFG) (Fig. 1D).

We deployed a sequential state-based model (Fig. 2A) based on the premise that it would perform better than a linear model at reconstructing phoneme sequences from continuous samples of neural data. This sequence-to-sequence (Seq2Seq) model performed significantly better than chance (10% accuracy—S.D. 8%) and outperformed a linear model (24% accuracy, S.D. 5%) in predicting phonemes across all patients. As our model targets phoneme-level decoding, we report Phoneme Error Rate (PER) rather than Word Error Rate (WER), allowing us to precisely assess neural representations at a finer granularity appropriate to our objectives. The model achieved a median PER of 27% (S.D. 6%) when decoding from articulatory periods, and 34% (S.D. 6%) PER from pre-articulatory periods (Fig. 2B). Additionally, we tested the ability to predict variable-length phoneme sequences utilizing a teacher-forcing variant of the Seq2Seq model and again achieved significantly better performance than chance (10% accuracy, S.D. 8%) and a linear model (20% accuracy, S.D. 6%). The variable length decoder achieved a median of 44% PER (S.D. 4%) using data from articulation periods, and 56% PER (S.D. 6%), when decoding from pre-articulatory periods. The lowest PERs for a single subject were 13% and 24% during articulation for the fixed and variable-length phoneme sequences, and a PER of 26% and 43% decoding from pre-articulatory intervals, respectively.

**Fig. 2: Seq2Seq models for phoneme neural decoding.**

Given the variability in decoding performance across the cohort, we explored key factors in the training dataset that may contribute to this variance. The two primary parameters influencing each subject’s dataset size were the number of electrodes implanted, based on the anatomical trajectories required for seizure localization, and the duration for which each subject was able to perform the task. Controlling for the number of electrodes (Fig. 2C), we observe a significant correlation (R² = 0.64) between the number of trials and decoding accuracy, achieving a < 10% PER (90% accuracy) at approximately 180 min of articulation. To evaluate the impact of channels (Fig. 2D), we subsampled channels from each subject while controlling for trials. Certain subject clusters showed early separation and optimal performance with 500 channels, while others required up to 1500 channels. Notably, significant separation between subject clusters (p < 0.05) was observed with as few as 100 channels. Even after controlling for subject-specific improvements in decoding accuracy using a linear mixed-effects model, the number of channels remained a significant factor (p < 0.05) in enhancing decoding accuracy.

To further examine the variability in decoding performance, regional electrode occlusion (REO) analysis (Fig. 2E) was performed by generating multiple recurrent models that left out specific sets of electrodes based on their anatomical locations during training in critical speech production hubs. This allowed us to use the role of individual regional activity as a proxy for their contribution to the identity and position of each phoneme. When removing electrodes in the ventral sensorimotor cortex, the distribution of decoding accuracy for an all-channels model versus an ablated model is sharply skewed to the left, demonstrating significant effects and indicating the crucial role that the sensorimotor cortex electrodes play in phoneme decoding. Similarly, a pronounced shift to the left in performance was observed when removing temporal lobe electrodes. We continued our REO analysis in smaller regions of interest implicated in the speech production network with our within-subject models. Pre-articulatory activity in SCG, pSTG and STS each significantly (p-value < 0.01) contributed to phonological processing, with early involvement from STS and SCG and later involvement of pSTG at onset of articulation. Activity in IFG and aSTG showed no significant effect on PER when removed from the model trained on pre-articulatory data. This suggests that individual variations in recording sites render some subjects more valuable for decoding, underscoring the importance of electrode placement and regional contributions in explaining variability across the cohort.

Transfer manifolds across individuals

Influence of data size and recording locations

As seen from the previous analysis, inter-subject variability in amount of training data and number of electrodes had a clear effect on decoding performances at the cohort level, which motivates a practical implementation to improve decoder performance across subjects utilizing these features. We selected the participant with the best decoding performance and implemented a transfer learning technique that generalized the model trained on this individual’s data across the cohort. This mapping of learned latent features to subject-specific neural data resulted in remarkably good decoding performance with no significant difference in PER (p = 0.72) from models trained on data within-subjects (Fig. 3A). Optimal decoding performance resulted from keeping all components of the subject-specific model trainable, except for the recurrent layer (p < 0.001). By utilizing the learned embeddings from this best decoding subject, the recurrent layer effectively embeds optimal latent features that are then enhanced by individual data from the rest of the cohort. Thus, while having some subject specificity is important to achieve baseline decoding performances, core recurrent non-linearities can be learned and transferred from one patient to another to improve decoding performance.

**Fig. 3: Applying transfer learning to Seq2Seq models.**

To ensure the feasibility of this methodology in clinical practice, where the size of the training data is dictated by clinical needs—electrode placements are guided by epilepsy localization and data size by the highly variable time available for research recordings during a patient’s stay in the epilepsy monitoring unit - we assessed the value of a shared decoding model. Specifically, we examined whether decoder performance could be improved when inferring on minimal, partial, or incomplete datasets. We mapped the transfer decoding framework across our cohorts and assessed the critical features that lead to the most significant improvement in a pairwise analysis. By utilizing linear mixed effects modeling to control for variance between performance changes in training-inference subject pairs, phoneme decoding performance shows significant (p < 0.05) improvement for a subject when initialized with recurrent embeddings from a subject with greater number of trials (Fig. 3B). In better understanding how increased channel density in speech-motor-cortex areas improves decoder performance over the cohort we first had to identify a method to functionally categorize the electrode implantation scheme for each subject. To categorize the electrode implantation scheme across subjects, we used the Destrieux parcellation in FreeSurfer to map cortical coverage into 50 regions, focusing on key language and speech production hubs. To avoid biasing channel effects due to distinctly unmatched electrode trajectories, we focused on 14 of the 25 subjects who had distributed coverage and an average shared correlation score over 10%. In this subgroup, we observed a significant improvement in decoder performance (Fig. 3D) when models were initialized with recurrent embeddings from subjects with a higher number of implanted electrodes. This approach also allowed us to quantify channel density per region, enabling the assessment of how similarities in coverage patterns between a training and inference dataset impact decoder performance. Transfer learning performance improved significantly for patients (p < 0.05) when the training subject has a greater correlation of sites of cortical coverage with the inference subject (Fig. 3E).

Transfer manifolds using an idealized training set

Isolating optimal characteristics

We evaluated whether a shared recurrent layer across multiple subjects could enhance performance compared to using a single “ideal” subject, enabling creation of a generalizable training manifold (Fig. 4A) beyond a single optimal subject. This approach allows the recurrent layer to integrate generalizable, subject-invariant information about articulatory processes. By selecting information that improves decoding accuracy across multiple subjects, the model encodes features that are not specific to one subject but generalizable across different perspectives of a task and neural activity within a specific region. To test the robustness of this architecture, we conducted REO (Fig. 4B) in both the sensorimotor cortex and temporal lobe within the multi-subject models. The results demonstrate that the multi-subject model is resilient to such electrode occlusions, minimizing performance degradation. Occluding sensorimotor cortex electrodes in the multi-subject model disproportionately affects top performers—with the rate of PER degradation increasing as decoding accuracy increases. However, this effect is less pronounced when regionally occluding electrodes in the temporal region for the multi-subject model—with the distributions laying primarily against the diagonal—showing that REO in the temporal lobe have minimal effect on phoneme decoding performance for the group model.

**Fig. 4: Multi-subject model integrated with transfer learning approaches.**

We then performed an analysis across the entire cohort using the five candidates that were not only in the top quartile of decoding performances but also had dense sensorimotor cortex coverage, especially in SCG. We assumed that the sensorimotor cortex would contribute the most to the decoding, given the prior REO analyses across the single and multi-subject architectures. A group-based model built from these five subjects showed that utilizing a shared recurrent layer resulted in significantly lower PER on held-out trials (p < 0.01) for each subject in the group model (Fig. 5A). By applying the population-level manifold learned from the group model to the held-out subject, we observed a remarkable enhancement in performance (Fig. 5B). When we used a single subject with comprehensive coverage of the speech and auditory systems we had a significant decrease in PER, from 57% in the within-subject model to 49% (p < 0.001). As more subjects were incorporated into the model, decoding accuracy improved, with a PER of 45% after integrating four subjects. To determine the optimal number of subjects needed for enhancing decoder performance, we applied a stopping criterion based on whether additional subjects significantly improved performance. Significant improvements were observed with three subjects, and no change in PER with the inclusion of a fourth subject.

**Fig. 5: Across subject neural representation stability via multi-subject transfer learning.**

We conducted the same analysis with the five subjects based on the criteria above and assessed improvement in decoding across the cohort when utilizing latent states built on this order and set of specific subjects. As seen, the efficacy of increasing the number of subjects is critical for some subjects in the cohort (Fig. 5C); however, others fail to significantly improve in decoding performance after three subjects. The optimal number of subjects trained with the group model will depend on the inference subject preference for subjects in the group model, however, for most subjects (Fig. 5D) the decoding is comparable or improved when utilizing the transfer learning architecture. These findings highlight the substantial benefits of transferring group-derived latent articulatory features to subjects with incomplete cortical sampling often due to clinical electrode placements.

Discussion

We show the effectiveness of large-cohort intracranial sEEG in training lightweight, subject-independent models with high decoding accuracy for predicting articulated utterances before speech production. This ability to learn a shared phonemic representation across the cortex using pre-trained group models enhances performance even for subjects with limited coverage. These results have implications for the development of brain-to-text iBCI leveraging phoneme decoding. The superior performance of this multi-subject model suggests that leveraging data from multiple individuals can help overcome subject-specific variations to improve model generalizability and enable decoding even without the coverage of the critical speech cortex. sEEG recordings additionally provide a very broad sampling of the language production process, augmenting the richness of information that can be harnessed by the decoders^14,15,20,28. This approach allows for inputs from a distributed speech network that orchestrates the planning, sequencing, and execution of speech information, offering vantage points²⁹ for speech decoding. These insights are pertinent to the creation of robust phoneme decoding systems to accurately translate neural activity into speech outputs across different users.

The importance of initialization in machine learning models has been underscored for quite some time in computational research. Proper model initialization and consequent fine tuning to a specific context is critical for datasets with limited training samples and deep learning architectures are more likely to succeed when trained on large banks of datasets with hyperparameter optimization. This approach has more recently been realized in computational neuroscience, pre-training on neural data from a host of tasks, sources, and modalities—and enhancing robustness during inference and real-time performance³⁰. However, the same aspects that afford pre-trained models their utility—the number of flexible parameters and the time it takes to optimize them for a specific context—becomes a computational wall. We show the effectiveness of shallow networks in building generalizable solutions, which require a fraction of compute time, yet attain high information transfer rates.

State-of-the-art decoding methods^3,5,6,7,31 have focused on speech decoding frameworks in individuals, with great success, yet are not able to allow for shared-subject analysis techniques. These studies, with variable model architectures and modalities, endeavored to show generalizable results across subjects, yet none of them have shown significant improvement of decoding performance while leveraging transfer learning^8,32,33. While certain aspects of our work build on established findings (e.g., transfer learning for cross-subject BCIs and neuroanatomical effects on decoding), this work introduces the application of transfer learning to intracranial speech decoding with a focus on detailed articulatory complexity, a larger cohort-wide systematic analysis of how data/AI interactions influence decoding performance across subjects, which we believe provides valuable insights for designing future BCIs and integration of these findings into a framework for optimizing transfer learning in real-world neuroprostheses applications. Existing demonstration of high-performance speech decoding^6,7 requires the participants to perform training sessions over ~2 weeks to achieve stable accuracies as well as lengthy calibration processes due to signal nonstationarities and electrode micro-movements. A more recent account⁵ demonstrated a fast calibrating speech neuroprosthesis by training the decoder with neural data of a given day with previous-day neural data, achieving stable asymptotic accuracy over a week of training. Nevertheless, these works were hinged upon the fact that the subject voluntarily participated in hours of training each day. Our proposed group-level speech decoding framework could, in the nearest future, accelerate this process in subjects that are incapable of providing such ample training data by flexibly pre-train the decoder with a group-level representation. This work accommodates for variable electrode positioning and shared-subject analysis, employing supervised dimensional reduction to nonlinearly weight groups of channels. Additionally, our flexible model architecture allows for automated subject-specific cross-validation for a variable number of trials per subject during group model pre-training—which allows the pipeline to not be limited by a minimum number of trials in a single participant.

This ability to scale speech decoding models across subjects with variable cortical sampling of a distributed language network during multi-word speech production is feasible, given that neural datasets themselves are high dimensional, and task-specific signals are low dimensional. This feature of the latent neural space is critical in enabling the decoder to become generalizable. A goal of this model is to retain some latent feature space through subject-invariant pre-training that can be fine-tuned for datasets with limited information—either due to incomplete cortical sampling or damage to speech production hubs—and use these to impute embeddings for a population-based decoder. This approach aligns with neurocomputational theories like the DIVA model, which conceptualize speech motor control as emerging from feedforward articulatory plans refined by sensory feedback³⁴. The latent trajectories decoded here may reflect such transferable motor representations, robust to anatomical and functional variability. However, when programmed into a simple linear decoder, the model can map those same latent features of articulation to signals encoded across the distributed language network through a supervised filter trained on distributed sets of language networks already—potentially with sensorimotor cortex coverage. Studies have demonstrated the effectiveness of these architectures by building more efficient motor decoders that can be stable for longer with less training data^35,36,37. The ability to transfer low-dimensional neural manifolds using single unit data from multiple subjects to enable a long-term stable cortical-kinematic embedding manifold, which can then be utilized to initialize decoders across animals performing similar tasks and behaviors^35,37. The ability to harness these manifolds to pre-train on a host of neural datasets with the neural data transformer architecture enables the creation of much stronger inferences on held out trials³⁰. This groundwork sets the stage for exploring the potential of multi-subject models in high-performance large-cohort speech and language decoding, particularly evident in recent studies showcasing their effectiveness at the local field potential level for decoding single-word speech^8,9,25, especially with transformer models^8,20,38,39.

Work with deep learning architectures has, in the past, included some attribution or occlusion analysis to identify what specific layer weights, architecture of the model, or components of the dataset are allowing the model to converge towards favorable solutions. In our perspective, the most crucial aspect of the work is the training dataset, and for this reason, we conducted a REO analysis. However, in contrast to most current research that conducts this occlusion analysis at a single electrode level³¹, we examined network-level electrode occlusion effects of the decoding performance. The ability to occlude speech production-specific regions and showing their effect on the phoneme decoding architecture allows us to confer some neurobiological validity to the nonlinear dimensional reduction boundaries that our architecture applies to separating neural responses at the phonemic level.

Critically, REO significantly affects single-subject models, whereas group-level models remain robust to single-region electrode occlusion. This resilience is pivotal for generalizing the architecture to datasets with dysfunctional language hubs, tapping into the distributed system of the speech production network. In single-subject models, electrodes removed from vSMC, pSTG, and STS regions profoundly impact pre-articulatory speech decoding, resulting in significant PER degradation and highlighting their central role in temporal and phonetic encoding during speech preparation^40,41,42. These regions have additionally been implicated in prior work^43,44 dissecting the lexical and phonological route of reading, which maps orthography to meaning and sound, respectively, in comprehension. Conversely, for group-level models, removing electrodes from these regions does not diminish the performance improvement they offer compared to within-subject models. However, the extent of improvement is constrained by region availability and coverage density across subjects in the training dataset. Greater electrode coverage in the SCG region within the training set markedly enhances inference performance for subjects with predominantly frontotemporal coverage, with REO exerting a notable effect. Limited sampling in this region does not erase the benefits of group-model transfer—potentially due to separable learned representations that mirror the dual-stream framework of speech processing, with a dorsal stream supporting articulatory-motor integration that remains robust in our decoding architecture, and a ventral stream more involved in comprehension⁴⁵.

Our study involves an encoder-decoder framework for speech decoding, similar to previous approaches that utilized convolutional^1,46 or LSTM^28,47,48,49 sequence-based decoding models. Comparing to linear models and CNN models, these sequence-based models effectively capture the latent temporal articulatory^6,7,50,51 and acoustic information^15,32,52 in speech production. However, linear models still serve as an effective baseline comparison with interpretable results—being widely recognized as standard benchmarks in machine learning and providing a reference point to evaluate the performance gains achieved by more complex models. This ensures that the observed improvements are due to model complexity rather than inherent task or data properties. Additionally, comparing performance across models of varying complexity offers practical insights for speech BCIs, where computational efficiency and resource constraints often influence model selection. While deep learning models provide superior performance, the linear model is more often used in real-time speech decoding practices and additionally provides interpretable patterns⁵³ encoded across neural activity for validating neuroscientific findings from other literature. The utilization of a teacher forcing model allowing for variable-length sequential phoneme predictions, optimizes alignment with non-stationarity in the rate of speech and neural processing, and carries greater potential for real-time speech decoding, owing to its naturalistic language modeling approach. Additionally, by using only phoneme identity and position, our approach minimizes data requirements, and proves particularly valuable for non-speaking patients, as it does not rely on spoken speech spectrograms for training.

While a model trained on healthy subjects may face limitations in adapting to patients with speech impairments due to the underrepresentation of abnormal brain activity patterns, our approach aims to provide a stable training manifold to bridge this gap. The group model aggregates neural activity across multiple subjects, offering a robust initialization for decoders, particularly in patients with incomplete or partial datasets. Although this approach is constrained by the cortical regions sampled and the available task data, the use of accumulated training datasets and iterative dense sampling of the sensorimotor cortex provides a promising framework for generalizing to dysfunctional language networks and speech disorders due to motor pathway damages, offering a more effective starting point than training on limited patient-specific data alone.

One limitation in our study is the lack of analysis of the inner speech trials in our task; however, these trials present inherent limitations. There is no definitive way to verify whether participants covertly articulated the cued words or engaged in other cognitive processes, such as silent reading or ignoring the cues entirely. This aligns with prior findings suggesting that inner speech may be impoverished in its phonemic and featural representations compared to overt speech¹⁰. Future work will incorporate tasks that, due to their less effortful nature, are better suited to reliably bias participants toward engaging in covert speech, enabling more precise decoding and validation of inner speech activity.

Another limitation of our work is predicated on the fact that these were healthy subjects that overtly conduct a complicated tongue twister task, which would be difficult for many users of BCIs, especially those who have lost their voice due to conditions like anarthria. However, the lack of reliance on acoustic features for our text-based model allows us to test more directly the causal transfer capability⁵⁴ of our decoder to collate sEEG data across a cohort of patients rather than relying on auditory feedback.

Lastly, we would like to emphasize that the data collected from each patient is approximately a total of 1 h of training and testing data, given clinical constraints. Despite this, we accomplish PERs as low as 21%, whereas an intelligible BCI trained on days or weeks of neural activity approaches PERs of 10%–20% over weeks of training and recalibration^5,6,7. One potential reason for this efficiency apart from the model architecture and distributed cortical coverage is the demanding articulatory stimuli itself. Tongue twisters challenge the speech production network to actively monitor speech errors, heightening task-related activity. In addition to errors, recording closely competing stimuli allows the model to capture bifurcation type dynamics such as Hopp fields, further enhancing the tuning of sampled neural populations enhancing the signal-to-noise ratio of articulatory regions.

Our study highlights the effectiveness of sEEG in capturing neural features across the distributed language network, enabling precise prediction of phonemic sequences prior to articulation. These findings validate neurophysiological insights into speech production and identify key regions crucial for minimally invasive speech-BCIs. The versatility of both the decoding model and sEEG data underscores their capacity to unveil shared information across subjects—deepening our understanding of impaired motor, sensory, and cognitive processes in neurodegenerative disorders. Additionally, our transfer learning architecture allows generalization across diverse speech production scenarios, making real-time decoding feasible and efficient. Future research should replicate these decoding frameworks with a broader language corpus, aiming to improve accessibility and patient outcomes, while targeting prosody and volume variations, enhancing the model’s applicability in real-world settings.

Methods

Participants

Twenty-five patients (12 male, 19–51 years, 4 left-handed, verbal IQ 95.6 ± 10.4, age of epilepsy onset 16 ± 9 years) with intractable epilepsy who had sEEG monitoring for seizure onset localization provided written informed consent to participate in this research study. The study was conducted under a protocol approved by the Committee for the Protection of Human Subjects (CPHS) at the University of Texas Health Science Center at Houston (protocol number HSC-MS-06-0835). Patients were excluded if they had right-hemisphere language dominance, large structural malformations or prior cortical resections.

Task design

All participants engaged in a task involving the production of sequences of tongue twisters (Fig. 1A). The stimuli were displayed on a 15.4-inch LCD screen with a resolution of 2880 × 1800 pixels, positioned at eye level approximately 2–3 feet from the participants. A total of 64 distinct sets of stimuli were presented, each consisting of four words following either a consonant (C), vowel (V), consonant, or CVCC structure. These four words were categorized based on two axes: phonological and lexical bias, as detailed in Hickok et al. In the phonological bias condition, the word sequence imposed an increased phonological load by aligning the production space similarity between the initial utterances of the first and third words. In the lexical bias condition, if an error occurred on the third word, specifically involving a targeted error (switching the initial utterances of the first and third words), the third word transformed into a real word instead of a pseudoword. During the task, participants were instructed to mentally simulate speech production upon the first appearance of the four words on the screen. Subsequently, for the following two trials, they were prompted to articulate the words displayed overtly, and for the last two trials, the words were removed from the screen, requiring participants to articulate them from memory. 128 distinct stimuli with 218 unique words, as sets of four words making up a tongue twister were used. The trials represented 36 of the 44 phonemes in the English language with variable occurrences, ranging from 5 of TH phoneme to 280 of B phoneme. The stimuli presentation utilized the Psychophysics Toolbox in MATLAB, featuring lowercase Arial font with a height of 150 pixels (2.2° visual angle). Each stimulus remained on the screen for 1500 ms, with a 2000 ms inter-stimulus interval. The stimuli were presented across two recording sessions, each encompassing the presentation of 64 stimuli in a pseudorandom order without repetition. In this analysis, we only focused on the trials in which the participant articulated the tongue twister correctly—to avoid any instability in the linear articulatory decoding models.

Behavioral analysis

The mean accuracy across all task conditions surpassed 80%, with trials excluded if any words were articulated incorrectly or if there were any disruptions in fluency, such as stuttering or delays (average percent incorrect = 5.5%; average dysfluency trials = 6.6%).

Data recording and preprocessing

Neural recordings from each participant were acquired from multiple sEEG probes (14–21) implanted for clinical purposes of seizure localization using a Robotic Surgical Assistant (ROSA; Medtech). Each probe contained 8–16 platinum-iridium electrodes of 0.5 mm or 2 mm in length with a center-to-center spacing of 0.5–4.43 mm (PMT Corporation). Electrodes were localized by co-registering a pre-operative anatomical MRI scan with a post-operative CT scan and displayed on a 3D cortical surface model generated in Freesurfer (Dale et al.⁵⁵).

Neural activity was recorded at 2 kHz using the NeuroPort recording system (Blackrock Neurotech). The neural data were visually inspected to remove channels contaminated with line noise, epileptic activity, and artifacts. Data from each electrode was then re-referenced using common-average referencing.

The analytical procedures involved initial bandpass filtering of the raw electrode data, transforming it into broadband gamma activity (BGA) within the frequency range of 70–150 Hz while simultaneously eliminating line noise using zero-phase second-order Butterworth band-stop filters. Subsequent to this preprocessing step, a frequency domain bandpass Hilbert transform with paired sigmoid flanks and a half-width of 1.5 Hz was applied. The resulting analytic amplitude underwent further refinement through smoothing using a Savitzky–Golay finite impulse response method, specifically employing a third-order filter with a frame length of 201 ms. The presentation of BGA in this context is expressed as a percentage change from the baseline level, defined as the period 500–100 ms preceding each stimulus presentation.

Continuous audio was also recorded synchronously through a lapel mic at 30 kHz using the same recording system and then segmented into individual trials. Articulation onset and offset times of the individual words and phonemes spoken during each trial were extracted from the audio signal using the Montreal Forced Aligner (MFA) (Montreal, Quebec, Canada) (McAuliffe et al.⁵⁶). The segmentations produced by MFA were then validated manually in Pratt.

Statistical analysis

To ensure statistically robust and topologically precise estimates of BGA in electrocorticography (ECoG), we employed surface-based mixed-effects multilevel analysis (SB-MEMA) (Kadipasaoglu et al.^26,27) to generate population-level representations. SB-MEMA analysis was performed on a window of 500 ms time-locked to the onset of articulation for the first word in the tongue twister stimuli. Significance levels were determined at a corrected alpha level of 0.01, implementing family-wise error rate corrections for multiple comparisons. The minimum family-wise error rate criterion was established through white-noise clustering analysis (Monte Carlo simulations, 5000 iterations) on data with matching dimension and smoothness as the analyzed dataset (Kadipasaoglu et al.²⁶). Following this, a geodesic Gaussian smoothing filter (3 mm full-width at half-maximum) was applied. Additionally, ECoG results were confined to regions with a minimum of three patients contributing to coverage and BGA percentage change exceeding 5%.

Sequence-based modeling

For the construction of a sequence prediction model, neural data underwent processing at either the phrase or word level, depending on the nature of the analysis. A bidirectional recurrent neural network from the TensorFlow library, with a Keras backend, was employed. This network featured an encoder-decoder structure capable of predicting either a predetermined length of phonemes (CVC model) or a variable length of phonemes. The latter utilized a teacher-forcing style decoder structure trained on phonemes within the closed dictionary of tongue twister stimuli. Model optimization, involving the determination of the number of units and layers, was accomplished through hyperparameter tuning on a validation dataset. The optimal number of layers and units dependent on the number of channels, regions, and the type of target involved in the analysis. The base model for variable-length, phrase-level phoneme decoding using all clean channels disregarding region selectivity, had 2 LSTM layers of 64 units each, one which acted as the neural encoder, and the other as the decoder, which utilized the encoded hidden states to project to a final dense layer projecting to the number of unique phoneme classes within the target dictionary. PERs reported in the final analysis were derived from predictions on the test set for each subject. The pre-articulatory analysis involved windows of data preceding phrase onset, defined here as the onset of articulation of the first word.

In the evaluation of sequence prediction mechanisms, two models were constructed. The first model adhered to task-specific constraints, training, and testing solely on sequences following a predefined order of CVCs repeated four times to form four unique words constituting the entire phrase. This model, designed for fixed-length sequences, faced limitations in adapting to variable-length sequences, posing challenges for generalizability to natural speech events with varying utterance lengths. To address these limitations, a second encoder-decoder structure model was introduced. The decoder of this model demonstrated greater flexibility through three key methods. Firstly, it operated on a teacher-forcing style facilitating information transfer on a phoneme-by-phoneme basis. Secondly, the target features accommodated blank tokens, start tokens, and end-of-sentence tokens commonly used in natural language processing models providing additional information about speech pauses and breaks in utterances. Lastly, a connectionist temporal classification loss function was implemented, allowing for the marginalization of various forms of alignment between predicted and articulated phoneme sequences. This approach facilitated the optimal handling of merging, concatenation, and deletion of extra tokens predicted.

Accuracy metrics

In assessing subject performance, we refrained from assuming equal class distributions due to variations in the number of correctly articulated trials across subjects. Consequently, conventional accuracy metrics were unsuitable for our analysis. Instead, we opted for macro F1 scores to compute class-balanced accuracy, incorporating considerations such as specificity and sensitivity. For sequential predictions, we employed the PER, quantifying the number of substitutions, insertions, or deletions required to align with the ground truth sequence of phonemes.

To establish a data-driven chance performance, we implemented a shuffled control scheme in which both trials and phoneme sequences were randomized while keeping sEEG signals intact. This preserves phoneme distributions but disrupts trial-specific neural-phoneme alignments, ensuring that model performance reflects meaningful neural decoding rather than frequency-based biases.

Coverage correlation analysis

To evaluated the contribution of specific cortical coverage across the cohort, we parcellated the cortex into 50 regions using the Destrieux parcellation (Freesurfer v4.5, 2009 Destrieux atlas⁵⁷) and then contextualized these regions as canonical language and speech production hubs. For each anatomical region as described by this parcellation, we calculated a channel density metric—of the number of electrodes in this region divided by the total number of electrodes implanted in this subject. This parcellation enabled us to generate a barcode for each subject based on the channel density per parcel and assess patient correlations utilizing cosine similarity across the cohort to derive a similarity score across subjects.

Regional electrode occlusion (REO) analysis

To evaluate the interpretability of the architectures employed for phonemic decoding, a REO analysis was conducted where important hubs of language production are removed from the grouped dataset at a regional level. These regions are seeded from nodes across the cortical surface as inferred by SB-MEMA. Electrodes were selected from individual regions to be excluded from the analysis and for each region of interest, for each patient, a PER was calculated. Prior analyses were able to generate all channel decoding performances for each subject, so this error rate can be compared to the error rate calculated from the REO analysis to see if the absence of a region would increase the PER or degrade the decoding performance. To control for effects across individual patients, linear mixed effects models were used to evaluate if removing a region would provide a significant increase in PER with an intercept fitted for each patient. This analysis was also done for different time windows, with a separate linear mixed effect model constructed for each time window to understand the sequencing of regional activity and its context in phonological processing.

Subject-independent decoding analysis

For transfer learning between subjects using our sequence-to-sequence models, we add a simple 1D convolutional layer on top of the LSTM and affine layers. We then pre-train our model on a single subject and the core LSTM encoder layer and affine layer is frozen—meaning the weights of this layer are not allowed to be adjusted during the backpropagation procedure when training on a new subject data and labels. However, we keep the convolutional layer trainable to allow the model to extract subject relevant features from the variable electrodes configuration based on patient specific anatomical electrode trajectories as we transfer the model from one participant to another. A model trained on a single subject is then transferred across all other 24 subjects that have done this specific task, with the convolutional layer being trainable while the core LSTM layer and phoneme output layers are frozen. Training on a new subject is done for 100 epochs only as compared to the pre-training on the original subject for 500 epochs.

For multi-subject models, the only change we make to the architecture is a concatenation layer, which allows all training trials from multiple subjects to be collated together—and then the model randomly samples trials across all subjects to create a batch for training. This allows the model to evaluate the loss function collectively for the group and build latent embeddings from each subject’s neural activity. After the across subject neural embedding is built, the trials are split back into their respective subjects and the same linear readout decoder architecture is employed to predict sequential phonemes that correspond to the individual subject’s training dataset. This allows a shared representation of the neural embeddings but mapping to subject-specific behavioral responses. To implement a zero shot decoding evaluation, we simply applied a group K-fold cross-validation scheme with K = 5 and left entire phrases from the tongue twister task stimuli out. The held-out stimuli were selected and removed from all subjects' trials before the model was trained and used as the test dataset to evaluate the group model against.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The datasets generated from this research are not publicly available due to their containing information non-compliant with HIPAA, and the human participants from whom the data were collected have not consented to their public release. However, they are available on request from the corresponding author. Source data are provided with this paper.

Code availability

The custom code that supports the findings of this study is available from the corresponding author on request.

References

Luo, S. et al. Stable decoding from a speech BCI enables control for an individual with ALS without recalibration for 3 months. Adv. Sci. 10, e2304853 (2023).
Article Google Scholar
Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98, 1042–1054.e4 (2018).
Article CAS PubMed PubMed Central Google Scholar
Khanna, A. R. et al. Single-neuronal elements of speech production in humans. Nature 626, 603–610 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174, 21–31.e9 (2018).
Article CAS PubMed PubMed Central Google Scholar
Card, N. S. et al. An Accurate and Rapidly Calibrating Speech Neuroprosthesis. N Engl J Med. 391, 609–618 (2024)
Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620, 1031–1036 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Metzger, S. L. et al. A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620, 1037–1046 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, J. et al. Transformer-based neural speech decoding from surface and depth electrode signals. J Neural Eng. 22, 016017 (2025)
Lesaja, S. et al. Self-supervised learning of neural speech representations from unlabeled intracranial signals. IEEE Access 10, 133526–133538 (2022).
Article Google Scholar
Oppenheim, G. M. & Dell, G. S. Inner speech slips exhibit lexical bias, but not the phonemic similarity effect. Cognition 106, 528–537 (2008).
Article PubMed Google Scholar
Okada, K., Matchin, W. & Hickok, G. Neural evidence for predictive coding in auditory cortex during speech production. Psychon. Bull. Rev. 25, 423–430 (2018).
Article PubMed Google Scholar
Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9, 217 (2015).
Article PubMed PubMed Central Google Scholar
Herff, C. et al. Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices. Front. Neurosci. 13, 1267 (2019).
Article PubMed PubMed Central Google Scholar
Angrick, M. et al. Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. Commun. Biol. 4, 1–10 (2021).
Article Google Scholar
Kohler, J. et al. Synthesizing Speech from Intracranial Depth Electrodes Using an Encoder-Decoder Framework. Neurons, Behavior, Data Analysis, and Theory 6, 57524 (2022).
Feng, C. et al. Acoustic Inspired Brain-to-Sentence Decoder for Logosyllabic Language. Cyborg Bionic Syst. 6, 0257 (2025).
Forseth, K. J. et al. A lexical semantic hub for heteromodal naming in middle fusiform gyrus. Brain 141, 2112–2126 (2018).
Article PubMed PubMed Central Google Scholar
Saravani, A. G., Forseth, K. J., Tandon, N. & Pitkow, X. Dynamic Brain Interactions during Picture Naming. eNeuro 6, ENEURO.0472-18.2019 (2019).
Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Thomas, T. M. et al. Decoding articulatory and phonetic components of naturalistic continuous speech from the distributed language network. J. Neural Eng. 20, 046030 (2023).
Article ADS Google Scholar
Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38, 9803–9813 (2018).
Article CAS PubMed PubMed Central Google Scholar
Stephen, E. P., Li, Y., Metzger, S., Oganian, Y. & Chang, E. F. Latent neural dynamics encode temporal context in speech. Hear. Res. 437, 108838 (2023).
Article PubMed PubMed Central Google Scholar
Meier, A. et al. Lateralization and time-course of cortical phonological representations during syllable production. eNeuro 10, ENEURO.0474-22.2023 (2023).
Li, J. et al. Neural2Speech: a transfer learning framework for neural-driven speech reconstruction. https://doi.org/10.48550/arXiv.2310.04644 (2023).
Wang, R. et al. Stimulus speech decoding from human cortex with generative adversarial network transfer learning. In Proc. 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) 390–394 (IEEE, 2020).
Kadipasaoglu, C. M. et al. Surface-based mixed effects multilevel analysis of grouped human electrocorticography. NeuroImage 101, 215–224 (2014).
Kadipasaoglu, C. M. et al. Development of grouped icEEG for the study of cognitive processing. Front Psychol 6, 1008 (2015).
Petrosyan, A. et al. Speech decoding from a small set of spatially segregated minimally invasive intracranial EEG electrodes with a compact and interpretable neural network. J. Neural Eng. 19, 066016 (2022).
Article ADS Google Scholar
Hsieh, J. K. et al. Cortical sites critical to language function act as connectors between language subnetworks. Nat. Commun. 15, 7897 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Ye, J., Collinger, J. L., Wehbe, L. & Gaunt, R. Neural data transformer 2: multi-context pretraining for neural spiking activity. Preprint at https://doi.org/10.1101/2023.09.18.558113 (2023).
Chen, X. et al. A neural speech decoding framework leveraging deep learning and speech synthesis. Nat. Mach. Intell. https://doi.org/10.1038/s42256-024-00824-8 (2024).
Makin, J. G., Moses, D. A. & Chang, E. F. Machine translation of cortical activity to text with an encoder–decoder framework. Nat. Neurosci. 23, 575–582 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, D. et al. A brain-to-text framework for decoding natural tonal sentences. Cell Reports 43, 114924 (2024)
Guenther, F. H. & Vladusich, T. A neural theory of speech acquisition and production. J. Neurolinguist. 25, 408–422 (2012).
Article Google Scholar
Gallego, J. A., Perich, M. G., Chowdhury, R. H., Solla, S. A. & Miller, L. E. Long-term stability of cortical population dynamics underlying consistent behavior. Nat. Neurosci. 23, 260–270 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fortunato, C. et al. Nonlinear manifolds underlie neural population activity during behaviour. bioRxiv https://doi.org/10.1101/2023.07.18.549575 (2023).
Safaie, M. et al. Preserved neural dynamics across animals performing similar behaviour. Nature 623, 765–771 (2023).
Komeiji, S. et al. Feasibility of decoding covert speech in ECoG with a Transformer trained on overt speech. Sci. Rep. 14, 11491 (2024).
Shigemi, K. et al. Synthesizing speech from ecog with a combination of transformer-based encoder and neural vocoder. In Proc. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, 2023).
Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Oganian, Y. & Chang, E. F. A speech envelope landmark for syllable encoding in human superior temporal gyrus. Sci. Adv. 5, eaay6279 (2019).
Article ADS PubMed PubMed Central Google Scholar
Chang, E. F. et al. Categorical speech representation in human superior temporal gyrus. Nat. Neurosci. 13, 1428–1432 (2010).
Article CAS PubMed PubMed Central Google Scholar
Woolnough, O. et al. Spatiotemporal dynamics of orthographic and lexical processing in the ventral visual pathway. Nat. Hum. Behav. 5, 389–398 (2021).
Article PubMed Google Scholar
Woolnough, O. et al. A spatiotemporal map of reading aloud. J. Neurosci. 42, 5438–5450 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402 (2007).
Article CAS PubMed Google Scholar
Angrick, M. et al. Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16, 036019 (2019).
Article ADS PubMed PubMed Central Google Scholar
Duraivel, S. et al. High-resolution neural recordings improve the accuracy of speech decoding. Nat. Commun. 14, 6938 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Anumanchipalli, G. K., Chartier, J. & Chang, E. F. Speech synthesis from neural decoding of spoken sentences. Nature 568, 493–498 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Canny, E., Vansteensel, M. J., Van Der Salm, S. M. A., Müller-Putz, G. R. & Berezutskaya, J. Boosting brain–computer interfaces with functional electrical stimulation: potential applications in people with locked-in syndrome. J. Neuroeng. Rehabil. 20, 157 (2023).
Article PubMed PubMed Central Google Scholar
Metzger, S. L. et al. Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis. Nat. Commun. 13, 6510 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, P., Anumanchipalli, G. K. & Chang, E. F. Brain2Char: a deep architecture for decoding text from brain recordings. J. Neural Eng. 17, 066015 (2020).
Article ADS Google Scholar
Wairagkar, M., Hochberg, L. R., Brandman, D. M. & Stavisky, S. D. Synthesizing speech by decoding intracortical neural activity from dorsal motor cortex. In Proc. 2023 11th International IEEE/EMBS Conference on Neural Engineering (NER) 1–4 (IEEE, 2023).
Haufe, S. et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 87, 96–110 (2014).
Article PubMed Google Scholar
Wang, R. et al. Distributed feedforward and feedback cortical processing supports human speech production. Proc. Natl. Acad. Sci. USA 120, e2300255120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dale, A. M., Fischl, B. & Sereno, M. I. Cortical Surface-Based Analysis: I. Segmentation and Surface Reconstruction. NeuroImage 9, 179–194 (1999).
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech 2017, 498–502 (ISCA, 2017).
Destrieux, C., Fischl, B., Dale, A. & Halgren, E. Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage 53, 1–15 (2010).

Download references

Acknowledgements

We express our gratitude to all the patients who participated in this study; the neurologists at the Texas Comprehensive Epilepsy Program who participated in the care of these patients; and the nurses and technicians in the Epilepsy Monitoring Unit at Memorial Hermann Hospital who helped make this research possible. Initial processing scripts were written by Kiefer Forseth. We would also like to thank members of the Tandon Lab for extensive discussion and feedback: Meredith McCarty, Elliot Murphy, Kathryn Snyder, and Oscar Woolnough. This work was supported by the National Institute of Neurological Disorders and Stroke U01NS128921.

Author information

Authors and Affiliations

Vivian L. Smith Department of Neurosurgery, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
Aditya Singh, Tessy Thomas, Jinlong Li & Nitin Tandon
Texas Institute for Restorative Neurotechnologies, University of Texas Health Science Center at Houston, Houston, TX, USA
Aditya Singh, Tessy Thomas, Jinlong Li & Nitin Tandon
Department of Bioengineering, Rice University, Houston, TX, USA
Jinlong Li
Department of Cognitive Sciences, University of California, Irvine, CA, USA
Greg Hickok
Carnegie Mellon University, Pittsburgh, PA, USA
Xaq Pitkow
Memorial Hermann Hospital, Texas Medical Center, Houston, TX, USA
Nitin Tandon

Authors

Aditya Singh
View author publications
Search author on:PubMed Google Scholar
Tessy Thomas
View author publications
Search author on:PubMed Google Scholar
Jinlong Li
View author publications
Search author on:PubMed Google Scholar
Greg Hickok
View author publications
Search author on:PubMed Google Scholar
Xaq Pitkow
View author publications
Search author on:PubMed Google Scholar
Nitin Tandon
View author publications
Search author on:PubMed Google Scholar

Contributions

[N.T. and G.H.] designed research; [A.S. and X.P.] designed model architecture; [A.S. and T.T.] performed data collection; [A.S. and T.T.] analyzed data; [A.S., T.T., and N.T.] wrote the first draft of the paper, and [A.S., T.T., J.L., X.P., and N.T.] authors edited and revised the paper; N.T. conducted funding acquisition and neurosurgical procedures.

Corresponding author

Correspondence to Nitin Tandon.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Zachary Freudenburg and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Singh, A., Thomas, T., Li, J. et al. Transfer learning via distributed brain recordings enables reliable speech decoding. Nat Commun 16, 8749 (2025). https://doi.org/10.1038/s41467-025-63825-0

Download citation

Received: 02 September 2024
Accepted: 28 August 2025
Published: 01 October 2025
DOI: https://doi.org/10.1038/s41467-025-63825-0

Subjects

Abstract

Similar content being viewed by others

A neural speech decoding framework leveraging deep learning and speech synthesis

A bilingual speech neuroprosthesis driven by cortical articulatory representations shared between languages

Decoding speech perception from non-invasive brain recordings

Introduction

Results

Transfer manifolds across individuals

Influence of data size and recording locations

Transfer manifolds using an idealized training set

Isolating optimal characteristics

Discussion

Methods

Participants

Task design

Behavioral analysis

Data recording and preprocessing

Statistical analysis

Sequence-based modeling

Accuracy metrics

Coverage correlation analysis

Regional electrode occlusion (REO) analysis

Subject-independent decoding analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer review file

Source data

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links