Introduction

Intracranial recordings have yielded novel insights into how focal neuronal populations encode articulatory kinematics, latent phonetics1,2,3, and vocal modulation4. These insights have driven the creation of brain-computer interfaces (BCIs) to enable communication in speech apraxia. Thus far, these data have been recorded from intact sensorimotor cortex in anarthric individuals with damaged efferent pathways or end effectors5,6,7. Models derived from these are highly individualized5,8,9 and not readily extensible to patients with cortical loss due to brain injury. In such aphasic, as opposed to anarthric, individuals, a BCI using sparse data from brain regions with residual language capacities combined with a transfer model from a population of normal individuals will enable us to bridge gaps in clinical translation and enable the development of a generalizable prosthesis.

In service of this goal, we implemented a tongue twister paradigm10,11, designed to load the articulatory system, in a cohort of 25 patients using over 3600 stereoelectroencephalographic (sEEG) depth electrodes. We used sequence-to-sequence models to decode phonemes12,13,14,15,16 from distributed speech hubs3,17,18,19,20,21 and assessed the contribution of the number of channels, the effect of the number of trials (a surrogate for quantity of neural data used for training) and assessed decoding performance not only during but also prior to articulation. We then developed a grouped transfer learning technique to train population neural latents22,23 and assessed the combined effects of each of these factors to generate a robust, reliable training manifold for speech decoding. These manifolds were then implemented as generalizable decoders on patients not used to train them and demonstrated improved inference in individuals with limited coverage of the speech motor cortex (akin to missing these brain regions due to injury).

By leveraging multi-site and multi-subject cortical data, this architecture is initialized on diverse neural codes, enabling a pre-trained nonlinear neural encoder that maps onto a linear readout effector. While others have focused on pre-training at the stimuli level, to provide stronger priors for speech waveform reconstructions24,25, we restricted the complexity of the decoder output to phonemic sequences and instead built stronger priors for the encoder. This approach pushed the informational limits of neural data creating a rich latent feature set from models that learn subject-independent representations of articulation. These generalizable manifolds of speech production, coupled with transfer learning, allowed us to estimate planned phonemic trajectories in patients lacking sufficient data to construct the latent feature set. This framework can potentially facilitate neural prosthetics for aphasic patients who lack the typical levels of word production fluency needed to initialize decoding models.

Results

Across the task, average accuracy for pronouncing all words correctly in the tongue twister trial was 87% ( ± 4% S.D.). Trials with articulatory errors (8% ± 3% S.D.) or dysfluency (5% ± 2% S.D.) were excluded. Across the cohort of 25 patients, we recorded comprehensively from peri-sylvian frontotemporal language sites (Fig. 1B, C). A mixed effects multi-level analysis (SB-MEMA, Kadipasaoglu et al.26,27) used to aggregate data and revealed expected loci of activation in subcentral gyrus (SCG), superior temporal gyrus (STG), posterior middle temporal gyrus, premotor cortex, and inferior frontal gyrus (IFG) (Fig. 1D).

Fig. 1: Experimental overview.
figure 1

A Tongue twister paradigm and phonetic transcript of a trial with and without speech dysfluency. B Aggregate of surface recording zones represented on an inflated brain surface to represent the density of electrode coverage across the cohort of subjects. C 3641 (blue) electrodes in 25 patients were used (dark electrodes were excluded due to noise). D A surface-based mixed effects multilevel analysis (sbMEMA) was used to derive a population map of activation related to articulation of the first word (0–500 ms) scaled by the extent of activation and corrected for the number of contributors. Created in BioRender. Singh, A. (2025) https://BioRender.com/026o8yt.

We deployed a sequential state-based model (Fig. 2A) based on the premise that it would perform better than a linear model at reconstructing phoneme sequences from continuous samples of neural data. This sequence-to-sequence (Seq2Seq) model performed significantly better than chance (10% accuracy—S.D. 8%) and outperformed a linear model (24% accuracy, S.D. 5%) in predicting phonemes across all patients. As our model targets phoneme-level decoding, we report Phoneme Error Rate (PER) rather than Word Error Rate (WER), allowing us to precisely assess neural representations at a finer granularity appropriate to our objectives. The model achieved a median PER of 27% (S.D. 6%) when decoding from articulatory periods, and 34% (S.D. 6%) PER from pre-articulatory periods (Fig. 2B). Additionally, we tested the ability to predict variable-length phoneme sequences utilizing a teacher-forcing variant of the Seq2Seq model and again achieved significantly better performance than chance (10% accuracy, S.D. 8%) and a linear model (20% accuracy, S.D. 6%). The variable length decoder achieved a median of 44% PER (S.D. 4%) using data from articulation periods, and 56% PER (S.D. 6%), when decoding from pre-articulatory periods. The lowest PERs for a single subject were 13% and 24% during articulation for the fixed and variable-length phoneme sequences, and a PER of 26% and 43% decoding from pre-articulatory intervals, respectively.

Fig. 2: Seq2Seq models for phoneme neural decoding.
figure 2

A Schematic representation of a sequence-to-sequence model: neural data with variable cortical coverage were processed by a temporal convolutional layer, a recurrent neural network, and a linear readout layer to isolate phoneme identity probabilities for each index in the phoneme sequence. These predicted phoneme sequences (example predicted trial is depicted) are then compared using a distance metric to evaluate a phoneme error rate (PER). B Phoneme sequences were decoded utilizing frames of neural activity during articulation and prior to articulation with a fixed and variable length Seq2Seq model as well as a linear model for comparison. PERs were computed across these conditions to evaluate effects of time window and model architecture. C, D Cohort level trial and channel statistics from controlled analyses driving decoding performance, and extrapolated values for optimal number of trials and channels for high decoding accuracy (1-PER). E Regional electrode occlusion (REO) analysis created for decoding architectures for broad lobe-based and region-specific analysis, employing a linear mixed effects model with random effects for patients across different time windows preceding articulation. **Box plots (center/bounds/whiskers): Linear 76%/70–82%/65–90% (articulatory), 80%/75–85%/70–92% (pre-articulatory); Fixed 27%/22–32%/18–42% (articulatory), 34%/29–39%/25–47% (pre-articulatory); Variable 44%/38–50%/32–58% (articulatory), 56%/50–62%/45–68% (pre-articulatory). Outliers beyond 1.5×IQR shown. Statistical significance by repeated measures ANOVA, two-sided (*p < 0.05, **p < 0.01, n = 25 subjects). **Created in BioRender. Singh, A. (2025) https://BioRender.com/22ktqfb.

Given the variability in decoding performance across the cohort, we explored key factors in the training dataset that may contribute to this variance. The two primary parameters influencing each subject’s dataset size were the number of electrodes implanted, based on the anatomical trajectories required for seizure localization, and the duration for which each subject was able to perform the task. Controlling for the number of electrodes (Fig. 2C), we observe a significant correlation (R2 = 0.64) between the number of trials and decoding accuracy, achieving a < 10% PER (90% accuracy) at approximately 180 min of articulation. To evaluate the impact of channels (Fig. 2D), we subsampled channels from each subject while controlling for trials. Certain subject clusters showed early separation and optimal performance with 500 channels, while others required up to 1500 channels. Notably, significant separation between subject clusters (p < 0.05) was observed with as few as 100 channels. Even after controlling for subject-specific improvements in decoding accuracy using a linear mixed-effects model, the number of channels remained a significant factor (p < 0.05) in enhancing decoding accuracy.

To further examine the variability in decoding performance, regional electrode occlusion (REO) analysis (Fig. 2E) was performed by generating multiple recurrent models that left out specific sets of electrodes based on their anatomical locations during training in critical speech production hubs. This allowed us to use the role of individual regional activity as a proxy for their contribution to the identity and position of each phoneme. When removing electrodes in the ventral sensorimotor cortex, the distribution of decoding accuracy for an all-channels model versus an ablated model is sharply skewed to the left, demonstrating significant effects and indicating the crucial role that the sensorimotor cortex electrodes play in phoneme decoding. Similarly, a pronounced shift to the left in performance was observed when removing temporal lobe electrodes. We continued our REO analysis in smaller regions of interest implicated in the speech production network with our within-subject models. Pre-articulatory activity in SCG, pSTG and STS each significantly (p-value < 0.01) contributed to phonological processing, with early involvement from STS and SCG and later involvement of pSTG at onset of articulation. Activity in IFG and aSTG showed no significant effect on PER when removed from the model trained on pre-articulatory data. This suggests that individual variations in recording sites render some subjects more valuable for decoding, underscoring the importance of electrode placement and regional contributions in explaining variability across the cohort.

Transfer manifolds across individuals

Influence of data size and recording locations

As seen from the previous analysis, inter-subject variability in amount of training data and number of electrodes had a clear effect on decoding performances at the cohort level, which motivates a practical implementation to improve decoder performance across subjects utilizing these features. We selected the participant with the best decoding performance and implemented a transfer learning technique that generalized the model trained on this individual’s data across the cohort. This mapping of learned latent features to subject-specific neural data resulted in remarkably good decoding performance with no significant difference in PER (p = 0.72) from models trained on data within-subjects (Fig. 3A). Optimal decoding performance resulted from keeping all components of the subject-specific model trainable, except for the recurrent layer (p < 0.001). By utilizing the learned embeddings from this best decoding subject, the recurrent layer effectively embeds optimal latent features that are then enhanced by individual data from the rest of the cohort. Thus, while having some subject specificity is important to achieve baseline decoding performances, core recurrent non-linearities can be learned and transferred from one patient to another to improve decoding performance.

Fig. 3: Applying transfer learning to Seq2Seq models.
figure 3

A Assessing the transferability of model components through PERs by comparing subject-independent models; transferring all layers of a trained model and freezing their weights in the inference model; transferring and freezing the readout layer; and transferring and freezing the recurrent layer. p < 0.001 for recurrent layer transfer vs within subject performance, p < 0.05 for readout transfer vs within subject performance, and p > 0.05 for full model transfer vs within subject. Transfer decoding allows improvements in decoding (∆PER) employing training subjects with B increased number of trials. C Subject electrode location correlations calculated and a group of 14 similarly distributed subjects (to ensure decoding changes are not due to completely dissimilar electrode coverages) from the cohort are sub-selected to show that D increased number of channels and E shared coverage correlation with the inference subject additionally are strong drivers for decoding performance. **Box plots (center/bounds/whiskers): Within Subject 0.49/0.47–0.51/0.44–0.54; Full Transfer 0.46/0.44–0.48/0.42–0.50; Readout Transfer 0.45/0.43–0.47/0.41–0.49; Recurrent Transfer 0.43/0.41–0.45/0.39–0.47. Outliers beyond 1.5× IQR. Statistical significance by paired t-tests, one-sided (***p < 0.001, **p < 0.01, *p < 0.05, n = 25 subjects). **Created in BioRender. Singh, A. (2025) https://BioRender.com/1o97awd.

To ensure the feasibility of this methodology in clinical practice, where the size of the training data is dictated by clinical needs—electrode placements are guided by epilepsy localization and data size by the highly variable time available for research recordings during a patient’s stay in the epilepsy monitoring unit - we assessed the value of a shared decoding model. Specifically, we examined whether decoder performance could be improved when inferring on minimal, partial, or incomplete datasets. We mapped the transfer decoding framework across our cohorts and assessed the critical features that lead to the most significant improvement in a pairwise analysis. By utilizing linear mixed effects modeling to control for variance between performance changes in training-inference subject pairs, phoneme decoding performance shows significant (p < 0.05) improvement for a subject when initialized with recurrent embeddings from a subject with greater number of trials (Fig. 3B). In better understanding how increased channel density in speech-motor-cortex areas improves decoder performance over the cohort we first had to identify a method to functionally categorize the electrode implantation scheme for each subject. To categorize the electrode implantation scheme across subjects, we used the Destrieux parcellation in FreeSurfer to map cortical coverage into 50 regions, focusing on key language and speech production hubs. To avoid biasing channel effects due to distinctly unmatched electrode trajectories, we focused on 14 of the 25 subjects who had distributed coverage and an average shared correlation score over 10%. In this subgroup, we observed a significant improvement in decoder performance (Fig. 3D) when models were initialized with recurrent embeddings from subjects with a higher number of implanted electrodes. This approach also allowed us to quantify channel density per region, enabling the assessment of how similarities in coverage patterns between a training and inference dataset impact decoder performance. Transfer learning performance improved significantly for patients (p < 0.05) when the training subject has a greater correlation of sites of cortical coverage with the inference subject (Fig. 3E).

Transfer manifolds using an idealized training set

Isolating optimal characteristics

We evaluated whether a shared recurrent layer across multiple subjects could enhance performance compared to using a single “ideal” subject, enabling creation of a generalizable training manifold (Fig. 4A) beyond a single optimal subject. This approach allows the recurrent layer to integrate generalizable, subject-invariant information about articulatory processes. By selecting information that improves decoding accuracy across multiple subjects, the model encodes features that are not specific to one subject but generalizable across different perspectives of a task and neural activity within a specific region. To test the robustness of this architecture, we conducted REO (Fig. 4B) in both the sensorimotor cortex and temporal lobe within the multi-subject models. The results demonstrate that the multi-subject model is resilient to such electrode occlusions, minimizing performance degradation. Occluding sensorimotor cortex electrodes in the multi-subject model disproportionately affects top performers—with the rate of PER degradation increasing as decoding accuracy increases. However, this effect is less pronounced when regionally occluding electrodes in the temporal region for the multi-subject model—with the distributions laying primarily against the diagonal—showing that REO in the temporal lobe have minimal effect on phoneme decoding performance for the group model.

Fig. 4: Multi-subject model integrated with transfer learning approaches.
figure 4

A Schematic of transfer learning for sequence-to-sequence modeling with multiple subjects. Each subject has a unique temporal convolutional layer allowing for subject-specific nonlinear dimensional reduction. Subject-specific features were concatenated, and hidden states were derived across all samples using a shared recurrent layer. The shared encoded features were separated back into subject-specific trials and a linear readout layer was used to decode phonemes sequentially like the above examples. The shared recurrent layer was then frozen and transferred to a subject held-out from the group model. B Multi-subject models show robustness to REO across sensorimotor cortex and temporal lobe. Created in BioRender. Singh, A. (2025) https://BioRender.com/ifs6xbo.

We then performed an analysis across the entire cohort using the five candidates that were not only in the top quartile of decoding performances but also had dense sensorimotor cortex coverage, especially in SCG. We assumed that the sensorimotor cortex would contribute the most to the decoding, given the prior REO analyses across the single and multi-subject architectures. A group-based model built from these five subjects showed that utilizing a shared recurrent layer resulted in significantly lower PER on held-out trials (p < 0.01) for each subject in the group model (Fig. 5A). By applying the population-level manifold learned from the group model to the held-out subject, we observed a remarkable enhancement in performance (Fig. 5B). When we used a single subject with comprehensive coverage of the speech and auditory systems we had a significant decrease in PER, from 57% in the within-subject model to 49% (p < 0.001). As more subjects were incorporated into the model, decoding accuracy improved, with a PER of 45% after integrating four subjects. To determine the optimal number of subjects needed for enhancing decoder performance, we applied a stopping criterion based on whether additional subjects significantly improved performance. Significant improvements were observed with three subjects, and no change in PER with the inclusion of a fourth subject.

Fig. 5: Across subject neural representation stability via multi-subject transfer learning.
figure 5

A Pre-articulatory zero-shot decoding performances from single and group (n = 5) subject models. B Group model trained on pre-articulatory activity from 1 to 5 subjects (n) with dense sensorimotor cortex coverage significantly improves decoding performance when transferred to a participant with frontotemporal electrodes and no sensorimotor cortex coverage. C Heatmap of %∆PER improvement across 20 subjects with variable electrode applying transfer learning with a recurrent layer trained on increasingly more participants in a specific set and order with the group-based model (n = 1:5). D Peak %∆ improvement in decoding accuracy (PER) for each subject in the cohort, along with the optimal number of subjects required in the group model to achieve this performance. ** 5A) Box plots (center/bounds/whiskers): Single Subject Models—S1: 0.52/0.50–0.54/0.46–0.57, S2: 0.51/0.49–0.53/0.44–0.59, S3: 0.52/0.50–0.54/0.46–0.60, S4: 0.53/0.48–0.56/0.40–0.58, S5: 0.53/0.51–0.55/0.44–0.60; Group Models—S1: 0.46/0.40–0.49/0.34–0.52, S2: 0.46/0.43–0.49/0.34–0.52, S3: 0.47/0.39–0.48/0.34–0.56, S4: 0.47/0.40–0.49/0.32–0.54, S5: 0.45/0.41–0.49/0.33–0.51. Outliers beyond 1.5× IQR. Statistical significance by paired comparisons, two-sided (**** p < 0.0001, *** p < 0.001, ** p < 0.01). n = 5 subjects. 5B) Box plots (center/bounds/whiskers): Within Subject 0.57/0.55–0.59/0.52–0.60; Group Models—n = 1: 0.50/0.47–0.50/0.45–0.53, n = 2: 0.47/0.46–0.48/0.44–0.49, n = 3: 0.48/0.46–0.49/0.44–0.49, n = 4: 0.45/0.43–0.47/0.41–0.48. Outliers beyond 1.5× IQR. Statistical significance by repeated measures analysis, two-sided (**** p<0.0001, *** p < 0.001, ** p < 0.01, * p < 0.05, n.s. p > 0.05) Sample sizes n = 1 to n = 4 for group models. **Created in BioRender. Singh, A. (2025) https://BioRender.com/2brioq6.

We conducted the same analysis with the five subjects based on the criteria above and assessed improvement in decoding across the cohort when utilizing latent states built on this order and set of specific subjects. As seen, the efficacy of increasing the number of subjects is critical for some subjects in the cohort (Fig. 5C); however, others fail to significantly improve in decoding performance after three subjects. The optimal number of subjects trained with the group model will depend on the inference subject preference for subjects in the group model, however, for most subjects (Fig. 5D) the decoding is comparable or improved when utilizing the transfer learning architecture. These findings highlight the substantial benefits of transferring group-derived latent articulatory features to subjects with incomplete cortical sampling often due to clinical electrode placements.

Discussion

We show the effectiveness of large-cohort intracranial sEEG in training lightweight, subject-independent models with high decoding accuracy for predicting articulated utterances before speech production. This ability to learn a shared phonemic representation across the cortex using pre-trained group models enhances performance even for subjects with limited coverage. These results have implications for the development of brain-to-text iBCI leveraging phoneme decoding. The superior performance of this multi-subject model suggests that leveraging data from multiple individuals can help overcome subject-specific variations to improve model generalizability and enable decoding even without the coverage of the critical speech cortex. sEEG recordings additionally provide a very broad sampling of the language production process, augmenting the richness of information that can be harnessed by the decoders14,15,20,28. This approach allows for inputs from a distributed speech network that orchestrates the planning, sequencing, and execution of speech information, offering vantage points29 for speech decoding. These insights are pertinent to the creation of robust phoneme decoding systems to accurately translate neural activity into speech outputs across different users.

The importance of initialization in machine learning models has been underscored for quite some time in computational research. Proper model initialization and consequent fine tuning to a specific context is critical for datasets with limited training samples and deep learning architectures are more likely to succeed when trained on large banks of datasets with hyperparameter optimization. This approach has more recently been realized in computational neuroscience, pre-training on neural data from a host of tasks, sources, and modalities—and enhancing robustness during inference and real-time performance30. However, the same aspects that afford pre-trained models their utility—the number of flexible parameters and the time it takes to optimize them for a specific context—becomes a computational wall. We show the effectiveness of shallow networks in building generalizable solutions, which require a fraction of compute time, yet attain high information transfer rates.

State-of-the-art decoding methods3,5,6,7,31 have focused on speech decoding frameworks in individuals, with great success, yet are not able to allow for shared-subject analysis techniques. These studies, with variable model architectures and modalities, endeavored to show generalizable results across subjects, yet none of them have shown significant improvement of decoding performance while leveraging transfer learning8,32,33. While certain aspects of our work build on established findings (e.g., transfer learning for cross-subject BCIs and neuroanatomical effects on decoding), this work introduces the application of transfer learning to intracranial speech decoding with a focus on detailed articulatory complexity, a larger cohort-wide systematic analysis of how data/AI interactions influence decoding performance across subjects, which we believe provides valuable insights for designing future BCIs and integration of these findings into a framework for optimizing transfer learning in real-world neuroprostheses applications. Existing demonstration of high-performance speech decoding6,7 requires the participants to perform training sessions over ~2 weeks to achieve stable accuracies as well as lengthy calibration processes due to signal nonstationarities and electrode micro-movements. A more recent account5 demonstrated a fast calibrating speech neuroprosthesis by training the decoder with neural data of a given day with previous-day neural data, achieving stable asymptotic accuracy over a week of training. Nevertheless, these works were hinged upon the fact that the subject voluntarily participated in hours of training each day. Our proposed group-level speech decoding framework could, in the nearest future, accelerate this process in subjects that are incapable of providing such ample training data by flexibly pre-train the decoder with a group-level representation. This work accommodates for variable electrode positioning and shared-subject analysis, employing supervised dimensional reduction to nonlinearly weight groups of channels. Additionally, our flexible model architecture allows for automated subject-specific cross-validation for a variable number of trials per subject during group model pre-training—which allows the pipeline to not be limited by a minimum number of trials in a single participant.

This ability to scale speech decoding models across subjects with variable cortical sampling of a distributed language network during multi-word speech production is feasible, given that neural datasets themselves are high dimensional, and task-specific signals are low dimensional. This feature of the latent neural space is critical in enabling the decoder to become generalizable. A goal of this model is to retain some latent feature space through subject-invariant pre-training that can be fine-tuned for datasets with limited information—either due to incomplete cortical sampling or damage to speech production hubs—and use these to impute embeddings for a population-based decoder. This approach aligns with neurocomputational theories like the DIVA model, which conceptualize speech motor control as emerging from feedforward articulatory plans refined by sensory feedback34. The latent trajectories decoded here may reflect such transferable motor representations, robust to anatomical and functional variability. However, when programmed into a simple linear decoder, the model can map those same latent features of articulation to signals encoded across the distributed language network through a supervised filter trained on distributed sets of language networks already—potentially with sensorimotor cortex coverage. Studies have demonstrated the effectiveness of these architectures by building more efficient motor decoders that can be stable for longer with less training data35,36,37. The ability to transfer low-dimensional neural manifolds using single unit data from multiple subjects to enable a long-term stable cortical-kinematic embedding manifold, which can then be utilized to initialize decoders across animals performing similar tasks and behaviors35,37. The ability to harness these manifolds to pre-train on a host of neural datasets with the neural data transformer architecture enables the creation of much stronger inferences on held out trials30. This groundwork sets the stage for exploring the potential of multi-subject models in high-performance large-cohort speech and language decoding, particularly evident in recent studies showcasing their effectiveness at the local field potential level for decoding single-word speech8,9,25, especially with transformer models8,20,38,39.

Work with deep learning architectures has, in the past, included some attribution or occlusion analysis to identify what specific layer weights, architecture of the model, or components of the dataset are allowing the model to converge towards favorable solutions. In our perspective, the most crucial aspect of the work is the training dataset, and for this reason, we conducted a REO analysis. However, in contrast to most current research that conducts this occlusion analysis at a single electrode level31, we examined network-level electrode occlusion effects of the decoding performance. The ability to occlude speech production-specific regions and showing their effect on the phoneme decoding architecture allows us to confer some neurobiological validity to the nonlinear dimensional reduction boundaries that our architecture applies to separating neural responses at the phonemic level.

Critically, REO significantly affects single-subject models, whereas group-level models remain robust to single-region electrode occlusion. This resilience is pivotal for generalizing the architecture to datasets with dysfunctional language hubs, tapping into the distributed system of the speech production network. In single-subject models, electrodes removed from vSMC, pSTG, and STS regions profoundly impact pre-articulatory speech decoding, resulting in significant PER degradation and highlighting their central role in temporal and phonetic encoding during speech preparation40,41,42. These regions have additionally been implicated in prior work43,44 dissecting the lexical and phonological route of reading, which maps orthography to meaning and sound, respectively, in comprehension. Conversely, for group-level models, removing electrodes from these regions does not diminish the performance improvement they offer compared to within-subject models. However, the extent of improvement is constrained by region availability and coverage density across subjects in the training dataset. Greater electrode coverage in the SCG region within the training set markedly enhances inference performance for subjects with predominantly frontotemporal coverage, with REO exerting a notable effect. Limited sampling in this region does not erase the benefits of group-model transfer—potentially due to separable learned representations that mirror the dual-stream framework of speech processing, with a dorsal stream supporting articulatory-motor integration that remains robust in our decoding architecture, and a ventral stream more involved in comprehension45.

Our study involves an encoder-decoder framework for speech decoding, similar to previous approaches that utilized convolutional1,46 or LSTM28,47,48,49 sequence-based decoding models. Comparing to linear models and CNN models, these sequence-based models effectively capture the latent temporal articulatory6,7,50,51 and acoustic information15,32,52 in speech production. However, linear models still serve as an effective baseline comparison with interpretable results—being widely recognized as standard benchmarks in machine learning and providing a reference point to evaluate the performance gains achieved by more complex models. This ensures that the observed improvements are due to model complexity rather than inherent task or data properties. Additionally, comparing performance across models of varying complexity offers practical insights for speech BCIs, where computational efficiency and resource constraints often influence model selection. While deep learning models provide superior performance, the linear model is more often used in real-time speech decoding practices and additionally provides interpretable patterns53 encoded across neural activity for validating neuroscientific findings from other literature. The utilization of a teacher forcing model allowing for variable-length sequential phoneme predictions, optimizes alignment with non-stationarity in the rate of speech and neural processing, and carries greater potential for real-time speech decoding, owing to its naturalistic language modeling approach. Additionally, by using only phoneme identity and position, our approach minimizes data requirements, and proves particularly valuable for non-speaking patients, as it does not rely on spoken speech spectrograms for training.

While a model trained on healthy subjects may face limitations in adapting to patients with speech impairments due to the underrepresentation of abnormal brain activity patterns, our approach aims to provide a stable training manifold to bridge this gap. The group model aggregates neural activity across multiple subjects, offering a robust initialization for decoders, particularly in patients with incomplete or partial datasets. Although this approach is constrained by the cortical regions sampled and the available task data, the use of accumulated training datasets and iterative dense sampling of the sensorimotor cortex provides a promising framework for generalizing to dysfunctional language networks and speech disorders due to motor pathway damages, offering a more effective starting point than training on limited patient-specific data alone.

One limitation in our study is the lack of analysis of the inner speech trials in our task; however, these trials present inherent limitations. There is no definitive way to verify whether participants covertly articulated the cued words or engaged in other cognitive processes, such as silent reading or ignoring the cues entirely. This aligns with prior findings suggesting that inner speech may be impoverished in its phonemic and featural representations compared to overt speech10. Future work will incorporate tasks that, due to their less effortful nature, are better suited to reliably bias participants toward engaging in covert speech, enabling more precise decoding and validation of inner speech activity.

Another limitation of our work is predicated on the fact that these were healthy subjects that overtly conduct a complicated tongue twister task, which would be difficult for many users of BCIs, especially those who have lost their voice due to conditions like anarthria. However, the lack of reliance on acoustic features for our text-based model allows us to test more directly the causal transfer capability54 of our decoder to collate sEEG data across a cohort of patients rather than relying on auditory feedback.

Lastly, we would like to emphasize that the data collected from each patient is approximately a total of 1 h of training and testing data, given clinical constraints. Despite this, we accomplish PERs as low as 21%, whereas an intelligible BCI trained on days or weeks of neural activity approaches PERs of 10%–20% over weeks of training and recalibration5,6,7. One potential reason for this efficiency apart from the model architecture and distributed cortical coverage is the demanding articulatory stimuli itself. Tongue twisters challenge the speech production network to actively monitor speech errors, heightening task-related activity. In addition to errors, recording closely competing stimuli allows the model to capture bifurcation type dynamics such as Hopp fields, further enhancing the tuning of sampled neural populations enhancing the signal-to-noise ratio of articulatory regions.

Our study highlights the effectiveness of sEEG in capturing neural features across the distributed language network, enabling precise prediction of phonemic sequences prior to articulation. These findings validate neurophysiological insights into speech production and identify key regions crucial for minimally invasive speech-BCIs. The versatility of both the decoding model and sEEG data underscores their capacity to unveil shared information across subjects—deepening our understanding of impaired motor, sensory, and cognitive processes in neurodegenerative disorders. Additionally, our transfer learning architecture allows generalization across diverse speech production scenarios, making real-time decoding feasible and efficient. Future research should replicate these decoding frameworks with a broader language corpus, aiming to improve accessibility and patient outcomes, while targeting prosody and volume variations, enhancing the model’s applicability in real-world settings.

Methods

Participants

Twenty-five patients (12 male, 19–51 years, 4 left-handed, verbal IQ 95.6 ± 10.4, age of epilepsy onset 16 ± 9 years) with intractable epilepsy who had sEEG monitoring for seizure onset localization provided written informed consent to participate in this research study. The study was conducted under a protocol approved by the Committee for the Protection of Human Subjects (CPHS) at the University of Texas Health Science Center at Houston (protocol number HSC-MS-06-0835). Patients were excluded if they had right-hemisphere language dominance, large structural malformations or prior cortical resections.

Task design

All participants engaged in a task involving the production of sequences of tongue twisters (Fig. 1A). The stimuli were displayed on a 15.4-inch LCD screen with a resolution of 2880 × 1800 pixels, positioned at eye level approximately 2–3 feet from the participants. A total of 64 distinct sets of stimuli were presented, each consisting of four words following either a consonant (C), vowel (V), consonant, or CVCC structure. These four words were categorized based on two axes: phonological and lexical bias, as detailed in Hickok et al. In the phonological bias condition, the word sequence imposed an increased phonological load by aligning the production space similarity between the initial utterances of the first and third words. In the lexical bias condition, if an error occurred on the third word, specifically involving a targeted error (switching the initial utterances of the first and third words), the third word transformed into a real word instead of a pseudoword. During the task, participants were instructed to mentally simulate speech production upon the first appearance of the four words on the screen. Subsequently, for the following two trials, they were prompted to articulate the words displayed overtly, and for the last two trials, the words were removed from the screen, requiring participants to articulate them from memory. 128 distinct stimuli with 218 unique words, as sets of four words making up a tongue twister were used. The trials represented 36 of the 44 phonemes in the English language with variable occurrences, ranging from 5 of TH phoneme to 280 of B phoneme. The stimuli presentation utilized the Psychophysics Toolbox in MATLAB, featuring lowercase Arial font with a height of 150 pixels (2.2° visual angle). Each stimulus remained on the screen for 1500 ms, with a 2000 ms inter-stimulus interval. The stimuli were presented across two recording sessions, each encompassing the presentation of 64 stimuli in a pseudorandom order without repetition. In this analysis, we only focused on the trials in which the participant articulated the tongue twister correctly—to avoid any instability in the linear articulatory decoding models.

Behavioral analysis

The mean accuracy across all task conditions surpassed 80%, with trials excluded if any words were articulated incorrectly or if there were any disruptions in fluency, such as stuttering or delays (average percent incorrect = 5.5%; average dysfluency trials = 6.6%).

Data recording and preprocessing

Neural recordings from each participant were acquired from multiple sEEG probes (14–21) implanted for clinical purposes of seizure localization using a Robotic Surgical Assistant (ROSA; Medtech). Each probe contained 8–16 platinum-iridium electrodes of 0.5 mm or 2 mm in length with a center-to-center spacing of 0.5–4.43 mm (PMT Corporation). Electrodes were localized by co-registering a pre-operative anatomical MRI scan with a post-operative CT scan and displayed on a 3D cortical surface model generated in Freesurfer (Dale et al.55).

Neural activity was recorded at 2 kHz using the NeuroPort recording system (Blackrock Neurotech). The neural data were visually inspected to remove channels contaminated with line noise, epileptic activity, and artifacts. Data from each electrode was then re-referenced using common-average referencing.

The analytical procedures involved initial bandpass filtering of the raw electrode data, transforming it into broadband gamma activity (BGA) within the frequency range of 70–150 Hz while simultaneously eliminating line noise using zero-phase second-order Butterworth band-stop filters. Subsequent to this preprocessing step, a frequency domain bandpass Hilbert transform with paired sigmoid flanks and a half-width of 1.5 Hz was applied. The resulting analytic amplitude underwent further refinement through smoothing using a Savitzky–Golay finite impulse response method, specifically employing a third-order filter with a frame length of 201 ms. The presentation of BGA in this context is expressed as a percentage change from the baseline level, defined as the period 500–100 ms preceding each stimulus presentation.

Continuous audio was also recorded synchronously through a lapel mic at 30 kHz using the same recording system and then segmented into individual trials. Articulation onset and offset times of the individual words and phonemes spoken during each trial were extracted from the audio signal using the Montreal Forced Aligner (MFA) (Montreal, Quebec, Canada) (McAuliffe et al.56). The segmentations produced by MFA were then validated manually in Pratt.

Statistical analysis

To ensure statistically robust and topologically precise estimates of BGA in electrocorticography (ECoG), we employed surface-based mixed-effects multilevel analysis (SB-MEMA) (Kadipasaoglu et al.26,27) to generate population-level representations. SB-MEMA analysis was performed on a window of 500 ms time-locked to the onset of articulation for the first word in the tongue twister stimuli. Significance levels were determined at a corrected alpha level of 0.01, implementing family-wise error rate corrections for multiple comparisons. The minimum family-wise error rate criterion was established through white-noise clustering analysis (Monte Carlo simulations, 5000 iterations) on data with matching dimension and smoothness as the analyzed dataset (Kadipasaoglu et al.26). Following this, a geodesic Gaussian smoothing filter (3 mm full-width at half-maximum) was applied. Additionally, ECoG results were confined to regions with a minimum of three patients contributing to coverage and BGA percentage change exceeding 5%.

Sequence-based modeling

For the construction of a sequence prediction model, neural data underwent processing at either the phrase or word level, depending on the nature of the analysis. A bidirectional recurrent neural network from the TensorFlow library, with a Keras backend, was employed. This network featured an encoder-decoder structure capable of predicting either a predetermined length of phonemes (CVC model) or a variable length of phonemes. The latter utilized a teacher-forcing style decoder structure trained on phonemes within the closed dictionary of tongue twister stimuli. Model optimization, involving the determination of the number of units and layers, was accomplished through hyperparameter tuning on a validation dataset. The optimal number of layers and units dependent on the number of channels, regions, and the type of target involved in the analysis. The base model for variable-length, phrase-level phoneme decoding using all clean channels disregarding region selectivity, had 2 LSTM layers of 64 units each, one which acted as the neural encoder, and the other as the decoder, which utilized the encoded hidden states to project to a final dense layer projecting to the number of unique phoneme classes within the target dictionary. PERs reported in the final analysis were derived from predictions on the test set for each subject. The pre-articulatory analysis involved windows of data preceding phrase onset, defined here as the onset of articulation of the first word.

In the evaluation of sequence prediction mechanisms, two models were constructed. The first model adhered to task-specific constraints, training, and testing solely on sequences following a predefined order of CVCs repeated four times to form four unique words constituting the entire phrase. This model, designed for fixed-length sequences, faced limitations in adapting to variable-length sequences, posing challenges for generalizability to natural speech events with varying utterance lengths. To address these limitations, a second encoder-decoder structure model was introduced. The decoder of this model demonstrated greater flexibility through three key methods. Firstly, it operated on a teacher-forcing style facilitating information transfer on a phoneme-by-phoneme basis. Secondly, the target features accommodated blank tokens, start tokens, and end-of-sentence tokens commonly used in natural language processing models providing additional information about speech pauses and breaks in utterances. Lastly, a connectionist temporal classification loss function was implemented, allowing for the marginalization of various forms of alignment between predicted and articulated phoneme sequences. This approach facilitated the optimal handling of merging, concatenation, and deletion of extra tokens predicted.

Accuracy metrics

In assessing subject performance, we refrained from assuming equal class distributions due to variations in the number of correctly articulated trials across subjects. Consequently, conventional accuracy metrics were unsuitable for our analysis. Instead, we opted for macro F1 scores to compute class-balanced accuracy, incorporating considerations such as specificity and sensitivity. For sequential predictions, we employed the PER, quantifying the number of substitutions, insertions, or deletions required to align with the ground truth sequence of phonemes.

To establish a data-driven chance performance, we implemented a shuffled control scheme in which both trials and phoneme sequences were randomized while keeping sEEG signals intact. This preserves phoneme distributions but disrupts trial-specific neural-phoneme alignments, ensuring that model performance reflects meaningful neural decoding rather than frequency-based biases.

Coverage correlation analysis

To evaluated the contribution of specific cortical coverage across the cohort, we parcellated the cortex into 50 regions using the Destrieux parcellation (Freesurfer v4.5, 2009 Destrieux atlas57) and then contextualized these regions as canonical language and speech production hubs. For each anatomical region as described by this parcellation, we calculated a channel density metric—of the number of electrodes in this region divided by the total number of electrodes implanted in this subject. This parcellation enabled us to generate a barcode for each subject based on the channel density per parcel and assess patient correlations utilizing cosine similarity across the cohort to derive a similarity score across subjects.

Regional electrode occlusion (REO) analysis

To evaluate the interpretability of the architectures employed for phonemic decoding, a REO analysis was conducted where important hubs of language production are removed from the grouped dataset at a regional level. These regions are seeded from nodes across the cortical surface as inferred by SB-MEMA. Electrodes were selected from individual regions to be excluded from the analysis and for each region of interest, for each patient, a PER was calculated. Prior analyses were able to generate all channel decoding performances for each subject, so this error rate can be compared to the error rate calculated from the REO analysis to see if the absence of a region would increase the PER or degrade the decoding performance. To control for effects across individual patients, linear mixed effects models were used to evaluate if removing a region would provide a significant increase in PER with an intercept fitted for each patient. This analysis was also done for different time windows, with a separate linear mixed effect model constructed for each time window to understand the sequencing of regional activity and its context in phonological processing.

Subject-independent decoding analysis

For transfer learning between subjects using our sequence-to-sequence models, we add a simple 1D convolutional layer on top of the LSTM and affine layers. We then pre-train our model on a single subject and the core LSTM encoder layer and affine layer is frozen—meaning the weights of this layer are not allowed to be adjusted during the backpropagation procedure when training on a new subject data and labels. However, we keep the convolutional layer trainable to allow the model to extract subject relevant features from the variable electrodes configuration based on patient specific anatomical electrode trajectories as we transfer the model from one participant to another. A model trained on a single subject is then transferred across all other 24 subjects that have done this specific task, with the convolutional layer being trainable while the core LSTM layer and phoneme output layers are frozen. Training on a new subject is done for 100 epochs only as compared to the pre-training on the original subject for 500 epochs.

For multi-subject models, the only change we make to the architecture is a concatenation layer, which allows all training trials from multiple subjects to be collated together—and then the model randomly samples trials across all subjects to create a batch for training. This allows the model to evaluate the loss function collectively for the group and build latent embeddings from each subject’s neural activity. After the across subject neural embedding is built, the trials are split back into their respective subjects and the same linear readout decoder architecture is employed to predict sequential phonemes that correspond to the individual subject’s training dataset. This allows a shared representation of the neural embeddings but mapping to subject-specific behavioral responses. To implement a zero shot decoding evaluation, we simply applied a group K-fold cross-validation scheme with K = 5 and left entire phrases from the tongue twister task stimuli out. The held-out stimuli were selected and removed from all subjects' trials before the model was trained and used as the test dataset to evaluate the group model against.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.