Background & Summary

Brain-computer interface (BCI) devices aim to restore communicative abilities to individuals who have lost motor function. BCI speech decoding devices must balance the need for a user-friendly interface with task performance1. Neural signals from electrocorticography (ECoG) provide a measure of task-relevant brain activity with a minimal processing latency and a superior signal-to-noise ratio2,3. However, ECoG studies must be conducted on surgery patients who differ in many respects from the end user of BCI devices. Alternatively, electroencephalography (EEG) allows for BCI devices to be tested directly among the target population. But because neural signals become distorted as they pass through the skull and scalp to reach EEG sensors4, BCI applications are often limited to the detection of stimulus-evoked potentials5,6,7,8, resulting in non-naturalistic paradigms that remain comparatively slow and inflexible9,10. Although a practical solution to speech decoding will ultimately require the quick and accurate classification of a large inventory of context-dependent speech sounds in rapid succession11, this goal remains out of reach for state-of-the-art EEG decoding methods12.

To utilize EEG effectively for these purposes, researchers would benefit from datasets that incrementally increase the degree of “naturalness” of stimulus items, such that existing models may be tested with similar yet successively more complex datasets and adjusted to compensate for this increase in complexity13. Such data would comprise stimuli that incorporate systematic linguistic regularities and rules regarding phonological, phonotactic, or semantic content that could affect the decoding of naturalistic speech. This would entail stimuli in which the same sounds are presented in different linguistic environments, and word items that comprise real or pseudowords, which differentially engage semantic and motor neural networks. While much of the decoding literature is based on what we assume is neural motor activation, decoding from semantic activation remains a goal within the field14, and to transition to the type of predictive networks that have been proposed for real-time decoding of propositional or conversational content, decoding that incorporates semantic neural networks may be necessary.

Currently, methodological paradigms for speech decoding generally include covert speech and auditory comprehension tasks15,16,17,18,19,20,21,22,23,24,25,26,27, with innovations consisting largely as modifications of these paradigms. The resulting analyses are heterogeneous in nature and few authors make their data publicly available, complicating efforts to reproduce and compare different decoding methods: the publication of EEG datasets allows researchers to benchmark model performance28,29. The available datasets typically focus on a task of high difficulty (e.g., imagined or inner speech) yet remain minimal in scope, with a small number of trials, stimulus types, and subjects12. Moreover, there is little coherence between different stimulus types that would allow researchers to test their models against progressively more naturalistic data. BCI devices tend to rely upon the neural signals associated with muscle movements30, and although the utility of classification schemes based on speech articulation has been noted by numerous researchers31,32,33,34,35, this factor has yet to be included as an organizing principle in published datasets. Finally, the overfitting of highly complex machine learning models is frequently discussed in the literature36 but rarely tested overtly. Independently collected datasets of the same data types are needed to ensure that models can generalize across subjects and sessions. We present two datasets for EEG speech decoding that partially address these limitations:

  • Naturalistic speech is not comprised of isolated speech sounds37. The phonetic environment surrounding phonemes affects their quality38,39, complicating accurate category designation40,41. Existing datasets lack a wide range of co-articulated phonemes. We provide single, double, and phoneme triplets for six consonants (/b/, /p/, /d/, /t/, /s/, /z/) and five vowels (/i/, /ε/, /ɑ/, /u/, /oʊ/) to assess classification accuracy in progressively more complex phonetic environments.

  • The decoding literature devotes considerable attention to the articulatory properties of phonemes3135. The most successful decoding models integrate this knowledge22,42,43. Available datasets do not select phonemes that fall within overlapping articulatory categories (place, manner, voicing). We provide consonants that represent unique combinations of features (bilabial/alveolar; stop/fricative; voiced/unvoiced) to assess articulatory features as a classification parameter.

  • The integration of probabilistic language models into BCI devices has predisposed researchers to anticipate better results when training on real words44,45,46,47. EEG datasets tend to comprise word stimuli that are real12. Yet functional magnetic resonance imaging (fMRI) studies reveal a greater hemodynamic response to pseudowords48,49. Effortful processing may strengthen neural signals and facilitate decoding50. We provide real and pseudowords to test this hypothesis.

  • Speech decoding papers typically publish one or more analyses conducted on a single dataset12, raising concerns about overfitting and how well the model might perform on additional data without significant modification51. We provide two datasets (N=8 and N=16) collected at different time points for the same stimulus types and/or participants. The second serves as an external validation set to allow model performance to be assessed on independently derived data52.

  • Modification of the EEG signal may prove useful for the decoding of neural signals as an initial training aid or as an augmentation technique. We provide data from a successful study which found that transcranial magnetic stimulation (TMS) may have the potential to increase successful decoding from EEG signals53. We provide data from a control condition and two TMS conditions by phoneme type and stimulation target54 for further experimentation and testing.

  • Robust speech decoding models must compensate for a high degree of noise from various sources. In addition to the limitations of EEG data (noise from within the signal), sounds are often heard inaccurately in the presence of environmental noise or when the participant is inattentive to stimuli55,56,57. Noise may also occur as a feature of accompanying techniques, such as TMS.

Methods

Our aim was to create a systematically structured, multifaceted dataset that would represent a novel contribution to the publicly available data for EEG speech decoding. This involved the selection of stimuli that allow a scaffolded progression in the type of classification analyses that can be conducted in terms of the choice of parameters and the complexity of the task (single phonemes, phoneme pairs, phoneme triplets/words). The dataset includes comprehension and production tasks collected either during neuromodulation or a control condition. There is evidence that TMS may improve subject performance in phoneme discrimination by administering two closely spaced TMS pulses prior to phoneme presentation while targeting motor cortex regions that control the muscles involved in the articulation of a specific phoneme category58. Likewise, errors may be induced when stimulation occurs in an unrelated region of the motor cortex, prompting the perception of a different phoneme category58. The majority of this data is published here for the first time. A subset of the data (CV pairs) has previously been published in conjunction with a pilot study that investigated whether the perceptual effects would translate into a similar bias in decoding accuracy during the neural decoding of EEG signals recorded during the perception task53. The pilot study illustrated proof-of-concept; by collecting data independently at two separate time points (2019 and 2021) and now offering the full dataset, we invite researchers (i) to develop a model on the larger dataset (2021) and assess the robustness of the model in a second, smaller and noisier validation dataset (2019), and (ii) to test their models on progressively larger units of speech (2021).

Participants

Participants aged between 20 and 40 were recruited from the UCLA campus using flyers. Ten participants (6 female) were recruited for the first round of data collection in 2019 and twenty participants (10 female) were recruited for a second round of data collection in 2021. Inclusion criteria were defined as no diagnosis of any neurological, psychiatric, or developmental disorders, self-reported normal hearing, and no contraindications for TMS or MRI procedures (implanted medical devices, implanted metal, pregnancy, personal or family history of seizures, and exclusionary medications). During an initial screening session, participants completed an abbreviated version of the experimental task to ensure that participants understood the task directions and could perform the phoneme discrimination task. Left-hemisphere lateralization of the language processing regions in all participants was established during an fMRI scan in which participants performed a similar phoneme discrimination task, slightly modified for the MRI scanner. Two individuals (1 female) were excluded from the 2019 dataset due to necessary modifications that were made to the stimulus audio files after their participation. In the 2021 dataset, one participant (male) was excluded when he exhibited a biased response strategy upon stimulation (i.e., failure to select from the full set of phonemes), and three participants (male) were excluded due to complications with the TMS equipment that may have led to imprecise targeting. Participants provided informed consent and were paid for two sessions. The experimental protocol was approved by the UCLA Institutional Review Board (IRB#21-000333). The same participant recruitment and data collection procedures described below are presented in an abbreviated form in our pilot study publication53.

Experimental design

The study was conducted in three sessions (Fig. 1). In the first session, participants underwent a directed interview to ensure that they met the study inclusion criteria and possessed no contraindications. Participants who were able to perform an abbreviated phoneme perception task with at least 75% accuracy were enrolled. In the second session, participants underwent an MRI scan to aid in neuronavigation for the TMS procedure. They performed a modified phoneme discrimination task in the MRI scanner to lateralize their primary language processing areas for consonant and word stimuli. The final session involved the recording of EEG signals while participants performed the phoneme perception task. TMS was targeted to areas of the motor cortex associated with the production of specific phonemes. In the second round of data collection that was conducted in 2021, participants also listened to and repeated single phonemes and performed the perception task for phoneme triplets.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Session organization. (A) The experiment was conducted in three sessions that were held on separate days. Participant eligibility was confirmed in the first session. MRI scanning and TMS-EEG data collection were conducted independently. (B) Additional participants and trial types were included in 2021.

Data collection

MRI scanning

Scanning was conducted in the UCLA Center for Cognitive Neuroscience with a Siemens Prisma-FIT 3T Scanner. Participants were provided with ear protectors and headphones for a 45 to 60 dB reduction of the noise associated with scanning to ensure that participants could hear the stimuli clearly and that the noise level was not uncomfortably loud. Participants were asked to lie with their head motionless during all scanning procedures. High-resolution anatomical images were acquired, followed by a functional scan in which participants were directed to either relax passively while looking at a fixation cross or to perform the button-press phoneme discrimination task. Stimuli were grouped by consonant or word type in a block design to increase the statistical power of the fMRI analysis. Functional data were acquired in the block design with a BOLD-weighted echoplanar imaging sequence aligned in parallel to the bicommissural plane, thus yielding 36 slices covering the whole brain, each 3 mm thick with a 1 mm gap between slices. Each slice was acquired as a 64 × 64 matrix yielding an in-plane resolution of 1.5 × 1.5 mm. The total duration of the scanning session was 40 minutes.

TMS-EEG

The TMS-EEG procedure was conducted in the Neuromodulation Division of the Semel Institute for Neuroscience and Human Behavior at UCLA. The TMS equipment utilized for the procedure included a Magstim Super Rapid Plus1 stimulator and a figure-of-eight 40 mm coil. The EEG system included an eegoTM sports WaveGuard 64-channel EEG cap and eego mylab system compatible with electromagnetic stimulation. Targeting was completed using the Visor 2 neuronavigation system. The electrode positions were digitized and registered to individual participant MRIs using the ANT Neuro Xensor. EEG signals were bandpass-filtered 0.1-350 Hz, sampled at 2000 Hz, and referenced to the CPz electrode. All electrode impedances were kept <5 kΩ. PsychoPy59 stimulus presentation software initiated the audio routines and recorded reaction time data.

The appropriate stimulation intensity for TMS studies is determined on an individual basis60. Prior to the experimental session, the motor threshold (rMT) of each participant was determined by eliciting motor-evoked potentials (MEPs) in the first dorsal interosseus (FDI) muscle of the dominant hand at the minimum amount of stimulation needed to evoke an MEP in a hand muscle after a single pulse over M1. Single TMS pulses were delivered to locations in the motor cortex contralateral to the dominant hand. The intensity of the stimulation was gradually lowered until reaching a level of stimulator output at which 5 out of 10 MEPs in the hand muscle had an amplitude of at least 50 microvolts. Potentials evoked during TMS represent the net sum of excitatory and inhibitory stimulation effects61,62,63. The literature has found that excitation increases at intensities of 110–120% rMT. In accordance with our reference study, stimulation was administered at 110% of the FDR rMT58. A physician observed the motor thresholding procedure to ensure that no negative effects were incurred by participants.

TMS targeted areas of the motor cortex involved in (i) lip and tongue movements (which produce bilabial and alveolar consonants, respectively) or (ii) processing of real and pseudowords. Stimulation targets were defined as the MNI coordinates of peak motor cortex activation in LipM1 and TongueM1 during lip and tongue articulatory movements (lips: −56, −8, 46; tongue: −60, −10, 25), taken from the literature64 and the reference study58. However, cortical functional localization is known to show individual variation65. Therefore, the coordinates were overlaid over the activation map of the task results for each participant to ensure an overlap between the targets and individual task localization. The target was taken as the nearest peak to the MNIcoordinate. Broca’s area (BA 44: −51, 7, 23) was the target for real words and a region implicated in verbal memory (BA 6: −46, 1, 41) was the target for pseudowords66.

Behavioral task

The phoneme discrimination task consisted of listening to speech sounds and identifying stimuli with a button-press response. Auditory stimuli (Fig. 2) were presented via laptop speakers: (i) single phonemes, (ii) paired consonant-vowel phonemes (CV, VC), and (iii) real or pseudowords constructed of phoneme triplets (CVC). Consonant stimuli included four phonemes in the pair and triplet conditions (/b/, /p/, /d/, /t/), with the addition of two additional phonemes (/s/, /z/) in the single condition. Vowel stimuli included five phonemes in all conditions (/i/, /ɛ/, /ɑ/, /u/, /oʊ/). These sets yielded 11 individual phonemes (6 consonants and 5 vowels), 40 phoneme pairs (20 CV/20 VC), and 40 phoneme triplets (20 real/20 pseudowords). Participants were asked to listen to and repeat the phoneme in the single condition, to identify the consonant phoneme in paired conditions, and to identify phoneme triplets as real or pseudowords. Multiple classification analyses may be conducted on each stimulus type (Fig. 3).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Stimulus types. (A) Consonant and vowel phonemes possess unique articulatory features. Consonants can be described by three parameters: place of articulation (bilabial, alveolar), manner of articulation (stop, fricative), and voicing (voiced, unvoiced). Vowels can be described by four parameters: tongue height (from close to open), tongue position (front, back), tongue tension (tense, lax), and lip position (rounded, unrounded). (B) Phoneme pairs included eight instances of each combination of stop consonants and vowels. (C) Phoneme triplets were limited by the stipulation to create real and pseudowords. Stimuli included eight instances of each vowel in the real and pseudoword conditions.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Analysis. (A) Single phonemes can be classified by modality, category, and articulation. (B) Phoneme pairs – by target, category, and articulation. (C) Phoneme triplets – by target, category, articulation, and word type.

TMS elicits a period of excitatory activation with an onset latency of 50-80 ms after stimulation67. We reproduced the design of our reference study58 to ensure an excitatory neural response that would translate into task facilitation. Each trial delivered paired TMS pulses at one of the stimulation targets, separated by a short interpulse interval (50 ms). Excitation of the cortical region not involved in stimulus production (i.e., TMS at LipM1 during alveolar phoneme presentation) results in neural noise that interferes with the perception task. The audio stimulus followed 50 ms after the second TMS pulse. One target was stimulated per run (counterbalanced across participants). Details of the experimental protocol are illustrated in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Experimental protocol. (A) Software controlled the experimental task and sent triggers to initiate TMS and to create timestamps for each pulse. (B) The TMS coil was positioned at the necessary stimulation site. Despite some overlap in the induced magnetic field, only the targeted region received maximum stimulation intensity. (C) The audio stimuli were immersed in white noise. Two TMS pulses were administered 50 ms prior to stimulus onset. (D) The run design in 2019 was reproduced in 2021 with the addition of more frequent breaks between blocks. A block of single phonemes was introduced in 2021.

Participants listened to audio clips immersed in 500 ms of white noise. The white noise created a mild background distraction for participants to ensure that they did not perform the phoneme discrimination task at ceiling. Participants were instructed to respond as fast as possible with a button press after they had identified the phoneme. In the case of multiple button presses, correct trials were determined from the initial button press. Participants who exhibited a non-random response strategy (i.e., failure to select from the full set of phonemes) were excluded. In 2021, participants were instructed to listen to single phonemes without TMS and to repeat the sound they heard immediately after stimulus presentation (300 ms from trial onset).

Two lists of stimulus items were used with one list assigned to each block. In 2019, the runs were split into two blocks. The first block presented CV pairs, followed by a block of VC pairs. In 2021, four blocks were administered per run. The first two blocks presented CV pairs, followed by two blocks of CVC stimuli (real and pseudowords). A five-minute break was provided between runs. Participants completed 120 trials in each run: 80 with TMS and 40 random catch trials. In 2021, each run of the task was preceded by the presentation of 220 trials of single phonemes (20 trials each). Stimuli in all conditions were presented in a pseudo-randomized order. The total run time of the experiment lasted 49 minutes in 2019 and 58 minutes in 2021. As noted in the preceding sections, minimal modifications to the procedure were required for the intake and scanning sessions. For the initial assessment, half of the task was administered. During fMRI scanning, the full-length task was administered with stimuli presented in blocks of the same type (bilabial, alveolar, real words, pseudowords).

Data characterization

Classification of acquired data

The procedure required sustained attention during a lengthy TMS procedure. The mean reaction time and standard deviation were calculated to confirm that participants were attentive to the task throughout the procedure. These metrics are documented in .csv files uploaded to the data repository68. In the 2019 dataset, some variation in trial numbers is observed due to missed trials and rotation in the list of stimuli administered to each participant. No subjects performed less than 90% of the total list, with the exception of P04 in the VC condition with LipTMS. Here, excluded trials resulted from missed trials. In the 2021 dataset, all trials were uploaded irrespective of a button-press response. Two subjects performed an abbreviated list of phoneme triplets, and one also performed an abbreviated list of single phonemes. The number of tagged trials is shown in Tables 1 and 2.

Table 1 Trial Numbers: 2019 Dataset.
Table 2 Trail Numbers: 2021 Dataset.

Data processing

The continuous raw data from the EEG recordings have been uploaded to two data repositories68,69 in BIDS format and as .cnt files so that researchers may apply their preferred pre-processing and processing pipeline, as necessary for alternative speech decoding models. However, in order to exemplify the principle of transparency, we have also made available three stages of data cleaning or signal processing that were utilized for our data validation section (i) data normalized within each data window to zero mean and unit variance (for DDA); (ii) data resampled at 256Hz, filtered (notch filter in the 59Hz to 61Hz bands and band-pass filter from 0.1Hz to 100Hz), and separated by trial and with the removal of bad channels and TMS artifacts (for ERPs); and (iii) data that additionally underwent 1-2 rounds of cleaning of unwanted ICA components (for ERPs).

A Matlab routine was designed based on the EEGLAB library70 for data analysis and pre-processing. The code68,71 has been made available in several separate sections, each responsible for a part of the data processing. This makes it possible to organize, streamline, and automate the analysis, in a process that eliminates the extensive use of EEGLAB interface by the user in cases where it is not strictly necessary. The routine is structured as shown in Fig. 5 and consists of seven main sections that include the removal of unwanted channels, event setup based on the information tables, resampling and filtering, separation of trials based on events, visual inspection for cleaning up bad trials, ICA decomposition, and manual removal of unwanted ICA components. In addition, the code offers optional sections for interpolation of TMS signals, generation of signal state images, and other tools that provide information on changes made during processing.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Code structure for processing the data. The seven fundamental sections are shown in gray, the optional sections in brown, and the extra functions for organizing the data in orange. The icons below each section indicate operations applied after the data processing step is complete.

After each data processing stage, the data were stored in specific folders that consist of .set files containing the pre-processing steps, a summary of the manual modifications made, and .mat files that document the event-related potential and spectral power density obtained for the analysis conducted. As a result, the outputs are organized and clearly annotated to ensure reproducibility and access to all the stages of the pipeline. For DDA, all data were utilized without filtering or downsampling.

Load data, remove bad channels, and set events

The pipeline was built to run one participant at a time. Initially, all the variables used to store the location of the necessary files were defined, then the folders for organizing the outputs were created, and the .cnt file was imported and saved as a .set file. Only the EEG recordings are of interest for this analysis, so the EOG and BIPs channels were removed. In addition, M1 and M2 were discarded. In total, 62 channels were processed, with CPz used as the reference and AFz as the ground electrode. Next, the database for each subject was updated with seven events for each phoneme pair task-related trial, namely: baseline, trial onset, first TMS pulse, second TMS pulse, first auditory stimulus, second auditory stimulus, and trial end.

Resampling and filtering

The data were resampled to 256Hz and two filters were applied: a notch filter with cutoff frequencies at 59Hz and 61Hz and a band pass filter with cutoff frequencies at 0.1Hz and 100Hz. The output generated by this last step is used for the subsequent analyses, differentiated only by the type of trial studied: (i) control, (ii) TMS applied to the lip target region, or (iii) with TMS applied to the tongue target region.

Trials separation

This section selects the analysis type and defines the first sound stimulus as the base event for ERP construction. This event sets an epoch separation that avoids undesired effects due to the high-amplitude TMS spikes.

Trials inspection and data cleaning

Two rounds of visual inspection were conducted on the data in search of trials with contaminated signals and channels with high-amplitude artifacts. To do so, the spectral power density (PSD) plots generated from the data were analyzed, in addition to the EEG recordings themselves. Using the EEGLAB interface, the unwanted segments were selected and removed from the analysis. After each round, ICA decomposition was applied using the library-adapted infomax ICA algorithm72. During ICA, 35 components were inspected and those that clearly exhibited an artifact signal were removed.

Data Records

The entire dataset69 can be found at OpenNeuro (https://doi.org/10.18112/openneuro.ds006104.v1.0.0) in BIDS format. In addition, the complete data records can be found in the Open Science Framework repository68. Studies are labeled chronologically. Each primary folder contains subfolders for the raw data in .cnt format, processed data, and trial characteristics. The raw and processed data are grouped individually, with one subject per folder, and labeled as per Tables 1 and 2.

The routines for the analyses in the technical validation section and the results for both signal processing techniques in the data processing section are located in the Study/EEG_Data_Processing/Code folder. ERPs are obtained using only ICA, and signal cleaning was performed using the pipeline described in Fig. 1, based on the EEGLab library versions 2022.0 and 2022.1 native to MATLAB. The EEG data records can be found in the OpenNeuro repository69 in BIDS format. The dataset follows BIDS convention with the following structure: /sub-[subject]/ses-[session]/eeg/. Subject labels are P01-P08 for Study 1 and S01-S16 for Study 2 to avoid confusion about the origin of the data files. Session is 01 for Study 1 and 02 for Study 2.

Raw and pre-processed EEG data

Raw EEG files were stored in the .cnt format. This format contains continuous EEG recordings saved over the EEG-TMS sessions. 66 channels were recorded, with electrode placement according to Fig. 4. Pre-processed EEG data has also been made available in .set and .mat files, according to steps described in Fig. 5.

Event timestamps and behavioral data

For each trial, event timestamps are provided in .csv format, with one file for each recording session (Fig. 6). The events include (i) the second (final) TMS pulse of the pair, (ii) the sound stimulus onset, and (iii) the subsequent phoneme onsets. In addition to timestamps, the files provide labels for presented (true) and identified sound stimuli (phoneme or real/nonce word).

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Files characterizing the EEG data. (A) The 2019 dataset provides labels for CV and VC phoneme pairs. The TMS condition in which control trials were collected is noted, as well as the articulatory features and category of each consonant phoneme per trial. A timestamp is provided for the onset of each phoneme and the final TMS pulse prior to stimulus presentation. Stimulation was conducted at only one cortical site per run. The control items are labeled according to the run in which they were collected. (B) The 2021 dataset provides the same information, when relevant, for single phonemes. (C) CV phoneme pairs are described in full, similar to the 2019 dataset. (D) Each component phoneme in the word trials is indicated and marked with a time stamp. Stimuli are categorized by word type.

Technical Validation

Two sets of analyses were performed to support the technical quality of the datasets. Firstly, we extracted the grand mean event-related potentials (ERPs) by means of independent component analysis (ICA)72 to illustrate evidence of a stimulus-locked response across participants in each condition. We selected this method primarily due to its widespread use in the investigation of human cognitive information processing and therefore its familiarity among the electrophysiology research community. However, ICA can be subjective in its implementation by individual researchers, and the method may not be ideal for the analysis of specific types of data73,74. In particular, substantial attention has been paid to the need to remove the TMS artifact from TMS-EEG data75,76,77. Therefore, we performed a second analysis with delay differential analysis (DDA)78,79,80, a non-linear signal processing technique that requires minimal pre-processing and is noise insensitive81,82. The two analyses provide complementary evidence for the presence of a condition-dependent response in the EEG data. In particular, the DDA analysis illustrates excitatory activity during the time window of interest for our cognitive task, which differs by TMS condition.

Event-related potentials

The processing steps described in the Data Processing section were applied to the raw data to define the format of the auditory event-related potential (ERP) in the control condition and each TMS condition. The pipeline shown in Fig. 5 was executed for all participants to ensure homogeneity in the analysis. The mean and standard deviation of the potentials in the 1-second window after the first sound stimulus and the standard deviation for each of the 61 channels separated by the participant is represented in Fig. 9, while 8 represents the mean ERP from channel CP5. The reference pictures provided in Fig. 7 are meant to provide general guidance in interpreting the waveform; please refer to the cited papers for their original findings. We observe that the ERPs approximate the expected auditory-evoked potential (AEP) induced by phoneme pairs composed of stop consonants and vowels (see 7A,B). Deviations from the anticipated AEP may occur due to noise (our stimuli were immersed in white noise) and the exact combination of consonants and vowels in each stimulus item83,84. The shape of the TMS-evoked potential (TEP) will depend on the number of pulses delivered, the interpulse interval, and whether stimulation is subthreshold or suprathreshold. A wide variety of TMS paradigms have been tested with conflicting results, such that it may be better to observe the TEP in order to identify whether the paradigm was excitatory or inhibitory, or to consider the effect by means of an additional measure, such as a behavioral task (see 7C)85. The TMS paradigm used for collection of the dataset produced a facilitatory effect on performance in a phoneme perception task54,58.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Expected waveforms. All plots were modified and reproduced with permission from the publishers. (A) The auditory-evoked potential (AEP) for stop (/d/, /t/) consonant-vowel pairs exhibits a small N100 potential followed by a larger P200 potential. Variation in timing will occur depending on the stimulus type and in the presence of noise83. (B) Vowels and consonants each produce a unique waveform84, such that the overall shape is dependent on the contribution of each to the waveform. (C) The shape of the TMS-evoked potential (TEP) will differ according to the cortical region targeted. Whether TMS creates an excitatory or inhibitory response can be observed in the shape of the resulting TEP. In the motor cortex, greater activity between 25-125 ms accompanies excitatory paradigms85. This figure illustrates the characteristic shape of the waveform for each type of neural response in the line and its standard deviation in the darker envelope.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Event-related potentials (ERPs) from trial onset for the channel CP5. The grand mean average ERPs to (i) control stimuli, (ii) stimuli with LipM1 stimulation, and (iii) stimuli with TongueM1 stimulation are displayed for (A) the 2019 and (B) 2021 datasets. The TMS conditions show rectified EEG activity to allow for comparison with the reference study.

Fig. 9 provides an overview of the analyzed data in which each recorded channel is represented by its mean and standard deviation. Note that the value of the dispersion of the control trial group is smaller than the other values, as expected, and that some channels of certain subjects have high signal variation, especially those related to the frontal and medial portions of the brain, where stimulation occurred. This is unsurprising, given that TMS affects a unique subset of cortical neurons in each individual based on the position and orientation of neurons relative to the stimulation coil86,87,88,89,90,91.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Characterization of the data by the mean and standard deviation of each channel per participant to (i) control stimuli, (ii) stimuli with LipM1 stimulation, and (iii) stimuli with TongueM1 stimulation are displayed for (A) the 2019 and (B) the 2021 datasets. The channels are arranged for each experimental subject from most anterior on the left to most posterior on the right.

Delay differential analysis

Delay differential analysis (DDA) is a signal processing technique that combines differential embeddings with linear and nonlinear nonuniform functional delay embeddings. The integration of nonlinear dynamics allows information from the data to be detected which may not be observable in traditional linear methods. DDA requires minimal pre-processing, which eliminates a highly subjective step in the data analysis. Sparse DDA models have several advantages over the high dimensional feature spaces of other signal processing techniques: (i) the risk of overfitting is greatly reduced; (ii) the sparse model concentrates on the overall dynamics of the system and cannot additionally model noise; (iii) DDA is computationally fast; (iv) there is no need of pre-processing except normalization to zero mean and unit variance for each data window in order to ignore amplitude information and concentrate on system dynamics. The DDA model consists of two sets of parameters: (i) the delays and model form are the fixed parameters that are kept constant throughout the analysis; (ii) the coefficients (a1a2a3) and the fitting error of the model are the free parameters. The coefficients are used as features to distinguish different dynamics in the data.

The DDA model used in this analysis is

$$\mathop{x}\limits^{.}={a}_{1}{x}_{1}+{a}_{2}{x}_{2}+{a}_{3}{x}_{1}^{2},$$
(1)

where xi = x(t − τi). In this analysis, the fixed parameters are the same as in Ref. 92. We found that one of the free parameters, namely a3, can be used to describe neural activity in a manner similar to ERPs. However, an ERP and a3 are not strictly the same phenomenon. For details, see Ref. 92. Note that in most cases, there is no direct relation between frequencies and any of the model parameters, as explained in Ref. 78. In the current analysis, the delays are τ1 = 6 δt and τ2 = 16 δt, with \(\delta t=\frac{1}{{f}_{s}}\), where the sampling rate is fs = 2000 Hz. These are double to the delays in Ref. 92 because the sampling rate is double. The window length is 30 ms and the window shift is 1 ms. In Fig. 10, we observe waveforms that display the same dynamics as the reference studies in Fig. 7. We observe neural activity 200 ms and 400 ms after stimulus-onset, the same time window where activity is observed in Fig. 7A. We also observe a sharp spike in activity 25-125 ms after the final TMS pulse, which corresponds to the results illustrated in Fig. 7B. This finding suggests excitatory activity.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

DDA coefficient a3 from Eq. (1) from trial onset. (A) The grand mean average DDA coefficient to (i) control stimuli, (ii) stimuli with LipM1 stimulation, and (iii) stimuli with TongueM1 stimulation are displayed. In each plot, the lighter line represents the 2019 dataset, and the darker line represents the 2021 dataset. (B) The heatmaps for all individual participants are shown for the 2021 dataset.

Limitations and final remarks

This dataset is intended to provide researchers with a means to systematically test the classification accuracy of speech decoding models against naturalistic speech stimuli of increasing complexity, within and across datasets that manipulate the cortical state of participants. To our knowledge, this is the first EEG dataset for neural speech decoding that (i) augments neural activity by means of neuromodulation and (ii) provides stimulus categories constructed in accordance with principles of phoneme articulation and coarticulation. Nonetheless, several limitations of the dataset can be noted.

First of all, the experimental task involves aspects of comprehension, production, and motor activity (in the form of a button-press response), which may be subject to some overlap in the neural signal. In particular, in single phoneme trials, speech may have been produced while potentials relevant to comprehension of the speech sound were still ongoing. However, it is well known that the neural networks underlying motor and language functions are not strictly dissociable. They are frequently coactivated, even in covert speech or comprehension paradigms48,49. Therefore, we believe this phenomenon underlies most if not all speech decoding paradigms, to a greater or lesser degree, and likely represents neural processing in a naturalistic context.

Secondly, inner speech is widely adopted in the speech decoding literature, where it is often considered to be the most intuitive way of controlling a BCI device. However, inner speech decoding paradigms may not accurately mark the onset of individual stimuli or component phonemes in the recorded data. We believe that prior to transitioning to an inner speech paradigm, researchers would benefit from developing models that can target specific features of the speech stream, such as articulatory features and coarticulation. This kind of systematic study of the speech input may lead to more robust models overall and a better understanding of how this process occurs, rather “black box” models that must be trained on huge amounts of data or rely more heavily on predictive language models than on actual decoding accuracy. Both of these trends in the literature address issues that are orthogonal to the improvement of the actual decoding model.

Questions may also arise as to how neuromodulation may be integrated into a BCI device. We believe that greater attention is needed to study the possibilities for applying neuromodulation during speech decoding. At this time, there is no viable BCI device for evoked paradigms that can be used by the target user population, even with an inner speech paradigm, and therefore any talk of a full-functional device that is ready for end users would be premature. As the study of this procedure continues, we may find a means to utilize neuromodulation for model or participant training, and new possibilities for fast and accurate neuromodulation techniques that could be integrated into a headset continue to be developed.

Finally, we have noted that the two datasets differ in size and quality. We recommend that the larger dataset be used for model development and that the smaller dataset be used to validate the model in more stringent conditions. Our own tests have shown that, by means of DDA, we can successfully perform a classification analysis with both data sets54.

Usage Notes

The raw .cnt EEG files can be read in MATLAB with the FieldTrip Toolbox93 and in the Brainstorm94 eepv4_read.m function, or in Python with the libeep library. The pre-processed files can be read in MATLAB with EEGLab70, FieldTrip93, or in Python with MNE95. The already published data corresponding to the experimental pilot study may also be found at OSF (https://doi.org/10.17605/OSF.IO/E82P9). This repository96 duplicates just one subset of the larger dataset: all data, folders, and code pertaining to CV and VC pair stimuli collected in 2019 and 2021.