Abstract
Speech is essential for human communication, but millions of people lose the ability to speak due to conditions such as amyotrophic lateral sclerosis (ALS) or stroke. Assistive technologies like brain-computer interfaces (BCIs), can convert brain signals into speech. However, these technologies still face challenges in decoding accuracy. This issue is especially challenging for tonal languages like Mandarin Chinese. Furthermore, most existing speech datasets are based on Indo-European languages, which hinders our understanding of how tonal information is encoded in the brain. To address this, we introduce a comprehensive open dataset, which includes multimodal signals from 30 subjects using Mandarin Chinese across overt, silent, and imagined speech modes, covering electroencephalogram (EEG), surface electromyogram (sEMG), and speech recordings. This dataset lays a valuable groundwork for exploring the neural encoding of tonal languages, investigating tone-related brain dynamics, and improving assistive communication strategies. It supports cross-linguistic speech processing research and contributes to data-driven neural speech decoding technology innovations.
Similar content being viewed by others
Background & Summary
The continuous advancement of BCI technology has sparked a growing interest in providing alternative communication solutions for individuals with speech impairments. Existing devices, such as P300 spellers and steady-state visual evoked potential (SSVEP) spellers, convert neural activity into computer commands, enabling patients to communicate. However, these technologies have several limitations, including restricted freedom of movement, lengthy training periods, the necessity for users to concentrate on specific tasks, and communication rates that are significantly slower than natural speech1,2,3. These challenges emphasize the need for more efficient and natural communication methods, leading to the development of speech-brain-computer interfaces. Speech BCIs decode neural signals associated with various modes of speech production, such as overt, silent, and imagined speech4. Overt speech BCIs capture neural activity during actual vocalization, often utilizing signals like EEG or ECoG, and may produce real-time speech or text output5,6. Silent speech BCIs focus on interpreting neural signals linked to subvocalization, typically using EEG, although MEG can also be used7,8. Imagined speech BCIs decode neural signals related to thought-based speech, which are often more complex to interpret and generally require advanced signal processing techniques, such as deep learning, to achieve accurate results9,10. Speech production involves five stages: cognitive language processing, encoding, conduction, vocalization generation, and perceptual feedback11,12. In the overt speech process, the brain’s language areas generate language content, which is translated into motor commands and sent through the nervous system to the vocal organs, resulting in sound production via airflow and vocal cord vibrations13,14. Similar brain regions are activated during silent and imagined speech processes, but speech-related signals are produced without vocalization, involving neural activity linked to internal speech or mental rehearsal of speech15,16.
The speech process involves neural signals, muscle activity, and speech generation. Decoding speech involves extracting meaningful patterns from neural or muscular signals, integrating acoustic information, and significantly enhancing decoding accuracy and robustness. In this study, we designed a specific speech paradigm and developed a publicly available dataset that includes multi-dimensional physiological signals across three speech conditions. Each mode offers unique insights into the brain’s speech processing: overt speech involves actual vocalization, silent speech focuses on subvocalization, and imagined speech investigates the neural activity underlying thought-based speech. By transferring relevant findings from overt speech research to silent speech and imagined speech, it is possible to overcome challenges related to data accuracy and significantly improve decoding performance.
Neural signals are divided into invasive and non-invasive categories. Invasive signals include electrocorticogram (ECoG) and intracranial electroencephalogram (iEEG); non-invasive signals include functional near-infrared spectroscopy (fNIRS), magnetoencephalography (MEG), and electroencephalography (EEG), etc. Invasive signals, collected via electrodes implanted on or near the cortex, offer high signal-to-noise ratios and strong spatial resolution but require surgical implantation, which presents high risks and limits their applicability17,18. In contrast, though lower in signal quality and spatial resolution, non-invasive signals are highly safe, easy to use, and non-invasive, making them attractive for practical applications19,20. EEG, in particular, stands out for its low acquisition cost, high temporal resolution, and strong portability, making it a popular choice for speech BCI research21.
Surface electromyography (sEMG) is frequently selected in speech-related research due to its non-invasive nature, ease of use, and ability to capture muscle activity involved in speech production. It is particularly valuable for monitoring facial, tongue, and throat muscle movements during overt and silent speech22,23. Despite its sensitivity to noise from surrounding tissues, advancements in signal processing techniques have greatly improved sEMG’s accuracy, making it a practical tool for speech-related technologies24,25. When combined with neural signals, such as those from EEG, sEMG provides a powerful approach to enhance speech decoding accuracy, particularly in silent speech interfaces and prosthetic speech systems26,27,28. This integration of neural and muscular data improves the precision of brain-computer interfaces (BCIs), enabling more effective communication solutions for individuals with speech impairments29,30,31. Furthermore, sEMG-based systems have demonstrated significant progress in real-time speech recognition and prosthetics, with studies reporting the development of multi-modal interfaces that utilize both sEMG and neural signals for more robust performance32,33. These advances highlight sEMG’s growing role in assistive technologies that can enhance communication for individuals with speech disorders34,35.
Speech is a conduit between neural processing and motor execution, converting linguistic cognition into audible signals. Neural activity in brain regions such as Broca and Wernicke areas generates motor commands for articulators like the tongue, lips, and vocal cords. These movements produce sound waves with distinct acoustic properties, such as frequency, amplitude, and temporal dynamics, that carry linguistic content and speaker-specific characteristics36,37,38. Critically, speech also provides real-time feedback via auditory and tactile mechanisms, allowing speakers to adjust and refine their output. For example, the auditory feedback loop ensures that the acoustic output aligns with the intended speech, enabling adaptability to noise and disruptions39. From a neuroscience perspective, speech perception and production rely on multiple functional regions of the brain cortex, with significant overlap in neural activity between overt speech, silent speech, and imagined speech40,41. During imagined speech, the planning of vocal actions primarily occurs in the premotor cortex. However, due to the absence of actual vocalization, the flow of information typically stops at the primary motor cortex42,43.
This dataset44 provides substantial advantages for advancing speech decoding research, particularly within the context of BCI systems. Combining non-invasive data acquisition with multi-modal signals is a valuable resource for exploring diverse decoding strategies. Furthermore, it focuses on Mandarin Chinese speech, offering a culturally relevant foundation for research. The dataset facilitates the integration of neural and muscular signals with acoustic features, enhancing decoding accuracy and robustness across various speech production modes. Researchers can utilize this resource to investigate the neural correlates of overt, silent, and imagined speech, thus contributing to a more nuanced understanding of speech perception and production mechanisms.
Compared to existing datasets, ours uniquely fills a critical gap by focusing on Mandarin Chinese—a tonal language under-represented in current resources. Prior datasets such as the English-centric ZuCo 2.045, DREAMER46, and DECAF47 provide high-density EEG for semantic or affective tasks, yet none address tonal encoding. Recent Chinese releases, ChineseEEG48 and its extension ChineseEEG-249, advance Mandarin coverage but remain limited to silent reading EEG without accompanying audio or electromyography. Similarly, VocalMind-sEEG50 offers Mandarin speech production conditions yet lacks sEMG and acoustic recordings, preventing a unified view of articulatory and neural dynamics.
By simultaneously integrating EEG, sEMG, and high-quality speech recordings across overt, silent, and imagined speech modes, our dataset offers a complete view of speech-related activity. The integration of muscle activity and the emphasis on imagined speech offer a new avenue for researchers to explore novel aspects of speech decoding. This dataset’s multifaceted nature enhances our understanding of speech-related neural processes. It facilitates developing and benchmarking more sophisticated algorithms for real-time speech decoding, accelerating advancements in BCI systems and speech processing research. Additionally, it provides a valuable foundation for exploring the neural encoding of tonal languages, investigating tone-related brain dynamics, improving assistive communication strategies, supporting cross-linguistic speech processing research, and contributing to data-driven neural speech decoding technology innovations.
Methods
Participants
A total of 30 participants were recruited, aged between 17 and 25 years, with an average age of 21 years, including 14 females. As our data also includes relevant data of three minors, S20, S24, and S28, we have fully explained the research purpose and data usage to their guardians separately, and obtained their written consent and signed informed consent. The data from subjects S01-S03 represent the pre-experimental phase. During this phase, the experimental paradigm was identical to the subsequent formal experiment, with only the total number of trials adjusted. The pre-experimental data from these three subjects are included in the dataset, serving as a preliminary test before formal data collection. This step was essential to assess and ensure the feasibility and reliability of the experimental setup and protocols. All participants were native Chinese speakers, one being left-handed and the others right-handed, and none had speech impairments or neurological disorders. The experiment in this study was approved by the Ethics Committee of Tianjin University, with the ethics approval number TJUE-2024-378. Partially de-identified, but data contain the participants’ real voices, and participants are permitted to share their data openly as part of the informed consent process. All participants knew the research procedure and signed the informed consent form. The information on all subjects is shown in Table 1.
Experimental setup
The experimental design includes three modes (Fig. 1a): In the overt speech experiment, participants are instructed to read the corresponding text in Mandarin Chinese at a normal pace after the prompt material appears (only once). In the silent speech experiment, participants must perform the corresponding vocalization movements without producing any sound when the prompt appears. In the imagined speech experiment, participants must silently read the text in their minds at a normal pace in Mandarin Chinese (only once), again without making any sound.
Experimental Framework. (a) Three modes of experimentation are imagined speech, silent speech, and overt speech. (b) Example of the experimental process of overt speech. (c) The data collection site, EEG, sEMG, and speech recordings were gathered from the subjects to construct a comprehensive dataset.
The experimental materials are selected based on the General Standard Chinese Characters Table51, which consists of 8105 characters divided into three levels. Level one includes 3500 commonly used characters, meeting the basic needs of education and cultural communication. Level two comprises 3000 characters, fulfilling the general requirements of publishing, printing, dictionary compilation, and information processing. Level three includes 1605 characters, such as those used in surnames, personal names, place names, scientific and technical terms, and classical Chinese texts in primary and secondary school textbooks. Based on the clinician’s guidance and the characteristics of high-frequency use in the daily life of speech-disordered patients, 5 groups of Chinese characters (2 characters per group) were selected, totaling 10 Chinese characters. These selected characters all rank within the top 1% of the SUBTLEX-CH film subtitle corpus and cover 82.3% of basic imperative and declarative sentences in natural dialogue52. The specific materials used in the experiment and their corresponding descriptions are shown in Table 2. To reduce visual fatigue during the experiment, each material is presented with white Chinese characters centered on a gray background. The relevant code has been uploaded; see the Code availability section for detailed information.
Experimental procedure
The experimental procedures of the three speech modes are the same, and only the speech activity process is different. The experimental method is outlined using the overt speech mode as an example, as shown in Fig. 1b. Before the formal start of the experiment, participants were asked to sit in a chair and maintain a comfortable posture. After wearing an EEG cap and attaching muscle activity collection devices, they were instructed to rest for approximately 20 seconds to prepare for sensory cues. The participants were seated approximately 80 cm from the display screen. Ten materials were grouped, each subjected to 50 random, non-continuous repetitions. Participants were required to perform the corresponding speech activity following the visual cue. Subsequently, a fixation cross appeared at the center of the screen. The speech duration for each material was 1000 ms. During the experiment, participants were required to focus their attention and remain as still as possible. Each participant’s session lasted approximately 48 minutes.
To mitigate potential order effects, we adopted a block-wise design in which each experimental block contained trials from only one speech modality (imagined, silent, or overt); no intermixed trials were employed. This choice was made on pragmatic grounds: during preliminary equipment validation with two volunteers who did not participate in the main study, an intermixed-trial arrangement noticeably increased participant-reported fatigue and produced more movement-related EEG artefacts. Given the differing difficulty and mental demands of speech modes, we sequenced experiments to optimize data quality. Participants began with imagined speech, the most abstract and cognitively demanding, and then progressed to silent speech, which is less demanding but still requires focus. Overt speech, being the most natural and least taxing, was done last. Doing this lets participants adapt gradually, reducing fatigue and maintaining data quality. Participants were given the option to rest after completing one block to minimize cognitive load, which helps reduce the effects of EEG time-related correlations on the results. Consequently, the experiment was structured as three separate, modality-homogeneous blocks executed in the fixed order imagined → silent → overt. Imagined speech, the most cognitively demanding condition, was therefore collected first, when participants were maximally alert; silent and overt speech followed in ascending order of motor engagement.
EEG acquisition
The experimental setup and dataset composition are shown in Fig. 1c. The experimental interface was implemented using Psychtoolbox. EEG data were acquired with a NeuroScan SynAmps2 system, configured with 64 electrodes following the international 10–20 placement system. Electrode impedance was maintained below 5 kΩ, and data were sampled at 1000 Hz. Raw EEG recordings were collected using Curry 9.0 software and exported in the specific format (.cdt) for further analysis.
sEMG acquisition
For sEMG data collection, a NeuSen WM series wireless electromyography system with 6 channels was utilized. The grounding electrode was placed at the subject’s collarbone. Sampling was conducted at 1000 Hz. Anatomical studies of the face, neck, and speech-related nerves identified the critical muscles involved in speech production. Double-button electrodes (44*22 mm) were strategically placed on specific muscle groups, including the levator labii superioris, risorius, zygomaticus, mentalis, anterior digastric, and mylohyoid muscles. Electrode polarity and orientation were adjusted according to muscle alignment to optimize signal capture. The sEMG acquisition sensor placement area and corresponding muscle distribution are shown in Fig. 2. The raw sEMG data were exported using Neuracle acquisition software. Figure 2 was based on human facial anatomy53 and was drawn using AI tools.
The sEMG acquisition sensor placement area and corresponding muscle distribution. The distance from the center line of the nose tip quantifies the placement of channels 1–653.
Speech recordings acquisition
Speech recordings were conducted using Adobe Audition 2023 software, a YAMAHA UR22C sound card, and a microphone. The microphone is placed approximately 20 cm directly in front of the subject’s lips, and the sound collection position faces the subject during the sound collection. The microphone was tilted downward at a 10-degree angle to minimize breath interference and adjusted to align horizontally with the corners of the mouth. The chamber walls were lined with sound-absorbing materials. The setup facilitated the capture of speech signals and monitoring of subject behavior during overt speech tasks, specifically recording vocalization onset and duration. The setup verified the absence of audible sounds for silent and imagined speech tasks. All audio data were sampled at a rate of 44100 Hz.
Multimodal Synchronization Protocol
Hardware triggering for EEG/sEMG: Visual stimulus onset events triggered TTL pulses via MATLAB’s lptwrite() function (Parallel Port: 0x3FF8). Identical event codes (1–10 corresponding to stimulus materials) were simultaneously sent to both NeuroScan SynAmps2 (DIN port) and NeuSen WM (digital input) systems using the custom TriggerBox class (see Code Availability). Hardware triggering achieved sub-millisecond synchronization precision between EEG and sEMG systems, consistent with manufacturer specifications (NeuroScan SynAmps2 DIN latency: <0.1 ms; NeuSen WM digital input latency: <0.5 ms). Actual synchronization precision was quantified by analyzing TTL pulse timing across all trials, revealing a median inter-system latency of 0.23 ms.
Audio alignment: Due to equipment constraints, speech recordings were temporally aligned post-hoc using sEMG activity as a physiological anchor. Audio segments were defined from 350 ms before detected speech onset (via spectrogram energy thresholding) to 900 ms post-stimulus, based on known neurophysiological latencies of articulatory muscle activation preceding vocalization54.
Data Records
The dataset is named T-MSPD: Tianjin University Multidimensional Speech Patterns Dataset and can be accessed at https://cstr.cn/31253.11.sciencedb.2441644. The publicly accessible data provided via the supplementary upload link is released under the data usage agreement. This dataset repository comprises four types of data: experimental materials, raw data, preprocessed EEG and sEMG, and speech recordings, along with the relevant code and instructions. This structured organization facilitates efficient data access and processing. This dataset comprises data from 30 participants, with approximately 24 hours of EEG data, 24 hours of sEMG data, and 8 hours of speech recordings in all. The overall dataset structure is depicted in Fig. 3:
-
1.
Materials: Contains materials used in the speech experiment and their sources.
-
2.
Raw Data: Comprises unprocessed data, including EEG data (in ‘.cdt’ format), sEMG data (in ‘data.bdf’ and ‘evt.bdf’ formats), and speech data (in ‘.wav’ format).
-
3.
Preprocessed Data: Includes EEG and sEMG data that have been preprocessed and saved in ‘.set’ and corresponding ‘.fdt’ formats. Preprocessed speech data is also saved in ‘_pre.wav’ format, with ‘.mat’ files available for further post-processing. The file named ‘Subject ID_recording.mat’ contains speech data that has been manually annotated using the synchronization labels in the overt speech folder. We also provide the lexical information of each subject for different materials. We also provide spectrograms of different materials for each subject in overt speech mode.
-
4.
Other Instructions: Some supplementary information, such as the dataset structure files and other auxiliary materials. The dataset structure files contain all the necessary information for the experiments, enabling other users to use this dataset efficiently.
The overall dataset structure and the composition of EEG-BIDS. (a–d) represent the directories of four folders, and (e) is the directory structure of the EEG-BIDS file, including subject-specific subfolders and metadata. (f,g) are the raw data and the data acquisition device, preprocessed and aligned data, respectively.
Technical Validation
EEG data preprocessing
The onset of the EEG data was defined as 0 ms, with segments from −100 ms to 0 ms utilized for baseline correction. Consequently, the duration of the data spans from −100 ms to 900 ms. Preprocessing of EEG data was conducted using EEGLAB. First, irrelevant electrodes were excluded based on their spatial location. Specifically, M1 and M2 were not used as reference electrodes, and CB1 and CB2, which are associated with cerebellum localization, were excluded from the analysis. The power spectrum of each electrode signal was computed, and electrodes with energy exceeding twice the standard deviation of the average power spectrum were identified as bad channels. These were corrected using spatial interpolation, replacing their data with the average of adjacent electrodes. At the trial level, two measures were computed: the median variance of each electrode and the median deviation of each electrode from its mean within a trial. Trials in which either measure exceeded twice the standard deviation were flagged, and the union of these flagged trials was excluded. The reference for EEG data was recalculated as the average across all channels. Data were then filtered using a 0–120 Hz bandpass filter55 and a 50 Hz notch filter to remove powerline noise. Following this, data were down-sampled to 256 Hz, segmented according to labels, and subjected to baseline correction. Segments identified as bad were removed. Finally, independent component analysis (ICA) was applied to remove artifacts caused by muscle activity or unexpected movements, ensuring cleaner EEG data for subsequent analysis. As shown in Table 3, we provide the statistical results of the EEG data of 27 subjects in the formal experiment under different speech modes that were accepted and rejected (27*500*3). The average percentage of rejection trials to total trials was 3.9%, proving that our EEG data collection process is reliable.
In this work, we focused on comparing brain activity across three speech production modes, overt, silent, and imagined, to investigate how neural dynamics vary across these modes. We did not examine feature differences between individual speech materials within each mode, as this approach allowed us to concentrate on how each speech mode engages distinct brain regions, avoiding unnecessary complexity.
EEG source localization
We applied source localization methods to reverse-engineer the origin of EEG signals, revealing the brain regions activated during different speech modes. This not only further validates the effectiveness of the EEG data but also helps us understand the brain’s working mechanisms during the execution of various speech activities. In this work, we selected the cortex surface as the source space and OpenMEEG BEM56 as the forward modeling method. The cortex surface, constructed from high-resolution MRI, provides an anatomically precise model of the brain’s cortical surface with approximately 15000–20000 source points. For forward modeling, OpenMEEG BEM uses a three-layer head model (scalp: 0.33 S/m, skull: 0.0042 S/m, brain: 0.33 S/m), accurately reflecting the electrical conductivity differences between head tissues. Finally, the sLORETA57 was employed for source localization, leveraging its standardized approach to produce unbiased and reliable cortical activation maps. Together, this methodological framework ensures high spatial accuracy and robust interpretability of brain dynamics under different speech tasks. The average source localization results of all subjects are shown in Figs. 4–6.
The temporal variation source analysis results under the overt speech mode. Taking the EEG data source reconstruction of all participants in the material ‘wo 我 (I)’ under three different speech modes as an example, we performed source localization analysis on each epoch (−100 to 900 ms) with a 100 ms sliding window, obtaining the time-varying results of the source reconstruction.
The temporal variation source analysis results under the silent speech mode. The time window division and analysis methodology are identical to those described in Fig. 4.
The temporal variation source analysis results under the imagined speech mode. The time window division and analysis methodology are identical to those described in Fig. 4.
The distinct neural activation patterns observed across the three speech modes validate the reliability and richness of the dataset. In overt speech, activation progresses from motor planning regions, including the supplementary motor area (SMA), the primary motor cortex (M1), and the anterior cingulate cortex (ACC), in the early stage, to motor execution and auditory feedback areas, namely M1 and the auditory cortex (A1), in the middle stage, and finally to feedback integration regions, such as the dorsomedial prefrontal cortex (dmPFC) and the ventral prefrontal cortex (vPFC), in the late stage, reflecting a complete speech production process. Silent speech, while lacking phonation, activates the Broca area and the premotor cortex, with reduced motor and auditory cortex involvement, highlighting reliance on internal language processing. In imagined speech, cognitive control regions, including SMA and the prefrontal cortex (PFC), dominate across all stages, with minimal motor or sensory feedback activation, demonstrating the absence of overt production54. These consistent and stage-specific activation patterns across modalities confirm the robustness and potential for studying speech mechanisms in varying conditions.
EEG relative power spectrum analysis
Since the relative power of EEG data across different frequency bands reflects the brain’s functional activity and neural dynamics in specific states, after preprocessing the EEG data and extracting signals corresponding to various labels from 30 participants. Using the Yule-Walker method within the autoregressive (AR) model58, we computed the average relative power spectral density (PSD) for five commonly studied EEG frequency bands (δ: 1–4 Hz, θ: 4–8 Hz, α: 8–12 Hz, β: 12–30 Hz, γ: 30–120 Hz) across three speech modes: overt, silent, and imagined. The average relative power topography of the five EEG frequency bands at the Cz electrode under different speech modes is shown in Fig. 7. It can be observed that overt speech and silent speech primarily activate regions in the left frontal lobe, parietal lobe, and motor-related areas. In contrast, imagined speech predominantly activates areas in the medial temporal lobe, bilateral inferior frontal gyrus, right superior frontal gyrus, and anterior cingulate gyrus59. These regions correspond to Broca and Wernicke areas associated with speech activity60.
Topographic maps of EEG relative power in different frequency bands during three speech modes: (a) overt speech, (b) silent speech, and (c) imagined speech. The color bar indicates the relative power as a percentage, representing the contribution of each frequency band’s power to the total power across all frequency bands.
Additionally, activity in the delta band gradually decreases across all three speech modes, while activity in the theta, alpha, and beta bands increases, indicating that the frequency bands closely related to imagined speech are concentrated in the 4–30 Hz range. This conclusion is also consistent with the findings of previous studies55. We verified the close correlation between the collected EEG data and speech activity through classic EEG data feature analysis methods. This conclusion provides a reliable data basis for future research, allowing researchers to further explore the specific features of different speech modes according to their own needs.
EEG data classification verification
In this work, we further assessed the reliability of the EEG data from a classification perspective. Specifically, we constructed and trained a classification model using the EEGNet61, which effectively extracts the spatiotemporal features of EEG signals through CNN (Fig. 8a), thereby validating the stability and effectiveness of the data. We divided the preprocessed EEG data from all participants into training, validation, and test sets in a 6:2:2 ratio. To improve the stability of model evaluation, we used 5-fold cross-validation during the classification process. The classification results for the 30 participants are shown in Fig. 8b, with an overall average accuracy of 45.73%. Although the classification accuracy for some participants was lower than the average, it was still significantly higher than the 10% random guessing probability, indicating that the classification model effectively distinguishes differences in EEG signals. Furthermore, the performance evaluation results of the classification model provided strong support for the reliability of the EEG data, ensuring the credibility of subsequent analyses and applications.
sEMG data preprocessing
The collected sEMG data underwent basic preprocessing, including removing irrelevant electrodes (channel 7 and channel 8), 1–300 Hz bandpass filtering, and 50 Hz notch filtering. In overt and silent speech modes, the sEMG data labels fully correspond to the EEG data labels. Therefore, we also set the data duration to −100 ms to 900 ms for each segment.
Mean absolute value of sEMG
To evaluate the reliability of sEMG data, we selected the Mean Absolute Value (MAV) of the EMG signal as a key time-domain feature. MAV effectively reflects the overall amplitude level of the signal, offering advantages such as simplicity in computation and strong robustness. Additionally, it demonstrates excellent discriminative capability in various motion pattern classification tasks. After data preprocessing, the sEMG signals were segmented into fixed-length, non-overlapping windows. The MAV was calculated based on Eq. 1 for each segment, generating feature vectors. Here, x[n] represents the signal amplitude at the n-th sampling point, and N denotes the length of the signal segment:
We focused on overt speech due to its clear articulatory sEMG signals and well-defined temporal structure, which allows precise trial alignment and ensures that variability in the sEMG data is primarily due to the recording system. As shown in Fig. 9, for the same subject, the sEMG recorded multiple times using the same material exhibited a high level of consistency. The statistical analysis revealed no significant differences between the trials, thereby strongly supporting the reliability of the sEMG data.
To assess the classification performance of sEMG data during overt speech tasks, we applied EEGNet using the same methodology for the EEG data. The classification results for 30 participants are shown in Fig. 10, with an overall accuracy of 71%. Despite some participants having accuracy below the cohort average, the results were still significantly above the 10% random guessing probability, demonstrating the model’s ability to distinguish differences in sEMG signals. These results strongly support the reliability of the sEMG data, confirming its suitability for subsequent classification tasks.
Raw speech recording data evaluation
The speech recording data was intercepted, removing the irrelevant parts before and after the experiment. We comprehensively assessed speech quality by evaluating multiple objective metrics, including Signal-to-Noise Ratio (SNR), Energy Contrast, and Spectral Noise Proportion. The audio was segmented into 1-second frames, and the Short-Time Energy (STE) was computed for each frame. Signal energy was assumed to be the energy of the current frame, and noise energy was estimated using silent frames with low energy. The SNR was calculated as the ratio of signal energy to noise energy in decibels (dB), with the result normalized to a range of [0, 1] by assuming an ideal SNR of 30 dB. Energy Contrast was determined by calculating the logarithmic ratio of the mean energy over all frames to the noise energy, with an ideal energy contrast of 10 dB. This score was normalized to [0, 1]. Spectral Noise Proportion was assessed by calculating the energy ratio between high-frequency (>4000 Hz) and total spectral energy, with a lower ratio indicating cleaner speech. This value was inverted to provide a higher score for cleaner speech. The final speech quality score was derived by aggregating these metrics using weighted summation, with weights of 0.5 for SNR, 0.3 for energy contrast, and 0.2 for spectral noise proportion. The score was then normalized to a range of [0, 1], where higher scores indicate better quality. The results were visualized in a radar plot, where the normalized scores for SNR, energy contrast, and spectral noise proportion were represented on the axes, providing an intuitive way to assess speech quality across different dimensions. The quality assessment scores of the raw speech data for all participants are summarized in Table 4. As an example, the visualized results of the average speech quality assessment for S30 are presented in Fig. 11.
Visualization of speech features and quality assessment scores for S01. (a) Waveform: Displays the original waveform of the speech signal, reflecting the changes in audio data over time. (b) Short-time energy map: Displays the short-time energy changes of the speech signal in different periods. (c) Spectrum: Displays the intensity distribution of the speech signal at different frequencies and times. (d) Radar chart: Visualizes the relative performance of the three indicators of signal-to-noise ratio, energy contrast, and spectrum-to-noise ratio. The higher the score, the farther the point of the radar chart is from the center, indicating that the speech quality is better. The score range of the radar chart is 0 to 1, where 1 indicates ideal speech quality, and 0 indicates poor speech quality.
We combined both time-domain and frequency-domain metrics to provide a detailed and comprehensive assessment of speech quality. The weighting for each metric was carefully assigned based on its relevance to overall speech quality, with a higher weight given to the SNR due to its direct impact on speech clarity. At the same time, energy contrast and spectral noise proportion were also included to account for variations in signal strength and spectral characteristics. This approach ensures a balanced and robust evaluation. Additionally, the code for speech quality assessment has been uploaded with the dataset for transparency and reproducibility.
Label and temporal alignment of multimodal data
In this dataset, we provide multimodal neurophysiological data across three speech modes: overt, silent, and imagined. These data are initially analyzed independently to explore the similarities and differences in the neural mechanisms underlying each mode. Including overt speech data is particularly valuable in addressing the data scarcity challenge when training silent and imagined speech models, thereby advancing transfer learning in speech decoding. Although speech data is collected synchronously with EEG and sEMG signals during overt speech production, automatic annotation of the speech data is not feasible.
We developed an efficient method for aligning data labels to tackle this challenge. The speech data is first segmented based on the experiment’s start and end times during preliminary preprocessing. It is crucial to note that muscle activity typically precedes sound onset in speech production, reflecting the neuromuscular transmission process involved in the formation of sound 57]. Relying solely on spectrogram-based segmentation may introduce alignment errors. To address this, we visually examined the EMG data from the same channels across all participants, segmented by different labels. The significant changes in sEMG signal amplitude between 300 and 400 ms are depicted in Fig. 12 for all six channels, providing a physiological anchor point for temporal alignment. To provide a more detailed illustration, we have selected the sEMG data from Channel 1 for further analysis. Based on this neurophysiological evidence, we implemented the alignment formula:
Which leverages the consistent sequence of articulator movement → vocalization. Validation against manual annotation showed this method achieves a mean absolute alignment error of 42 ± 18 ms across subjects. To ensure precise alignment between speech recordings, EEG, and sEMG data, we adjusted for this time lag by setting the beginning of each speech segment 350 ms before the first appearance of the spectrogram. This approach guarantees that both the initial speech and subsequent segments remain accurately synchronized with the EEG and sEMG data. This method facilitates precise alignment for analyzing speech production and ensures that the multimodal data—EEG, sEMG, and speech recording—can be used together as complementary inputs in classification and decoding studies.
Next, the preprocessed speech data is divided into trials, enabling precise extraction of experiment-related speech segments. Finally, these speech segments are aligned with the labels using the synchronized markers from the EEG and sEMG signals. This allowed us to identify the corresponding segments of the speech data accurately. At the same time, we align the three types of data according to the labels, as shown in Fig. 13. Manual segmentation and the addition of synchronized labels further refined the alignment, ensuring the completeness and accuracy of the speech data and providing precise temporal alignment of multimodal data. This provides a solid foundation for subsequent multimodal analysis and the development of speech decoding models.
Usage Notes
The experiment and data analysis codes have been released together with the dataset. The repository contains four categories of resources: experimental materials, raw data, pre-processed EEG and sEMG, and speech recordings, along with the corresponding code and documentation. Because speech recordings are inherently sensitive and can directly identify individuals, the dataset is distributed under a restricted-access model that requires prior acceptance of the data usage agreement. To apply, users should register an account at Science Data Bank (https://cstr.cn/31253.11.sciencedb.24416)44, click the “Request Access” button, fill in the affiliation and project details, and submit the request; approval is normally granted within 5–7 business days. The data usage agreement permits the use of data for scientific research and technical development (including non-commercial and commercial purposes), prohibits any attempt to re-identify participants or redistribute the data without written permission from the dataset authors, and is publicly available on the dataset page before application.
Data availability
The dataset is publicly accessible via the ScienceDB platform. All of the data are online (https://cstr.cn/31253.11.sciencedb.24416)44.
Code availability
The code for all modules is openly available at https://cstr.cn/31253.11.sciencedb.2441644. All code is stored in a subfolder named ‘code’ and stored according to the implemented functionality. All scripts were developed in MATLAB 2023b and Python 3.8. The experimental paradigm was presented using Psychtoolbox-3-3.0.19.0. EEG and sEMG data preprocessing was performed using the EEGLAB toolbox, version v2024.2.1. The software used for speech data preprocessing was Adobe Audition 2022. For source analysis, we used the Brainstorm toolbox, version 3.241203. Researchers interested in this work can conduct further research based on the preprocessing and analysis code we provided.
References
He, H. & Wu, D. Transfer Learning for Brain-Computer Interfaces: A Euclidean Space Data Alignment Approach. IEEE Trans. Biomed. Eng. 67(2), 399–410, https://doi.org/10.1109/TBME.2019.2913914 (2020).
Pan, J. et al. Advances in P300 brain–computer interface spellers: toward paradigm design and performance evaluation. Front. Hum. Neurosci. 16, 1077717, https://doi.org/10.3389/fnhum.2022.1077717 (2022).
Velikoborets, G. S., Nagornova, Z. V. & Shemyakina, N. V. BCI-Speller’s Data and Approaches as the Basis for Cognitive BCI Applications (from Communication to Creation). Hum. Physiol. 49(1), S133–S146, https://doi.org/10.1134/S0362119723600479 (2023).
Soroush, P. Z. et al. The nested hierarchy of overt, mouthed, and imagined speech activity evident in intracranial recordings. NeuroImage 269, 119913, https://doi.org/10.1016/j.neuroimage.2023.119913 (2023).
Makin, J. G., Moses, D. A. & Chang, E. F. Machine translation of cortical activity to text with an encoder–decoder framework. Nat. Neurosci. 23(4), 575–582, https://doi.org/10.1038/s41593-020-0608-8 (2020).
Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620(7976), 1031–1036, https://doi.org/10.1038/s41586-023-06377-x (2023).
Wang, Y. et al. Silent Speech Decoding Using Spectrogram Features Based on Neuromuscular Activities. Brain Sci. 10(7), 442, https://doi.org/10.3390/brainsci10070442 (2020).
Yoo, H. et al. Silent Speech Recognition with Strain Sensors and Deep Learning Analysis of Directional Facial Muscle Movement. ACS Appl. Mater. Interfaces 14(48), 54157–54169, https://doi.org/10.1021/acsami.2c14918 (2022).
Li, F. et al. Decoding imagined speech from EEG signals using hybrid-scale spatial-temporal dilated convolution network, J. Neural Eng., https://doi.org/10.1088/1741-2552/ac13c0 (2021).
Lee, D.-Y., Lee, M. & Lee, S.-W. Decoding Imagined Speech Based on Deep Metric Learning for Intuitive BCI Communication. IEEE Trans. Neural Syst. Rehabil. Eng. Publ. IEEE Eng. Med. Biol. Soc. 29, 1363–1374, https://doi.org/10.1109/TNSRE.2021.3096874 (2021).
Hickok, G. Chapter 4 - The dual stream model of speech and language processing, in Handbook of Clinical Neurology, vol. 185, Hillis, A. E. & Fridriksson, J. Eds., in Aphasia, 185, 57–69 https://doi.org/10.1016/B978-0-12-823384-9.00003-7 (Elsevier, 2022).
Silva, A. B. et al. A Neurosurgical Functional Dissection of the Middle Precentral Gyrus during Speech Production. J. Neurosci. Off. J. Soc. Neurosci. 42(45), 8416–8426, https://doi.org/10.1523/JNEUROSCI.1614-22.2022 (2022).
Drijvers, L., Small, S. L. & Skipper, J. I. Language is widely distributed throughout the brain. Nat. Rev. Neurosci. 26(3), 189–189, https://doi.org/10.1038/s41583-024-00903-0 (2025).
Zheng, Y., Zhang, J., Yang, Y. & Xu, M. Neural representation of sensorimotor features in language-motor areas during auditory and visual perception. Commun. Biol. 8(1), 1–15, https://doi.org/10.1038/s42003-025-07466-5 (2025).
Tang, J., LeBel, A., Jain, S. & Huth, A. G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 26(5), 858–866, https://doi.org/10.1038/s41593-023-01304-9 (2023).
Kröger, B. J., Stille, C. M., Blouw, P., Bekolay, T. & Stewart, T. C. Hierarchical Sequencing and Feedforward and Feedback Control Mechanisms in Speech Production: A Preliminary Approach for Modeling Normal and Disordered Speech. Front. Comput. Neurosci. 14, 573554, https://doi.org/10.3389/fncom.2020.573554 (2020).
Rabbani, Q. et al. Iterative alignment discovery of speech-associated neural activity. J. Neural Eng. 21(4), 046056, https://doi.org/10.1088/1741-2552/ad663c (2024).
Parvizi, J. & Kastner, S. Promises and limitations of human intracranial electroencephalography. Nat. Neurosci. 21(4), 474–483, https://doi.org/10.1038/s41593-018-0108-2 (2018).
MEG Sensor Selection for Neural Speech Decoding - PubMed. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/33204579/ Accessed: (2024).
Kwak, Y., Song, W.-J. & Kim, S.-E. FGANet: fNIRS-Guided Attention Network for Hybrid EEG-fNIRS Brain-Computer Interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. Publ. IEEE Eng. Med. Biol. Soc. 30, 329–339, https://doi.org/10.1109/TNSRE.2022.3149899 (2022).
Zhou, J. et al. Speech2EEG: Leveraging Pretrained Speech Model for EEG Signal Recognition. IEEE Trans. Neural Syst. Rehabil. Eng. Publ. IEEE Eng. Med. Biol. Soc. 31, 2140–2153, https://doi.org/10.1109/TNSRE.2023.3268751 (2023).
Deng, Z. et al. Silent Speech Recognition Based on Surface Electromyography Using a Few Electrode Sites Under the Guidance From High-Density Electrode Arrays. IEEE Trans. Instrum. Meas. 72, 1–11, https://doi.org/10.1109/TIM.2023.3244849 (2023).
Liu, H. et al. An epidermal sEMG tattoo-like patch as a new human-machine interface for patients with loss of voice. Microsyst. Nanoeng. 6, 16, https://doi.org/10.1038/s41378-019-0127-5 (2020).
Towards optimizing electrode configurations for silent speech recognition based on high-density surface electromyography - PubMed. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/33181497/ Accessed: (2024).
Meltzner, G. S. et al. Development of sEMG sensors and algorithms for silent speech recognition. J. Neural Eng. 15(4), 046031, https://doi.org/10.1088/1741-2552/aac965 (2018).
Chen, X., Zhang, X., Chen, X. & Chen, X. Decoding Silent Speech Based on High-Density Surface Electromyogram Using Spatiotemporal Neural Network. IEEE Trans. Neural Syst. Rehabil. Eng. Publ. IEEE Eng. Med. Biol. Soc. 31, 2069–2078, https://doi.org/10.1109/TNSRE.2023.3266299 (2023).
Galego, J. S., Casas, O. V., Rossato, D., Simes, A. & Balbinot, A. Surface Electromyography and Electroencephalography processing in dysarthric patients for verbal commands or speaking intention characterization. Measurement 175(1), 109147, https://doi.org/10.1016/j.measurement.2021.109147 (2021).
Tang, X. et al. Enhancing EEG and sEMG Fusion Decoding Using a Multi-Scale Parallel Convolutional Network With Attention Mechanism. IEEE Trans. Neural Syst. Rehabil. Eng. Publ. IEEE Eng. Med. Biol. Soc. 32, 212–222, https://doi.org/10.1109/TNSRE.2023.3347579 (2024).
Li, H. et al. Silent Speech Interface With Vocal Speaker Assistance Based on Convolution-Augmented Transformer. IEEE Trans. Instrum. Meas. 72, 1–11, https://doi.org/10.1109/TIM.2023.3273660 (2023).
Li, Y., Zhang, W., Zhang, Q. & Zheng, N. Transfer Learning-Based Muscle Activity Decoding Scheme by Low-frequency sEMG for Wearable Low-cost Application. IEEE Access 9, 22804–22815, https://doi.org/10.1109/ACCESS.2021.3056412 (2021).
Li, H. et al. Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language. Brain Sci. 12(7), 818, https://doi.org/10.3390/brainsci12070818 (2022).
Li, W. et al. sEMG-based technology for silent voice recognition. Comput. Biol. Med. 152, 106336, https://doi.org/10.1016/j.compbiomed.2022.106336 (2023).
Zhu, B., Zeng, X., Chen, Z., Zhang, D. & Xie, L. SSR Using Portable sEMG Acquisition System With Electrode Layout Optimization. IEEE Sens. J. 24(15), 24997–25005, https://doi.org/10.1109/JSEN.2024.3410401 (2024).
Vojtech, J. M. et al. Surface Electromyography-Based Recognition, Synthesis, and Perception of Prosodic Subvocal Speech. J. Speech Lang. Hear. Res. JSLHR 64(6S), 2134–2153, https://doi.org/10.1044/2021_JSLHR-20-00257 (2021).
Sharon, R. A., Narayanan, S. S., Sur, M. & Murthy, A. H. Neural Speech Decoding During Audition, Imagination and Production. IEEE Access 8, 149714–149729, https://doi.org/10.1109/ACCESS.2020.3016756 (2020).
Verwoert, M. et al. Whole-brain dynamics of articulatory, acoustic and semantic speech representations. Commun. Biol. 8(1), 1–10, https://doi.org/10.1038/s42003-025-07862-x (2025).
Beyene, M. & Meininger, R. A case study of distress mechanism(s) in a concrete structure foundation in the saturated zone and above the saturated zone. J. Microsc. 286(2), 114–119, https://doi.org/10.1111/jmi.13068 (2022).
Ter-Ovanesov, M. D., Levitskiy, A. V., Lesnidze, E. E., Sokhlikov, A. A. & Gaboyan, A. S. Ascending aorta injury followed by cardiac tamponade after sternal puncture, Khirurgiia (Sofiia), 5, 125–128, https://doi.org/10.17116/hirurgia2021051125 (2021).
Neural mechanisms underlying auditory feedback control of speech - PubMed. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/18035557/ (Accessed: 2024).
Hansen, J. Y. et al. Integrating brainstem and cortical functional architectures. Nat. Neurosci. 27(12), 2500–2511, https://doi.org/10.1038/s41593-024-01787-0 (2024).
Zhang, X.-H. et al. The cell-type underpinnings of the human functional cortical connectome. Nat. Neurosci. 28(1), 150–160, https://doi.org/10.1038/s41593-024-01812-2 (2025).
Kearney, E. & Guenther, F. H. Articulating: The Neural Mechanisms of Speech Production. Lang. Cogn. Neurosci. 34(9), 1214–1229, https://doi.org/10.1080/23273798.2019.1589541 (2019).
Bakhshali, M. A., Khademi, M. & Ebrahimi-Moghadam, A. Investigating the neural correlates of imagined speech: An EEG-based connectivity analysis. Digit. Signal Process. 123, 103435, https://doi.org/10.1016/j.dsp.2022.103435 (2022).
Zhao, R. et al. An open dataset of multidimensional signals based on different speech patterns in pragmatic Mandarin[DS/OL]. V6. Science Data Bank, [2025-10-22], https://doi.org/10.57760/sciencedb.24416 (2025).
Hollenstein, N. et al. ZuCo 2.0: A Dataset of Eye-Tracking and EEG Recordings During Natural Reading. Sci Data 7, 151 (2020).
Katsigiannis, S. & Ramzan, N. DREAMER: A Database for Emotion Recognition Through EEG and ECG. IEEE Access 5, 13420–13433 (2017).
Song, S. et al. DECAF: A Multimodal Dataset for Decoding Affective Physiological Responses. IEEE Access 8, 75085–75096 (2020).
Mou, X. et al. ChineseEEG: A Chinese Linguistic Corpora EEG Dataset for Semantic Alignment and Neural Decoding. Sci Data 11, 458 (2024).
Chen, Y. et al. ChineseEEG-2: Extending Mandarin EEG Corpora to Overt and Listening Conditions. arXiv:2508.04240 (2025).
Li, J. et al. VocalMind-sEEG: A Stereo-EEG Dataset of Mandarin Overt, Silent, and Imagined Speech. Sci Data 12, 202 (2025).
Table of General Standard Chinese Characters. [Online]. Available: http://hanzidb.org/character-list/general-standard (Accessed: 2025).
Cai, Q. & Brysbaert, M. SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. PLOS ONE 5(6), e10729, https://doi.org/10.1371/journal.pone.0010729 (2010).
Vojtech, J. M. & Stepp, C. E. Electromyography, in Manual of Clinical Phonetics, Routledge (2021).
Nuttall, H. E., Kennedy-Higgins, D., Devlin, J. T. & Adank, P. Modulation of intra- and inter-hemispheric connectivity between primary and premotor cortex during speech perception. Brain Lang. 187, 74–82, https://doi.org/10.1016/j.bandl.2017.12.002 (2018).
Proix, T. et al. Imagined speech can be decoded from low- and cross-frequency intracranial EEG features. Nat. Commun. 13(1), 48, https://doi.org/10.1038/s41467-021-27725-3 (2022).
Gramfort, A., Papadopoulo, T., Olivi, E. & Clerc, M. OpenMEEG: opensource software for quasistatic bioelectromagnetics. Biomed. Eng. Online 9, 45, https://doi.org/10.1186/1475-925X-9-45 (2010).
Pascual-Marqui, R. D. Standardized low-resolution brain electromagnetic tomography (sLORETA): technical details. Methods Find. Exp. Clin. Pharmacol. 24(Suppl D), 5–12 (2002).
Shakeel, A., Onojima, T., Tanaka, T. & Kitajo, K. Real-Time Implementation of EEG Oscillatory Phase-Informed Visual Stimulation Using a Least Mean Square-Based AR Model. J. Pers. Med. 11(1), 38, https://doi.org/10.3390/jpm11010038 (2021).
Stephan, F., Saalbach, H. & Rossi, S. The Brain Differentially Prepares Inner and Overt Speech Production: Electrophysiological and Vascular Evidence. Brain Sci. 10(3), 148, https://doi.org/10.3390/brainsci10030148 (2020).
Schroeder, M. L. et al. Mapping cortical activations underlying covert and overt language production using high-density diffuse optical tomography. NeuroImage 276, 120190, https://doi.org/10.1016/j.neuroimage.2023.120190 (2023).
Lawhern, V. J. et al. EEGNet: a compact convolutional neural network for EEG-based brain-computer interfaces. J. Neural Eng. 15(5), 056013, https://doi.org/10.1088/1741-2552/aace8c (2018).
Acknowledgements
This work is funded by the National Key Research and Development Program of China (No. 2023YFF1203500) and the Haihe Laboratory of Brain-computer Interaction and Human-machine Integration Independent Project (24HHNJSS00015). We also thank all research assistants who provided general support in participant recruiting and data collection.
Author information
Authors and Affiliations
Contributions
Ran Zhao: Conceptualization, Methodology, Formal analysis, Writing-original draft, Writing-review & editing. Yanru Bai: Conceptualization, Methodology, Formal analysis, Writing-original draft, Writing-review & editing. Shuming Zhang: Methodology, Data curation. Jianing Zhu: Data preprocessing. Hongxing Liu: Methodology. Guangjian Ni: Conceptualization, Methodology, Writing review & editing, Funding acquisition, Project administration.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhao, R., Bai, Y., Zhang, S. et al. An open dataset of multidimensional signals based on different speech patterns in pragmatic Mandarin. Sci Data 12, 1934 (2025). https://doi.org/10.1038/s41597-025-06213-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-06213-z















