Abstract
The paper introduces PAVSig: Polish Audio-Visual child speech dataset for computer-aided diagnosis of Sigmatism (lisp). The study aimed to gather data on articulation, acoustics, and visual appearance of the articulators in different child speech patterns, particularly in sigmatism. The data was collected in 2021–2023 in six kindergarten and school facilities in Poland during the speech and language therapy examinations of 201 children aged 4–8. The diagnosis was performed simultaneously with data recording, including 15-channel spatial audio signals and a dual-camera stereovision stream of the speaker’s oral region. The data record comprises audiovisual recordings of 51 words and 17 logotomes containing all 12 Polish sibilants and the corresponding speech and language therapy diagnoses from two independent speech and language therapy experts. In total, we share 66,781 audio-video segments, including 12,830 words and 53,951 phonemes (12,576 sibilants).
Similar content being viewed by others
Background & Summary
Computer-aided speech diagnosis (CASD) or computer-aided speech therapy (CAST) is an emerging field with access to data acquisition and processing capabilities. It is strongly connected to particular languages, as it relies on acoustics and articulation, which are significantly different in various language groups1. The distinction can also concern the speaker’s age, as the therapy can involve children, adolescents, or adults. Nonetheless, the earlier the diagnosis occurs, the more efficient the therapy can be. Therefore, it is essential to support the speech and language therapist/pathologist (SLP) in treating even very young children2. Sigmatism (lisp) is one of the most common types of speech sound disorders3,4,5,6,7,8,9,10,11,12. It refers to an incorrect articulation of sibilant sounds, different across languages. In Polish, there are 12 sibilants: four denti-alveolars /s/, /z/,
,
, four retroflexes
,
,
,
; and four alveolo-palatals
,
,
,
13.
Many detailed features can describe articulation, e.g., the manner of articulation, active and passive articulators, airflow direction, or voicing14,15,16,17. The measurement methods to systematize and objectify the process of assessing speech production still need to be better defined. CASD tools can support the therapy by providing additional information to the therapist or improve speech screening tests in schools and kindergartens. Finally, automated articulation analysis can be implemented in applications for speech exercises that can be performed autonomously at home between speech and language therapy (SLT) sessions.
To be applicable in practice, CASD systems have to be based on data that are reliable and easily recordable without disrupting natural articulation. That excludes many specialized invasive systems used in articulation research, like electromagnetic articulography (EMA)18,19 or electropalatography (EPG)20,21. On the other hand, acoustic analysis has been performed in this area for years11,16,17,22,23,24,25. The other, less common idea involves a video recording of the speaker’s face to monitor the appearance of articulatory organs (articulators): tongue, lips, or teeth26,27,28,29,30. Both audio and video can be recorded every day using devices that offer sufficiently good quality. Access to annotated audiovisual (AV) data related to normal and distorted child speech is necessary to make the analysis reliable and repeatable for possible training and validation of automated CASD models.
This study was a part of research project no. 2018/30/E/ST7/00525: “Hybrid System for Acquisition and Processing of Multimodal Signal in the Analysis of Sigmatism in Children”, financed by National Science Center, Poland, in 2019–2024. It aimed at finding relationships between articulation, acoustics, and visual appearance of the articulators in different child speech patterns. A literature review and available solutions showed a need for an adequate dataset for the Polish language. Therefore, we prepared a detailed framework for the SLT examination with a data recording session, including multichannel spatial audio signals and a dual-camera stereovision stream of the speaker’s oral region (Fig. 1). As a result, we collected an extensive multimodal PAVSig (Polish Audio-Visual speech dataset for computer-aided diagnosis of Sigmatism) dataset of 201 children aged 4–8, along with the corresponding SLT diagnoses from two independent experts.
Methods
Research sample
Our interdisciplinary research team, including biomedical engineers and SLPs, performed the SLT examinations and data recording sessions in six kindergarten and school facilities in Myslowice, Katowice, Ruda Slaska, and Zabrze, Poland, from October 2021 to June 2023. The research sample covered 208 children, but several factors (non-native Polish speakers, data acquisition failures, others) limited it to 201 (107 girls and 94 boys) aged 4–8 (see Table 1). Including the child in the research sample required written consent from their parents or legal guardians to participate in the study and share the data as described in this paper. The child also had to agree verbally to participate in the study. The exclusion criteria covered: (1) diagnosed disabilities, including hard of hearing, deafness, low vision, visual impairment, aphasia, autism spectrum disorder, intellectual disability, and (2) epilepsy record. The study received a positive recommendation from the Bioethics Committee for Scientific Research at the Jerzy Kukuczka University of Physical Education in Katowice, Poland (Decision No. 3/2021).
Speech and language therapy assessment with a data recording session
The assessment consisted of three stages, with two involving data recording:
-
1.
In the first part, a dedicated multimodal data acquisition device (MDAD) registered the child’s speech while naming pictures visible on the screen (Fig. 2a).
Fig. 2 Illustration of the SLT examination with a data recording session: (a) part 1: data recording while naming graphics visible on the screen; (b) part 2: data recording while repeating words or logotomes and undergoing SLT assessment; (c) part 3: SLT examination. The individuals in the pictures or their legal guardians consented to publishing their image in the manuscript.
-
2.
In the second part, the speaker was recorded while repeating selected words and one- or two-syllable logotomes following the SLP. This stage also involved various tongue movements, swallowing, or smiling (Fig. 2b).
-
3.
The third part was the SLT examination according to the dedicated diagnostic protocol for sigmatism-related speech assessment (Fig. 2c). It was performed by the SLP, and no data was recorded at this point.
Each session produced a record of multimodal data (15-channel spatial audio and dual-camera video stream) and a filled diagnostic questionnaire of the case. At least two biomedical engineers and one SLP were present during each recording session. The second SLP prepared their independent diagnosis another day without data recording and with no access to the previous assessment outcomes.
Speech corpus
The sibilant-related linguistic material prepared and collected in the study consisted of 51 words and 12 one-syllable logotomes containing all Polish sibilants (see Table 2). The corpus organized isolated words with sibilants in different phases of articulation (word positions): at the beginning, in the middle, and at the end of the phrase31,32 (final positions only for voiceless sibilants). Our assumption was to use words where sibilants are surrounded by a vowel /a/ wherever possible. However, the priority was that the words were known and unambiguous to a preschool-age child and easily graphically represented, as the child’s task was to name the object they saw in a picture. During the language selection process, we encountered a disproportion concerning the presence of different sibilants in words applicable in picture naming. We reviewed picture tests available for Polish children and considered their use, but the sounds’ distributions were diversified as well. Some tests (e.g., from Krajna and Bryndal33) employed words featuring sibilants in different neighborhoods or featuring facultative pronunciation (e.g., jam, Polish:
, which may be produced with an affricate or asynchronously:
or
). Additionally, we have included a set of words that did not follow the described selection criteria, but we have used them in our previous experiments, and we consider them an added value to the core dictionary. We assume that the language material covered in the database may be used for different purposes and filtered according to the needs.
Most words (38) were displayed graphically on the screen in part 1 of the examination—a single illustration accompanied each word. Due to their difficulty and graphic ambiguity, the remaining words (13) and logotomes (12) were produced by the SLP in part 2 of the examination, while the speaker’s task was to repeat the phrase. The word order was the same in all measurements. Since four words contain two sibilants each (
—book,
—firefighter,
—cookies,
—pond), the total number of unique occurrences of sibilants in the speech corpus is 67.
There were five more types of logotomes with vowels only in the speech corpus (Table 2, bottom section). They can be used as an additional, sibilant-free resource for the articulation assessment.
Multimodal data acquisition device
We collected the data using a dedicated, self-designed multimodal data acquisition device (MDAD, Fig. 3). It was invented and redesigned multiple times with the milestone versions described and validated in34,35 (Fig. 3a). The only adjustments introduced in the most recent device were the construction updates, e.g., to reduce weight or improve visual user-friendliness (Fig. 3b). The equipment records the audio signal from 15 spatially distributed channels (a semicylindrical microphone array) and captures the video data using a dual-camera stereovision module (Fig. 3c).
Multimodal data acquisition device: (a) closed construction, prototype from35; (b) open construction, recent version; (c) inside view to the measuring part; red numbers present the microphone (audio channel) numbers, “LC” and “RC” indicate the left and right camera, respectively; (d) sample dual-camera view; the picture comes from horizontal concatenation of the left and right camera frame. The individuals in the pictures or their legal guardians consented to publishing their image in the manuscript.
The MDAD comprises a 5 V-powered central unit and three recording arcs (Fig. 3c). Each arc uses five electret Panasonic WM-61a microphones with omnidirectional characteristics36. Fifteen audio signals are recorded at a 44.1 kHz sampling rate and synchronized in time in a semicylindrical 3 × 5 array with ca. 5 cm distances between adjacent microphones. A pair of Arducam 8MP 1080P Auto Focus cameras37 installed between two bottom arcs are used to produce a stereovision stream. Both capture an unobstructed view of the articulators during speech production from a distance of ca. 15 cm at 30 frames per second with a 480 × 640 resolution each (Fig. 3d). We added LED lighting to improve the quality of image data and illuminate the speaker’s oral area. The main technical parameters are given in Table 3. The software for data recording was developed in Matlab38.
The construction elements were mostly printed in 3D, and the structure resembling a bicycle helmet was adapted to the preschool children’s characteristics and limitations. The MDAD was made more subject-friendly with additional elements that were visually attractive to the child, e.g., artificial rabbit ears or a plume (Fig. 3b).
Before each recording session, the device was safely and comfortably placed on the speaker’s head and eventually repositioned by the operator in its mobile part to adjust the distance from the sound source to the sensors. We prepared a dedicated adjustment interface to secure as much repeatable interspeaker and intraspeaker data acquisition as possible. Despite sufficient mobility, the MDAD is mechanically stable regarding the sound source and the scene during measurements. We used two versions of the MDAD to record the data: ver. 1 (closed construction, Fig. 3a) in first 53 speakers, and ver. 2 (open construction, Fig. 3b) in the remaining 148.
Data preprocessing
The AV data shared in our dataset was prepared through a sequence of preprocessing operations. First, the 15-channel audio recording and dual-camera video stream were synchronized in time with both video frames concatenated horizontally (Fig. 3d). Then, we manually segmented audio data into segments: words, logotomes, and phones (employing inventory of 37 phonemes39). The segmentation was prepared in Audacity40 based on the time series and spectrogram representations of the central microphone signal. No normalization or other audio data processing took place. Based on the audio segmentation results, we trimmed the AV stream in time in each segment to adjust the number of video frames and audio samples based on both sampling rates.
Finally, we applied additional cropping to the video frames to limit the scene size for improved anonymity (Fig. 4). The procedure involved the determination of the speaker- and recording-wise mouth vertical and horizontal midlines and cropping with fixed margins to show the limited oral area. The midlines were obtained for each frame using a YOLO v6 object detector trained to recognize the lips region of interest (ROI)41,42. Then, we determined midline valid for a particular participant and examination/recording part as a median across all related frames. Finally, all frames were cropped with fixed limits shown in Fig. 4. Therefore, the output frame size is 240 × 640.
We published the AV data of each segment in two forms. First, the uncompressed 15-channel audio stream was stored in a WAV format. Second, we prepared a video stream in an MP4 format, using an H.264/MPEG-4 AVC encoder43 with a high-quality constant rate factor (CRF) of 18. The latter representation is treated as video data, but we added a synchronized audio track for easier database browsing. The single-channel audio comes from the central microphone (channel #8) compressed using the advanced audio coding (AAC) standard with a 192k bitrate.
Speech examination questionnaire
Each participant’s articulation was examined in detail under a dedicated diagnostic questionnaire prepared by our SLT team. The description addressed the child’s speech production (especially sibilants), and anatomical and physiological issues (regarding the tongue frenulum, upper lip, palate, and teeth; e.g., swallowing, breathing, and tongue mobility). The experts also assessed the production of individual sibilants. The questionnaire consisted of 196 fields (including general data on the examination and descriptive elements), all filled in Polish. For this study, we combined the original items and obtained a concise subset of 95 fields, mostly of categorical type (Table 4). Note that the definitions of articulatory features are provided with the dataset in the PDF file (see section Dataset files).
The goal was to prepare two independent diagnoses by two SLPs. One was done by the expert attending the examination simultaneously with the recording session (see section Speech and language therapy examination with a data recording session). The other was prepared by the second expert the other day without data recording. We collected double diagnoses in 181 out of 201 participants and single diagnoses in the remaining 20 cases. The dataset contains 185 diagnoses from expert E1 and 197 from expert E2.
Data Records
Database structure
The PAVSig dataset is available under the following DOI44: https://doi.org/10.7910/DVN/IHZRGB. The structure of the main folder of the PAVSig repository is shown in Fig. 5a. There is a separate folder with the audio, video, and speech diagnosis data from each participant, named 00XXX (XXX stands for the anonymized three-digit ID of a participant; the extreme folder names are 00030 and 00237), five CSV files with respective dataset specifications, and a PDF file presenting the diagnosis dictionary.
All CSV files with dataset summaries use semicolon as a delimiter and are encoded using a UTF-8 standard. Two types of special characters must be imported carefully: Polish letters with diacritics and IPA (International Phonetic Alphabet) symbols. Although the dataset could bring the most valuable contribution to Polish speech research, we also paid attention to presenting all resources in English.
Participant folder
A complete participant folder contains two audio data subfolders, two video data subfolders, and a CSV file with a double-expert speech diagnosis (Fig. 5b). In the case of a missing recording of one of two parts, there are only single audio and video subfolders. The subfolder naming rule is: 00XXX-R-audio or 00XXX-R-video, where R = 1 or 2 refers to the first or second part of the recording session. The participant diagnosis from one or two SLPs is stored in a CSV file named 00XXX-diag.csv.
Audio data
Each audio data folder contains a complete set of WAV files with audio segments extracted from the recording of the corresponding part of the session (Fig. 5c). The segments contain either words or phones within words. Each WAV file stores an uncompressed 15-channel audio stream recorded at 16 bits and 44.1 kHz in a setup described in the Methods section (the order of channels in the WAV file corresponds to the arrangement shown in Fig. 3c).
The file nomenclature protocol is as follows:
-
The file name for a word or logotome is <word>.wav (word is a Polish word with removed diacritics and spaces).
-
The file name for a phoneme p within a word parentWord is <parentWord_p>.wav. There are some special cases here:
– To stay within Latin alphabet, sibilants are written as they sound in Polish. That makes p to be one-, two, or even three-letter patterns (e.g., see ci denoting
in Fig. 5c or zi, drz instead of
,
, respectively, in Fig. 5d). For more details see Table 6, field sibilant.
– If a word contains more than one phoneme of a certain type, a counter value follows p in the second and possibly next occurrence (see zaba_a.wav and zaba_a2.wav in Fig. 5c).
– Some speakers produced the same word twice. In such cases, the second occurrence is indicated by adding 2 after the word name, e.g., owoce2.wav (note that the second “o” here is stored in a file named owoce2_o2.wav).
– Finally, there are some words produced in a different form, e.g., “dzwon” instead of “dzwonek” or “siatkówka” instead of “siatka”. In such cases, word and parentWord in the filename is always a correct form consistent with Table 2, although all existing phonemes are stored (e.g., “siatkówka” produces a word segment siatka.wav and the following phoneme segments:
siatka_si.wav,
siatka_a.wav,
siatka_t.wav,
siatka_k.wav,
siatka_u.wav,
siatka_f.wav,
siatka_k2.wav,
siatka_a2.wav).
The word change is indicated in the dataset through a mechanism described in the Data Validation section.
Moreover, to provide data that is easily applicable to standard speech analysis software (Praat45), we have added a separate text file to each word containing segmentation and annotation data in a TextGrid format46. The file naming rule is as follows: annotations of the word/logotome in the audio file <word>.wav are included in the text file <word>.txt. In this case, the phoneme labels are transcribed using IPA.
Video data
The video data folder has the same number of MP4 files as the audio folder with the same nomenclature rules and identical file names (Fig. 5d). Individual file stores a single audiovisual segment (word or phoneme) of the recording: a dual-camera view cropped to a 240 × 640 size, as described in the Methods section, with a single-channel audio signal from the central microphone #8.
Participant diagnosis
The participant diagnosis file 00XXX-diag.csv includes six columns: three with the diagnosis in Polish and three with the English translation. In either triplet, the first column contains the questionnaire field name (e.g., the name of articulation or phonetic cue), and the next two store the responses from SLPs E1 and E2. If one of the experts did not examine the child, the corresponding column is empty. Note that the content of the participant diagnosis CSV file is a portion of data extracted from the complete diagnosis dataset described in the Diagnosis summary section.
Dataset files
Participant summary
The participant dataset participantSummary.csv gathers the anonymized data of the children participating in the study. The dataset fields (columns) are specified in Table 5. The articulation field is an attempt to assess each participant’s articulation with a single, simplified label. For each SLT assessment, we took the three most significant features per sibilant (place and manner of articulation, voicing). A single SLT diagnosis yielded the typical label if all features were assessed as typical, and atypical otherwise. Overall, the participant’s articulation was classified as typical if none of SLT assessments detected distorted pronunciation. However, this field should be treated with caution, as articulation diagnosis is complex, and some alterations from the target norm are natural in the progress of articulation development. We recommend that dataset users revise the individual features and select the categorization scheme based on their research needs.
Sibilant summary
The sibilant dataset sibilantSummary.csv specifies all sibilants in the speech corpus. The data dictionary is given in Table 6 (also refer to Table 2).
Segment summary
The segmentSummary.csv describes all AV segments available in the dataset. That concerns words, logotomes, and phonemes. Each entry in the segment summary has a corresponding WAV file in the audio subfolder and an MP4 file in the video subfolder. Table 7 presents the segment data dictionary.
Diagnosis summary
Two files store the complete set of SLT annotations: diagnosisSummaryPL.csv with the original diagnoses in Polish and diagnosisSummaryEN.csv with the English translation. In either case, 95 SLT examination questionnaire fields, e.g., the participant or expert ID or the name of articulation cue) are organized in columns with a field name in the first row. The entries (rows) are sorted in ascending order of participant ID and then expert ID. The participant diagnosis placed in the participant folder comes from extracting the appropriate subarrays (the field names and participant-related rows) from Polish and English datasets, transposing them to the 95 × 3 tables, and concatenating horizontally.
The diagnosisDictionary.pdf file contains the definitions of articulation-related fields of the speech examination questionnaire.
Technical Validation
Equipment validation
A thorough technical validation of the audio data acquisition component of the MDAD was performed and presented in detail in34. Below, we briefly report the experiments and results. Since then, we have redesigned the MDAD regarding its construction and usability. However, we have used the same microphones and their arrangement for audio recording and hardware for processing.
-
1.
In the first experiment, we tested the MDAD using synthetic signals in accordance with the Polish standard PN-EN ISO 374647 specifying acoustic measurements of sound level (SL) in conditions close to the free field. The experiments were performed in a special acoustically adapted room where the noise rating (NR) was acceptable for recording studios in the NR 25–30 range. Based on measuring the SL and a signal-to-noise ratio (SNR), we found out that all 15 microphones record the signal in the same way in different tone frequencies between 1 and 8 kHz. Depending on the tone, the mean SNR was between 61.3 and 65.7 dB, safely acceptable for medium-class recording equipment48,49.
-
2.
In the second experiment, we verified the MDAD’s ability to detect abnormal air outflow during articulation50,51. For this purpose, a human speaker simulated various air blows (central, left, and right outflow, each repeated three times) in ten attempts. The energy distributions indicated the appropriate reactions of the sensors to the directional acoustic stimuli.
The dual-camera video recording system has been added to the MDAD and shown originally in35. In that study, we presented the ability to support repeatable interspeaker and intraspeaker data acquisition by adjusting the mask position on a subject’s head through a dedicated visualization interface. We superimposed reference lines on the camera images to help the operator reliably place the MDAD on the speaker’s head. We use them to align possible stereovision viewpoints with the characteristic points of the face, e.g., the philtrum. We also estimated the extrinsic and intrinsic geometric parameters of the stereo system for eventual calibration purposes by finding the geometrical relationship between two cameras by observing the same point52,53. We used a chessboard template with known dimensions and geometry to calibrate individual cameras separately. Then, we determined translation and rotation matrices between the cameras and yielded a mean calibration error of 0.39 pixels54.
Data completeness remarks
During PAVSig collection, we faced some issues with the data completeness, leading to the exclusion of participants or lack of data, some of which have already been mentioned in the paper. This section provides an explanation of these categories.
Unsuitable participants or missing recording session. Of 208 participants under consideration, four were excluded due to the following reasons:
-
Two speakers were of non-Polish origin (Ukrainian) and thus non-native Polish speakers.
-
One child’s recordings and examination were unreliable and disrupted by the presence of drains in their ears.
-
One child was examined by an SLP but was later unavailable to participate in the examination with the recording session. Thus, they were excluded from the study with nothing more than a single diagnostic questionnaire.
Data acquisition failures. In rare cases, we experienced some technical problems with the data acquisition. That concerned both audio (unacceptable noise, missing samples) and video (missing frames, synchronization issues). That led to the exclusion of three participants, producing the final research sample size of 201. Moreover, there are four speakers with one part of the recording unavailable beacause of that (three in part 1 and one in part 2).
Incomplete audio data. In 21 speakers, especially in the early stage of the study involving the first version of the device, there was an issue with a part of the multichannel audio stream. Due to technical reasons, the signal from microphones #1–5 (top recording arc) was damaged and unavailable. The remaining ten channels (#6–15, including the central channel #8) are complete. We indicated these cases in the recording_1, recording_2 fields in the participant summary (Table 5).
Missing diagnoses. Due to organizational reasons, we were not always able to perform the second SLT examination. The total number of missing diagnoses is 20, but each child has at least one questionnaire (see section Speech examination questionnaire).
Data validation
We performed an extensive review and validation of the recorded and processed data. The validation covered a manual assessment of ca. 13k segments containing words and logotomes after applying all interventions to the research sample and participant-related constituents, as described in the Data completeness remarks section.
The expected number of word/logotome segments was 13,668: 201 participants × (38 words in part 1 + 30 words and logotomes in part 2). With three speakers with missing part 1 and one with missing part 2 recordings, this number was limited by 144 (3 × 38 + 1 × 30) to 13,524. A portion of segments (718, 3.5 per speaker) was missing due to several reasons:
-
the children did not produce them at all, mostly in part 1, based on naming the pictures shown on the screen;
-
the children used a synonym to name the picture (e.g., “mazaki”—"pisaki”, “lekarz”—"pan doktor”, or “sznurek”—"lina”);
-
speech was severely disturbed by noise or other sounds.
On the other hand, in 24 cases, the child produced the same word or logotome twice. Hence, the total number of word/logotome segments we share is 12,830 (13, 524 − 718 + 24).
To make the data use easy and reliable, we performed an extended validation of the available set of words and logotomes, primarily for the sibilant analysis. We assigned three different data validity levels (DVL) to each segment:
-
DVL = 1.0 – the segment is considered correct and suitable for the analysis.
-
DVL = 0.9 – the segment presents a slightly different word version or declination than required. However, the word change does not affect the sibilant and its environment, so it is suitable for the analysis. Examples: “dzwonek”—"dzwon”, “żaba”—"żabka”, “ciastka”—"ciasto”, “parasol”—"parasolka”. Note that the speech error may affect the number of word syllables or stress.
-
DVL = 0.5 – the segment presents a significantly different word version or declination that affects the sibilant environment. Examples: “pies”—"piesek”, “jeże”—"jeż”, “strażak”—"straż”, “kaczka”—"kaczuszka”. All words and logotomes modified in any other way also fall into this category.
Table 8 presents detailed statistics of our dataset regarding the total number of specific segments—words, logotomes, and phonemes—also with the DVL distributions. Note that the phoneme DVL is always inherited after its parent word or logotome. There are ca. 2% and 1% segments with minor and major issues, respectively, so 97% of segments are considered correct. The DVLs are stored in the segment dataset under the dataValidity field (Table 7).
There is a total of 66,781 segments in PAVSig (12,830 words and logotomes, 53,951 phonemes). The number of sibilant occurrences varies between 593 in “zi” (3.0 per speaker) and 2,364 in “sz” (11.8 per speaker). The total number of sibilants is 12,576 (62.6 per speaker).
Usage Notes
The dataset is available under the Data Use Agreement (DUA) with data access requirements given in the repository44.
Code availability
No custom code was used to generate or process the data described in the manuscript.
References
Demenko, G., Wagner, A. & Cylwik, N. The use of speech technology in foreign language pronunciation training. Archives of Acoustics 35, 309–330 (2010).
Minczakiewicz, E. Dyslalia in the Context of Other Speech Defects and Disorders in Preschool and School Children, (PL) Dyslalia na tle innych wad i zaburzeń mowy u dzieci w wieku przedszkolnym i szkolnym. Konteksty pedagogiczne 1, 149–169 (2017).
Gacka, E. & Kaźmierczak, M. Speech screening examinations as an example of activity in the field of speech-language therapy. Logopaedica Lodziensia 1, 31—42, https://doi.org/10.18778/2544-7238.01.04 (2017).
Smit, A. B. Phonologic error distributions in the Iowa-Nebraska articulation norms project. Journal of Speech, Language, and Hearing Research 36, 533–547, https://doi.org/10.1044/jshr.3603.533 (1993).
Lockenvitz, S., Tetnowski, J. A. & Oxley, J. The sociolinguistics of lisping: a review. Clinical Linguistics & Phonetics 34, 1169–1184, https://doi.org/10.1080/02699206.2020.1788167 (2020).
Amr Rey, O., Sánchez Delgado, P., Salvador Palmer, R., Ortiz De Anda, M. C. & Paredes Gallardo, V. Exploratory study on the prevalence of speech sound disorders in a group of Valencian school students belonging to 3rd grade of infant school and 1st grade of primary school. Psicologia educativa 28, 195–207, https://doi.org/10.5093/PSED2022A1 (2022).
Grigorova, E., Ristovska, G. & Jordanova, N. P. Prevalence of phonological articulation disorders in preschool children in the city of Skopje. PRILOZI 41, 31–37, https://doi.org/10.2478/prilozi-2020-0043 (2020).
Van Borsel, J., Van Rentergem, S. & Verhaeghe, L. The prevalence of lisping in young adults. Journal of Communication Disorders 40, 493–502, https://doi.org/10.1016/j.jcomdis.2006.12.001 (2007).
Harrison, J. E., Weber, S., Jakob, R. & Chute, C. G. ICD-11: an international classification of diseases for the twenty-first century. BMC Medical Informatics and Decision Making21, https://doi.org/10.1186/s12911-021-01534-6 (2021).
Benselam, Z., Guerti, M. & Bencherif, M. Arabic Speech Pathology Therapy Computer-Aided System. Journal of Computer Science 3, 685–692 (2007).
Valentini-Botinhao, C. et al. Automatic detection of sigmatism in children. In Proc. WOCCI 2012 - Workshop on Child, Computer and Interaction, 1–4 (Portland, USA, 2012).
Anjos, I. et al. A model for sibilant distortion detection in children. In DMIP ’18 (2018).
Żygis, M. & Padgett, J. A perceptual study of polish fricatives, and its implications for historical sound change. Journal of Phonetics 38, 207–226, https://doi.org/10.1016/j.wocn.2009.10.003 (2010).
Bickford, A. C. & Floyd, R.Articulatory Phonetics: Tools for Analyzing the World’s Languages (SIL International Dallas, 2006).
Zsiga, E. C.The Sounds of Language: An Introduction to Phonetics and Phonology (Wiley-Blackwell, 2012).
Miodonska, Z., Badura, P. & Mocko, N. Noise-based acoustic features of Polish retroflex fricatives in children with normal pronunciation and speech disorder. Journal of Phonetics 92, 101149, https://doi.org/10.1016/j.wocn.2022.101149 (2022).
Krecichwost, M., Mocko, N. & Badura, P. Automated detection of sigmatism using deep learning applied to multichannel speech signal. Biomedical Signal Processing and Control 68, 102612, https://doi.org/10.1016/j.bspc.2021.102612 (2021).
Katz, W., Mehta, S., Wood, M. & Wang, J. Using electromagnetic articulography with a tongue lateral sensor to discriminate manner of articulation. The Journal of the Acoustical Society of America 141, 57–63, https://doi.org/10.1121/1.4973907 (2017).
Kroos, C. Evaluation of the measurement precision in three-dimensional electromagnetic articulography (Carstens AG500). Journal of Phonetics 40, 453–465, https://doi.org/10.1016/j.wocn.2012.03.002 (2012).
Wood, S., Wishart, J., Hardcastle, W., Cleland, J. & Timmins, C. The use of electropalatography (EPG) in the assessment and treatment of motor speech disorders in children with Down’s syndrome: Evidence from two case studies. Developmental Neurorehabilitation 12, 66–75, https://doi.org/10.1080/17518420902738193 (2009).
Cleland, J., Timmins, C., Wood, S., Hardcastle, W. & Wishart, J. Electropalatographic therapy for children and young people with Down’s syndrome. Clinical Linguistics & Phonetics 23, 926–939, https://doi.org/10.3109/02699200903061776 (2009).
Bilibajkić, R., Vojnović, M. & Šarić, Z. Detection of lateral sigmatism using support vector machine. Speech and Language 2019 322–328 (2019).
Król, D., Lorenc, A. & Święciński, R. Detecting Laterality and Nasality in Speech with the use of a Multi-channel Recorder. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’15, 5147–5151, https://doi.org/10.1109/ICASSP.2015.7178952 (2015).
Lorenc, A., Król, D. & Klessa, K. An acoustic camera approach to studying nasality in speech: The case of polish nasalized vowels. The Journal of the Acoustical Society of America 144, 3603–3617, https://doi.org/10.1121/1.5084038 (2018).
Wei, S., Hu, G., Hu, Y. & Wang, R.-H. A new method for mispronunciation detection using support vector machine based on pronunciation space models. Speech Communication 51, 896–905, https://doi.org/10.1016/j.specom.2009.03.004 (2009).
Bílková, Z. et al. Automatic evaluation of speech therapy exercises based on image data. In Lecture Notes in Computer Science, 397–404, https://doi.org/10.1007/978-3-030-27202-9_36 (Springer International Publishing, 2019).
Bilkova, Z. et al. ASSISLT: Computer-aided speech therapy tool. In 2022 30th European Signal Processing Conference (EUSIPCO), 598–602, https://doi.org/10.23919/eusipco55093.2022.9909627 (IEEE, 2022).
Sage, A. et al. Deep learning approach to automated segmentation of tongue in camera images for computer-aided speech diagnosis. In Advances in Intelligent Systems and Computing, 41–51, https://doi.org/10.1007/978-3-030-49666-1_4 (Springer International Publishing, 2020).
Chotikkakamthorn, K., Ritthipravat, P., Kusakunniran, W., Tuakta, P. & Benjapornlert, P. A lightweight deep learning approach to mouth segmentation in color images. Applied Computing and Informaticshttps://doi.org/10.1108/aci-08-2022-0225 (2022).
Miled, M., Messaoud, M. A. B. & Bouzid, A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications 82, 551–571, https://doi.org/10.1007/s11042-022-13321-0 (2022).
Milewski, S. Phoneme frequency in preschool children’s spoken texts, (PL) Frekwencja fonemów w tekstach mówionych dzieci w wieku przedszkolnym. Logopedia 24, 67–83 (1997).
Milewski, S. & Binkuńska, E. The phonostatistic and phonotactic structure of selected Polish difficult word sequences (tongue twisters), (PL) Struktura fonostatystyczno-fonotaktyczna wybranych polskich słownych ciagów trudnych (lingwołamek). Prace Jezykoznawcze 26, 143–160, https://doi.org/10.31648/pj.10593 (2024).
Krajna, E. & Bryndal, M. 100-Word Articulation Test: Auditory Analysis of Recordings and Attempt at Test Normalization, (PL) 100-wyrazowy test artykulacyjny. Analiza słuchowa nagrań i próba normalizacji testu. Audiofonologia XIV, 137–174 (1999).
Krecichwost, M., Miodonska, Z., Trzaskalik, J. & Badura, P. Multichannel speech acquisition and analysis for computer-aided sigmatism diagnosis in children. IEEE Access 8, 98647–98658, https://doi.org/10.1109/ACCESS.2020.2996413 (2020).
Krecichwost, M., Sage, A., Miodonska, Z. & Badura, P. 4D multimodal speaker model for remote speech diagnosis. IEEE Access 10, 93187–93202, https://doi.org/10.1109/ACCESS.2022.3203572 (2022).
Panasonic. Omnidirectional Back Electret Condenser Microphone Cartridge, Series: WM-61A, WM-61B. https://www.alldatasheet.com/datasheet-pdf/pdf/528408/PANASONIC/WM-61A.html [accessed: 17-May-2024].
ArduCam. Arducam 8MP 1080P Auto Focus USB Camera Module with Microphone [accessed 20-March-2023].
The MathWorks Inc. MATLAB version: 9.13.0 (R2022b). https://www.mathworks.com (2022).
Jassem, W. Polish. Journal of the International Phonetic Association 33, 103–107, https://doi.org/10.1017/S0025100303001191 (2003).
Audacity Team. Audacity (R): Free Audio Editor and Recorder. https://www.audacityteam.org/ (2023).
Sage, A. & Badura, P. Detection and segmentation of mouth region in stereo stream using yolov6 and deeplab v3+ models for computer-aided speech diagnosis in children. Applied Sciences14, https://doi.org/10.3390/app14167146 (2024).
Li, C. et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv:2209.02976, https://doi.org/10.48550/ARXIV.2209.02976 (2022).
International Telecommunication Union Telecommunication Standardization Sector (ITU-T). H.264 : Advanced video coding for generic audiovisual services (2003).
Krecichwost, M. et al. PAVSig: Polish multichannel Audio-Visual child speech dataset with double-expert Sigmatism diagnosis, https://doi.org/10.7910/DVN/IHZRGB (2025).
Boersma, P. & Weenink, D. Praat: doing phonetics by computer. https://www.fon.hum.uva.nl/praat/. [accessed: 3-April-2025]
Boersma, P. & Weenink, D. TextGrid file formats. https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_formats.html [accessed: 3-April-2025].
Polish Committee for Standardization (PCS). Acoustics – Determination of sound power levels and sound energy levels of a noise source on the basis of sound pressure measurements - an indicative method using the surrounding measuring surface above the reflecting plane, (PL) Akustyka – Wyznaczanie poziomów mocy akustycznej i poziomów energii akustycznej źródel hałasu na podstawie pomiarów ciśnienia akustycznego – Metoda orientacyjna z zastosowaniem otaczajaącej powierzchni pomiarowej nad płaszczyznaą odbijajaącaą dźwięk. Standard, PN-EN ISO 3746:2011, PCS (2017).
Rosen, S. & Howell, P.Signals and systems for speech and hearing (Brill, 2011).
Baken, R. & Orlikoff, R.Clinical Measurement of Speech and Voice. Speech Science (Singular Thomson Learning, 2000).
Klatt, D. H., Stevens, K. N. & Mead, J. Studies of articulatory activity and airflow during speech. Annals of the New York Academy of Sciences 155, 42–55, https://doi.org/10.1111/j.1749-6632.1968.tb56748.x (1968).
Lorenc, A., Żygis, M., Łukasz, M., Pape, D. & Sóskuthy, M. Articulatory and acoustic variation in Polish palatalised retroflexes compared with plain ones. Journal of Phonetics 96, 101181, https://doi.org/10.1016/j.wocn.2022.101181 (2023).
Zhang, Z. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1330–1334, https://doi.org/10.1109/34.888718 (2000).
Heikkila, J. & Silven, O. A four-step camera calibration procedure with implicit image correction. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1106–1112, https://doi.org/10.1109/CVPR.1997.609468 (1997).
Bier, A. & Luchowski, L. Error analysis of stereo calibration and reconstruction. In Gagalowicz, A. & Philips, W. (eds.) Computer Vision/Computer Graphics CollaborationTechniques, 230–241, https://doi.org/10.1007/978-3-642-01811-4_21 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2009).
Acknowledgements
This work was supported by the National Science Centre, Poland, research project No. 2018/30/E/ST7/00525: “Hybrid System for Acquisition and Processing of Multimodal Signal in the Analysis of Sigmatism in Children”, partially by the Foundation for Polish Science (FNP), partially by the Polish Ministry of Science, Poland, statutory financial support No. 07/010/BK_25/1047, and partially by the European Funds for Silesia 2021-2027 Program co-financed by the Just Transition Fund Project “Development of the Silesian Biomedical Engineering Potential in the Face of the Challenges of the Digital and Green Economy (BioMeDiG)” under Grant FESL.10.25-IZ.01-07G5/23. The authors want to thank Natalia Moćko, PhD, Marcin Biesok, and Wojciech Galiński for their valuable help in audio data analysis.
Author information
Authors and Affiliations
Contributions
All authors are justifiably credited with authorship. The detailed contribution is as follows: M.K., Z.M., J.T., and P.B. conceived and designed the analysis, M.K. designed, produced, and validated the equipment, M.K., Z.M., A.S., J.T., E.K., and P.B. collected the data, M.K., Z.M., A.S., and P.B. contributed data or analysis tools, M.K. and P.B. designed and prepared the database, J.T. and E.K. provided SLT assessments, M.K., Z.M., A.S., and P.B. wrote the paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Krecichwost, M., Miodonska, Z., Sage, A. et al. Polish multichannel audio-visual child speech dataset with double-expert sigmatism diagnosis. Sci Data 12, 1612 (2025). https://doi.org/10.1038/s41597-025-05896-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05896-8







