Background & Summary

Computer-aided speech diagnosis (CASD) or computer-aided speech therapy (CAST) is an emerging field with access to data acquisition and processing capabilities. It is strongly connected to particular languages, as it relies on acoustics and articulation, which are significantly different in various language groups1. The distinction can also concern the speaker’s age, as the therapy can involve children, adolescents, or adults. Nonetheless, the earlier the diagnosis occurs, the more efficient the therapy can be. Therefore, it is essential to support the speech and language therapist/pathologist (SLP) in treating even very young children2. Sigmatism (lisp) is one of the most common types of speech sound disorders3,4,5,6,7,8,9,10,11,12. It refers to an incorrect articulation of sibilant sounds, different across languages. In Polish, there are 12 sibilants: four denti-alveolars /s/, /z/, , , four retroflexes , , , ; and four alveolo-palatals , , , 13.

Many detailed features can describe articulation, e.g., the manner of articulation, active and passive articulators, airflow direction, or voicing14,15,16,17. The measurement methods to systematize and objectify the process of assessing speech production still need to be better defined. CASD tools can support the therapy by providing additional information to the therapist or improve speech screening tests in schools and kindergartens. Finally, automated articulation analysis can be implemented in applications for speech exercises that can be performed autonomously at home between speech and language therapy (SLT) sessions.

To be applicable in practice, CASD systems have to be based on data that are reliable and easily recordable without disrupting natural articulation. That excludes many specialized invasive systems used in articulation research, like electromagnetic articulography (EMA)18,19 or electropalatography (EPG)20,21. On the other hand, acoustic analysis has been performed in this area for years11,16,17,22,23,24,25. The other, less common idea involves a video recording of the speaker’s face to monitor the appearance of articulatory organs (articulators): tongue, lips, or teeth26,27,28,29,30. Both audio and video can be recorded every day using devices that offer sufficiently good quality. Access to annotated audiovisual (AV) data related to normal and distorted child speech is necessary to make the analysis reliable and repeatable for possible training and validation of automated CASD models.

This study was a part of research project no. 2018/30/E/ST7/00525: “Hybrid System for Acquisition and Processing of Multimodal Signal in the Analysis of Sigmatism in Children”, financed by National Science Center, Poland, in 2019–2024. It aimed at finding relationships between articulation, acoustics, and visual appearance of the articulators in different child speech patterns. A literature review and available solutions showed a need for an adequate dataset for the Polish language. Therefore, we prepared a detailed framework for the SLT examination with a data recording session, including multichannel spatial audio signals and a dual-camera stereovision stream of the speaker’s oral region (Fig. 1). As a result, we collected an extensive multimodal PAVSig (Polish Audio-Visual speech dataset for computer-aided diagnosis of Sigmatism) dataset of 201 children aged 4–8, along with the corresponding SLT diagnoses from two independent experts.

Fig. 1
figure 1

Schematic overview of the study.

Methods

Research sample

Our interdisciplinary research team, including biomedical engineers and SLPs, performed the SLT examinations and data recording sessions in six kindergarten and school facilities in Myslowice, Katowice, Ruda Slaska, and Zabrze, Poland, from October 2021 to June 2023. The research sample covered 208 children, but several factors (non-native Polish speakers, data acquisition failures, others) limited it to 201 (107 girls and 94 boys) aged 4–8 (see Table 1). Including the child in the research sample required written consent from their parents or legal guardians to participate in the study and share the data as described in this paper. The child also had to agree verbally to participate in the study. The exclusion criteria covered: (1) diagnosed disabilities, including hard of hearing, deafness, low vision, visual impairment, aphasia, autism spectrum disorder, intellectual disability, and (2) epilepsy record. The study received a positive recommendation from the Bioethics Committee for Scientific Research at the Jerzy Kukuczka University of Physical Education in Katowice, Poland (Decision No. 3/2021).

Table 1 Specification of the research sample.

Speech and language therapy assessment with a data recording session

The assessment consisted of three stages, with two involving data recording:

  1. 1.

    In the first part, a dedicated multimodal data acquisition device (MDAD) registered the child’s speech while naming pictures visible on the screen (Fig. 2a).

    Fig. 2
    figure 2

    Illustration of the SLT examination with a data recording session: (a) part 1: data recording while naming graphics visible on the screen; (b) part 2: data recording while repeating words or logotomes and undergoing SLT assessment; (c) part 3: SLT examination. The individuals in the pictures or their legal guardians consented to publishing their image in the manuscript.

  2. 2.

    In the second part, the speaker was recorded while repeating selected words and one- or two-syllable logotomes following the SLP. This stage also involved various tongue movements, swallowing, or smiling (Fig. 2b).

  3. 3.

    The third part was the SLT examination according to the dedicated diagnostic protocol for sigmatism-related speech assessment (Fig. 2c). It was performed by the SLP, and no data was recorded at this point.

Each session produced a record of multimodal data (15-channel spatial audio and dual-camera video stream) and a filled diagnostic questionnaire of the case. At least two biomedical engineers and one SLP were present during each recording session. The second SLP prepared their independent diagnosis another day without data recording and with no access to the previous assessment outcomes.

Speech corpus

The sibilant-related linguistic material prepared and collected in the study consisted of 51 words and 12 one-syllable logotomes containing all Polish sibilants (see Table 2). The corpus organized isolated words with sibilants in different phases of articulation (word positions): at the beginning, in the middle, and at the end of the phrase31,32 (final positions only for voiceless sibilants). Our assumption was to use words where sibilants are surrounded by a vowel /a/ wherever possible. However, the priority was that the words were known and unambiguous to a preschool-age child and easily graphically represented, as the child’s task was to name the object they saw in a picture. During the language selection process, we encountered a disproportion concerning the presence of different sibilants in words applicable in picture naming. We reviewed picture tests available for Polish children and considered their use, but the sounds’ distributions were diversified as well. Some tests (e.g., from Krajna and Bryndal33) employed words featuring sibilants in different neighborhoods or featuring facultative pronunciation (e.g., jam, Polish: , which may be produced with an affricate or asynchronously: or ). Additionally, we have included a set of words that did not follow the described selection criteria, but we have used them in our previous experiments, and we consider them an added value to the core dictionary. We assume that the language material covered in the database may be used for different purposes and filtered according to the needs.

Table 2 A set of words with highlighted sibilants.

Most words (38) were displayed graphically on the screen in part 1 of the examination—a single illustration accompanied each word. Due to their difficulty and graphic ambiguity, the remaining words (13) and logotomes (12) were produced by the SLP in part 2 of the examination, while the speaker’s task was to repeat the phrase. The word order was the same in all measurements. Since four words contain two sibilants each (—book, —firefighter, —cookies, —pond), the total number of unique occurrences of sibilants in the speech corpus is 67.

There were five more types of logotomes with vowels only in the speech corpus (Table 2, bottom section). They can be used as an additional, sibilant-free resource for the articulation assessment.

Multimodal data acquisition device

We collected the data using a dedicated, self-designed multimodal data acquisition device (MDAD, Fig. 3). It was invented and redesigned multiple times with the milestone versions described and validated in34,35 (Fig. 3a). The only adjustments introduced in the most recent device were the construction updates, e.g., to reduce weight or improve visual user-friendliness (Fig. 3b). The equipment records the audio signal from 15 spatially distributed channels (a semicylindrical microphone array) and captures the video data using a dual-camera stereovision module (Fig. 3c).

Fig. 3
figure 3

Multimodal data acquisition device: (a) closed construction, prototype from35; (b) open construction, recent version; (c) inside view to the measuring part; red numbers present the microphone (audio channel) numbers, “LC” and “RC” indicate the left and right camera, respectively; (d) sample dual-camera view; the picture comes from horizontal concatenation of the left and right camera frame. The individuals in the pictures or their legal guardians consented to publishing their image in the manuscript.

The MDAD comprises a 5 V-powered central unit and three recording arcs (Fig. 3c). Each arc uses five electret Panasonic WM-61a microphones with omnidirectional characteristics36. Fifteen audio signals are recorded at a 44.1 kHz sampling rate and synchronized in time in a semicylindrical 3 × 5 array with ca. 5 cm distances between adjacent microphones. A pair of Arducam 8MP 1080P Auto Focus cameras37 installed between two bottom arcs are used to produce a stereovision stream. Both capture an unobstructed view of the articulators during speech production from a distance of ca. 15 cm at 30 frames per second with a 480 × 640 resolution each (Fig. 3d). We added LED lighting to improve the quality of image data and illuminate the speaker’s oral area. The main technical parameters are given in Table 3. The software for data recording was developed in Matlab38.

Table 3 Technical parameters of the multimodal data acquisition device.

The construction elements were mostly printed in 3D, and the structure resembling a bicycle helmet was adapted to the preschool children’s characteristics and limitations. The MDAD was made more subject-friendly with additional elements that were visually attractive to the child, e.g., artificial rabbit ears or a plume (Fig. 3b).

Before each recording session, the device was safely and comfortably placed on the speaker’s head and eventually repositioned by the operator in its mobile part to adjust the distance from the sound source to the sensors. We prepared a dedicated adjustment interface to secure as much repeatable interspeaker and intraspeaker data acquisition as possible. Despite sufficient mobility, the MDAD is mechanically stable regarding the sound source and the scene during measurements. We used two versions of the MDAD to record the data: ver. 1 (closed construction, Fig. 3a) in first 53 speakers, and ver. 2 (open construction, Fig. 3b) in the remaining 148.

Data preprocessing

The AV data shared in our dataset was prepared through a sequence of preprocessing operations. First, the 15-channel audio recording and dual-camera video stream were synchronized in time with both video frames concatenated horizontally (Fig. 3d). Then, we manually segmented audio data into segments: words, logotomes, and phones (employing inventory of 37 phonemes39). The segmentation was prepared in Audacity40 based on the time series and spectrogram representations of the central microphone signal. No normalization or other audio data processing took place. Based on the audio segmentation results, we trimmed the AV stream in time in each segment to adjust the number of video frames and audio samples based on both sampling rates.

Finally, we applied additional cropping to the video frames to limit the scene size for improved anonymity (Fig. 4). The procedure involved the determination of the speaker- and recording-wise mouth vertical and horizontal midlines and cropping with fixed margins to show the limited oral area. The midlines were obtained for each frame using a YOLO v6 object detector trained to recognize the lips region of interest (ROI)41,42. Then, we determined midline valid for a particular participant and examination/recording part as a median across all related frames. Finally, all frames were cropped with fixed limits shown in Fig. 4. Therefore, the output frame size is 240 × 640.

Fig. 4
figure 4

Illustration of the video cropping procedure. The individuals in the pictures or their legal guardians consented to publishing their image in the manuscript.

We published the AV data of each segment in two forms. First, the uncompressed 15-channel audio stream was stored in a WAV format. Second, we prepared a video stream in an MP4 format, using an H.264/MPEG-4 AVC encoder43 with a high-quality constant rate factor (CRF) of 18. The latter representation is treated as video data, but we added a synchronized audio track for easier database browsing. The single-channel audio comes from the central microphone (channel #8) compressed using the advanced audio coding (AAC) standard with a 192k bitrate.

Speech examination questionnaire

Each participant’s articulation was examined in detail under a dedicated diagnostic questionnaire prepared by our SLT team. The description addressed the child’s speech production (especially sibilants), and anatomical and physiological issues (regarding the tongue frenulum, upper lip, palate, and teeth; e.g., swallowing, breathing, and tongue mobility). The experts also assessed the production of individual sibilants. The questionnaire consisted of 196 fields (including general data on the examination and descriptive elements), all filled in Polish. For this study, we combined the original items and obtained a concise subset of 95 fields, mostly of categorical type (Table 4). Note that the definitions of articulatory features are provided with the dataset in the PDF file (see section Dataset files).

Table 4 General specification of the speech examination questionnaire items.

The goal was to prepare two independent diagnoses by two SLPs. One was done by the expert attending the examination simultaneously with the recording session (see section Speech and language therapy examination with a data recording session). The other was prepared by the second expert the other day without data recording. We collected double diagnoses in 181 out of 201 participants and single diagnoses in the remaining 20 cases. The dataset contains 185 diagnoses from expert E1 and 197 from expert E2.

Data Records

Database structure

The PAVSig dataset is available under the following DOI44: https://doi.org/10.7910/DVN/IHZRGB. The structure of the main folder of the PAVSig repository is shown in Fig. 5a. There is a separate folder with the audio, video, and speech diagnosis data from each participant, named 00XXX (XXX stands for the anonymized three-digit ID of a participant; the extreme folder names are 00030 and 00237), five CSV files with respective dataset specifications, and a PDF file presenting the diagnosis dictionary.

Fig. 5
figure 5

Illustration of the data repository structure at different levels of the folder tree: (a) main folder, (b) participant folder, (c) audio folder, (d) video folder.

All CSV files with dataset summaries use semicolon as a delimiter and are encoded using a UTF-8 standard. Two types of special characters must be imported carefully: Polish letters with diacritics and IPA (International Phonetic Alphabet) symbols. Although the dataset could bring the most valuable contribution to Polish speech research, we also paid attention to presenting all resources in English.

Participant folder

A complete participant folder contains two audio data subfolders, two video data subfolders, and a CSV file with a double-expert speech diagnosis (Fig. 5b). In the case of a missing recording of one of two parts, there are only single audio and video subfolders. The subfolder naming rule is: 00XXX-R-audio or 00XXX-R-video, where R = 1 or 2 refers to the first or second part of the recording session. The participant diagnosis from one or two SLPs is stored in a CSV file named 00XXX-diag.csv.

Audio data

Each audio data folder contains a complete set of WAV files with audio segments extracted from the recording of the corresponding part of the session (Fig. 5c). The segments contain either words or phones within words. Each WAV file stores an uncompressed 15-channel audio stream recorded at 16 bits and 44.1 kHz in a setup described in the Methods section (the order of channels in the WAV file corresponds to the arrangement shown in Fig. 3c).

The file nomenclature protocol is as follows:

  • The file name for a word or logotome is <word>.wav (word is a Polish word with removed diacritics and spaces).

  • The file name for a phoneme p within a word parentWord is <parentWord_p>.wav. There are some special cases here:

– To stay within Latin alphabet, sibilants are written as they sound in Polish. That makes p to be one-, two, or even three-letter patterns (e.g., see ci denoting in Fig. 5c or zi, drz instead of , , respectively, in Fig. 5d). For more details see Table 6, field sibilant.

– If a word contains more than one phoneme of a certain type, a counter value follows p in the second and possibly next occurrence (see zaba_a.wav and zaba_a2.wav in Fig. 5c).

– Some speakers produced the same word twice. In such cases, the second occurrence is indicated by adding 2 after the word name, e.g., owoce2.wav (note that the second “o” here is stored in a file named owoce2_o2.wav).

– Finally, there are some words produced in a different form, e.g., “dzwon” instead of “dzwonek” or “siatkówka” instead of “siatka”. In such cases, word and parentWord in the filename is always a correct form consistent with Table 2, although all existing phonemes are stored (e.g., “siatkówka” produces a word segment siatka.wav and the following phoneme segments:

siatka_si.wav,

siatka_a.wav,

siatka_t.wav,

siatka_k.wav,

siatka_u.wav,

siatka_f.wav,

siatka_k2.wav,

siatka_a2.wav).

The word change is indicated in the dataset through a mechanism described in the Data Validation section.

Moreover, to provide data that is easily applicable to standard speech analysis software (Praat45), we have added a separate text file to each word containing segmentation and annotation data in a TextGrid format46. The file naming rule is as follows: annotations of the word/logotome in the audio file <word>.wav are included in the text file <word>.txt. In this case, the phoneme labels are transcribed using IPA.

Video data

The video data folder has the same number of MP4 files as the audio folder with the same nomenclature rules and identical file names (Fig. 5d). Individual file stores a single audiovisual segment (word or phoneme) of the recording: a dual-camera view cropped to a 240 × 640 size, as described in the Methods section, with a single-channel audio signal from the central microphone #8.

Participant diagnosis

The participant diagnosis file 00XXX-diag.csv includes six columns: three with the diagnosis in Polish and three with the English translation. In either triplet, the first column contains the questionnaire field name (e.g., the name of articulation or phonetic cue), and the next two store the responses from SLPs E1 and E2. If one of the experts did not examine the child, the corresponding column is empty. Note that the content of the participant diagnosis CSV file is a portion of data extracted from the complete diagnosis dataset described in the Diagnosis summary section.

Dataset files

Participant summary

The participant dataset participantSummary.csv gathers the anonymized data of the children participating in the study. The dataset fields (columns) are specified in Table 5. The articulation field is an attempt to assess each participant’s articulation with a single, simplified label. For each SLT assessment, we took the three most significant features per sibilant (place and manner of articulation, voicing). A single SLT diagnosis yielded the typical label if all features were assessed as typical, and atypical otherwise. Overall, the participant’s articulation was classified as typical if none of SLT assessments detected distorted pronunciation. However, this field should be treated with caution, as articulation diagnosis is complex, and some alterations from the target norm are natural in the progress of articulation development. We recommend that dataset users revise the individual features and select the categorization scheme based on their research needs.

Table 5 Data dictionary for the participant dataset.

Sibilant summary

The sibilant dataset sibilantSummary.csv specifies all sibilants in the speech corpus. The data dictionary is given in Table 6 (also refer to Table 2).

Table 6 Data dictionary for the sibilant dataset.

Segment summary

The segmentSummary.csv describes all AV segments available in the dataset. That concerns words, logotomes, and phonemes. Each entry in the segment summary has a corresponding WAV file in the audio subfolder and an MP4 file in the video subfolder. Table 7 presents the segment data dictionary.

Table 7 Data dictionary for the segment dataset.

Diagnosis summary

Two files store the complete set of SLT annotations: diagnosisSummaryPL.csv with the original diagnoses in Polish and diagnosisSummaryEN.csv with the English translation. In either case, 95 SLT examination questionnaire fields, e.g., the participant or expert ID or the name of articulation cue) are organized in columns with a field name in the first row. The entries (rows) are sorted in ascending order of participant ID and then expert ID. The participant diagnosis placed in the participant folder comes from extracting the appropriate subarrays (the field names and participant-related rows) from Polish and English datasets, transposing them to the 95 × 3 tables, and concatenating horizontally.

The diagnosisDictionary.pdf file contains the definitions of articulation-related fields of the speech examination questionnaire.

Technical Validation

Equipment validation

A thorough technical validation of the audio data acquisition component of the MDAD was performed and presented in detail in34. Below, we briefly report the experiments and results. Since then, we have redesigned the MDAD regarding its construction and usability. However, we have used the same microphones and their arrangement for audio recording and hardware for processing.

  1. 1.

    In the first experiment, we tested the MDAD using synthetic signals in accordance with the Polish standard PN-EN ISO 374647 specifying acoustic measurements of sound level (SL) in conditions close to the free field. The experiments were performed in a special acoustically adapted room where the noise rating (NR) was acceptable for recording studios in the NR 25–30 range. Based on measuring the SL and a signal-to-noise ratio (SNR), we found out that all 15 microphones record the signal in the same way in different tone frequencies between 1 and 8 kHz. Depending on the tone, the mean SNR was between 61.3 and 65.7 dB, safely acceptable for medium-class recording equipment48,49.

  2. 2.

    In the second experiment, we verified the MDAD’s ability to detect abnormal air outflow during articulation50,51. For this purpose, a human speaker simulated various air blows (central, left, and right outflow, each repeated three times) in ten attempts. The energy distributions indicated the appropriate reactions of the sensors to the directional acoustic stimuli.

The dual-camera video recording system has been added to the MDAD and shown originally in35. In that study, we presented the ability to support repeatable interspeaker and intraspeaker data acquisition by adjusting the mask position on a subject’s head through a dedicated visualization interface. We superimposed reference lines on the camera images to help the operator reliably place the MDAD on the speaker’s head. We use them to align possible stereovision viewpoints with the characteristic points of the face, e.g., the philtrum. We also estimated the extrinsic and intrinsic geometric parameters of the stereo system for eventual calibration purposes by finding the geometrical relationship between two cameras by observing the same point52,53. We used a chessboard template with known dimensions and geometry to calibrate individual cameras separately. Then, we determined translation and rotation matrices between the cameras and yielded a mean calibration error of 0.39 pixels54.

Data completeness remarks

During PAVSig collection, we faced some issues with the data completeness, leading to the exclusion of participants or lack of data, some of which have already been mentioned in the paper. This section provides an explanation of these categories.

Unsuitable participants or missing recording session. Of 208 participants under consideration, four were excluded due to the following reasons:

  • Two speakers were of non-Polish origin (Ukrainian) and thus non-native Polish speakers.

  • One child’s recordings and examination were unreliable and disrupted by the presence of drains in their ears.

  • One child was examined by an SLP but was later unavailable to participate in the examination with the recording session. Thus, they were excluded from the study with nothing more than a single diagnostic questionnaire.

Data acquisition failures. In rare cases, we experienced some technical problems with the data acquisition. That concerned both audio (unacceptable noise, missing samples) and video (missing frames, synchronization issues). That led to the exclusion of three participants, producing the final research sample size of 201. Moreover, there are four speakers with one part of the recording unavailable beacause of that (three in part 1 and one in part 2).

Incomplete audio data. In 21 speakers, especially in the early stage of the study involving the first version of the device, there was an issue with a part of the multichannel audio stream. Due to technical reasons, the signal from microphones #1–5 (top recording arc) was damaged and unavailable. The remaining ten channels (#6–15, including the central channel #8) are complete. We indicated these cases in the recording_1, recording_2 fields in the participant summary (Table 5).

Missing diagnoses. Due to organizational reasons, we were not always able to perform the second SLT examination. The total number of missing diagnoses is 20, but each child has at least one questionnaire (see section Speech examination questionnaire).

Data validation

We performed an extensive review and validation of the recorded and processed data. The validation covered a manual assessment of ca. 13k segments containing words and logotomes after applying all interventions to the research sample and participant-related constituents, as described in the Data completeness remarks section.

The expected number of word/logotome segments was 13,668: 201 participants  × (38 words in part 1 + 30 words and logotomes in part 2). With three speakers with missing part 1 and one with missing part 2 recordings, this number was limited by 144 (3 × 38 + 1 × 30) to 13,524. A portion of segments (718, 3.5 per speaker) was missing due to several reasons:

  • the children did not produce them at all, mostly in part 1, based on naming the pictures shown on the screen;

  • the children used a synonym to name the picture (e.g., “mazaki”—"pisaki”, “lekarz”—"pan doktor”, or “sznurek”—"lina”);

  • speech was severely disturbed by noise or other sounds.

On the other hand, in 24 cases, the child produced the same word or logotome twice. Hence, the total number of word/logotome segments we share is 12,830 (13, 524 − 718 + 24).

To make the data use easy and reliable, we performed an extended validation of the available set of words and logotomes, primarily for the sibilant analysis. We assigned three different data validity levels (DVL) to each segment:

  • DVL = 1.0 – the segment is considered correct and suitable for the analysis.

  • DVL = 0.9 – the segment presents a slightly different word version or declination than required. However, the word change does not affect the sibilant and its environment, so it is suitable for the analysis. Examples: “dzwonek”—"dzwon”, “żaba”—"żabka”, “ciastka”—"ciasto”, “parasol”—"parasolka”. Note that the speech error may affect the number of word syllables or stress.

  • DVL = 0.5 – the segment presents a significantly different word version or declination that affects the sibilant environment. Examples: “pies”—"piesek”, “jeże”—"jeż”, “strażak”—"straż”, “kaczka”—"kaczuszka”. All words and logotomes modified in any other way also fall into this category.

Table 8 presents detailed statistics of our dataset regarding the total number of specific segments—words, logotomes, and phonemes—also with the DVL distributions. Note that the phoneme DVL is always inherited after its parent word or logotome. There are ca. 2% and 1% segments with minor and major issues, respectively, so 97% of segments are considered correct. The DVLs are stored in the segment dataset under the dataValidity field (Table 7).

Table 8 Segments distribution in the dataset.

There is a total of 66,781 segments in PAVSig (12,830 words and logotomes, 53,951 phonemes). The number of sibilant occurrences varies between 593 in “zi” (3.0 per speaker) and 2,364 in “sz” (11.8 per speaker). The total number of sibilants is 12,576 (62.6 per speaker).

Usage Notes

The dataset is available under the Data Use Agreement (DUA) with data access requirements given in the repository44.