Background & Summary

In recent decades, Machine Learning (ML) has rapidly grown as an industry sector, promoting significant advancements in the recognition and classification of human speech. Despite early research on speech processing dating back to the 1960s, technological limitations hindered widespread adoption until cheap and accessible Graphics Processing Units (GPUs) became available. The wide availability of GPUs opened the research avenues in Neural Networks (NNs). This progress led to a general improvement in ML performance, namely Natural Language Processing (NLP), image processing and also speech processing. The most prominent field in speech processing is Automatic Speech Recognition (ASR), which transcribes speech from audio recordings. In addition to ASR, supplemental tasks such as Gender Identification (GID), Language Identification (LID), and Speaker Identification (SID) can be provided. The growing importance of metadata related to speech transcriptions has created a market to recognize emotions, health, age, and other information from speech. Stress detection, however, is rapidly developing due to its relevance in key areas of human activity. The concept of stress has been known since ancient Rome1, but its systematic study in a physiological sense did not begin until the 19th century with Claude Bernard’s theory of “milieu intérieur”2 and Walter Cannon’s extension of this concept to a theory of homeostasis3. Cannon also linked psychological and psychosomatic symptoms and proposed that prolonged exposure to fear could result in death4. The Fight-or-Flight response, which he developed with Philip Bard, is a widely accepted theory that a mix of different physiological processes prepares the body for fighting or fleeing in response to an acute stressor. Based on this work, more research has been conducted on the application of speech-based features to stress estimation and the development of multimodal datasets. John Hansen’s5 early work explored the features of stressed speech, but limited data hindered progress. Hansen later collected data in cooperation with North Atlantic Treaty Organization (NATO) to establish initial stress-related databases. He identified four main features for stress estimation: intensity, pitch, duration of words, and vocal tract spectrum. Lombard effect needs to be taken into account w.r.t. these features, as it may negatively impact its descriptive power. Tet Fei Yap’s doctoral thesis6 explored the effects of cognitive load on speech and found that formant frequencies, although lower-dimensional than Mel-Frequency Cepstral Coefficients (MFCC), were comparable in performance for cognitive load classification systems. These advances in research help to study the very nature of stress demonstration in speech, but there is still a considerable lack of datasets to support the research efforts. Currently, the available data sources for stress analysis from speech face two main problems: data availability and data accuracy. Only the Speech Under Simulated & Actual Stress Database (SUSAS)7,8 database from the Linguistic Data Consortium (LDC) is easily accessible for academic purposes, but its nature (particularly its small number of samples and high variance of the stress scenarios) makes it unsuitable for NN training. The other databases (Cognitive Load Corpus with Speech and Performance Data from a Symbol-Digit Dual-Task (CoLoSS), Speech under Stress Conditions-0 (SUSC-0), Speech under Stress Conditions-1 (SUSC-1) and partly Cognitive Load with Speech and EGG (CLSE)6) are difficult to access and also do not meet the requirements for conversational and spontaneous sentences with appropriate stress load labels. To address this need, we created a new database with subjective self-determined and objective biological based ground truth labels. In this paper, we present a dataset collected using a specific methodology and experimental protocol called Brno Extended Stress and Speech Test (BESST)9 and which aims to provide empirical data for the development of pre-emptive intervention systems based on stress estimation from speech.

Methods

For the collection of necessary data, the BESST, which is an adaptation of the original Maastricht Acute Stress Test (MAST)10, was developed for the Czech context. It involves participants performing stress-inducing tasks while being recorded by multiple cameras and microphones. The first task involves immersion of a non-dominant hand in ice water, while the second is a dual task based on Reading Span task (RSPAN). The data presented in this study was collected in April 2022 by the Faculty of Information Technology - Brno University of Technology (FIT-BUT) and the Department of Psychology at the Faculty of Arts at Masaryk University (PSYCH-PHIL-MUNI).

Participants

The research sample was represented by 90 people (21F, 69M; Caucasian young adults). Due to technical issues and data loss, 79 cases (19F, 60M) were eventually included in the final dataset. All participants taking part in the study had to be between the ages of 19 and 26 years old. Participants were recruited through an ad on social networks, university websites, and posters placed around the FIT-BUT. Some were also recruited through the snowball method or by an email invitation from Masaryk University (MUNI) to students from the Faculty of Arts. A reward was promised for participation (irrespective of whether they finished the whole session), in the form of a food voucher worth about 100 CZK (around 4 EUR). Participants were informed of exclusion criteria (epilepsy, current medications, acute psychological or physiological issues, treatments or illnesses, heart conditions), and checked if Czech was their native language. Any heart conditions could have been dangerous for participants exposed to icy water, and the Czech language was imposed to increase the homogeneity of the speech data for analysis. Participants were given their unique ID to allow pseudonymization and later complete anonymization of the data.

Ethics statement

This study was assessed and approved by the Ethical Committee Faculty of Electrical Engineering and Communication For Biomedical Research, Brno University of Technology, Brno, Czech Republic under number EK:02b/2022 (“Stress and Cognitive Loads Effects on Speech”). The study was carried out according to relevant guidelines and regulations following the principles of the Declaration of Helsinki.

Participation was voluntary, participants were explicitly informed that they could leave from the experiment at any time without any negative consequences or loss of promised remuneration. The details about the experiment, recorded modalities (e.g. face, bio, and voice recordings) and overall conditions of the experiment were specified in the informed consent. Informed consent was read and signed by the participant with three different signatures: 1) Experiment participation consent 2) GDPR clause 3) Consent with the data distribution.

Procedure

The goal of the research was to explore speech production under stress and to use the findings of this work to create an automatic speech classifier. Each subject recorded a 45-60 minute session at FIT-BUT. During the session, noninvasive measurements of participants’ physical state were performed. Specifically, 1) heart electrical activity (Electrocardiogram (ECG)) using the FAROS sensor and 2) electrodermal activity (Electro-Dermal Activity (EDA)) using the EMPATICA bracelet were measured. The session was 3) recorded with multiple cameras, and the voice signal was 4) recorded with several microphones. All introductory information about the experiment was provided to participants upon arrival in the experimental room, and then informed consent was signed. Prior to the experimental run, ECG electrodes and a bracelet to measure EDA, skin temperature and heart rate were attached to the participants. Electrodes were placed on the anterior chest, one under the right clavicle, and the other on the left side of the rib cage, at the farthest point from the heart. The participants then moved to the testing area, where instructions were presented on a computer screen and they were asked to complete the long-term anxiety level Perceived Stress Scale 14 (PSS14)11 questionnaire on a tablet. Cameras were turned on, recording the participant from different angles (face-frontal, bucket, posture-left, posture-right), and a headset was turned on to record the voice. Another microphone recorded the participant’s voice on the table in front of him/her in a distance of approximately 50cm. After filling in the questionnaire, the session started with a 3-minute relaxation segment, during which the participants were instructed to stay calm. All lights in the room were turned off for the duration of relaxation. Baseline measurements of participants’ resting state (baseline: resting ECG, EDA) were taken. After the resting segment, participants were asked to complete a questionnaire about their current anxiety level State-Trait Anxiety Inventory-Y1 (STAI-Y1). To better manipulate the stress level of the participants, one of the researchers responsible for preparing the participant for the experiment and conducting the final interview behaved in a friendly and warm manner, while the other researcher leading the experimental part behaved coldly and directively toward the participants. Another factor intended to raise the level of psychosocial stress of the participants was the output of the cameras visible on computer screens in front of the participant throughout the experimental phase.

The experimental phase consisted of two alternating tasks as can be seen in Fig. 1:

  • Hand Immersion Task (HIT) - Participant’s hand was immersed in cold water while they were asked to describe the pictures they saw on the screen.

  • Reading Span task (RSPAN) - It included sentences in the form of rebuses presented to participants with the assignment to read them out loud and remember the pre-defined words.

Fig. 1
figure 1

Scheme of the BESST experimental session; figured materials have been translated for the purpose of this paper, during the data collection, they were presented in Czech language.

In HIT, participants had to immerse their hand up to their elbow in a bucket of ice cold water (approximately 5-10 °C), while describing what they saw in the pictures presented to them on the computer screen. At the end of each segment, participants rated how subjectively difficult the task was on a scale from 1 (“Easy”) to 9 (“Impossible to complete”). The RSPAN task required participants to read short texts in which some words were replaced with an image of the same meaning. Participants were asked to verbally express these words and read the full paragraphs. To increase cognitive load, participants had to memorize the last word of each text and at the end of each segment, the words had to be reproduced in the correct order. The number and difficulty of the texts increased with each block. Again, after each block, participants were asked to rate the difficulty of this task on a scale from 1 to 9. Both tasks alternated, HIT five times, RSPAN four times, totaling 9 segments. The experimental phase was preceded by a practice presentation of both tasks, during which participants did not immerse their hand in cold water. At the end of the experimental procedure, participants completed the second STAI-Y1 questionnaire, relaxed again for 3 minutes, and answered the NASA Task Load Experience (NASA-TLX) questionnaire. A short interview was conducted on their experience of the experiment. Each participant subsequently received his/her reward and was de-instrumented.

For more details on the acquisition process, see12.

Sensors & Instruments

Self-reported measures

  1. 1.

    Perceived Stress Scale 14 (PSS14)11 - Participants were asked to report on their trait stress level before the experimental session started. The PSS-14 questionnaire contains 14 items and it was designed to measure the degree to which situations in one’s life are appraised as stressful. The questions focus on assessing subjective feelings of uncontrollability, unpredictability, and overload of the participants’ lives. Participants are asked to respond using a five-point Likert scale from “never” (0) to “very often” (4) to items such as “In the last month, how often have you felt anxious for something that happened unexpectedly?” or “In the last month, how often have you felt that you were on top of things?”.

  2. 2.

    State-Trait Anxiety Inventory (STAI)13 - Participants were asked to report their current stress level before the experimental session started. The State-Trait Anxiety Inventory Questionnaire is a widely used tool for measuring anxiety. It consists of two parts, each with 20 items. In this research, the section that measures state anxiety (STAI-Y1) was used to assess participants’ current/state feelings. The participants rated their agreement with statements such as “I feel tense” or “I feel excited” on a scale of “not at all” (1), “a little” (2), “rather yes” (3), or “very” (4).

  3. 3.

    Difficulty rating (Likert scale14) - After each block (HIT/RSPAN), participants were asked to rate the subjective load of the block they just completed on a scale from 1 (“Easy”) to 9 (“Impossible to complete”).

  4. 4.

    NASA Task Load Experience (NASA-TLX)15 - The questionnaire explored the cognitive load experienced during the entire experimental session, as reported by the participants. It consists of six 20-point Likert rating scales, with their edge values marked as “low” or “high”, and the rest of the scale without labels. Participants are to rank the completed task on these six scales, respectively, concerning mental demand, physical demand, temporal demand (time required to perform the task), performance, effort, and frustration level.

Objective measures

  1. 1.

    Heart electrical activity captured by the Faros 180 (https://www.bittium.com/medical/bittium-faros) sensor

  2. 2.

    Electro-Dermal Activity (EDA) captured by the Empatica E4 (https://www.empatica.com/research/e4/) bracelet

  3. 3.

    Audio recordings captured by multiple microphones (1x Shure SM-35XLR7 headset; 1x XY microphone through the Zoom H4n recorder)

  4. 4.

    Video recordings using 4x Panasonic HC-VX9805 camera.

Data Acquisition and Pre-processing

Data segmentation tool

The BESSTiANNO tool was designed to provide precise segmentation and manual verification of audio and video streams. It was inspired by the VEST: Video Event Segmentation Tool (https://github.com/jcheong0428/vest), an open-source software used for annotating events in video clips. However, the original VEST tool was limited as a browser-based JavaScript application, which was not suitable for larger and distributed annotation tasks. To address this issue, the BESSTiANNO implementation was improved by incorporating a Python Flask backend and an SQLite database, making it a more reliable solution for larger scale annotation operations. An example of the BESSTiANNO interface is shown in Fig. 2.

Fig. 2
figure 2

Screenshot of one annotated session within the BESSTiANNO tool.

Data Records

High-level data summary is in Table 1. The dataset is available on9.

Table 1 Dataset summary.

Data streams

The captured data was represented by the streams specified in the Table 2.

Table 2 Available data streams.

Per participant state attributes

  • Self reported difficulty perception (NASA-TLX)

  • Current level of stress (STAI-Y1)

  • General level of stress (PSS14)

  • Gender

Dataset structure

The dataset is partitioned into subfolders with different modalities in each folder. The directory structure of the dataset is described in detail on Fig. 3.

Fig. 3
figure 3

Dataset directory structure.

Dataset descriptive statistics

The duration statistics of the speech segments can be seen in Table 3. Histograms for these segments are in Fig. 4.

Table 3 Speech durations.
Fig. 4
figure 4

Final check screen for all modalities.

Participants self-reported scores statistics are in Table 4 with histograms in Fig. 5. The score histogram for the rebus part is shown in Fig. 6.

Table 4 Self-reported loads.
Fig. 5
figure 5

Speech segment durations.

Fig. 6
figure 6

Self-reported subjective loads.

Technical Validation

Regarding the technical validation, the presented data were meticulously collected using established psychological experimental methodologies and state-of-the-art technologies encompassing the entire data lifecycle, which includes data collection, pre-processing, and post-processing stages.

All data streams are time synchronized. The time t0 is set at the moment when the audio recording from the Zoom recorder started. The Zoom recorder and video cameras have no explicit way to programatically set the correct time from an external source (i.e. Network Time Protocol (NTP)). Instead, we set each of the device’s correct time manually with an acceptable margin of error of 1 second.

The Empatica E4 device has an implicit time synchronization of the internal Real-Time Clock (RTC) every time the data is downloaded from the bracelet. The Faros 180 has an explicit way to synchronize the internal RTC when the device is connected to the provided software.

In our pilot trial, the Empatica E4 bracelet was our single-source of the correct time. The bracelet has a way to mark an event in the data stream by pushing a button. This method was not used to mark the beginning of the experiment, but there were a few accidental event markers created that were visible on the video streams. A relative shift of the camera internal clocks was computed from these events, that allowed us to compute real-time alignment of video streams to the Empatica datastream.

Subsequently, the alignment of the audio to the video streams was computed using the audio track from the video as the ground truth time source. Using the Audalign (https://github.com/benfmiller/audalign) project, we obtained a relative lag for the audio streams. We used the cross-correlation and fingerprinting method from the Audalign package.

Finally, we relied on the internal RTC of the Faros 180 device to obtain the alignment with the rest of the data streams. Unfortunately, we found that the Faros 180 RTC has a non-trivial time drift issue. This led to incorrect timestamps in our Faros 180 streams. As we had already extracted RR-intervals from the ECG from the Faros device and Empatica E4 had one auxiliary RR-interval stream, we computed cross-correlation between those two streams per participant and obtained corrected Faros ECG timestamps.

To help with manual annotations and corrections of the data, the BESSTiANNO tool was developed and employed. This tool enables precise segmentation and manual verification of both audio and video streams, contributing to the reliability of the dataset. Researchers interested in utilizing or inspecting the BESSTiANNO tool can access it online.

Further, the RR-interval extraction was applied as follows: In all Electrocardiogram (ECG) recordings, QRS complexes were detected and outliers were manually verified and corrected. At first, a robust QRS complex detection algorithm16 was used. This algorithm is based on an ensemble of three independent QRS detectors, each based on a different principle, the continuous wavelet transform, the Stockwell transform and the phasor transform, and individual adaptive thresholding16. The positions of QRS complexes were automatically refined to be on top of the R waves. The detections and positions of the QRS complexes were visually checked by an ECG expert beat-by-beat. Missing and additional detections and inaccurate positions were manually corrected by the expert using SignalPlant software17. The RR intervals were extracted from QRS position series and ectopic beats and erroneously long RR intervals were carefully visually inspected and removed from further analysis.

After filtration and pre-processing, each participant’s data went through a manual check of alignment and dynamic range. Potential discrepancies in the data were subsequently fixed. In case, the data were corrupted beyond repair (e.g. some Empatica RR intervals), the modality was discarded from the participant’s dataset. An example of a final check screen is shown in Fig. 7.

Fig. 7
figure 7

Rebus score histogram.

Regarding the validation of stress induction, participants were asked to report their anxiety levels as a measure of self-experienced discomfort. These levels were measured before and after the stress induction phase. To assess the expected effect of stress induction tasks, the results of the STAI-Y1 questionnaire were compared. Given the non-normal distribution of the data, the Wilcoxon Signed Rank Test was used. There was a statistically significant difference (V = 2436.5, p < 0.001) between the levels of self-reported anxiety measured before (mean = 38.76; median = 37.00; SD = 8.95) and after (mean = 43.80; median = 42.00; SD = 11.24) the stress induction, indicating that the intervention significantly increased subjectively perceived anxiety.

Usage Notes and Limitations

The BESST dataset can be used for research and development by academic and commercial entities. The license agreement must be signed before the download link is provided. The results and demonstration of how the data corpus can be used a provided in the solitary article18. The final data may be affected by the overall stressful experimental situation - the experimental context may have increased the stress level in participants generally, and we consider this as a potential limitation, which should be further considered when the dataset is used. Losses in ecological validity of presented stress situations can be also seen as limitation of this experimental approach. BESST protocol can reliably provide speech signals affected by stress; however, the protocol cannot effectively simulate the whole variety of real-world situations and scenarios, so the generalizations made upon the provided data should reflect the experimental context of the measurements.

Future Work

Provided dataset includes experimentally controlled measurements of the human reaction to direct physiological stress and cognitive load. BESST protocol originally aimed to collect speech stress data to train the stress classifier from speech samples. However, the entire procedure and setting can be used to explore various research questions related to stress or stressful situations. The relationships between physiological and psychological responses in stressful contexts should be further explored. The entire dataset, which includes various measures, can also be investigated beyond the context of speech stress. The original dataset contained 90 participants, data from participants with missing signals or protocol errors were removed so the final provided dataset includes 79 cases with the full records (all participants were Caucasian young adults), the provided dataset includes only a Czech language sample. Regarding this, larger datasets should be collected, possibly using experimental protocol BESST.