Background & Summary

Emotion Recognition in Conversation (ERC) is an emerging field focused on empowering machines to comprehend human emotions. Successful deployment of ERC technologies has the potential to lead to more engaging, personalized, and human-like interactions with artificial intelligence systems1,2.

It is important to acknowledge the complexity of emotions and the considerable variation in their manifestation across individuals3. Moreover, emotion is a psycho-physiological process ignited by both conscious and unconscious perceptions, and the ability to perceive emotional states significantly impacts human interactions, particularly in conversational scenarios4.

Numerous studies have delved into the intricacies of emotion recognition, focusing on different modalities such as acoustic, visual, or physiological information derived from individuals. Each modality presents its distinct challenges and benefits, thus leading to a multifaceted approach in the pursuit of advanced emotion recognition systems.

Indeed, facial expressions and speech signals are commonly utilized modalities in emotion recognition due to their relatively straightforward capture methods. They provide intuitive and abundant insights into an individual’s emotional state, rendering them a valuable resource in the interpretation and classification of emotions.

Conversely, the EEG modality can intricately capture the outcomes of complex associations within the brain’s interconnected activities, providing a more direct measurement of emotions when compared to audiovisual data. This potential for accuracy enhances its performance in emotion classification, particularly among participant groups who may have difficulty accurately expressing their emotions. This approach directly assesses brain activity, bypassing external behavioral displays and, as a result, can yield more genuine insights into an individual’s emotional state, irrespective of their capacity or inclination to externally manifest their emotions.

The experimental setup for behavioral emotion environment is more challenging as it requires active participation and engagement from multiple individuals5. In EEG-based studies, passive tasks have been predominantly employed to evoke emotions, such as watching videos or images, listening to music, or recalling memories6,7,8,9,10,11,12. Only a limited number of studies have explored EEG measurements in a conversational context. Saffaryazdi et al. introduced the PEGCONV13 in which they recorded 16 channels of EEG along with physiological activities (PPG, GSP) during spontaneous conversations. Park et al. presented the K-EmoCon dataset14, which includes single-channel EEG data, peripheral physiological signals, and audiovisual recordings.

In the domain of audiovisual measurements, a comprehensive collection of datasets has been established, featuring larger-sized conversational settings across various scenarios15,16,17,18,19,20,21,22,23,24,25. These dialogues were skillfully performed by professional actors15,17,18,20,22, or they involved participants engaging with examiners16,19 in predefined interaction scenarios designed to elicit specific target emotions. Some portions of this dataset were generated by extracting conversation segments from films23, TV series24, or video clips25.

Emotion annotation presents one of the most demanding challenges in this context, and the strategies employed may vary based on the experimental conditions. Self-report annotation can be the most reliable means of labeling; however, individuals often mask their true emotions, or they may struggle to accurately identify their own emotions. This challenge intensifies in spontaneous and natural settings, where emotions can fluctuate unpredictably during conversations. In response to this challenge, Park et al.14 incorporated three types of annotations: those from the participant, their conversation partner, and observers, to ensure comprehensive assessments from different viewpoints.

Emotion recognition systems have predominantly relied on a single primary modality from the EEG/Audio/Video spectrum. For instance, a study reported by Soleymani et al.6 focused on EEG accuracy alone, despite also recording audio and video data. Although they measured EEG and audiovisual data simultaneously, their analysis predominantly highlighted the performance of one modality13,14. The study by Koelstra et al.12 also measured EEG and visual data; however, it was not appropriate to heavily utilize the visual data since the paradigm was designed for a passive task like watching video clips, where visual cues do not dynamically engage the participant. Table 1 provides a summary of relevant papers addressing emotion recognition using multiple modalities, which are pertinent to our study. It includes key information such as the database title, primary modalities, language of the audio/video stimuli, subjects’ elicitation method, and types of stimuli. Interestingly, no prior research has ventured into the concurrent use of EEG alongside facial or audio signals for emotion recognition within conversational contexts.

Table 1 The performance of emotion recognition was evaluated in each modality (EEG, Visual, and Audio data).

The aim of this study is to introduce a multimodal dataset designed to facilitate a more comprehensive understanding of human emotional behavior. This is achieved by analyzing both direct neuronal signals through electroencephalography (EEG) and indirect cues obtained from audiovisual sources. To create this dataset, we established a conversational setting in which participants voluntarily selected scripts that were significantly associated with the target emotions during a screening session. Subsequently, these participants engaged in interactions with a cue-based conversation system, expressing their emotions through facial cues and voice modulations. In total, 42 participants took part in this study, and their responses were captured using a 30-channel EEG and audiovisual recordings. The trials for each emotion were thoughtfully balanced to ensure a precise representation across a wide spectrum of emotions.

To enhance its suitability for classification tasks, we pursued the following objectives: 1) Attaining balanced class distributions. 2) Structuring data in suitable formats. 3) Maintaining consistent conversation duration in each trial. Given the use of scripted dialogues, the conversations were not entirely spontaneous but were guided by cues to ensure uniformity.

Our multimodal approach and the dataset we have developed hold the promise of making significant contributions to several research directions: 1) Techniques for integrating data from multiple sources26,27,28; 2) Analysis of disparities in emotions between observable human behavior and underlying brain functions29,30; 3) Examination of time-variant or contextual emotion analysis31; 4) Synthesis of the facial or vocal outputs32,33; and 5) Exploration of emerging machine learning topics, such as multimodal contrastive learning34,35, generative models36, or knowledge transfer learning31.

Methods

The following section outlines the methodological framework employed in conducting a multimodal experiment for emotion recognition. The experiment entailed exposing participants to emotional stimuli through three distinct modalities: audio, visual (via camera recordings), and neurophysiological data (collected via EEG). This study focused on the recognition of five primary emotions: anger, happiness, sadness, calmness, and a neutral state.

Each dataset within these modalities adheres to a consistent format, and class distributions are balanced. In the controlled cue-based conversation scenario, participants deliberately selected scripts that strongly evoked specific emotions, allowing us to infer that our class labels reliably represent their genuine emotional states.

Ethic statement

This study was conducted with approval from the Institutional Research Ethics Committee of Nazarbayev University (NU-IREC 777/09102023) and the Institutional Review Board of Korea University (KUIRB-2021-0248-02). All participants provided written informed consent prior to their involvement and were fully informed about the nature of the research, including the recording of audio, video, and EEG data. They were also made aware of the data distribution, which is restricted to research purposes under a strictly supervised Data Usage Agreement by the IREC. Furthermore, stringent protocols were implemented to ensure the protection and confidentiality of all participant data, in accordance with established ethical guidelines for research involving human subjects.

Participants

The study enlisted a cohort of 42 individuals from Nazarbayev University, including both students and members of the general population. These participants were proficient English speakers who had either lived in or studied in English-speaking countries. The age range of the participants fell between 20 and 30 years.

A high level of language proficiency was a requirement for all participants, given the nature of the experiment, which involved engaging in dialogues and discerning emotional nuances within the English language.

Participants were provided with prior information about the screening and primary experiment dates, allowing them to schedule their involvement accordingly. Before participating, all participants offered informed consent and were given a comprehensive explanation of the study’s objectives and procedures. The importance of their right to withdraw from the study at any point was emphasized to ensure ethical compliance throughout the research.

Experiment Protocol

The primary objective of this study is to investigate five distinct emotional classes, which are represented as follows: A for anger, H for happiness, S for sadness, C for calmness, and N for a neutral emotional state. These emotional classes can be further categorized into two overarching dimensions: positive and negative valence, as well as high and low arousal levels.

Our study involved the creation of conversation scenarios in which human participants engaged in emotional dialogues with a conversational system. These dialogues encompassed multiple iterations of listening and speaking interactions. Initially, the dialogues were generated using OpenAI’s ChatGPT and were subsequently meticulously reviewed and edited by human researchers to ensure emotional clarity and appropriateness for the study.

It is important to note that the dialogues were intentionally designed to be universally accessible and straightforward, resembling common, everyday conversations. The length of sentences in each interaction was thoughtfully adjusted to ensure they were of an appropriate duration, allowing the speaker to engage in speech for more than 20 seconds.

In our study, the dialogues are structured as paired listen/speak sets. To accomplish this, psychologist reviewed the entire collection of the scripts to ensure that the emotions were correctly induced and to verify that the script did not contain any excessively harsh sentences that could potentially cause adverse effects on the participants.

The experiment was designed to include 100 repeated interactions. In each interaction, participants initially watched a pre-recorded video and then responded to the corresponding dialogue for a duration of 20 seconds. When participants spoke, the dialogue was presented in the central area of the monitor, guiding them to read and deliver the script as instructed (see Fig. 1).

Fig. 1
figure 1

Illustration of the experimental setup. (a) The participants wore a cap for EEG recordings and multimodal data were synchronously recorded, (b) Listening condition: a prerecorded video is displayed to the participant, prompting their interaction, (c) Speaking Condition: Scripts are provided at the center of the monitor. Participants were encouraged to freely express themselves beyond the confines of the given script. The individual(s) depicted in this figure provided informed consent to have their image openly published.

Each iteration commenced with a neutral class dialogue, followed by four interactions corresponding to specific emotional classes, as demonstrated in Table 3. The introduction of a neutral dialogue in the first interaction was intended to establishing a baseline, ensuring that participants initiated the experiment in a neutral emotional state before transitioning to emotionally charged conversations. This approach also contributed to maintaining a balanced distribution of the five emotional class trials, guaranteeing that each class was represented in 20 interactions. In total, our dataset encompasses 20 iterations, comprising a cumulative total of 100 interactions.

Despite providing participants with ample time to rest and return to a neutral emotional state, they reported encountering challenges when transitioning between contrasting emotions, such as moving from anger to happiness. This issue has been documented in previous studies as well. Consequently, we thoughtfully arranged the sequence of emotional classes, strategically placing the ‘calm’ class between such contrasting emotions to facilitate smoother transitions.

We employed a pseudo-random class-iteration sequence: [A, A, C, C, S, S, H, A, C, H, H, S, S, A, A, C, C, H, H, S] to optimize the emotional transitions experienced by the participants. In this sequence, each emotional class corresponds to a specific iteration, and the neutral class consistently begins each iteration. This approach serves two purposes: it maintains the participant’s current emotional state and facilitates a gradual transition to contrasting emotional classes. To prevent any potential bias resulting from rapid emotional shifts in participants and to ensure data consistency, we applied this pseudo-randomized sequence uniformly to all participants.

We must note that, while the sequence set was the same for all participants, the dialogues within each iteration varied as participants selected different sets of dialogues for each emotional class.

The experiment was carried out in two separate sessions, separated by a one-day interval between them. This approach was adopted to ensure optimal participant engagement and to minimize the effects of fatigue. For a comprehensive overview of the procedures implemented in each session, please refer to Table 2.

Table 2 The experimental procedures for data measurement consisted of two sessions: the screening and recording. Each session lasted for approximately 1 and 2 hours respectively.
Table 3 An example of a single conversation consisting of 5 iterations, including 1 neutral (N) and 4 interactions (Listening (L) + Speaking (S)) on one of four emotion classes (e.g., H: Happiness).

Session I: Screening

The primary purpose of the screening session was to familiarize participants with the experiment’s overview and the research format. During the first session, participants were presented with a selection of dialogues and tasked with choosing the dialogues that they believed would most effectively elicit and reflect their emotional responses, as detailed in Table 2. A pool of 50 dialogues was offered, with 10 dialogues available per emotional class. Each participant was required to choose five dialogues for each emotional class. Each dialogue consisted of four interactions, encompassing both listening and speaking components.

In previous studies, class labels were assigned after participants had experienced emotions from specific stimuli in a natural setting. This approach presented challenges in achieving consistent and balanced data, as label assignments could vary significantly among individuals. Variability in labels for the same stimulus could introduce conflicts in machine learning models. In contrast, our study provided distinct dialogues for each emotion class. Participants then voluntarily selected the scripts they found most evocative, ensuring that class labels were established prior to the experiment.

Following the selection of dialogues, participants followed an experimental protocol based on their choices. They engaged in a practice session without data recording, allowing them to become familiar with the procedure under the guidance of psychologists. If a chosen dialogue and its associated video failed to elicit the desired emotion, participants were given the option to select a different script.

Although our paradigm did not replicate a naturalistic setting and involved predetermined conversation scripts, there was a potential risk of participants merely reading the provided text without genuine emotional engagement. To mitigate this, participants were explicitly instructed to emotionally align with specific scenarios. Those who did not adequately exhibit this alignment or emotional comprehension were excluded from the study.

Session II: Main Experiment

The second session, which constituted the main experiment, spanned approximately 2 hours. Before commencing the experiment, we assessed each participant’s psychological condition. If a participant exhibited signs of biased mental states, such as high levels of fatigue or stress, we either excluded them from the study or rescheduled their session.

During the main experiment, participants sequentially engaged in 20 iterations of conversation, with each iteration lasting around 4 minutes. To allow participants to return to their baseline mental state, we provided ample rest time between iterations. The subsequent conversation only commenced once the participant signaled their readiness for the next trial. At this point, the experimenter alerted the instructor to initiate the next dialogue phase.

After each iteration, participants were asked to provide self-reported ratings for their arousal level and valence score pertaining to the dialogue they had just experienced. The rating scale ranged from -5 to 5, enabling participants to indicate their perceived emotional arousal and the valence (positive or negative) associated with the dialogue (refer to Fig. 2). Additionally, our instructor independently provided a score to assess the participant’s emotional state. Upon completing the experiment, we collected scores from both the participant and the instructor for all 20 iterations.

Fig. 2
figure 2

Subjective self-assessment of emotion on valence and arousal levels.

Fig. 3
figure 3

Comparative Analysis of Emotion Ratings between Participants and Experimenters: A study on Arousal and Valence levels across four emotional states - Happy, Sad, Calm, and Angry. N - total number of data points for rating across all participants.

Participants received explicit instructions to actively convey their emotions, both mentally and through their vocal and facial expressions, throughout the conversations. This guidance was intended to ensure that the physiological data (EEG signals), visual data (facial expressions), and audio data (voice modulation) all correspond to the intended emotional class labels. During the experiment, participants were encouraged to make movements and gestures freely to express their emotions.

It is crucial to emphasize that participants had the autonomy to halt the recording at any point during both sessions if they experienced any discomfort. Consequently, eight participants chose to withdraw from the study due to personal reasons.

Experimental Settings

Participants were comfortably seated in front of a computer monitor during the experiment. They interacted with a 27-inch monitor featuring a 60 Hz refresh rate, which displayed the dialogues and other visual stimuli (see Fig. 1a).

The experiment was conducted in a controlled environment to ensure optimal data collection. The controlled settings encompassed the following aspects:

  1. 1.

    Lighting and Background: To minimize visual distractions in facial recordings, a white screen was positioned behind the participant. Additionally, a light kit was used to provide uniform illumination of the participant’s face.

  2. 2.

    EEG Equipment: We utilized a BrainAmp system (Brain Products; Munich, Germany) to record brain activities. EEG data were collected through 30 Ag/AgCl electrodes placed at specific locations on the scalp: Fp1, Fp2, F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, CP5, CP1, CP2, CP6, P7, P3, Pz, P4, P8, PO9, O1, Oz, O2, PO10. The data were sampled at a rate of 500 Hz. The electrodes were referenced to the mastoids and grounded through the AFz electrode. Throughout the experiment, we ensured that the impedance of the EEG electrodes remained consistently below 10 kΩ. The EEG dataset was initially recorded in BrainVision Core Data Format and was subsequently converted into Matlab (.mat) format for ease of use and analysis.

  3. 3.

    Audio Equipment: To accurately capture the participant’s verbal responses with minimal distortion or interference, we utilized a high-quality microphone. This choice of equipment facilitated a precise analysis of voice modulations corresponding to various emotional states. The audio data was recorded and stored in the WAV (Waveform Audio File Format) for further analysis.

  4. 4.

    Video Equipment: A web camera was securely attached to the monitor and positioned to focus directly on the participant’s face. This setup was designed to consistently capture facial expressions and emotions throughout the course of the experiment. The data was initially recorded and stored in the AVI (Audio Video Interleave) format, and subsequently, it was converted into the MPEG-4 (.MP4) format to reduce file size.

To ensure that the EEG electrodes were not visible and to eliminate any unnecessary components from the video recording, the electrodes were concealed beneath a black-colored cap.

The entire experiment was conducted using the PsychoPy software, which is based on the Python programming language. This software played a crucial role in various aspects of the experiment, including the presentation of videos and scripts, the management of recordings, ensuring precise timing throughout the study, and facilitating communication with each modality.

Data Records

All data files are accessible via the Zenodo general-purpose open repository37. The data is available under the terms and conditions of the data usage agreement (DUA).

The repository is structured to offer comprehensive insights into the data while maintaining a standardized format. It’s worth noting that we have segmented the continuous recordings into 20-second data streams.

The root folder, EVA, contains participant folders labeled as subject{SUBNUM}, where {SUBNUM} ranges from 1 to 42. Each participant folder includes three unimodal data folders: Video, Audio, and EEG (see Fig. 6).

Video Data: The ‘Video’ subfolder houses segmented video clips, each 20 seconds in length. The naming convention for these clips is:

$$\{NUM\_INTS\}\,\_{\rm{Trial}}\_\,\{NUM\_TRIAL\}\,\_{\rm{cond}}\_\,\{TASK\}\_\{CLASSNAME\}.{\rm{MP4}}$$
  • NUM_INTS: Represents the instance index number, ranging 200.

  • NUM_TRIAL: Denotes the trial index within each conversation, spanning from 1 to 10, corresponding to 5 interactions.

  • Task: Specifies the activity being performed, either ‘listening’ or ‘speaking’.

  • CLASSNAME: Indicates the associated emotion class.

Collectively, a single subject folder possesses 200 video clips, derived from [5 emotion classes  × 2 tasks  × 20 iterations].

Audio Data: In the ‘Audio’ folder, the audio files follow the same naming format as the video files but we appended ‘_aud’ at the end of the file name. It’s essential to note that the audio files focus exclusively on the ‘speaking’ task and omit the ‘listening’ task. This design choice is due to the audio recordings capturing only the actress’s voice. Summarily, each subject in this category boasts 100 audio files, deduced from [5 classes  × 1 task  × 20 conversations].

To offer an intuitive understanding, a single iteration typically comprises the subsequent three video/audio files:

  • ‘001_Trial_01_Listening_Neutral.MP4’

  • ‘002_Trial_02_Speaking_Neutral.MP4’

  • ‘002_Trial_02_Speaking_Neutral_Aud.WAV’

EEG Data: The EEG data were initially recorded continuously with dimensions [Time  × Channels]. For preprocessing, we applied a high-pass filter set above 0.5Hz using a fifth-order Butter-worth filter and band-pass filtered at 50 Hz to mitigate facial noise and electrical line noise. Subsequently, the data were segmented using time markers for each event (trigger), resulting in a structure of [Instances  × Time  × Channels].

Given an initial sampling rate of 500 Hz, our processed EEG data adopts the structure: [200 instances  × 10,000 time points (20s  × 500 Hz)  × 30 channels].

The labels for this data use a one-hot encoding format, structured as 200 trials by 10 classes (5 emotions multiplied by 2 tasks). The files adhere to the following naming conventions:

$$\begin{array}{rcl} & & \,{\rm{subject}}\,\{NUM\_SUB\}\,\_{\rm{eeg.mat}}\\ & & \,{\rm{subject}}\,\{NUM\_SUB\}\,\_{\rm{eeg}}\_{\rm{label.mat}}\end{array}$$

Note that the label information can be applied across all modalities since all recordings, regardless of the modality, were conducted synchronously. This ensures uniform annotations throughout the dataset.

Missing Data

During the course of the experiment, we encountered several challenges, including technical malfunctions with the equipment, unexpected interruptions, and instances where participants missed the task, among other issues. These missing segments, particularly the iterations in which they occurred, were diligently documented. Subsequently, after the initial experiment was completed, these specific iterations were revisited and conducted again.

This approach was undertaken to maintain data integrity, and the re-positioned iterations were placed in their correct order. Consequently, the entire dataset maintains consistent class labels and formats. It’s worth emphasizing that all participants had a balanced dataset, ensuring that there are no discrepancies in the data format across any of the participants.

Technical Validation

The principal objective of this study is to explore open topics in emotion recognition within a single or multimodal setting, rather than to conduct a performance comparison across each modality. It is crucial to recognize the distinct characteristics of EEG, Audio, and Visual signals, as each exhibits unique signal patterns that necessitate specialized analysis pipelines. Performance in these modalities can vary significantly and can be sensitive to the specific techniques employed in each domain. For instance, while certain augmentation techniques in computer vision can substantially enhance performance, their application to audio or EEG data may not consistently yield similar results.

Given the complexity of the task and the diverse nature of the modalities involved, we adopted a widely-used, straightforward end-to-end CNN model and, additionally, more recent transformer architectures38,39. We used a simple data split methodology to ensure robust and consistent results.

The dataset was initially annotated with ten evenly distributed class labels: happiness, sadness, anger, relaxation, and neutrality, which were available in both listening and speaking conditions. However, for consistent analysis, only the speaking trials were considered, as the audio data exclusively captures these trials.

For each modality, a single participant was represented by 100 trials, each lasting 20 seconds. However, a 20-second duration is relatively lengthy and not ideal for emotion recognition classification. To address this, we divided the 20-second data streams into 5-second intervals. This division effectively expanded the initial 100 trials and their labels into 400 trials. The dataset was then split into 70% training and 30% testing subsets, with the first 280 trials designated for training and the remaining 120 for testing. Consequently, the performance metrics reported in this study were derived from this consistent set of training and testing data across all modalities. We designed a separate CNN model for each modality, and to establish the baseline performance of these CNN models on the original dataset, we refrained from employing advanced techniques such as knowledge transfer40, augmentation41, or signal optimization42,43.

To evaluate model performances in multi-class classification, we calculated metrics such as mean accuracy and the F1 Harmonic Mean Score across all participants.

EEG Data Analysis

The segmented EEG data was downsampled to 100 Hz and then band-pass filtered within a frequency range of 0.5-50 Hz. Then the EEG data can be formatted as (400, 30, 500, 1), which represents the total number of trials (training+test), channels, data points (5 seconds), and depth.

DeepConvNet44, and ShallowConvNet44 models are straightforward CNN networks that directly take the input EEG data without considering the spatial structure. A recent study applied preprocessing steps to find user-specific frequency ranges43 then the out images were fed to the CNN model. Lawhern et al.45 demonstrated the successful implementation of spectral-spatial optimization into an end-to-end CNN model. In this study, the baseline validation for EEG data was calculated based on the EEGnet45.

The EEGNet architecture is configured with standard parameters: it includes a 2D-convolutional layer (F1 = 8, kernel-Length = 300), a depthwise convolutional layer (D = 8), and a separable convolution layer (F2 = 16). After these layers, there is a dense layer with a dropout rate of 0.5, followed by a softmax layer. The model utilized categorical cross-entropy as the loss function and employed the Adam optimizer. The training was performed with a batch size of 32 across 100 epochs.

Video Data Analysis

The 5-second video clip contains 150 frames, recorded at 30 frames per second. We extracted 25 frames by selecting every 6th frame, resulting in a total of 10,000 frames (combining training and testing) across all emotion categories. For training, we used the DeepFace model46, a deep neural network comprising a series of convolutional, pooling, and fully connected layers, totaling nine layers. We adjusted the final softmax layer to accommodate our class number. A dropout layer with a rate of 0.5 was incorporated into the first fully connected (FC) layer. The model was trained using the Adam optimizer with a learning rate of 0.001, and the loss function was cross-entropy. The training was conducted with a batch size of 32 over 100 epochs.

For performance evaluation on each test trial, we predicted the outputs for all frames and then assigned the most frequently occurring class as the prediction.

Audio Data Analysis

To validate the audio data, we employed the Librosa library47 to preprocess the audio data into a usable format. Subsequently, we used a standard feature extraction method known as Mel-Frequency Cepstral Coefficients (MFCCs)48. MFCCs have been widely validated on extensive audio datasets and are a reliable choice for audio feature extraction.

During the pre-processing phase, we extract essential audio features, including MFCCs, Chroma features, and Mel Spectrograms, which are then concatenated for further processing. We utilized a Sequential Convolutional Neural Network (SCNN) architecture49,50 with one-dimensional data processing capabilities. The model consists of four 1D convolutional layers with Rectified Linear Unit (ReLU) activation, interspersed with dropout layers to enhance generalization. Additionally, L1 and L2 regularization techniques are applied to reduce overfitting. The network starts with 256 filters and progressively decreases to 128 filters, followed by Max pooling for down-sampling. The architecture concludes with a densely connected layer employing softmax activation. Model optimization is performed using the Adam optimizer, with a specific weight decay of 1e − 6 and a learning rate parameter of 1e − 3 to fine-tune the training process.

Figure 4 provides a concise overview of the classification pipelines for each data type. The multimodal data samples depicted in this figure were extracted from one of the participants (participant30), representing the happiness class.

Fig. 4
figure 4

Multimodal input data and their corresponding processing pipelines for emotion classification. The single trial has 5 seconds of duration. The audio data is preprocessed to create input images while raw video/EEG data were fed to each CNN model. The resulting outputs from these CNN models provide softmax predictions for five emotional states: Neutral (N), Happiness (H), Calmness (C), Anger (A), and Sadness (S). The individual(s) depicted in this figure provided informed consent to have their image openly published.

Transformer architecture

The Transformer was initially developed for language modeling and has since replaced CNNs due to its superior ability to handle dependencies. In our study, the performance of EEG, audio, and visual data was validated using specific Transformer architectures, namely EEGformer51, AST38, and ViViT39, respectively. The model architectures and parameters were used as outlined in their original publications.

Performance evaluation

The dataset was partitioned into two subsets: a training set and a test set. This division was executed in a manner where the initial 70% of the video/audio/EEG data was designated for training, while the latter 30% was reserved for testing. Specifically, for audio data, each participant contributed 400 audio samples, each lasting 5 seconds. Of these, 280 audio samples were drawn from the first 70% of conversations, with the remaining 120 selected for testing purposes. This ratio can be flexibly adjusted to align with the specific objectives of the study. Commonly, we train the model for a fixed number of epochs without using a validation set.

Table 4 presents the emotion recognition performance in the uni-modal condition. For individual CNN models, the EEG data yielded a mean accuracy of 60.0% ± 0.09 and an average F1-score of 0.58 ± 0.09. The audio data yielded a mean accuracy of 61.9% ± 0.08 and an average F1-score of 0.61 ± 0.09. The video data model reported a mean accuracy of 71.4% ± 0.08 and a mean F1-score of 0.70 ± 0.08. It’s important to note that the reported scores in Table 4 can be easily varied according to the diverse machine learning techniques and data split configurations.

Table 4 The performance of emotion recognition was evaluated independently in each modality.

Table 5 provides a comparative analysis of the mean accuracy and mean F1-score for Transformer models. The results demonstrate that the pretrained Transformer model enhances classification performance in the audio and video modalities, achieving mean accuracies of 62.7% and 74.5%, respectively. Conversely, the EEGTransformer51 did not yield comparable improvements, achieving 53.5% accuracy and an F1 score of 0.52. This performance discrepancy may be attributed to the skewness in the availability of pre-trained models.

Table 5 Performance comparison of Transformer models with EEG, audio, and visual modalities.

Figure 5 indicates the confusion matrices for EEG, Audio, and Video modalities, each assessed over 1008 trials across distinct emotion categories for all participants. The EEG modality showcases a marked accuracy in classifying the Happiness and Neutral emotions. The Audio modality distinctly excels in discerning high-arousal states such as Angry and Happiness. Both EEG and Audio modalities manifest a congruent trend, notably mispredicting low-arousal classes like Neutral, Sadness, and Calm – an observation that aligns well with foundational knowledge in the field. In stark contrast, the Video modality reveals significant mispredictions, frequently misclassifying low-arousal classes as high-arousal emotions, specifically Angry and Happiness.

Fig. 5
figure 5

Accumulative confusion matrix for classifying five emotion classes across all participants using EEG, Audio, and Video data. The matrices represent for each emotion: Neutral (N), Sadness (S), Anger (A), Happiness (H), and Calm (C). The total test trials per class across all participants are 1008 (24 speaking tasks  × 42 participants).

Fig. 6
figure 6

The data repository is structured for intuitive navigation. Under the primary level, folders are named ‘subject{idx}’, where ‘idx’ denotes the participant identifier. Each of these participant-specific directories contains three sub-directories: ‘Video’, ‘Audio’, and ‘EEG’. Correspondingly, data files within these sub-directories adhere to their specific formats: video files are saved as *.MP4, audio recordings are in *.WAV format, and EEG data is stored as *.MAT files.

In an analysis of valence and arousal levels across 42 participants, the data was analyzed for five distinct emotions. Within each emotion, participants participated in four separate conversations. For each conversation, participants’ valence and arousal levels were measured on a scale ranging from -5 (lowest) to 5 (highest) (see Fig. 2). On average, emotions traditionally associated with higher arousal, such as excitement or anger, consistently scored above 3.5, while emotions typically associated with lower arousal, such as sadness or calmness, scored below 2.5 (see Fig. 3). Similarly, positive emotions, like happiness, had a valence level frequently exceeding 4, whereas negative emotions, like sadness, were frequently below 2 on the valence scale.

Usage Notes

This work is under a Data Usage Agreement (DUA) at Zenodo37. The approval process and usage restrictions are outlined in the accompanying application form, which can be accessed through the repository. The data access will be granted to applicants who agree to the terms and conditions outlined in the Data Usage Agreement. In order to access and use the EAV dataset, you will be asked to provide your full name, affiliation, position/title, and a brief description of your intended use of the dataset.

For EEG data analysis, the BBCI toolbox (https://github.com/bbci/bbci_public)52 and OpenBMI53 offer a wide range of signal processing functions in Matlab, including artifact rejection, spectral/spatial filtering, resampling, and re-referencing. The MNE toolbox (https://min2net.github.io)45 provides similar functions and visualization methods in Python.

For audiovisual data analysis, the implemented CNN models in this study can be found in two repositories: Deepface (https://github.com/serengil/deepface)46 for video analysis and (https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition)40 for audio analysis.

Limitations

Our conversational scenario is designed using cue-based and posed conditions to standardize the conversation, ensure consistency, and achieve balanced class labels. However, such posed scenarios and our experimental setting come with acknowledged limitations:

  • The scripted dialogues may not fully capture the spontaneity and genuine emotional expression that occurs in real-life conversations.

  • Human reactions may be exaggerated or downplayed due to cue-based conversation scenarios, which could result in mismatched emotion trials.

  • General participants may lack familiarity with expressing emotions facially, our dataset is particularly well-suited for multimodal settings.

  • The setup of the EEG cap could constrain the facial expression, and it covered the forehead area.

  • EEG cleaning methods, i.e., artifact rejection, spatial filtering, and normalization, were excluded to maintain consistent baseline results.

  • Participants’ ages were limited to a young age group, and they are not native English speakers. This may create biased results and reduce performance in applying a language model or the use of a pretrained model.

  • The predefined emotional classes may not encompass the full spectrum of human emotions, potentially limiting the dataset’s representativeness.

Recognizing emotions from videos of conversations presents unique challenges, primarily due to the obstructions introduced by non-verbal cues54. In contrast, the audio signal offers a more discernible pathway to emotion identification, driven by nuanced auditory features such as pitch, tone, and voice dynamics. A review of literature on prevalent visual-audio datasets reveals significant disparities in the emotion recognition capabilities between video and audio modalities. For instance, in21 authors reported accuracies of 71.77% for audio-only and 47.44% for video-only modalities. A similar trend was observed by another study55, which recorded accuracies of 42.16% for video and 72.95% for audio.

In an EEG-based emotion study, participants provided ratings on scales (e.g., valence) and the ground truth has been determined based on these self-assessed ratings scores12,18. On the contrary, in audio/video studies with conversational scenarios, the emotion classes are mostly pre-determined and participants are expected to align their emotions to the given scenario15,17,20. Furthermore, the annotation can be created with diverse perspectives14,19. Brain-focused studies tend to focus on participants’ internal/direct feeling, while vision/audio studies concentrate more on their external/indirect expressions.

In our research, we utilize a multimodal dataset and aim to reduce this gap by synchronizing the internal feelings with their external expressions. Participants selected the scripts that resonated most with them emotionally, and they were then asked to fully express their emotions through voice and facial expressions.

Our dataset included two supplementary class labels. One is based on participant’s reported arousal-valence scores, and the other is determined by a supervisor’s observation. Feedback from the participants might be more indicative of direct emotions, whereas observations by the supervisor could align more with indirect emotions. Note that our dataset can be annotated differently based on these reports, catering to researchers’ interests in specific modalities and emotional foci, whether indirect or direct.

We believe that despite these limitations, our study provides valuable insights into multimodal emotion recognition in controlled conversational settings, providing a strong foundation for further research and the development of emotion recognition models.

In conclusion, this study has introduced a valuable resource in the form of the Emotion in Audio-Visual (EAV) dataset, encompassing EEG, audio, and video recordings collected during cue-based conversations. By addressing the critical need to understand emotional states for human-machine interfaces, we have taken a significant step toward developing AI models with human-like attributes. The EAV dataset is the first of its kind to include these three primary modalities, enabling comprehensive emotion recognition within conversational contexts. We believe that this dataset will play a pivotal role in advancing our understanding of human emotions from both neuroscience and machine learning perspectives, offering a robust foundation for future research and model development.

Future studies can build upon the foundation laid by the EAV dataset and contribute to advancements in emotion recognition, human-computer interaction, and affective computing, ultimately benefiting a wide range of applications and industries. Expanding the dataset to include participants from various cultural backgrounds can lead to insights into how emotions are expressed and recognized differently across cultures, enabling more culturally sensitive emotion recognition models. Moreover, exploring practical applications of emotion recognition, such as in online education, customer service, and mental health support, can be a relevant future research direction. Implementing and evaluating emotion recognition systems in real-world scenarios can provide valuable insights and enhance the practicality of this technology.