Abstract
Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle-item level or the coarse-category level, we applied Gromov–Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the coarse-category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the coarse-category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.
Similar content being viewed by others
Introduction
Recent studies have revealed that human emotions possess an exceptionally complex and high-dimensional structure1,2. Traditionally, emotion research has focused on simplifying emotions into lower-dimensional models to make them more manageable. For instance, basic emotion theory3 proposes that human emotions can be categorized into six fundamental types, while the dimensional approach4,5 maps emotions onto a two-dimensional space defined by arousal (active–inactive) and valence (positive–negative). These models have been widely applied in fields such as facial expression recognition and continue to exert significant influence in psychology and artificial intelligence6,7. However, recent studies employing data-driven approaches1,2,8 suggest that human emotions cannot be fully captured within such low-dimensional frameworks, but instead, exhibit a more complex, high-dimensional structure. For example, by using self-reports while watching videos, Cowen & Keltner1 identified 27 distinct emotional dimensions based on large-scale self-reported data and Koide-Majima et al.2 identified 18–36 brain-correlated emotion dimensions per participant using fMRI and 80 emotion categories, highlighting the subtle differences and intricate interrelationships among emotions. This evidence indicates that accurate modeling and understanding of human emotions requires a more sophisticated approach which explicitly accounts for the high-dimensional nature of emotional experiences.
Against this backdrop, an emerging approach is to leverage large language models (LLMs), particularly multimodal LLMs (MLLMs), to understanding human emotions. MLLMs, integrating capabilities for processing text, images, and audio into LLMs, have rapidly advanced over recent years, and now have the ability to process multiple modalities, not only text but also images and audio. They have already demonstrated high performance in emotion inference tasks based on external expressions, such as facial expression recognition9,10, text-based sentiment analysis11,12,13 and tri-modal emotion recognition from speech, textual content, and facial expressions14,15. If these MLLMs can move beyond such limitations and accurately replicate the complexity of human affective responses, they might be able to serve as a valuable new tool for emotion research.
However, the question of whether MLLMs can accurately infer the high-dimensional structures of emotions and also the emotions experienced internally by humans, for example while watching videos, has not been answered, and doing so is challenging. This is because these tasks require a multi-step inference process that goes beyond simple feature extraction16. Specifically, predicting how a person will feel while watching a video involves two key steps: first, accurately recognizing what is depicted in the video, and second, reasoning about how the viewer will respond emotionally to it. This process is inherently complex, and depends on multiple factors such as narrative context and the viewer’s prior knowledge. It is therefore fundamentally different from the simple analysis of expressed emotions. While recent studies suggest that MLLMs have the potential to infer subjective sensory experiences such as color perception and auditory pitch17,18, their ability to accurately estimate more abstract, context-dependent emotions remains uncertain.
In this study, we investigated the extent to which current MLLMs can accurately predict the emotions that people experience when watching videos (Fig. 1A). To this end, we used two datasets from previous studies in which participants reported emotion ratings they experienced while watching video clips1,2. We then instructed MLLMs, including Gemini19 and GPT20, to report emotion ratings for these videos (Fig. 1A), and assessed how well the models’ predicted emotions matched the human ratings.
In evaluating these ratings from humans and models, we not only examined agreement in emotion ratings for each video, but also focused on patterns and relationships across multiple videos – that is, the emotion structure (Fig. 1B). Emotion structure refers to the relational structures of emotions elicited by videos. In this study, relationships are specifically similarity or dissimilarity among the multidimensional emotional responses evoked by different videos. Figure 1B visually represents these relationships by mapping each video’s emotion ratings into a multidimensional space. For instance, Video 1 (dog) and Video 3 (cat) are both associated with joy and are therefore located close to each other in the space. In contrast, Video 2 (insect), which evokes the emotion horror is positioned further away. The spatial distances and distribution patterns among these videos collectively represent the emotion structure.
The reason we focused on comparing emotion structures across multiple videos rather than performing direct, one-to-one comparisons of emotion ratings is that people and models can differ in how they interpret and use emotion terms. For example, even a seemingly straightforward emotion like “joy” may be used differently for emotion ratings by different people or computational models. Indeed, previous research has shown that emotion ratings are influenced by individual cognitive tendencies and model-specific biases, leading to considerable variability among humans21,22 and among computational models23. In contrast, focusing on emotion structures allows us to abstract away from differences in the specific use of emotion terms and instead concentrate on relative similarity or dissimilarity among emotional responses. This structural approach primarily evaluates whether relationships between emotional content – such as the distinction between “joy” and “horror” videos – are consistently represented, irrespective of the exact emotion terms used to express the emotions elicited by these videos. Thus, this method enables us to assess how accurately models capture human emotional recognition patterns based on relational structures, and minimizes confounding related to terminological variability.
In this study, we explored similarities and differences in emotion similarity structures by applying two complementary comparison strategies, supervised and unsupervised, following the methodologies proposed in previous studies24,25,26. As shown schematically in Fig. 1C and D, the supervised approach assumes a fixed one-to-one correspondence between emotional responses to the same videos. Conventional representational similarity analysis (RSA) takes this approach and then evaluates the similarity between structures by computing the correlation between representational dissimilarity matrices (RDMs) of different domains (e.g., humans and models). In contrast, the unsupervised approach, specifically Gromov-Wasserstein Optimal Transport (GWOT), which we use in this study, searches for the mapping that best aligns the two structures purely from their internal relational geometry, allowing different elements to be optimally paired. For example, in Fig. 1D, ghost aligns with skull and dog with cat, resulting in categorical rather than item-level agreement. When such mappings occur, we may conclude that the two structures are “categorically” matched, i.e., joyful videos – dog, cat, baby – are correctly mapped to the same category of images, and horror images – ghost, skull, insect – are also correctly mapped to the same category of images, but this mapping is not a fine-item-level one-to-one mapping. The juxtaposition of these methods allows us to distinguish fine item-level alignment from coarser category-level alignment (see previous studies24,25 for details), thus providing a comprehensive assessment of how well model-predicted emotion structures match human judgments, from individual videos to higher-order abstract organizations.
Overview of the analytical framework for comparing similarity structures of emotions across two domains (e.g., humans vs. model). A Acquisition of emotion ratings. Participants and models watch a series of video clips and report emotion ratings on multiple dimensions, such as calmness, joy, horror, anger. The elements of the matrix represent the intensity of each emotion category for each video reported by participants or models. B Emotion structures. Each video’s emotion ratings, as reported by humans and models, are represented as points in a multidimensional space to illustrate the relational structure of emotions (emotion structure). The points corresponding to videos that evoke similar emotional responses, such as Video 1 (dog) and Video 3 (cat) associated with joy, are positioned closer together, while videos eliciting distinct emotions, such as Video 2 (insect) associated with horror, are placed further apart. Dissimilarity between videos is represented by distance, namely the black lines between points. C Supervised comparison of emotion structures. Supervised comparison of emotion structures between two domains based on fixed mapping between the same videos, which is represented by blue lines. D Unsupervised comparison of emotion structures. A conceptual illustration of unsupervised comparison based on Gromov-Wasserstein Optimal Transport (GWOT), which searches for optimal mappings based solely on internal relations (RDMs). The optimal mappings are shown as red lines. In the figure, groups of videos that evoke similar emotions (categories) are surrounded by gray outlines. In this case, the mappings are categorical but not exact at the fine-item level, e.g., ghost is mapped to skull and dog is mapped to cat, but these are appropriately paired within the same category. Emoji graphics from Twemoji, licensed under CC BY 4.0 by Twitter, Inc. and other contributors (https://creativecommons.org/licenses/by/4.0/).
Results
Datasets and analysis methods
In the following analysis to investigate the structures of human emotions, we used two datasets from previous studies: Koide-Majima et al. (2020) and Cowen and Keltner (2017). Both datasets are subjective emotion ratings of human participants while watching short video clips (see Methods for the details).
The following results are organized as follows. First, before evaluating the ability of the MLLMs, we first examined whether there were common emotion structures across human participants. The degree of commonality between human participants can be considered as an approximate upper-bound of the degree of commonality between actual human emotion structures and that inferred by MLLMs. For this we used data from Koide-Majima et al. (2020) only, due to the limited data availability of Cowen and Keltner (2017). After that, we evaluated the degree of similarity between emotion structures of humans and those inferred by MLLMs by using both datasets.
To evaluate the degree of commonality in emotion structures between human participants or between humans and models, we performed two types of analyses on each dataset. The first analysis focused on the similarity of emotion reports for individual videos, while the second analysis considered the overall similarity structure of emotions elicited by different videos, not restricted only within each video. For the second analysis of the emotion similarity structure, we employed two metrics: the correlation between emotion similarity structures, known as the conventional Representational Similarity Analysis (RSA), which assumes a correspondence between videos; and Gromov–Wasserstein Optimal Transport (GWOT), which does not assume such a correspondence (see Methods for details). The importance of using GWOT is to find the optimal mapping between emotion structures without video labels, and then based on the optimal mapping, to check whether the same videos correspond to each other in the emotion structures of humans and models. We evaluate the matching rate of one-to-one correspondence of each video to assess whether the structures are matched at the fine-item level. Since it is possible that the structures are not matched at the fine-item level, but are matched at the coarse-category level, we also evaluated the matching rate of coarse-category correspondence of a group of videos.
Commonality of emotion structure between different participant groups
In this section, we examine the extent to which different groups of participants shared a common emotion structure, using data from Koide-Majima et al. (2020)2 (Fig. 1A). In the dataset, participants watched 550 short video clips and continuously rated the intensity of their emotional responses on a scale from 0 to 100. Each participant was assigned to one or two specific emotion categories (e.g., “joy” or “fear”) and evaluated only the assigned emotions while watching all videos. The assignment was designed such that each of the 80 emotion categories was rated by four different participants. As a result, each emotion category had four independent sets of ratings, yielding data for 80 distinct emotion categories in total (see Methods and Table 5 for details).
To evaluate the consistency of emotion structures across participants, we randomly split the data within each emotion category into two groups and treated each group as a “pseudo-participant.” Each group consisted of two participants per category, and the ratings were averaged within each group to produce a representative profile for that category. Since most participants were assigned to two emotion categories, it was possible for the same individual to appear in different groups across different categories. However, we ensured that no participant’s data appeared in both groups within the same emotion category. This grouping procedure prevented artificial inflation of similarity between groups due to individual-level bias.
Similarity between emotion ratings for each video
Histogram of the Pearson correlation for each video clip between human ratings in the Koide-Majima et al. dataset. The blue histogram represents the distribution of the correlation between the ratings of Participant group 1 and group 2 participants for each video, and the gray histogram represents the distribution of the correlation between the Participant group 1 ratings and the shuffled Participant group 1 ratings, which served as the null distribution. The dashed line represents the mean of the correlation, 0.313, between the ratings of Participant group 1 and group 2 (blue histogram).
The analysis of similarity between the emotion ratings of the two participant groups for each video revealed that the ratings are relatively consistent between participant groups (Fig. 2). The blue histogram represents the distribution of Pearson correlation coefficients between the two groups (Participant group 1 and group 2) for each video, with the mean correlation value of 0.313 shown by the dashed line. To assess statistical significance, we also estimated correlation values at the chance level by computing Pearson correlation coefficients between the responses of Participant group 1 and the shuffled responses of Participant group 1. The gray histogram shows the distribution obtained by a one-time shuffling of the responses (see Methods for details). The Cohen’s D between the distribution of correlation coefficients (blue) and the null distribution by one-time shuffled data (gray) was 2.33. Cohen’s D computed from 1,000 shuffles was 2.64, demonstrating a statistically significant difference between the original distribution and the null distribution. Although the correlation values are statistically significant, the average value of 0.313 is at a moderate level and not necessarily high, indicating that a significant level of individual differences in emotion ratings exists between participant groups at the level of each individual video.
Similarity between similarity structures of evoked emotions from all videos
We performed two analyses, conventional RSA and GWOT, to evaluate the similarity between the entire similarity structures formed by the emotion ratings of all videos between the participant groups.
Unsupervised comparison of the similarity structures for all videos in the Koide-Majima et al. dataset between Participant group 1 and group 2 based on Gromov-Wasserstein Optimal Transport (GWOT). A Representational Dissimilarity Matrices (RDMs) of Participant group 1 and group 2. The elements of the RDMs represent the dissimilarity between the emotion ratings of the videos, quantified by cosine similarity. B Optimal transportation plan obtained by GWOT between the RDMs of Participant group 1 and group 2. Green lines represent the category boundaries of the videos.
Correlation between similarity structures. To evaluate the similarity of the emotion structures, we first calculated the correlation between the similarity structures of the emotion reports of all videos across participant groups. Figure 3A shows the representational dissimilarity matrices (RDMs) for Participant groups 1 (top) and 2 (bottom). To create the RDMs, we used cosine similarity to measure the similarity between the emotion ratings of the videos. We found that the correlation coefficient between the RDMs was markedly high, 0.859, which significantly exceeds the chance level, estimated via 1,000 shuffles with a 95% percentile interval (\([-0.000509, 0.00108]\), Table 2). The level of similarity at the level of the representational similarity structures is significantly higher than the level of similarity at the level of each individual video (0.31 on average) evaluated in the previous section.
GWOT: one-to-one matching. Next, by performing GWOT, we found that the one-to-one matching rate of the optimal transportation plan is also high (see Fig. 3). As shown in the optimal transportation plan (Fig. 3B), we can observe many non-zero values in the diagonal elements. To be precise, the matching rate, i.e., percentage of non-zero diagonal elements (see Methods for details), is 16.36%, which significantly exceeds both the theoretical chance level (0.182%) and the empirically measured chance level, estimated via 10 shuffles with a 95% percentile interval (\([1.01\%, 1.23\%]\), Table 2), as shown in Table 2. This finding means that even at the fine-item level and under this strict unsupervised alignment condition, the emotion structures between the participant groups have some degree of commonality, i.e., the same videos are unsupervisedly mapped at 16.36%. This level of agreement can be treated as a rough estimate of the upper bound of agreement between humans and models, which will be done in subsequent sections.
GWOT: category matching. Finally, by evaluating the matching rate of the optimal transportation plan obtained by GWOT at the coarse-category level, we found that the category matching rate is also high (Table 2 and Fig. 3). For analysis of category matching, we classified the videos into 10 categories via hierarchical clustering of the participants’ emotion reports (see Method in details). In Fig. 3, the rows are sorted according to the hierarchical clustering results, and the green lines indicate category boundaries. The non-zero values in the optimal transportation plan (Fig. 3) concentrate along the diagonally outlined boxes (green lines), meaning a high degree of category alignment – 66.18% – which is significantly higher than the empirical chance level (\([17.3, 19.4\%]\), Table 2).
This result means that a common emotion structure between different participant groups is also present at the coarse-category level. Similarly to the case with one-to-one matching, this level of agreement can be treated as a rough estimate of the upper bound of agreement between humans and models.
Evaluation of MLLM’s estimation of human emotion structure for data from Koide-Majima et al. (2020)
To assess whether MLLM can infer the high-dimensional similarity structure of human emotions, we compared the similarity structure of emotion ratings obtained by MLLM with that of humans. We specifically used Gemini for several criteria (see Methods, ‘Selection of Multimodal LLMs’, for details). Using the same 550 videos employed in human psychology experiments, we obtained Gemini’s response to the estimated emotion intensities for each video based on the prompt described in the Methods section.
Similarity between emotion ratings on each video
Analysis based on the correlation calculated for each video revealed that Gemini’s emotion estimates exhibit a degree of agreement with human ratings that is significantly higher than the chance level (Fig. 4). Specifically, the blue histogram shows the distribution of Pearson correlation coefficients between Gemini’s estimates and the participant ratings for each video, while the gray histogram represents the distribution of correlations obtained after shuffling the human ratings. The dashed line indicates the mean correlation of the original data (0.374), and Cohen’s D computed from 1,000 shuffles was 3.23, demonstrating a substantial difference between the original distribution and the null distribution. Furthermore, when comparing the distribution of correlation between Gemini’s estimates and the participant ratings with the correlation observed between different participant groups (Figs. 2 and 4), the correlation between Gemini and the participants was found to be slightly higher than that observed among the participant groups. These results ensure that – at least at the level of emotion ratings for each video – Gemini can provide human-like emotion ratings with variability comparable to the variability within human participants.
The histograms show the Pearson correlation of each video clip between the human ratings and Gemini’s estimation in the Koide-Majima et al. dataset. The blue histogram represents the distribution of the correlation between the human ratings and the Gemini’s estimation for each video, and the gray histogram represents the distribution of the correlation between the human ratings and the shuffled human ratings, which served as the null distribution. The dashed line represents the mean of the correlation, 0.374, between the human ratings and the Gemini’s estimation (the blue histogram).
Well-estimated and poorly-estimated videos. To gain insight into possible reasons why some videos yielded higher similarity and others did not, we examined the content of the top 25 and bottom 25 ranked videos in terms of Pearson correlation. We selected five representative videos that were particularly typical in content, and summarized them in Table 1.
By examining Table 1, for videos that can evoke a strong emotional response from just a single frame, Gemini appears to estimate emotions with high accuracy. Specifically, in 273.mp4 and 477.mp4, which feature infants or pets (e.g., dogs or cats), the model’s predictions of emotions such as “cuteness” and “love” closely matched the human participants’ reports of “adorableness” and “calmness.” Likewise, in 048.mp4 and 527.mp4, which depict a dark setting with a black-haired woman wearing a white dress, even a single frame can strongly evoke “fear,” and the model accurately reproduced participants’ reports of “fear.” Additionally, 024.mp4 shows a scene where surgical scissors and forceps are used to make an incision in the eyelid, and the model most strongly predicted the emotion “empathic pain.” Such videos are presumed to be the type where visual features alone nearly fully determine the emotion, thereby requiring minimal contextual or narrative elements, which in turn likely facilitates the model’s accurate representation of the emotion structure.
By contrast, an analysis of the bottom five videos with the lowest correlation revealed two main reasons why Gemini struggled to accurately estimate emotions. First, videos that require contextual interpretation tended to pose difficulties for the model. In particular, when a video’s emotional meaning could shift dramatically depending on the context, the model often relied too heavily on surface-level visual features and produced incorrect estimates. For example, in 188.mp4, a man is seen kissing a pregnant woman on the neck, followed by a scene in which he forcefully pushes her against a wall multiple times. While a single frame might appear romantic, the full video conveys an impression of sexual violence and threat, evoking discomfort and fear. In fact, whereas participants reported negative emotions, the model predicted high intensities of “romantic” and “sexy.” This shows that the model made its inference based solely on the visual feature of “a couple kissing,” and failed to recognize the contextual cues that indicate the woman was in danger. Second, in short videos that provide insufficient information for even human viewers to comprehend the narrative and evoke appropriate emotions, the model likewise exhibited a decline in emotion prediction accuracy. For instance, 292.mp4 is part of a movie trailer that depicts a man and woman in a decaying urban setting. However, based on the brief video alone, it is difficult to determine whether the story is about romance, a runaway, or something else entirely. Even human viewers may feel unsure about which emotion to report, and such ambiguity can lead to inconsistencies in participants’ responses and lower prediction accuracy for the model.
Taken together, these findings show that while MLLMs demonstrate high accuracy when emotions can be inferred directly from visual features, their performance remains limited in situations that require contextual understanding.
Similarity structure of all videos
Next, to evaluate the similarity between the human and the model’s similarity structures, we performed the same sets of analyses as we did when comparing the emotion structures of different participant groups in the previous sections. In terms of correlation on each video in the previous section, it was shown that Gemini predicted human emotions with higher accuracy than the level of agreement typically observed among human participants (Figs. 2 and 4). However, higher correlations for each video do not necessarily mean that the overall emotion structures of all videos are similar between the model and humans. This section focuses on the structural level comparison between the model and humans based on conventional RSA and GWOT.
Correlation between similarity structures. We observed a moderately high degree of correlation between representational dissimilarity matrices (RDMs) of all videos in humans and Gemini (Fig. 5A). The top panel in Fig. 5A depicts the participants’ RDM and the bottom panel depicts Gemini’s RDM, both using cosine similarity as metric. Pearson’s correlation coefficient between the two RDMs is 0.558, which is significantly higher than the chance level (\([-0.000509\%, 0.00108\%]\), Table 2), although this value is lower than the correlation coefficient between different participant groups (0.859).
GWOT: one-to-one matching. Although the correlation between RDMs of humans and Gemini is moderately high (0.558), we found the one-to-one matching rate from GWOT is low, 2.36% (Table 2). As we can see in the optimal transportation plan (Fig. 5B), the high values are scattered in places other than the diagonal elements, resulting in the low matching rate. Although this rate is significantly higher than both the theoretical chance level (0.182%) and the empirically measured chance level (\([1.01\%, 1.23\%]\), Table 2), it still remains substantially below the matching rate between participant groups (16.36 %) .
GWOT:category matching. Despite the low one-to-one matching rate, we found that category matching rate is reasonably high. To analyze category matching, we classified the videos into 10 categories via hierarchical clustering of the participants’ emotion reports (see Method for details). From Fig. 5B, we observe that the high values are concentrated within the diagonally aligned boxes outlined by the green lines, indicating a high category matching rate. In Fig. 5A and B, the rows are sorted according to the hierarchical clustering results, and the green lines represent the category boundaries. The model’s matching accuracy of all videos based on the optimal transportation plan is 50.5%, which clearly exceeds the chance level (\([17.3, 19.4\%]\), Table 2) and is a slightly lower than the category matching rate between participant groups (66.18%).
Similarity structure of selected videos
While the results in the previous section indicate that the model fails to capture the emotion structure of all videos adequately, we conducted further analyses on videos with high similarity scores to investigate whether the model might partially capture the structure of some parts of videos. To this end, we selected the top 100 and 250 videos in which the model’s predictions showed the highest correlation with human ratings from the results of Fig. 4, and conducted the same set of analyses as performed for all videos. Note that the purpose of this additional analysis was not to claim that the GWOT matching rate or RSA correlation increases when the videos are selected, because that is obvious, but rather to use these as a measure to compare performance with others in terms of partial structural alignment.
Focusing on the top 100 videos and top 250 videos, the model’s predicted ratings demonstrated a high degree of structural agreement with human ratings (Fig. 5C, D and Table 2). Inspection of the transport matrices show that many diagonal elements have non-zero values, indicating a high one-to-one matching rate, and most of the non-zero elements fall within the green, diagonally outlined region, indicating a high category matching rate. To be specific, the one-to-one matching rate was calculated to be 17.0% for the top 100 videos and 8.4% for the top 250 videos, respectively, which considerably exceeds the chance level (\([4.90\%, 5.90\%]\) for the top 100 videos and \([2.28\%, 2.60\%]\) for the top 250 videos, Table 2). The category matching rate reached 69.0% for the top 100 videos and 71.6% for the top 250 videos, respectively, likewise far surpassing the chance level (\([19.1\%, 25.5\%]\) for the top 100 videos and \([18.2\%, 21.1\%]\) for the top 250 videos, Table 2). These results show that, for the selected videos, the emotion structures between humans and models are similar enough that the unsupervised mapping of emotion structure was performed with relatively high accuracy.
Unsupervised comparison of human similarity structures of videos in the Koide-Majima et al. dataset with similarity structures estimated by Gemini based on Gromov-Wasserstein Optimal Transport (GWOT). A The Representational Dissimilarity Matrices (RDMs) of the human participants and Gemini. The elements of the RDMs represent the dissimilarity between the emotion ratings of the videos, quantified by cosine similarity. B Optimal transportation plan obtained by GWOT between the human and Gemini RDMs. Green lines represent the category boundaries of the videos. C Optimal transportation plan for the selected top 100 videos. D Optimal transportation plan for the selected top 250 videos.
Evaluation of MLLM’s estimation of human emotion structure for data from Cowen & Keltner (2017)
To further investigate the MLLMs’ performance of emotion ratings, we conducted the same sets of analyses using a different dataset by Cowen & Keltner1. In this dataset, 2185 short video clips (averaging about 5 seconds in length and containing no audio) were presented to participants, who then reported multiple emotions from 34 emotion categories while watching each clip. As with the Koide-Majima et al. (2020) dataset, we provided the full video input to Gemini-2.0-flash, since it can directly handle entire video sequences. Moreover, because these videos are relatively short, we also adopted an approach where we extracted six frames per video and presented them as input to GPT-4.1 and Molmo-7B-D, which do not accept raw video (see Methods for details). This design enabled us to compare how different MLLMs beyond Gemini perform under the same dataset conditions. In the following sections, we primarily show results for Gemini and then present the results from the other models for comparison.
By using a similar prompt to that for the Koide-Majima et al. dataset shown in Methods, we obtained the intensity of emotion ratings from Gemini for all video clips in the Cowen & Keltner (2017) dataset. While Gemini provided responses for all 2,185 videos, one video was excluded from further analysis because all emotion scores were rated as zero, making it impossible to compute cosine similarity with other videos. As a result, 2,184 videos were used for subsequent analyses. In the following, we quantitatively evaluate similarity between the emotion structure of humans and Gemini, using the conventional correlation RSA and GWOT.
Similarity between emotion ratings for individual videos
By analyzing the correlation on a per-video basis, we found that Gemini’s emotion estimates align with participants’ ratings at a level exceeding chance (Fig. 6A). The blue histogram shows the distribution of Pearson correlations between Gemini and human ratings for each video, whereas the gray histogram corresponds to correlations obtained after shuffling the human ratings. The dashed line marks the mean correlation of the original data (0.553), and Cohen’s D calculated over 1000 shuffles was 3.18. Compared to the Koide-Majima et al. dataset, this higher mean correlation and Cohen’s D suggest that Gemini estimates video-specific emotions more accurately for this dataset.
Well-estimated and poorly-estimated videos. Similarly to the Koide-Majima et al. dataset, we examined the content of the top 25 well estimated videos and those of the bottom 25 badly estimated videos. We selected five representative videos that were particularly typical in content, and summarize them in Table 3.
Examination of the top 25 videos revealed, consistent with insights from the Koide-Majima et al. dataset, that Gemini exhibits high estimation accuracy for video content that evokes strong emotional responses from a single frame. Specifically, in 1487.mp4 and 1828.mp4, where insects are shown gathering in dirty places, Gemini was able to predict the emotion of “disgust” reported by participants. Furthermore, in 0073.mp4, which depicts a needle piercing human skin, and 1362.mp4, which shows a man covered in blood, the model successfully inferred emotions such as “disgust” and “horror.” These results are consistent with the previous section’s conclusion that videos whose emotions can be clearly triggered by visual information alone tend to yield higher estimation accuracy.
In contrast, an analysis of the bottom 25 videos with the lowest correlation revealed two main reasons why Gemini may have failed to accurately estimate emotions–similar to the findings with the Koide-Majima et al. dataset. First, Gemini tended to struggle with videos that required contextual understanding or background reasoning. For instance, in 2130.mp4, a person narrowly avoids being hit by a car while sledding. While participants predominantly reported feelings of “relief,” the model’s strongest prediction was “amusement.” This discrepancy shows that the model had difficulty capturing the contextual nuance of a near-miss accident. Additionally, in 0882.mp4, a player attempts to high-five a teammate but is ignored. While participants commonly reported the emotion “awkward,” the model predicted emotions such as “interest” and “admiration.” This shows that the model struggled to interpret the subtle social context of the discomfort or embarrassment resulting from a rejected social interaction. Second, some videos make it difficult even for human viewers to pinpoint a single, clear emotion. For example, 0664.mp4 (featuring birds hung upside down) and 0888.mp4 (depicting writhing tentacles) induce a vague sense of discomfort, yet participants themselves may be uncertain which emotion to report. Such ambiguity can lead to substantial variability in human ratings, thereby complicating the model’s ability to accurately estimate emotions.
Similarity structure of all and selected videos
To evaluate the similarity structure between participants and Gemini, we conducted two analyses, RSA and GW alignment (Fig. 6B, C, D, Table 4).
Correlation between similarity structures. We observed a moderately high degree of correspondence between the representational dissimilarity matrices (RDMs) of humans and Gemini for all videos in this dataset (Fig. 6B). The left panel in Fig. 6B shows the participants’ RDM and the right panel shows the RDM of Gemini; both use cosine similarity as their metric. Notably, this dataset is approximately four times larger than the Koide-Majima et al. dataset, yet we still obtained a correlation coefficient of 0.555 – a reasonably high value for such a sizable collection of videos. The Pearson correlation of 0.555 is well above chance (Table 4), showing that Gemini can approximate the overall emotion structure at an even larger scale than that examined in the Koide-Majima et al. dataset.
GWOT: one-to-one matching. Despite the high correlation of RDMs between humans and Gemini (0.558), we found that the one-to-one matching rate was low, at 1.69 % (Fig. S1, Table 4), which is close to the chance level (0.229%).
We then also evaluated the similarity structures of selected videos, using the top 250 and 750 videos by Pearson correlation to investigate the possibility that Gemini accurately captures the similarity structures for selected videos (Fig. 6C, D). Both panels show that there are many non-zero values at the diagonal elements and the matching rate is 18.8 % for the top 250 videos and 7.47 % for the top 750 videos, which are significantly higher than the chance level (\([3.01\%, 3.46\%]\) and 0.667%, respectively).
This result shows that Gemini did not estimate the overall structure of the videos well, but was able to estimate the structure of a part of the videos well enough that unsupervised mapping is possible to some extent. This result is consistent with the results of the Koide-Majima et al. dataset and strengthens the robustness of the result that Gemini is able to estimate the high-dimensional structure of human emotions.
GWOT: category matching. Despite of the low one-to-one matching rate for all videos, we found that the category matching rate was reasonably high, as similarly observed in analysis of the Koide-Majima et al. dataset. For the analysis of category matching, we classified the videos into 10 categories via hierarchical clustering of the participants’ emotion reports (see Methods for details). In Fig. 6B–D and Fig. S1, the rows are sorted according to the hierarchical clustering results, and the green lines indicate category boundaries. As shown in Figs. S1, 6C and D, the matching values are concentrated along the diagonally outlined boxes (green lines), indicating a high level of category alignment. The category matching rates were 78.8% for the top 250 videos and 69.2% for the top 750 videos, both substantially above the chance level (\([19.1\%, 25.4\%]\) and 21.2%, respectively; see Table 4). The category matching rate for all videos was 54.6%, which also significantly exceeds the chance level (15.6%; see Table 4). These findings demonstrate that Gemini broadly captured the emotion structure of all the videos at the coarse-category level when watching the videos.
Comparison of the human similarity structures of videos in the Cowen & Keltner dataset with the similarity structures estimated by Gemini. A Histograms of the Pearson correlation of each video clip between the human ratings and Gemini’s estimation. The blue histogram represents the distribution of correlation between the human ratings and Gemini’s estimation for each video, and the gray histogram represents the distribution of correlation between the human ratings and the shuffled human ratings, which served as the null distribution. The dashed line represents the mean of the correlation, 0.553, between the human ratings and Gemini’s estimation (blue histogram). B Representational Dissimilarity Matrices (RDMs) of the human participants and Gemini. The elements of the RDMs represent the dissimilarity between the emotion ratings of the videos, quantified by cosine similarity. C Optimal transportation plan obtained for the selected top 250 videos by GWOT between the human and Gemini RDMs. Green lines represent the category boundaries of the videos. D Optimal transportation plan for the selected top 750 videos.
Comparison with other MLLMs
In this section, we conducted the same experiment using GPT-4.127 and Molmo-7B-D28, two of the latest MLLMs, to evaluate their emotion estimation accuracy. Since these models cannot process video input directly, we adopted an approach where multiple frames were extracted from each video and input as still images to enable a fair comparison (See Methods, ‘Selection model for data from Cowen & Keltner (2017)’, for details). As a result, we obtained responses for 2184 videos from GPT and 1973 videos from Molmo, and used these data for further analysis.
GPT-4.1 achieved a performance roughly comparable to or slightly below that of Gemini (Table 4). For instance, its RSA value was 0.486 – about 0.06 lower than Gemini’s. When focusing on the top 250 or top 750 videos selected based on the correlation between each model’s predictions and human ratings, GPT-4.1 achieved one-to-one matching rates of 16.8% for the top 250 and 4.67% for the top 750 videos, and category-level matching rates of 68.6% for the top 250 and 57.6% for the top 750 videos. While these values are slightly lower than those of Gemini, they can still be considered very favorable results. These observations indicate that although GPT-4.1 does not achieve the same level of accuracy as Gemini, it appears to capture human emotion structure to a reasonable degree in one-to-one matching and to quite a high degree in category matching.
Meanwhile, although Molmo-7B-D exceeded the chance level on all evaluation metrics, its accuracy remained lower than that of Gemini and GPT-4.1 (Table 4). For example, although its RSA value was 0.229 – over 0.2 points below Gemini’s – it was still clearly above the chance level. Furthermore, in analyses focusing on the top 250 and top 750 videos selected based on the correlation between each model’s predictions and human ratings, Molmo-7B-D’s performance in category matching was 36.8% for the top 250 videos and 36.9% for the top 750 videos. Although these values exceed the chance level (\([19.1\%, 25.4\%]\) and 21.2%, respectively), they remain lower than those of Gemini and GPT-4.1.
Discussion
This study investigated the extent to which modern multimodal large language models (MLLMs) can reproduce the high-dimensional structure of human emotions. Several clear findings were obtained. First, in the supervised approach using Representational Similarity Analysis (RSA), state-of-the-art models such as Gemini and GPT showed high structural similarity with human emotional representations. This suggests that MLLMs are capable of accurately capturing the overall patterns of emotional responses elicited in humans during video viewing. While the open-weight model Molmo performed worse than the commercial models, it still demonstrated a reasonable level of reproducibility, indicating that this capability is not limited to proprietary systems. Second, the unsupervised approach using Gromov–Wasserstein Optimal Transport (GWOT) revealed important characteristics of how emotion structures align between humans and models. Specifically, while the models struggled to achieve strict one-to-one matching across all videos, they demonstrated a high degree of alignment at the level of categories. This result underscores the significance of employing GWOT in the present study. Unlike RSA, which requires a predefined correspondence between items, GWOT automatically searches for the optimal mapping in a flexible manner, allowing it to reveal meaningful correspondences at a coarser category level even when precise item-level alignment is difficult. In other words, although there are limitations in capturing fine-grained, item-specific correspondence, the findings suggest that models are beginning to capture the broader categorical structure of human emotional space. The use of GWOT was thus highly effective in uncovering this level-dependent distinction in alignment. Taken together, current MLLMs have evolved to the point where they can capture the coarse categorical framework of human emotion structure. However, they still fall short of enabling precise one-to-one alignment with individual emotional items. This mixed performance profile, with strong alignment at the category level but limitations at the fine-grained level, serves as a critical starting point for the remainder of the Discussion.
In the following sections, we explore why current MLLMs exhibit such a performance profile and what technical and conceptual directions might help overcome these limitations. Our discussion proceeds along three perspectives: First, in the section “Technological Leaps in MLLMs as a Foundation for Emotion Inference,” we discuss how recent advances in MLLMs have enhanced their ability to accurately perceive visual content (i.e., “Step 1”) and how this foundation supports the more complex task of emotion inference (“Step 2”). We highlight the improved visual interpretability in models like Gemini and GPT as a critical enabler of deeper affective understanding. Second, in “The Role of Context: What MLLMs Can and Cannot Predict,” we analyze cases where models succeed or fail in predicting human emotional responses. We point out that while models perform well when emotions can be inferred directly from visual cues, they continue to struggle with emotions that require deeper contextual or cultural understanding. Finally, in “Towards More Human-like Emotional Understanding,” we propose future directions for achieving more human-like affective intelligence. These include training with more emotionally rich datasets, incorporating bodily and interoceptive signals, and developing new evaluation frameworks that go beyond structural similarity to assess the subjective alignment of emotional interpretations. Through these discussions, we aim to clarify the current capabilities and limitations of MLLMs and provide a roadmap toward the development of more emotionally intelligent systems.
Technological leaps in MLLMs as a foundation for emotion inference
Based on the results of this study, along with benchmarking results on various tasks from early models to the latest MLLMs, we first emphasize that the rapid and significant technological leap in overall MLLM performance from the initial models that appeared around 2023 to the state-of-the-art models of 2025 has for the first time allowed the level of accuracy in emotion inference demonstrated in this study. Early MLLMs such as BLIP229 and LLaVA-1.530, which appeared around 2023, demonstrated the potential of multimodal processing, but faced numerous challenges in terms of visual recognition accuracy, instruction following, and contextual understanding31. The subsequent release of GPT-4V marked significant progress in benchmarks10,32. In our evaluations, we began to observe responses that highly correlated with human emotion ratings. However, GPT-4V still exhibited inaccuracies in visual recognition and contextual understanding, often misclassifying innocuous content as inappropriate (e.g., mislabeling it as sexual or violent)33. Due to misclassification, the percentage of videos for which GPT4-V outputs a responses is much lower (507/2185, 23%) than that of GPT-4.1 (2184/2185, 99%). For this reason, we judged that a comparison with GPT4-V would be difficult and decided not to include GPT-4V in our analysis. With the introduction of GPT-4.1 and Gemini-2.0-flash in 2025, the accuracy of Step 1 improved dramatically, as indicated in benchmarking results27,34. This advance has enabled accurate visual recognition and effective handling of complex instructions, and provided a solid foundation for emotion inference. As a result, the challenging and ambiguous task of inferring emotional structure (Step 2) became practically feasible. In our evaluations, misclassifications were significantly reduced and consistent responses were obtained across many videos. Notably, we were able to achieve stable, high correlations with human emotional structures, as demonstrated in this study. These results highlight that MLLMs have undergone rapid evolution in both Step 1 (recognition and comprehension) and Step 2 (emotion inference), signaling not just incremental improvements but clear transition to a stage where multimodal AI can address complex and context-dependent tasks.
At the same time, this study also revealed that even models with excellent recognition capabilities (Step 1) still face limitations when handling the integrated challenge of emotion structure inference (Step 2), indicating both the difficulty of the task and the exceptional capabilities of state-of-the-art models. For example, Molmo-7B-D, which we used as a comparison model, demonstrated outstanding performance in Step 1 across multiple benchmarks28. Molmo-7B-D outperformed GPT-4V in the multimodal benchmark28 and even surpassed then-state-of-the-art models like Gemini 1.5 Pro and GPT-4o-0513 in tasks such as VAQ v235 and TextVQA36. However, emotion structure inference (Step 2) cannot be solved by the simple application of recognized information. It requires estimating latent emotional structures that are inherently subjective and context-dependent. Therefore, high accuracy in Step 1 alone is insufficient; models must also be able to interpret recognized information within deep contextual frameworks to derive emotional meaning. While Molmo exhibited strong performance in Step 1, it showed a clear performance gap compared to GPT and Gemini when tackling tasks involving latent emotion inference. This does not indicate a deficiency in Molmo’s capabilities but rather highlights that such tasks demand more than recognition accuracy or general reasoning, but rather complex, multi-layered processing. Notably, GPT and Gemini achieved further improvements and stability in Step 1, enabling them to effectively manage this previously challenging integrated task (Step 2). The fact that these SOTA models successfully perform tasks where even highly capable models like Molmo struggle strongly suggests that MLLMs have advanced beyond gradual improvements, reaching a new stage where AI can address integrated and context-sensitive challenges.
The role of context: what MLLMs can and cannot predict
To better understand the nature of these improvements, we examined cases in which emotion estimation worked particularly well. These analyses revealed that MLLMs exhibit high predictive accuracy from a structural perspective, particularly for emotions that do not strongly depend on contextual cues. For instance, videos featuring babies or cats and dogs consistently elicited predictions of “cuteness” or “fondness.” Similarly, scenes showing a long-haired woman in a white dress in a dimly lit setting repeatedly triggered predictions of “fear.” These results suggest that MLLMs are capable of appropriately reconstructing emotional responses primarily driven by visual features37.
On the other hand, MLLMs struggled to estimate emotions in videos that relied heavily on contextual cues. For example, in one video, a player attempts a high-five with a teammate and is ignored, while other players in the background are seen successfully high-fiving. Visually, this scene might suggest a positive emotion such as “admiration.” However, when considering the broader context, a more accurate emotional interpretation would be “awkwardness.” These findings indicate that emotional understanding cannot be derived from visual features alone; it requires the interpretation of situational and social context. When MLLMs fail to appropriately interpret these contextual cues, their emotion predictions tend to diverge from human judgments.
Towards more human-like emotional understanding
In addition to contextual cues, it would be beneficial to design models that can integrate not only social contextual information but also interoceptive signals such as heart rate and bodily sensations, which are known to essentially affect emotion perceptions22,38,39. Our findings indicate that while current MLLMs perform well in estimating emotions based on explicit visual features, they still face limitations in understanding more complex forms of context, such as social relationships and temporal dynamics. Crucially, such contextual understanding involves not only external cues but also internal bodily states. In fact, some of the videos that were particularly challenging for the models involved emotions, such as “heart-pounding.” These types of emotions might not be fully interpreted through visual input alone and require sensitivity to interoceptive signals in combination with environmental and social factors. Therefore, future advances in MLLMs should involve incorporating diverse modalities into the training process to enable more human-like, context-sensitive emotion understanding.
However, another possibility is that the relatively lower accuracy observed for bodily-driven emotions could partly stem from insufficient representation in pretrained datasets rather than solely the absence of interoceptive signal integration. Emotions strongly influenced by bodily sensations may be less frequently articulated explicitly in textual form, possibly limiting their prevalence in standard pretrained data. Consequently, incorporating direct interoceptive signals into the model architecture might not be strictly necessary. Instead, it could be beneficial to fine-tune existing pretrained models on new datasets specifically designed to better represent these emotion-scene associations. Considering the promising performance of MLLMs demonstrated by our zero-shot results, fine-tuning might provide a viable path forward. Future research should investigate whether fine-tuning pretrained models can lead to improvements in prediction accuracy and unsupervised alignment for these potentially challenging emotional contexts.
Taken as a whole, the fact that MLLMs are beginning to reproduce the structural patterns of emotions induced by visual features represents a significant advance in affective understanding. However, challenges remain in estimating emotions that are highly dependent on contextual and interoceptive factors, and full replication of human emotional experience has yet to be achieved. Nevertheless, the emerging ability of MLLMs to approximate emotional structures that were previously difficult to model suggests a new frontier for affective computing and offers a crucial foundation for future research and practical applications.
Methods
Two emotion rating datasets and data processing
We analyzed two datasets of emotion ratings during video viewing from previous studies. One is from Koide-Majima et al. (2020) and the second is from Cowen and Keltner (2017). The details of each experiment and the data obtained should be referred to in the original papers, but here we provide brief information necessary to understand the present paper(Table 5).
Data from Koide-Majima et al. (2020)
Videos. In total, 550 different video clips were presented, chosen to evoke a diverse range of emotional responses. The genres of the selected videos included horror, violent drama, comedy, romance, fantasy, everyday life scenes, and action. The duration of each video clip was 10–20 seconds, and about 15 seconds on average.
80 emotion categories used for subjective reports. 80 emotion categories were used from various sources to cover a wide range of emotions. Note that these emotion categories used Japanese words for Japanese participants in the experiments. The English translations of the 80 Japanese emotion categories are as follows: (1) love, (2) amusement, (3) craving, (4) joy, (5) nostalgia, (6) boredom, (7) calmness, (8) relief, (9) romance, (10) sadness, (11) admiration, (12) aesthetic appreciation, (13) awe, (14) confusion, (15) entrancement, (16) interest, (17) satisfaction, (18) excitement, (19) sexual desire, (20) surprise, (21) nervousness, (22) tension, (23) anger, (24) anxiety, (25) awkwardness, (26) disgust, (27) empathic pain, (28) fear, (29) horror (bloodcurdling), (30) laughing, (31) happiness, (32) friendliness, (33) ridiculousness, (34) affection, (35) liking, (36) shedding tears, (37) emotional hurt,(38) sympathy, (39) lethargy, (40) empathy, (41) compassion, (42) curiousness, (43) unrest, (44) exuberance, (45) appreciation of beauty, (46) fever, (47) scare (feel a chill), (48) daze, (49) positive-expectation, (50) throb, (51) sexiness, (52) indecency, (53) embarrassment, (54) oddness, (55) contempt, (56) alertness, (57) eeriness, (58) positive-emotion, (59) vigor, (60) longing, (61) tenderness, (62) pensiveness, (63) melancholy, (64) relaxedness, (65) acceptance, (66) unease, (67) negative-emotion, (68) hostility, (69) levity, (70) protectiveness, (71) elation, (72) coolness, (73) cuteness, (74) attachment, (75) encouragement, (76) annoyance, (77) positive-fear, (78) aggressiveness, (79) distress, and (80) stress.
Emotion ratings from participants. 166 Japanese annotators rated emotions while viewing 550 video clips. Each annotator was instructed to rate how well an emotion category (e.g., “laughing”) matched their personal feelings elicited by the video scene, using a scale from 0 (not matched at all) to 100 (perfectly matched). Importantly, they were told to base their ratings on their own emotional responses, not those of the characters in the videos. During the rating process, annotators continuously indicated the degree of matching by moving a mouse while viewing the video stimuli, with ratings recorded at one-second intervals. For each of the 80 emotion categories, they obtained four independent ratings by assigning four different annotators to each category. To accomplish this, a total of 166 annotators participated, with each person rating one or two emotion categories. When an annotator was assigned two categories, they first rated one emotion (e.g., “disgust”) throughout the entire set of video clips. They then watched the entire sequence again to rate the second emotion category (e.g., “satisfaction”).
Group splitting into two groups for human–human comparison. To investigate the extent to which emotion structures are shared among humans, we split the dataset into two groups. For each emotion category, four participants had provided ratings; these were divided into two pairs, yielding two independent participant groups per category. Within each group, we then averaged the two participants’ ratings for the same emotion category, treating each group as an independent observer.
Although we wanted to make the ratings of the two groups as independent as possible, the limitations of the experimental data make it impossible to separate the two groups so that there is no overlap of participants. Note, however, that within any single emotion category, there is no overlap of participants between the two split groups. Thus, although the split used in this study does not perfectly guarantee independence in terms of overlap of participants, which would increase the similarity between two groups, there is some degree of independence for ratings at the level of each emotion category.
Data processing. Emotion ratings were averaged along both the temporal and participant dimensions for each video and each emotion category. First, temporal averaging was performed. For each participant, the continuously recorded emotion ratings at 1-second intervals were averaged over time for each video and each emotion category. As a result, for each video, we obtained an 80-dimensional emotion vector per participant, with a total of four vectors per video. Next, participant-wise averaging was conducted. When the data were used in the section comparing emotion structures between humans, we divided the participants into two groups and averaged the two emotion vectors per category within each group. In contrast, when comparing emotion structures between humans and models, we averaged the vectors of all four participants assigned to the same category. In this way, we obtained a single 80-dimensional emotion rating vector for each video.
Data from Cowen & Keltner (2017)
Videos. These investigators collected 2,185 short videos to cover a wide range of emotion-elicited situations. On average, each clip was about 5 seconds long.
34 emotion categories. 34 English emotion categories were selected from emotion taxonomies of prominent theories and used in the experiments. The 34 emotion categories were: (1) admiration, (2) adoration, (3) aesthetic appreciation, (4) amusement, (5) anger, (6) anxiety, (7) awe, (8) awkwardness, (9) boredom, (10) calmness, (11) confusion, (12) contempt, (13) craving, (14) disappointment, (15) disgust, (16) empathic pain, (17) entrancement, (18) envy, (19) excitement, (20) fear, (21) guilt, (22) horror, (23) interest, (24) joy, (25) nostalgia, (26) pride, (27) relief, (28) romance, (29) sadness, (30) satisfaction, (31) sexual desire, (32) surprise, (33) sympathy, and (34) triumph.
Emotion ratings from participants. The participants selected multiple emotions from the 34 English emotion categories elicited by each video. Participants were instructed to choose at least one emotion category but could choose as many as desired. Each of the 2,185 videos was judged by 9 to 17 observers. The ratings of each emotion category were averaged for each video.
Data processing. Cowen’s published emotion ratings represent the proportion of evaluators who selected each emotion category for each video. In this study, we treated this proportion as the intensity of the emotion experienced by participants and performed our analyses accordingly. Moreover, only the average frequency of emotions across all participants was publicly available, and no participant-level data were disclosed. This limitation prevented any comparison of emotion structures among individual participants. Consequently, for this experiment, we performed analyses solely between humans and the model.
Selection of multimodal LLMs
Based on the results of preliminary evaluations, we selected Gemini-2.0-flash, GPT-4.1, and Molmo-7B-D as the multimodal Large Language Models (MLLMs) for use in this study. Gemini and GPT were chosen because they consistently rank at the top in multiple benchmark tests40,41 and the latest Chatbot Arena leaderboard as of April 202542, demonstrating high accuracy in predictions for multimodal inputs, including videos and images. Given this limitation, we selected Molmo-7B-D28 as one of the most high-performing and accessible open-source models available, offering transparency and reproducibility. Molmo has been reported to achieve performance comparable to Gemini-1.5-Pro and GPT-4o across several benchmarks (Table 1 in Deitke et al. (2024)28). In particular, we found that Molmo exhibited strong instruction-following capabilities in our task. In addition, because the internal mechanisms of commercial SOTA models are opaque, including a competitive open-source model like Molmo provides a valuable point of comparison, enabling us to examine relative performance and generalization patterns in multimodal emotion inference from an external perspective. Such comparisons may offer useful insights for the future development and evaluation of open-source models. Due to computational resource constraints, we adopted the lightweight 7B configuration instead of larger models such as the 70B variant. Although Llama 343 is also recognized as an excellent open-source model, it was excluded from this study to limit the computational cost associated with GWOT analysis, which increases with the number of models evaluated. Although the above models were the target models in this study, due to data set limitations (explained in the following section), we used only Gemini for the Koide-Majima et al. dataset and Gemini, GPT-4.1, Molmo-7B-D for the Cowen & Keltner dataset.
Selected model for data from Koide-Majima et al. (2020)
We used Gemini-2.0-flash-001 for analysis of the Koide-Majima et al. dataset. Because the Koide-Majima et al. dataset consists of videos averaging 15 seconds in length that include audio and were rated by Japanese participants, the model used must be able to process videos containing audio and produce output in Japanese. Among the three models discussed in the previous section, only Gemini-2.0-flash-001 meets all of these criteria. Consequently, we employed this model only for our analyses of the Koide-Majima et al. dataset.
Selected model for data from Cowen & Keltner (2017)
The Cowen & Keltner dataset comprises short, audio-free videos of about five seconds in length, rated by English-speaking participants. For this study, we selected Gemini-2.0-flash-001, GPT-4.1-2025-04-14, and Molmo-7B-D-0924 as suitable models for processing this dataset. Given that these videos have no audio and are only about five seconds in length, we deemed it feasible to capture nearly the same amount of information by extracting multiple frames and assembling them into a single image. Therefore, any model capable of image input and handling English was considered appropriate. On this basis, we adopted GPT-4.1 and Molmo-7B-D. Meanwhile, although Gemini-2.0-flash-001 can also handle videos with audio, it can process this dataset without issue, as the Cowen & Keltner data involve short, audio-free videos rated by English speakers. Consequently, we compared the performances of the three models–Gemini-2.0-flash-001, GPT-4.1-2025-04-14, and Molmo-7B-D-0924 –on the Cowen & Keltner dataset.
Inputting the videos into Molmo and GPT. Because Molmo cannot accept direct video input, we extracted six frames from each video and concatenated them horizontally into a single image for its input. By contrast, GPT can handle multiple images simultaneously, so we provided all six frames at once without merging them into a single image. Specifically, the video was divided into six equal segments, and the first frame of each segment was selected. These six frames were then concatenated horizontally to form one continuous image input for Molmo.
The choice of using six frames was determined after exploring how many frames could be placed within a single image such that each frame remained clearly identifiable by the model. Using too few frames might omit critical information from the video, whereas including too many frames could reduce image resolution or increase computational overhead. Ultimately, six frames provided a suitable balance between retaining essential information and managing resource constraints.
Collecting responses from models
We collected responses (outputs) from the models (Gemini-2.0-flash, GPT-4.1, Molmo-7B-D) to evaluate each output using the following procedure. The details of these procedures are described in the following sections. All models were used with their default parameter settings provided by the respective APIs or repositories, without any additional fine-tuning or modification.
Handling Sensitive Videos Some videos contained sensitive content (e.g., violence, sexual themes), and in such cases, the model might not provide any response. However, there was a probabilistic chance of obtaining a valid output. Therefore, for videos deemed sensitive, we input the same prompt up to ten times and, if at least one valid response was generated, we adopted that response for our analysis. If no response was acquired after ten attempts, the video was excluded from the dataset. This approach aimed to reduce variability in model outputs and maximize the likelihood of obtaining responses even for sensitive content.
Data from Koide-Majima et al. (2020)
To obtain emotion ratings from the model, we carefully aligned both the input and output formats with the experimental conditions used for human participants. For visual input, each video in the Nishimoto dataset was directly presented to the model to generate emotion intensity predictions. The model used, Gemini, supports video input; therefore, no preprocessing such as frame extraction or conversion to still images was necessary, and the original video files were used as-is. Additionally, the scale for emotional intensity ratings was matched to that used by human participants: the model was instructed to output a score ranging from 0 (not matched at all) to 100 (perfectly matched) for each emotion category. This approach aimed to reduce variability in model outputs and maximize the likelihood of obtaining a response for even sensitive content.
To obtain emotion ratings from the model, we tested multiple prompt formats. We explored variations such as presenting all emotion categories at once or splitting them into multiple prompts, as well as including role-inducing expressions. Within the range we tested, these differences did not lead to substantial variation in the model’s output performance. Therefore, we adopted a prompt that was as simple as possible and closely aligned with the instructions used in the actual human experiment. Additionally, since the participants’ evaluations were conducted in Japanese, the prompt was also presented to the model in Japanese. Specifically, we translated the following English instruction into Japanese for use with the model:
Prompt | |
---|---|
Please watch the video clip provided. This is a short video clip. Please estimate the intensity of each emotion category listed below that people might feel when watching this video, according to the rating rules given below. | |
Emotion Categories: | |
love, amusement, craving, joy, nostalgia, boredom, calmness, relief, romance, sadness, admiration, aesthetic appreciation, awe, confusion, entrancement, interest, satisfaction, excitement, sexual desire, surprise, nervousness, tension, anger, anxiety, awkwardness, disgust, empathic pain, fear, horror, laughing, happiness, friendliness, ridiculousness, affection, liking, shedding tears, emotional hurt, sympathy, lethargy, empathy, compassion, curiousness, unrest, exuberance, appreciation of beauty, fever, scare, daze, positive-expectation, throb, sexiness, indecency, embarrassment, oddness, contempt, alertness, eeriness, positive-emotion, vigor, longing, tenderness, pensiveness, melancholy, relaxedness, acceptance, unease, negative-emotion, hostility, levity, protectiveness, elation, coolness, cuteness, attachment, encouragement, annoyance, positive-fear, aggressiveness, distress, stress. | |
Rating Rules: | |
\(\bullet\) Rate the intensity of each emotion that people might feel upon both the images and the sounds that make up the scene of the video. | |
\(\bullet\) Rate the intensity of each emotion on a scale of 0 to 100, where 0 indicates ‘not matched at all’ and 100 indicates ‘perfectly matched’. | |
\(\bullet\) Pay attention to the trivial connections between each emotion and the scene of the video, and rate them as carefully as possible. | |
Please rate each emotion individually, following this format: | |
love: [numerical value 0-100] | |
amusement: [numerical value 0-100] | |
... | |
stress: [numerical value 0-100] | |
Respond with numerical values only for each emotion, without additional explanation. |
For non-sensitive videos, we used the same prompt up to three times and took the average of the three outputs obtained from the model. A preliminary experiment indicated that adding a fourth or additional responses did not improve the correlation with human ratings; hence, we limited retrieval to three outputs to maintain efficiency.
Data from Cowen & Keltner (2017)
In this study, we obtained emotion intensity predictions by inputting each video along with a prompt into the model. For models that supported video input, the original video files were used directly as visual input without any preprocessing. For models that did not support video input, we instead used still images created by extracting frames from the videos (see Methods, “Selection of Multimodal LLMs”, for details.)
Additionally, based on the emotion rating format of the Cowen & Keltner dataset, we designed the model’s output format to be comparable to the aggregated human evaluations. The Cowen & Keltner dataset does not provide individual-level participant ratings; instead, it only offers the proportion of raters who selected each emotion category for each video. Strictly replicating this format would require collecting multiple responses from the model for each video, which is impractical in terms of both cost and labor. Therefore, in this study, we interpreted these proportions as emotion intensity rating and instructed the model to output a single intensity score for each emotion category (see Methods, ‘Dealing with human ratings’, for details).
We also explored several variations in prompt format. Similar to the Koide-Majima et al. dataset, we compared presenting emotion categories all at once versus separately, and whether to include role-defining expressions. Since the intensity scale for emotions can be arbitrarily set, we tried several options, but found no substantial differences in output. Therefore, we adopted a 0–9 scale in this study and normalized the resulting scores to a 0–1 range by dividing them by 10.
Given that the original experiment with human participants was conducted in English, we inputted the following prompt:
Prompt | |
---|---|
Please watch the video clip provided. After viewing, please estimate the intensity of each listed emotion that people might feel upon viewing the video clip. Rate each emotion on a scale from 0 to 9, where 0 means ‘not at all’ and 9 indicates ‘very strongly’. | |
Emotion Categories: | |
Admiration, Adoration, Aesthetic Appreciation, Amusement, Anger, Anxiety, Awe, Awkwardness, Boredom, Calmness, Confusion, Contempt, Craving, Disappointment, Disgust, Empathic Pain, Entrancement, Envy, Excitement, Fear, Guilt, Horror, Interest, Joy, Nostalgia, Pride, Relief, Romance, Sadness, Satisfaction, Sexual Desire, Surprise, Sympathy, Triumph. | |
Please rate each emotion individually, following this format: | |
Admiration: [numerical value 0-9] | |
Adoration: [numerical value 0-9] | |
... | |
Triumph: [numerical value 0-9] | |
Respond with numerical values only for each emotion, without additional explanation. |
In the Cowen & Keltner dataset, we obtained only a single response from the model for each video to evaluate its performance. This dataset comprises 2,185 video clips, which is approximately four times larger than the Koide-Majima et al. dataset (550 videos). Given this substantial scale, collecting multiple responses per video and averaging them would be impractical in terms of computational resources and processing time. Therefore, to maintain consistency across the dataset while keeping the computational cost manageable, we adopted a strategy of using a single response per video. Furthermore, preliminary experiments confirmed that even a single response provides a reasonably reliable estimate, supporting the adequacy of this approach for large-scale evaluation.
Analyzing the commonality of emotion structures
In this study, Representational Similarity Analysis (RSA) and Gromov–Wasserstein Optimal Transport (GWOT) were both employed to compare emotion structures. RSA is a supervised approach that measures correlations or other metrics between different representational structures based on a predefined label mapping. Its primary advantage lies in the straightforward assessment of the overall similarity in structure. In contrast, GWOT is an unsupervised method that compares structures by automatically searching for optimal correspondences among elements or labels, allowing it to identify flexible relationships without fixing them in advance.
Thus, while RSA excels at quantifying the pure correlation of structural patterns along a fixed mapping, it can overlook relationships that extend beyond the original assumptions. GWOT, on the other hand, derives benefits from adaptively aligning each video’s emotional data, making it possible to evaluate precisely which human-reported emotions the model’s estimated emotions most closely align with. Specifically, RSA enables evaluation of “how similar the patterns are” using measures like correlation coefficients, whereas GWOT measures “how a model’s estimated emotions best match human emotions” through the optimization of correspondences across the dataset.
Both methods are valuable for comparing emotion structures, yet they differ in their reliance on prior mappings and in their perspectives on similarity. By combining the results from both methods, one can capture different aspects of the emotion structure. In this work, we leveraged these two approaches to offer a multifaceted evaluation of how humans and the model represent emotions when watching videos.
Representational similarity analysis
Representational Dissimilarity Matrix (RDM). We made a Representational Dissimilarity Matrix (RDM) for the preparation of RSA. We compiled the emotion-category information for each video into a single “emotion vector,” which had 80 dimensions for the Koide-Majima et al. dataset and 34 dimensions for the Cowen & Keltner dataset. We then computed the cosine similarities between these emotion vectors for different videos and transformed them by \(1 - \text {cosine similarity}\), thereby constructing a Representational Dissimilarity Matrix (RDM) that reflects the differences in emotion ratings across videos.
Correlation between RDMs, RSA. We evaluated the overall similarity between RDMs by calculating the correlation coefficient between RDMs. This analysis is known as conventional Representational Similarity Analysis (RSA). In this method, we calculated Pearson correlations only from the upper triangular elements of the matrix. It should be noted, however, that this method inherently assumes that emotions induced by the same video are aligned across different similarity structures, which is considered a supervised comparison as opposed to an unsupervised comparison such as Gromov-Wasserstein Optimal Transport, which is discussed below.
Gromv–Wasserstein optimal transport
Histogram matching. Prior to running Gromov-Wasserstein optimal transport (GWOT) we equalized the marginal similarity distributions of the two data sets to remove global scale or bias differences while keeping the procedure fully unsupervised.
Let \(D\in {\mathbb {R}}^{n\times n}\) and \(D'\in {\mathbb {R}}^{m\times m}\) be the pairwise-similarity matrices. Extracting the upper-triangular elements (excluding the diagonal) gives vectors \({\textbf{u}}\) and \({\textbf{v}}\) of lengths \(L=n(n-1)/2\) and \(L'=m(m-1)/2\). After sorting both vectors in descending order, \({\textbf{u}}_{(1)}\ge \ldots \ge {\textbf{u}}_{(L)}\) and \({\textbf{v}}_{(1)}\ge \ldots \ge {\textbf{v}}_{(L')}\), we replace each entry of \({\textbf{v}}\) by the value with the same rank in \({\textbf{u}}\):
Because monotone rearrangement minimizes the Wasserstein-1 distance on \({\mathbb {R}}\), this rank-wise replacement realizes the optimal one-dimensional transport between the two empirical distributions. Only the histogram of similarities is normalized; no pairwise ordering in \(D\) is altered and no cross-sample correspondences are introduced. GWOT therefore operates on inputs that share identical marginal statistics, allowing it to focus on higher-order structural alignment.
GWOT. To assess the similarity between emotion structures in an unsupervised method, we applied the Gromov-Wasserstein optimal transport (GWOT) alignment method (Fig. 7A). GWOT is an unsupervised alignment technique that determine the optimal transportation plan between point sets in two different domains without using information about the correspondence between individual indices. In this study, we defined the dissimilarity between video i and j in Emotion structure 1 as \(D_{ij}\), and that between video k and l in Emotion structure 2 as \(D'_{kl}\), both based on cosine dissimilarity of emotion ratings. To align the structural patterns of these matrices, we obtained the optimal transport plan \(\Gamma\) by minimizing the following Gromov-Wasserstein distance (GWD):
which quantifies how well the similarity structures of the two domains correspond to each other.
Each element \(\Gamma _{ik}\) in optimal transportation plan \(\Gamma\) can be interpreted as the probability that the emotional experience elicited by i-th video in one domain corresponds to that of k-th video in the other domain. Figure 7A illustrates the concept of this optimization. The upper blue region represents emotion structure 1, and the lower orange region represents emotion structure 2. Each point within the structures corresponds to a video, and the arrows indicate the dissimilarity relationships between videos. GWOT identifies the optimal alignment that minimizes the discrepancy in the internal dissimilarity patterns between the two structures. Figure 7B shows the resulting optimal transport plan \(\Gamma\), where the rows and columns correspond to videos in emotion structure 1 and 2, respectively. The color represents the transport probability, with darker cells indicating stronger correspondence. For example, the dog video in emotion structure 1 corresponds to the cat video in emotion structure 2, and the baby video corresponds to the baby video across both structures.
Although there is a method to optimize GWD by adding the entropy term, called entropic GWOT, we use GWOT without the entropy term in this paper because we found that GWOT without entropy is faster in terms of computation time and also because the overall performance as evaluated by the matching rate was higher. Note that when we do not add the entropy term, the optimal transportation plan is typically sparse and binary, i.e., each row contains only a single non-zero value in a column. Thus, in this optimization, the optimal transport matrix can be thought of as a permutation matrix, indicating which row (an item in emotion structure 1) corresponds to which column (an item in emotion structure 2).
To obtain satisfactory local minima, we utilized the GWTune toolbox25 to randomly initialize the transport matrix \(\Gamma\) and performed multiple optimizations. Since the computational cost of a single optimization increases with the size of the matrix, we adjusted the number of random initializations according to the matrix scale, aiming to balance computational efficiency and alignment accuracy. Specifically, we conducted 10,000 random initializations when the number of videos was approximately 500 or fewer, 1000 for around 750 videos, and 200 for matrices approaching 2000 videos. From among the solutions obtained from these trials, we selected the transport plan that minimized the Gromov-Wasserstein distance (GWD).
Schematic of the Gromov–Wasserstein optimal transport. A Each element of D and \(D'\) represents the dissimilarity between the emotion ratings of the videos. The optimal transportation plan \(\Gamma\) is obtained by minimizing the Gromov-Wasserstein distance (GWD) between the two emotion structures. B The obtained transportation plan matrix \(\Gamma\). Each cell \(\Gamma _{ij}\) represents the probability of correspondence between the two videos i and j. Emoji graphics from Twemoji, licensed under CC BY 4.0 by Twitter, Inc. and other contributors (https://creativecommons.org/licenses/by/4.0/).
Evaluation of GWOT
One-to-one matching This analysis evaluates (i) how consistently the same video elicits comparable emotion structures across participants and (ii) how closely a model’s predicted emotions mirror those structures. For each video, we ask whether the emotion reports from two domains–e.g., Participant group 1 vs. Participant group 2, or humans vs. the model–refer to the same underlying video. If they do, the match is deemed correct and contributes to an overall agreement score.
We compute the correct matching rate based on the optimal transport (OT) plan \(\Gamma\). Let the binary ground truth assignment matrix be C, where \(C_{ik}=1\) if items i and k are a true pair and 0 otherwise. Using C, the OT based matching rate is
where \({\textbf{1}}(\cdot )\) is the indicator function.
Because we use GWOT without the entropy term and the two domains contain the same number of items, \(\Gamma\) is a binary permutation matrix. Under this setting, the above expression simplifies to the fraction of non–zero diagonal elements in \(\Gamma\); equivalently, it is the percentage of items whose OT mapping lands on the correct counterpart.
Category matching To reveal latent common features of emotion structure that cannot be fully captured by strict one-to-one correspondence, we employed a category matching approach. This method relaxes the strict one-to-one evaluation criterion by considering video pairs assigned to the same category as correct matches. Consequently, even if the emotion-report indices do not align perfectly, this approach enables us to assess the similarity among videos that evoke similar emotions.
The category matching rate is computed according to the following procedure. We redefine the “correct” assignment matrix \(C\) as follows:
Subsequently, by performing calculations analogous to those in Eq. (2), the category matching rate is obtained.
Determination of video categories To evaluate the category matching rate, we derived categories purely from the data because the video stimuli were not annotated with any explicit category labels. A naive approach would be to assign each video to the single emotion with the highest intensity rating; however, we judged that taking the full multivariate pattern of emotion ratings into account would yield a categorization that better reflects the structure of the data. We therefore adopted hierarchical agglomerative clustering – a simple yet standard data-driven technique – applied to the emotion-rating matrix of Participant group 1 for human-human comparison and the averaged human emotion-rating matrix for human-model comparison (Ward’s linkage, Euclidean distance).
Hierarchical clustering does require pre-specifying the number of clusters, a choice that is unavoidably somewhat arbitrary. In the present context, however, our aim is merely to construct a category-level matching metric; any reasonably large number of clusters suffices for this purpose. Pilot analyses across 10–30 clusters produced virtually identical qualitative results, indicating that the precise cut point does not materially affect the downstream matching scores. For the sake of visual clarity and interpretability, we report the 10-cluster solution in the main text.
Videos grouped into the same cluster are assumed to elicit similar composite emotional responses. These data-driven clusters serve as the basis for evaluating the commonality of emotion structures both between participants and between participants and models at the categorical level. The category-level matching rate thus complements stricter one-to-one matching measures by capturing broader correspondences in the geometry of emotion space.
Control model of shuffling human’s emotion ratings
In this study, we constructed a shuffled dataset of participants’ emotion ratings as a control model to estimate the lower bound of performance. By comparing the estimation accuracy of the actual model to this control model, one can gauge the extent to which the model legitimately captures the video-specific emotion structure. In other words, the shuffled dataset serves as a baseline that reveals how the model would perform if it failed to account for the unique emotional signatures of individual videos, focusing instead on the broad tendencies in the emotion ratings.
Specifically, all reported emotion ratings in the original dataset were randomly permuted, severing the original pairing between each video and its associated ratings. Through this process, the shuffled dataset retains only the marginal distribution of the emotion ratings, thereby approximating a scenario in which the model “knows the overall distribution of reported emotions, but not their correspondence to specific videos.”
The selection of videos that were “well estimated” in each shuffled dataset was performed by ranking the videos according to the correlation between the shuffled human ratings and the original human ratings. Based on these rankings, we extracted various sets of top-performing videos, such as the Top 100, Top 250, and Top 750. Because the selection is based on correlations with the original human ratings, the top-ranked videos vary across shuffles and may also differ from those selected using the actual model outputs. Since video selection in all cases is consistently based on the correlation between shuffled and original human ratings, the comparisons remain valid and fair across conditions.
Mean correlation per video and correlation between RDMs To estimate the degree of variability introduced by shuffling procedure, we performed 1,000 shuffles of the emotion ratings using different random seeds. First, for each video, we calculated the correlation coefficient between the shuffled human ratings and the original human ratings. We then derived the 95% percentile interval from the distribution of these mean correlation values across videos. This result is reported in the “mean of correlation on each video” row under “shuffled human ratings” in Table 2 and Table 4. Second, we constructed representational dissimilarity matrices (RDMs) from each of 1,000 shuffled datasets and computed the correlation between each shuffled RDM and the RDM obtained from the original human ratings. The 95% percentile interval was then derived from the distribution of these correlation values. This analysis corresponds to the “correlation of RDMs” row under “shuffled human ratings” in Table 2 and Table 4. Through these procedures, we quantitatively assessed the variability of both video-level agreement and structural similarity under random shuffling.
GWOT Due to the high computational cost of GWOT analysis, we adopted different procedures depending on the number of videos analyzed. For datasets with approximately 500 videos or fewer, we conducted 10 iterations of shuffling using different random seeds and computed the 95% percentile interval based on the distribution of the results. In contrast, for datasets with more than 500 videos, we performed only a single shuffle due to the significant time required for computation. This approach allowed us to balance the reliability and feasibility of the GWOT analysis using shuffled data.
Data availability
Cowen & Keltner dataset (10.1073/pnas.1702247114) can be requested here: https://goo.gl/forms/XErJw9sBeyuOyp5Q2. Koide-Majima dataset (10.1016/j.neuroimage.2020.117258) are available from Shinji Nishimoto upon request.
References
Cowen, A. S. & Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. U. S. A. 114, E7900–E7909 (2017).
Koide-Majima, N., Nakai, T. & Nishimoto, S. Distinct dimensions of emotion in the human brain and their representation on the cortical surface. Neuroimage 222, 117258 (2020).
Ekman, P. & Friesen, W. V. Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17, 124–129 (1971).
Russell, J. A. A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980).
Russell, J. A. & Barrett, L. F. Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. J. Pers. Soc. Psychol. 76, 805–819 (1999).
Maithri, M. et al. Automated emotion recognition: Current trends and future perspectives. Comput. Methods Programs Biomed. 215, 106646 (2022).
Breazeal, C. Designing Sociable Robots (MIT Press, 2004).
Keltner, D., Sauter, D., Tracy, J. & Cowen, A. Emotional expression: Advances in basic emotion theory. J. Nonverbal Behav. 43, 133–160 (2019).
Foteinopoulou, N. M. & Patras, I. Emoclip: A vision-language method for zero-shot video facial expression recognition. 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) 1–10 ( 2024).
Lian, Z. et al. GPT-4V with emotion: A zero-shot benchmark for generalized emotion recognition. Inf. Fusion 108, 102367 (2024).
Liang, P. et al. Holistic evaluation of language models. arXiv [cs.CL] (2022).
Zeng, A. et al. GLM-130B: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (2023).
Zhang, T., Irsan, I. C., Thung, F. & Lo, D. Revisiting sentiment analysis for software engineering in the era of large language models. ACM Trans. Softw. Eng. Methodol. 34, 1–30 (2025).
Li, A., Xu, L., Ling, C., Zhang, J. & Wang, P. EmoVerse: Exploring multimodal large language models for sentiment and emotion understanding. arXiv [cs.CL] (2024).
Cheng, Z. et al. Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning. arXiv [cs.AI] (2024).
Fei, H. et al. Video-of-thought: Step-by-step video reasoning from perception to cognition. arXiv [cs.AI] (2024).
Kawakita, G., Zeleznikow-Johnston, A., Tsuchiya, N. & Oizumi, M. Gromov–Wasserstein unsupervised alignment reveals structural correspondences between the color similarity structures of humans and large language models. Sci. Rep. 14, 1–10 (2024).
Marjieh, R., Sucholutsky, I., van Rijn, P., Jacoby, N. & Griffiths, T. L. Large language models predict human sensory judgments across six modalities. Sci. Rep. 14, 21445 (2024).
Pichai, S. Introducing gemini: our largest and most capable AI model. (accessed 28 Mar 2025). https://blog.google/technology/ai/google-gemini-ai/ (2023).
Open, A. I. Introducing ChatGPT. (accessed 28 Mar 2025). https://openai.com/index/chatgpt/ (2022).
DiGirolamo, M. A., Neupert, S. D. & Isaacowitz, D. M. Emotion regulation convoys: Individual and age differences in the hierarchical configuration of emotion regulation behaviors in everyday life. Affect. Sci. 4, 630–643 (2023).
Barrett, L. F. The theory of constructed emotion: an active inference account of interoception and categorization. Soc. Cogn. Affect. Neurosci. 12, 1–23 (2017).
Vaiani, L., Cagliero, L. & Garza, P. Emotion recognition from videos using multimodal large language models. Future Internet 16, 247 (2024).
Takeda, K., Abe, K., Kitazono, J. & Oizumi, M. Unsupervised alignment reveals structural commonalities and differences in neural representations of natural scenes across individuals and brain areas. iScience 28, 112427 (2025).
Takeda, K., Sasaki, M., Abe, K. & Oizumi, M. Unsupervised alignment in neuroscience: Introducing a toolbox for Gromov–Wasserstein optimal transport. J. Neurosci. Methods 419, 110443 (2025).
Kawakita, G., Zeleznikow-Johnston, A., Takeda, K., Tsuchiya, N. & Oizumi, M. Is my “red’’ your “red’’?: Evaluating structural correspondences between color similarity judgments using unsupervised alignment. iScience 28, 112029 (2025).
Open, A. I. Introducing GPT-4.1 in the API. (accessed 17 May 2025). https://openai.com/index/gpt-4-1/ (2025).
Deitke, M. et al. Molmo and PixMo: Open weights and open data for state-of-the-art multimodal models. arXiv [cs.CV] (2024).
Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 19730–19742 (PMLR, 2023).
Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 26286–26296 (2023).
Fu, C. et al. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv [cs.CV] (2023).
OpenAI et al. GPT-4 technical report. arXiv [cs.CL] (2023).
Wang, W. et al. Can’t see the forest for the trees: Benchmarking multimodal safety awareness for multimodal LLMs. arXiv [cs.CL] (2025).
Mallick, S. B. & Kilpatrick, L. Gemini 2.0: Flash, flash-lite and pro. (accessed 28 April 2025). https://developers.googleblog.com/en/gemini-2-family-expands/ (2025).
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D. & Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6904–6913 (IEEE, 2017).
Singh, A. et al. Towards VQA models that can read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019).
Kragel, P. A., Reddan, M. C., LaBar, K. S. & Wager, T. D. Emotion schemas are embedded in the human visual system. Sci. Adv. 5, eaaw4358 (2019).
Barrett, L. F. How Emotions are Made: The Secret Life of the Brain (Pan Macmillan, 2017).
Ohira, H. Predictive processing of interoception, decision-making, and allostasis: A computational framework and implications for emotional intelligence. Psihol. Teme 29, 1–16 (2020).
Puatruaucean, V. et al. Perception test: A diagnostic benchmark for multimodal video models. Neural Inf Process Syst abs/2305.13786, 42748–42761 (2023).
xAI. Grok-1.5 vision preview. (accessed 25 April 2025). https://x.ai/news/grok-1.5v (2024).
Chiang, W.-L. et al. Chatbot arena: An open platform for evaluating LLMs by human preference. In Forty-first International Conference on Machine Learning (2024).
Grattafiori, A. et al. The llama 3 herd of models. arXiv [cs.AI] (2024).
Acknowledgements
M.O. was supported by JST Moonshot R&D Grant No. JPMJMS2012 and JSPS KAKENHI, Grant Number 20H05712. M.O. and T.H. were supported by JSPS KAKENHI Grant Number 23H04834. S.N. was supported by JSPS KAKENHI Grant Number JP24H00619.
Author information
Authors and Affiliations
Contributions
H.A., T.H., and M.O. conceptualized the study. H.A., K.N. and M.O. collected data from multimodal LLMs. H.A. performed data analysis. N.K. and S.N. provided experimental data from Koide-Majima et al. (2020) and offered insights regarding data analysis. H.A. and M.O. drafted the initial manuscript. All authors reviewed, edited, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Asanuma, H., Koide-Majima, N., Nakamura, K. et al. Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs. Sci Rep 15, 32175 (2025). https://doi.org/10.1038/s41598-025-14961-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-14961-6