Introduction

Generative artificial intelligence (GenAI) based on large language models (LLMs) is becoming central to human–computer interactions (HCIs), demonstrating impressive capabilities in interpreting human intentions1,2 and understanding human cognitive, social, and emotional processes. Facial expressions are a key aspect of social–emotional functioning and provide valuable information about human goals, emotions, and psychological states3.

LLMs have expanded their capabilities beyond traditional text-based tasks, enabling them to process and integrate multimodal inputs such as vision, speech, and text. They have shown promise in social cognition such as “theory of mind” tasks, sometimes matching or exceeding human performance on mentalistic inference2. However, these results are largely based on text-only examples and are less robust in assessments where context is critical2,4,5. Studies evaluating LLMs' visual emotion recognition have mixed results, with some models performing no better than chance6.

GenAI’s ability to interpret facial expressions holds promise for HCI applications, particularly in behavioral healthcare7,8,9. Subtle expression changes may indicate mental health conditions like depression, anxiety, or even suicidal ideation10,11. AI-powered systems trained to recognize these nuanced expressions could potentially enable earlier diagnosis, real-time monitoring, and adaptive interventions.

Facial expressions and interpretation can vary by culture12 and context13, highlighting the importance of using diverse stimuli with validated ground truth labels and normative human performance data. Moreover, the need to evaluate performance across diverse actors (i.e., sex/racial/ethnicity) is well recognized6,14.

Results

Agreement

Cohen’s Kappa (κ) across all stimuli and expressions was 0.83 (95% CI: 0.80–0.85) for ChatGPT 4o, 0.81 (95% CI: 0.77–0.84) for Gemini 2.0 Experimental, and 0.70 (95% CI: 0.67–0.74) for Claude 3.5 Sonnet. Specific Kappas by emotion class can be found in Table 1 and Fig. 1b.

Fig. 1: LLM model.
figure 1

a Agreement with NimStim human performance benchmark and b overall accuracy and recall by emotion class. Pink = NimStim Benchmark; Blue = ChatGPT 4o; Green = Gemini 2.0 Experimental; Red = Claude 3.5 Sonnet.

Table 1 Description of validity ratings for LLM emotional expression estimates across all emotions

Confusion matrix

Overall accuracy across all actors and expressions was 86% (95% CI: 84–89%) for ChatGPT 4o, 0.84% (95% CI: 81–87%) for Gemini 2.0 Experimental, and 74% (95% CI: 71–78%) for Claude 3.5 Sonnet. Accuracy by emotion class can be found in Table 1 and Fig. 1a. For ChatGPT 4o and Gemini 2.0 Experimental, there was little variability in the performance across different emotion categories, except for fear, which was misclassified as surprise 52.50% and 36.25% of the time, respectively (see Figs. 2a-b and 3a-b). For Claude 3.5 Sonnet, there was more variability in the performance across different emotion categories with sadness being misclassified as disgust 20.24% of the time and fear being misclassified as surprise 36.25% of the time (see Figs. 2c and 3c).

Fig. 2: Confusion matrix.
figure 2

a ChatGPT 4o, b Gemini 2.0 Experimental, and c Claude 3.5 Sonnet.

Fig. 3: Alluvial plot.
figure 3

a ChatGPT 4o, b Gemini 2.0 Experimental, c Claude 3.5 Sonnet. Left column in ground truth and right column is model.

Lastly, there were no significant differences in model performance for accuracy, recall, or kappa based on the sex or race of the actor (see Table 2).

Table 2 LLM agreement by sex and race

Discussion

This study evaluated three leading LLMs, ChatGPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet, on facial emotion recognition using the NimStim dataset. ChatGPT 4o and Gemini 2.0 Experimental demonstrated “almost perfect”15,16 agreement and high accuracy with ground truth labels overall, with ChatGPT 4o and Gemini 2.0 Experimental performance comparable to or exceeding human raters on some emotions. Claude 3.5 Sonnet exhibited lower overall agreement and accuracy as compared to the other two models.

There was significant variability in Cohen’s Kappa and Recall within and between emotion classes. All models performed relatively well on Happy, Calm/Neutral, and Surprise, but showed difficulty recognizing Fear, often misclassifying it as Surprise. ChatGPT 4o achieved the best performance across emotions and significantly outperformed Claude 3.5 Sonnet on several emotions, including Calm/Neutral, Sad, Disgust, and Surprise. Gemini 2.0 Experimental also outperformed Claude 3.5 Sonnet for Calm/Neutral, Disgust, and Surprise. When comparing these models’ performance to human observers in the NimStim dataset, the overall 95% confidence intervals for kappa overlapped for humans, ChatGPT, and Gemini, indicating similar levels of reliability across all emotion categories. In contrast, Claude’s 95% CI did not overlap with that of humans, suggesting lower overall reliability. At the level of individual model-by-emotion comparisons, most 95% CIs overlapped; however, three exceptions emerged such that ChatGPT 4o showed higher reliability than humans for Surprise and Calm/Neutral, Gemini 2.0 Experimental outperformed humans for Surprise, and Claude 3.5 Sonnet was less reliable than humans for Calm/Neutral.

Literature has previously shown LLM biases, but current findings indicate that facial emotion recognition did not differ by sex or race. Furthermore, prior CNN models on this dataset achieved moderate classification performance (42% accuracy overall, with large emotion-specific variability17. In contrast, zero-shot vision-language models without training, fine-tuning, or architectural customization may offer stronger generalization.

Although these findings show promise for foundation models in affective computing, limitations remain. All stimuli featured static images18, actors aged 21–30, and most images were European American, which may limit generalizability. The context of verbal signals can modify facial expression meaning, highlighting the need for future multimodal emotion classification with auditory stimuli19. Furthermore, although we selected the NimStim dataset because it is accessible only to researchers upon request and has not appeared in LLM publications, thereby minimizing the likelihood it was included in model training and positively biasing results, relying on a single dataset may limit the generalizability of our findings. While we tested three general-purpose models, specialized large models designed for facial expression and micro-expression recognition (e.g., ExpLLM, MELLM) are also available. Future research should evaluate these models on this dataset to compare their performance with general LLMs. Prompt wording varied slightly across models due to interface constraints, potentially affecting results. Specific healthcare applications may want to fine-tune models or incorporate the Facial Action Coding System into retrieval-augmented generation frameworks to improve recognition of more subtle or complex emotions, such as fear. Understanding when and why models succeed or fail will be critical for guiding responsible integration. Future research should evaluate open-weight models like Llama or DeepSeek, which can support more transparent evaluation, local deployment, and stronger privacy protections, important model considerations for clinical applications.

Overall, this study provides an initial benchmark for evaluating LLMs’ socioemotional capabilities. Although ChatGPT and Gemini demonstrated reliability comparable to human observers across emotion categories, caution is warranted when translating these findings and using general-purpose LLMs in applied settings, as Claude, by contrast, showed lower overall reliability. Further testing with ecologically valid, multimodal, and demographically diverse stimuli is essential to understand their limitations and potential.

Methods

Study design

The current study was IRB-exempt from Beth Israel Deaconess Medical Center (2025P000198).

Facial expression stimuli

The NimStim, a large multiracial image dataset, was used as facial expression stimuli15. The NimStim Set of Facial Expressions is a comprehensive collection of 672 images depicting facial expressions posed by 43 professional actors (18 female, 25 male) aged between 21 and 30 years. The actors represent diverse racial backgrounds, including African-American (10 actors), Asian-American (6 actors), European-American (25 actors), and Latino-American (2 actors). Each actor portrays eight distinct emotional expressions: neutral, happy, sad, angry, surprised, fearful, disgusted, and calm. Psychometric evaluations with naive observers have demonstrated a high proportion correct at 0.81 (SD = 0.19; 95% CI: 0.77–0.85), high agreement between raters (kappa = 0.79, SD = 0.17; 95% CI = 0.75–0.83), and high test-retest reliability at 0.84 (SD = 0.08; 95% CI: 0.82–0.86)15. This dataset has been extensively utilized in various research studies with over 2000 citations20,21,22,23. The authors have obtained written consent to publish images of models #01, 03, 18, 21, 28, 40, and 45.

The NimStim dataset provides an independent benchmark, as it is proprietary and restricted to authorized research institutions through licensing agreements that explicitly prohibit public distribution. Our verification process, including extensive web searches, found no public availability of the NimStim data, suggesting it was unlikely to have been included in LLM training datasets. NimStim calm and neutral expressions were recoded as calm_neutral, consistent with Tottenham et al.15, who noted minimal perceptual differences between the two and treated either label as correct. Results separating calm and neutral are provided in the Supplementary Table 2.

Large language models

OpenAI GPT-4o Google Gemini 2.0 Experimental, and Anthropic Claude 3.5 Sonnet were used for facial expression recognition.

Procedures

All NimStim 672 images were individually uploaded twice to each LLM model for facial emotion processing using the user-facing interface, rather than the API, due to the fact that at the time of testing, only OpenAI offered the ability to batch multiple image inputs through the API for the selected models. Standardizing the methodology with the user interface ensured that the model’s response remained grounded in the initial instruction. Prompts varied slightly across LLM models due to initial model responses indicating an inability to follow the prompt, likely due to built-in constraints and safety barriers (see Supplementary Table 1).

Analyses

All analyses were conducted with R v 4.3.1.

Agreement

We assessed agreement between each LLM model output and the ground truth label by calculating a stratified bootstrap analysis of Cohen’s kappa (κ), to address repeated measures within participants and imbalances in emotion categories via oversampling. For each of 1000 bootstrap iterations, participants were sampled with replacement, and within-participant emotion categories were balanced via oversampling. We report mean κ and 95% confidence intervals and interpreted agreement using standard thresholds (moderate: 0.4–0.6, substantial: 0.6–0.8, and almost perfect: ≥0.815,16. We applied the same oversampling bootstrap method to calculate κ for emotion class, sex, and race categories separately. Finally, we benchmarked model performance against κ values reported in the NimStim dataset by comparing 95% confidence interval overlap15.

Confusion matrix, accuracy, recall, precision, and F1

To evaluate the classification performance of each LLM, we computed confusion matrices and derived standard metrics including accuracy, precision, recall, and F1-score for each model across emotion categories. The matrix quantifies the performance of the classification model by showing the count of samples for each combination of actual and predicted emotions, as well as the corresponding row and column totals to reflect the total occurrences of each actual emotion across the dataset, the number of times each emotion was predicted by the model represented by the diagonal elements, and a grand total representing the overall number of samples in the analysis. Note that the per-class balanced accuracy was equivalent to recall, a common metric in multi-class classification. Metrics were calculated per class and overall, with 95% confidence intervals estimated. κ and accuracy were also stratified by sex and race.

Methods of model comparison to NimStim

We benchmarked the performance of LLM models against the κ reported for untrained human observers in the NimStim dataset. However, it is important to note that the original authors did not specify how they calculated κ. Tottenham et al. 15 presented κ for each emotion by mouth state of mouth open and closed. To obtain a single κ estimate per emotion category to allow for comparability to results in the current study, we aggregated κ from the two mouth-states. First, κ and their associated standard deviations (SD) were extracted separately for mouth open and closed. The mean κ for each emotion was computed as the arithmetic average of the κ values from mouth-states. To account for variability across mouth-state conditions, we calculated the pooled SD using the square root of the mean of squared SD values, ensuring equal weighting across conditions. This approach provided a single, representative estimate of κ for each emotion while preserving the contributions from both facial configurations. Finally, to determine if the LLM models performed similarly, we assessed whether the 95% confidence intervals of these κ values overlap, indicating comparable (or different) levels of agreement.