Abstract
Accurate emotion recognition is a foundational component of social cognition, yet human biases can compromise its reliability. The emergent capabilities of multimodal large language models (MLLMs) offer a potential avenue for objective analysis, but their performance has been tested mainly with ethnically homogenous stimuli. This study provides a systematic cross-ethnic evaluation of leading MLLMs on an emotion recognition task to assess their accuracy and consistency across diverse groups. We evaluated three leading MLLMs: ChatGPT-4, ChatGPT-4o, and Claude 3 Opus. Performance was tested twice using three “Reading the Mind in the Eyes Test” (RMET) versions featuring White, Black, and Korean faces. We analyzed accuracy against chance (25%) and compared scores to established human normative data for each ethnic version. ChatGPT-4o achieved performance significantly above chance levels across all tests (p < .001), with large effect sizes indicating robust performance (Cohen’s h = 1.253–1.619; RD = 0.583–0.694). The model obtained a mean accuracy of 83.3% (30/36) on the White RMET, 94.4% (34/36) on the Black RMET, and 86.1% (31/36) on the Korean RMET, placing it in the 85th, 94th, and 90th percentiles of human norms, respectively. This high accuracy remained consistent across ethnic stimuli. In contrast, ChatGPT-4 performed near the human average, while Claude 3 Opus performed near chance level. These preliminary findings highlight the rapid evolution of MLLMs, highlighting a significant performance leap between consecutive versions. This study suggests that ChatGPT-4o demonstrated performance scores exceeding average human accuracy on this specific task in recognizing complex emotions from static images of the eye region, with its performance remaining consistent across different ethnic groups. While these results are notable, the pronounced performance gaps between models and the inherent limitations of the RMET task underscore the need for continuous validation and careful, ethical consideration to fully understand the capabilities and boundaries of this technology.
Similar content being viewed by others
Data availability
In line with open science principles, the full prompts, study materials, and raw data are publicly available at [**http://osf.io/6rh8m**](http:/osf.io/6rh8m) . A detailed description of the manual data collection procedure is also provided in the repository.
References
Kosinski, M. Evaluating large Language models in theory of Mind tasks. Proc. Natl. Acad. Sci. USA. 121, e2405460121. https://doi.org/10.1073/pnas.2405460121 (2024).
van Duijn, M. J. et al. Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7–10 on advanceagud tests. Preprint at (2023). https://doi.org/10.18653/v1/2023.conll-1.25
Lee, J., Choi, Y., Song, M. & Park, S. ChatFive: Enhancing user experience in likert scale personality test through interactive conversation with LLM agents. Proc. 6th ACM Conf. Conversational User Interfaces. 1-8 https://doi.org/10.1145/3640794.3665572 (2024).
Hamalwa, G. D. Mind the (AI) gap: Psychometric profiling of GPT models for bias exploration. (2024).
Zhang, Y., Zou, C., Lian, Z., Tiwari, P. & Qin, J. SarcasmBench: Towards evaluating large language models on sarcasm understanding. IEEE Trans. Affect. Comput. https://doi.org/10.1109/taffc.2025.3604806 (2025).
Hall, J. A., Harrigan, J. A. & Rosenthal, R. Nonverbal behavior in clinician–patient interaction. Appl. Prev. Psychol. 4, 21–37. https://doi.org/10.1016/S0962-1849(05)80049-6 (1995).
Tian, Y., Kanade, T. & Cohn, J. F. Facial expression recognition. In Handbook of Face Recognition 487–519 (Springer, London, 2011). https://doi.org/10.1007/978-0-85729-932-1_19
Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154. https://doi.org/10.1038/s41586-024-07856-5 (2024).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate Racial and gender biases in health care: A model evaluation study. Lancet Digit. Health. 6, e12–e22. https://doi.org/10.1016/S2589-7500(23)00225-X (2024).
Atzil-Slonim, D. et al. Therapists’ empathic accuracy toward their clients’ emotions. J. Consult Clin. Psychol. 87, 33. https://doi.org/10.1037/ccp0000354 (2019).
Kashner, T. M. et al. Impact of structured clinical interviews on physicians’ practices in community mental health settings. Psychiatr Serv. 54, 712–718. https://doi.org/10.1176/appi.ps.54.5.712 (2003).
Mobbs, R., Makris, D. & Argyriou, V. Emotion recognition and generation: A comprehensive review of face, speech, and text modalities. Preprint at https://doi.org/10.48550/arXiv.2502.06803 (2025).
Picard, R. W. Affective Computing (MIT Press, 2000).
Schlegel, K., Sommer, N. R. & Mortillaro, M. Large language models are proficient in solving and creating emotional intelligence tests. Commun. Psychol. 3, 80. https://doi.org/10.1038/s44271-025-00258-x (2025).
Kramer, R. S. Comparing ChatGPT with human judgements of social traits from face photographs. Comput. Hum. Behav. Artif. Hum. 4, 100156 (2025).
Nelson, B. et al. Evaluating the performance of large language models in identifying human facial emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet. Preprint at (2025). https://doi.org/10.31234/osf.io/pxq5h_v1
Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y. & Plumb, I. The ‘Reading the Mind in the eyes’ test revised version: A study with normal adults, and adults with asperger syndrome or high-functioning autism. J. Child. Psychol. Psychiatry. 42, 241–251. https://doi.org/10.1017/s0021963001006643 (2001).
Handley, G., Kubota, J. T., Li, T. & Cloutier, J. Black ‘Reading the Mind in the eyes’ task: The development of a task assessing mentalizing from black faces. PLoS One. 14, e0221867. https://doi.org/10.1371/journal.pone.0221867 (2019).
Elyoseph, Z. et al. Capacity of generative AI to interpret human emotions from visual and textual data: Pilot evaluation study. JMIR Ment Health. 11, e54369. https://doi.org/10.2196/54369 (2024).
Scherer, K. R., Clark-Polner, E. & Mortillaro, M. In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. Int. J. Psychol. 46, 401–435. https://doi.org/10.1080/00207594.2011.626049 (2011).
Yan, X., Andrews, T. J., Jenkins, R. & Young, A. W. Cross-cultural differences and similarities underlying other-race effects for facial identity and expression. Q. J. Exp. Psychol. https://doi.org/10.1080/17470218.2016.1146312 (2016).
Flade, F. & Imhoff, R. Closing a conceptual gap in race perception research: A functional integration of the other-race face recognition and who said what? Paradigms. J. Pers. Soc. Psychol. 127, 1. https://doi.org/10.1037/pspa0000388 (2024).
Hadar-Shoval, D., Asraf, K., Mizrachi, Y., Haber, Y. & Elyoseph, Z. Assessing the alignment of large language models with human values for mental health integration: Cross-sectional study using Schwartz’s theory of basic values. JMIR Ment Health. 11, e55988. https://doi.org/10.2196/55988 (2024).
Hadar-Shoval, D., Asraf, K., Shinan-Altman, S., Elyoseph, Z. & Levkovich, I. Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas. Heliyon 10, e38056. https://doi.org/10.1016/j.heliyon.2024.e38056 (2024).
Fiske, A., Henningsen, P. & Buyx, A. Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. J. Med. Internet Res. 21, e13216. https://doi.org/10.2196/13216 (2019).
Koo, S. J. et al. Reading the Mind in the eyes test: Translated and Korean versions. Psychiatry Investig. 18, 295. https://doi.org/10.30773/pi.2020.0289 (2021).
OpenAI. GPT-4. (2023). https://openai.com/product/gpt-4
OpenAI. ChatGPT-4o. (2024). https://openai.com/index/hello-gpt-4o/
Anthropic. Claude, A. I. (2023). https://www.anthropic.com/claude
Jiao, J. & Chang, A. Evaluating sentiment and Spatial patterns of EV charging station user experience with AI-agents. Int. J. Urban Sci. 1–29 https://doi.org/10.1080/12265934.2025.2547792 (2025).
Fisher, J. et al. Political neutrality in AI is impossible—but here is how to approximate it. Preprint http://arxiv.org/abs/2503.05728 (2025).
Huang, S. et al. Collective constitutional AI: Aligning a Language model with public input. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency 1395–1417. https://doi.org/10.1145/3630106.3658979 (2024).
Nguyen, C., Carrion, D. & Badawy, M. K. Comparative performance of Claude and GPT models in basic radiological imaging tasks. MedRxiv 2024-11. https://doi.org/10.1101/2024.11.16.24317414 (2024).
Atreides, K. & Kelley, D. J. Cognitive biases in natural language: Automatically detecting, differentiating, and measuring bias in text. Cogn. Syst. Res. 88, 101304. https://doi.org/10.2139/ssrn.4568851 (2024).
Luczak, A. How artificial intelligence reduces human bias in diagnostics? AIMS Bioeng. 12, 69–89. https://doi.org/10.3934/bioeng.2025004 (2025).
Gupta, M., Virostko, J. & Kaufmann, C. Large Language models in radiology: Fluctuating performance and decreasing discordance over time. Eur. J. Radiol. 182, 111842. https://doi.org/10.1016/j.ejrad.2024.111842 (2025).
Kocak, B. et al. Radiology AI and sustainability paradox: Environmental, economic, and social dimensions. Insights Imaging 16, 88. https://doi.org/10.1186/s13244-025-01962-2 (2025).
Higgins, W. C., Kaplan, D. M., Deschrijver, E. & Ross, R. M. Why most research based on the reading the Mind in the eyes test is unsubstantiated and uninterpretable: A response to Murphy and hall (2024). Clin. Psychol. Rev. 115, 102530. https://doi.org/10.1016/j.cpr.2024.102530 (2025).
Cuff, B. M., Brown, S. J., Taylor, L. & Howat, D. J. Empathy: A review of the concept. Emot. Rev. 8, 144–153. https://doi.org/10.1177/1754073914558466 (2016).
Yager, J., Kay, J. & Kelsay, K. Clinicians’ cognitive and affective biases and the practice of psychotherapy. Am. J. Psychother. 74, 119–126 (2021).
Sumsion, A., Torrie, S., Lee, D. J. & Sun, Z. Surveying Racial bias in facial recognition: Balancing datasets and algorithmic enhancements. Electronics 13, 2317. https://doi.org/10.3390/electronics13122317 (2024).
Refoua, E. et al. The next frontier in mindreading? Assessing generative artificial intelligence (GAI)’s social-cognitive capabilities using dynamic audiovisual stimuli. Comput. Hum. Behav. Rep. 100702. https://doi.org/10.1016/j.chbr.2025.100702 (2025).
Konstantin, G. E., Nordgaard, J. & Henriksen, M. G. Methodological issues in social cognition research in autism spectrum disorder and schizophrenia spectrum disorder: A systematic review. Psychol. Med. 53, 3281–3292. https://doi.org/10.1017/S0033291723001095 (2023).
Vellante, M. et al. The ‘Reading the Mind in the eyes’ test: Systematic review of psychometric properties and a validation study in Italy. Cogn. Neuropsychiatry. 18, 326–354. https://doi.org/10.1080/13546805.2012.721728 (2013).
Hamdoun, S., Monteleone, R., Bookman, T. & Michael, K. AI-based and digital mental health apps: Balancing need and risk. IEEE Technol. Soc. Mag. 42, 25–36. https://doi.org/10.1109/MTS.2023.3241309 (2023).
Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013). https://doi.org/10.4324/9780203771587
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
E.R. led the project, contributed to the study conception and design, and wrote the initial manuscript draft. D.H. primarily supervised the research, contributed to data acquisition and analyses, and led the major revisions of the manuscript. Z.E. and G.M. contributed to the study’s conception and design, participated in data acquisition and analysis, and critically revised the manuscript. D.P. and A.G. assisted with data acquisition and analysis and reviewed the manuscript drafts. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable. This study evaluated the performance of publicly available Generative AI models and did not involve human participants, and therefore did not require ethics approval or consent to participate.
Consent for publication
Not applicable This study did not involve human participants.
Competing interests
Dr. Gunther Meinlschmidt (GM) received funding from various sources including the Stanley Thomas Johnson Stiftung, Gottfried und Julia Bangerter-Rhyner-Stiftung, Gesundheitsförderung Schweiz, Swiss Heart Foundation, Research Foundation of the International Psychoanalytic University (IPU) Berlin, German Federal Ministry of Education and Research, Hasler Foundation, Swiss State Secretariat for Education, Research and Innovation (SERI), and Wings Health. GM receives royalties from publishing companies as an author, including a book published by Springer. GM received an honorarium from Lundbeck for speaking at a symposium. GM is a co-founder of Therayou AG and owns stock in this company. GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (‘Selbsterfahrungsleiter’), and for postgraduate training of psychotherapists, psychosomatic specialists, and supervisors.Elad Refoua (ER) Dr. Zohar Elyoseph (ZE) Dr. Dorit Hadar-Shoval (DH) David Piterman (DP) and Alon Geller (AG) have declared no competing interests.
Clinical trial registration
Not applicable. This study is not a clinical trial and did not require registration.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Refoua, E., Elyoseph, Z., Piterman, D. et al. Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Sci Rep (2026). https://doi.org/10.1038/s41598-026-39292-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-39292-y


