Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 20 February 2026

Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test

  • Elad Refoua1,
  • Zohar Elyoseph2,7,
  • David Piterman3,
  • Alon Geller4,
  • Gunther Meinlschmidt5,6 &
  • …
  • Dorit Hadar Shoval3 

Scientific Reports , Article number:  (2026) Cite this article

  • 54 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Health care
  • Mathematics and computing
  • Medical research
  • Psychology

Abstract

Accurate emotion recognition is a foundational component of social cognition, yet human biases can compromise its reliability. The emergent capabilities of multimodal large language models (MLLMs) offer a potential avenue for objective analysis, but their performance has been tested mainly with ethnically homogenous stimuli. This study provides a systematic cross-ethnic evaluation of leading MLLMs on an emotion recognition task to assess their accuracy and consistency across diverse groups. We evaluated three leading MLLMs: ChatGPT-4, ChatGPT-4o, and Claude 3 Opus. Performance was tested twice using three “Reading the Mind in the Eyes Test” (RMET) versions featuring White, Black, and Korean faces. We analyzed accuracy against chance (25%) and compared scores to established human normative data for each ethnic version. ChatGPT-4o achieved performance significantly above chance levels across all tests (p < .001), with large effect sizes indicating robust performance (Cohen’s h = 1.253–1.619; RD = 0.583–0.694). The model obtained a mean accuracy of 83.3% (30/36) on the White RMET, 94.4% (34/36) on the Black RMET, and 86.1% (31/36) on the Korean RMET, placing it in the 85th, 94th, and 90th percentiles of human norms, respectively. This high accuracy remained consistent across ethnic stimuli. In contrast, ChatGPT-4 performed near the human average, while Claude 3 Opus performed near chance level. These preliminary findings highlight the rapid evolution of MLLMs, highlighting a significant performance leap between consecutive versions. This study suggests that ChatGPT-4o demonstrated performance scores exceeding average human accuracy on this specific task in recognizing complex emotions from static images of the eye region, with its performance remaining consistent across different ethnic groups. While these results are notable, the pronounced performance gaps between models and the inherent limitations of the RMET task underscore the need for continuous validation and careful, ethical consideration to fully understand the capabilities and boundaries of this technology.

Similar content being viewed by others

Large language models are proficient in solving and creating emotional intelligence tests

Article Open access 21 May 2025

Evaluating the clinical utility of multimodal large language models in rare maculopathy

Article Open access 03 December 2025

Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs

Article Open access 01 September 2025

Data availability

In line with open science principles, the full prompts, study materials, and raw data are publicly available at [**http://osf.io/6rh8m**](http:/osf.io/6rh8m) . A detailed description of the manual data collection procedure is also provided in the repository.

References

  1. Kosinski, M. Evaluating large Language models in theory of Mind tasks. Proc. Natl. Acad. Sci. USA. 121, e2405460121. https://doi.org/10.1073/pnas.2405460121 (2024).

  2. van Duijn, M. J. et al. Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7–10 on advanceagud tests. Preprint at (2023). https://doi.org/10.18653/v1/2023.conll-1.25

  3. Lee, J., Choi, Y., Song, M. & Park, S. ChatFive: Enhancing user experience in likert scale personality test through interactive conversation with LLM agents. Proc. 6th ACM Conf. Conversational User Interfaces. 1-8 https://doi.org/10.1145/3640794.3665572 (2024).

  4. Hamalwa, G. D. Mind the (AI) gap: Psychometric profiling of GPT models for bias exploration. (2024).

  5. Zhang, Y., Zou, C., Lian, Z., Tiwari, P. & Qin, J. SarcasmBench: Towards evaluating large language models on sarcasm understanding. IEEE Trans. Affect. Comput. https://doi.org/10.1109/taffc.2025.3604806 (2025).

    Google Scholar 

  6. Hall, J. A., Harrigan, J. A. & Rosenthal, R. Nonverbal behavior in clinician–patient interaction. Appl. Prev. Psychol. 4, 21–37. https://doi.org/10.1016/S0962-1849(05)80049-6 (1995).

    Google Scholar 

  7. Tian, Y., Kanade, T. & Cohn, J. F. Facial expression recognition. In Handbook of Face Recognition 487–519 (Springer, London, 2011). https://doi.org/10.1007/978-0-85729-932-1_19

  8. Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154. https://doi.org/10.1038/s41586-024-07856-5 (2024).

    Google Scholar 

  9. Zack, T. et al. Assessing the potential of GPT-4 to perpetuate Racial and gender biases in health care: A model evaluation study. Lancet Digit. Health. 6, e12–e22. https://doi.org/10.1016/S2589-7500(23)00225-X (2024).

    Google Scholar 

  10. Atzil-Slonim, D. et al. Therapists’ empathic accuracy toward their clients’ emotions. J. Consult Clin. Psychol. 87, 33. https://doi.org/10.1037/ccp0000354 (2019).

    Google Scholar 

  11. Kashner, T. M. et al. Impact of structured clinical interviews on physicians’ practices in community mental health settings. Psychiatr Serv. 54, 712–718. https://doi.org/10.1176/appi.ps.54.5.712 (2003).

    Google Scholar 

  12. Mobbs, R., Makris, D. & Argyriou, V. Emotion recognition and generation: A comprehensive review of face, speech, and text modalities. Preprint at https://doi.org/10.48550/arXiv.2502.06803 (2025).

  13. Picard, R. W. Affective Computing (MIT Press, 2000).

  14. Schlegel, K., Sommer, N. R. & Mortillaro, M. Large language models are proficient in solving and creating emotional intelligence tests. Commun. Psychol. 3, 80. https://doi.org/10.1038/s44271-025-00258-x (2025).

    Google Scholar 

  15. Kramer, R. S. Comparing ChatGPT with human judgements of social traits from face photographs. Comput. Hum. Behav. Artif. Hum. 4, 100156 (2025).

    Google Scholar 

  16. Nelson, B. et al. Evaluating the performance of large language models in identifying human facial emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet. Preprint at (2025). https://doi.org/10.31234/osf.io/pxq5h_v1

  17. Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y. & Plumb, I. The ‘Reading the Mind in the eyes’ test revised version: A study with normal adults, and adults with asperger syndrome or high-functioning autism. J. Child. Psychol. Psychiatry. 42, 241–251. https://doi.org/10.1017/s0021963001006643 (2001).

    Google Scholar 

  18. Handley, G., Kubota, J. T., Li, T. & Cloutier, J. Black ‘Reading the Mind in the eyes’ task: The development of a task assessing mentalizing from black faces. PLoS One. 14, e0221867. https://doi.org/10.1371/journal.pone.0221867 (2019).

    Google Scholar 

  19. Elyoseph, Z. et al. Capacity of generative AI to interpret human emotions from visual and textual data: Pilot evaluation study. JMIR Ment Health. 11, e54369. https://doi.org/10.2196/54369 (2024).

    Google Scholar 

  20. Scherer, K. R., Clark-Polner, E. & Mortillaro, M. In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. Int. J. Psychol. 46, 401–435. https://doi.org/10.1080/00207594.2011.626049 (2011).

    Google Scholar 

  21. Yan, X., Andrews, T. J., Jenkins, R. & Young, A. W. Cross-cultural differences and similarities underlying other-race effects for facial identity and expression. Q. J. Exp. Psychol. https://doi.org/10.1080/17470218.2016.1146312 (2016).

    Google Scholar 

  22. Flade, F. & Imhoff, R. Closing a conceptual gap in race perception research: A functional integration of the other-race face recognition and who said what? Paradigms. J. Pers. Soc. Psychol. 127, 1. https://doi.org/10.1037/pspa0000388 (2024).

    Google Scholar 

  23. Hadar-Shoval, D., Asraf, K., Mizrachi, Y., Haber, Y. & Elyoseph, Z. Assessing the alignment of large language models with human values for mental health integration: Cross-sectional study using Schwartz’s theory of basic values. JMIR Ment Health. 11, e55988. https://doi.org/10.2196/55988 (2024).

    Google Scholar 

  24. Hadar-Shoval, D., Asraf, K., Shinan-Altman, S., Elyoseph, Z. & Levkovich, I. Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas. Heliyon 10, e38056. https://doi.org/10.1016/j.heliyon.2024.e38056 (2024).

    Google Scholar 

  25. Fiske, A., Henningsen, P. & Buyx, A. Your robot therapist will see you now: Ethical implications of embodied artificial intelligence in psychiatry, psychology, and psychotherapy. J. Med. Internet Res. 21, e13216. https://doi.org/10.2196/13216 (2019).

    Google Scholar 

  26. Koo, S. J. et al. Reading the Mind in the eyes test: Translated and Korean versions. Psychiatry Investig. 18, 295. https://doi.org/10.30773/pi.2020.0289 (2021).

    Google Scholar 

  27. OpenAI. GPT-4. (2023). https://openai.com/product/gpt-4

  28. OpenAI. ChatGPT-4o. (2024). https://openai.com/index/hello-gpt-4o/

  29. Anthropic. Claude, A. I. (2023). https://www.anthropic.com/claude

  30. Jiao, J. & Chang, A. Evaluating sentiment and Spatial patterns of EV charging station user experience with AI-agents. Int. J. Urban Sci. 1–29 https://doi.org/10.1080/12265934.2025.2547792 (2025).

  31. Fisher, J. et al. Political neutrality in AI is impossible—but here is how to approximate it. Preprint http://arxiv.org/abs/2503.05728 (2025).

  32. Huang, S. et al. Collective constitutional AI: Aligning a Language model with public input. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency 1395–1417. https://doi.org/10.1145/3630106.3658979 (2024).

  33. Nguyen, C., Carrion, D. & Badawy, M. K. Comparative performance of Claude and GPT models in basic radiological imaging tasks. MedRxiv 2024-11. https://doi.org/10.1101/2024.11.16.24317414 (2024).

  34. Atreides, K. & Kelley, D. J. Cognitive biases in natural language: Automatically detecting, differentiating, and measuring bias in text. Cogn. Syst. Res. 88, 101304. https://doi.org/10.2139/ssrn.4568851 (2024).

    Google Scholar 

  35. Luczak, A. How artificial intelligence reduces human bias in diagnostics? AIMS Bioeng. 12, 69–89. https://doi.org/10.3934/bioeng.2025004 (2025).

    Google Scholar 

  36. Gupta, M., Virostko, J. & Kaufmann, C. Large Language models in radiology: Fluctuating performance and decreasing discordance over time. Eur. J. Radiol. 182, 111842. https://doi.org/10.1016/j.ejrad.2024.111842 (2025).

    Google Scholar 

  37. Kocak, B. et al. Radiology AI and sustainability paradox: Environmental, economic, and social dimensions. Insights Imaging 16, 88. https://doi.org/10.1186/s13244-025-01962-2 (2025).

    Google Scholar 

  38. Higgins, W. C., Kaplan, D. M., Deschrijver, E. & Ross, R. M. Why most research based on the reading the Mind in the eyes test is unsubstantiated and uninterpretable: A response to Murphy and hall (2024). Clin. Psychol. Rev. 115, 102530. https://doi.org/10.1016/j.cpr.2024.102530 (2025).

    Google Scholar 

  39. Cuff, B. M., Brown, S. J., Taylor, L. & Howat, D. J. Empathy: A review of the concept. Emot. Rev. 8, 144–153. https://doi.org/10.1177/1754073914558466 (2016).

    Google Scholar 

  40. Yager, J., Kay, J. & Kelsay, K. Clinicians’ cognitive and affective biases and the practice of psychotherapy. Am. J. Psychother. 74, 119–126 (2021).

    Google Scholar 

  41. Sumsion, A., Torrie, S., Lee, D. J. & Sun, Z. Surveying Racial bias in facial recognition: Balancing datasets and algorithmic enhancements. Electronics 13, 2317. https://doi.org/10.3390/electronics13122317 (2024).

    Google Scholar 

  42. Refoua, E. et al. The next frontier in mindreading? Assessing generative artificial intelligence (GAI)’s social-cognitive capabilities using dynamic audiovisual stimuli. Comput. Hum. Behav. Rep. 100702. https://doi.org/10.1016/j.chbr.2025.100702 (2025).

  43. Konstantin, G. E., Nordgaard, J. & Henriksen, M. G. Methodological issues in social cognition research in autism spectrum disorder and schizophrenia spectrum disorder: A systematic review. Psychol. Med. 53, 3281–3292. https://doi.org/10.1017/S0033291723001095 (2023).

    Google Scholar 

  44. Vellante, M. et al. The ‘Reading the Mind in the eyes’ test: Systematic review of psychometric properties and a validation study in Italy. Cogn. Neuropsychiatry. 18, 326–354. https://doi.org/10.1080/13546805.2012.721728 (2013).

    Google Scholar 

  45. Hamdoun, S., Monteleone, R., Bookman, T. & Michael, K. AI-based and digital mental health apps: Balancing need and risk. IEEE Technol. Soc. Mag. 42, 25–36. https://doi.org/10.1109/MTS.2023.3241309 (2023).

    Google Scholar 

  46. Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013). https://doi.org/10.4324/9780203771587

Download references

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations

  1. Department of Psychology, Bar-Ilan University, Ramat-Gan, Israel

    Elad Refoua

  2. School of Counseling and Human Development, University of Haifa, Haifa, Israel

    Zohar Elyoseph

  3. Department of Psychology, University of Kiryat Shmona and the Galilee, Tel-Hai, Israel

    David Piterman & Dorit Hadar Shoval

  4. Ruppin Academic Center, Emek Hefer, Israel

    Alon Geller

  5. Clinical Psychology and Psychotherapy – Methods and Approaches, Department of Psychology, Trier University, Trier, Germany

    Gunther Meinlschmidt

  6. Department of Digital and Blended Psychosomatics and Psychotherapy, Psychosomatic Medicine, University of Basel and University Hospital Basel, Basel, Switzerland

    Gunther Meinlschmidt

  7. Imperial College London, London, UK

    Zohar Elyoseph

Authors
  1. Elad Refoua
    View author publications

    Search author on:PubMed Google Scholar

  2. Zohar Elyoseph
    View author publications

    Search author on:PubMed Google Scholar

  3. David Piterman
    View author publications

    Search author on:PubMed Google Scholar

  4. Alon Geller
    View author publications

    Search author on:PubMed Google Scholar

  5. Gunther Meinlschmidt
    View author publications

    Search author on:PubMed Google Scholar

  6. Dorit Hadar Shoval
    View author publications

    Search author on:PubMed Google Scholar

Contributions

E.R. led the project, contributed to the study conception and design, and wrote the initial manuscript draft. D.H. primarily supervised the research, contributed to data acquisition and analyses, and led the major revisions of the manuscript. Z.E. and G.M. contributed to the study’s conception and design, participated in data acquisition and analysis, and critically revised the manuscript. D.P. and A.G. assisted with data acquisition and analysis and reviewed the manuscript drafts. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Elad Refoua or Zohar Elyoseph.

Ethics declarations

Ethics approval and consent to participate

Not applicable. This study evaluated the performance of publicly available Generative AI models and did not involve human participants, and therefore did not require ethics approval or consent to participate.

Consent for publication

Not applicable This study did not involve human participants.

Competing interests

Dr. Gunther Meinlschmidt (GM) received funding from various sources including the Stanley Thomas Johnson Stiftung, Gottfried und Julia Bangerter-Rhyner-Stiftung, Gesundheitsförderung Schweiz, Swiss Heart Foundation, Research Foundation of the International Psychoanalytic University (IPU) Berlin, German Federal Ministry of Education and Research, Hasler Foundation, Swiss State Secretariat for Education, Research and Innovation (SERI), and Wings Health. GM receives royalties from publishing companies as an author, including a book published by Springer. GM received an honorarium from Lundbeck for speaking at a symposium. GM is a co-founder of Therayou AG and owns stock in this company. GM is compensated for providing psychotherapy to patients, acting as a supervisor, serving as a self-experience facilitator (‘Selbsterfahrungsleiter’), and for postgraduate training of psychotherapists, psychosomatic specialists, and supervisors.Elad Refoua (ER) Dr. Zohar Elyoseph (ZE) Dr. Dorit Hadar-Shoval (DH) David Piterman (DP) and Alon Geller (AG) have declared no competing interests.

Clinical trial registration

Not applicable. This study is not a clinical trial and did not require registration.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Refoua, E., Elyoseph, Z., Piterman, D. et al. Evaluation of cross-ethnic emotion recognition capabilities in multimodal large language models using the reading the mind in the eyes test. Sci Rep (2026). https://doi.org/10.1038/s41598-026-39292-y

Download citation

  • Received: 22 July 2025

  • Accepted: 04 February 2026

  • Published: 20 February 2026

  • DOI: https://doi.org/10.1038/s41598-026-39292-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Generative artificial intelligence (GenAI)
  • Emotion recognition
  • Cross-cultural psychology
  • Psychiatric diagnosis
  • Reading the mind in the eyes test (RMET)
  • Bias
  • Mental health
Download PDF

Associated content

Collection

Artificial Emotional Intelligence

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics