Scientific Reports

Table 3 Mann–Whitney U test statistics comparing GPT-4o and human expert evaluation scores across allToM task categories. No statistically significant differences were observed, indicating strong alignment between model and expert assessments.

From: Large language models for autism: evaluating theory of mind tasks in a gamified environment

Category	U Statistic	p-value
Whole Dataset	548509.5	0.7495
Faux pas	69918.0	0.3711
Irony	23163.0	0.1439
Hinting Task	22304.0	0.7834
Strange Stories	30482.0	0.1635

Back to article page

Search

Advanced search

Quick links