Table 3 Mann–Whitney U test statistics comparing GPT-4o and human expert evaluation scores across allToM task categories. No statistically significant differences were observed, indicating strong alignment between model and expert assessments.

From: Large language models for autism: evaluating theory of mind tasks in a gamified environment

Category

U Statistic

p-value

Whole Dataset

548509.5

0.7495

Faux pas

69918.0

0.3711

Irony

23163.0

0.1439

Hinting Task

22304.0

0.7834

Strange Stories

30482.0

0.1635