Table 3 Mann–Whitney U test statistics comparing GPT-4o and human expert evaluation scores across allToM task categories. No statistically significant differences were observed, indicating strong alignment between model and expert assessments.
From: Large language models for autism: evaluating theory of mind tasks in a gamified environment
Category | U Statistic | p-value |
|---|---|---|
Whole Dataset | 548509.5 | 0.7495 |
Faux pas | 69918.0 | 0.3711 |
Irony | 23163.0 | 0.1439 |
Hinting Task | 22304.0 | 0.7834 |
Strange Stories | 30482.0 | 0.1635 |