Table 2 Summary of evaluation metrics for each model on the Koide-Majima et al. dataset. For the shuffled human ratings, the 95% percentile interval of the null distribution is reported for each metric (see Methods for details).

From: Correspondence of high dimensional emotion structures elicited from video clips between humans and multimodal LLMs

Model

Data pattern

Evaluation method

Top 100 videos

Top 250 videos

All video

Human

Video

Mean of correlation on each video

–

–

0.313

  

Correlation between RDMs

0.965

0.931

0.859

  

Matching rate of GW alignment

41.0%

33.6%

16.36%

  

10 hierarchical category matching rate

81.0%

85.6%

66.18%

Gemini-2.0-flash-001

Video

Mean of correlation on each video

–

–

0.374

  

Correlation between RDMs

0.938

0.818

0.558

  

Matching rate of GW alignment

17.0%

8.4%

2.36%

  

10 hierarchical category matching rate

69.0%

71.6%

50.5%

Shuffled human ratings

Video

Mean of correlation on each video

–

–

\([0.321, 1.72]\times 10^{-2}\)

  

Correlation between RDMs

[0.0461, 0.505]

[0.0154, 0.0179]

\([-0.509, 1.08]\times 10^{-3}\)

  

Matching rate of GW alignment

\([4.90, 5.90]\%\)

\([2.28, 2.60]\%\)

\([1.01, 1.23]\%\)

  

10 hierarchical category matching rate

\([19.1, 25.5]\%\)

\([18.2, 21.1]\%\)

\([17.3, 19.4]\%\)