Table 2 Summary of evaluation metrics for each model on the Koide-Majima et al. dataset. For the shuffled human ratings, the 95% percentile interval of the null distribution is reported for each metric (see Methods for details).
Model | Data pattern | Evaluation method | Top 100 videos | Top 250 videos | All video |
---|---|---|---|---|---|
Human | Video | Mean of correlation on each video | – | – | 0.313 |
 |  | Correlation between RDMs | 0.965 | 0.931 | 0.859 |
 |  | Matching rate of GW alignment | 41.0% | 33.6% | 16.36% |
 |  | 10 hierarchical category matching rate | 81.0% | 85.6% | 66.18% |
Gemini-2.0-flash-001 | Video | Mean of correlation on each video | – | – | 0.374 |
 |  | Correlation between RDMs | 0.938 | 0.818 | 0.558 |
 |  | Matching rate of GW alignment | 17.0% | 8.4% | 2.36% |
 |  | 10 hierarchical category matching rate | 69.0% | 71.6% | 50.5% |
Shuffled human ratings | Video | Mean of correlation on each video | – | – | \([0.321, 1.72]\times 10^{-2}\) |
 |  | Correlation between RDMs | [0.0461, 0.505] | [0.0154, 0.0179] | \([-0.509, 1.08]\times 10^{-3}\) |
 |  | Matching rate of GW alignment | \([4.90, 5.90]\%\) | \([2.28, 2.60]\%\) | \([1.01, 1.23]\%\) |
 |  | 10 hierarchical category matching rate | \([19.1, 25.5]\%\) | \([18.2, 21.1]\%\) | \([17.3, 19.4]\%\) |