Extended Data Table 2 In-distribution predictability results of 15 LLMs for the ADeLe battery using tenfold cross-validation, averaged across ten seeds

From: General scales unlock AI evaluation with explanatory and predictive power

  1. The first two columns show names of subject LLMs and the overall accuracy of subject LLMs on the ADeLe battery. The remaining three pairs of columns show the AUROC and ECE of three different assessors (RF using demands, RF using average GloVe embeddings and fine-tuning LLaMA-3.1-8B). For a single LLM subject, the training time is 4 s and 160 s for the demand-based and embeddings-based assessors, respectively, on a M3 Pro CPU, whereas the fine-tuned LLAMA assessor costs 300 h on a single V100 GPU. The weighted average is only indicative for easy comparison, and uses the normalized LLM accuracy as a weight in the mean, giving more relevance to more powerful models, which are more representative now and in the near future. The asterisks indicate statistical difference (α = 0.05) between the demand-based assessor and the strongest baseline (fine-tuned LLAMA), using the Wilcoxon signed-rank. The RF assessor’s s.d. across the ten seeds range between 0.0004 and 0.001 for AUROC and between 0.0006 and 0.002 for ECE among subject LLMs. Given these low s.d. scores, we do not show confidence intervals (which are very narrow) for the sake of clarity.