Table 2 Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe

From: General scales unlock AI evaluation with explanatory and predictive power

Benchmark	Claims to measure (their terminology)	Claims to measure (our terminology)	Sensitive to…	Mean	s.d.	Accuracy
ChemLLMBench	• Generation of descriptions for molecules	CEc^•, CEe˚, KNf^•, KNn˚	AS˚	3.2	1.0	27.2
	• Generation of new molecules		CEc^•	2.6	1.2
	• Chemical name understanding		KNf^•	4.0	1.2
	• Chemical reaction products prediction		MCr˚	3.1	1.1
	• Identification of target molecules		QLq˚	2.2	1.5
			SNs˚	3.2	1.3
Civil Service Examination	• Logical reasoning	QLI˚	KNa˚	2.1	1.8	73.5
			KNs˚	2.0	1.8
Data Analysis	• Data analysis	CL˚, KNf˚, QLq˚	KNc˚	2.2	1.0	69.7
LSAT	• Analytical reasoning	CEc˚, CL˚, MC˚, QLI˚	KNc˚	2.9	1.1	81.6
	• Logical reasoning		KNs˚	2.1	1.8
	• Reading comprehension
MMLU-Pro	• Knowledge	KNa^•, KNc˚, KNf˚, KNn˚, KNs˚, QL˚	KNa^•	2.9	1.6	87.6
	• Reasoning
Math	• Mathematics	CL˚, KNf˚, QLI^•, QLq˚	AT˚	2.4	1.1	59.0
			MCu˚	2.4	1.0
			QLI^•	2.7	1.2
MedCalcBench	• Medical calculation knowledge	KNf˚, KNn˚, QLq˚	AS˚	2.0	1.2	88.0
	• Patient attributes extraction
	• Final results arithmetic
MenatQA	• Event temporal reasoning	QL˚	KNc˚	2.8	1.0	72.8
OmniMath	• Mathematical reasoning at Olympiad level	CL^•, KNf˚, QLI^•, QLq˚	AS˚	2.3	1.3	34.4
			CL^•	3.1	1.2
			MCr˚	2.6	1.1
			QLI^•	3.4	1.0
Reasoning	• Spatial Reasoning	QLI˚, SN˚	CL˚	2.6	1.1	48.2
	• Logical Reasoning		MCr˚	2.8	1.1
SAT	• Critical thinking	MC˚	AT˚	2.2	1.1	98.3
	• Problem-solving
	• Analytical skills
SciBench	• Scientific problem-solving	KNa^•, KNf˚, KNn^•, KNs˚, MC˚	KNa^•	3.0	1.6	83.7
			KNn^•	2.8	1.7
TempReason	• Event temporal reasoning	QL˚	KNc˚	2.8	1.1	71.2
TimeQA	• Event temporal reasoning	QL˚	KNc˚	2.4	1.1	89.0

Benchmarks are chosen such that at least one dimension that satisfies our two sensitivity criteria (s.d. ≥ 1 and mean ≥ 2). The second column shows what they claim to measure using their own terminology (sources are described in Supplementary Table 28). The third column shows the dimensions that the benchmarks claim to measure, expressed in our terminology. The fourth column shows what the benchmarks are actually measuring or sensitive to. For every dimension to which the benchmark is sensitive, we provide the mean and s.d. of the levels for that dimension (fifth and sixth columns, respectively). The seventh (last) column shows the average accuracy of GPT-4o per benchmark as a reference. A superscript ∘ in the third and fourth columns indicates missed dimension (lack of sensitivity) or extra dimension (lack of specificity), respectively. A superscript • indicates the dimensions that are claimed to be measured and are actually measured (high sensitivity). The other six benchmarks from the ADeLe battery that do not follow the two aforementioned sensitivity criteria are Date Arithmetic, GRE & GMAT, Language, MCTACO, TimeDial and TruthQuest (with the accuracies of GPT-4o being 98.9, 95.6, 72.4, 95.1, 98.8 and 43.0, respectively). Takeaway: no benchmarks have high sensitivity and specificity, indicating a clear lack of construct validity.

Back to article page

Table 2 Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe

Search

Quick links