Table 2 Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe

From: General scales unlock AI evaluation with explanatory and predictive power

Benchmark

Claims to measure (their terminology)

Claims to measure (our terminology)

Sensitive to…

Mean

s.d.

Accuracy

ChemLLMBench

• Generation of descriptions for molecules

CEc, CEe˚, KNf, KNn˚

AS˚

3.2

1.0

27.2

 

• Generation of new molecules

 

CEc

2.6

1.2

 
 

• Chemical name understanding

 

KNf

4.0

1.2

 
 

• Chemical reaction products prediction

 

MCr˚

3.1

1.1

 
 

• Identification of target molecules

 

QLq˚

2.2

1.5

 
   

SNs˚

3.2

1.3

 

Civil Service Examination

• Logical reasoning

QLI˚

KNa˚

2.1

1.8

73.5

   

KNs˚

2.0

1.8

 

Data Analysis

• Data analysis

CL˚, KNf˚, QLq˚

KNc˚

2.2

1.0

69.7

LSAT

• Analytical reasoning

CEc˚, CL˚, MC˚, QLI˚

KNc˚

2.9

1.1

81.6

 

• Logical reasoning

 

KNs˚

2.1

1.8

 
 

• Reading comprehension

     

MMLU-Pro

• Knowledge

KNa, KNc˚, KNf˚, KNn˚, KNs˚, QL˚

KNa

2.9

1.6

87.6

 

• Reasoning

     

Math

• Mathematics

CL˚, KNf˚, QLI, QLq˚

AT˚

2.4

1.1

59.0

   

MCu˚

2.4

1.0

 
   

QLI

2.7

1.2

 

MedCalcBench

• Medical calculation knowledge

KNf˚, KNn˚, QLq˚

AS˚

2.0

1.2

88.0

 

• Patient attributes extraction

     
 

• Final results arithmetic

     

MenatQA

• Event temporal reasoning

QL˚

KNc˚

2.8

1.0

72.8

OmniMath

• Mathematical reasoning at Olympiad level

CL, KNf˚, QLI, QLq˚

AS˚

2.3

1.3

34.4

   

CL

3.1

1.2

 
   

MCr˚

2.6

1.1

 
   

QLI

3.4

1.0

 

Reasoning

• Spatial Reasoning

QLI˚, SN˚

CL˚

2.6

1.1

48.2

 

• Logical Reasoning

 

MCr˚

2.8

1.1

 

SAT

• Critical thinking

MC˚

AT˚

2.2

1.1

98.3

 

• Problem-solving

     
 

• Analytical skills

     

SciBench

• Scientific problem-solving

KNa, KNf˚, KNn, KNs˚, MC˚

KNa

3.0

1.6

83.7

   

KNn

2.8

1.7

 

TempReason

• Event temporal reasoning

QL˚

KNc˚

2.8

1.1

71.2

TimeQA

• Event temporal reasoning

QL˚

KNc˚

2.4

1.1

89.0

  1. Benchmarks are chosen such that at least one dimension that satisfies our two sensitivity criteria (s.d. ≥ 1 and mean ≥ 2). The second column shows what they claim to measure using their own terminology (sources are described in Supplementary Table 28). The third column shows the dimensions that the benchmarks claim to measure, expressed in our terminology. The fourth column shows what the benchmarks are actually measuring or sensitive to. For every dimension to which the benchmark is sensitive, we provide the mean and s.d. of the levels for that dimension (fifth and sixth columns, respectively). The seventh (last) column shows the average accuracy of GPT-4o per benchmark as a reference. A superscript in the third and fourth columns indicates missed dimension (lack of sensitivity) or extra dimension (lack of specificity), respectively. A superscript • indicates the dimensions that are claimed to be measured and are actually measured (high sensitivity). The other six benchmarks from the ADeLe battery that do not follow the two aforementioned sensitivity criteria are Date Arithmetic, GRE & GMAT, Language, MCTACO, TimeDial and TruthQuest (with the accuracies of GPT-4o being 98.9, 95.6, 72.4, 95.1, 98.8 and 43.0, respectively). Takeaway: no benchmarks have high sensitivity and specificity, indicating a clear lack of construct validity.