Table 2 Sensitivity and specificity analysis of a subset of 20 benchmarks in ADeLe
From: General scales unlock AI evaluation with explanatory and predictive power
Benchmark | Claims to measure (their terminology) | Claims to measure (our terminology) | Sensitive to… | Mean | s.d. | Accuracy |
|---|---|---|---|---|---|---|
ChemLLMBench | • Generation of descriptions for molecules | CEc•, CEe˚, KNf•, KNn˚ | AS˚ | 3.2 | 1.0 | 27.2 |
• Generation of new molecules | CEc• | 2.6 | 1.2 | |||
• Chemical name understanding | KNf• | 4.0 | 1.2 | |||
• Chemical reaction products prediction | MCr˚ | 3.1 | 1.1 | |||
• Identification of target molecules | QLq˚ | 2.2 | 1.5 | |||
SNs˚ | 3.2 | 1.3 | ||||
Civil Service Examination | • Logical reasoning | QLI˚ | KNa˚ | 2.1 | 1.8 | 73.5 |
KNs˚ | 2.0 | 1.8 | ||||
Data Analysis | • Data analysis | CL˚, KNf˚, QLq˚ | KNc˚ | 2.2 | 1.0 | 69.7 |
LSAT | • Analytical reasoning | CEc˚, CL˚, MC˚, QLI˚ | KNc˚ | 2.9 | 1.1 | 81.6 |
• Logical reasoning | KNs˚ | 2.1 | 1.8 | |||
• Reading comprehension | ||||||
MMLU-Pro | • Knowledge | KNa•, KNc˚, KNf˚, KNn˚, KNs˚, QL˚ | KNa• | 2.9 | 1.6 | 87.6 |
• Reasoning | ||||||
Math | • Mathematics | CL˚, KNf˚, QLI•, QLq˚ | AT˚ | 2.4 | 1.1 | 59.0 |
MCu˚ | 2.4 | 1.0 | ||||
QLI• | 2.7 | 1.2 | ||||
MedCalcBench | • Medical calculation knowledge | KNf˚, KNn˚, QLq˚ | AS˚ | 2.0 | 1.2 | 88.0 |
• Patient attributes extraction | ||||||
• Final results arithmetic | ||||||
MenatQA | • Event temporal reasoning | QL˚ | KNc˚ | 2.8 | 1.0 | 72.8 |
OmniMath | • Mathematical reasoning at Olympiad level | CL•, KNf˚, QLI•, QLq˚ | AS˚ | 2.3 | 1.3 | 34.4 |
CL• | 3.1 | 1.2 | ||||
MCr˚ | 2.6 | 1.1 | ||||
QLI• | 3.4 | 1.0 | ||||
Reasoning | • Spatial Reasoning | QLI˚, SN˚ | CL˚ | 2.6 | 1.1 | 48.2 |
• Logical Reasoning | MCr˚ | 2.8 | 1.1 | |||
SAT | • Critical thinking | MC˚ | AT˚ | 2.2 | 1.1 | 98.3 |
• Problem-solving | ||||||
• Analytical skills | ||||||
SciBench | • Scientific problem-solving | KNa•, KNf˚, KNn•, KNs˚, MC˚ | KNa• | 3.0 | 1.6 | 83.7 |
KNn• | 2.8 | 1.7 | ||||
TempReason | • Event temporal reasoning | QL˚ | KNc˚ | 2.8 | 1.1 | 71.2 |
TimeQA | • Event temporal reasoning | QL˚ | KNc˚ | 2.4 | 1.1 | 89.0 |