Extended Data Fig. 5: metabench and CogBench results.
From: A foundation model to predict and capture human cognition

a, Results for metabench34, a sparse benchmark containing several canonical benchmarks from the machine learning literature. We find that Centaur maintains the level of performance of Llama, indicating that finetuning on human behavior did not lead to deterioration in other tasks (ARC: z = −0.126, p = 0.9, GSM8K: z = −0.529, p = 0.597, HellaSwag: z = 0.0, p = 1.0, MMLU: z = 0.0, p = 1.0, Winogrande: z = −0.556, p = 0.578). Performance on TruthfulQA71 – which measures how models mimic human falsehoods – even improved significantly with finetuning (z = 2.312, p = 0.021; all z-test were two-sided). b, Performance-based metrics from CogBench33, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. We find that – relative to Llama – Centaur’s performance improves in all experiments (Probabilistic reasoning: z = 6.371, p ≤ 0.0001, Horizon task: z = 22.176, p ≤ 0.0001, Restless bandit: z = 7.317, p ≤ 0.0001, Instrumental learning: z = 0.126, p = 0.45, Two-step task: z = 1.458, p = 0.072, Balloon analog risk task: z = 1.496, p = 0.067; all z-test were one-sided). c, Behavioral metrics from CogBench. We observe that Centaur becomes more similar to human subjects in all ten behavioral metrics (Prior weighting: z = 2.176, p = 0.015, Likelihood weighting: z = 1.131, p = 0.129, Directed exploration: z = 0.525, p = 0.3, Random exploration: z = 2.014, p = 0.022, Meta-cognition: z = 2.206, p = 0.014, Learning rate: z = 0.477, p = 0.317, Optimism bias: z = 0.78, p = 0.218, Model-basedness: z = 9.608, p ≤ 0.0001, Temporal discounting: z = 2.594, p = 0.005, Risk taking: z = 1.612, p = 0.053; all z-test were one-sided).