Fig. 5: Ability of psychometric tests to accurately predict downstream LLM behaviour.
From: A psychometric framework for evaluating and shaping personality traits in large language models

The ability of LLM psychometric tests to accurately predict synthetic personality levels in a downstream text generation task (i.e. writing social media status updates) compared with human baselines reported in previous work48, quantified as Pearson’s correlations. On average, LLM IPIP-NEO scores outperformed human IPIP-NEO scores in predicting text-based levels of personality, indicating that LLM personality test responses accurately capture latent LLM personality levels manifested in downstream behaviour. N = 9,000 total LLM observations. All LLM correlations are statistically significant at P < 0.0001 (2-sided values computed using Student’s t-distribution); n = 2,250 per model.