Extended Data Fig. 1: Distributions of (a) IPIP-NEO and (b) BFI personality domain scores across models.
From: A psychometric framework for evaluating and shaping personality traits in large language models

Box plots depict model medians surrounded by their interquartile ranges and outlier values. As models increased in size (for example, Flan-PaLM from 8B to 540B parameters), (a) IPIP-NEO scores were relatively more stable compared to (b) BFI scores, where scores for socially-desirable traits increased while NEU scores decreased. n = 1, 250 observations per model, per test; N = 22, 500 total observations per test.