Extended Data Table 1 IPIP-NEO reliability metrics per model for proprietary (closed-source) models

From: A psychometric framework for evaluating and shaping personality traits in large language models

  1. Consistent with human standards, we interpreted a given reliability metric RM (that is, α, λ6, ω) < 0.50 as unacceptable; 0.50 ≤ RM < 0.60 as poor; 0.60 ≤ RM < 0.70 as questionable; 0.70 ≤ RM < 0.80 as acceptable; 0.80 ≤ RM < 0.90 as good; and RM ≥ 0.90 as excellent. * RMs for these subscales were calculated after removing one item with zero variance, since reliability cannot be computed for items with zero variance. N = 10,000 test observations, with n = 1, 250 observations per model. Each IPIP-NEO subscale comprises 60 items.