Table 1 Results summary across experiments, parameters and tested models

		Construct validity				Shaping
		Reliability	Convrg. ↑	Discr. ↑	Criter.	Single	Multi.	Dwnstr.	Overall
Model	Variant
PaLM 62B	Base	− −	0.05	−0.24	− −	NT	NT	NT	− −
Flan-PaLM
8B	IT	+	0.69	0.23	−	+	−	NT	−
62B	IT	+	0.87	0.41	+	+	+	NT	+
540B	IT	+ +	0.90	0.51	+	+ +	+ +	+ +	+ +
Flan-PaLMChilla 62B	CO, IT	+^a	0.87	0.48	+ +	+	+	NT	+
Llama 2
7B	Base	− −	−0.01	−0.03	− −	NT	NT	NT	− −
13B	Base	− −	−0.01	−0.05	− −	NT	NT	NT	− −
70B	Base	− −	0.00	−0.02	− −	NT	NT	NT	− −
Llama 2-Chat
7B	IT	+	0.59	0.15	−	−	−	NT	−
13B	IT	+ +	0.82	0.29	+ +	−	+	NT	+
70B	IT	+ +	0.82	0.39	+ +	+	+	+ +	+
Mistral 7B
v0.1	Base	− −	0.03	−0.01	− −	NT	NT	NT	− −
Instruct v0.1	IT	−	0.28	0.09	+	− −	− −	NT	− −
Mixtral 8x7B
v0.1	MoE, Base	− −	0.04	0.01	− −	NT	NT	NT	− −
Instruct v0.1	MoE, IT	+ +	0.80	0.40	+ +	−	+	+ +	+
GPT-
3.5 Turbo	IT	+ +	0.84	0.28	+ +	−	−	NT	−
4o mini	MM, IT	+ +	0.81	0.38	+ +	+	+	NT	+
4o	MM, IT	+ +	0.90	0.48	+ +	+ +	+ +	+ +	+ +
Prompt set parameters
Personality profiles		0				45	32	45
Biographic descriptions		50				50	50	50
Item instructions		5				1	1	0
Items		419				300	300	0
Item postambles		5				1	1	0
Simulated response profiles		1,250				2,250	1,600	2,250
Responses per model		523,750				675,000	480,000	56,250
Section/Supplementary Note		‘Reliability results’/6.3	‘Convergent and discriminant validity results’		‘Criterion validity results’/5	‘Shaping results’/8.1	‘Shaping results’/8.2	‘Real-world task results’/10

Convergent validity (Convrg.) summarized by the average convergent correlation between IPIP-NEO and BFI domain scores (Extended Data Fig. 2 and Extended Data Table 3); discriminant validity (Discr.) summarized by the average absolute difference between an IPIP-NEO domain’s convergent correlation with all of its respective discriminant correlations (see Extended Data Table 3); criterion validity (Criter.) summarized from Fig. 3; single-trait shaping performance (Single) summarized from Extended Data Table 6; multiple-trait shaping performance (Multi.) summarized from Extended Data Fig. 3 and Extended Data Table 7; shaping performance in downstream text-generation task (Dwnstr.) summarized from Fig. 5. Results over LLM variants: base, instruction-tuned (IT), compute-optimally trained (CO), mixture of experts (MoE) and multi-modal (MM). Overall performance per model summarized across all experiments. − − unacceptable; − poor to neutral; + neutral to good; + + excellent. ^aTwo items with no variance were removed to compute reliability metrics. Some models were not tested (NT) across shaping experiments. We conducted independent and concurrent personality-shaping experiments on models where personality test data were sufficiently reliable. Personality shaping in a downstream task was tested on the most capable model per family.

Search