Table 1 Results summary across experiments, parameters and tested models

From: A psychometric framework for evaluating and shaping personality traits in large language models

  

Construct validity

Shaping

 
  

Reliability

Convrg.

Discr.

Criter.

Single

Multi.

Dwnstr.

Overall

Model

Variant

        

PaLM 62B

Base

− −

0.05

−0.24

− −

NT

NT

NT

− −

Flan-PaLM

         

 8B

IT

+

0.69

0.23

+

NT

 62B

IT

+

0.87

0.41

+

+

+

NT

+

 540B

IT

+ +

0.90

0.51

+

+ +

+ +

+ +

+ +

Flan-PaLMChilla 62B

CO, IT

+a

0.87

0.48

+ +

+

+

NT

+

Llama 2

         

 7B

Base

− −

−0.01

−0.03

− −

NT

NT

NT

− −

 13B

Base

− −

−0.01

−0.05

− −

NT

NT

NT

− −

 70B

Base

− −

0.00

−0.02

− −

NT

NT

NT

− −

Llama 2-Chat

         

 7B

IT

+

0.59

0.15

NT

 13B

IT

+ +

0.82

0.29

+ +

+

NT

+

 70B

IT

+ +

0.82

0.39

+ +

+

+

+ +

+

Mistral 7B

         

 v0.1

Base

− −

0.03

−0.01

− −

NT

NT

NT

− −

 Instruct v0.1

IT

0.28

0.09

+

− −

− −

NT

− −

Mixtral 8x7B

         

 v0.1

MoE, Base

− −

0.04

0.01

− −

NT

NT

NT

− −

 Instruct v0.1

MoE, IT

+ +

0.80

0.40

+ +

+

+ +

+

GPT-

         

 3.5 Turbo

IT

+ +

0.84

0.28

+ +

NT

 4o mini

MM, IT

+ +

0.81

0.38

+ +

+

+

NT

+

 4o

MM, IT

+ +

0.90

0.48

+ +

+ +

+ +

+ +

+ +

Prompt set parameters

         

Personality profiles

 

0

45

32

45

 

Biographic descriptions

 

50

50

50

50

 

Item instructions

 

5

1

1

0

 

Items

 

419

300

300

0

 

Item postambles

 

5

1

1

0

 

Simulated response profiles

 

1,250

2,250

1,600

2,250

 

Responses per model

 

523,750

675,000

480,000

56,250

 

Section/Supplementary Note

 

‘Reliability results’/6.3

‘Convergent and discriminant validity results’

‘Criterion validity results’/5

‘Shaping results’/8.1

‘Shaping results’/8.2

‘Real-world task results’/10

 
  1. Convergent validity (Convrg.) summarized by the average convergent correlation between IPIP-NEO and BFI domain scores (Extended Data Fig. 2 and Extended Data Table 3); discriminant validity (Discr.) summarized by the average absolute difference between an IPIP-NEO domain’s convergent correlation with all of its respective discriminant correlations (see Extended Data Table 3); criterion validity (Criter.) summarized from Fig. 3; single-trait shaping performance (Single) summarized from Extended Data Table 6; multiple-trait shaping performance (Multi.) summarized from Extended Data Fig. 3 and Extended Data Table 7; shaping performance in downstream text-generation task (Dwnstr.) summarized from Fig. 5. Results over LLM variants: base, instruction-tuned (IT), compute-optimally trained (CO), mixture of experts (MoE) and multi-modal (MM). Overall performance per model summarized across all experiments. − − unacceptable; − poor to neutral; + neutral to good; + + excellent. aTwo items with no variance were removed to compute reliability metrics. Some models were not tested (NT) across shaping experiments. We conducted independent and concurrent personality-shaping experiments on models where personality test data were sufficiently reliable. Personality shaping in a downstream task was tested on the most capable model per family.