Table 2 Discrepancy analyses between ChatGPT model predictions and human judgments for current stimuli.

From: ChatGPT does not replicate human moral judgments: the importance of examining metrics beyond correlation to assess agreement

 

p values

Variable

N

Mean

SD

95% CI bounds

t

Uncorr

BH-FDR

Bonf

Cohen’s d

davinci-human simp. diff

60

0.20

1.04

[− 0.07, 0.47]

1.49

0.141

0.141

 > 0.99

0.19

     moral scenarios

20

0.80

0.49

[0.57, 1.03]

7.28

 < 0.001

 < 0.001

 < 0.001

1.63

     neutral scenarios

20

0.81

0.77

[0.44, 1.17]

4.65

 < 0.001

 < 0.001

0.004

1.04

     immoral scenarios

20

− 1.00

0.51

[− 1.24, -0.76]

− 8.76

 < 0.001

 < 0.001

 < 0.001

− 1.96

davinci-human abs. diff

60

0.91

0.54

[0.77, 1.05]

13.06

 < 0.001

 < 0.001

 < 0.001

1.69

     moral scenarios

20

0.81

0.47

[0.59, 1.03]

7.68

 < 0.001

 < 0.001

 < 0.001

1.72

     neutral scenarios

20

0.91

0.64

[0.61, 1.21]

6.34

 < 0.001

 < 0.001

 < 0.001

1.42

     immoral scenarios

20

1.01

0.5

[0.77, 1.24]

9.08

 < 0.001

 < 0.001

 < 0.001

2.03

davinci-human sq. diff

60

1.11

1.09

[0.83, 1.39]

7.89

 < 0.001

 < 0.001

 < 0.001

1.02

     moral scenarios

20

0.86

0.65

[0.56, 1.17]

5.89

 < 0.001

 < 0.001

 < 0.001

1.32

     neutral scenarios

20

1.22

1.39

[0.57, 1.87]

3.93

 < 0.001

0.001

0.022

0.88

     immoral scenarios

20

1.25

1.11

[0.73, 1.77]

5.02

 < 0.001

 < 0.001

0.002

1.12

gpt-4o-human simp. diff

60

0.32

0.79

[0.12, 0.52]

3.15

0.003

0.003

0.062

0.41

     moral scenarios

20

0.58

0.27

[0.45, 0.71]

9.48

 < 0.001

 < 0.001

 < 0.001

2.12

     neutral scenarios

20

0.96

0.56

[0.70, 1.22]

7.65

 < 0.001

 < 0.001

 < 0.001

1.71

     immoral scenarios

20

-0.58

0.43

[− 0.78, − 0.38]

− 6.03

 < 0.001

 < 0.001

 < 0.001

− 1.35

gpt-4o-human abs. diff

60

0.72

0.44

[0.60, 0.83]

12.56

 < 0.001

 < 0.001

 < 0.001

1.62

     moral scenarios

20

0.58

0.27

[0.45, 0.71]

9.48

 < 0.001

 < 0.001

 < 0.001

2.12

     neutral scenarios

20

1.00

0.49

[0.77, 1.22]

9.18

 < 0.001

 < 0.001

 < 0.001

2.05

     immoral scenarios

20

0.58

0.42

[0.38, 0.78]

6.15

 < 0.001

 < 0.001

 < 0.001

1.38

gpt-4o-human sq. diff

60

0.71

0.77

[0.51, 0.91]

7.16

 < 0.001

 < 0.001

 < 0.001

0.92

     moral scenarios

20

0.41

0.32

[0.26, 0.56]

5.62

 < 0.001

 < 0.001

 < 0.001

1.26

     neutral scenarios

20

1.22

0.99

[0.75, 1.68]

5.52

 < 0.001

 < 0.001

 < 0.001

1.23

     immoral scenarios

20

0.51

0.59

[0.23, 0.79]

3.85

0.001

0.001

0.026

0.86

  1. Bolded lines represent the deviations collapsing across the subgroups.