Table 2 Discrepancy analyses between ChatGPT model predictions and human judgments for current stimuli.

From: ChatGPT does not replicate human moral judgments: the importance of examining metrics beyond correlation to assess agreement

						p values
Variable	N	Mean	SD	95% CI bounds	t	Uncorr	BH-FDR	Bonf	Cohen’s d
davinci-human simp. diff	60	0.20	1.04	[− 0.07, 0.47]	1.49	0.141	0.141	> 0.99	0.19
moral scenarios	20	0.80	0.49	[0.57, 1.03]	7.28	< 0.001	< 0.001	< 0.001	1.63
neutral scenarios	20	0.81	0.77	[0.44, 1.17]	4.65	< 0.001	< 0.001	0.004	1.04
immoral scenarios	20	− 1.00	0.51	[− 1.24, -0.76]	− 8.76	< 0.001	< 0.001	< 0.001	− 1.96
davinci-human abs. diff	60	0.91	0.54	[0.77, 1.05]	13.06	< 0.001	< 0.001	< 0.001	1.69
moral scenarios	20	0.81	0.47	[0.59, 1.03]	7.68	< 0.001	< 0.001	< 0.001	1.72
neutral scenarios	20	0.91	0.64	[0.61, 1.21]	6.34	< 0.001	< 0.001	< 0.001	1.42
immoral scenarios	20	1.01	0.5	[0.77, 1.24]	9.08	< 0.001	< 0.001	< 0.001	2.03
davinci-human sq. diff	60	1.11	1.09	[0.83, 1.39]	7.89	< 0.001	< 0.001	< 0.001	1.02
moral scenarios	20	0.86	0.65	[0.56, 1.17]	5.89	< 0.001	< 0.001	< 0.001	1.32
neutral scenarios	20	1.22	1.39	[0.57, 1.87]	3.93	< 0.001	0.001	0.022	0.88
immoral scenarios	20	1.25	1.11	[0.73, 1.77]	5.02	< 0.001	< 0.001	0.002	1.12
gpt-4o-human simp. diff	60	0.32	0.79	[0.12, 0.52]	3.15	0.003	0.003	0.062	0.41
moral scenarios	20	0.58	0.27	[0.45, 0.71]	9.48	< 0.001	< 0.001	< 0.001	2.12
neutral scenarios	20	0.96	0.56	[0.70, 1.22]	7.65	< 0.001	< 0.001	< 0.001	1.71
immoral scenarios	20	-0.58	0.43	[− 0.78, − 0.38]	− 6.03	< 0.001	< 0.001	< 0.001	− 1.35
gpt-4o-human abs. diff	60	0.72	0.44	[0.60, 0.83]	12.56	< 0.001	< 0.001	< 0.001	1.62
moral scenarios	20	0.58	0.27	[0.45, 0.71]	9.48	< 0.001	< 0.001	< 0.001	2.12
neutral scenarios	20	1.00	0.49	[0.77, 1.22]	9.18	< 0.001	< 0.001	< 0.001	2.05
immoral scenarios	20	0.58	0.42	[0.38, 0.78]	6.15	< 0.001	< 0.001	< 0.001	1.38
gpt-4o-human sq. diff	60	0.71	0.77	[0.51, 0.91]	7.16	< 0.001	< 0.001	< 0.001	0.92
moral scenarios	20	0.41	0.32	[0.26, 0.56]	5.62	< 0.001	< 0.001	< 0.001	1.26
neutral scenarios	20	1.22	0.99	[0.75, 1.68]	5.52	< 0.001	< 0.001	< 0.001	1.23
immoral scenarios	20	0.51	0.59	[0.23, 0.79]	3.85	0.001	0.001	0.026	0.86

Bolded lines represent the deviations collapsing across the subgroups.

Back to article page

Table 2 Discrepancy analyses between ChatGPT model predictions and human judgments for current stimuli.

Search

Quick links