Table 4 Consistency rates of ChatGPT−4o and deepseek in english and Chinese versions across different groups (mean and standard deviation).

	Group A	Group B	Group C	Group D	Group E	Total	p¹	ICC
ChatGPT−4o	95.56 ± 0.093 (96.67 ± 0.105)	100 ± 0 (100 ± 0)	100 ± 0 (96.67 ± 0.105)	100 ± 0 (100 ± 0)	100 ± 0 (100 ± 0)	99.11 ± 0.043 (98.67 ± 0.065)	0.086 (0.574)	0.982 (0.984)
DeepSeek	100 ± 0 (94.44 ± 0.141)	100 ± 0 (100 ± 0)	100 ± 0 (98.89 ± 0.035)	100 ± 0 (96.67 ± 0.105)	100 ± 0 (100 ± 0)	100 ± 0 (98 ± 0.079)	1.000 (0.437)	1.000 (0.978)
p²	0.146 (0.543)	1.000 (1.000)	1.000 (0.942)	1.000 (0.317)	1.000 (1.000)	1.000 (0.412 )

Each cell presents the performance of the AI model on both the English and Chinese versions of the same question, shown as English value (Chinese value).
¹Kruskal-Wallis test. ²Mann–Whitney U test.

Quick links

Search