Table 2 Accuracy rates of ChatGPT−4o and deepseek in english and Chinese versions across different groups.

	ChatGPT−4o			DeepSeek			p	Cohen’s d
	Value (%)	95% CI-Wald binominal	Standard deviation	Value (%)	95% CI-Wald binominal	Standard deviation	p	Cohen’s d
Group A	82.2^a (83.3^a)	72.7–89.5 (74-90.4)	0.038 (0.037 )	50.0^a (55.6^a)	39.3–60.7 (44.7–66)	0.050 (0.050 )	< 0.001* (< 0.001*)	0.720 (0.629)
Group B	80.0^a (80.0^a)	70.2–87.7 (70.2–87.7)	0.040 (0.040)	90.0^b (80.0^b)	81.9–95.3 (70.2–87.7)	0.030 (0.040)	0.061 (1.000)	0.281 (0.000)
Group C	90.0^b (93.3^b)	81.9–95.3 (86.1–97.5)	0.030 (0.025)	100.0^c (98.9^c)	96–100 (94–100)	0.000 (0.011)	0.002* (0.055 )	0.469 (0.289)
Group D	100.0^cd (100.0^cd)	96–100(96–100)	0.000 (0.000)	100.0^cd (93.3^cd)	96–100 (86.1–97.5)	0.000 (0.025)	1.000 (0.013*)	0.000 (0.373)
Group E	100.0^cd (100.0^cd)	96–100 (96–100)	0.000 (0.000)	100.0^cd (90.0^cd)	96–100 (81.9–95.3)	0.000 (0.030)	1.000 (0.002*)	0.000 (0.465)
Total	90.4 (91.3)	87.3–93 (88.3–93.8)	0.029 (0.028)	88.0 (83.6)	84.6–90.9 (79.8–86.9)	0.033 (0.037)	0.247 (< 0.001*)	0.077 (0.234)

Each cell presents the performance of the AI model on both the English and Chinese versions of the same question, shown as English value (Chinese value).
Pairwise comparisons within each model and language version were conducted using the Mann-Whitney U test with a Bonferroni correction for multiple comparisons. Values within the same column sharing a common superscript letter are not statistically different at the p < 0.05 level. Confidence intervals were calculated using the Wald binomial method.

Quick links

Search