Table 2 Accuracy rates of ChatGPT−4o and deepseek in english and Chinese versions across different groups.

From: Evaluation of ChatGPT-4o and DeepSeek as tools for orthodontic health literacy in public dental education

 

ChatGPT−4o

DeepSeek

p

Cohen’s d

Value (%)

95% CI-Wald binominal

Standard

deviation

Value (%)

95% CI-Wald binominal

Standard deviation

Group A

82.2a

(83.3a)

72.7–89.5

(74-90.4)

0.038

(0.037 )

50.0a

(55.6a)

39.3–60.7

(44.7–66)

0.050

(0.050 )

< 0.001*

(< 0.001*)

0.720

(0.629)

Group B

80.0a

(80.0a)

70.2–87.7

(70.2–87.7)

0.040

(0.040)

90.0b

(80.0b)

81.9–95.3

(70.2–87.7)

0.030

(0.040)

0.061

(1.000)

0.281

(0.000)

Group C

90.0b

(93.3b)

81.9–95.3

(86.1–97.5)

0.030

(0.025)

100.0c

(98.9c)

96–100

(94–100)

0.000 (0.011)

0.002*

(0.055 )

0.469

(0.289)

Group D

100.0cd

(100.0cd)

96–100(96–100)

0.000

(0.000)

100.0cd

(93.3cd)

96–100

(86.1–97.5)

0.000 (0.025)

1.000

(0.013*)

0.000

(0.373)

Group E

100.0cd

(100.0cd)

96–100

(96–100)

0.000

(0.000)

100.0cd

(90.0cd)

96–100

(81.9–95.3)

0.000 (0.030)

1.000

(0.002*)

0.000

(0.465)

Total

90.4

(91.3)

87.3–93

(88.3–93.8)

0.029

(0.028)

88.0

(83.6)

84.6–90.9

(79.8–86.9)

0.033 (0.037)

0.247

(< 0.001*)

0.077

(0.234)

  1. Each cell presents the performance of the AI model on both the English and Chinese versions of the same question, shown as English value (Chinese value).
  2. Pairwise comparisons within each model and language version were conducted using the Mann-Whitney U test with a Bonferroni correction for multiple comparisons. Values within the same column sharing a common superscript letter are not statistically different at the p < 0.05 level. Confidence intervals were calculated using the Wald binomial method.