Table 2 Performance of LLM-Chatbots in addressing questions with english prompts

From: Large language model comparisons between English and Chinese query performance for cardiovascular prevention

English Prompts

BARD

ChatGPT 3.5

ChatGPT 4

PBARD vs. 3.5

PBARD vs. 4

P3.5 vs. 4

Sum Score, mean (SD)a

5.40 (0.93)

5.45 (1.06)

5.65 (0.67)

0.74

0.057

0.16

Appropriate, n %

66 (88.0)

69 (92.0)

73 (97.3)

0.33

0.021

0.24

Borderline, n %

9 (12.0)

5 (6.7)

1 (1.3)

Inappropriate, n %

0 (0.0)

1 (1.3)

1 (1.3)

Chinese Prompts

ERNIE

ChatGPT 3.5

ChatGPT 4

PERNIE vs. 3.5

PERNIE vs. 4

P3.5 vs. 4

Sum Score, mean (SD)a

4.99 (1.85)

5.25 (1.62)

5.07 (1.74)

0.34

0.78

0.49

Appropriate, n %

63 (84.0)

66 (88.0)

64 (85.3)

0.71

0.96

0.85

Borderline, n %

3 (4.0)

3 (4.0)

3 (4.0)

Inappropriate, n %

9 (12.0)

6 (8.0)

8 (10.7)

  1. aFor gradings from three cardiologists, “Appropriate” was assigned with score 2, “Borderline” was assigned with score 1, and “Inappropriate” was assigned with score 0. The p-values in the table represent the following comparisons: PBARD vs. 3.5: Comparison between Google Bard and ChatGPT-3.5. PBARD vs. 4: Comparison between Google Bard and ChatGPT-4. P3.5 vs. 4: Comparison between ChatGPT-3.5 and ChatGPT-4.
  2. SD standard deviation.