Table 2 Performance of LLM-Chatbots in addressing questions with english prompts

English Prompts	BARD	ChatGPT 3.5	ChatGPT 4	P_{BARD vs. 3.5}	P_{BARD vs. 4}	P_{3.5 vs. 4}
Sum Score, mean (SD)^a	5.40 (0.93)	5.45 (1.06)	5.65 (0.67)	0.74	0.057	0.16
Appropriate, n %	66 (88.0)	69 (92.0)	73 (97.3)	0.33	0.021	0.24
Borderline, n %	9 (12.0)	5 (6.7)	1 (1.3)
Inappropriate, n %	0 (0.0)	1 (1.3)	1 (1.3)
Chinese Prompts	ERNIE	ChatGPT 3.5	ChatGPT 4	P_{ERNIE vs. 3.5}	P_{ERNIE vs. 4}	P_{3.5 vs. 4}
Sum Score, mean (SD)^a	4.99 (1.85)	5.25 (1.62)	5.07 (1.74)	0.34	0.78	0.49
Appropriate, n %	63 (84.0)	66 (88.0)	64 (85.3)	0.71	0.96	0.85
Borderline, n %	3 (4.0)	3 (4.0)	3 (4.0)
Inappropriate, n %	9 (12.0)	6 (8.0)	8 (10.7)

^aFor gradings from three cardiologists, “Appropriate” was assigned with score 2, “Borderline” was assigned with score 1, and “Inappropriate” was assigned with score 0. The p-values in the table represent the following comparisons: P_{BARD vs. 3.5}: Comparison between Google Bard and ChatGPT-3.5. P_{BARD vs. 4}: Comparison between Google Bard and ChatGPT-4. P_{3.5 vs. 4}: Comparison between ChatGPT-3.5 and ChatGPT-4.
SD standard deviation.

Quick links

Search