Table 3 Performance of LLM-Chatbots in refining suboptimal responses with updated model iterations (English)

English Prompts	BARD	ChatGPT 3.5	ChatGPT 4
Number of Suboptimal Responses, n	9	6	2
Temporal Improvement, n (%)^a	6 (66.7)	4 (66.7)	2 (100)
Self-check, n (%)^b	7 (77.8)	1 (16.7)	2 (100)

Chinese Prompts	ERNIE	ChatGPT 3.5	ChatGPT 4
Number of Suboptimal Responses, n	12	9	11
Temporal Improvement, n (%)^a	11 (91.6)	2 (22.2)	6 (54.5)
Self-check, n (%)^b	11 (91.6)	1 (11.1)	5 (45.4)

^aTo evaluate the models’ temporal improvement over the study period, suboptimal responses which included “borderline” and “inappropriate” responses were updated using the latest iteration of LLM (26^th June and 10^th July), with new responses assessed by the cardiologist graders.
^bTo test the performance of self-check, “please check if above answer is correct” was entered as a follow-up prompt. Successful self-check was defined as either recognizing whether the response is correct or additionally providing an appropriate response.
LLM large language model, SD standard deviation

Quick links

Search