Table 3 Performance of LLM-Chatbots in refining suboptimal responses with updated model iterations (English)

From: Large language model comparisons between English and Chinese query performance for cardiovascular prevention

English Prompts

BARD

ChatGPT 3.5

ChatGPT 4

Number of Suboptimal Responses, n

9

6

2

Temporal Improvement, n (%)a

6 (66.7)

4 (66.7)

2 (100)

Self-check, n (%)b

7 (77.8)

1 (16.7)

2 (100)

Chinese Prompts

ERNIE

ChatGPT 3.5

ChatGPT 4

Number of Suboptimal Responses, n

12

9

11

Temporal Improvement, n (%)a

11 (91.6)

2 (22.2)

6 (54.5)

Self-check, n (%)b

11 (91.6)

1 (11.1)

5 (45.4)

  1. aTo evaluate the models’ temporal improvement over the study period, suboptimal responses which included “borderline” and “inappropriate” responses were updated using the latest iteration of LLM (26th June and 10th July), with new responses assessed by the cardiologist graders.
  2. bTo test the performance of self-check, “please check if above answer is correct” was entered as a follow-up prompt. Successful self-check was defined as either recognizing whether the response is correct or additionally providing an appropriate response.
  3. LLM large language model, SD standard deviation