Table 1 Comparing the accuracy of responses to FRCOphth Part 2 written questions with different LLM-chatbots.

From: Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams

Covariate

Levels

Inaccurate response

Accurate response

Univariable OR

Multilevel OR

LLM-chatbot

ChatGPT-3.5

65 (50.4)

64 (49.6)

-

-

Google Bard

62 (48.1)

67 (51.9)

1.10 (0.67–1.79, p = 0.71)

1.16 (0.63–2.14, p = 0.64)

Bing Chat

22 (17.1)

107 (82.9)

4.94 (2.82–8.92, p < 0.001)***

11.90 (5.54–25.53, p < 0.001)***

ChatGPT-4.0

27 (20.9)

102 (79.1)

3.84 (2.24–6.71, p < 0.001)***

8.10 (3.95–16.62, p < 0.001)***

ChatGPT-4.0 prompted

5 (11.6)

38 (88.4)

7.72 (3.10–23.52, p < 0.001)***

23.36 (6.51–83.80, p < 0.001)***

Difficulty

Mean (SD)

2.8 (0.8)

2.4 (0.8)

0.60 (0.48–0.75, p < 0.001)***

0.52 (0.21–1.25, p = 0.14)

Topica

Investigations

26 (40.0)

39 (60.0)

-

-

Trauma

16 (61.5)

10 (38.5)

0.42 (0.16–1.05, p = 0.07)

0.11 (0.01–1.92, p = 0.13)

Oculoplastic & Orbit

13 (50.0)

13 (50.0)

0.67 (0.26–1.67, p = 0.39)

0.39 (0.03–5.40, p = 0.48)

Glaucoma

24 (61.5)

15 (38.5)

0.42 (0.18–0.93, p = 0.035)*

0.13 (0.01–1.41, p = 0.09)

Strabismus

10 (38.5)

16 (61.5)

1.07 (0.42–2.77, p = 0.89)

0.68 (0.05–9.31, p = 0.77)

Paediatrics

7 (26.9)

19 (73.1)

1.81 (0.69–5.19, p = 0.24)

3.75 (0.23–60.75, p = 0.35)

Retina

23 (29.5)

55 (70.5)

1.59 (0.80–3.21, p = 0.19)

1.37 (0.19–10.03, p = 0.76)

Cataract

5 (12.8)

34 (87.2)

4.53 (1.68–14.57, p = 0.005)**

2.66 (0.16–45.68, p = 0.50)

Cornea & External Eye

2 (3.8)

50 (96.2)

16.67 (4.60–107.44, p < 0.001)***

23.55 (1.42–390.93, p = 0.028)*

Uveitis & Oncology

10 (25.6)

29 (74.4)

1.93 (0.82–4.79, p = 0.14)

2.22 (0.23–21.91, p = 0.49)

Neurology

13 (25.0)

39 (75.0)

2.00 (0.91–4.55, p = 0.09)

1.59 (0.16–15.53, p = 0.69)

Genetics

6 (46.2)

7 (53.8)

0.78 (0.23–2.66, p = 0.68)

1.70 (0.05–61.67, p = 0.77)

Pharmacology

12 (46.2)

14 (53.8)

0.78 (0.31–1.96, p = 0.59)

0.89 (0.06–12.26, p = 0.93)

Miscellaneous

14 (26.9)

38 (73.1)

1.81 (0.83–4.05, p = 0.14)

1.42 (0.16–12.53, p = 0.75)

  1. aEffect estimates of question topic on accurate responses, adjusted for LLM-chatbot and difficulty, compared to the reference topic of “Investigations”. Significant differences are indicated by * for P < 0.05, ** for P < 0.01, and *** for P < 0.001. For the performance of individual LLM-chatbots on different topics please see Supplemental Table 2.
  2. LLM Large language model, OR Odds ratio, SD Standard deviation.