Table 1 Comparing the accuracy of responses to FRCOphth Part 2 written questions with different LLM-chatbots.

From: Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams

Covariate	Levels	Inaccurate response	Accurate response	Univariable OR	Multilevel OR
LLM-chatbot	ChatGPT-3.5	65 (50.4)	64 (49.6)	-	-
	Google Bard	62 (48.1)	67 (51.9)	1.10 (0.67–1.79, p = 0.71)	1.16 (0.63–2.14, p = 0.64)
	Bing Chat	22 (17.1)	107 (82.9)	4.94 (2.82–8.92, p < 0.001)***	11.90 (5.54–25.53, p < 0.001)***
	ChatGPT-4.0	27 (20.9)	102 (79.1)	3.84 (2.24–6.71, p < 0.001)***	8.10 (3.95–16.62, p < 0.001)***
	ChatGPT-4.0 prompted	5 (11.6)	38 (88.4)	7.72 (3.10–23.52, p < 0.001)***	23.36 (6.51–83.80, p < 0.001)***
Difficulty	Mean (SD)	2.8 (0.8)	2.4 (0.8)	0.60 (0.48–0.75, p < 0.001)***	0.52 (0.21–1.25, p = 0.14)
Topic^a	Investigations	26 (40.0)	39 (60.0)	-	-
	Trauma	16 (61.5)	10 (38.5)	0.42 (0.16–1.05, p = 0.07)	0.11 (0.01–1.92, p = 0.13)
	Oculoplastic & Orbit	13 (50.0)	13 (50.0)	0.67 (0.26–1.67, p = 0.39)	0.39 (0.03–5.40, p = 0.48)
	Glaucoma	24 (61.5)	15 (38.5)	0.42 (0.18–0.93, p = 0.035)*	0.13 (0.01–1.41, p = 0.09)
	Strabismus	10 (38.5)	16 (61.5)	1.07 (0.42–2.77, p = 0.89)	0.68 (0.05–9.31, p = 0.77)
	Paediatrics	7 (26.9)	19 (73.1)	1.81 (0.69–5.19, p = 0.24)	3.75 (0.23–60.75, p = 0.35)
	Retina	23 (29.5)	55 (70.5)	1.59 (0.80–3.21, p = 0.19)	1.37 (0.19–10.03, p = 0.76)
	Cataract	5 (12.8)	34 (87.2)	4.53 (1.68–14.57, p = 0.005)**	2.66 (0.16–45.68, p = 0.50)
	Cornea & External Eye	2 (3.8)	50 (96.2)	16.67 (4.60–107.44, p < 0.001)***	23.55 (1.42–390.93, p = 0.028)*
	Uveitis & Oncology	10 (25.6)	29 (74.4)	1.93 (0.82–4.79, p = 0.14)	2.22 (0.23–21.91, p = 0.49)
	Neurology	13 (25.0)	39 (75.0)	2.00 (0.91–4.55, p = 0.09)	1.59 (0.16–15.53, p = 0.69)
	Genetics	6 (46.2)	7 (53.8)	0.78 (0.23–2.66, p = 0.68)	1.70 (0.05–61.67, p = 0.77)
	Pharmacology	12 (46.2)	14 (53.8)	0.78 (0.31–1.96, p = 0.59)	0.89 (0.06–12.26, p = 0.93)
	Miscellaneous	14 (26.9)	38 (73.1)	1.81 (0.83–4.05, p = 0.14)	1.42 (0.16–12.53, p = 0.75)

^aEffect estimates of question topic on accurate responses, adjusted for LLM-chatbot and difficulty, compared to the reference topic of “Investigations”. Significant differences are indicated by * for P < 0.05, ** for P < 0.01, and *** for P < 0.001. For the performance of individual LLM-chatbots on different topics please see Supplemental Table 2.
LLM Large language model, OR Odds ratio, SD Standard deviation.

Back to article page

Table 1 Comparing the accuracy of responses to FRCOphth Part 2 written questions with different LLM-chatbots.

Search

Quick links