Table 2 Original research articles on the use of large language models for patient education.

From: Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators

First author

LLM used

Subspecialty

Type of data used

LLM assessed against

Ophthalmologist verification?

Important findings

Tailor et al. [22]

ChatGPT 3.5, ChatGPT 4, Claude 2, Bing, Bard

Retina

Theoretical questions

Expert opinion

Yes

LLM responses were comparable with experts in quality, empathy, and safety. Expert-edited LLM performed better compared with experts alone

Desideri et al. [28]

ChatGPT 3.5, Bing, Bard

Retina

Theoretical questions

Expert opinion

Yes

ChatGPT 3.5 had the most accurate response on intravitreal injection advice in patients with AMD

Tailor et al. [24]

ChatGPT 4

General ophthalmology

Theoretical questions

Expert opinion

Yes

ChatGPT 4 provided 79% overall appropriate responses to patient questions across ophthalmic specialties

Bernstein et al. [66]

ChatGPT 3.5

General ophthalmology

Theoretical questions

Expert opinion

Yes

Ophthalmologists were able to accurately discern human vs chatbot responses 61% of the time. Both responses did not significantly differ in terms of incorrect information

Wu et al. [29]

ChatGPT 3.5, Bard, Google Assistant, Alexa

Retina

Theoretical questions

Actual outcome

Yes

LLMs require a higher reading comprehension compared to AAO and may miss the urgency of diagnosis, such as in retinal detachment

Cohen et al. [67]

ChatGPT 3.5

Cataract

Theoretical questions

Expert opinion

Yes

ChatGPT’s responses were written at a higher reading level than Google’s, containing 6% inaccuracies. ChatGPT’s answers were favoured 66% of the time

Pushpanathan et al. [23]

ChatGPT 3.5, ChatGPT 4, Google Bard

General ophthalmology

Theoretical questions

Expert opinion

Yes

ChatGPT 4.0 demonstrated superior performance at 89.2% compared to GPT 3.5 (59.5%) and Google Bard (40.5%) in addressing ocular symptoms queries

Kianian et al. [26]

ChatGPT 4, Bard

Uveitis

Theoretical questions

Actual outcome

Yes

ChatGPT 4.0 had a significantly lower FKGL score (higher readability) with fewer complex words

Wilhelm et al. [68]

ChatGPT 3.5, Command, Claude-instant-v1.0, BigScience

External disease/Cornea, Dermatology, Orthopaedics

Theoretical questions

Actual outcome

Yes

The overall accuracy was highest with GPT 3.5 Turbo at 88.3% with ophthalmology, dermatology, and orthopaedics questions

Dihan et al. [27]

ChatGPT 3.5, ChatGPT 4, Bard

Glaucoma

Theoretical questions

Actual outcome

Yes

All LLMs’ responses were poorly actionable; ChatGPT 4 had the most effective, easily understandable patient education material for childhood glaucoma, while also achieving a 6th grade reading level

Xue et al. [69]

Xiaoqing, HuaTuo, Ivy GPT, ChatGPT 3.5, ChatGPT 4

Glaucoma

Theoretical questions

Expert opinion

Yes

Xiaoqing outperformed other LLMs in informativeness and readability for glaucoma patients

Barclay et al. [25]

ChatGPT 3.5, ChatGPT 4

Cornea

Theoretical questions

Expert opinion

Yes

ChatGPT 4 offered a constant and higher correct response at 89% in questions relating to endothelial keratoplasty and Fuchs dystrophy. Highlights bias and hallucinations

Biswas et al. [70]

ChatGPT 3.5

General ophthalmology

Theoretical questions

Expert opinion

Yes

ChatGPT 3.5 provided a good response on myopia rated at 73%