Table 2 Original research articles on the use of large language models for patient education.
First author | LLM used | Subspecialty | Type of data used | LLM assessed against | Ophthalmologist verification? | Important findings |
---|---|---|---|---|---|---|
Tailor et al. [22] | ChatGPT 3.5, ChatGPT 4, Claude 2, Bing, Bard | Retina | Theoretical questions | Expert opinion | Yes | LLM responses were comparable with experts in quality, empathy, and safety. Expert-edited LLM performed better compared with experts alone |
Desideri et al. [28] | ChatGPT 3.5, Bing, Bard | Retina | Theoretical questions | Expert opinion | Yes | ChatGPT 3.5 had the most accurate response on intravitreal injection advice in patients with AMD |
Tailor et al. [24] | ChatGPT 4 | General ophthalmology | Theoretical questions | Expert opinion | Yes | ChatGPT 4 provided 79% overall appropriate responses to patient questions across ophthalmic specialties |
Bernstein et al. [66] | ChatGPT 3.5 | General ophthalmology | Theoretical questions | Expert opinion | Yes | Ophthalmologists were able to accurately discern human vs chatbot responses 61% of the time. Both responses did not significantly differ in terms of incorrect information |
Wu et al. [29] | ChatGPT 3.5, Bard, Google Assistant, Alexa | Retina | Theoretical questions | Actual outcome | Yes | LLMs require a higher reading comprehension compared to AAO and may miss the urgency of diagnosis, such as in retinal detachment |
Cohen et al. [67] | ChatGPT 3.5 | Cataract | Theoretical questions | Expert opinion | Yes | ChatGPT’s responses were written at a higher reading level than Google’s, containing 6% inaccuracies. ChatGPT’s answers were favoured 66% of the time |
Pushpanathan et al. [23] | ChatGPT 3.5, ChatGPT 4, Google Bard | General ophthalmology | Theoretical questions | Expert opinion | Yes | ChatGPT 4.0 demonstrated superior performance at 89.2% compared to GPT 3.5 (59.5%) and Google Bard (40.5%) in addressing ocular symptoms queries |
Kianian et al. [26] | ChatGPT 4, Bard | Uveitis | Theoretical questions | Actual outcome | Yes | ChatGPT 4.0 had a significantly lower FKGL score (higher readability) with fewer complex words |
Wilhelm et al. [68] | ChatGPT 3.5, Command, Claude-instant-v1.0, BigScience | External disease/Cornea, Dermatology, Orthopaedics | Theoretical questions | Actual outcome | Yes | The overall accuracy was highest with GPT 3.5 Turbo at 88.3% with ophthalmology, dermatology, and orthopaedics questions |
Dihan et al. [27] | ChatGPT 3.5, ChatGPT 4, Bard | Glaucoma | Theoretical questions | Actual outcome | Yes | All LLMs’ responses were poorly actionable; ChatGPT 4 had the most effective, easily understandable patient education material for childhood glaucoma, while also achieving a 6th grade reading level |
Xue et al. [69] | Xiaoqing, HuaTuo, Ivy GPT, ChatGPT 3.5, ChatGPT 4 | Glaucoma | Theoretical questions | Expert opinion | Yes | Xiaoqing outperformed other LLMs in informativeness and readability for glaucoma patients |
Barclay et al. [25] | ChatGPT 3.5, ChatGPT 4 | Cornea | Theoretical questions | Expert opinion | Yes | ChatGPT 4 offered a constant and higher correct response at 89% in questions relating to endothelial keratoplasty and Fuchs dystrophy. Highlights bias and hallucinations |
Biswas et al. [70] | ChatGPT 3.5 | General ophthalmology | Theoretical questions | Expert opinion | Yes | ChatGPT 3.5 provided a good response on myopia rated at 73% |