Introduction

A large language model (LLM) is a type of generative artificial intelligence (AI) that uses deep neural networks and large text-based training datasets derived from articles, books, and webpage content comprising billions of words [1]. LLM is an advanced model of natural language processing (NLP) focusing on the interaction between computers and human language, which is capable of learning and discerning contextual information [2]. These models can perform as conversational agents, useful for day-to-day activities. The rise of these models sparked interest in the public when ChatGPT (OpenAI, San Francisco, CA, USA), trained from over 300 billion words, was launched in November 2022, providing chatbot conversations in a user-friendly interface [3]. Since then, other generative language models have been introduced and have become widely available to the public. The widespread adoption of these models has been evident not only in individual consumers but also in the biomedical field of research. The interest gained from clinicians and investigators regarding the utility of LLM has impacted the research field. This is evident by increased trends in studies, particularly its use in ophthalmology. This review aims to report published original research data regarding the utility of LLM, emphasizing different scopes of research, namely clinical assistance, medical education, patient education, and research.

Methods

The study examined published articles from peer-reviewed journals. Articles were identified from the PubMed online database on 6 June 2024 using the following constructed search query: (large language model* OR LLM OR LLMs* OR GPT) AND (ophthalmology* OR ophthalmic* OR eye)”. All identified articles were divided into article types, and only peer-reviewed original research articles were included in the review. The search query yielded 165 articles, 46 of which were identified as original research articles (Fig. 1). The University of California San Diego Library provided access to full-text articles. Included articles were parsed and analysed according to the following: journal, LLM (and model version, if available) used, the purpose of the study, LLM application, specialty/subspecialty discussed, and important findings and implications discovered.

Fig. 1
figure 1

Flow diagram illustrating the number of records retrieved from the PubMed database on 6 June 2024, detailing the progression from initial literature identification through final inclusion in the manuscript.

Applications of LLM

The applications of LLM in ophthalmology offer tremendous potential in assisting clinicians, students, educators, and even patients. Table 1 shows all articles included in the study. Investigators have studied different LLMs and subsequent versions in most of the subspecialties in ophthalmology. The findings have been insightful, with some offering implications for their use. In the succeeding section, we will delve into the different findings of the articles reviewed. First, we will describe our findings based on the number of articles retrieved, the LLM application and version used, and the specialty/subspecialty related to its use. This will be followed by the different findings provided by the authors.

Table 1 Original research articles on the use of large language models for clinical assistance.

Clinical assistance

Our literature review yielded 14 original articles describing the utility of LLMs in clinical assistance (Table 1). Of these, three articles discussed its use in general ophthalmology, while ten were on subspecialty topics (6 articles in the field of retina, 2 in glaucoma, and one each for uveitis, oculoplastics and lacrimal, and refractive surgery). Seven LLMs were described: ChatGPT 3.5 and 4, Bing (Microsoft Corporation, Washington, USA), Glass 1.0 (Glass Health, San Francisco, California, USA), Google Gemini (formerly Bard, Google, California, USA), and Llama 2 (Meta AI, New York, USA).

The potential of using LLMs to determine diagnosis [4], triage urgency [4, 5], and prepare operative notes and discharge summaries [6] has been described in simulated general ophthalmology settings. Zandi et al. [4] analysed the ability of ChatGPT 4 and Bard to diagnose 40 common ophthalmic conditions and prescribe urgency to seek care by using common ophthalmic complaints from a patient perspective. They found that both chatbots were significantly better at ophthalmic triage, i.e., assessing for urgency over identifying the correct leading diagnosis, with ChatGPT 4 performing better in terms of appropriateness of triage recommendations, grader satisfaction for patient use, and lower potential harm rates. A study by Lyons et al. [5] similarly examined both diagnosis and triage urgency for 44 clinical case vignettes representing common ophthalmic complaints. Their study found that both LLMs are able to enlist the correct diagnosis among the top 3 differentials in most cases and to determine triage urgency correctly. ChatGPT 4 also performed better than Bing Chat and had comparable diagnostic and triage accuracy with ophthalmology trainees, with no grossly inaccurate statements. In the study of Singh et al. [6], responses to 24 prompts to create discharge summaries and operative notes were evaluated by three surgeons across various criteria. They found encouraging results and trainable potential while acknowledging some inaccuracies and generic text responses.

Studies describing the use of LLMs in the field of retina focused on diabetic retinopathy (DR), diabetic macular oedema (DMO), retinal vascular diseases, and retinal detachment. Two studies aimed to examine the utility of LLMs in developing recommendations for DR screening and DMO management by comparing responses to clinicians in hypothetical AI-generated case scenarios [7, 8]. In the study of Gopalakrishnan et al. [7], recommendations generated from 5 clinicians and 3 LLMs (ChatGPT 3.5, ChatGPT 4, and Bing) using a three-point multiple-choice question on the urgency of DR screening were compared. The study found fair inter-rater reliability among clinicians (κ = 0.25) among LLMs (κ = 0.29), and between the majority response of clinicians and AI (κ = 0.32). Meanwhile, the study by Choudhary et al. [8] compared the single best management per eye for DMO and for co-existing ocular comorbidities, if any, of 5 clinicians and the same 3 LLMs (ChatGPT 3.5, ChatGPT 4, and Bing). For the management of DMO, the study showed a moderate agreement among clinicians (κ = 0.60) and LLMs (κ = 0.58) in managing DMO. A strong agreement was found between the majority clinician response and the majority AI response for the same variable (κ = 0.69). Meanwhile, a substantial agreement among clinicians (κ = 0.80) but only a fair agreement among LLMs (κ = 0.36) was seen for the management of co-existing ocular comorbidities in these cases. A moderate agreement (κ = 0.49) existed between the majority clinician response and the majority AI response. The studies of Gopalakrishnan and Choudhary show the potential difficulties of both clinicians and LLMs in achieving strong agreement, which may be influenced by the innate complexity of managing diabetic eye diseases. It cannot be discounted, however, that the fact that processing AI-generated cases, which may have been included in the LLM’s learning dataset, may have introduced bias into the results.

Liu et al. [9] uniquely compared the utility of ChatGPT 3.5 in diagnosing various retinal vascular diseases using Chinese prompts of fluorescein angiography reports and compared these against results achieved with English prompts, as well as the diagnostic performances of ophthalmologists and ophthalmology interns. Their study revealed that while being able to achieve acceptable accuracy (F1-score of 70.47%), diagnostic performance is less than those achieved with English prompts (80.05%), ophthalmologists (89.35%), and ophthalmology interns (82.69%), suggesting that there is room for improvement in the development of LLMs to account for disparities in applicability in other settings and populations [10].

The group of Chen et al. published two articles describing the use of AI in interpreting fluorescein angiography [11] and indocyanine green angiography [12] images and in applying these to guide treatment decisions in a two-tier process. Both studies uniquely employed the use of an image-to-text alignment module (Bootstrapping Language-Image Pre-training, or BLIP, framework) in generating reports, which were then inputted into the LLM Llama 2 for analysis to answer prompted questions [11, 12]. Ophthalmologists were then asked to evaluate both the generated reports of the BLIP framework and the generated answers using the Llama 2, yielding satisfactory quality and agreement in both steps for both fluorescein angiography and indocyanine green angiography. As the applicability of LLMs is limited to text prompts in most studies, the use of BLIP in this study presents promising potential in the analysis of ophthalmic images. This approach has also been described extensively in previous studies [13, 14].

A study by Carlà et al. [15] explored the role of 3 LLMs in analysing retinal detachment cases in providing guides on appropriate surgical management. In 50 retinal detachment cases, surgical choices of ChatGPT 3.5, ChatGPT 4, and Google Gemini were in agreement with those of 3 expert vitreoretinal surgeons in 80%, 84%, and 70% of cases, respectively. Both ChatGPTs outperformed Gemini in terms of the overall quality of the website from a patient’s standpoint. The authors found that ChatGPT 4 was the only LLM able to suggest a combined phacovitrectomy for cases where the presence of a cataract was highlighted.

Two papers have described the use of LLMs to aid glaucoma specialists in the clinic. Huang et al. [16] explored the potential of ChatGPT in predicting the conversion of ocular hypertension into glaucoma using patient data from the Ocular Hypertension Treatment Study [17], and found a better accuracy rate of 75% using ChatGPT 4 compared to 61% using ChatGPT 3. This study shows the potential of LLMs in prognosticating disease development with properly inputted data. A paper by Carlà et al. [18] described the use of ChatGPT 4 and Google Gemini in the analysis of glaucoma case descriptions to come up with an accurate surgical plan. In this study, ChatGPT 4 was able to achieve greater consistency with senior glaucoma specialists (58%) versus Gemini (32%) and better response quality compared to the latter (p = 0.002).

One paper assessed the diagnostic performance of LLMs in the diagnosis of 6 uveitis cases against that of uveitis specialists. This study by Rojas-Carabali et al. [19] showed a diagnostic success rate of 66% using ChatGPT versus only 33% using Glass. It is important to note that in this study, the uveitis specialists provided the cases and were able to evaluate images. In contrast, the LLMs only received text prompts from the clinicians. This suggests that the current LLMs’ inability to assess images likely contributes, at least in part, to the limited accuracy of their responses.

A study by Cirkovic and Katz [20] aimed to validate ChatGPT 4’s capability in suggesting refractive surgery options using data from 100 patients. The authors found a significant association between human raters and ChatGPT 4, which was stronger when options were narrowed down to two categories (laser refractive surgery versus other options) from six categories. This demonstrates that LLMs seem to perform better in simple binary clinical scenarios, such as in screening and triage.

Another study by Ali et al. [21] found that ChatGPT 3.5 provided correct responses only 40% of the time in the context of lacrimal drainage disorders. Additionally, factual inaccuracies provided by ChatGPT noted non-evidence-based responses, such as the suggestion of silicone intubation and mitomycin C in certain cases. This highlights that constant validation and confirmation are essential to ensure the reliability and accuracy of the chatbot’s response.

In summary, numerous studies have explored the vast potential of LLMs to assist ophthalmologists in the clinical workflow, which may help improve clinical efficiency, diagnostic accuracy, and even prognostic possibilities. While most studies exploring these were widely heterogeneous in terms of population and methodologies, the accuracy of LLM responses against those of clinicians, whether trainees or consultant experts, was sought. The results consistently show a wide range of accuracy for all LLMs, falling short of clinician precision. Other LLM limitations identified include a lack of ability to process visual prompts and non-English languages, analysis of AI-generated clinical scenarios, and standardized objective measures of assessing clinical accuracy in studies.

Patient education

The literature review yielded 13 original articles describing the role of LLM in patient education (Table 2). The articles covered various ophthalmology specialties, with retina and general ophthalmology being the most frequently discussed, followed by articles in retina. Two articles focused on glaucoma and 1 article each covered cataract, uveitis, cornea, oculoplastic and reconstructive surgery and a combination of external disease and cornea. Thirteen LLMs were described: ChatGPT 3.5 and 4, Bing, Google Bard (recently branded Gemini, Google, California, USA), Claude 2 and Claude-instant-v1.0 (Anthropic), Google Assistant (Google, California, USA), Alexa (Amazon, Seattle, USA), BigScience (Bloomz), Command (Cohere), Xiaoqing, HuaTuo, and Ivy GPT.

Table 2 Original research articles on the use of large language models for patient education.

The advent of LLM continues to revolutionize patient education in medicine. Numerous LLMs, such as ChatGPT, Bing, and Bard, among others, have provided a new landscape in patient learning that is accessible, affordable, and available [22]. These tools are especially valuable in geographically distant and disadvantaged areas where access to healthcare still poses a major challenge to many patients.

In ophthalmology, the performance of different LLMs was assessed in response to common patient queries regarding ocular symptoms. ChatGPT 4 was superior at 89.2% compared to ChatGPT 3.5 and Google Bard [23]. Another study found that ChatGPT 4 produced an overall average of 79% appropriate responses to patient information in different ocular specialties, with the highest appropriateness in ocular oncology at 100% and the lowest being in uveitis at 55% [24]. Consequently, facts related to endothelial keratoplasty and Fuchs dystrophy were assessed in collaboration with corneal specialists, and ChatGPT 4 provided a correct response at 89% [25]. Patient-targeted information and the generation of easy comprehension are also crucial to assessing patient education. Generation of general uveitis information was assessed by comparing Google Bard with ChatGPT 4, where the latter had a significantly better readability score, as evidenced by a lower FKGL (Flesch-Kincaid Grade Level) and fewer complex words with easy-to-understand passages [26]. Another study by Dihan et al. [27], demonstrated that ChatGPT 4 generated the most readable patient education materials compared to ChatGPT 3.5 and Bard using SMOG (Simplified Measure of Gobbledygook) and FKGL (Flesch-Kincaid Grade Level) scores. Such improvements can aid patients with poor literacy to comprehend complex ophthalmology concepts more effectively, which can ultimately promote engagement and treatment adherence.

Accuracy of the information is a crucial aspect when educating patients, and fosters understanding and awareness of conditions that ultimately affect patient decisions regarding certain treatment options. In patients with age-related macular degeneration (AMD), ChatGPT 3.5 consistently offered the most accurate and satisfactory response concerning advice on general AMD questions and intravitreal injections as compared to Bing and Bard [28].

Although LLMs provide promising assistance and results, certain limitations exist that need appropriate discretionary measures. A study by Wu et al. [29] looked at LLMs’ response to “floaters” and showed that certain LLMs required a higher reading comprehension than AAO.org, a website that ophthalmologists globally curate to educate the public about ocular disease. LLMs failed to convey a sense of urgency to see a physician for floaters, which could lead to retinal detachment.

LLMs offer valuable tools for improving patient education across various ophthalmology specialties. For most studies, ChatGPT 4 consistently outperformed its predecessors and other models in generating correct and accurate responses on different ocular conditions. However, it is essential to carefully address inaccuracies and potential bias to prevent risks of misinformation that could potentially cause more harm than good. This summarizes the potential use of integrating LLMs into patient education, placing a strong emphasis on constant validation and improvement of the models to better educate patients on their ocular conditions.

Medical education

The literature search yielded 17 original articles describing the utility of LLM in medical education (Table 3). Most LLMs (13 articles) in this category were used for general ophthalmology. One article for each subspecialty was described in paediatric ophthalmology, uveitis, glaucoma, and low vision. Eight LLMs were described: ChatGPT 3.5, 4 and Plus, Bard, Legacy, Bing, LLaMA, PaLM2(Google LLC).

Table 3 Original research articles on the use of large language models for medical education.

LLM was generally used and tested for its performance among multiple-choice question board certification examinations, mainly in the United States, which showed consistent ability of at least 55% of LLM in generating correct answers [30,31,32,33,34]. Bard answered 62.4% of ophthalmology board exam practice questions [33]. ChatGPT 4 achieved 62.9% correct answers in the ophthalmology qualifying examination (Ophthalmic Knowledge Assessment Program - OKAP, American Board of Ophthalmology - ABO, United States Medical Licensure Examination - USMLE) [34]. Legacy achieved 55.8% on the Basic and Clinical Science Course (BCSC) [32]. In the Part 1 Fellowship of the Royal College of Ophthalmologists (FRCOphth) Multiple Choice Question (MCQ) examination, ChatGPT 4 outperformed historical averages of past candidates [35]. For the Part 2 FRCOphth MCQ examination, ChatGPT 4 achieved a score of 69%, surpassing ChatGPT 3.5 (48%), LLaMA (32%), and PaLM 2 (56%). Its performance was also comparable to that of expert ophthalmologists (median score: 76%) and exceeded that of ophthalmology trainees (median score: 59%) [36]. For the French language version of the European Board of Ophthalmology (EBO) examination, ChatGPT 4 achieved a 91.2% success rate [37]. However, for the Japanese language ophthalmology board examinations in the Japanese Ophthalmology Society, ChatGPT 3.5 and ChatGPT 4 were only able to answer 22.4% and 45.8% of questions correctly. This was because ChatGPT was not able to access specialized literature databases [38]. In comparing ChatGPT 3.5 and 4, ChatGPT 4 consistently outperformed ChatGPT 3.5 by more than 15% [34, 39,40,41].

In Paediatric ophthalmology, 80.6%, 61.3%, and 54.8% accuracy of ChatGPT 4, ChatGPT 3.5, and Bard, respectively, in answering myopia-related questions [42]. ChatGPT 3.5 was able to provide relatively high-accuracy responses for various questions related to uveitis [43]. The accuracy of ChatGPT 3.5 in diagnosing glaucoma cases was 72.7%, and even better in 1 of 3 ophthalmology residents [44]. In Low vision, GPT 4 and GPT 3.5 achieved 82.4 and 65.9% in answering self-assessment tests [45]. Enhanced prompting strategies can improve LLMs’ performance in complex clinical scenarios [30]. As demonstrated by Aeyeconsult, using ophthalmology textbooks as the source material of reliable medical knowledge improves the accuracy of ChatGPT 4 [46].

LLMs offer a wide range of applications in ophthalmology. Chat GPT 4 consistently performed superior or at par performance with human exam takers, demonstrating its impact on medical education. Despite their good performance in multiple-choice questions board certification examinations, LLMs have a propensity to hallucinate or generate information that is not present in their training data and do not cite their sources, making it difficult for users to verify the information provided [31, 42, 46]. LLMs are also unable to interpret images or figures, perform fairly with non-English language prompts, and may miss data that is unavailable on the internet [37]. Another limitation identified by the present review is that studies exploring the use of LLMs in medical education only describes the LLM’s ability to answer examination questions. While an important aspect of medical education, examination questions do not encompass the practical and clinical metrics by which medical students are also evaluated. The authors believe that LLMs are useful as complements for medical education, but caution should be exercised when clinical judgment is involved. Further studies on developing open-access LLMs trained with accurate and reliable ophthalmology data are recommended.

Research

Our search yielded two relevant articles on the utility of LLM in research (Table 4). The study by Raja et al. [47] demonstrated the potential use of LLM in categorizing and analysing scientific literature. The study provided insights into research productivity by enhancing an accurate and quick classification of scientific papers, which can provide efficient information retrieval and better organizational knowledge. Another study by Mohammadi and Nguyen explored the use of LLM in pre-processing fundus images for use in ML and analysing data using programming languages (R and Python). A remarkable outcome of this study is that, even without coding experience, LLMs (in this case, ChatGPT 4) may aid users in analysing images. Compared to other potential utilities of LLM, such as the categories mentioned above, studies on its use in research may be limited by multiple factors. Ethical considerations regarding its use have been carefully discussed. Editors of different reputable ophthalmology journals express concern regarding authorship and data quality of research provided by LLMs [48,49,50]. As data from some LLMs is limited, such as ChatGPT 3.5 with a dataset covering until before September 2021, newer and more relevant information can be missed. For example, medications to slow the progression of advanced dry age-related macular degeneration were approved in 2023 [51, 52]. Other LLMs may have up-to-date information retrieval capacity due to their access to the web, although some are paywall-restricted. This promotes inequity for those who are in low-resource settings. In addition, its propensity to hallucinate by providing made-up references should be considered [53]. The topic of the use of LLM in research should be investigated further, as its potential can have a considerable impact on the research community.

Table 4 Original research articles on the use of large language models for research.

Ethical considerations, potential drawbacks, and recommendations

While the promise of LLM in the field of ophthalmology remains to grow, certain cautions must be addressed. Challenges such as hallucinations, inaccuracies, and potential biases could limit its use and lead to harmful outcomes. In addition, cloud-based LLMs also introduce significant privacy risks due to the potential exposure of sensitive patient data during processing and storage.

To help mitigate these limitations, several methodological advancements have been introduced including Chain of thought (CoT) prompting which breakdown complex queries into intermediate steps, fine tuning (FT) which trains models using domain specific datasets, reinforcement with human feedback (RLHF) which aligns output with human preference, and retrieval-augmented generation (RAG) which retrieves information from external sources [54,55,56,57]. When deployed locally in privacy-preserving models, such as LLaMA, these strategies not only enhance performance but also support data privacy by eliminating cloud-based reliance [58].

In clinical settings, such use of LLMs could be grounds for safety concerns, which may lead to unfortunate patient management and misinformation [59]. Given the dynamic nature of these models, continuous research and verification are essential to safeguard accuracy and reliability in clinical and educational settings [60]. To mitigate bias and improve inclusion, training datasets should include population-specific data, especially the historically underrepresented groups [61,62,63]. Furthermore, establishing global and institutional guidelines on the proper use of LLM in ophthalmology will help promote best practices while ensuring alignment with ethical standards. Recognizing this, the American Academy of Ophthalmology, through the formation of the AI Task Force Committee, provides tools and regulations on how to engage with LLMs responsibly [64]. Regular integration of the latest and ever-evolving research and clinical evidence into these models will enhance their relevance, ensuring that their outputs remain consistent with the current standard of care.

To ensure safe and responsible use, LLM application necessitates careful adherence to the four ethical principles of medicine - beneficence, nonmaleficence, justice, and autonomy [65] As the evaluation of LLM in ophthalmology currently remains limited to controlled settings, further model refinement, particularly through alignment with real-time clinical data and information, is essential before it can be considered for clinical trials.

Conclusion

LLMs show great promise to enhance and transform the healthcare system. Here, we reviewed and highlighted various original research studies regarding the LLM applications in ophthalmology, emphasizing roles in clinical assistance, patient education, medical education, and research. By integrating LLM into practice, ophthalmology could provide enhanced workflow efficiency and diagnostic capabilities, high patient accessibility, complement medical learning, and boost research productivity. Targeted monitoring in model training, continuous validation, and assessment while strictly adhering to ethical standards is needed to ensure that these LLM-based tools benefit everyone.

Summary

What is known about this topic

  • The use of large language models have increasingly been adopted and studied across various fields of medicine, including ophthalmology, since its introduction in 2022.

What this study adds

  • This review summarizes the results of all available original articles examining the use of large language models in ophthalmology, and classifies these into four major uses: clinical assistance, patient education, medical education, and research.