Abstract
Since its introduction in November 2022, the public interest in the utility of large language models (LLMs) has gained widespread adoption among individual consumers and among medical practitioners, with a consequent increase in publications describing their utility in healthcare. This review highlights original research articles on how LLM’s can be utilized by various stakeholders in ophthalmology through clinical assistance, patient education, medical education, and research. ChatGPT consistently responds with better accuracy and quality than other LLMs across various studies employing different methodologies, with newer iterations offering more advantages. Studies have likewise identified limitations of LLMs, which include hallucination, inability to interpret image-based prompts, and limited performance across non-English languages. As newer iterations of available and more advanced models with image processing are currently being introduced, generative artificial intelligence should be continuously monitored for its implications in eye care.
Similar content being viewed by others
Introduction
A large language model (LLM) is a type of generative artificial intelligence (AI) that uses deep neural networks and large text-based training datasets derived from articles, books, and webpage content comprising billions of words [1]. LLM is an advanced model of natural language processing (NLP) focusing on the interaction between computers and human language, which is capable of learning and discerning contextual information [2]. These models can perform as conversational agents, useful for day-to-day activities. The rise of these models sparked interest in the public when ChatGPT (OpenAI, San Francisco, CA, USA), trained from over 300 billion words, was launched in November 2022, providing chatbot conversations in a user-friendly interface [3]. Since then, other generative language models have been introduced and have become widely available to the public. The widespread adoption of these models has been evident not only in individual consumers but also in the biomedical field of research. The interest gained from clinicians and investigators regarding the utility of LLM has impacted the research field. This is evident by increased trends in studies, particularly its use in ophthalmology. This review aims to report published original research data regarding the utility of LLM, emphasizing different scopes of research, namely clinical assistance, medical education, patient education, and research.
Methods
The study examined published articles from peer-reviewed journals. Articles were identified from the PubMed online database on 6 June 2024 using the following constructed search query: (large language model* OR LLM OR LLMs* OR GPT) AND (ophthalmology* OR ophthalmic* OR eye)”. All identified articles were divided into article types, and only peer-reviewed original research articles were included in the review. The search query yielded 165 articles, 46 of which were identified as original research articles (Fig. 1). The University of California San Diego Library provided access to full-text articles. Included articles were parsed and analysed according to the following: journal, LLM (and model version, if available) used, the purpose of the study, LLM application, specialty/subspecialty discussed, and important findings and implications discovered.
Applications of LLM
The applications of LLM in ophthalmology offer tremendous potential in assisting clinicians, students, educators, and even patients. Table 1 shows all articles included in the study. Investigators have studied different LLMs and subsequent versions in most of the subspecialties in ophthalmology. The findings have been insightful, with some offering implications for their use. In the succeeding section, we will delve into the different findings of the articles reviewed. First, we will describe our findings based on the number of articles retrieved, the LLM application and version used, and the specialty/subspecialty related to its use. This will be followed by the different findings provided by the authors.
Clinical assistance
Our literature review yielded 14 original articles describing the utility of LLMs in clinical assistance (Table 1). Of these, three articles discussed its use in general ophthalmology, while ten were on subspecialty topics (6 articles in the field of retina, 2 in glaucoma, and one each for uveitis, oculoplastics and lacrimal, and refractive surgery). Seven LLMs were described: ChatGPT 3.5 and 4, Bing (Microsoft Corporation, Washington, USA), Glass 1.0 (Glass Health, San Francisco, California, USA), Google Gemini (formerly Bard, Google, California, USA), and Llama 2 (Meta AI, New York, USA).
The potential of using LLMs to determine diagnosis [4], triage urgency [4, 5], and prepare operative notes and discharge summaries [6] has been described in simulated general ophthalmology settings. Zandi et al. [4] analysed the ability of ChatGPT 4 and Bard to diagnose 40 common ophthalmic conditions and prescribe urgency to seek care by using common ophthalmic complaints from a patient perspective. They found that both chatbots were significantly better at ophthalmic triage, i.e., assessing for urgency over identifying the correct leading diagnosis, with ChatGPT 4 performing better in terms of appropriateness of triage recommendations, grader satisfaction for patient use, and lower potential harm rates. A study by Lyons et al. [5] similarly examined both diagnosis and triage urgency for 44 clinical case vignettes representing common ophthalmic complaints. Their study found that both LLMs are able to enlist the correct diagnosis among the top 3 differentials in most cases and to determine triage urgency correctly. ChatGPT 4 also performed better than Bing Chat and had comparable diagnostic and triage accuracy with ophthalmology trainees, with no grossly inaccurate statements. In the study of Singh et al. [6], responses to 24 prompts to create discharge summaries and operative notes were evaluated by three surgeons across various criteria. They found encouraging results and trainable potential while acknowledging some inaccuracies and generic text responses.
Studies describing the use of LLMs in the field of retina focused on diabetic retinopathy (DR), diabetic macular oedema (DMO), retinal vascular diseases, and retinal detachment. Two studies aimed to examine the utility of LLMs in developing recommendations for DR screening and DMO management by comparing responses to clinicians in hypothetical AI-generated case scenarios [7, 8]. In the study of Gopalakrishnan et al. [7], recommendations generated from 5 clinicians and 3 LLMs (ChatGPT 3.5, ChatGPT 4, and Bing) using a three-point multiple-choice question on the urgency of DR screening were compared. The study found fair inter-rater reliability among clinicians (κ = 0.25) among LLMs (κ = 0.29), and between the majority response of clinicians and AI (κ = 0.32). Meanwhile, the study by Choudhary et al. [8] compared the single best management per eye for DMO and for co-existing ocular comorbidities, if any, of 5 clinicians and the same 3 LLMs (ChatGPT 3.5, ChatGPT 4, and Bing). For the management of DMO, the study showed a moderate agreement among clinicians (κ = 0.60) and LLMs (κ = 0.58) in managing DMO. A strong agreement was found between the majority clinician response and the majority AI response for the same variable (κ = 0.69). Meanwhile, a substantial agreement among clinicians (κ = 0.80) but only a fair agreement among LLMs (κ = 0.36) was seen for the management of co-existing ocular comorbidities in these cases. A moderate agreement (κ = 0.49) existed between the majority clinician response and the majority AI response. The studies of Gopalakrishnan and Choudhary show the potential difficulties of both clinicians and LLMs in achieving strong agreement, which may be influenced by the innate complexity of managing diabetic eye diseases. It cannot be discounted, however, that the fact that processing AI-generated cases, which may have been included in the LLM’s learning dataset, may have introduced bias into the results.
Liu et al. [9] uniquely compared the utility of ChatGPT 3.5 in diagnosing various retinal vascular diseases using Chinese prompts of fluorescein angiography reports and compared these against results achieved with English prompts, as well as the diagnostic performances of ophthalmologists and ophthalmology interns. Their study revealed that while being able to achieve acceptable accuracy (F1-score of 70.47%), diagnostic performance is less than those achieved with English prompts (80.05%), ophthalmologists (89.35%), and ophthalmology interns (82.69%), suggesting that there is room for improvement in the development of LLMs to account for disparities in applicability in other settings and populations [10].
The group of Chen et al. published two articles describing the use of AI in interpreting fluorescein angiography [11] and indocyanine green angiography [12] images and in applying these to guide treatment decisions in a two-tier process. Both studies uniquely employed the use of an image-to-text alignment module (Bootstrapping Language-Image Pre-training, or BLIP, framework) in generating reports, which were then inputted into the LLM Llama 2 for analysis to answer prompted questions [11, 12]. Ophthalmologists were then asked to evaluate both the generated reports of the BLIP framework and the generated answers using the Llama 2, yielding satisfactory quality and agreement in both steps for both fluorescein angiography and indocyanine green angiography. As the applicability of LLMs is limited to text prompts in most studies, the use of BLIP in this study presents promising potential in the analysis of ophthalmic images. This approach has also been described extensively in previous studies [13, 14].
A study by Carlà et al. [15] explored the role of 3 LLMs in analysing retinal detachment cases in providing guides on appropriate surgical management. In 50 retinal detachment cases, surgical choices of ChatGPT 3.5, ChatGPT 4, and Google Gemini were in agreement with those of 3 expert vitreoretinal surgeons in 80%, 84%, and 70% of cases, respectively. Both ChatGPTs outperformed Gemini in terms of the overall quality of the website from a patient’s standpoint. The authors found that ChatGPT 4 was the only LLM able to suggest a combined phacovitrectomy for cases where the presence of a cataract was highlighted.
Two papers have described the use of LLMs to aid glaucoma specialists in the clinic. Huang et al. [16] explored the potential of ChatGPT in predicting the conversion of ocular hypertension into glaucoma using patient data from the Ocular Hypertension Treatment Study [17], and found a better accuracy rate of 75% using ChatGPT 4 compared to 61% using ChatGPT 3. This study shows the potential of LLMs in prognosticating disease development with properly inputted data. A paper by Carlà et al. [18] described the use of ChatGPT 4 and Google Gemini in the analysis of glaucoma case descriptions to come up with an accurate surgical plan. In this study, ChatGPT 4 was able to achieve greater consistency with senior glaucoma specialists (58%) versus Gemini (32%) and better response quality compared to the latter (p = 0.002).
One paper assessed the diagnostic performance of LLMs in the diagnosis of 6 uveitis cases against that of uveitis specialists. This study by Rojas-Carabali et al. [19] showed a diagnostic success rate of 66% using ChatGPT versus only 33% using Glass. It is important to note that in this study, the uveitis specialists provided the cases and were able to evaluate images. In contrast, the LLMs only received text prompts from the clinicians. This suggests that the current LLMs’ inability to assess images likely contributes, at least in part, to the limited accuracy of their responses.
A study by Cirkovic and Katz [20] aimed to validate ChatGPT 4’s capability in suggesting refractive surgery options using data from 100 patients. The authors found a significant association between human raters and ChatGPT 4, which was stronger when options were narrowed down to two categories (laser refractive surgery versus other options) from six categories. This demonstrates that LLMs seem to perform better in simple binary clinical scenarios, such as in screening and triage.
Another study by Ali et al. [21] found that ChatGPT 3.5 provided correct responses only 40% of the time in the context of lacrimal drainage disorders. Additionally, factual inaccuracies provided by ChatGPT noted non-evidence-based responses, such as the suggestion of silicone intubation and mitomycin C in certain cases. This highlights that constant validation and confirmation are essential to ensure the reliability and accuracy of the chatbot’s response.
In summary, numerous studies have explored the vast potential of LLMs to assist ophthalmologists in the clinical workflow, which may help improve clinical efficiency, diagnostic accuracy, and even prognostic possibilities. While most studies exploring these were widely heterogeneous in terms of population and methodologies, the accuracy of LLM responses against those of clinicians, whether trainees or consultant experts, was sought. The results consistently show a wide range of accuracy for all LLMs, falling short of clinician precision. Other LLM limitations identified include a lack of ability to process visual prompts and non-English languages, analysis of AI-generated clinical scenarios, and standardized objective measures of assessing clinical accuracy in studies.
Patient education
The literature review yielded 13 original articles describing the role of LLM in patient education (Table 2). The articles covered various ophthalmology specialties, with retina and general ophthalmology being the most frequently discussed, followed by articles in retina. Two articles focused on glaucoma and 1 article each covered cataract, uveitis, cornea, oculoplastic and reconstructive surgery and a combination of external disease and cornea. Thirteen LLMs were described: ChatGPT 3.5 and 4, Bing, Google Bard (recently branded Gemini, Google, California, USA), Claude 2 and Claude-instant-v1.0 (Anthropic), Google Assistant (Google, California, USA), Alexa (Amazon, Seattle, USA), BigScience (Bloomz), Command (Cohere), Xiaoqing, HuaTuo, and Ivy GPT.
The advent of LLM continues to revolutionize patient education in medicine. Numerous LLMs, such as ChatGPT, Bing, and Bard, among others, have provided a new landscape in patient learning that is accessible, affordable, and available [22]. These tools are especially valuable in geographically distant and disadvantaged areas where access to healthcare still poses a major challenge to many patients.
In ophthalmology, the performance of different LLMs was assessed in response to common patient queries regarding ocular symptoms. ChatGPT 4 was superior at 89.2% compared to ChatGPT 3.5 and Google Bard [23]. Another study found that ChatGPT 4 produced an overall average of 79% appropriate responses to patient information in different ocular specialties, with the highest appropriateness in ocular oncology at 100% and the lowest being in uveitis at 55% [24]. Consequently, facts related to endothelial keratoplasty and Fuchs dystrophy were assessed in collaboration with corneal specialists, and ChatGPT 4 provided a correct response at 89% [25]. Patient-targeted information and the generation of easy comprehension are also crucial to assessing patient education. Generation of general uveitis information was assessed by comparing Google Bard with ChatGPT 4, where the latter had a significantly better readability score, as evidenced by a lower FKGL (Flesch-Kincaid Grade Level) and fewer complex words with easy-to-understand passages [26]. Another study by Dihan et al. [27], demonstrated that ChatGPT 4 generated the most readable patient education materials compared to ChatGPT 3.5 and Bard using SMOG (Simplified Measure of Gobbledygook) and FKGL (Flesch-Kincaid Grade Level) scores. Such improvements can aid patients with poor literacy to comprehend complex ophthalmology concepts more effectively, which can ultimately promote engagement and treatment adherence.
Accuracy of the information is a crucial aspect when educating patients, and fosters understanding and awareness of conditions that ultimately affect patient decisions regarding certain treatment options. In patients with age-related macular degeneration (AMD), ChatGPT 3.5 consistently offered the most accurate and satisfactory response concerning advice on general AMD questions and intravitreal injections as compared to Bing and Bard [28].
Although LLMs provide promising assistance and results, certain limitations exist that need appropriate discretionary measures. A study by Wu et al. [29] looked at LLMs’ response to “floaters” and showed that certain LLMs required a higher reading comprehension than AAO.org, a website that ophthalmologists globally curate to educate the public about ocular disease. LLMs failed to convey a sense of urgency to see a physician for floaters, which could lead to retinal detachment.
LLMs offer valuable tools for improving patient education across various ophthalmology specialties. For most studies, ChatGPT 4 consistently outperformed its predecessors and other models in generating correct and accurate responses on different ocular conditions. However, it is essential to carefully address inaccuracies and potential bias to prevent risks of misinformation that could potentially cause more harm than good. This summarizes the potential use of integrating LLMs into patient education, placing a strong emphasis on constant validation and improvement of the models to better educate patients on their ocular conditions.
Medical education
The literature search yielded 17 original articles describing the utility of LLM in medical education (Table 3). Most LLMs (13 articles) in this category were used for general ophthalmology. One article for each subspecialty was described in paediatric ophthalmology, uveitis, glaucoma, and low vision. Eight LLMs were described: ChatGPT 3.5, 4 and Plus, Bard, Legacy, Bing, LLaMA, PaLM2(Google LLC).
LLM was generally used and tested for its performance among multiple-choice question board certification examinations, mainly in the United States, which showed consistent ability of at least 55% of LLM in generating correct answers [30,31,32,33,34]. Bard answered 62.4% of ophthalmology board exam practice questions [33]. ChatGPT 4 achieved 62.9% correct answers in the ophthalmology qualifying examination (Ophthalmic Knowledge Assessment Program - OKAP, American Board of Ophthalmology - ABO, United States Medical Licensure Examination - USMLE) [34]. Legacy achieved 55.8% on the Basic and Clinical Science Course (BCSC) [32]. In the Part 1 Fellowship of the Royal College of Ophthalmologists (FRCOphth) Multiple Choice Question (MCQ) examination, ChatGPT 4 outperformed historical averages of past candidates [35]. For the Part 2 FRCOphth MCQ examination, ChatGPT 4 achieved a score of 69%, surpassing ChatGPT 3.5 (48%), LLaMA (32%), and PaLM 2 (56%). Its performance was also comparable to that of expert ophthalmologists (median score: 76%) and exceeded that of ophthalmology trainees (median score: 59%) [36]. For the French language version of the European Board of Ophthalmology (EBO) examination, ChatGPT 4 achieved a 91.2% success rate [37]. However, for the Japanese language ophthalmology board examinations in the Japanese Ophthalmology Society, ChatGPT 3.5 and ChatGPT 4 were only able to answer 22.4% and 45.8% of questions correctly. This was because ChatGPT was not able to access specialized literature databases [38]. In comparing ChatGPT 3.5 and 4, ChatGPT 4 consistently outperformed ChatGPT 3.5 by more than 15% [34, 39,40,41].
In Paediatric ophthalmology, 80.6%, 61.3%, and 54.8% accuracy of ChatGPT 4, ChatGPT 3.5, and Bard, respectively, in answering myopia-related questions [42]. ChatGPT 3.5 was able to provide relatively high-accuracy responses for various questions related to uveitis [43]. The accuracy of ChatGPT 3.5 in diagnosing glaucoma cases was 72.7%, and even better in 1 of 3 ophthalmology residents [44]. In Low vision, GPT 4 and GPT 3.5 achieved 82.4 and 65.9% in answering self-assessment tests [45]. Enhanced prompting strategies can improve LLMs’ performance in complex clinical scenarios [30]. As demonstrated by Aeyeconsult, using ophthalmology textbooks as the source material of reliable medical knowledge improves the accuracy of ChatGPT 4 [46].
LLMs offer a wide range of applications in ophthalmology. Chat GPT 4 consistently performed superior or at par performance with human exam takers, demonstrating its impact on medical education. Despite their good performance in multiple-choice questions board certification examinations, LLMs have a propensity to hallucinate or generate information that is not present in their training data and do not cite their sources, making it difficult for users to verify the information provided [31, 42, 46]. LLMs are also unable to interpret images or figures, perform fairly with non-English language prompts, and may miss data that is unavailable on the internet [37]. Another limitation identified by the present review is that studies exploring the use of LLMs in medical education only describes the LLM’s ability to answer examination questions. While an important aspect of medical education, examination questions do not encompass the practical and clinical metrics by which medical students are also evaluated. The authors believe that LLMs are useful as complements for medical education, but caution should be exercised when clinical judgment is involved. Further studies on developing open-access LLMs trained with accurate and reliable ophthalmology data are recommended.
Research
Our search yielded two relevant articles on the utility of LLM in research (Table 4). The study by Raja et al. [47] demonstrated the potential use of LLM in categorizing and analysing scientific literature. The study provided insights into research productivity by enhancing an accurate and quick classification of scientific papers, which can provide efficient information retrieval and better organizational knowledge. Another study by Mohammadi and Nguyen explored the use of LLM in pre-processing fundus images for use in ML and analysing data using programming languages (R and Python). A remarkable outcome of this study is that, even without coding experience, LLMs (in this case, ChatGPT 4) may aid users in analysing images. Compared to other potential utilities of LLM, such as the categories mentioned above, studies on its use in research may be limited by multiple factors. Ethical considerations regarding its use have been carefully discussed. Editors of different reputable ophthalmology journals express concern regarding authorship and data quality of research provided by LLMs [48,49,50]. As data from some LLMs is limited, such as ChatGPT 3.5 with a dataset covering until before September 2021, newer and more relevant information can be missed. For example, medications to slow the progression of advanced dry age-related macular degeneration were approved in 2023 [51, 52]. Other LLMs may have up-to-date information retrieval capacity due to their access to the web, although some are paywall-restricted. This promotes inequity for those who are in low-resource settings. In addition, its propensity to hallucinate by providing made-up references should be considered [53]. The topic of the use of LLM in research should be investigated further, as its potential can have a considerable impact on the research community.
Ethical considerations, potential drawbacks, and recommendations
While the promise of LLM in the field of ophthalmology remains to grow, certain cautions must be addressed. Challenges such as hallucinations, inaccuracies, and potential biases could limit its use and lead to harmful outcomes. In addition, cloud-based LLMs also introduce significant privacy risks due to the potential exposure of sensitive patient data during processing and storage.
To help mitigate these limitations, several methodological advancements have been introduced including Chain of thought (CoT) prompting which breakdown complex queries into intermediate steps, fine tuning (FT) which trains models using domain specific datasets, reinforcement with human feedback (RLHF) which aligns output with human preference, and retrieval-augmented generation (RAG) which retrieves information from external sources [54,55,56,57]. When deployed locally in privacy-preserving models, such as LLaMA, these strategies not only enhance performance but also support data privacy by eliminating cloud-based reliance [58].
In clinical settings, such use of LLMs could be grounds for safety concerns, which may lead to unfortunate patient management and misinformation [59]. Given the dynamic nature of these models, continuous research and verification are essential to safeguard accuracy and reliability in clinical and educational settings [60]. To mitigate bias and improve inclusion, training datasets should include population-specific data, especially the historically underrepresented groups [61,62,63]. Furthermore, establishing global and institutional guidelines on the proper use of LLM in ophthalmology will help promote best practices while ensuring alignment with ethical standards. Recognizing this, the American Academy of Ophthalmology, through the formation of the AI Task Force Committee, provides tools and regulations on how to engage with LLMs responsibly [64]. Regular integration of the latest and ever-evolving research and clinical evidence into these models will enhance their relevance, ensuring that their outputs remain consistent with the current standard of care.
To ensure safe and responsible use, LLM application necessitates careful adherence to the four ethical principles of medicine - beneficence, nonmaleficence, justice, and autonomy [65] As the evaluation of LLM in ophthalmology currently remains limited to controlled settings, further model refinement, particularly through alignment with real-time clinical data and information, is essential before it can be considered for clinical trials.
Conclusion
LLMs show great promise to enhance and transform the healthcare system. Here, we reviewed and highlighted various original research studies regarding the LLM applications in ophthalmology, emphasizing roles in clinical assistance, patient education, medical education, and research. By integrating LLM into practice, ophthalmology could provide enhanced workflow efficiency and diagnostic capabilities, high patient accessibility, complement medical learning, and boost research productivity. Targeted monitoring in model training, continuous validation, and assessment while strictly adhering to ethical standards is needed to ensure that these LLM-based tools benefit everyone.
Summary
What is known about this topic
-
The use of large language models have increasingly been adopted and studied across various fields of medicine, including ophthalmology, since its introduction in 2022.
What this study adds
-
This review summarizes the results of all available original articles examining the use of large language models in ophthalmology, and classifies these into four major uses: clinical assistance, patient education, medical education, and research.
References
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40. https://doi.org/10.1038/s41591-023-02448-8.
Cascella M, Semeraro F, Montomoli J, Bellini V, Piazza O, Bignami E. The breakthrough of large language models release for medical applications: 1-year timeline and perspectives. J Med Syst. 2024;48:22. https://doi.org/10.1007/s10916-024-02045-3.
Eysenbach G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ. 2023;9:e46885. https://doi.org/10.2196/46885.
Zandi R, Fahey JD, Drakopoulos M, Bryan JM, Dong S, Bryar PJ, et al. Exploring diagnostic precision and triage proficiency: a comparative study of GPT-4 and bard in addressing common ophthalmic complaints. Bioengineering. 2024;11:120. https://doi.org/10.3390/bioengineering11020120.
Lyons RJ, Arepalli SR, Fromal O, Choi JD, Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol. 2024;59:e301–e308. https://doi.org/10.1016/j.jcjo.2023.07.016.
Singh S, Djalilian A, Ali MJ. ChatGPT and ophthalmology: exploring its potential with discharge summaries and operative notes. Semin Ophthalmol. 2023;38:503–7. https://doi.org/10.1080/08820538.2023.2209166.
Gopalakrishnan N, Joshi A, Chhablani J, Yadav NK, Reddy NG, Rani PK, et al. Recommendations for initial diabetic retinopathy screening of diabetic patients using large language model-based artificial intelligence in real-life case scenarios. Int J Retin Vitreous. 2024;10:11. https://doi.org/10.1186/s40942-024-00533-9.
Choudhary A, Gopalakrishnan N, Joshi A, Balakrishnan D, Chhablani J, Yadav NK, et al. Recommendations for diabetic macular edema management by retina specialists and large language model-based artificial intelligence platforms. Int J Retin Vitreous. 2024;10:22. https://doi.org/10.1186/s40942-024-00544-6.
Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. 2024;26:e51926. https://doi.org/10.2196/51926.
Mohammadi SS, Nguyen QD. A user-friendly approach for the diagnosis of diabetic retinopathy using ChatGPT and automated machine learning. Ophthalmol Sci. 2024;4:100495. https://doi.org/10.1016/j.xops.2024.100495.
Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, Shi D, et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digit Med. 2024;7:111. https://doi.org/10.1038/s41746-024-01101-z.
Chen X, Zhang W, Zhao Z, Xu P, Zheng Y, Shi D, et al. ICGA-GPT: report generation and question answering for indocyanine green angiography images. Br J Ophthalmol. 2024;108:1450–6. https://doi.org/10.1136/bjo-2023-324446.
Lin Z, Zhang D, Shi D, Xu R, Tao Q, Wu L, et al. Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation. J Biomed Inf. 2023;138:104281. https://doi.org/10.1016/j.jbi.2023.104281.
Chen X, Xu P, Li Y, Zhang W, Song F, He M, et al. ChatFFA: an ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography. iScience. 2024;27:110021. https://doi.org/10.1016/j.isci.2024.110021.
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, et al. Exploring AI-chatbots’ capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol. 2024;108:1457–69. https://doi.org/10.1136/bjo-2023-325143.
Huang X, Raja H, Madadi Y, Delsoz M, Poursoroush A, Kahook MY, et al. Predicting glaucoma before onset using a large language model chatbot. Am J Ophthalmol. 2024;266:289–99. https://doi.org/10.1016/j.ajo.2024.05.022.
Kass MA, Heuer DK, Higginbotham EJ, Johnson CA, Keltner JL, Miller JP, et al. The ocular hypertension treatment study: a randomized trial determines that topical ocular hypotensive medication delays or prevents the onset of primary open-angle glaucoma. Arch Ophthalmol. 2002;120:701–13. https://doi.org/10.1001/archopht.120.6.701.
Carlà MM, Gambini G, Baldascino A, Boselli F, Giannuzzi F, Margollicci F, et al. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol. 2024;262:2945–59. https://doi.org/10.1007/s00417-024-06470-5.
Rojas-Carabali W, Sen A, Agarwal A, Tan G, Cheung CY, Rousselot A, et al. Chatbots Vs. human experts: evaluating diagnostic performance of chatbots in uveitis and the perspectives on AI adoption in ophthalmology. Ocul Immunol Inflamm. 2024;32:1591–8. https://doi.org/10.1080/09273948.2023.2266730.
Ćirković A, Katz T. Exploring the potential of ChatGPT-4 in predicting refractive surgery categorizations: comparative study. JMIR Form Res. 2023;7:e51798. https://doi.org/10.2196/51798.
Ali MJ. ChatGPT and lacrimal drainage disorders: performance and scope of improvement. Ophthalmic Plast Reconstr Surg. 2023;39:221–5. https://doi.org/10.1097/IOP.0000000000002418.
Tailor PD, Dalvin LA, Chen JJ, Iezzi R, Olsen TW, Scruggs BA, et al. A comparative study of responses to retina questions from either experts, expert-edited large language models, or expert-edited large language models alone. Ophthalmol Sci. 2024;4:100485. https://doi.org/10.1016/j.xops.2024.100485.
Pushpanathan K, Lim ZW, Er Yew SM, Chen DZ, Hui’En Lin HA, Lin Goh JH, et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience. 2023;26:108163. https://doi.org/10.1016/j.isci.2023.108163.
Tailor PD, Xu TT, Fortes BH, Iezzi R, Olsen TW, Starr MR, et al. Appropriateness of ophthalmology recommendations from an online chat-based artificial intelligence model. Mayo Clin Proc Digit Health. 2024;2:119–28. https://doi.org/10.1016/j.mcpdig.2024.01.003.
Barclay KS, You JY, Coleman MJ, Mathews PM, Ray VL, Riaz KM, et al. Quality and agreement with scientific consensus of ChatGPT information regarding corneal transplantation and Fuchs dystrophy. Cornea. 2024;43:746–50. https://doi.org/10.1097/ICO.0000000000003439.
Kianian R, Sun D, Crowell EL, Tsui E. The use of large language models to generate education materials about uveitis. Ophthalmol Retin. 2024;8:195–201. https://doi.org/10.1016/j.oret.2023.09.008.
Dihan Q, Chauhan MZ, Eleiwa TK, Hassan AK, Sallam AB, Khouri AS, et al. Using large language models to generate educational materials on childhood glaucoma. Am J Ophthalmol. 2024;265:28–38. https://doi.org/10.1016/j.ajo.2024.04.004.
Ferro Desideri L, Roth J, Zinkernagel M, Anguita R. Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration. Int J Retin Vitreous. 2023;9:71. https://doi.org/10.1186/s40942-023-00511-7.
Wu G, Zhao W, Wong A, Lee DA. Patients with floaters: answers from virtual assistants and large language models. Digit Health. 2024;10:20552076241229933. https://doi.org/10.1177/20552076241229933.
Milad D, Antaki F, Milad J, Farah A, Khairy T, Mikhail D, et al. Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. Br J Ophthalmol. 2024;108:1398–405. https://doi.org/10.1136/bjo-2023-325053.
Antaki F, Milad D, Chia MA, Giguère CÉ, Touma S, El-Khoury J, et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol. 2024;108:1371–8. https://doi.org/10.1136/bjo-2023-324438.
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324. https://doi.org/10.1016/j.xops.2023.100324.
Botross M, Mohammadi SO, Montgomery K, Crawford C. Performance of Google’s artificial intelligence chatbot “Bard” (now “Gemini”) on ophthalmology board exam practice questions. Cureus. 2024;16:e57348. https://doi.org/10.7759/cureus.57348.
Haddad F, Saade JS. Performance of ChatGPT on ophthalmology-related questions across various examination levels: observational study. JMIR Med Educ. 2024;10:e50842. https://doi.org/10.2196/50842.
Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2024;108:1379–83. https://doi.org/10.1136/bjo-2023-324091.
Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: a head-to-head cross-sectional study. PLOS Digit Health. 2024;3:e0000341. https://doi.org/10.1371/journal.pdig.0000341.
Panthier C, Gatinel D. Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: a novel approach to medical knowledge assessment. J Fr Ophtalmol. 2023;46:706–11. https://doi.org/10.1016/j.jfo.2023.05.006.
Sakai D, Maeda T, Ozaki A, Kanda GN, Kurimoto Y, Takahashi M. Performance of ChatGPT in board examinations for specialists in the Japanese Ophthalmology Society. Cureus. 2023;15:e49903. https://doi.org/10.7759/cureus.49903.
Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board-style questions. Am J Ophthalmol. 2023;254:141–9. https://doi.org/10.1016/j.ajo.2023.05.024.
Jiao C, Edupuganti NR, Patel PA, Bui T, Sheth V. Evaluating the artificial intelligence performance growth in ophthalmic knowledge. Cureus. 2023;15:e45700. https://doi.org/10.7759/cureus.45700.
Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus. 2023;15:e40822. https://doi.org/10.7759/cureus.40822.
Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. https://doi.org/10.1016/j.ebiom.2023.104770.
Marshall RF, Mallem K, Xu H, Thorne J, Burkholder B, Chaon B, et al. Investigating the accuracy and completeness of an artificial intelligence large language model about uveitis: an evaluation of ChatGPT. Ocul Immunol Inflamm. 2024;32:2052–5. https://doi.org/10.1080/09273948.2024.2317417.
Delsoz M, Raja H, Madadi Y, Tang AA, Wirostko BM, Kahook MY, et al. The use of ChatGPT to assist in diagnosing glaucoma based on clinical case reports. Ophthalmol Ther. 2023;12:3121–32. https://doi.org/10.1007/s40123-023-00805-x.
Taloni A, Borselli M, Scarsi V, Rossi C, Coco G, Scorcia V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13:18562. https://doi.org/10.1038/s41598-023-45837-2.
Singer MB, Fu JJ, Chow J, Teng CC. Development and evaluation of aeyeconsult: a novel ophthalmology Chatbot leveraging verified textbook knowledge and GPT-4. J Surg Educ. 2024;81:438–43. https://doi.org/10.1016/j.jsurg.2023.11.019.
Raja H, Munawar A, Mylonas N, Delsoz M, Madadi Y, Elahi M, et al. Automated category and trend analysis of scientific articles on ophthalmology using large language models: development and usability study. JMIR Form Res. 2024;8:e52462. https://doi.org/10.2196/52462.
Dupps WJ Jr. Artificial intelligence and academic publishing. J Cataract Refract Surg. 2023;49:655–6. https://doi.org/10.1097/j.jcrs.0000000000001223.
Van Gelder RN. The pros and cons of artificial intelligence authorship in ophthalmology. Ophthalmology. 2023;130:670–1. https://doi.org/10.1016/j.ophtha.2023.05.018.
Bressler NM. What artificial intelligence chatbots mean for editors, authors, and readers of peer-reviewed ophthalmic literature. JAMA Ophthalmol. 2023;141:514–5. https://doi.org/10.1001/jamaophthalmol.2023.1370.
Apellis Pharmaceuticals. FDA approves Syfovre (pegcetacoplan) injection, the first and only in its class. 2023. Available at: https://investors.apellis.com/news-releases/news-release-details/fda-approves-syfovretm-pegcetacoplan-injection-first-and-only. Accessed August 18, 2024.
EyesOnEyeCare. FDA approves IVERIC bio’s IZERVAY (branciciclovir injection) for geographic atrophy. 2023. Available at: https://glance.eyesoneyecare.com/stories/2023-08-07/fda-approves-iveric-bio-s-izervay-for-ga/. Accessed August 18, 2024.
Volpe NJ, Mirza RG. Chatbots, artificial intelligence, and the future of scientific reporting. JAMA Ophthalmol. 2023;141:824–5. https://doi.org/10.1001/jamaophthalmol.2023.3344.
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv. 2022, https://arxiv.org/abs/2201.11903.
Anisuzzaman DM, Malins JG, Friedman PA, Attia ZI. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health. 2024;3:100184. https://doi.org/10.1016/j.mcpdig.2024.11.005.
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. In: Proceedings of the Neural Information Processing Systems (NeurIPS) 2022; 2022. https://doi.org/10.48550/arXiv.2203.02155.
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Proceedings of the 36th International Conference on Machine Learning. 2019:5243–52. https://doi.org/10.5555/3495724.3496517.
Nguyen Q, Nguyen DA, Dang K, Liu S, Nguyen K, Wang SY, et al. Advancing question-answering in ophthalmology with retrieval-augmented generation (RAG): Benchmarking open-source and proprietary large language models. J-GLOBAL. 2024. Available from: https://jglobal.jst.go.jp/en/detail?JGLOBAL_ID=202402211872512470.
Chen JS, Reddy AJ, Al-Sharif E, Shoji MK, Kalaw FGP, Eslani M, et al. Analysis of ChatGPT responses to ophthalmic cases: can ChatGPT think like an ophthalmologist. Ophthalmol Sci. 2024;5:100600. https://doi.org/10.1016/j.xops.2024.100600.
Ullah E, Parwani A, Baig MM, Singh R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology - a recent scoping review. Diagn Pathol. 2024;19:43. https://doi.org/10.1186/s13000-024-01464-7.
Celi LA, Cellini J, Charpignon ML, Dee EC, Dernoncourt F, Eber R, et al. Sources of bias in artificial intelligence that perpetuate healthcare disparities-a global review. PLOS Digit Health. 2022;1:e0000022. https://doi.org/10.1371/journal.pdig.0000022.
Dychiao RGK, Alberto IRI, Artiaga JCM, Salongcay RP, Celi LA. Large language model integration in Philippine ophthalmology: early challenges and steps forward. Lancet Digit Health. 2024;6:e308. https://doi.org/10.1016/S2589-7500(24)00064-5.
Restrepo D, Wu C, Tang Z, Shuai Z, Phan TNM, Ding J-E, et al. Multi-OphthaLingua: a multilingual benchmark for assessing and debiasing LLM ophthalmological QA in LMICs. AAAI. 2025;39:28321–30.
Tom E, Keane PA, Blazes M, Pasquale LR, Chiang MF, Lee AY, et al. Protecting data privacy in the age of AI-enabled ophthalmology. Transl Vis Sci Technol. 2020;9:36. https://doi.org/10.1167/tvst.9.2.36.
Kalaw FGP, Baxter SL. Ethical considerations for large language models in ophthalmology. Curr Opin Ophthalmol. 2024;35:438–46. https://doi.org/10.1097/ICU.0000000000001083.
Bernstein IA, Zhang YV, Govil D, Majid I, Chang RT, Sun Y, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6:e2330320. https://doi.org/10.1001/jamanetworkopen.2023.30320.
Cohen SA, Brant A, Fisher AC, Pershing S, Do D, Pan C. Dr. Google vs. Dr. ChatGPT: exploring the use of artificial intelligence in ophthalmology by comparing the accuracy, safety, and readability of responses to frequently asked patient questions regarding cataracts and cataract surgery. Semin Ophthalmol. 2024;39:472–9. https://doi.org/10.1080/08820538.2024.2326058.
Wilhelm TI, Roos J, Kaczmarczyk R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res. 2023;25:e49324. https://doi.org/10.2196/49324.
Xue X, Zhang D, Sun C, Shi Y, Wang R, Tan T, et al. Xiaoqing: A Q&A model for glaucoma based on LLMs. Comput Biol Med. 2024;174:108399. https://doi.org/10.1016/j.compbiomed.2024.108399.
Biswas S, Logan NS, Davies LN, Sheppard AL, Wolffsohn JS. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt. 2023;43:1562–70. https://doi.org/10.1111/opo.13207.
Funding
FGPK - National Institutes of Health Bridge2AI (AI-READI Salutogenesis Grand Challenge) Grant OT2OD032644.
Author information
Authors and Affiliations
Contributions
JCMA, MCBG, GMNS, FGPK - designed the study, acquired, parsed, and interpreted the data, drafted and revised the manuscript, and approved the final version of the manuscript. APA, IDN – designed the study, acquired the data, and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Artiaga, J.C.M., Guevarra, M.C.B., Sosuan, G.M.N. et al. Large language models in ophthalmology: a scoping review on their utility for clinicians, researchers, patients, and educators. Eye 39, 2752–2761 (2025). https://doi.org/10.1038/s41433-025-03935-7
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41433-025-03935-7