Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs

Abstract

Background

To evaluate the clinical applicability of three generic Vision-Large-Language Models (VLLMs) — OpenAI’s GPT-4omni, GPT-4V(ision) and Google’s Gemini in detecting and diagnosing inherited retinal diseases (IRDs), using fundus photographs.

Methods

The head-to-head comparative study curated 60 ultra-widefield (UWF) fundus images of 30 IRD patients from the National University Hospital, Singapore. Additionally, ten normal, open-sourced UWF fundus images were included for comparison. The 70 fundus images were analysed by the three VLLMs using standardised prompts to generate descriptions of 10 specified retinal features and provide clinical insights. Each VLLM received 2100 scores for descriptions across ten features, rated by three blinded consultant-level graders using three-point scale (0 = poor, 1 = borderline, 2 = good). Clinical insights including disease detection, diagnosis and pathological gene inference evaluated against clinical ground-truth.

Results

GPT-4o achieved the highest mean quality score in feature description (1.64 [0.697], mean [SEM]), outperforming GPT-4V (1.57 [0.738]) and Gemini (1.46 [0.800]; both p < 0.001). All models demonstrated high detection accuracy (\(\ge\)81.4%), but Gemini incorrectly classified all normal fundus images as IRD. GPT-4omni (65.7%) outperformed GPT-4V (50%) and Gemini (60%) in diagnosis accuracy. Gene inference precision remained low (\(\le\)20.3%) across all models. High concordance was observed across all models between feature descriptions and diagnoses (\(\ge\)97.1%), between diagnoses and clinical recommendations (100%).

Conclusions

GPT-4omni and GPT-4V demonstrated promising potential in detecting IRDs from fundus photographs, with good feature extraction capabilities and high detection accuracy. Gemini struggled with misidentifying normal fundus images. All three VLLMs require further refinement to improve diagnostic accuracy and gene inference.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The specific case examples of fundus images for six IRD-related genes curated in the study.
Fig. 2: Flowchart of overall study design.
Fig. 3: The model’s mean quality scores representing features description capability.
Fig. 4: Models’ performance in classifying fundus images.
Fig. 5: Heatmap of prediction genes for GPT-4o, GPT-4V and Gemini.

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. Liévin V, Hotherc E, Motzfeldt AG, Winther O. Can large language models reason about medical questions?. ArXiv. 2023. https://arxiv.org/abs/2207.08143.

  3. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198.

  4. Antaki F, Milad D, Chia MA, Giguère C, Touma S, El-Khoury J, et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol. 2024;108:1371–8.

    Article  PubMed  Google Scholar 

  5. Measuring performance on the Healthcare Access and Quality Index for 195 countries and territories and selected subnational locations: a systematic analysis from the Global Burden of Disease Study 2016. Lancet. 2018;391:2236–71.

  6. Ayuso C, Millan JM. Retinitis pigmentosa and allied conditions today: a paradigm of translational research. Genome Med. 2010;2:34.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Hanany M, Rivolta C, Sharon D. Worldwide carrier frequency and genetic prevalence of autosomal recessive inherited retinal diseases. Proc Natl Acad Sci USA. 2020;117:2710–6.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  8. Heath Jeffery RC, Mukhtar SA, Mcallister IL, Morgan WH, Mackey DA, Chen FK. Inherited retinal diseases are the most common cause of blindness in the working-age population in Australia. Ophthalmic Genet. 2021;42:431–9.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Liew G, Michaelides M, Bunce C. A comparison of the causes of blindness certifications in England and Wales in working age adults (16–64 years), 1999–2000 with 2009–2010. BMJ Open. 2014;4:e004015.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Galvin O, Chi G, Brady L, Hippert C, Del Valle Rubido M, Daly A, et al. The impact of inherited retinal diseases in the Republic of Ireland (ROI) and the United Kingdom (UK) from a cost-of-illness perspective. Clin Ophthalmol. 2020;14:707–19.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Wong WM, Tham YC, Simunovic MP, Chen FK, Luu CD, Chen H, et al. Rationale and protocol paper for the Asia Pacific Network for inherited eye diseases. Asia Pac J Ophthalmol. 2024;13:100030.

    Article  CAS  Google Scholar 

  12. Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, et al. Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases. Clin Neuroradiol. 2024;34:779–87.

    Article  PubMed  Google Scholar 

  13. Mert S, Stoerzer P, Brauer J, Fuchs B, Haas-Lützenberger EM, Demmer W, et al. Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs. Arch Orthop Trauma Surg. 2024;144:2461–7.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Koga S. Evaluating ChatGPT in pathology: towards multimodal AI in medical imaging. J Clin Pathol. 2024;78:70.

  15. Antaki F, Chopra R, Keane PA. Vision-language models for feature detection of macular diseases on optical coherence tomography. JAMA Ophthalmol. 2024;142:573–6.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Cheong KX, Zhang C, Tan T-E, Fenner BJ, Wong WM, Teo KY, et al. Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy. Brit J Ophthalmol. 2024;108:1443–9.

    Article  Google Scholar 

  17. Berger W, Kloeckener-Gruissem B, Neidhardt J. The molecular basis of human retinal and vitreoretinal diseases. Prog Retin Eye Res. 2010;29:335–75.

    Article  PubMed  CAS  Google Scholar 

  18. Jacobson, S, Buraczynska G, Milam M, A H, Chen C, et al. Disease expression in X-linked retinitis pigmentosa caused by a putative null mutation in the RPGR gene. Invest Ophthalmol Vis Sci. 1997;38:1983–97.

    PubMed  CAS  Google Scholar 

  19. Salmaninejad A, Motaee J, Farjami M, Alimardani M, Esmaeilie A, Pasdar A. Next-generation sequencing and its application in diagnosis of retinitis pigmentosa. Ophthalmic Genet. 2019;40:393–402.

    Article  PubMed  Google Scholar 

  20. Konstantinou EK, Shaikh N, Ramsey DJ. Birt-Hogg-Dubé syndrome associated with chorioretinopathy and nyctalopia: a case report and review of the literature. Ophthalmic Genet. 2023;44:175–81.

    Article  PubMed  Google Scholar 

  21. Patal R, Banin E, Batash T, Sharon D, Levy J. Ultra-widefield fundus autofluorescence imaging in patients with autosomal recessive retinitis pigmentosa reveals a genotype–phenotype correlation. Graefe’s Arch Clin Exp Ophthalmol. 2022;260:3471–8.

    Article  CAS  Google Scholar 

  22. Abalem MF, Otte B, Andrews C, Joltikov KA, Branham K, Fahim AT, et al. Peripheral visual fields in ABCA4 Stargardt disease and correlation with disease extent on ultra-widefield fundus autofluorescence. Am J Ophthalmol. 2017;184:181–8.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Masumoto H, Tabuchi H, Nakakura S, Ohsugi H, Enno H, Ishitobi N, et al. Accuracy of a deep convolutional neural network in detection of retinitis pigmentosa on ultrawide-field images. PeerJ. 2019;7:e6900.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Government Technology Agency (GovTech). Mastering the art of prompt engineering with Empower [Internet]. Singapore: GovTech TechNews; 2025 Apr 3 [cited 2025 Sep 23]. https://www.tech.gov.sg/technews/mastering-the-art-of-prompt-engineering-with-empower.

  25. Jacque L, Duncan KB, David GB, Stephen PD, Fishman GA, et al. Guidelines on clinical assessment of patients with inherited retinal degenerations [Internet]. San Francisco (CA): American Academy of Ophthalmology; 2022 [cited 2025 Sep 23]. https://www.aao.org/education/clinical-statement/guidelines-on-clinical-assessment-of-patients-with.

  26. Georgiou M, Robson AG, Fujinami K, De Guimarães TAC, Fujinami-Yokokawa Y, Daich Varela M, et al. Phenotyping and genotyping inherited retinal diseases: molecular genetics, clinical and imaging features, and therapeutics of macular dystrophies, cone and cone-rod dystrophies, rod-cone dystrophies, Leber congenital amaurosis, and cone dysfunction syndromes. Prog Retin Eye Res. 2024;100:101244.

    Article  PubMed  CAS  Google Scholar 

  27. Liu Y, Xie H, Zhao X, Tang J, Yu Z, Wu Z, et al. Automated detection of nine infantile fundus diseases and conditions in retinal images using a deep learning system. EPMA J. 2024;15:39–51.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Lu MY, Chen B, Williamson DFK, Chen RJ, Zhao M, Chow AK, et al. A multimodal generative AI Copilot for human pathology. Nature. 2024;634:466–73.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, et al. A visual-language foundation model for computational pathology. Nat Med. 2024;30:863–74.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. Huang Z, Bianchi F, Yuksekgonul M, Montine TJ, Zou J. A visual–language foundation model for pathology image analysis using medical Twitter. Nature Med. 2023;29:2307–16.

    Article  PubMed  CAS  Google Scholar 

  31. Ikezogwo, W, Seyfioglu O, M, Ghezloo S, Geva F, et al. Quilt-1M: one million image-text pairs for histopathology. Adv Neural Inf Process Syst. 2023;36:37995–8017.

    PubMed  PubMed Central  Google Scholar 

  32. Wei J, Wang X, Schuurmans D, Bosma M, Chi EH-H, Xia F, et al. Chain of thought prompting elicits reasoning in large language models. ArXiv. 2022. https://arxiv.org/abs/2201.11903.

  33. Gu J, Han Z, Chen S, Beirami A, He B, Zhang G, et al. A systematic survey of prompt engineering on vision-language foundation models. ArXiv. 2023. https://arxiv.org/abs/2307.12980.

  34. Liu S, Lin Z, Yu S, Lee R, Ling T, Pathak D, et al. Language models as black-box optimizers for vision-language models. 2023. https://ui.adsabs.harvard.edu/abs/2023arXiv230905950L. https://doi.org/10.48550/arXiv.2309.05950.

  35. Apornvirat S, Namboonlue C, Laohawetwanit T. Comparative analysis of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. Am J Clin Pathol. 2024;162:252–60.

  36. Giray L. Prompt engineering with ChatGPT: a guide for academic writers. Ann Biomed Eng. 2023;51:2629–33.

    Article  PubMed  Google Scholar 

  37. Mao R, Chen G, Zhang X, Guerin F, Cambria E. GPTEval: a survey on assessments of ChatGPT and GPT-4. ArXiv. 2023. https://arxiv.org/abs/2308.12488.

  38. Wang YX, Panda-Jonas S, Jonas JB. Optic nerve head anatomy in myopia and glaucoma, including parapapillary zones alpha, beta, gamma and delta: Histology and clinical features. Progr Retinal Eye Res. 2021;83:100933.

    Article  Google Scholar 

  39. Sorin V, Kapelushnik N, Hecht I, Zloto O, Glicksberg BS, Bufman H, et al. Integrated visual and text-based analysis of ophthalmology clinical cases using a large language model. Sci Rep. 2025;15:4999.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Xu P, Chen X, Zhao Z, Shi D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol. 2024;108:1384–9.

    Article  PubMed  Google Scholar 

  41. Rahmanzadehgervi P, Bolton L, Taesiri MR, Nguyen AT. Vision language models are blind. In: Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision, Proceedings, Part V. Springer-Verlag; 2024, pp. 293–309, https://doi.org/10.1007/978-981-96-0917-8_17.

  42. Zhou Y, Chia MA, Wagner SK, Ayhan MS, Williamson DJ, Struyven RR, et al. A foundation model for generalizable disease detection from retinal images. Nature. 2023;622:156–63.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  43. Pontikos N, Woof W, Veturi A, Javanmardi B, Ibarra-Arellano M, Hustinx A, et al. Eye2Gene: prediction of causal inherited retinal disease gene from multimodal imaging using deep-learning. Invest Ophthalmol Vis Sci. 2022;63:1161.

    Google Scholar 

  44. Huang C, Jiang A, Feng J, Zhang Y, Wang X, Wang Y. Adapting visual-language models for generalizable anomaly detection in medical images. ArXiv. 2024. https://arxiv.org/abs/2403.12570.

  45. Van M-H, Verma P, Wu X. On large visual language models for medical imaging analysis: an empirical study. ArXiv. 2024. https://arxiv.org/abs/2402.14162.

  46. Bejani MM, Ghatee M. A systematic review on overfitting control in shallow and deep neural networks. Artif Intell Rev. 2021;54:6391–438.

    Article  Google Scholar 

  47. Eli AA, Ali A. Deep learning applications in medical image analysis: advancements, challenges, and future directions. ArXiv. 2024. https://arxiv.org/abs/2410.14131.

Download references

Funding

This work was supported by grants from the National Medical Research Council, Singapore (MOH-CSASI22jul-0001; to CYC). XM acknowledge the support of China Scholarship Council program (project ID:202306010300).

Author information

Authors and Affiliations

Authors

Contributions

Conception and design of the study (YCT and CYC); Acquisition, analysis, and interpretation of data (XM, KP, HTL and WMW); Drafting the work (XM); Revising the work (XM, CCX, MW, HM and LPY); Consultant-grade evaluators for clinical diagnostic assessment (WMW, LJC and LPC); supervision (HWC, YCT and CYC); Final approval of the version to be published (all authors).

Corresponding authors

Correspondence to Yih-Chung Tham or Ching-Yu Cheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meng, X., Wong, W.M., Pushpanathan, K. et al. Comparative analysis of generic vision-language models in detecting and diagnosing inherited retinal diseases using fundus photographs. Eye 39, 3187–3194 (2025). https://doi.org/10.1038/s41433-025-04013-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41433-025-04013-8

Search

Quick links