Abstract
The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Though deep learning (DL) techniques offer promising avenues for improving diagnostic efficiency, data scarcity and imbalance issues persist in training robust diagnostic models, particularly for rare eye diseases. Here, we introduce EyeDiff, a generative foundation model capable of synthesizing lesion-preserving ophthalmic images from textual descriptions. Both objective metrics and expert human evaluations confirmed EyeDiff’s ability to generate high-fidelity images across multiple imaging modalities, accurately reflecting textual descriptions of diverse retinal diseases and lesion types. By augmenting minority classes across 11 globally sourced datasets, EyeDiff consistently boosted the diagnostic accuracy for both common and rare eye diseases across different foundation model types, including modality-specific, multimodal and vision-language foundation models trained solely on real data. These results underscore EyeDiff’s potential as a general-purpose text-to-image foundation model, offering a scalable and flexible approach to generate balanced, disease-relevant data for advancing retinal disease diagnosis.
Similar content being viewed by others
Data availability
The data for model training in the current study are available as open data through the following links: Retinal Image Bank (https://imagebank.asrs.org/), EyePACS (https://www.kaggle.com/c/diabetic-retinopathy-detection/data), OCTDL (https://ieee-dataport.org/documents/octdl-optical-coherence-tomography-dataset-image-based-deep-learning-methods), REFUGE (https://bitbucket.org/woalsdnd/refuge/src/master/), ORIGA (https://figshare.com/articles/dataset/Retinal_Fundus_Glaucoma_Image_dataset/24549217?file=43119880), RIM-ONE (https://bit.ly/rim-one-dl-images), DRISHTI (https://www.kaggle.com/datasets/lokeshsaipureddi/drishtigs-retina-dataset-for-onh-segmentation), GAMMA (https://paperswithcode.com/dataset/gamma-challenge). The data for validation in the current study are available as open data through the following links: IDRID (https://ieee-dataport.org/open-access/ indian-diabetic-retinopathy-image-dataset-idrid), MESSIDOR-2 (https://www.adcis.net/en/third-party/messidor2/), APTOS-2019 (https://www.kaggle.com/competitions/aptos2019-blindness-detection/data), PAPILA (https://figshare.com/articles/dataset/PAPILA/14798004/1), Glaucoma Fundus (https://dataverse.harvard.edu/dataset. xhtml?persistentId=https://doi.org/10.7910/DVN/1YRRAC), JSIEC (https://zenodo.org/record/3477553), Retina (https://www.kaggle.com/datasets/jr2ngb/ cataractdataset), OCTID (https://borealisdata.ca/dataverse/OCTID) and OCTDL(https://ieee-dataport.org/documents/octdl-optical-coherence-tomography-dataset-image-based-deep-learning-methods).
Code availability
The deep-learning model was developed using PyTorch (http://pytorch.org). We trained the model on an NVIDIA V100 card. The code for deep learning model development can be accessed at https://github.com/huggingface/diffusers/tree/main/examples/dreambooth.
References
Raimundo, R. & Rosário, A. The impact of artificial intelligence on data system security: a literature review. Sensors 21, https://doi.org/10.3390/s21217029 (2021).
Lama, H. et al. Severe macular complications in glaucoma: high-resolution multimodal imaging characteristics and review of the literature. BMC Ophthalmol. 23, 318 (2023).
Stino, H. et al. Association of diabetic lesions and retinal nonperfusion using widefield multimodal imaging. Ophthalmol. Retin. 7, 1042–1050 (2023).
Rahman, N., Georgiou, M., Khan, K. N. & Michaelides, M. Macular dystrophies: clinical and imaging features, molecular genetics and therapeutic options. Br. J. Ophthalmol. 104, 451–460 (2020).
Vij, R. & Arora, S. A systematic review on diabetic retinopathy detection using deep learning techniques. Arch. Comput. Methods Eng. 30, 2211–2256 (2023).
Vij, R. & Arora, S. A systematic survey of advances in retinal imaging modalities for Alzheimer’s disease diagnosis. Metab. Brain Dis. 37, 2213–2243 (2022).
Aung, Y. Y. M., Wong, D. C. S. & Ting, D. S. W. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br. Med. Bull. 139, 4–15 (2021).
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digital Health 4, e406–e414 (2022).
Vij, R. & Arora, S. A novel deep transfer learning based computerized diagnostic Systems for Multi-class imbalanced diabetic retinopathy severity classification. Multimed. Tools Appl. 82, 34847–34884 (2023).
Khalifa, N. E., Loey, M. & Mirjalili, S. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artif. Intell. Rev. 55, 2351–2377 (2022).
Goceri, E. Medical image data augmentation: techniques, comparisons and interpretations. Artif. Intell. Rev. 1-45 (2023).
Gao, L., Zhang, L., Liu, C. & Wu, S. Handling imbalanced medical image data: a deep-learning-based one-class classification approach. Artif. Intell. Med. 108, 101935 (2020).
Khan, A. A., Chaudhari, O. & Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. Expert Syst. Appl. 244, 122778 (2024).
Chen, R. et al. Translating color fundus photography to indocyanine green angiography using deep-learning for age-related macular degeneration screening. NPJ Digit. Med. 7, 34 (2024).
Shi, D. et al. Translation of color fundus photography into fluorescein angiography using deep learning for enhanced diabetic retinopathy screening. Ophthalmol. Sci. 3, 100401 (2023).
Kugelman, J. et al. Data augmentation for patch-based OCT chorio-retinal segmentation using generative adversarial networks. Neural Comput. Appl. 33, 7393–7408 (2021).
Yoo, T. K., Choi, J. Y. & Kim, H. K. Feasibility study to improve deep learning in OCT diagnosis of rare retinal diseases with few-shot classification. Med. Biol. Eng. Comput. 59, 401–415 (2021).
Sonmez, S. C., Sevgi, M., Antaki, F., Huemer, J. & Keane, P. A. Generative artificial intelligence in ophthalmology: current innovations, future applications and challenges. Br. J. Ophthalmol. 108, 1335–1340 (2024).
Chen, R. et al. Noninvasive synthesis of multiframe ultra-widefield fluorescein angiography from color fundus photographs. Ophthalmol. Retina https://doi.org/10.1016/j.oret.2025.08.002 (2025).
Rombach, R. et al. High-resolution image synthesis with latent diffusion models. 10674–10685 (2021).
Tian, Y., Fan, L., Isola, P., Chang, H. & Krishnan, D. J. A. StableRep: synthetic images from text-to-image models make strong visual representation learners. abs/2306.00984 (2023).
Xu, K. et al. Digital twins in ophthalmology: Concepts, applications, and challenges. Asia Pac. J. Ophthalmol. 14, 100205 (2025).
Wu, X. et al. Generation of Fundus fluorescein angiography videos for health care data sharing. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2025.1419 (2025).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Shi, D. et al. EyeFound: a multimodal generalist foundation model for ophthalmic imaging. ArXiv abs/2405.11338 (2024).
Shi, D. et al. A multimodal visual-language foundation model for computational ophthalmology. NPJ Digit. Med. 8, 381 (2025).
Lin, Z. et al. in Computer Vision–ECCV 2024. (eds Aleš Leonardis et al.) 366-384 (Springer Nature Switzerland).
Porwal, P. et al. IDRiD: diabetic retinopathy–segmentation and grading challenge. Med. Image Anal. 59, 101561 (2020).
Ahn, J. M. et al. A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS One 13, e0207982 (2018).
Cen, L.-P. et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat. Commun. 12, 4828 (2021).
Gholami, P., Roy, P., Parthasarathy, M. K. & Lakshminarayanan, V. OCTID: optical coherence tomography image database. Comput. Electr. Eng. 81, 106532 (2020).
Kulyabin, M. et al. OCTDL: optical coherence tomography dataset for image-based deep learning methods. Sci. Data 11, 365 (2024).
Kovalyk, O. et al. PAPILA: dataset with fundus images and clinical data of both eyes of the same patient for glaucoma assessment. Sci. Data 9, 291 (2022).
Xu, P. et al. Benchmarking large multimodal models for ophthalmic visual question answering with OphthalWeChat. Adv. Ophthalmol. Pract. Res. 6, 33–41 (2025).
Sharma, M. Overcoming challenges in research and development of rare eye diseases. Indian J. Ophthalmol. 70, 2214–2215 (2022).
Vij, R. & Arora, S. A Systematic Review on Deep Learning Techniques for Diabetic Retinopathy Segmentation and Detection Using Ocular Imaging Modalities. Wirel. Personal. Commun. 134, 1153–1229 (2024).
Vij, R. & Arora, S. A hybrid evolutionary weighted ensemble of deep transfer learning models for retinal vessel segmentation and diabetic retinopathy detection. Comput. Electr. Eng. 115, 109107 (2024).
He, S. et al. Bridging the camera domain gap with image-to-image translation improves glaucoma diagnosis. Transl. Vis. Sci. Technol. 12, 20–20 (2023).
Song, F., Zhang, W., Zheng, Y., Shi, D. & He, M. A deep learning model for generating fundus autofluorescence images from color fundus photography. Adv. Ophthalmol. Pr. Res. 3, 192–198 (2023).
Shi, D., He, S., Yang, J., Zheng, Y. & He, M. One-shot retinal artery and vein segmentation via cross-modality pretraining. Ophthalmol. Sci. 4, 100363 (2024).
Zhang, W. et al. in Medical Image Computing and Computer Assisted Intervention–MICCAI. 689-699 (Springer Nature Switzerland).
Dhariwal, P. & Nichol, A. J. A. Diffusion models beat GANs on image synthesis. (2021).
Wu, J. et al. GAMMA Challenge: glaucoma grading from multi-modality images. 90, 102938 (Elsevier, 2022).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama. 316, 2402–2410 (2016).
Orlando, J. I. et al. REFUGE challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med. Image Anal. 59, 101570 (2020).
Zhang, Z. et al. ORIGA(-light): an online retinal fundus image database for glaucoma analysis and research. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference 2010, 3065-3068 (2010).
Fumero, F., Alayón, S., Sánchez, J. L., Sigut, J. F. & Gonzalez-Hernandez, M. J. t. I. S. o. C.-B. M. S. RIM-ONE: an open retinal image database for optic nerve evaluation. 1-6 (2011).
Sivaswamy, J. et al. Drishti-GS: retinal image dataset for optic nerve head(ONH) segmentation. 53-56 (2014).
Ho, J. Classifier-Free Diffusion Guidance. ArXiv (2022).
Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 586-595 (IEEE, 2018).
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R. & Choi, Y. CLIPScore: a reference-free evaluation metric for image captioning. ArXiv abs/2104.08718 (2021).
Moon, H. H. et al. Generative AI in glioma: ensuring diversity in training image phenotypes to improve diagnostic performance for IDH mutation prediction. Neuro Oncol. 26, 1124–1135 (2024).
Al-Hammuri, K., Gebali, F., Kanan, A. & Chelvan, I. T. Vision transformer architecture and applications in digital health: a tutorial and survey. Vis. Comput. Ind. Biomed. Art. 6, 14 (2023).
Aburass, S., Dorgham, O., Al Shaqsi, J., Abu Rumman, M. & Al-Kadi, O. Vision transformers in medical imaging: a comprehensive review of advancements and applications across multiple diseases. J. Imaging Inform. Med. https://doi.org/10.1007/s10278-025-01481-y (2025).
Rodriguez, M. A., AlMarzouqi, H. & Liatsis, P. Multi-label retinal disease classification using transformers. IEEE J. Biomed. Health Inf. 27, 2739–2750 (2023).
Oulhadj, M. et al. Diabetic retinopathy prediction based on vision transformer and modified capsule network. Comput Biol. Med. 175, 108523 (2024).
Acknowledgements
We thank the American Society of Retina Specialists for providing the valuable Retina Image Bank and the InnoHK HKSAR Government for providing valuable support. The study was supported by the Start-up Fund for RAPs under the Strategic Hiring Scheme (P0048623) from HKSAR, Global STEM Professorship Scheme (P0046113), and Henry G. Leong Endowed Professorship in Elderly Vision Health. The sponsors or funding organizations had no role in the design or conduct of this research.
Author information
Authors and Affiliations
Contributions
D.S. conceived the study. D.S. built the deep learning model. D.S., R.C, and W.Z. conducted the literature search and analyzed the data. R.C. and X.C. completed human evaluation. W.Z. performed validation of downstream tasks and quantitative evaluation. R.C. wrote the manuscript. R.C, B.L, P.X., S.L, and X.W. organized figures and tables in this study. M.H. provided the data and facilities. All authors critically revised the manuscript. All authors have read and approved the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, R., Zhang, W., Liu, B. et al. Boosting foundation models for rare eye disease diagnosis via a multimodal text-to-image generative framework. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02560-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02560-2

