Boosting foundation models for rare eye disease diagnosis via a multimodal text-to-image generative framework

Chen, Ruoyu; Zhang, Weiyi; Liu, Bowen; Wu, Xinyuan; Chen, Xiaolan; Xu, Pusheng; Liu, Shunming; He, Mingguang; Shi, Danli

doi:10.1038/s41746-026-02560-2

Download PDF

Article
Open access
Published: 24 March 2026

Boosting foundation models for rare eye disease diagnosis via a multimodal text-to-image generative framework

Ruoyu Chen¹^na1,
Weiyi Zhang¹^na1,
Bowen Liu¹^na1,
Xinyuan Wu¹,
Xiaolan Chen¹,
Pusheng Xu¹,
Shunming Liu¹,
Mingguang He^1,2,3 &
…
Danli Shi^1,2

npj Digital Medicine , Article number: (2026) Cite this article

1217 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Though deep learning (DL) techniques offer promising avenues for improving diagnostic efficiency, data scarcity and imbalance issues persist in training robust diagnostic models, particularly for rare eye diseases. Here, we introduce EyeDiff, a generative foundation model capable of synthesizing lesion-preserving ophthalmic images from textual descriptions. Both objective metrics and expert human evaluations confirmed EyeDiff’s ability to generate high-fidelity images across multiple imaging modalities, accurately reflecting textual descriptions of diverse retinal diseases and lesion types. By augmenting minority classes across 11 globally sourced datasets, EyeDiff consistently boosted the diagnostic accuracy for both common and rare eye diseases across different foundation model types, including modality-specific, multimodal and vision-language foundation models trained solely on real data. These results underscore EyeDiff’s potential as a general-purpose text-to-image foundation model, offering a scalable and flexible approach to generate balanced, disease-relevant data for advancing retinal disease diagnosis.

A multimodal retinal image dataset for diabetic retinopathy detection using foundation models

Article Open access 10 March 2026

A generalizable eye disease detection method based on Zero-Shot Learning

Article Open access 12 March 2026

Optimising deep learning models for ophthalmological disorder classification

Article Open access 24 January 2025

Data availability

The data for model training in the current study are available as open data through the following links: Retinal Image Bank (https://imagebank.asrs.org/), EyePACS (https://www.kaggle.com/c/diabetic-retinopathy-detection/data), OCTDL (https://ieee-dataport.org/documents/octdl-optical-coherence-tomography-dataset-image-based-deep-learning-methods), REFUGE (https://bitbucket.org/woalsdnd/refuge/src/master/), ORIGA (https://figshare.com/articles/dataset/Retinal_Fundus_Glaucoma_Image_dataset/24549217?file=43119880), RIM-ONE (https://bit.ly/rim-one-dl-images), DRISHTI (https://www.kaggle.com/datasets/lokeshsaipureddi/drishtigs-retina-dataset-for-onh-segmentation), GAMMA (https://paperswithcode.com/dataset/gamma-challenge). The data for validation in the current study are available as open data through the following links: IDRID (https://ieee-dataport.org/open-access/ indian-diabetic-retinopathy-image-dataset-idrid), MESSIDOR-2 (https://www.adcis.net/en/third-party/messidor2/), APTOS-2019 (https://www.kaggle.com/competitions/aptos2019-blindness-detection/data), PAPILA (https://figshare.com/articles/dataset/PAPILA/14798004/1), Glaucoma Fundus (https://dataverse.harvard.edu/dataset. xhtml?persistentId=https://doi.org/10.7910/DVN/1YRRAC), JSIEC (https://zenodo.org/record/3477553), Retina (https://www.kaggle.com/datasets/jr2ngb/ cataractdataset), OCTID (https://borealisdata.ca/dataverse/OCTID) and OCTDL(https://ieee-dataport.org/documents/octdl-optical-coherence-tomography-dataset-image-based-deep-learning-methods).

Code availability

The deep-learning model was developed using PyTorch (http://pytorch.org). We trained the model on an NVIDIA V100 card. The code for deep learning model development can be accessed at https://github.com/huggingface/diffusers/tree/main/examples/dreambooth.

References

Raimundo, R. & Rosário, A. The impact of artificial intelligence on data system security: a literature review. Sensors 21, https://doi.org/10.3390/s21217029 (2021).
Lama, H. et al. Severe macular complications in glaucoma: high-resolution multimodal imaging characteristics and review of the literature. BMC Ophthalmol. 23, 318 (2023).
Google Scholar
Stino, H. et al. Association of diabetic lesions and retinal nonperfusion using widefield multimodal imaging. Ophthalmol. Retin. 7, 1042–1050 (2023).
Google Scholar
Rahman, N., Georgiou, M., Khan, K. N. & Michaelides, M. Macular dystrophies: clinical and imaging features, molecular genetics and therapeutic options. Br. J. Ophthalmol. 104, 451–460 (2020).
Google Scholar
Vij, R. & Arora, S. A systematic review on diabetic retinopathy detection using deep learning techniques. Arch. Comput. Methods Eng. 30, 2211–2256 (2023).
Google Scholar
Vij, R. & Arora, S. A systematic survey of advances in retinal imaging modalities for Alzheimer’s disease diagnosis. Metab. Brain Dis. 37, 2213–2243 (2022).
Google Scholar
Aung, Y. Y. M., Wong, D. C. S. & Ting, D. S. W. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br. Med. Bull. 139, 4–15 (2021).
Google Scholar
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digital Health 4, e406–e414 (2022).
Google Scholar
Vij, R. & Arora, S. A novel deep transfer learning based computerized diagnostic Systems for Multi-class imbalanced diabetic retinopathy severity classification. Multimed. Tools Appl. 82, 34847–34884 (2023).
Google Scholar
Khalifa, N. E., Loey, M. & Mirjalili, S. A comprehensive survey of recent trends in deep learning for digital images augmentation. Artif. Intell. Rev. 55, 2351–2377 (2022).
Google Scholar
Goceri, E. Medical image data augmentation: techniques, comparisons and interpretations. Artif. Intell. Rev. 1-45 (2023).
Gao, L., Zhang, L., Liu, C. & Wu, S. Handling imbalanced medical image data: a deep-learning-based one-class classification approach. Artif. Intell. Med. 108, 101935 (2020).
Google Scholar
Khan, A. A., Chaudhari, O. & Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. Expert Syst. Appl. 244, 122778 (2024).
Google Scholar
Chen, R. et al. Translating color fundus photography to indocyanine green angiography using deep-learning for age-related macular degeneration screening. NPJ Digit. Med. 7, 34 (2024).
Google Scholar
Shi, D. et al. Translation of color fundus photography into fluorescein angiography using deep learning for enhanced diabetic retinopathy screening. Ophthalmol. Sci. 3, 100401 (2023).
Google Scholar
Kugelman, J. et al. Data augmentation for patch-based OCT chorio-retinal segmentation using generative adversarial networks. Neural Comput. Appl. 33, 7393–7408 (2021).
Yoo, T. K., Choi, J. Y. & Kim, H. K. Feasibility study to improve deep learning in OCT diagnosis of rare retinal diseases with few-shot classification. Med. Biol. Eng. Comput. 59, 401–415 (2021).
Google Scholar
Sonmez, S. C., Sevgi, M., Antaki, F., Huemer, J. & Keane, P. A. Generative artificial intelligence in ophthalmology: current innovations, future applications and challenges. Br. J. Ophthalmol. 108, 1335–1340 (2024).
Google Scholar
Chen, R. et al. Noninvasive synthesis of multiframe ultra-widefield fluorescein angiography from color fundus photographs. Ophthalmol. Retina https://doi.org/10.1016/j.oret.2025.08.002 (2025).
Rombach, R. et al. High-resolution image synthesis with latent diffusion models. 10674–10685 (2021).
Tian, Y., Fan, L., Isola, P., Chang, H. & Krishnan, D. J. A. StableRep: synthetic images from text-to-image models make strong visual representation learners. abs/2306.00984 (2023).
Xu, K. et al. Digital twins in ophthalmology: Concepts, applications, and challenges. Asia Pac. J. Ophthalmol. 14, 100205 (2025).
Wu, X. et al. Generation of Fundus fluorescein angiography videos for health care data sharing. JAMA Ophthalmol. https://doi.org/10.1001/jamaophthalmol.2025.1419 (2025).
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Google Scholar
Shi, D. et al. EyeFound: a multimodal generalist foundation model for ophthalmic imaging. ArXiv abs/2405.11338 (2024).
Shi, D. et al. A multimodal visual-language foundation model for computational ophthalmology. NPJ Digit. Med. 8, 381 (2025).
Google Scholar
Lin, Z. et al. in Computer Vision–ECCV 2024. (eds Aleš Leonardis et al.) 366-384 (Springer Nature Switzerland).
Porwal, P. et al. IDRiD: diabetic retinopathy–segmentation and grading challenge. Med. Image Anal. 59, 101561 (2020).
Google Scholar
Ahn, J. M. et al. A deep learning model for the detection of both advanced and early glaucoma using fundus photography. PLoS One 13, e0207982 (2018).
Google Scholar
Cen, L.-P. et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat. Commun. 12, 4828 (2021).
Google Scholar
Gholami, P., Roy, P., Parthasarathy, M. K. & Lakshminarayanan, V. OCTID: optical coherence tomography image database. Comput. Electr. Eng. 81, 106532 (2020).
Google Scholar
Kulyabin, M. et al. OCTDL: optical coherence tomography dataset for image-based deep learning methods. Sci. Data 11, 365 (2024).
Google Scholar
Kovalyk, O. et al. PAPILA: dataset with fundus images and clinical data of both eyes of the same patient for glaucoma assessment. Sci. Data 9, 291 (2022).
Google Scholar
Xu, P. et al. Benchmarking large multimodal models for ophthalmic visual question answering with OphthalWeChat. Adv. Ophthalmol. Pract. Res. 6, 33–41 (2025).
Google Scholar
Sharma, M. Overcoming challenges in research and development of rare eye diseases. Indian J. Ophthalmol. 70, 2214–2215 (2022).
Google Scholar
Vij, R. & Arora, S. A Systematic Review on Deep Learning Techniques for Diabetic Retinopathy Segmentation and Detection Using Ocular Imaging Modalities. Wirel. Personal. Commun. 134, 1153–1229 (2024).
Google Scholar
Vij, R. & Arora, S. A hybrid evolutionary weighted ensemble of deep transfer learning models for retinal vessel segmentation and diabetic retinopathy detection. Comput. Electr. Eng. 115, 109107 (2024).
Google Scholar
He, S. et al. Bridging the camera domain gap with image-to-image translation improves glaucoma diagnosis. Transl. Vis. Sci. Technol. 12, 20–20 (2023).
Google Scholar
Song, F., Zhang, W., Zheng, Y., Shi, D. & He, M. A deep learning model for generating fundus autofluorescence images from color fundus photography. Adv. Ophthalmol. Pr. Res. 3, 192–198 (2023).
Google Scholar
Shi, D., He, S., Yang, J., Zheng, Y. & He, M. One-shot retinal artery and vein segmentation via cross-modality pretraining. Ophthalmol. Sci. 4, 100363 (2024).
Google Scholar
Zhang, W. et al. in Medical Image Computing and Computer Assisted Intervention–MICCAI. 689-699 (Springer Nature Switzerland).
Dhariwal, P. & Nichol, A. J. A. Diffusion models beat GANs on image synthesis. (2021).
Wu, J. et al. GAMMA Challenge: glaucoma grading from multi-modality images. 90, 102938 (Elsevier, 2022).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama. 316, 2402–2410 (2016).
Google Scholar
Orlando, J. I. et al. REFUGE challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med. Image Anal. 59, 101570 (2020).
Google Scholar
Zhang, Z. et al. ORIGA(-light): an online retinal fundus image database for glaucoma analysis and research. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference 2010, 3065-3068 (2010).
Fumero, F., Alayón, S., Sánchez, J. L., Sigut, J. F. & Gonzalez-Hernandez, M. J. t. I. S. o. C.-B. M. S. RIM-ONE: an open retinal image database for optic nerve evaluation. 1-6 (2011).
Sivaswamy, J. et al. Drishti-GS: retinal image dataset for optic nerve head(ONH) segmentation. 53-56 (2014).
Ho, J. Classifier-Free Diffusion Guidance. ArXiv (2022).
Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 586-595 (IEEE, 2018).
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R. & Choi, Y. CLIPScore: a reference-free evaluation metric for image captioning. ArXiv abs/2104.08718 (2021).
Moon, H. H. et al. Generative AI in glioma: ensuring diversity in training image phenotypes to improve diagnostic performance for IDH mutation prediction. Neuro Oncol. 26, 1124–1135 (2024).
Google Scholar
Al-Hammuri, K., Gebali, F., Kanan, A. & Chelvan, I. T. Vision transformer architecture and applications in digital health: a tutorial and survey. Vis. Comput. Ind. Biomed. Art. 6, 14 (2023).
Google Scholar
Aburass, S., Dorgham, O., Al Shaqsi, J., Abu Rumman, M. & Al-Kadi, O. Vision transformers in medical imaging: a comprehensive review of advancements and applications across multiple diseases. J. Imaging Inform. Med. https://doi.org/10.1007/s10278-025-01481-y (2025).
Rodriguez, M. A., AlMarzouqi, H. & Liatsis, P. Multi-label retinal disease classification using transformers. IEEE J. Biomed. Health Inf. 27, 2739–2750 (2023).
Google Scholar
Oulhadj, M. et al. Diabetic retinopathy prediction based on vision transformer and modified capsule network. Comput Biol. Med. 175, 108523 (2024).
Google Scholar

Download references

Acknowledgements

We thank the American Society of Retina Specialists for providing the valuable Retina Image Bank and the InnoHK HKSAR Government for providing valuable support. The study was supported by the Start-up Fund for RAPs under the Strategic Hiring Scheme (P0048623) from HKSAR, Global STEM Professorship Scheme (P0046113), and Henry G. Leong Endowed Professorship in Elderly Vision Health. The sponsors or funding organizations had no role in the design or conduct of this research.

Author information

These authors contributed equally: Ruoyu Chen, Weiyi Zhang, Bowen Liu.

Authors and Affiliations

School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
Ruoyu Chen, Weiyi Zhang, Bowen Liu, Xinyuan Wu, Xiaolan Chen, Pusheng Xu, Shunming Liu, Mingguang He & Danli Shi
Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
Mingguang He & Danli Shi
Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong SAR, China
Mingguang He

Authors

Ruoyu Chen
View author publications
Search author on:PubMed Google Scholar
Weiyi Zhang
View author publications
Search author on:PubMed Google Scholar
Bowen Liu
View author publications
Search author on:PubMed Google Scholar
Xinyuan Wu
View author publications
Search author on:PubMed Google Scholar
Xiaolan Chen
View author publications
Search author on:PubMed Google Scholar
Pusheng Xu
View author publications
Search author on:PubMed Google Scholar
Shunming Liu
View author publications
Search author on:PubMed Google Scholar
Mingguang He
View author publications
Search author on:PubMed Google Scholar
Danli Shi
View author publications
Search author on:PubMed Google Scholar

Contributions

D.S. conceived the study. D.S. built the deep learning model. D.S., R.C, and W.Z. conducted the literature search and analyzed the data. R.C. and X.C. completed human evaluation. W.Z. performed validation of downstream tasks and quantitative evaluation. R.C. wrote the manuscript. R.C, B.L, P.X., S.L, and X.W. organized figures and tables in this study. M.H. provided the data and facilities. All authors critically revised the manuscript. All authors have read and approved the manuscript.

Corresponding authors

Correspondence to Mingguang He or Danli Shi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, R., Zhang, W., Liu, B. et al. Boosting foundation models for rare eye disease diagnosis via a multimodal text-to-image generative framework. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02560-2

Download citation

Received: 28 October 2024
Accepted: 08 March 2026
Published: 24 March 2026
DOI: https://doi.org/10.1038/s41746-026-02560-2

Boosting foundation models for rare eye disease diagnosis via a multimodal text-to-image generative framework

Subjects

Abstract

Similar content being viewed by others

A multimodal retinal image dataset for diabetic retinopathy detection using foundation models

A generalizable eye disease detection method based on Zero-Shot Learning

Optimising deep learning models for ophthalmological disorder classification

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

A multimodal retinal image dataset for diabetic retinopathy detection using foundation models

A generalizable eye disease detection method based on Zero-Shot Learning

Optimising deep learning models for ophthalmological disorder classification

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links