Introduction

Pathological changes in the retina can cause severe vision loss or even blindness, and the accurate diagnosis of retinal lesions is extremely important1. However, the number of ophthalmologists is highly mismatched with the population base in China, especially in many rural and underdeveloped towns. The lack of ophthalmologists, coupled with outdated equipment, makes it difficult to diagnose retinal diseases and some pathological changes.

The implementation of artificial intelligence (AI)2, including medical imaging, automated diagnosis, and robotic surgery has become increasingly popular in modern medical industry3,4,5,6. Recent studies have shown that deep learning (DL) models can detect and diagnose various retinal diseases by interpreting ocular data from different diagnostic modalities, including fundus photographs, optical coherence tomography (OCT), and visual fields7. The integration of DL algorithms into ophthalmology could play a crucial role in the screening and diagnosis of eye diseases in the future, especially in areas where ophthalmologists are lacking8.

AI has the potential to revolutionize disease diagnosis and management by performing classification for human experts and by rapidly reviewing immense amounts of images9. The analysis of fundus photography based on DL algorithm has been reported to use for automated screening and diagnosis of common retinal diseases, including diabetic retinopathy (DR)3,10,11, age-related macular degeneration (AMD)12,13,14,15, retinopathy of prematurity (ROP)16,17, glaucoma18,19,20, papilledema21, and anaemia22. Choi et al.23 applied deep convolutional neural network for automated detection of twelve retinal diseases using a small database of fundus photograph and can detect fewer than 10 retinal diseases. Kim et al.24 conducted a secondary diagnosis of normal conditions and eight retinal diseases using artificial intelligence tools based on fundus photographs. Cen et al.25 established a two-tier grading system for 39 diseases and conditions using various fundus image datasets. All of the methods mentioned above utilized fundus photographs.

Recently, the analysis of OCT images based on DL algorithms has been reported to use for automated screening and diagnosis of retinal diseases. Compared to 2D fundus images, OCT images is a 3D imaging of retinal structure, which could increase the sensitivity of detection of retinal diseases at the early stage26. Kermany et al.27 developed an effective transfer learning algorithm for retinal disease diagnosis based on OCT images from 4686 patients across multiple centers. However, their system is limited to detecting only three retinal diseases. Hassan et al.28 presented a novel incremental cross-DA approach that enables the classification networks to retain their prior knowledge while learning new cross-modality retinopathy screening tasks incrementally, achieving the diagnosis of five retinal diseases on OCT and six on fundus photographs. However, the accuracy of diagnosis for some retinal diseases with limited samples is relatively low. In this study, we developed an AI-assisted diagnosis system based on extremely large-scale multicenter OCT data, which can detect 15 retinal anomalies (including retinal diseases and some pathological changes), and it achieved average performance level that is comparable to that of attending ophthalmologists.

The analysis of big retinal disease data can elucidate complex interrelationships among multiple factors of the disease, thereby enhancing our understanding of disease onset and progression. Healthcare data analytics has the potential to bring in dramatic changes in healthcare industry by streamlining processes and improving the quality of care29. Dias et al. emphasized the importance of collecting real-time data for effective medical decision support30. Whereas, Chen et al.31 investigated the national burden of eye diseases in China from 1990 to 2019, utilizing statistical data from Global Health Data Exchange32. Yet, their study focused on only six eye diseases. In this study, we developed a cloud platform, a big data center, and AI-assisted diagnosis algorithm for retinal disease diagnosis. The system has been deployed in 207 medical institutions across 29 provinces in China. This widespread adoption underscores its significant clinical value, highlighting our focus on the research of AI-based diagnosis of retinal diseases in real clinical environments.

In summary, this study developed an artificial intelligence-empowered cloud platform, AI-PORAS, which can detect 15 retinal anomalies in OCT images with an average accuracy of 93.16% on a testing dataset of 165,384 eyes with 3,551,959 OCT B-scan slices. The proposed AI diagnostic model (OCT-AI), based on the domain adaptation method, perfectly accommodates the real-world dynamic changes in understanding retinal anomalies categories and the variability of OCT images across different OCT scanners. AI-PORAS has been adapted to a commercial OCT scanner, BV1000, and has been deployed in 207 medical institutions across 29 provinces in China, including remote areas where resources are limited. The deployed system has remotely diagnosed 116,717 patients for 15 retinal anomalies at the time of submission. In addition, extensive statistical analysis was conducted on the diagnosed patients in terms of age, gender, geographical location within China, and disease growth patterns, with the goal of identifying potential factors contributing to retinal anomalies, providing valuable decision-making guidance for government regulatory bodies. To the best of our knowledge, this is the first large-scale real-world study of using AI systems to localize, classify and statistical analysis 15 retinal anomalies simultaneously in OCT images. It is also the most extensively deployed AI cloud system for OCT-based retinal disease screening, serving 207 medical institutions in 29 provinces in China. Moreover, efforts are underway to promote the deployment of AI-PORAS in more medical institutions worldwide, benefiting people across various countries and regions globally.

Results

Localization of retinal anomalies on the source domain testing dataset

The OCT-AI model was trained for retinal disease localization with the source domain dataset. Figure 1 shows four example OCT B-scan slices with bounding boxes placed by the OCT-AI model for the identified diseases. Means of intersection over union (IoU) by OCT-AI and by different groups of ophthalmologists are shown in Fig. 2. The OCT-AI model achieved an average IoU of 0.44 for all 14 diseases that was better than the performances by the attending ophthalmologist group and the resident ophthalmologist group, who each obtained an average IoU of 0.42. Among these 14 diseases, the OCT-AI model outperformed or was comparable to both ophthalmologist groups in localizing 9 of the 14 anomalies and was inferior in localizing the remaining 5 diseases. All IoU values were calculated based on the ground truth manually annotated by the chief ophthalmologist group. On average, OCT-AI achieved the highest IoU of 0.44, indicating that the diseased regions identified by the OCT-AI model were close to the chief ophthalmologists’ decisions.

Fig. 1: Four example results of retinal anomalies detection.
Fig. 1: Four example results of retinal anomalies detection.
Full size image

Left: Annotations by the chief ophthalmologist group (ground truth). Right: Diseased regions identified by OCT-AI model. a PVD, HE, CME, ERM, and Drusen; b PVD and Drusen; c PVD, HE, and SRF; d FTMH. Red: PVD. Dark Green: ERM; Bright Green: Drusen; Blue: HE; Dark Yellow: SRF; Bright Yellow: CME; Magenta: FTMH; Cyan: IS/OS. The values represent the predicted probability of retinal anomalies.

Fig. 2: Comparation of OCT-AI performance with ophthalmologist on source domain dataset.
Fig. 2: Comparation of OCT-AI performance with ophthalmologist on source domain dataset.
Full size image

a Left: average IoUs between ground truth and localized bounding boxes by OCT-AI and the two ophthalmologist groups on source domain testing dataset. b Right: ROC curves and AUC values in the four-fold cross-validation on the source domain training dataset.

Performance on the source domain testing dataset

OCT-AI trained on the source domain training dataset achieved the best average accuracy, precision, F1 score, FNR, Kappa, and slightly worse FPR than these by the attending ophthalmologist group on the testing dataset as shown in Fig. 3. ROC curves and AUCs were also calculated for the competing methods as shown in Fig. 4. Average ROC by OCT-AI over all 14 anomalies had an AUC of 0.8559 (Fig. 4a). Individual ROC curves corresponding to the 14 anomalies are shown in Fig. 4b. Overall, OCT-AI reached or slightly exceeded the attending physician’s expertise level in classification of the 14 retinal anomalies and obtained much better performances than the resident physicians. In retinal disease localization task (Fig. 2), OCT-AI achieved a slightly better average IoU of 0.44 than the attending ophthalmologists (0.42) and the resident ophthalmologists (0.42), proving that the OCT-AI system can effectively localize the 14 retinal anomalies at the expertise level that is slightly higher than that of the attending ophthalmologists. It is worth noting that annotations by ophthalmologists are binary so only a single point of the ROC curve can be calculated for each ophthalmologist group.

Fig. 3: Localization performance of different cohorts on source-domain test datasets.
Fig. 3: Localization performance of different cohorts on source-domain test datasets.
Full size image

The Accuracy (a), Precision (b), F1 score (c), FPR (d), FNR (e), Kappa (f) were compared in OCT-AI, attending ophthalmologists, and resident cohorts.

Fig. 4: ROC curves and AUC values of the OCT-AI model.
Fig. 4: ROC curves and AUC values of the OCT-AI model.
Full size image

Result of retinal anomalies classification on the source domain testing dataset. a Average performance on the 14 retinal anomalies. b Performance on each retinal disease.

Performances on the target domain adaption dataset

The target domain adaptation dataset was collected by the commercial BV1000 OCT scanners and the training dataset were collected by two brands of OCT scanners (Optovue and Heidelberg), which have different characteristics than these of BV1000. Four experiments described in Section OCT-AI Model Development were conducted for comparison and all experiments used the target domain adaptation testing dataset for testing. It can be observed from Table 1 that BV-DA-CNN achieved the best FNR and AUC. FR-CNN and YOLOv12 are the directly application of OCT-AI to the testing dataset collected by BV1000 without adaptation and FR-CNN had the best FPR. However, FNR, ACC and AUC scores of FR-CNN are much worse than those by DA-CNN, BV-FR-CNN and BV-DA-CNN. Although YOLOv12 outperforms FR-CNN in terms of ACC and AUC (94.38% vs. 93.62, 88.2% vs. 70.68%), FR-CNN demonstrated better performance in the critical metrics of FNR and FPR. DA-CNN is not as good as BV-FR-CNN and BV-DA-CNN. BV-FR-CNN has the best ACC, and its FNR, FPR and AUC are slightly worse than those by BV-DA-CNN. BV-DA-CNN achieved the best ROC curve with AUC of 0.9406 while FR-CNN is much worse than the other three methods, as indicated by the AUC value of 0.7068 (Fig. 5a). The best method, BV-DA-CNN, achieved comparable performances as these on the testing data, indicating the model adaptation is effective.

Fig. 5: ROC curves and AUC values by various backbones.
Fig. 5: ROC curves and AUC values by various backbones.
Full size image

a Left: values by FR-CNN, YOLOv12, DA-CNN, BV-FR-CNN, and BV-DA-CNN in the domain adaptation study. b Right: values by BV-FR-CNN, and BV-DA-CNN on the external dataset.

Table 1 Results of domain adaptation dataset and external dataset

Performance metrics (AUC) were calculated from 50 repeated random subsamples (50% of target domain adaption dataset, n = 50) to substantiate the significance of performance differences. As shown in Supplementary Fig. 1, ANOVA revealed statistically significant differences among the five methods (p < 0.0001). Paired Student’s t test further demonstrated that: 1) DV-CNN significantly outperformed both FR-CNN and YOLOv12 (p < 0.0001); 2) BV-FR-CNN showed statistically significant improvement over FR-CNN (p < 0.0001); 3) BV-DA-CNN achieved significant gains versus BV-FR-CNN (p < 0.0001).

Performances on the external dataset

As of the submission, we have diagnosed 116,717 patients with 204,491 eyes using the AI-PORAS platform deployed across 29 provinces and municipalities in China. We utilized 20% of the dataset (39,053 eyes with 861,314 OCT B-scan slices) to retrain the OCT-AI model, and the remaining 80% of the dataset (165,384 eyes with 3,551,959 OCT B-scan slices) for testing. It can be observed from Table 1 and Fig. 5b that BV-FR-CNN achieved a better performance than BV-DA-CNN. While BV-DA-CNN used a source domain dataset for domain adaptation, BV-FR-CNN does not, indicating that extremely large-scale data expansion has a more pronounced impact on model performance than domain adaptation. However, in real-world scenarios, obtaining an extremely large-scale labeled dataset at once is impractical. Therefore, before accumulating a larger-scale target domain dataset, we used the better-performing BV-DA-CNN as the backbone for OCT-AI. In the future, we plan to replace it with the BV-FR-CNN, which will be retrained on larger-scale data and is expected to demonstrate superior performance.

Detail results on the source domain dataset

The OCT-AI model was trained with the source domain training dataset and tested with the source domain testing dataset for classification of multiple retinal anomalies and the performances are shown in Fig. 3. The manual annotations made by the chief ophthalmologist group were utilized as ground truth. Performances of OCT-AI were compared with those by the attending and resident ophthalmologist groups. Accuracy, precision, F1 score, false positive rate (FPR), false negative rate (FNR) and Kappa coefficient (Kappa) were used as performance metrics. In summary, the OCT-AI system achieved similar or better performances than the attending ophthalmologist group and was significantly better than the resident ophthalmologist group, especially for diseases where sufficient data were available for training. Detailed results are presented below.

OCT-AI achieved an average accuracy of 93.12% in classifying the 14 anomalies (Fig. 3a), better than that achieved by the attending physicians (89.36%) and the resident physicians (87.24%). OCT-AI outperformed the two groups of physicians in classifying 11 out of 14 anomalies and was comparable or slightly worse in classifying three diseases, FTMH, RD and LMH. Accuracies of OCT-AI ranged from a low of 87.13% for Drusen to a high of 97.31% for PED.

Precision for a disease represents the fraction of correctly classified diseased regions among all identified regions for that disease. Figure 3b revealed that 82.32% of all the identified regions by OCT-AI were correct, averaged across all the 14 retinal anomalies, outperforming the attending ophthalmologist group (76.21%) and the resident ophthalmologist group (76.28%). The highest precision achieved by OCT-AI was on RD (100%), and the lowest was on ERM (non-macular) with a precision of 36.65%.

The average F1 score of 0.78 by OCT-AI was much higher than that by attending physicians (0.58) and that by resident physicians (0.48) as shown in Fig. 3c. The range of F1 scores by OCT-AI on the 14 diseases was from the lowest of 0.48 on ERM (non-macular) to the highest of 0.91 on both SRF and CME. OCT-AI obtained good F1 scores for the diseases where enough training data were available. For these diseases with limited training data, e.g., FTMH (n = 75), RD (n = 79), IS/OS (n = 127) and LMH (n = 37), OCT-AI achieved F1 scores of 0.57, 0.51, 0.54 and 0.51, respectively, which were much lower than those by the attending ophthalmologist group. Overall, OCT-AI outperformed both ophthalmologist groups.

FPRs by OCT-AI on the 14 anomalies ranged from 0% to 11.01%, with the lowest on RD (0%) and the highest on Drusen (11.01%, Fig. 3d). The average FPR of OCT-AI was 3.69%, which was lower than that of the attending physicians (4.04%) but was higher than that of the resident physicians (2.86%). The worst FPRs by OCT-AI, attending physicians and resident physicians were 11.01% (Drusen), 11.97\% (ERM, non-macular), and 13.63% (ERM), respectively.

FNR for a disease represents the ratio of missed diseased regions among all annotated regions for the disease. Figure 3e shows FNR of 21.46% by OCT-AI, which was much better than that by the attending ophthalmologist group (40.79%) and that by the resident ophthalmologist group (58.00%). OCT-AI performed worst on FTMH (n = 75), RD (n = 79) and IS/OS (n = 127) with FNRs of 57.94%, 65.49% and 62.94%, respectively, where training data were limited. However, the attending ophthalmologist group performed relatively well, achieving corresponding FNR of 2.80%, 7.96% and 34.01%, illustrating that human experts were utilizing experience to conduct the diagnosis and the process was independent to the training dataset. On the contrary, human experts performed not well on CNV, Drusen and Retinoschisis (FNRs all above 66.00%) while OCT-AI performed much better because these three diseases had much more data for training (n = 477, 873 and 462, respectively).

Kappa is a metric used for consistency testing and the larger the better. Figure 3f shows a similar trend of that as FNR, e.g., for those diseases having limited data for training, the attending ophthalmologist group performed better than OCT-AI and vice versa. On average, OCT-AI (0.76) was much better than the attending physicians (0.60) and the resident physicians (0.47). The best Kappa achieved by OCT-AI were among those diseases having relatively more samples for training, e.g., SRF (Kappa = 0.88, n = 439), CME (Kappa = 0.85, n = 545) and HE (Kappa = 0.80, n = 656).

Four-fold cross-validation was conducted on the training dataset to assess the proposed OCT-AI model for classification of retinal anomalies. The training dataset was divided into four parts and the system was trained with three of the four parts and tested on the remaining part. This procedure was repeated four times such that each part was utilized for testing once. Performance metrics on each fold are listed in Table 2. Receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) values are shown in Fig. 2b. All performance metrics on each of the parts show strong internal consistency, confirming the reliability of the data annotation process.

Table 2 Results of four-fold cross-validation on the training dataset (Averaged on the 14 retinal anomalies)

Analysis of OCT-AI failure cases

FPR and FNR are critical metrics for evaluating detection model performance. As shown in Supplementary Fig. 2a, b, we use a quantitative approach based on conditional false positive rate and conditional false negative rate to reveal inter-anomaly misclassification correlations. 1) Correlation of Conditional False Positive– Target False Negative: For all false-positive eyes of a certain conditional anomaly, calculate the false negative rate of other target anomalies (Supplementary Fig. 2a). Specifically, all false-positive cases for each anomaly were input to the model and estimated the false negative rate of other anomalies (e.g., Among all RD false-positive eyes, PVD false negative rate = 18.86%, ERM false negative rate = 15.2%). 2) Correlation of Conditional False Negative – Target False Positive: For all false-negative eyes of a certain conditional anomaly, calculate the false positive rate of other target anomalies (Supplementary Fig. 2b). Specifically, all false-negative cases for each anomaly were input to the model and estimated the false positive rate of other anomalies (e.g., Among all RD false-negative cases, PVD false positive rate = 32.02%, ERM false positive rate = 13.98%).

Analysis revealed that: 1) No false-negative anomalies were occurred in CE false-positive eyes, and the other false-positive anomalies exhibited consistent impact on CE false negative rate; 2) Among all False-Positive eyes, PVD had a large impact on the false negative rate of EZ (31.1%), and PED influenced the false negative rate of macular hole (MH) (30.0%); 3) In the false-negative case of each anomaly, no False Positive CE was observed; 4) In PED False-Negative case, false positive rates of SRF, HE, and CNV reached 74.5%, 64.7%, and 61.9% respectively. These findings indicated that CE exhibits minimal involvement with the misdiagnosis or missed detection of other anomalies, whereas the missed detection of PED serve as a major confounding factor contributing to false positives in the detection of other retinal anomalies.

Supplementary Fig. 2c–e are misclassification cases by the OCT-AI model (green bounding boxes: ground truth; yellow bounding boxes: BV-DA-CNN predictions), where RD was incorrectly classified as SRF. In this particular slice, due to the large area involved in the detachment, it is challenging to distinguish whether the lesion was RD or SRF. Notably, such misalignments also challenge clinical diagnosis, highlighting the need to enhance the model’s sensitivity to subtle morphological distinctions. Supplementary Fig. 2e involves a slice with RD where the image quality is poor. Due to the overall low brightness (possibly caused by scanning parameters or patient cooperation issues), the detached neurosensory layer was not clearly visualized, and consequently, the model also failed to detect the abnormality. This result highlights the model’s limitations in low-quality OCT images, which aligns with the practical difficulties faced by clinicians.

These representative case analyses reveal potential directions for model improvement in complex clinical scenarios. In the future, we will explore multi-scale feature fusion strategies based on medical foundation models to enhance the model’s ability to detect small or blurred lesions, as well as advanced image enhancement techniques to improve the model’s robustness to low-quality images.

Discussion

Millions of people worldwide are affected by retinal diseases such as diabetic retinopathy (DR)33, age-related macular degeneration (AMD)14, glaucoma34, and retinal detachment (RD). Without accurate diagnosis and timely appropriate treatment, these retinal diseases can lead to irreversible blurred vision, metamorphopsia, visual field defects, or even blindness. However, in rural and remote areas, especially in developing countries, where there is severe scarcity of medical resources. Consequently, patients with retinal diseases may not receive timely treatment in primary healthcare settings due to a deficiency of ophthalmologists and equipment. In China, there is also a shortage of ophthalmologists. According to China’s White Paper on Eye Health, there were 44,800 ophthalmologists nationwide by 2020, equating to 1.6 ophthalmologists per 50,000 people, the figure is even less than 0.6 in remote areas35.

In recent years, the analysis of retinal images based on DL algorithms has been reported to use for automated screening and diagnosis of retinal diseases. However, most studies have shown their models trained on fundus images can only detect 10 or fewer retinal diseases23,24,36,37,38,39,40, with a few capable of detecting over 1025,41,42,43, and one can detect 39 retinal diseases25, with accuracies between 30.5% and 87.42%, and AUC values between 94.1% and 99.84%. The OCT slice image-based studies conducted by Kermany27, Ozdas44 and Tayal45 can only detect 3 or fewer retinal diseases with accuracies of 96.6%, 95.7% and 96.5%, respectively, and the study by De46 can make referral recommendations for 4 retinal diseases. The Ultra-Widefield Fundus images (UWF) image-based studies performed by Sun47 and Cui48 can detect 8 and 5 retinal diseases, respectively, reaching the accuracy level of retina specialists. The study by Hassan28 showed that a deep learning algorithm trained on both fundus and OCT modalities was able to detect 6 retinal diseases in OCT and 5 in fundus with an average accuracy of 98.26%. Deep learning algorithm trained with OCT images provided higher accuracies than those trained with fundus images but can only detect fewer retinal diseases.

Recent advancements highlight AI’s potential in diagnosing retinal diseases such as DR, AMD, and glaucoma49. However, despite deep learning models achieving expert-level performance in retinal fundus image analysis3,50,51, its clinical deployment remains limited by persistent challenges including data heterogeneity, algorithmic transparency, and regulatory compliance50. Studies in radiology and oncology highlight AI deployment requires rigorous multi-phase validation (e.g., retrospective testing, prospective trials) to mitigate biases52,53,54. Yet, these fields similarly face barriers in data quality, algorithmic transparency, and ethical concerns. Similarly, ophthalmology urgently requires clinically deployed and validated AI frameworks. Pioneering studies like De Fauw et al.46 validated clinical applicability of AI in a tertiary hospital, while commercial solutions such as EyRIS (approved in seven countries) demonstrate partial success in DR/AMD screening55. Critically, their applications remain constrained by narrow disease coverage and localized workflows. Additionally, there are no large cloud services or big data centers powered by AI techniques that are specifically designed for clinical retinal disease diagnosis and analysis.

This paper introduces an Artificial Intelligence Cloud Platform for OCT-based Retinal Disease Screening (AI-PORAS). OCT images contain more information than fundus images and it can diagnose up to 15 retinal anomalies. It streamlines and expedites the diagnosis process of retinal diseases remotely, leading to enhanced diagnostic accuracy and reduced workload for ophthalmologists. AI-PORAS also facilitates large-scale data management for advanced data analytics. The platform has successfully diagnosed 116,717 individuals across 29 provinces, municipalities, and autonomous regions in China, and this number continues to grow. Statistical analysis of the patient dataset showed that females exhibit a slightly higher prevalence of retinal diseases, ~3.16% higher than males. The incidence rates of CA, ERM, and EZ diseases continued to increase in 2023, a trend that warrants long-term attention. Additionally, a higher incidence rate of retinal diseases was observed in central China, while the southern part of China exhibited the lowest incidence rates. Details of statistical analysis on the correlation of retinal disease with gender (Supplementary Fig. 6a and Supplementary Tables 4, 5), age (Supplementary Figs. 6b, 7 and Supplementary Tables 68), trends (Supplementary Fig. 8 and Supplementary Table 9), and region (Supplementary Fig. 9 and Supplementary Tables 1013) are presented in Statistical Analysis on the External Dataset from Supplementary information. AI-PORAS have demonstrated significant clinical value and being adopted by an increasing number of healthcare institutions, as well as it has high potential for further promotion to more countries and regions.

Different types of OCT scanners are available in the market, and the resulted OCT images may have different characteristics that could impact the performance of OCT-AI models. Additionally, AI system development typically span several years, and the understanding of retinal diseases is constantly updated with new research findings. Therefore, the categories of diseases automatically identified by the model must be adjusted accordingly. For example, our historically collected source domain dataset was obtained from Optovue and Heidelberg OCT scanners covering 14 retinal anomalies, while AI-PORAS was deployed across China with the self-developed commercial OCT scanner, BV1000, covering 15 retinal anomalies. Therefore, domain adaptation techniques among different scanners and disease categories are necessary to effectively utilize the historically accumulated source domain dataset from Optovue and Heidelberg for training the deep learning model.

The OCT-AI model, trained on 25,978 OCT B-scan slices from the source domain dataset, performed at a level comparable to attending ophthalmologists. It achieved an IoU of 0.44 for abnormality localization in OCT images and demonstrated an average accuracy of 93.12%, precision of 82.32%, F1 score of 0.78, FPR of 3.69%, FNR of 21.46%, and Kappa of 0.76 across 14 retinal disease classifications. All metrics, except for FPR, surpassed those of the ophthalmologists. Given the impracticality of acquiring an extremely large-scale dataset at once in a real clinical setting, domain adaptation techniques were employed to assess the generalization capability of OCT-AI. This approach enabled the model to quickly adapt to new OCT scanners and disease categories (BV-DA-CNN), achieving an accuracy of 96.62%, an AUC of 94.06, an FPR of 1.59%, and an FNR of 21.94% on 243,361 OCT B-scan slices in the target domain dataset. Compared to the baseline method (FR-CNN), three metrics (accuracy, AUC, and FNR) represent improvements of 3%, 23.38%, and 45.57%, respectively. Our study has demonstrated that extremely large-scale data expansion has a more significant impact on model performance than domain adaptation, shown in Table 1. On a larger external dataset comprising 4,413,273 OCT B-scan slices, BV-DA-CNN achieved an average accuracy of 93.16%, an AUC of 93.64%, a FPR of 6.82%, and a FNR of 7.87%. Although these scores were slightly lower than those of BV-FR-CNN, which benefits more from extremely large-scale data expansion, BV-DA-CNN remains the optimal choice for early clinical deployment. All data used in this study has been desensitized and did not require subject notification.

In summary, reading and diagnosing retinal anomalies in OCT images require intensive medical training. Many remote regions lack such resources, even primary hospitals in developed regions to be unable to meet the high demand for human expertise due to the unprecedented increase in diagnostic imaging capabilities. AI-PORAS is a potential solution to this challenge, providing remote image reading and interpretation services to hospitals in remote areas. As patient volume continues to grow, data analysis on the data will offer more insightful guidance on the diseases, and the data can be used to continuously improve the performance of AI-PORAS. To the best of our knowledge, this is the first large-scale real-world study of using AI systems to localize, classify and statistical analysis 15 retinal anomalies simultaneously in OCT images. It is also the most extensively deployed AI cloud system for OCT-based retinal disease screening, serving 207 medical institutions in 29 provinces in China.

Our future work on improving AI-PORAS can be achieved through 1) considering a retinal stratification algorithm to conduct a comprehensive thickness analysis of retinal layers across diverse regions, nationalities, and age groups. This can significantly enhance the understanding of eye health and provide valuable guidance and decision-making suggestions for healthcare professionals in delivering more effective and personalized care to patients; 2) integrating these patient factors such as gender, region, and age group into algorithm design to increase the specificity of AI-PORAS. Our objective is to validate the efficacy of OCT-AI in a clinical setting that mimics real-world scenarios. This endeavor not only enhances trust in artificial intelligence technology within the medical domain but also furnishes valuable insights and guidance for practical clinical applications in the future. 3) OCT-AI is systematic refinement of standard frameworks without a complete fundamental network redesign. Future work will explore attention mechanisms or develop new foundation model-based frameworks, as well as explore hybrid architectures combining Transformer and CNN to further optimize small lesions detection performance.

Our study has limitations. First, the current statistical analysis is conducted from four perspectives: region, gender, age, and trends, yet more subtle and potentially valuable information remains to be explored in depth, such as the underlying correlations between the incidences of certain diseases. Second, the OCT-AI Model was trained to detect 15 popular retinal anomalies, there are still quite few retinal diseases that cannot be handled by our system, partially because these retinal anomalies are rare and there are not enough annotated images for training. Finally, accurately localizing and classifying all diseases in OCT images are challenging and there are large variations among different human experts, which need to be addressed to develop robust OCT-AI model.

Methods

Study design

We developed the Artificial Intelligence Cloud Platform for OCT-based Retinal Disease Screening (AI-PORAS) system for retinal disease diagnosis as illustrated in Fig. 6. AI-PORAS consists of four key components: a cloud server powered by an in-house developed AI-assisted diagnosis model known as OCT-AI, an expert team of twelve ophthalmologists, OCT terminals at 207 medical institutions, and the Big Data Analysis Platform (Supplementary Fig. 3). Terminal OCT scanners capture images and associated essential patient data, including age, gender, and region, from terminals across China and transfer the data to the cloud server for further processing. In the cloud server, OCT-AI first provides auxiliary diagnostic results, which are then transmitted to the expert team for confirmation and possible correction to reach a consensus diagnosis. The diagnostic results are subsequently transmitted back to the terminals for clinical use. Additionally, the Big Data Analysis Platform performs a comprehensive statistical analysis on the patients in the pool, including disease growth trends, patient ages, genders, and regions.

Fig. 6: Architecture of AI-PORAS.
Fig. 6: Architecture of AI-PORAS.
Full size image

AI-PORAS consists of four key components: a cloud server powered by an in-house developed AI-assisted diagnosis model known as OCT-AI, an expert team of twelve ophthalmologists, OCT terminals at 207 medical institutions, and the Big Data Analysis Platform.

Prior to the deployment of AI-PORAS, the OCT-AI system was pretrained with the source domain dataset shown in Supplementary Table 1, which consist of 30,716 OCT B-scan slices collected from 4243 eyes at five Chinese hospitals using two OCT scanners (Optovue and Heidelberg). The trained OCT-AI system was then adapted to a commercial OCT scanner, BV1000, made by Big Vision company, using the domain adaptation datasets (243,361 OCT B-scan slices from 2948 eyes) shown in Supplementary Table 2 and a domain adaptation approach. The AI-PORAS platform, equipped with the BV1000 OCT terminal scanner, has been deployed at 207 medical institutions across China and had diagnosed 116,717 patients, 204,491 eyes with 4,413,273 OCT B-scan slices by the time of manuscript submission shown in Supplementary Table 3, referred to as the external dataset. Diagnostic results on the source domain, target domain adaptation, and external datasets, and statistical analysis results on the external datasets after the platform’s deployment, are reported in this study.

Our source domain dataset was collected using commercial OCT scanners (Optovue and Heidelberg). Subsequently, we developed the BV1000 scanner for the remotely deployed AI-PORAS system. However, the BV1000 has different characteristics from those of the two commercial OCT scanners, necessitating adaptation of the trained AI-PORAS to the developed commercial scanner. Therefore, the BV1000 scanner was used to collect the domain adaptation dataset to adapt AI-PORAS to BV1000 before its deployment. Domain adaptation method can minimize image-level shifts, including image style and illumination, as well as instance-level shifts, such as object appearance size56. This allows for the full utilization of the historically accumulated source domain dataset collected by Optovue and Heidelberg, effectively simulating the effect of dataset expansion.

According a multicenter study covering all 31 provinces of mainland China57, fifteen retinal anomalies were commonly observed both in the general population and in ophthalmology outpatient departments across multiple institutions. The anomalies were investigated in this study include Posterior Vitreous Detachment (PVD), Retinal Pigment Epithelial Detachments (PED), Drusen, Choroidal Atrophy, Epiretinal Membrane (ERM), Ellipsoid Zone Disruption (EZ), Hard Exudate (HE), Retinoschisis (RC), Subretinal Fluid (SRF), Cystoid Macular Edema (CME), Retinal Hemorrhage (RH), Choroidal Neovascularization (CNV), Macular Hole (MH), Choroidal Excavation (CE), Retinal Detachment (RD) and Retinal Atrophy (RA). Notably, the included pathologies represent hallmark features of major blinding diseases: Drusen, PED, CNV, and SRF are pathognomonic features of AMD, affecting 196 million globally in 2020 with projected increase to 288 million by 204058; while CME, SRF, HE, and RH are commonly observed in Diabetic Retinopathy (DR), particularly in Diabetic Macular Edema (DME); CME, SRF, and RH are also typical manifestations of Retinal Vein Occlusion, the second-leading retinal vascular blinding disease after DR59. This systematic inclusion of high-incidence and high-risk blinding retinal anomalies ensures clinical generalizability and diagnostic significance of our AI model.

Expert team and data annotation

The OCT-AI model first locates potential abnormal regions and marks a bounding box for each of the identified regions. It then classifies the tissue in the bounding box as one of the retinal anomalies. Thus, a ground truth annotation includes two parts: the location of the bounding box and the disease type of the tissue within it. To holistically balance comprehensive model accuracy evaluation with practical deployment efficiency, we designed a specialized fine-grained annotation protocol for source domain data, and implemented simplified yet efficient annotation workflow for target domain data.

To precisely annotate the source domain training and testing datasets, 9 clinical ophthalmologists were recruited and divided into three groups, each with different levels of experience: chief ophthalmologists, attending ophthalmologists, and resident ophthalmologists. Each group, consisting of a group leader and two members, operated as follows: each member initially annotated all the OCT B-scan slices independently, then the group leader reviewed all the annotations, resolved inter-annotator discrepancies, and confirmed the final annotations. The final annotations by the chief ophthalmologists served as ground truth for computing performance metrics, while those by the other groups were used in a comparative study. Each OCT B-scan slice was annotated as none, one, or more of the 14 retinal anomalies. Examples of annotated OCT B-scan slices are shown in Supplementary Fig. 4.

To efficiently annotate the target domain adaptation dataset, 3 OCT specialists with experience equivalent to that of the chief ophthalmologist group were recruited. Two group members first manually annotated all the dataset, including locations of bounding boxes and their corresponding disease types. The team leader then reviewed all the annotations, resolved inter-annotator discrepancies, and confirmed the final annotations. The final annotations were used as ground truth for training and domain adaptation. Each OCT B-scan slice was annotated as none, one, or more of the 15 retinal anomalies. It is worth noting that the target domain adaptation dataset was annotated for 15 anomalies, and some of them were different from these 14 anomalies annotated for the source domain dataset.

The collection and annotation of the source domain dataset lasted for about two years, while the collection and annotation of the target domain adaptation dataset occurred recently over a relatively short period. During this time, the expert team’s understanding of the disease evolved with data accumulation, leading to annotation differences between the datasets. Moreover, an AI model based on domain adaption method was designed to adapt such types of dynamic changes in real clinical scenarios.

Standardization protocols of diverse imaging platforms

Despite variations in hardware configurations among different OCT scanners - including but not limited to the splitting ratio of fiber optic couplers (e.g., 1/9 or 2/8), and light source selection (broadband low-coherence sources for spectral-domain OCT (SD-OCT) versus wavelength-swept lasers for swept-source OCT (SS-OCT))- all systems share the fundamental imaging principle based on low-coherence interferometry60. Through complex signal analysis (Fourier or inverse Fourier transform) of coherent light signals, these devices reconstruct cross-sectional fundus images. This universal physical mechanism ensures consistency in the core imaging process across major commercial systems (Optovue, Heidelberg, BV1000). Achieving inter-device consistency primarily relies on standardized acquisition protocols. Therefore, in this study, all OCT imaging platforms have been required to ensure the visibility of key fundus structures and to center the imaging on the macula.

Besides, OCT devices from different manufacturers exhibit significant variations in imaging resolution, speckle noise intensity, and other factors, which introduces a “domain shift” to the model (Supplementary Fig. 5). No explicit means were directly implemented on the imaging platforms themselves to ensure enforced consistency across different devices. Therefore, we designed the BV-DA-CNN network, with a fully supervised domain adaptation approach to minimize inter-device domain discrepancies. This enables the model to learn robust features from source devices (e.g., Optovue and Heidelberg) and apply them to target devices for disease detection (e.g., BV1000).

To furtherly ensure cross-device image consistency, we implemented a standardized preprocessing pipeline compliant with the maskrcnn-benchmark framework input requirements61. Specifically, all input images (the shorter edge:\(\,x\), the longer edge: \(y\)) underwent uniform resizing: \(x\) was first resized to a predefined minimum dimension \({S}_{\min }\) while maintaining the original aspect ratio. If this operation resulted in \(y\) exceeding a maximum threshold \({S}_{\max }\) pixels, we instead constrain \(y\) to \({S}_{\max }\) pixels, proportionally adjusting \(x\). This approach both preserved the original aspect ratio while ensuring scale consistency of network inputs, effectively mitigating the impact of variations in acquisition parameters or inherent resolution. More critically, the proposed fully supervised domain adaptation method achieved consistent lesion representation through feature space alignment, effectively reducing discrepancies between similar lesions captured by different imaging platforms at the feature space level.

OCT-AI model development

The proposed OCT-AI model advances the Faster R-CNN62 with an innovative domain adaptation framework tailored for OCT-based AI diagnostics. Lesions in OCT images typically exhibit multi-layer structures with blurred boundaries and irregular shapes, presenting high diversity. Compared to state-of-the-art detection architectures such as the YOLO series (e.g., YOLOv5/YOLOv8/YOLOv10) and the transformer-based DETR series (e.g., DETR, RT-DETR), Faster R-CNN demonstrates significant advantages in detecting medical lesions due to its RoIAlign mechanism for fine-grained feature extraction62,63. Furthermore, to address domain shifts arising from heterogeneous OCT scanners, we designed a novel domain adaptation framework with four specialized strategies: FR-CNN, DA-CNN, BV-FR-CNN and BV-DA-CNN.

We trained the OCT-AI model for retinal abnormality localization and classification. The model first identifies the abnormal regions indicated by bounding boxes and then classifies each of the regions as one of the annotated anomalies using a fully connected (FC) layer, as shown in the green box in Fig. 7. OCT-AI, similar to Faster R-CNN structurally integrates region proposal, feature extraction, bounding box regression, and classification for fast object detection. The feature extractor can be any image classification model, such as ImageNet64, VGG65, ResNet66, Inception V1-V467, or DenseNet68. We chose ResNet-10166 because its residual architecture makes the training very effective.

Fig. 7: Diagram of the domain adaptive faster R-CNN model.
Fig. 7: Diagram of the domain adaptive faster R-CNN model.
Full size image

The green box contains the basic Faster R-CNN model, and the blue boxes are the modifications made to the Domain Adaptive Faster R-CNN in BV-DA-CNN. RPN Region proposal network, FC Fully connected, ROI Region of interest.

In our study, the source domain dataset was collected using Optovue and Heidelberg OCT scanners, while the target domain adaptation dataset was collected by the commercial BV1000 scanner, which has different specifications from those of Optovue and Heidelberg. Additionally, it is worth noting that the target domain adaptation dataset was annotated for 15 anomalies, some of which differ from the 14 anomalies annotated in the source domain dataset. The OCT-AI model has been designed based on domain adaptation to accommodate changes in understanding retinal anomalies categories and the variability of OCT images across different OCT scanners. This approach also enables the OCT-AI model to fully leverage historical source domain data before acquiring larger-scale target domain data. We first trained and tested OCT-AI with the source domain dataset collected by Optovue and Heidelberg, and then adapted the trained OCT-AI using the target adaptation dataset collected by BV1000, before deploying it to the real system that equipped with the BV1000.

We designed four domain adaptation strategies and all experiments used the domain adaptation dataset for testing. 1) FR-CNN: we trained the Faster R-CNN model with the re-annotated source domain training dataset and then directly applied the trained model to the testing dataset in the target domain adaptation dataset. 2) DA-CNN: we adapted FR-CNN using the Domain Adaptive Faster R-CNN56 method with the training images without annotations in the domain adaptation dataset, making this method semi-supervised. The Domain Adaptive Faster R-CNN56 model employs two domain-adaptive classification components to align image-level and instance-level feature representations, as shown in the red box of Fig. 7. It also uses regularization consistency to improve robustness at both levels. 3) BV-FR-CNN: we trained Faster R-CNN with the training images with annotations in the target domain adaptation dataset directly. 4) BV-DA-CNN: we adapted the trained FR-CNN with the training images with annotations in the domain adaptation dataset, making this method fully supervised. Notably, in BV-DA-CNN, we made the following modifications to DA-CNN: we (I) added multi-label classification to the image-level classifier to enhance the overall anomalies prediction capability as shown in the blue box of Fig. 7, (II) changed the instance-level classifier from binary classification to multi-class classification, and (III) removed the consistency regularization from the original Domain Adaptive Faster R-CNN since both domains had labels for training.

During the training of OCT-AI object detection model, we employed a balanced 1:1 sampling ratio for selecting foreground and background proposals in the Region Proposal Network (RPN). The same sampling strategy was applied in the detection head responsible for fine-grained classification to ensure balance between foreground and background. Although no additional sampling techniques for inter-class lesion balancing, data augmentation techniques such as random rotation, vertical flipping, and horizontal flipping were utilized to enhance data distribution diversity and mitigate overfitting.

AI-PORAS system deployment

Retinal anomalies have become more prevalent in recent years, making it challenging to diagnose them relying solely on outpatient doctors, especially in rural areas where medical resources are limited. There is an urgent need for an efficient alternative diagnostic tool that can provide quick screening and early diagnosis of retinal disease to alleviate the uneven distribution of medical resources. For this purpose, we deployed our AI-PORAS platform 207 medical institutions in China nationwide through strategic partnerships or public health screenings, to store, view, process, and accelerate the screening and diagnosis of retinal anomalies, as shown in Fig. 6. Given that the BV-DA-CNN model has demonstrated the best performance (Table 1) before obtaining larger-scale target domain data, it has been selected as the backbone for the OCT-AI model, and the trained BV-DA-CNN has been uploaded to our self-built private cloud. This enables remote users to transmit data through the network and obtain results. After a patient is examined, their clinical information and images are uploaded to the cloud server, which archives the data and requests the deployed OCT-AI model to perform the analysis. Once completed, medical personnel log into the platform through a browser to review the original OCT data and the AI analysis results, and then write a report for the patient.

Implementation details

The system was implemented by Pytorch and trained with the Stochastic Gradient Descent (SGD) optimizer for 180000 epochs on a deep learning server with 8 GeForce GTX 1080Ti Graphical Processing Units (GPUs). The initial learning rate was set as 0.0025 and weight decay as 0.0001. To improve the generalization abilities, the training data were augmented by random flipping, random cropping, and adding noise.

Performance metrics

Intersection of union (IoU) between predicted and ground truth bounding boxes was utilized to assess performance of disease localization. For retinal disease classification, false negative rate (FNR, also known as missed diagnosis rate), false positive rate (FPR, also known as misdiagnosis rate), accuracy (ACC), precision (PRE), Kappa coefficient (Carletta, 1996 #383) and F1 score were used as performance metrics. In addition, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) were calculated for model comparison. These metrics are defined as:

$$FNR=\frac{FN}{FN+TP},FPR=\frac{FP}{FP+TN},ACC=\frac{TN+TP}{TN+TP+FP+FN}$$
(1)
$$PRE=\frac{TP}{TP+FP},F1score=\frac{2\times PRE\times Recall}{PRE+Recall}$$
(2)
$$Kappa=\frac{2\times (TP\times TN-FN\times FP)}{(TP+FP)\times (FP+TN)+(TP+FN)\times (FN+TN)}$$
(3)

where \({Recall}=\frac{{TP}}{{TP}+{FN}}\), and \({TP},{TN},{FP},{FN}\) denote true positive, true negative, false positive, and false negative, respectively.