Introduction

Cornea is the transparent, dome-shaped structure at the front of the eye that plays a crucial role in focusing light onto the retina1,2. Corneal epithelial staining is a significant indicator refer to epithelial defects of various ocular surface diseases3,4,5 (e.g., dry eye (DE), neurotrophic keratopathy, various types of keratitis) and certain systemic conditions (e.g., diabetes-related corneal damage)6,7. These conditions significantly increase the risks of visual impairment and can negatively impact overall quality of life7,8. Fluorescein dye staining allows for the visualization of corneal epithelium lesions9 associated with epithelial erosions or defects in the epithelial cell barrier10, making it an important diagnostic and assessment tool for evaluating the severity of ocular surface diseases11. However, the current assessment for staining relies on manual grading, which is time-consuming and labor-intensive. Moreover, subjective variation caused by the reliance upon the experiences and individual perspectives of physicians makes it challenging for large-scale data analysis and reduces data comparability between different centers. While computer-assisted quantification systems have emerged to enhance objective evaluation12,13,14,15,16,17,18, their diagnostic accuracy, sensitivity, and clinical utility warrant further validation.

Prior approaches have introduced image processing and traditional machine learning techniques into corneal staining evaluation, including thresholding16, edge detection15,17 and random forest18. However, those methodologies rely on pre-selected and typically fixed thresholds and hyperparameters, challenged, e.g., by confluent staining19 (the stained area comprises multiple punctate lesions with indistinct borders) in counting punctate staining, which had limited performance on images with large individual variability. Compared to feature-based machine learning, deep learning (DL) enables end-to-end prediction20 with superior feature discrimination and state-of-the-art performance. Currently, several existing digital methods have introduced DL into corneal staining evaluation. Deng et al. have introduced a dataset for the classification of corneal ulcers and the segmentation of flaky ulcers12. Building upon this work, Wang et al. have proposed a framework for the segmentation of patchy ulcers, utilizing adjacent scale fusion and position embedding21. Wang et al. combining patchy ulcer segmentation and staining grading, have employed a Transformer-based network for corneal staining evaluation, incorporating additional patchy ulcer segmentation branch13. Qu et al. have designed a DL grading system for epithelial erosions using ResNet3414, but failed to provide interpretability.

However, all of the aforementioned AI-based corneal staining methods encounter several limitations in their clinical applications. Firstly, while DL excels in detecting well-defined patchy lesions13, it underperforms in identifying punctate lesions13. Punctate lesions appear as scattered dots or confluent areas confined to the epithelium upon staining, which is often associated with DE and frequently found surrounding the primary site of most corneal ulcers, related to inflammatory attacks and epithelial repair, which indicate disease progression and prognosis22, occurred commonly in most ocular surface disease6,8,23,24. However, annotating these lesions is costly due to their small and discrete nature25, and often accompany with “noise”, e.g., dye residue, which poses a significant obstacle to the training of existing AI models13,26. Numerous approaches have been developed to tackle the issue of learning from noisy labels, including adversarial training27, Bayesian learning28, and knowledge distillation29. However, there have been no such study applying noisy learning to quantitative analyses of corneal staining. Secondly, the current discrete classification-based AI-assist staining evaluation systems risk plateau effects, where the results may remain within the same clinical grading range even when a significant number of stained points decreases16. This limitation could misrepresent treatment response in clinical practices. Lastly, there is an urgent need for real-world evaluation on corneal staining AI systems to facilitate precise evaluation of disease severity in different medical institutions and provide accurate endpoints for large-scale, multi-center clinical trials in ocular surface disease medication.

Our study aims to address the challenges in the evaluation of corneal punctate staining. Firstly, we proposed a deep learning framework for general corneal punctate staining grading, then we introduced a fine-grained corneal staining scoring model using knowledge distillation (FKD-CSS), which employs knowledge distillation to feature dual decoder branches for CSS grading and staining segmentation. This innovative framework not only optimizes grading performance but also provides interpretability through staining area visualization to enhance the clinical applicability and accessibility. Secondly, this study will focus on DE due to its high prevalence (5% to 50%22,30 globally) and its characteristic epithelial punctate staining31. Additionally, corneal staining is a recognized standard for diagnosis and severity in this population32,33 and is also an FDA-approved endpoint for efficacy11. Therefore, the DE datasets prospectively recruited from a multi-center real-word clinical condition will be used as the primary training and validation dataset for the FKD-CSS model. This dataset contributes to the robustness and representativeness of our study. Furthermore, an independent external test of multiple ocular surface abnormalities was also utilized to validate the model’s generalizability in handling different staining patterns.

Results

Datasets characteristics

The images of developing, internal validation, and primary external test datasets were mainly collected prospectively from a 3-month follow-up dataset of DE patients from 37 tertiary hospitals in seven regions of China (Fig. 1) where the model is most likely to be adopted. After an automatic quality control step of the model, we retained 1471 out of 1477 corneal fluorescein staining images from Eastern China of DE dataset for both training and 5-fold cross-validation. The images were from 14 tertiary hospitals in seven provinces and municipalities (Supplementary Table 2-4, 12). The remaining images from 6 regions of China (Central, South, North, Northeast, Northwest and Southwest) except for Eastern were used for multi-center external testing of the model in the scenario of DE, yielding 2376 out of 2386 images from 23 hospitals in 15 provinces and municipalities (Supplementary Table 4-7, 12). Additionally, 231 images from 18 kinds of common ocular surface disease without DE (e.g., Acanthamoeba keratitis, viral keratitis, neuroparalytic keratitis) with punctate and patchy lesions from Zhongshan ophthalmic center were annotated for external test of generalization capability of our FKD-CSS in disease staging (Supplementary Table 9). The assignment of datasets is shown in Supplementary Fig. 1. And the detailed distribution of the datasets is described in “datasets” of the “Methods” section.

Fig. 1: The distribution and collection of the datasets encompasses the entirety of China.
figure 1

DE dry eye, MOSD multiple ocular surface diseases.

The images were captured using various types of clinical commonly used cameras (Supplementary Table 10). Corneal staining image were captured followed the standard practice (Supplementary Table 1). The images were labeled with CSS34 (0 - 4), which had been utilized as primary efficacy assessment in various clinical trials34,35,36. By a triple read and arbitration method (Supplementary Fig. 2), the label was required to be accurate to 0.5 points, resulting in 9 classes, ranging from 0 to 4 in 0.5 intervals, as the ground-truth of CSS. To facilitate model performance comparison with the commonly used clinical classification of corneal staining, the CSS (score \(x\)) was also converted into discrete categories to classifies ocular epithelium damage34,35,37,38: Staining Negative or Without Clinical Significance (SNWC): \({\rm{x}} < 0.5\). Mild: \(0.5\le {\rm{x}} < 1.5\), indicating mild damage. Moderate:\(1.5\le {\rm{x}} < 2.5\), indicating mild inflammation & worsening defects. Severe:\(2.5\le {\rm{x}} < 3.5\), indicating chronic inflammation & extensive damage. Critical:\({\rm{x}}\ge 3.5\), indicating severe inflammation or significant epithelial damage. Additionally, senior ophthalmologists provided their best clinical judgment of CSS that took into account all the morphological features of the lesions present in the Region of Interest (ROI) of the images by coarsely annotating 100 images randomly selected from the development datasets, allow the FKD-CSS grading framework to effectively transfer prior knowledge about the correlation between staining grades and lesion locations to a larger and more generalized dataset using knowledge distillation, 80 annotated images from each of the two external test sets were used for staining detection validation for external test of the model in DE and multi-disease scenarios. Notably, the assessment for DE dataset focused primarily on the lower third of the cornea instead of complete corneal, which has been demonstrated to present the highest degree of corneal staining in symptomatic DE patients most commonly39,40, while evaluation was focused in the entire corneal area in dataset of multiple ocular surface diseases (MOSD). For details, see “corneal staining test” section in Methods.

System architecture

We developed an advanced automatic CSS grading system that utilizes fine-grained staining information to output continuous and consistent scores. As illustrated in Fig. 2, the framework’s training pipeline encompasses three key stages: ROI cropping, segmentation teacher network training, and CSS grading network training. In the inference phase, the system initially detects the ROI areas in the cornea before directly outputting the CSS score. Notably, due to the incorporation of fine-grained knowledge distillation in the grading network’s training, the system is adept at identifying crucial staining areas that significantly influence CSS grading. A comprehensive explanation of this process is detailed in the Methods: “Algorithm Construction of FKD-CSS” section.

Fig. 2: The proposed FKD-CSS framework consists of four components.
figure 2

(1) a corneal segmentation network to extract the region of interest and employ a quality control, (2) a weakly-supervised teacher segmentation network to guide the grade model towards fine-grained lesions, and (3) a novel prototype memory bank to cluster similar semantic information across different CSS grades, (4) an interpretable grading model to predict the CSS and disease severity. ROI region of interest. FKD fine-grained knowledge distillation. CSS corneal staining score.

Performance of baseline model between regression and categorization

Many existing methods treat the cornea grading as a classification problem13,14, However, the suitability of these methods for finer-grained lesions is questionable. Classification focuses solely on the accuracy of individual grades, overlooking the overall consistency and correlation among different grades. To address this, we conducted experiments on ResNet50, the fundamental deep learning backbone of FKD-CSS, under both classification and regression settings. The primary difference between these two models lies in their final fully-connected layer: the classification model generates a probability distribution across 9 discrete output values, corresponding to each CSS grade increment (ranging from 0 to 4 in 0.5-step intervals), whereas the deep regression model directly predicts a continuous CSS value. Our results indicate the classification model’s limitations in CSS tasks, achieving only a Pearson Correlation of 0.727 and an area under the curve of receiver operating characteristic (AUC-ROC) of 0.438. In contrast, the regression model scores 0.828 and 0.803, respectively. It is evident that the regression model is more suitable for providing consistent and robust CSS grading, aligning closely with the nuanced demands of CSS assessment.

Performance and interpretability of FKD-CSS compared with other deep learning algorithms

To assess the performance of the proposed FKD-CSS model, we compared it with established medical image feature extractors that are widely used for ocular staining grading or segmentation: ResNext5041, DenseNet12142, and ResNest5043. We adapt all of these models as regression models. However, despite their ability to extract high-level visual features due to pre-training on ImageNet, these methods exhibit deficiencies in the extraction of staining features, as illustrated in Table 2. The performance of ResNeSt (r = 0.834, AUC = 0.802), DenseNet121 (r = 0.842, AUC = 0.818), and ResNext50 (r = 0.858, AUC = 0.818) still falls short when compared to the proposed FKD-CSS model (r = 0.898, AUC = 0.881). Furthermore, we compared the performance of FKD-CSS with other methods using samples from SNWC to evaluate the model’s ability to distinguish normal populations. FKD-CSS exhibits the best performance (AUC = 0.881) compared to other comparative methods (AUC:0.825–0.847). (Supplementary Table 11).

We provide a feature visualization comparison between the FKD-CSS model and the baseline backbone, ResNet50. The visualization for ResNet50 is achieved using gradient class activation mapping plus plus (GradCAM + +)44. As depicted in Fig. 3, the FKD-CSS model successfully captures fine-grained staining areas in corneal staining images of different CSS gradings, whereas ResNet50 only captures coarse-grained staining areas. These visualization results serve as visual evidence of the proposed FKD-CSS model’s ability to generate meaningful and informative interpretability.

Fig. 3: Feature visualization of the proposed FKD-CSS model and ResNet50 under randomly chosen samples with different levels of dry eye.
figure 3

The first row is the original raw image. The second row is the class activation mapping plus plus (CAM++) of the ResNet50. The third row is the fined-grained lesion detection visualization from the proposed FKD-CSS model. FKD fine-grained knowledge distillation. CSS corneal staining score.

Achieving expert performance in clinical internal validation

In comparison to six ophthalmologists (Pearson’s r ≤ 0.858), the FKD-CSS model achieved the highest Pearson’s r of 0.898, indicating a strong positive correlation with the ground-truth CSS grading and affirming its accuracy. Additionally, it attained the highest AUC of 0.881, which is surpasses that of the senior ophthalmologists (AUCs ≤ 0.761), exemplifying its superior ability to distinguish between different CSS grades (Fig. 4, Table 1).

Fig. 4: Performance of FKD-CSS in internal cross-validation set.
figure 4

a the correlation map between the FKD-CSS and the ground-truth CSS grading. b the AUC-ROC curves of the FKD-CSS model and ophthalmologists with different seniority. FKD fine-grained knowledge distillation. CSS corneal staining score. AUC area under the curve, ROC receiver operating characteristic. SNWC staining negative or without clinical significance.

Table 1 Comparison of the performance of the proposed FKD-CSS model and different AI models and ophthalmologists on internal cross-validation

Achieving expert performance in 6 external datasets of dry eye

We conducted a comprehensive performance test on six external test datasets of DE, sourced from 6 out of 7 regions across China. The Pearson’s r between the FKD-CSS model’s predictions and the actual clinical assessments ranges from 0.844 to 0.899, and the AUCs shown the performance of classification of the FKD-CSS model in comparison with ground-truth CSS grading ranges from 0.804 to 0.883 (Fig. 5), demonstrate performance on par with senior ophthalmologists (AUC: 0.691–0.806) (Table 2). Notably, the model’s best performance in classification is most frequently observed in SNWC (AUC: 0.921–0.966) and Critical (AUC: 0.927–0.990), followed by Mild (AUC: 0.753–0.868) and Severe (AUC: 0.782–0.871), while its worst performance is most commonly seen in distinguishing Moderate severity (AUC: 0.705–0.818) across the test sets from six different regions. In contrast, senior ophthalmologists show lower performance in these stages, with AUC: 0.780–0.911 for SNWC, 0.587–0.814 for mild severity, 0.518–0.749 for moderate severity, 0.609–0.789 for severe severity, and 0.620–0.934 for critical severity. These findings highlight the FKD-CSS model’s robustness, outperforming senior ophthalmologists in these critical assessments.

Fig. 5: Performance of FKD-CSS in 6 external test datasets of dry eye.
figure 5

a Correlation map between the FKD-CSS model and clinical assessments (ground-truth) on six external test datasets. b AUC-ROC of the FKD-CSS model and different levels of ophthalmologists on six external validation datasets. FKD fine-grained knowledge distillation. CSS corneal staining score. AUC area under the curve. ROC receiver operating characteristic. SNWC staining negative or without clinical significance.

Table 2 Comparison of the performance of the proposed FKD-CSS model and different ophthalmologists on six external validation datasets

To assess the explainability of the FKD-CSS model, we randomly selected 80 images from the external test dataset of DE (20 images for each category from 1 to 4) for expert annotation, which was used for lesion detection analysis. The FKD-CSS model successfully detected 95.3% of these annotated lesion areas. This high detection rate (Recall) demonstrates the model’s capability in providing insights that align with the expertise of ophthalmologists in identifying corneal staining lesions.

Model generalizes to multiple ocular surface diseases

To further assess the generalization ability of the proposed FKD-CSS model, we conducted a clinical test on a MOSD dataset encompassing 18 common ocular surface diseases except for DE. The FKD-CSS model achieved a Pearson’s r of 0.816 and an AUC of 0.807 (0.716–0.968 in the specific classification), demonstrating its robust generalization capabilities across different ocular surface diseases (Fig. 6).

Fig. 6: Performance of FKD-CSS in the external test dataset of MOSD.
figure 6

a The correlation map between the model and clinical assessments (ground-truth) (b) the ROC curve of the model compared with clinical assessments (ground-truth). FKD fine-grained knowledge distillation. CSS corneal staining score. AUC area under the curve, ROC receiver operating characteristic. MOSD multiple ocular surface disease. SNWC staining negative or without clinical significance.

In evaluating the FKD-CSS model’s fine-grained detection ability, senior ophthalmologists annotated the fine-grained staining areas of 80 images randomly selected from the MOSD dataset. Remarkably, the FKD-CSS model detected 78.5% of these areas. We also present visualization examples of the FKD-CSS model applied to various diseases. These visualizations highlight the model’s effectiveness in capturing the nuanced staining areas characteristic of these diseases, further illustrating its capability to accurately assess the severity of fine-grained staining areas in a range of ocular surface conditions (Fig. 7).

Fig. 7: The visualization of fine-grained features of FKD-CSS model of randomly chosen samples in MOSD dataset.
figure 7

The first row is the original raw image, and the second row is the fined-grained lesion detection visualization from the proposed FKD-CSS. FKD fine-grained knowledge distillation. CSS corneal staining score. MOSD multiple ocular surface disease.

Interactive visualization tool

Given the rich and varied nature of our grading data, which encompasses inputs from six different ophthalmologists across seven regions and multiple diseases, we have developed an interactive visualization software for more effective data analysis (Supplementary Software 1). This tool is designed to present comprehensive data distributions and facilitate various comparisons. Users can interactively select different options to compare the performance of various levels of ophthalmologists and the AI system across different regions and disease categories. Such functionality allows for in-depth exploration and a better understanding of the data, enhancing the overall analysis process.

Discussion

The contribution of this study is three-fold: (1) Rather than segmenting only patchy lesions or just generating a CSS grading, the FKD-CSS model is able to give the CSS grades and the coarse staining segmentation simultaneously with the dual-decoder architecture, which addresses the gap in identifying the punctate lesion of corneal staining with interpretability by using fine-grained knowledge distillation. (2) Unlike traditional methods that only offer categories of grading, FKD-CSS offers rich, fine-grained pathological features, producing continuous CSS score to show the severity level of pathological epithelium damage, which are valuable for detailed assessments in clinical trials; delivering fine-grained lesion detection with visualizations of the shape, size, and location of lesions, which contain etiological information39,40. As such, the AI system thereby assisting clinicians in evaluating etiology, severity, and treatment efficacy. (3) The FKD-CSS showed satisfactory performance for evaluating CSS in real-world settings of DE and external set of multiple ocular surface diseases using prospectively collected corneal staining images, and so could allow the system to be implemented and adopted for clinical care.

Due to limitations in both data size and quality, traditional DL methods are unable to extract rich features of punctate lesion for CSS grading training. While transfer learning can help by pre-training on large-scale natural image datasets, biases introduced may not optimally represent CSS-specific features. The complexity of staining patterns—irregular morphologies, heterogeneous intensities, and sparse distributions—requires task-specific inductive biases to align feature learning with clinical criteria, reducing reliance on extensive high-quality datasets45,46. Incorporating fine-grained information through multi-task learning and segmentation supervision presents a solution, yet the discrete nature of punctate lesions complicates precise or even coarse labeling on a large scale. To overcome this issue, we propose a fine-grained knowledge distillation approach that leverages annotations from a small-sized and coarse-labeled dataset focusing on punctate lesions in a weakly supervised manner. In fact, coarse annotations are valuable in weakly supervised learning in medical image analysis47, enabling practical, scalable, and effective training of models while alleviating challenges in obtaining high-quality, detailed annotations48, thus enhancing a model’s robustness and generalizability44. This allows the model to make educated guesses about finer details, effectively learning to localize or segment the target structures even without accurate pixel-level annotations’ guidance. The coarse-labeled dataset consists of diverse corneal stained images across various severity levels, allowing the AI model to learn punctate patterns effectively (details are presented in Sec.2.1). A teacher segmentation network, trained on this dataset, serves as a foundation to initialize and provide fine-grained supervision for the grading network, especially for large-scale data without manual staining mask annotation. By utilizing this pre-trained segmentation teacher network, the grading model initializes robust encoder parameters for subsequent CSS grading training. Furthermore, the knowledge distillation approach minimizes the Kullback–Leibler divergence (KL-divergence) between the distributions of staining segmentation by the teacher and grading network. This alignment in fine-grained information extraction across segmentation and grading tasks substantially enhances the model’s robustness in complex real-world scenarios49,50,51,52. It is important to note that knowledge distillation is applied only during the training phase of our study. Once the training dataset has been quality-controlled to ensure high-quality, knowledge distillation can effectively transfer fine-grained features from the lesion detection task to the CSS grading task. However, given the high cost and imprecision of annotations, direct quantitative analysis poses challenges. Instead, we assess the impact through the improvement in grading performance. As shown in Table 3, the integration of segmentation knowledge distillation significantly enhances the FKD-CSS model’s grading performance, allowing it to share latent features between segmentation and grading components and utilize spatial information effectively. Additionally, we evaluated the segmentation performance on coarsely annotated images from external DE and MOSD datasets (descriptions are shown in Results section). The coarse nature of these annotations, often broader than actual punctate staining areas, introduces potential for false positives, leading us to choose Recall as our evaluation metric. Such visualizations do not merely delineate crucial diagnostic areas but also allow clinicians to visualize and comprehend the model’s decision-making process, thereby augmenting both the transparency and interpretability. In contrast to many existing AI-assisted image reading methods, which are still often criticized for their ‘black box’ nature44,53. Although knowledge distillation has been explored in medical image analysis, this study introduces a method that opens new avenues for intelligent evaluation of multi-disciplinary medical conditions using pretext assessments like grading—not limited to ocular surface diseases—particularly addressing conditions requiring early diagnosis through subtle lesion detection.

Table 3 Ablation study of the proposed FKD-CSS model on the internal dataset cross-validation

Given that the existing DL-assisted models for corneal staining focus solely on discrete classification tasks, assigning fixed categorical grades (e.g., mild, moderate, or severe) to corneal staining samples14,54, which are relatively less sensitive in detecting clinically significant subtle changes, especially in treatment response55. The continuous regression grading approach of the FKD-CSS model aligns more effectively with clinical assessments, offering a nuanced measurement of the relative distances between different scores. For instance, in a regression context, the error between scores of 3.5 and 1.0 is considered larger than that between 2.0 and 1.0, whereas a categorization approach would treat these differences as equivalent. This nuanced understanding is crucial in clinical settings where subtle variations in severity can have significant implications. However, current deep learning methods often face challenges in regression tasks, primarily due to the imbalanced distribution of data56. To counter this, the FKD-CSS introduces a memory bank that stores key features for each discrete category, effectively bridging the gap between regression and classification tasks. This memory bank allows for a balanced approach, harnessing the strengths of both methodologies to enhance overall performance and accuracy in grading. As a result, the continuous CSS output of FKD-CSS presented considerable consistency with expert team, which will assist in clinical monitoring of disease progression and therapeutic response, and serve as a sensitive endpoint for treatment trials. On the other hand, the continuous score of FKD-CSS can transfer to categories by employing reasonable threshold, which can correspond to various existing scoring methods of corneal staining (e.g., Oxford scale24, Baylor scale57, and NEI/Industry scale58), and can flexibly adapt to the clinical needs of severity assessment of disease.

The ultimate goal of medical AI is to serve clinical needs, necessitating thorough testing and validation in real-world settings before deployment59. Previous studies have showed only moderate intra- and interrater agreement in manually grading punctate staining26, with correlation coefficients ranging from 0.65 to 0.96116,54,60 and kappa values between 0.417 and 0.46326,61. This variability stems from difficulties in counting fine lesions and ambiguities in defining adjacent CSS grades, inducing subjectivity in clinician evaluations. Our study also found relatively low correlation values for senior doctors’ independent scoring compared to a consensus ground truth. Therefore, AI-assisted CSS offers significant advantages in corneal damage assessment due to its stability and repeatability. To date, only two studies have utilized deep learning to evaluate corneal staining14,54. One focuses only on punctate erosions with limited interpretability, validating on a single internal test set (153 images)14, while the other employed external testing from a single center (93 image)54. Such single-center, small-scale datasets pose risks of overfitting and limited generalizability62,63. In contrast, our model’s primary testing datasets were sourced from a multicenter clinical trial across China, encompassing 37 hospitals in 22 provinces from all seven major regions, offering an in-depth comparison across different camera types, slit-lamp platform platforms and regions with different population distribution, ensuring a realistic evaluation of performance across diverse settings. It is important to recognize that there are no absolutely precise annotations for CSS, which complicates performance evaluation of all existing CSS approaches24,34,58,64. This study aims to establish a reference close to the true ground truth based on consensus among senior doctors. If our automated approach matches or surpasses the consistency and accuracy of experienced ophthalmologists, it is commendable. The FKD-CSS model exhibited comparable performance (r:0.844–0.899, AUC:0.804–0.883) to experienced ocular surface experts (r:0.802-0.915, AUC:0.620–0.806), which is also excellent when compared to currently reported correlation of existing DL-assisted methods with human grading in single-center (r:0.863)54. This automation alleviates the workload on doctors and eases the burden on remote hospitals lacking specialists. The stable and consistent scoring facilitates long-term patient tracking, enabling better assessment of recovery processes and the evaluation of clinical drugs in both dry eye and various ocular surface diseases. In addition, differences in model performance across various centers exist, and future research should explore how demographic characteristics and other confounding variables influence model performance to optimize the model’s real-world application. Furthermore, this study designs an interactive data visualization tool to present raw data from large-scale, multicenter research (Supplementary Software 1), illustrating variations in data distribution and grading consistency across physicians, facilitating a comprehensive evaluation of the model’s performance in complex environments.

Unlike human grading, AI-based evaluation is not bound to a specific scale or disease, but simply evaluates predefined affected areas26. The focus on the inferior one-third of the cornea aligns with existing dry eye staining assessment methods, specifically the inferior cornea staining score (ICSS), which has been a primary efficacy measure in various clinical trials. Previous studies indicate that staining in the inferior cornea is the most severe and typical sign of dry eye34,35,36 and shows significant improvement after treatment39,40. Moreover, limited zone assessment is highly efficient and avoids the plateau effect caused by weighted calculations of staining scores in other regions, making it ideal for evaluating severity and treatment effectiveness in DE. Nevertheless, the vulnerable areas of the corneal vary across different diseases6,7,39,65, and evaluating the entire cornea may provide more comprehensive pathological information. The ability to switch between the ROI within the partial cornea area and the entire cornea surface is crucial for digital CSS model’s generalizability across different ocular surface abnormalities. We tested the FKD-CSS model externally on a MOSD dataset, which includes multiple ocular surface diseases with punctate and patchy lesions. The model performed well, consistently categorizing and scoring corneal staining, correctly recognizing 78.5% of staining areas across the different staining patterns. These results These results indicate that the techniques can be applied beyond dry eye and punctate lesions to various corneal staining images across multiple diseases. Despite these advantages, the robustness of AI algorithms in complex clinical scenarios is still a challenge, as misidentifying key staining areas or subtle corneal changes could lead to incorrect diagnoses. Traditional clinical grading methods often struggle with providing reliable confidence measures, as deep regression approaches lack uncertainty quantification, and classification methods suffer from overconfidence due to reliance on the cross-entropy loss54,66,67. Weakly supervised methods with coarse annotations can indeed introduce risks due to the absence of fine-grained labels. Our approach calculates confidence scores based on the distance between the image of interest and prototypes (Eq. 3), reducing the risks of false negatives and positives through knowledge distillation and prototype-based confidence estimation. This allows for reliable anomaly alerts for manual review when necessary. Such uncertainty analysis has already been widely used in fields like autonomous driving68,69. Additionally, the AI model offers practical benefits, built on the widely-used ResNet50 framework, known for easy implementation across various platforms and low computational resource requirements. It operates without an internet connection, making it suitable for deployment in diverse medical and research settings, even in areas with limited connectivity, thus facilitating its real-world application.

This study has the following limitations. Firstly, the model’s ability in specific grading of the severities performed the worst in the moderate category in most of the external test sets, with subpar performance in the adjacent mild and severe categories as well. This may be attributed to the semantic ambiguity in the image reading criteria used as a reference for expert decision-making, which the model learned from. Specifically, the distinction between the moderate and the mild category defined by image reading criteria is whether the lesions are countable, while the distinction from the severe category is whether it is merged34. These definitions are prone to subjective variability in image reading, which is an inherent limitation of all the existing corneal staining scoring criteria24,34,58,65. Consequently, the accuracy of model performance evaluation is constrained by this limitation. Secondly, it remains a significant challenge to simultaneously consider the model’s generalization performance and accuracy in specific. The real-world settings of this study mainly focus on DE while the model preliminary demonstrates promising generalization in adapting to multiple ocular surface diseases, it is recommended to expand the training and testing samples as much as possible to enhance the model’s capability before applying it to specific ocular surface diseases other than DE. Thirdly, this study only explored the Asian population, subsequent research efforts could aim to expand the range of ethnic diversity scenarios to improve the model’s generalization capacity. Moreover, the database included in this study only from tertiary hospitals. The main challenge is the lack of imaging resources in primary or community hospitals and the difficulty in data quality control. Lastly, the proposed AI model still heavily relies on the training distribution and is unable to handle data from out-of-control healthcare systems, which has the potential to produce unforeseen and inexplicable outcomes. Moving forward, our plan is to incorporate robust abnormal analysis and uncertainty estimation methods into the FKD-CSS framework. It is worth mentioning that unconventional corneal staining such as adhesion of secretions or filaments on the surface of the cornea, and noticeable ocular surface irregularities lead to dye accumulation, such staining was not in the application scenario of this study since it is uncommon and lacks clear clinical implications. For future work, we plan to extend our analysis to include these out-of-distribution (OOD) data. By incorporating a broader range of corneal staining types, we aim to enhance the model’s robustness and generalizability. This will allow us to better understand and address the variability in clinical presentations and improve the model’s applicability to a wider array of ocular surface conditions.

Overall, this study proposes a knowledge distillation strategy for fine-grained corneal staining score assessment, addressing the gap in the identifying fined-grained lesions with interpretability. The model also employs flexible output for both continuous scores and severity grading for treatment efficacy assessment. The performance of the FKD-CSS model has been validated in real-world settings of DE, which demonstrated the potential in facilitating standardized and precise medical care and supporting the conduct of multicenter clinical trials of DE, and it also exhibits promising generalizability across diverse ocular surface diseases. This work represents a foundational step, to further support analysis of correlations between feature matrices, pathological information, and final scores by linking automatic scoring with fine-grained features in specific condition to provide a new direction for quantifying clinical evaluations.

Methods

Ethics approvals

This study was approved by the medical ethics committee of Zhongshan Ophthalmic Centre, Sun Yat-sen University, China (protocol number: 2020YWPJ001). All procedures were conducted following the Declaration of Helsinki (1983). Informed consent was obtained from each of the participants prior to data collection.

Datasets

The images of developing, internal validation and primary external test datasets were mainly collected prospectively from a DE dataset of a multicenter phase III study for cyclosporine-A eye drops (ClinicalTrials.gov; identifier: NCT04541888), involving 37 tertiary hospitals from seven regions of China (Fig. 1). The dataset consists of one corneal staining image per eye of each follow-up of DE patients who tested positive for epithelial staining at baseline between November 2020 and October 2021. The follow-up visits including baseline, improvement or exacerbation after treatment, and also from subjects who have recovered to corneal staining negativity after treatment, encompasses corneal abundant punctate, discrete, and coalescent patchy epithelial lesions. For eyes that completely returned to normal after treatment, we included only one staining-negative image to avoid duplication. Images from different eyes of the same patient, taken at various time points, can be deemed independent since the sampling intervals (≥2 weeks) exceed the epithelial cell turnover time (≤1 week), which is sufficient for significant morphological changes in corneal epithelial lesions70. The multicenter operators had all received standard training before capturing corneal fluorescein staining images. Additionally, the generalizability of FKD-CSS was tested on an independent annotated corneal staining image set of multiple ocular surface diseases (MOSD) without DE from Zhongshan ophthalmic center. There was no overlap between patients in the development and all test sets.

Development and internal validation datasets

All images from Eastern China of the phase III study dataset were enrolled for development and internal validation of FKD-CSS model, resulting in 1,477 corneal fluorescein staining images of 299 DE patients from 14 tertiary hospitals in seven provinces and municipalities. The internal validation was conducted via 5-fold cross-validation, with 30% of the training data in each fold used for hyper-parameter selection. The prediction for the external validation is ensemble averaging across five models trained during the cross-validation process.

Specifically, to establish a correlation between CSS grading and staining lesions, we randomly selected 100 DE images with CSS grades 1-4 from the development dataset (25 images per grade). These images were coarsely annotated for staining lesions by senior ophthalmologists. Integrating these fine-grained but imprecise annotations allows the FKD-CSS framework to effectively transfer prior knowledge about the correlation between staining grades and lesion locations to a larger and more generalized dataset using knowledge distillation. Then we conducted extensive ablation studies to analyze the contribution of key architectural components in FKD-CSS model and compared its performance and interpretability with representative deep learning methods in internal validations.

External test datasets for Dry eye

The proposed model was tested in six external independent sets from 6 regions of China (Central, South, North, Northeast, Northwest, and Southwest China). Each external set involved 2–7 tertiary hospitals, yielding a total of 2386 photographs of 476 DE patients from 23 hospitals in 15 provinces and municipalities. Each dataset contains all ranges of CSS.

External test for multiple ocular surface diseases

To assess the generalizability of FKD-CSS, 231 corneal staining images of 231 patients with multiple ocular surface diseases from Zhongshan ophthalmic center were collected retrospectively (between October 2019 and November 2023). The inclusion criteria focused on patients with corneal epithelial injuries from any cause except DE. The exclusion criteria included factors which could lead to false-positive sodium fluorescein staining, such as adhesion of secretions or filaments on the surface of the cornea, noticeable ocular surface irregularities that can lead to significant dye accumulation, and a condition that the investigator felt may have confounded the study results. Notably, the FKD-CSS model was initially trained on DE cases, focusing primarily on the lower third of the cornea. However, multiple ocular surface diseases affect the entire corneal area. To address this discrepancy, we divided each image into upper and lower sub-images and independently predicted their CSS. The final CSS for each image of MOSD dataset was then determined by selecting the higher value from either the upper or lower corneal area.

Reading process of corneal staining images

All images were assigned to three senior ophthalmologists who graded each image three times. If there was a dispute, they discussed and reached consensus. If the reading group encountered difficult cases, they could apply for expert arbitration done by two ocular surface experts. The principle of CSS was scored by evaluation of distribution and density of the punctate staining34 as follows: no staining = 0, few/rare punctate lesions = 1, discrete and countable lesions = 2, lesions too numerous to count, but not coalescent = 3, coalescent = 4. The manual scoring label was required to be accurate to 0.5 points according to a precise evaluation in practice. 228 corneal staining images were randomly re-read to check the grading consistency. The total eligibility rate of spot-check was 84.21%. In order to establish a detailed correlation between CSS and staining lesions, we randomly selected 100 DE images with CSS ranging from 1 to 4 from the development dataset, 25 images for each grade. The staining lesions of these images were then coarsely annotated by the team of 3 experienced doctors using the PAIR® annotation software package (https://www.aipair.com.cn/en/, Version 2.7, RayShape, Shenzhen, China). It is worth noting that the annotation process for punctate lesions is both expensive and prone to inaccuracies. Consequently, there is a notable limitation in the quantity and quality of these lesion annotations. However, the integration of these fine-grained yet imprecise annotations enables the proposed FKD-CSS framework to effectively transmit prior knowledge regarding the correlation between staining grades and lesion locations to a more expansive and generalized dataset by using knowledge distillation. Since all of those patients are DE, corneal staining was annotated and scored from the 1/3 inferior area of cornea, which has been demonstrated to present the highest degree of corneal staining in symptomatic DE patients most commonly39,40.

Algorithm construction of FKD-CSS

The proposed FKD-CSS framework consists of three stages (Fig. 2). Initially, a corneal segmentation network is utilized to extract the inferior corneal region, followed by a quality control step to exclude low-quality images with indistinct corneal regions to ensure that each image has clearly visible corneal boundaries and consistent, uniform illumination. Subsequently, a weakly-supervised teacher segmentation network is introduced in the second stage to guide the CSS grading model towards the detection of detailed but coarsely-grained lesions. Additionally, a prototype-based stratified screening mechanism was incorporated to enable self-correction and uncertainty estimation for misclassified samples, ensuring the model’s reliability and suitability for real-world clinical applications.

We trained a U-Net on the SUSTech-SYSU dataset12 utilizing a pre-trained ResNet50 backbone on ImageNet for automatic corneal segmentation. To maintain topological accuracy, we applied ellipse fitting as a post-processing technique. However, it is essential to highlight that the ellipse fitting process may encounter challenges with low-quality images that have unclear corneal boundaries or uneven illumination. To address this, we implemented an automatic exclusion strategy to ensure quality control. As a result of this quality control step, we retained 1471 out of 1477 images for both training and 5-fold cross-validation, and 2376 out of 2386 images for external test of DE, 231 images for external test of MOSD purposes. The training-validation ratio in cross-validation is set at 7:3 in each fold.

CSS grading depends heavily on the accurate identification of punctate staining areas, yet existing DL methods often fall short due to their inability to extract such fine-grained information during training. To address this difficulty, we introduced a weakly-supervised segmentation teacher network utilizing a limited number of coarse lesion segmentation annotations. Specifically, we manually annotated 100 images from the developing dataset with coarse annotations, as mentioned in the previous section. We then utilized a U-Net71 with a ResNet50 backbone, pre-trained from our previous corneal segmentation task. The teacher network’s primary objective is to infuse spatial staining information into the subsequent student grading model, thereby enriching the automated grading process with robust explanations, as will be detailed in the following paragraphs. By leveraging weak segmentation supervision alongside spatial context, our proposed approach aims to enhance the accuracy and reliability of automated CSS grading by while being cost-effective and efficient, requiring only a few coarse annotations.

The grading network approaches the CSS grading task as a regression problem, generating continuous predictions that can be easily converted into discrete grades by applying a specified threshold. To optimize this regression task, the network aims to minimize the L1 loss function, which is defined in Eq. (1):

$${{\mathcal{L}}}_{\mathrm{score}}=\frac{1}{B}\mathop{\sum }\limits_{i=1}^{B}{|}{|}\,\hat{{y}_{i}}-{y}_{i}\,{|}{|},$$
(1)

where \(B\) is the batch size, \(\hat{{y}_{i}}\) denotes the predicted grade, and \({y}_{i}\) is the ground-truth CSS grade. As discussed earlier, most of the existing methods lack explainability and do not yield interpretable results and fine-grained highlights of staining areas. In response to this issue, we adopt a novel approach of utilizing intermediate features for pretext tasks, specifically referred to as interpretable segmentation distillation. Our method employs a straightforward knowledge distillation technique, drawing insights from the pretrained weakly supervised segmentation network, as shown in Eq. (2):

$${{\mathcal{L}}}_{\mathrm{dist}}=\frac{1}{B}\mathop{\sum }\limits_{i=1}^{B}\,\mathrm{KL}\,({s}_{i}^{s},{s}_{i}^{t}),$$
(2)

where \({s}_{i}^{t}\) is the pseudo soft label generated by the teacher segmentation network, \({s}_{i}^{s}\) denotes the segmentation probabilities based on the features from the CSS grading encoder, employing a U-Net segmentation head. The KL-divergence is used to measure the difference between these two probability distributions. This segmentation knowledge distillation task effectively aligns the CSS grading features with staining locations. Furthermore, the segmentation branch also provides interpretable results, making it valuable for clinical applications. In accordance with the CSS grading standard, annotations are divided into nine discrete classes, ranging from 0 to 4 in increments of 0.5. We operate under the assumption that subjects in each category should display similar levels of clinical symptoms. To capitalize on this assumption, we propose the integration of a CSS Prototype Memory Bank (CPMB). The CPMB is specifically designed to group similar semantic information from predicted successive grades, which is able to cluster facilitates more effective knowledge representation and retrieval, enabling the system to draw upon a rich database of categorized symptoms and corresponding grades. To employ it, we consider the InfoNCE loss20 to align the similar samples with the same category in Eq. (3):

$${{\mathcal{L}}}_{\text{proto}}=-\frac{1}{{\rm{B}}}{\sum }_{{\rm{i}}=1}^{{\rm{B}}}\log \frac{\exp \left({{\rm{f}}}_{{\rm{i}}}\cdot {{\rm{p}}}_{{{\rm{y}}}_{{\rm{i}}}}/{\rm{\tau }}\right)}{{\sum }_{{\rm{j}}=1}^{{\rm{C}}}\exp \left({{\rm{f}}}_{{\rm{i}}}\cdot {{\rm{p}}}_{{\rm{j}}}/{\rm{\tau }}\right)},$$
(3)

where \({f}_{i}\) represents the feature vector of \(i\)-th sample, and \({p}_{j}\) denotes the \(j\)-th prototype in CPMB, \(C\) is the number of classes in CPMB (which is set as 9 in this study) and \(\tau\) is the temperature parameter. We employ a moving average (MA) strategy to update the clustering center in CPMB. Given the \(i\)-th sample labeled with \({y}_{i}\), the MA process is formulated in Eq. (4):

$${p}_{{y}_{i}}=\alpha {p}_{{y}_{i}}+(1-\alpha ){f}_{i},$$
(4)

where \(\alpha\) represents the momentum coefficient. The overall objective function is given by Eq. (5):

$${\mathcal{L}}={{\mathcal{L}}}_{\text{score}}+{\lambda }_{\text{proto}}{{\mathcal{L}}}_{\text{proto}}+{\lambda }_{\text{dist}}{{\mathcal{L}}}_{\text{dist.}}$$
(5)

The hyper-parameters \({\lambda }_{\text{proto}}\) and \({\lambda }_{\text{dist}}\) are configured as 0.5 and 0.05, respectively. We set the initial learning rate to 0.0001 and utilize the cosine decay scheduler for learning rate adjustment. The student grade network is initialized using the parameters from the teacher network encoder and is trained for a total of 50 epochs. A batch size of 64 is employed during training. For distillation, the temperature parameter \({\rm{\tau }}\) is set to 0.1 and the momentum coefficient \(\alpha\) is set to 0.999. The proposed framework is implemented using PyTorch and is trained on a single NVIDIA Titan RTX GPU. We plan to make the code publicly available once the paper is accepted.

Ablation study

To evaluate the specific impact of each component within the FKD-CSS model, we undertook an ablation study utilizing our internal validation set. The quantitative outcomes of this study, detailing various combinations of three critical components, are presented in Table 3. The “ROI extraction” component refers to the preprocessing step that crops the ROI areas, focusing on the 1/3 inferior area of the cornea for Dry Eye cases and the entire cornea for other conditions. The “CPMB” stands for the CSS Prototype Memory Bank training strategy, mathematically denoted as \({{\mathcal{L}}}_{\text{proto}}\). Lastly, “Knowledge distillation” indicates the fine-grained segmentation training strategy, represented as \({{\mathcal{L}}}_{\text{dist}}\).

Compared to the performance of the baseline (r = 0.828, AUC = 0.803), the successive inclusion of ROI extraction, Knowledge distillation, and CPMB can elevate the algorithm’s r to 0.865, 0.883, and 0.889, and increased the AUCs to 0.848, 0.861, and 0.871, respectively. When combine all three components, the performance of the model was the best (r = 0.898, AUC = 0.881) (Table 3). The findings from this study clearly demonstrate the critical role played by each component in ensuring accurate CSS grading. Notably, the removal of any single component from the model led to a significant decline in its performance.

Statistical analysis

All statistical analyses were conducted in the SPSS version 24.0 software (IBM, Armonk, New York). Since the output from FKD-CSS initially consisted of continuous scores that could be converted into these discrete categories, we employed two statistical metrics to evaluate the regression consistency and categorization performance, namely correlation coefficient and the AUC-ROC. Normality of the continuous CSS scores was confirmed via D’Agostino-Pearson test (p > 0.05). We evaluated the correlation between clinical grading scores and automated grading scores by Pearson test, r was utilized to determine the correlation coefficients. AUC-ROCs were computed by using the absolute error between the predicted grades and the ground truth as the threshold. ROC curves were generated by sweeping error thresholds from 0 (perfect match) to maximum observed error, calculating true/false positive rates at each grade. The overall AUC was derived from a clinical prevalence-weighted average of all five grade-specific AUCs. All tests were two-tailed, p < 0.001 was considered statistically significant.

Comparison with ophthalmologists

Six clinicians at an academic ophthalmology center, with varying levels of clinical experience, were assigned to grade the severity of corneal staining in the validation and test images. Their experience ranged as follows: two had less than 5 years of experience, two had between 5 and 10 years, and two had more than 10 years. All of these clinicians were instructed to make a referral grading decision on each image in development & internal validation datasets, and six external test datasets.

Staining detection test

To assess the fine-grained capability of our model, we performed a staining detection testing by randomly selected 80 images from the external DE test dataset and 80 images from the MOSD test dataset. Notably, the assessment for DE dataset was focusing primarily on the lower third of the cornea, which has been demonstrated to be the most predictive key zone for DE severity, while evaluation was focused on the entire corneal area in dataset of multiple ocular surface diseases.

We used the segmentation branch integral to the knowledge distillation process for feature visualization. This branch shares latent features with the grading network, meaning that its segmentation outputs are indicative of the area most relevant to CSS grading. We harness these outputs to create a probability map, which is then employed as a heatmap for detecting staining areas. This approach allows for a visual representation of the model’s focus areas, thereby highlighting its precision in identifying key regions relevant to CSS grading. It is worth noting the inherent challenge of annotating point staining, where the ground truth area is typically larger than discrete staining points. Therefore, we use the recall of the segmentation results as a metric to evaluate the effectiveness of our model in detecting staining.

Uncertainty estimation

To further mitigate potential risks of misdiagnosis of the FKD-CSS, our algorithm incorporated a novel confidence estimation mechanism. By leveraging prototype distances (Eq. 3), it calculated confidence scores to gauge uncertainty. Testing on an external validation set showed a statistically significant confidence gap (p < \(1{e}^{-10}\)) between correctly graded (error < 0.5) and misgraded samples (error ≥ 0.5). With a confidence threshold of 0.27, the model detected 97.77% of misgraded samples, flagging low-confidence cases for manual review.

This confidence estimation mechanism enhances reliability in clinical AI applications, ensuring safe and effective large-scale screenings. By achieving performance comparable to senior clinicians while integrating safeguards for patient safety, our FKD-CSS model fulfills the demands of clinical use with reduced risk.