Introduction

Glaucoma is the second-leading cause of blindness worldwide, characterized by progressive vision loss (Fig. 1a)1. It impacts over 3 million patients in the US, and projections indicate that by 2050, this number will escalate to 6.3 million due to an aging population, with African Americans and Hispanics being disproportionately affected2. The irreversible nature of visual impairment from glaucoma progression underscores the importance of detecting visual dysfunction in its early stages3. Glaucoma progression has traditionally been assessed by tracking the decline in visual function sensitivity over time using standard automated perimetry, though this approach is susceptible to substantial test-retest variability and confounding effects of age-related changes4. Current standard clinical care for glaucoma also includes optical coherence tomography (OCT) imaging, which allows for the detection of progressive morphological changes, such as retinal nerve fiber layer (RNFL) thinning, closely associated with vision loss5. Recent studies have developed deep learning-based methods to detect visual field (VF) deterioration from longitudinal OCT scans and RNFL thickness, achieving promising area under the ROC curve (AUC) scores6,7. However, OCT measurements over an extended period, often spanning several years, are typically needed to identify progression status8, which makes it challenging to collect sufficient data for training a generalizable progression prediction model. In addition, methods that rely on RNFL thickness may be affected by OCT layer segmentation artifacts9,10. Hence, we develop a deep learning model that uses the baseline OCT scans to predict glaucoma progression status. The glaucoma progression was defined by longitudinal VF data from at least five consecutive visits over a six-year period (Fig. 1b).

Fig. 1: Illustration of glaucoma progression and the proposed equity-aware deep learning model.
figure 1

a Illustration of progressive vision loss in a glaucoma patient. b Examples of glaucoma progression identified from longitudinal visual field tests. c The proposed equity-aware EfficientNet, called EqEffNet, for training a glaucoma detection model. d The proposed framework, called FairDist, generalizes the pretrained glaucoma detection model to enhance the demographic equity in glaucoma progression prediction through knowledge distillation.

On the other hand, while deep learning has the potential to improve the efficiency and accessibility of glaucoma diagnosis for a broader population11,12,13,14, their performance across different demographic groups is underexplored. A growing body of research indicates that deep learning algorithms can reinforce and amplify biases present in the data15,16,17,18, raising significant fairness concerns, particularly for certain racial and ethnic minorities. For example, a recent study found that Black individuals experience significantly lower performance outcomes compared to Whites and Asians when using RNFL thickness and OCT scans for glaucoma detection18. Two other studies offered a comprehensive analysis of inequity concerns in artificial intelligence (AI) across various healthcare applications17,19, especially impacting the underrepresented communities. The unfairness of AI models can stem from several factors. First, skewed data distributions in diagnosis labels and sensitive attributes may result in inadequate feature representation for specific groups. Second, anatomical differences between subgroups can create varying levels of difficulty for AI models to provide accurate diagnoses. Third, annotation inconsistencies arising from clinicians’ varying preferences based on patient attributes can confuse AI models and contribute to bias. Last, the AI model itself can perpetuate unfairness. Conventional AI training processes prioritize maximizing overall performance, which may inadvertently widen performance gaps between subgroups. Regardless of different reasons, it is of paramount importance to explicitly address the potential unfairness of AI models for equitable medical diagnosis.

In this work, we propose an equity-aware deep learning model to address these limitations. From the analyses above, two obstacles may hinder the development of a responsible progression prediction model: (1) the limited availability of longitudinal data with annotations and (2) potential disparities in prediction performance across different demographic groups. To address the challenge of limited data, we proposed training a glaucoma detection model (Fig. 1c) using a large dataset and subsequently generalizing it for progression prediction through a knowledge distillation process (Fig. 1d). Additionally, we introduced an equity-aware feature learning mechanism that adjusts the importance of OCT scan features based on patient identity information. The proposed integration of knowledge distillation and equitable feature learning aims to enhance both the performance and fairness of deep learning models for glaucoma progression prediction. The terms equity and fairness are used interchangeably in this paper.

Results

Dataset collection

In this study, we utilized a glaucoma detection dataset (Table 1). It is a large-scale glaucoma detection dataset to develop a glaucoma detection model. The dataset contains 10,000 reliable (signal strength is greater than 6) spectral-domain OCT (Cirrus, Carl Zeiss Meditec, Dublin, California) samples of 10,000 patients from the Massachusetts Eye and Ear glaucoma service between 2010 and 2022. The average age of patients was 60.7 ± 6.3 years. Each 3D OCT data sample contains 200 B-scans, each with a dimension of 200 × 200. The self-reported patient demographic information was as follows (Fig. 2a): 57.0% of the patients are female and 43.0% are male; racially, 8.5% are Asian, 14.9% are Black, and 76.6% are White. 51.3% of the patients are categorized as non-glaucoma and 48.7% as glaucoma.

Fig. 2: Distributions of demographics and labels in the adopted datasets.
figure 2

a The distributions of demographics and glaucomatous status of 10,000 patients in the glaucoma detection dataset. b The distributions of demographics, glaucomatous and progression status of 500 patients in the progression prediction dataset.

Table 1 Dataset characteristics

Additionally, we used a glaucoma detection and progression prediction dataset (Table 1). It is a public dataset for both glaucoma detection and progression prediction tasks. It includes 500 spectral-domain Cirrus OCT samples of 500 glaucoma patients from the Massachusetts Eye and Ear. The average age of patients was 62.9 ± 6.4 years. Each OCT sample contains 200 B-scans, accompanying both glaucoma detection and progression prediction labels. Demographic information was as follows: Gender distribution is 54.0% female and 46.0% male; racial composition includes 9.4% Asian, 15.6% Black, and 75.0% White. The patients were divided into non-glaucoma and glaucoma categories, making up 55.2% and 44.8% of the dataset, respectively. Two types of glaucoma progression, including Mean Deviation (MD) fast progression and Total Deviation (TD) pointwise progression, were explored in this work. For MD fast progression, 91.2% are categorized as non-progression and 8.8% as progression20. For TD progression, the percentages are 90.6% for non-progression and 9.4% for progression.

Glaucoma detection results

We integrated an equity-aware feature attention learning layer into EfficientNet to develop the EqEffNet model. For gender groups, EqEffNet improved the overall AUC by 0.01 (p < 0.05), with the Male group improved from 0.83 to 0.85 (Fig. 3). For racial groups, both overall AUC and ES-AUC of EqEffNet had an improvement of 0.02 (p < 0.01) compared with EfficientNet, which was contributed by a significant improvement of 0.04 in AUC for the White group, although Asians and Blacks had no improvements (Fig. 3).

Fig. 3: The comparison between EfficientNet and EqEffNet for glaucoma detection.
figure 3

a Gender. b Race.

MD fast progression prediction results

Among the five comparative methods (VGG, DenseNet, ResNet, ViT, and EfficientNet) without unfairness mitigation strategies (Fig. 4a), ResNet achieved the highest overall AUC of 0.69 and ES-AUC of 0.67 (p < 0.05) for the gender attribute, outperforming VGG (0.65 and 0.63), DenseNet (0.68 and 0.65), ViT (0.65 and 0.61), and EfficientNet (0.66 and 0.63). EfficientNet achieved the highest sensitivity of 0.67 (p < 0.01), while both VGG and ResNet demonstrated the highest specificity of 0.66 (p < 0.05) among the five methods (Fig. 4a). For gender groups, FairDist outperformed all five methods without unfairness mitigation strategies, achieving the highest overall AUC of 0.74 and ES-AUC of 0.69 (p < 0.01). The group AUCs for Females (0.75) and Males (0.79) were consistently higher than those of other methods. Additionally, FairDist matched EfficientNet with the highest sensitivity of 0.67 (Fig. 4a). While comparing with the two methods with unfairness mitigation on gender groups (Fig. 4b), FairDist improved the overall AUC, ES-AUC, and sensitivity by 0.06, 0.11, and 0.20 over EfficientNet + Adversarial (p < 0.01), and 0.05, 0.15, and 0.27 over EqEffNet (p < 0.01), respectively.

Fig. 4: Prediction performance of MD Fast progression.
figure 4

a Comparison of baseline methods and the proposed FairDist model for gender groups. b Comparison of methods with unfairness mitigation strategies for gender groups. c Comparison of baseline methods and the proposed FairDist model for racial groups. d Comparison of methods with unfairness mitigation strategies for racial groups. MD visual field mean deviation, Adversarial with adversarial training.

For racial groups, ResNet remained the highest AUC of 0.69 among the five baseline methods without fairness considerations. However, its ES-AUC (0.49) and sensitivity (0.53) were lower than those of EfficientNet, which achieved an ES-AUC of 0.56 and sensitivity of 0.67 (Fig. 4c). FairDist achieved the highest AUC of 0.78, ES-AUC of 0.68, and sensitivity of 0.73 when compared with other methods with and without applying unfairness mitigation strategies (Fig. 4c). Compared with EqEffNet, FairDist improved the AUCs for Asians and Blacks by 0.20 and 0.14 (p < 0.01), respectively. While comparing with EfficientNet + Adversarial, FairDist improved the AUC in the Black group by 0.14 (p < 0.01), although the corresponding AUCs for Asians and Whites had slight drops (Fig. 4d).

TD pointwise progression prediction results

On the gender attribute, EfficientNet achieved an overall AUC of 0.72, ES-AUC of 0.70 and sensitivity of 0.71, which outperformed (p < 0.05) other four methods including VGG, DenseNet (except for sensitivity), ResNet and ViT without designing fairness learning components (Fig. 5a). FairDist achieved the highest AUC (0.74) and ES-AUC (0.72) compared with all five methods without fairness considerations (Fig. 5a), and meanwhile outperformed EfficientNet + Adversarial (0.71 and 0.62) and EqEffNet (0.72 and 0.66) with fairness considerations (Fig. 5b). In addition, FairDist achieved the highest average sensitivity and specificity which was 0.67 (Fig. 5a and Fig. 5b), compared with VGG (0.60), ResNet (0.60), ViT (0.56), EfficientNet (0.65), EfficientNet + Adversarial (0.66) and EqEffNet (0.62), respectively. FairDist achieved the same specificity of 0.70 as DenseNet, but lower sensitivity (0.64) than DenseNet (0.71).

Fig. 5: Prediction performance of TD Pointwise progression.
figure 5

a Comparison of baseline methods and the proposed FairDist model for gender groups. b Comparison of methods with unfairness mitigation strategies for gender groups. c Comparison of baseline methods and the proposed FairDist model for racial groups. d Comparison of methods with unfairness mitigation strategies for racial groups. MD visual field mean deviation, Adversarial with adversarial training.

On the race attribute, FairDist achieved the highest AUC (0.75), ES-AUC (0.65), and specificity (0.76) compared with all other methods (p < 0.05, Fig. 5c, d). Although EqEffNet achieved a perfect AUC of 1.0 for Blacks and a better AUC than FairDist in the White group (Fig. 5d), its AUC in Asian groups is significantly lower than FairDist, which had caused much lower ES-AUC (0.51) than that of FairDist (0.65). FairDist also improved the sensitivity and specificity by 0.07 and 0.09 (p < 0.01), respectively, compared with EqEffNet (Fig. 5d).

Discussion

Deep learning approaches have been increasingly adopted to analyze clinical data such as VF, fundus images, and OCT scans for automated glaucoma detection21,22. However, glaucoma progression prediction remains underexplored compared to glaucoma detection with deep learning, primarily due to the scarcity of available data23. This challenge is further complicated by deep learning models, which have raised significant concerns regarding fairness17. Unlike existing progression prediction methods that often require large datasets and overlook demographic biases, we propose a novel approach that addresses both data scarcity and group disparities in progression prediction.

Compared to the baseline methods, the advantages of the proposed approach stem from two key factors. First, the EqEffNet model with an equity-aware feature attention layer enhances both performance and equity, as demonstrated in glaucoma detection (Fig. 3) across gender and racial attributes. The equity-aware attention layer enables the model to adjust feature importance in OCT scans according to the demographic attributes (Fig. 6a), thus allowing the model to have improved capacity to balance the feature learning of various groups. Additionally, EqEffNet generally outperformed methods without considering group disparities in progression prediction, with notable improvements for race in MD Fast progression (Fig. 4c, d) and for both gender and race in TD Pointwise progression (Fig. 5). Second, the pretrained detection model is beneficial to enhance the glaucoma progression prediction model through knowledge distillation which can be observed by the comparison between FairDist and EqEffNet (Figs. 4 and 5). This allows an equity-enhanced and well-performing glaucoma detection model to guide the feature learning of a progression prediction model, which is beneficial especially when data scarcity is a challenge to train a good model. The averaged gradient-weighted class activation map of FairDist (Fig. 6b), which closely resembles that of EqEffNet (Fig. 6a), also demonstrates that knowledge distillation has effectively reshaped the progression model to extract meaningful features from OCT scans for enhanced model performance and equity. This is also meaningful that features around OCT retinal layers were more important, which were better captured by FairDist than the baseline EfficientNet model (Fig. 6c). Although FairDist did not consistently outperform baseline models on all metrics, it achieved improved overall model performance and equity across different demographic groups, which are key factors for reliable and trustworthy clinical deployment. For example, while FairDist may not always achieve the highest sensitivity, it often achieves a better balance between sensitivity and specificity, which is crucial for clinical applications that require minimizing both false negatives and false positives.

Fig. 6: Averaged gradient-weighted class activation maps across all OCT scans.
figure 6

a EfficientNet versus EqEffNet on glaucoma classification. b EfficientNet versus FairDist on MD Fast progression prediction. c EfficientNet versus FairDist on TD Pointwise progression prediction.

It is interesting to note that EqEffNet improved overall AUC and model fairness on the race attribute with compromised group AUCs in Asians and Blacks (Fig. 3b). This contradictory trend can be explained by two major factors. First, EfficientNet inherently tends to favor the Asian group in the studied dataset, whereas EqEffNet prioritizes overall performance and fairness through an optimized balance across groups globally without intentionally biasing specific groups. Second, patients according to different identity attributes (i.e., gender and race) could fall into the same or different groups, which can hinder a unified model from achieving consistent improvements across various identity attributes and groups. Additionally, although FairDist consistently achieved the highest AUC and ES-AUC scores compared with other methods, including VGG, DenseNet, ResNet, ViT, EfficientNet, and EqEffNet with and without integrating fairness learning, it did not always improve the AUCs across different groups. For example, FairDist achieved the best performance of 0.75 (p < 0.05) in Female group among all methods but its AUC (0.79) was worse than that (0.90) of EqEffNet Male group in the prediction of MD Fast progression (Fig. 4b). For TD Pointwise progression, FairDist did not excel in both Female (AUC: 0.77) and Male (AUC: 0.73) groups, where EfficientNet + Adversarial led the Female (AUC: 0.79) and EqEffNet led the Male (AUC: 0.76) groups, respectively (Fig. 5b). Once again, this is because mitigating the unfairness of deep learning models involves a multi-objective optimization process that seeks to reduce group disparities without sacrificing overall model performance24,25. This process will inevitably rebalance different groups, leading to performance increase or decrease in certain groups.

However, the proposed equitable model and its evaluation have several limitations. First, we studied a critical task of glaucoma progression prediction and evaluated the model for the glaucoma detection task. However, we have not evaluated the model performance and equity for other disease progression tasks due to a lack of public datasets with demographic attributes. Additionally, we focused on the OCT scans in this work, while there is a need to integrate other modalities, such as fundus images, to study a multimodal fairness learning problem, especially when different data modalities may unevenly cause group disparities. Second, we primarily used the equity-scaled AUC to assess the model equity, while other metrics such as demographic parity and equalized odds can also be used26,27. However, it is impossible to satisfy different metrics due to varying definitions and constraints, especially in situations of highly imbalanced distributions of demographics and labels, which could be an interesting future direction28. Third, we primarily focused on the EfficientNet model to design the equitable deep learning model since it generally outperformed other baseline models for progression predictions in this work. However, it is meaningful to explore more powerful model architectures and training paradigms, such as adapting pretrained foundation models to boost the model performance with low-rank based model adaptation techniques29. Fourth, we designed an equity-aware attention layer, which was helpful to differentiate image feature importance for improved model performance and equity. However, future work necessitates a comprehensive comparison of different unfairness mitigation strategies and comparing their generalizability to various demographic groups15. Last, the progression prediction dataset included in this evaluation is not substantial, which is a critical challenge that hinders the advancement of developing equitable deep learning models. Further work will test the generality of the proposed model using an increased scale of the dataset.

In conclusion, we proposed an equity-aware deep learning model with knowledge distillation to enhance the model performance and equity of glaucoma progression prediction. The model has been tested on MD Fast and TD Pointwise progressions, achieving improved performance compared with methods with and without fairness learning components. Our approach has the potential to improve demographic equity in glaucoma progression, and it can also be generalized to other disease progression prediction tasks.

Methods

Two datasets were used to develop deep learning models for glaucoma progression prediction. This study complied with the guidelines outlined in the Declaration of Helsinki. In light of the study’s retrospective design, the requirement for informed consent was waived.

Dataset description

In both glaucoma detection and progression prediction datasets, glaucomatous status was defined based on a reliable Humphrey VF test, which included a fixation loss of ≤33%, false-positive rate of ≤20%, and false-negative rate of ≤ 20%. A VF test conducted within 30 days of OCT was utilized to identify glaucoma patients. Criteria for the presence of glaucoma include a VF mean deviation less than −3 dB, coupled with abnormal results on both the glaucoma hemifield test and the pattern standard deviation20. In the progression prediction, two criteria were used to assess glaucoma progression based on VF maps collected over a minimum of five visits within a six-year period30, with each map represented by a vector of 52 TD values ranging from −38 dB to 26 dB. The criteria are: (1) MD Fast Progression: eyes with an MD slope ≤ −1 dB. (2) TD Progression: eyes with at least three locations exhibiting a TD slope ≤ −1 dB; Note that the 500 samples included are baseline OCT scans, and the VF data used to define the labels are private.

The proposed equity-aware deep learning model with knowledge distillation

The proposed approach includes two phases. First, we developed an equity-aware deep learning model called EqEffNet for glaucoma detection using a large dataset (Fig. 1c). Then, the pretrained EqEffNet was generalized to a glaucoma progression prediction model called FairDist through knowledge distillation (Fig. 1d). Specifically, EqEffNet used EfficientNet as the backbone model and integrated a fair attention mechanism to extract features from the OCT scans. The EfficientNet architecture is organized into seven blocks, each distinguished by a different color. Each block contains multiple MBConv layers, where an MBConv layer is a mobile inverted bottleneck convolution that incorporates squeeze-and-excitation optimization. These blocks differ in filter depth and size, as reflected by their kernel dimensions and the number of layers within each block. Starting with the initial standard convolution layer, the network progresses through these MBConv blocks, which increase in complexity and depth. Note that each block represents a stage in feature extraction where similar operations are applied to the input features. This hierarchical structure allows EfficientNet to efficiently manage computational resources while maximizing learning and representational capacity, making it highly effective for tasks requiring detailed image analysis and classification. However, EfficientNet may exhibit biases towards certain demographic groups (e.g., Asians and Blacks) and lead to notable group disparities. To address this, we introduced a fairness-aware attention layer that adjusts feature learning based on demographic attributes. Specifically, we utilized a multilayer perceptron (MLP) encoder to process the binary-encoded demographic attributes associated with the input scans. Afterwards, the learned attribute and image features were used to calculate the importance weights for each of the image features. Assuming \({{\bf{X}}}_{\mathrm{pred}}\) and \({{\bf{A}}}_{\mathrm{pred}}\) are the raw image and demographic attribute features, the fairness-aware attention layer can be defined as:

$${{\bf{h}}}_{\mathrm{pred}}\,=\,\mathrm{EfficientNet}\left({{\bf{X}}}_{\mathrm{pred}}\right)$$
(1)
$${{\bf{h}}}_{\mathrm{attr}}^{\mathrm{pred}}\,=\,\mathrm{MLP}\left({{\bf{A}}}_{\mathrm{pred}}\right)$$
(2)
$${{\bf{h}}}_{\mathrm{pred}}\,=\,{{\bf{h}}}_{\mathrm{pred}}{{\bf{W}}}^{{\bf{1}}}\bullet \mathrm{softmax}\left(\frac{{{\bf{h}}}_{\mathrm{attr}}^{\mathrm{pred}}{{\bf{W}}}^{{\bf{2}}}{{\bf{h}}}_{\mathrm{pred}}{{\bf{W}}}^{{\bf{3}}}}{\sqrt{{\rm{d}}}}\right)$$
(3)

where \({{\bf{h}}}_{\mathrm{pred}}\) and \({{\bf{h}}}_{\mathrm{attr}}^{\mathrm{pred}}\) represent the image and attribute features learned by the EfficientNet and MLP, respectively. \(d\) is the dimension of the latent image and attribute features. \({{\bf{W}}}^{{\bf{1}}}\), \({{\bf{W}}}^{{\bf{2}}}\), and \({{\bf{W}}}^{{\bf{1}}}\) are learnable weight parameters. With the fairness-aware attention layer, image features could adjust their importances for identity attributes to obtain balanced attentions and contributions for the final glaucoma detection and progression prediction outcomes. The glaucoma detection model EqEffNet was trained using the Glaucoma Detection Dataset including 10,000 OCT samples. 60%, 10% and 30% of the samples were used for model training, evaluation and testing, respectively.

Next, the pretrained glaucoma detection model was used to initialize a glaucoma detection model which guided the training of a glaucoma progression prediction model termed FairDist (Fig. 1d). Both glaucoma detection and progression prediction models used EqEffNet to learn features, and these two EqEffNet models took the same OCT scans as inputs but performing glaucoma detection and progression prediction tasks, respectively. At the same time, the pretrained glaucoma detection model (as a teacher) empowered the progression prediction model (as a student) through a knowledge distillation process which minimizes the attribute and image feature distances based on the Kullback–Leibler (KL) divergence31:

$${D}_{\mathrm{KL}}\,=\,\alpha {D}_{\mathrm{KL}}^{\mathrm{img}}+\beta {D}_{\mathrm{KL}}^{\mathrm{attr}}$$
(4)

where \({D}_{\mathrm{KL}}^{\mathrm{img}}\) and \({D}_{\mathrm{KL}}^{\mathrm{attr}}\) represent the KL similarities of images and attributes, respectively. \(\alpha\) and \(\beta\) are hyperparameters used to control the image and attribute feature similarities between the two EqEffNet models in FairDist. FairDist was trained using the Progression Prediction Dataset, which contains OCT scans with glaucomatous status and progression outcomes. 70% and 30% of samples were used for model training and testing, respectively.

Comparative methods, evaluation metrics, and statistical analysis

We evaluated the progression prediction performance and demographic equity of five popular deep learning models for processing medical images, including VGG32, DenseNet33, ResNet34, Vision Transformer (ViT)35, and EfficientNet36. In addition, we also compared with the adversarial training approach to mitigate model biases37. All deep learning modeling and statistical analyses were performed in Python 3.8 (available at http://www.python.org) on a Linux system. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), Sensitivity, and Specificity. To assess demographic equity, we adopted the equity-scaled AUC (ES-AUC)18, defined as: \({\mathrm{AUC}}_{\mathrm{ES}}\,=\,{\mathrm{AUC}}_{\mathrm{overall}}/(1\,+\,\Sigma |{\mathrm{AUC}}_{\mathrm{overall}}\,-\,{\mathrm{AUC}}_{\mathrm{group}}|)\). This metric balances overall AUC with performance disparities among different groups. Statistical significance was evaluated using t-tests and bootstrapping to compare AUC and ES-AUC values between models with and without FAS. Bootstrapping provided confidence intervals and standard error estimates. Results with p < 0.05 were considered statistically significant.

Parameter settings

The CNN models, including VGG, DenseNet, ResNet, and EfficientNet, were trained using a learning rate of 1e−4 for 10 epochs with a batch size of 6. For the ViT, we followed settings from the literature38: training for 50 epochs with a layer decay of 0.55, weight decay of 0.01, dropout rate of 0.1, batch size of 64, and a base learning rate of 5e−4. All models were optimized using AdamW39. In FairDist, the default values for α and β were set to 1.0 and 0.05, respectively.