Knowledge distillation-based lightweight MobileNet model for diabetic retinopathy classification

Dejene, Fitsum Mesfin; Ayano, Yehualashet Megersa; Feyisa, Degaga Wolde; Debelee, Taye Girma; Mekonnen, Hiwot Taye; Gessesse, Girum Woldegebreal; Merga, Zelalem Chimdesa; Molla, Hasset Tamirat; Mulie, Destaw

doi:10.1038/s41598-025-30893-7

Download PDF

Article
Open access
Published: 05 December 2025

Knowledge distillation-based lightweight MobileNet model for diabetic retinopathy classification

Scientific Reports volume 16, Article number: 1181 (2026) Cite this article

1747 Accesses
Metrics details

Subjects

Abstract

Diabetic retinopathy (DR) stands as a leading cause of global blindness. Early identification and prompt treatment are essential to prevent vision impairment caused by DR. Manual screening of retinal fundus images is challenging and time-consuming. Additionally, in low-income countries, there is a significant gap between the number of DR patients and ophthalmologists. Currently, machine learning (ML) and deep learning (DL) are becoming a viable alternative to traditional DR screening techniques. However, DL suffers a major limitation in resource-constrained devices because of its large model size and substantial computational demands. Knowledge distillation is a prominent technique for creating lightweight models, effectively transferring knowledge from a larger, complex model to a smaller, more efficient one without significant loss in performance. Therefore, in this research, a lightweight student model is proposed, which follows the MobileNet architectural design by utilizing depthwise separable convolutions. This design ensures efficient performance suitable for edge device deployment. For binary classification, our proposed model achieved an accuracy, precision, and recall of 98.38% on the APTOS 2019 dataset, whereas the proposed model achieved an accuracy of 93.03% for ternary classification on APTOS 2019.

Diabetic retinopathy classification using a multi-attention residual refinement architecture

Article Open access 10 August 2025

Ensemble deep learning and EfficientNet for accurate diagnosis of diabetic retinopathy

Article Open access 18 December 2024

Contrastive learning-based pretraining improves representation and transferability of diabetic retinopathy classification models

Article Open access 13 April 2023

Introduction

Diabetic Retinopathy (DR), a microvascular complication of diabetes mellitus, has emerged as a leading cause of blindness among working-age adults (20–74 years) worldwide^1,2. According to the International Diabetes Federation (IDF), 451 million adults (18–99 years old) have diabetes worldwide, and by 2045, that number is expected to reach 693 million^3,4. Approximately half (49.7%) of people living with diabetes remain undiagnosed, significantly delaying critical ophthalmologist interventions⁵. Among diagnosed patients, global estimates indicate that 34.6% develop DR, with 10.2% progressing to vision-threatening stages¹.

The epidemiological situation in Ethiopia reflects concerning trends, with diabetes prevalence ranging from 2.0% to 6.5% across regions⁶. Recent studies report DR prevalence of 19.48% among Ethiopian diabetic patients, with 10.7% having vision-threatening DR (VTDR)^7,8. This poses substantial public health challenges given Ethiopia’s severe ophthalmologist shortage (1:1,200,000 ratio) and limited eye care access (available to only 26% of the population)⁹.

Current clinical guidelines recommend that type 1 diabetics undergo initial retinal examination 5 years post-diagnosis, while type 2 diabetics require immediate screening at diagnosis^10,11. However, poor compliance persists due to multifactorial barriers including limited health literacy, inadequate infrastructure, and insufficient insurance coverage^12,13. The diagnostic process itself remains labor-intensive, requiring ophthalmologists to manually screen fundus images for characteristic findings, including microaneurysms, intra-retinal hemorrhages, venous beading, exudates, and neovascularization¹⁴. With only 232,866 ophthalmologists globally serving millions of potential DR cases, this manual approach creates critical bottlenecks¹⁵.

Lesions are symptoms of the severity level of DR¹⁴. Lesions can be broadly classified into four categories: soft and hard exudates (EX), hemorrhages (HM), and microaneurysms (MA). Due to the weakening of the vessel walls, MA, an early stage of DR, is identified by the appearance of tiny, round red spots on the retina. These dots have clear boundaries and are usually less than 125 micrometers in size¹⁶. On the other hand, unlike MA, HMs are recognized by the presence of large patches on the retina that have irregular edges and diameters greater than 125 micrometers. There are two forms of HM: flame and blot¹⁷. Flame refers to superficial areas, while blot indicates deeper ones. Hard EX are caused by blood leakage and appear as yellow patches on the retina. They have clear borders and span the outer layers of the retina¹⁸. Soft EX’s are observable white ovals in the retina that result from swelling of nerve fibers¹⁶. Hemorrhages and MA frequently appear as red lesions, whereas both types of EX are usually seen in white lesions. The DR lesion details are shown in Fig. 1. These lesions are critical for ML models, as they serve as the primary features from which the models learn.

AI-enabled systems, particularly leveraging ML and DL, have become useful in automating DR screening by analyzing retinal fundus images for key lesions such as microaneurysms, hemorrhages, and exudates²⁰. Among these, convolutional neural networks (CNNs) have demonstrated remarkable success, offering superior diagnostic accuracy, faster processing, and greater consistency compared to traditional manual screening methods, thereby reducing dependency on specialized ophthalmologists^21,22. These advancements facilitate large-scale, cost-effective screening programs, making them especially valuable in resource-limited settings like Ethiopia, where infrastructure and personnel constraints hinder traditional approaches²³.

However, despite their transformative potential, existing AI models face significant challenges, including large model sizes, high computational demands, and excessive memory footprints. These limitations hinder real-time usability, particularly on edge devices such as portable fundus cameras, mobile health platforms, or tele-ophthalmology setups, resulting in high energy consumption²⁰. While tele-ophthalmology initiatives in Africa have provided partial solutions, delays persist due to the need for centralized professional interpretation of images²⁴. Overcoming these barriers is critical for deploying AI-enabled DR screening in remote, low-resource environments.

This study addresses these challenges by developing a lightweight model for DR screening. By bridging the gap between AI advancements and practical deployment constraints, this research contributes to clinical practice in diabetes related vision care. To address these limitations, our research leverages knowledge distillation to develop a lightweight student model that achieves good diagnostic performance while significantly reducing computational requirements and model complexity. This makes the proposed solution well-suited for deployment in real-world, resource-constrained environments, thereby facilitating accessible and scalable DR screening.

The remaining sections of this paper are organized as follows. Section 2 provides an exploration of the related works. Section 3 discusses the methodology employed in our research. Section 4 presents the experimental setup, result analysis, and performance comparison. In Section 5, highlights the contributions of the proposed models. Finally, Section 6 summarizes the key findings and challenges and outlines future directions for further research.

Related work

DR screening is an active research area focused on finding better techniques to assist physicians in diagnosing DR. As a result, several research papers have been published on DR screening, particularly in the context of binary and multi-class classification.

Anoop et al.²⁵ and Ishtiaq et al.²⁶ employed custom-designed CNN models for DR classification, which involve a large number of trainable parameters^25,26. Ishtiaq et al.²⁷ designed a custom CNN model to extract complex patterns of retinal lesions and used a classical ML classifier for classification. This combination of CNNs and ML classifiers improves overall performance by leveraging CNNs for feature extraction and ML classifiers for classification, despite the model’s computational intensity. The CNN models proposed by Anoop et al.²⁵ and Ishtiaq et al.²⁶ are resource-intensive. In contrast, Bala et al.²⁸ developed a computationally efficient model resembling existing architectures²⁹, using four dense convolutional blocks and employing shortcut connections, which help maintain gradients during back-propagation. This model has 1.1 million parameters, making it lighter²⁸.

Pre-trained CNN models have also been used for binary classification of DR^30,31,32,33. These models, including EfficientNet³², ResNet³³, Inception-V3³⁰, and DenseNet³¹, are not architecturally identical but use the same fundamental CNN operations. While pre-trained models offer a substantial number of trainable parameters, training them with a relatively small number of samples raises concerns, particularly due to their high computational demands^30,31,32,33. Beghriche et al.³⁴ compared the performance of pre-trained models on DR classification, finding that fine-tuned XCeption outperformed DenseNet121 and MobileNetV2³⁴.

The integration of custom-designed CNN models with techniques like active deep learning³⁵ and Siamese networks³⁶ offers a robust solution for image classification, especially when data is limited. Qureshi et al.³⁷ demonstrated active deep learning for DR classification, allowing the algorithm to select informative image patches, thereby optimizing model performance. Additionally, integrating Siamese networks with custom CNNs, along with hierarchical clustering of image patches for feature extraction, further enhances performance³⁸. This combined approach has proven effective in overcoming data limitations and improving classification outcomes³⁷.

Islam et al.³⁹ employed knowledge distillation to transfer knowledge from a teacher model, a fusion of ResNet152V2 and the Swin Transformer, to a student model, XCeption, enhanced with a Convolutional Block Attention Module (CBAM). Despite using knowledge distillation, the teacher model remains resource-intensive, with 145.8 million parameters and 84.4 MB of memory, while the student model, although reduced, still retains 21.4 million parameters and 82 MB³⁹.

Some studies^40,41 have addressed the complexities of DR classification by re-categorizing inseparable classes, which simplifies the model design and handling of data features, improving DR grading accuracy.

The VGG model, known for its hierarchical architecture, has been extensively applied to DR classification^42,43,44. Khan et al.⁴⁴ enhanced VGG with stacked spatial pyramid pooling and network-in-network (NiN) layers, which improves scale invariance and non-linearity, important for identifying DR at varying image scales. However, VGG’s computational demands can cause gradient vanishing during training, prompting the use of genetic algorithms (GA) as an optimization tool, though GA is also resource-intensive⁴⁵.

MobileNet and DenseNet offer computationally efficient alternatives for DR classification, especially in resource-constrained devices. MobileNet, designed for mobile and embedded applications, provides a lightweight option⁴⁶, while DenseNet’s dense connectivity enhances supervision and reduces model complexity^29,47. According to Ayala et al.⁴⁸, DenseNet excels in parametric efficiency, while MobileNet’s lighter structure makes it more suitable for mobile DR screening applications. InceptionV3, with its inception modules for multi-scale lesion detection, has demonstrated efficacy in DR grading by capturing features at varying scales. Although segmentation into smaller patches improves feature extraction, it can be suboptimal due to the convolutional operations’ ability to capture localized information^49,50. Advanced models such as InceptionResNetV2 and graph neural networks (GNNs) further expand DR classification capabilities⁵¹.

A novel semi-local centrality to identify influential nodes in complex networks by integrating multidimensional factors (SLCMF)⁵². Unlike traditional metrics, SLCMF integrates structural, social, and semantic factors, enhancing both accuracy and scalability. It employs distributed local subgraphs, redefines centrality using the average shortest path, and captures latent relationships through semantic graph embedding. On the other hand, augmentation of the binary grey wolf optimization through quantum computing methodology was used for vision-threatening DR⁵³.

Study⁵⁴ identified Lipocalin-2 (LCN2) as a key mediator of neuroinflammation in retinal ischemia-reperfusion injury, suggesting its potential as a biomarker for glaucoma. Additionally, study⁵⁵ linked endocrine disruptors to diabetes through mitochondrial dysfunction, highlighting disruptions in oxidative phosphorylation and ROS generation. Technological advances include a CRDS-based breath analyzer⁵⁶ for non-invasive metabolic monitoring and deep learning methods like CS-Net⁵⁷ and AM-Net⁵⁸for real-time ultrasound super-resolution imaging. In metabolic regulation⁵⁹, emphasized the role of selenoproteins, while⁶⁰ characterized pembrolizumab-induced uveitis as a treatable immune-related adverse event. Generally, these works^{54,55,56,57,58,59,60} underscore interdisciplinary progress in pathophysiology and precision diagnostics.

Furthermore, ML techniques offer promising accuracy for automated glaucoma detection by analyzing retinal images through preprocessing, feature extraction, and classification, providing valuable clinical support in identifying glaucomatous symptoms⁶¹. A hybrid model, ML and Nature-Inspired Model for Coronavirus (MLNI-COVID-19), combines ML and nature-inspired algorithms to enhance the classification and optimization of brain Magnetic Resonance Imaging (MRI) scans in COVID-19 patients, demonstrating improved diagnostic accuracy, sensitivity, and specificity⁶². However, computational complexity remains challenging in a resource-constrained environment. Soft-computing-based gravitational search optimization is used for feature selection, to eliminate unnecessary features, enhance performance, and reduce computational complexity for glaucoma predictions⁶³. On the other hand, a genetic algorithm-based differential evolution-based multi-objective feature selection approach is used to extract only important features⁶⁴.

Methodology

Dataset

DL algorithms rely heavily on large datasets to understand image patterns that depict infections or lesions, as well as normal conditions. Training ML or DL models with substantial and high-quality datasets enhances model performance.

The Asian Pacific Tele-Ophthalmology Society (APTOS 2019)⁶⁵ is one of the most widely used publicly available retinal fundus image datasets, published on Kaggle. It contains 3662 retinal fundus images with varying resolutions, collected from Aravind Eye Hospital in India. The dataset was specifically designed for DR screening, making it highly suitable for training and evaluating DR detection models. The class distribution of the dataset is shown in Fig. 2, providing insights into the prevalence of different DR severity levels within the dataset.

In addition to the APTOS 2019, the primary dataset was collected from local eye clinic centers, including WAGA Ophthalmology Center, Biruh Vision Specialized Eye Care Center, and KENESER Specialized Eye Clinic. 3D OCT-1 Maestro 2 and DRI OCT Triton of the Topcon funduscopic machine are used to capture retinal fundus from the eye care center. The collected images first underwent an expert-based filtering process to remove low-quality or noisy data. In the second stage, the remaining images were independently annotated by two experienced ophthalmologists. The dataset is categorized into three classes: No DR (normal retinal fundus), NPDR (non-proliferative diabetic retinopathy), and PDR (proliferative diabetic retinopathy). The class distribution is presented in Fig. 2.

The samples of retinal fundus images for each class are shown in Fig. 3, highlighting the lesions or features that distinguish one class from another.

Data preprocessing

The image pre-processing steps involve noise removal, quality enhancement, and preparation of retinal fundus images to be suitable for the model.

Retinal fundus images often contain black borders around the actual retina, which do not contribute useful features for class differentiation. To address this, the cropping process converts the image to grayscale, applies a threshold (set to 7) to create a binary mask, and retains only the areas with pixel values above the threshold, effectively removing irrelevant dark regions⁶⁶. Additionally, a bi-linear interpolation down-scaling algorithm is used to reduce computational load and memory requirements, while adjusting the image size to fit the model’s input size. Since fundus images are typically in RGB format, all channels are retained to capture comprehensive features, although the green channel is often most useful for highlighting blood vessels. Furthermore, contrast-limited adaptive histogram equalization (CLAHE) with a clipLimit of 2.0 and tileGridSize of (8, 8) is applied channel-wise to enhance the contrast of the image. CLAHE improves primary contrast, making features in the retinal image more visible and easier for the model to analyze. The pre-processing techniques are illustrated in Fig. 4 and Algorithm 1

The dataset was initially split into two subsets: 85% of the pre-processed data as the training dataset and 15% as the testing dataset. The training dataset was then further divided into 85% as the training set and 15% as the validation set. Following the data splitting steps, data augmentation was applied on the training set to enhance dataset diversity, thereby improving model generalization and performance. This process was consistently applied to the primary dataset and the APTOS 2019 datasets for model training.

Model

MobileNet is a family of lightweight CNN architectures optimized for efficient deployment on mobile and embedded devices⁴⁶. In this study, a MobileNet-like structure is chosen due to its ability to significantly reduce computational complexity and memory footprint through the use of depthwise separable convolutions. This design makes the model highly suitable for resource-constrained devices such as smartphones or portable fundus cameras. Moreover, MobileNet has demonstrated competitive performance in various computer vision tasks, including medical imaging, while maintaining an architecture. These characteristics align with the goal of developing a lightweight, reliable DR screening model deployable in remote or low-resource settings. To further reduce the model’s computational complexity, knowledge distillation (KD) is used to transfer knowledge from a larger, pre-trained “Teacher” model to a smaller “Student” model. This process ensures that the smaller model achieves comparable performance while significantly reducing computational requirements, making it ideal for resource-constrained devices⁶⁷⁶⁸. Figure 5 shows the architecture of the “Teacher” model and “Student” model.

A lightweight student model was developed from MobileNet, by the principle of simplifying the teacher network by reducing the number of network layers and the sizes of filters, as outlined by Gou et al.⁶⁷. This alignment also minimizes the “model capacity gap”, where a significant difference can hinder the Student model’s ability to effectively gain knowledge from the Teacher⁶⁷. Then deeper blocks were systematically trimmed, filter sizes adjusted, stride patterns modified, and a custom fully connected classification head was used to achieve an optimal balance between computational complexity and predictive performance. These choices were specifically tailored for high-resolution (512$\times$512) medical images and resource-constrained deployment scenarios. Reducing the deeper layers in a Student model can help maintain important feature extraction ability while balancing computational efficiency and performance. The deeper layer frequently contains complicated, high-level information; yet, these layers can be computationally intensive. Therefore, by removing deep layers and building the shallower layers, which capture the core structural and low-level elements, the Student model achieves a good balance between maintaining useful information and minimizing the model’s overall size and latency. Our approach aligns with Wang et al.⁶⁹, who explained that simplified models focus on retaining the main features while offloading complex, task-specific details to improve interpretability and performance on resource-constrained devices. The detailed Student model architecture is shown in Fig. 6 and Table 1.

Table 1 Detailed architecture of the proposed student model.

Subjects

Abstract

Similar content being viewed by others

Diabetic retinopathy classification using a multi-attention residual refinement architecture

Ensemble deep learning and EfficientNet for accurate diagnosis of diabetic retinopathy

Contrastive learning-based pretraining improves representation and transferability of diabetic retinopathy classification models

Introduction

Related work

Methodology

Dataset

Data preprocessing

Model

Experiment and result analysis

Experimental setup

Performance evaluation metrics

Result analysis

Teacher model selection

Binary classification

Ternary classification

Performance comparision

Binary classification

Ternary classification

Discussion

Conclusion and future work

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval

Additional information

Publisher’s note

Appendices

Appendix A: Learning curves

Appendix B: Supplemental figures

Appendix C: Supplemental tables

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links