Multi-class eye disease classification using deep learning EfficientNetB0 fusion techniques

Sah, Uday Kumar; Chatterjee, Jyotir Moy; Sujatha, R.

doi:10.1038/s41598-026-35357-0

Download PDF

Article
Open access
Published: 13 February 2026

Multi-class eye disease classification using deep learning EfficientNetB0 fusion techniques

Uday Kumar Sah¹,
Jyotir Moy Chatterjee² &
R. Sujatha³

Scientific Reports volume 16, Article number: 6368 (2026) Cite this article

650 Accesses
Metrics details

Subjects

Abstract

Eye disease is one of the most conspicuous reasons of optical diminishing and, in certain cases, the prime cause of complete sightlessness. So, there is an urgent need to innovate systems that are capable of spotting eye sicknesses such as glaucoma, cataracts and diabetic retinopathy (DR). To solve this, motivating the necessity for accurate systematized assessment systems, we have advanced a two-backbone deep learning (DL) structure that systematically conglomerates EfficientNetB0 with three harmonizing architectures: ResNet50, InceptionV3, and AlexNet, exercising four different fusion strategies: concatenation, element-wise summation, weighted and majority voting. Using a stable dataset of 4,217 High Resolution retina images across the four symptomatic classes, we have educated and evaluated 12 fusion model arrangements to recognize ideal feature assimilation methodologies. Internal validation exhibited that concatenation fusion attained the great baseline performance, with EfficientNetB0 + ResNet50 (Exp01) triumph 95.26% accuracy (AUROC: 0.993) and EfficientNetB0 + InceptionV3 getting 94.79% accuracy (AUROC: 0.989). External validation on 400 images from Messidor-2 and ODIR datasets disclosed significant performance increases, with accuracies oscillating from 94.99% to 97.99% and AUROC values between 0.991 and 0.999, confirmatory authentic cross-dataset generalization rather than overfitting. Weighted and sum fusion strategies evidenced particularly effective for the external dataset, with EfficientNetB0 + ResNet50 (Exp03) weighted fusion reaching 97.49% accuracy, and EfficientNetB0 + AlexNet (Exp09) sum fusion accomplished an MCC of 0.980, signifying that intelligent feature combination can counterbalance for architectural boundaries and enhance domain robustness. Class-wise heatmap analysis displayed that while DR triumphed near-perfect detection across all configurations, glaucoma detection improved considerably with weighted, sum, and voting fusion, and reduced misclassifications with normal eye images. Explainability and interpretability examination using Score-CAM (disclosed steady anatomical attention patterns across both datasets: glaucoma models concentrated on the optic nerve head and peripapillary area, DR models emphasized macular zones and vascular structures, and cataract models focused on lens denseness all aligning with traditional medical checkups standards and proven that model decisions are based in pathophysiological relevant features rather than spurious connections. Our findings are that the dual-backbone fusion architectures with improved feature integration strategies, remarkably weighted and sum fusion, produce not only higher analytical accuracy but also strong simplification and clear quick decision-making essential for actual clinical implementation across various imaging locations.

Introduction

Eye diseases reveal an enormous vision health challenge, distressing billions of publics internationally and leading to an obvious root of optical injury and blindness. As per the WHO findings, approximately 2.2 billion people comprehensively undergo from vision hammering or blindness, with the peak repeated disorders like glaucoma, cataracts, and DR being primary cause to this issue¹. In the circumstance of DR, a hindrance of diabetes mellitus, it damages the blood vessels in the cornea and can leading to critical hammering of foresight when miss the treatment. Additionally, glaucoma involves the slow deterioration of the optic nerve, and its symptoms naturally remain unnoticed during the early stages². Cataracts, which cloud the eye’s natural lens, are a major cause of lasting blindness. Theme to contribute individual well-being and vision health, it’s obligatory to find eye disease as early as possible to avoid and handshake pre-emptive causes to protect foresight loss. Also, the progression of artificial intelligence (AI), especially DL, grant us the power to utilized a astonishing fusion-based architecture to timely identify and categorize such diseases, to take a voluntary step to protect charity to mankind. a decades ago, traditional approaches was reasonably appreciated among doctors. Which makes doctors to misinterpretations in spotting eye disease conditions, and in some exceptional cases, eye diagnosis treated wrongly and also became one of the major causes of visual blindness. Furthermore, it also pointedly delays in treatment and requires certified medical personnel to carry out the entire process efficiently and effectively.

Image classification approaches hold very good potential in medical image analysis (MIA) to create reliable techniques for analyzing medical images pixel by pixel, likewise MRI, X-ray, or CT scans. This approach goals to understand the fundamental causes of problems encountered with traditional methods. Numerous tools and techniques have emerged in recent times, including Convolutional Neural Networks (CNNs), which are widely used for image analysis tasks. However, CNNs currently have limitations in fully meeting the specific needs of medical diagnosis and clinical treatment. For instance, retina fundus images require a laser-sharp focus to accurately identify individual lesions before making disease predictions. One important requirements is to taking care of diverse features from fundus images to identify and classify multi-class eye diseases^{3, 4} same as DR, Most found in youths^{5, 6} and cataract, it is prime reason of ocular impairment^{7, 8, 9}. Moreover, Myopia has been shown to increase the risk of developing cataracts, making it a significant contributing factor in the progression of other eye diseases. It is estimated that approximately 50% of people will have myopia by 2050 in the world^{10, 11}. The perpetuation of the image categorization algorithms novelty and evolution of graphics processing technology and computing infrastructure in recent years has catalyzed the introduction of multiple DL frameworks¹², EfficientNetB0¹³ standards exhibit a vast perimeter in functioning dominance in image identification. These kinds of CNN algorithms have fairly cost fewer resources with multifaceted scaling. It’s a perfect theme that ascendable to depth, width, and compounding the image quality and resolutions. These are the motives why record scientists chose the EfficientNet-based types for MIA. Likewise, InceptionV3¹⁴ and AlexNet¹⁵ have reputable resilient performance targets in computer imaging activities.

This research presents the eye disease classification as follows:

Single CNN architectures frequently miss critical pinpointing patterns due to its in-built architectural constraints. consequently, we have proposed a dual-backbone fusion framework that combines complementary strengths from multiple architectures.
We have scientifically evaluated all 12 fusion outlines by pairing EfficientNetB0¹³ with ResNet50¹⁶, InceptionV3¹⁴, and AlexNet¹⁵ assessment each combination across four fusion schemes: concatenation, element-wise summation, weighted fusion, and majority voting. Every architecture provides distinct competencies. EfficientNetB0 delivers efficient universal feature extraction, ResNet50 assists deep residual learning, InceptionV3 seizes multi-scale patterns, and AlexNet deals computational efficiency.
We have evaluated all fusion outlines through rigorous internal and external assessment on distinct datasets^{17, 18, 19}. to mimic real-world generality. To ensure clinical responsibility, we have retained Score-CAM (Class Activation Matrix)^{20, 21, 22} and Local Interpretable Model-agnostic Explanations (LIME)²³ explainability and interpretability outlines that unveil which eye regions drive pinpointing decisions²⁴.
This inclusive study highlights fusion-based architectures as precise, understandable, and strong solutions for multi-class eye disease categorization across normal, cataract, glaucoma, and DR cases, appropriate for real-world medical deployment.

The rest of the article is set up to guide the reader through the work. Segment 2 reviewed past studies on eye disease classification and notes what each author added. 3rd part explains the methods and datasets. It covers how we gathered, arranged, split, and prepared the images. It also describes the fusion model and the training setup. Part 4 shows the results and how each fusion step affects the 12 models and presented Score-CAM, LIME, and SLIC to explain how the models make decisions. We have also compared our proposed model with existing works to check their strength. Section 5 ends with our concluding remarks and ideas for future improvements.

Literature review

Traditional approaches in eye disease detection

Ten years ago, doctors used traditional techniques to identify and recognize specific types of retinal diseases. However, they often faced difficulties and significant errors, sometimes leading to significant flaws in clinical treatment. These problems were not restricted to treatment mistakes; precise and deep patterns were also missed sometimes, which were critical for accurately cataloging of the diseases. At that time, manual preprocessing was deeply used, including optic disk localization (ODL), vessel segmentation (VS), and microaneurysm identification. Subsequently they also prefer hard-encoded distinct feature extraction algorithms like Oriented Gradients (HOG), Local Binary Patterns (LBP), and handcrafted textual analysis were also applied²⁵. These algorithms are very effective under controlled settings when fundus picture quality is excellent and variance is rather comparable. They also have excellent track records. Among the various problems we face today which includes uneven datasets, varied data sources, and shifting image quality dependent on the camera or smartphone we employed. High training errors in production and a large overhead on data preprocessing to forecast eye disease classifications²⁶ result from even current algorithms fighting to overcome these issues.

Modern models can predict diseases by analyzing signs and patient history, therefore revolutionizing new ways of healthcare and eye examination. Studies showed that early detection helps prevent severe damage and blindness. Rural areas often lack eye doctors, therefore limiting access to care, but AI technologies can help bridge this gap and offer safe guidance in such remote locations²⁷. Manual checks, relying on human notes, are prone to errors²⁸.

DL in ophthalmology

Artificial Intelligence, especially deep learning (DL), has revolutionized eye care after two pivotal studies^{29, 30}. These studies revealed that convolutional neural networks (CNNs) can detect diabetic retinopathy (DR) as comparatively effectively or even outperformed than human specialists. This research has ignited new developments in medical AI and eye care³¹. Another study retained optic nerve images and vision field data to detect glaucoma with remarkable accuracy³² which completely changes story behind the researchers. A subsequent review³³ examined deep learning’s application to fundus images. It covered tasks like the segmenting and classifying eye problems and highlighted the absence of standardized data, which hinders progress.

Fusion techniques in MIA

Accuracy is crucial in detecting eye diseases, but challenges persist. Missing data and low image clarity hinder model performance. Ensemble and fusion methods address these issues by combining features from multiple models before the final step. Numerous studies test many models, but strong head-to-head checks are very rare. Work on EfficientNet shows high speed & accuracy, which led to trials that pair EfficientNet with ResNest, InceptionV3, and AlexNet. Each model adds its own worth. ResNet is known as the training of deep neural networks (DNN) with residual connections to circumvent the vanishing gradient problem. InceptionV3 is acknowledged for apprehending multi-scale features using its inception modules. AlexNet is an older model, but it still offers a reliable baseline for many medical applications. The main theme of the researchers is to find an explainable model with trade-off accuracy³⁴ covers pixel, feature, and decision fusion. It also shelters attention models alike transformers & generative models that capture more details. The researchers call for clear model output in heart, brain, and cancer care & warns about data risks and system attacks³⁵ mixes CT, MRI, & PET with CNNs, GANs, and autoencoders. The studies show that newer fusion models capture fine details, perform better than older tools, and support medical work³⁶ showed strong results across datasets with CNN fusion³⁷ mapped six fusion types, explained how fusion improves test accuracy, and pointed out limits that affect real use. Fusion also helps in eye disease prediction. The accuracy & explainability gap and missing datasets still limits single models. These gaps cause weak feature detail & missed patterns.

Methodology

Dataset description

One key issue for researchers is class imbalance in medical data. Consent rules and privacy limits reduce access to varied samples^{38, 39}. We used a balanced eye disease dataset¹⁷ to avoid this problem. No class dominated the set. It came from several sources, including IDRiD⁴⁰ and the Ocular Recognition Database¹⁸. This mix helps support broad use and clear model behavior. Subsequently, we conducted experiments on these datasets. This approach also contributes to the creation of a generalized system and eliminates biases and favoritism associated with specific diseases.

This dataset having 4,217 high-resolution images of the retina fundus¹⁷ as shown in Fig. 1. These images are organized into four disease directories, each containing nearly 1,000 images per class. Each class has its own unique and crucial features which are essential for identifying the presence or absence of a particular disease. These four categories of disease as Normal, Cataract, Glaucoma, DR.

Normal images show the absence of any diseases or a normal eye structure. Cataract images depict an eye affected by lens clouding, which reduces clarity and alters the brightness and sharpness of the captured visual image. Glaucoma images illustrate hallmark signs of optic nerve damage, such as an enlarged cup-to-disc ratio, thinning of the neuro-retinal rim, and peripapillary changes all symptoms of progressive glaucoma. DR images contain lesions like microaneurysms, haemorrhages, and exudates, which are typical manifestations of diabetes-related retinal damage.

Data collection and organization

Each class went into its own folder. We used a custom define_paths function to read file paths and set labels from the folder names. The define_df function then combined these paths and labels into one DataFrame.

Data splitting

We have split the dataset into three parts: 80% for training, 10% for validation & 10% for testing. We have used train_test_split through stratified sampling to keep class balance. We have customized random_state to 123 for repeatable results.

Image preprocessing and augmentation

We have used the Keras ImageDataGenerator function to formulate the images. It then formed batches & set up streams for training, validation & testing. The steps are as follows:

1.
Every retina image was set to 224 × 224 pixels for a fixed input size.
2.
All images were loaded as RGB with a 224 × 224 × 3 shape.
3.
We have used horizontal flip to increase the training samples to cut the model overfitting.
4.
Then, we have added a simple placeholder that can be updated as required.
5.
We have used the same batch size for both training & testing. We have kept shuffling off to keep the image order fixed.

Proposed fusion architecture

The proposed model utilized feature-level fusion by mingling EfficientNetB0 with three standard CNNs namely AlexNet alike, ResNet50 & Inception V3 as shown in Fig. 2. These models pick up deep & local features in retina images. Their merger helped the system learn strong visual cues & reach high scores.

EfficientNetB0 + ResNet concat fusion (Exp01)

The first experiment begins with the EfficientNetB0¹³ and ResNet50¹⁶ models, coupled to extract features individually. EfficientNet is renowned for its exceptional computational efficiency, while ResNet is illustrious for its deep residual learning, making them powerful feature extractors for fundus images.

We start with giving input of 224 × 224 RGB fundus images into both architectures concurrently, so each model practices the same input differently. EfficientNetB0 is prodigious at capturing features cost-effectively because of its balanced scaling, while ResNet50 knows for depth and steadiness with its residual connections, assisting the architecture learn more convoluted patterns. After complete feature extraction of each model individually, we forwarded those features into GlobalAveragePooling2D to convert their final convolution outputs into compressed feature vectors. These two vectors are then combined into a single enriched representation, absorbing information from both architectures.

For taxonomy, the model routines a Dense layer with 512 neurons and ReLU activation, Batch Normalization follows and a 50% randomly turn off neuron in Dropout layer. Another Dense layer with 256 neurons follows the same configuration. Then after the final output layer uses Softmax with 4 neurons for multi-class disease prediction.

Throughout training, the model retains the Adam optimiser, categorical cross-entropy loss functions, and monitors validation accuracy using early stopping to prevent overfitting and limit the usages of computational resources. Learning rate reduction is applied when progress slows.

EfficientNetB0 + inceptionV3 concat fusion (Exp05)

The second experiments begin with combining the two distinct architectures as EfficientNetB0¹³ and InceptionV3¹⁴ by merging retina feature concatenation representations. This configuration not only take advantages of multi-scaling or balance scaling but, also reduce the usages of computation resources but also efficient to extract subtle retina abnormalities and details for both architectures.

Both pre-trained models have been removing their classifier layer and set `include_top = False`. So we utilized our own classification configurations to predict our retina diseases.

For the model input, we took 224 × 224 RGB fundus image into both architectures independently,

so each architecture learns and extract feature individually so when we concatenated, we get unified feature vectors that taken spatial information of eye abnormalities. For compressing and compact the large feature vectors in smaller compact chunks, we utilized GlobalAveragePooling2D. These vectors and information then feed by using concatenation layer.

For taxonomy, we employed a Dense Layer with 512 neurons and ReLU activation. Batch Normalization and Dropout were also fused. Dropout randomly deactivates 50% of the neurons to prevent overfitting. The next layer reduced 50% of neurons, resulting in more concise and compact convolution maps. This reduction in neurons helped decrease MLOPS and inference time and scale. The final layers comprised classifiers and the Softmax operating principle, which classified the four eye disease classes.

For training purposes., we used the Adam optimizer, categorical cross-entropy loss function for loss calculations and validation accuracy as the main evaluation metric. To uniformity of training efficiency and stability of each model, Early Stopping was used to stop training when validation loss not improving, and ReduceLROnPlateau lowered the learning rate when progress slowed so the model could continue adjust its weights.

EfficientNetB0 + alexnet concat fusion (Exp09)

The last experiment involved the most versatile modern and vintage architecture combination. EfficientNetB0¹³ and AlexNet¹⁵ Concatenation model is among of them. This combination well known for its lightweight version without compromising robust diagnostic and identification performance.

For the model input, we took 224 × 224 RGB fundus image into both architectures independently,

so each architecture learns and extract retina ‘s feature individually so when we concatenated, we get unified feature vectors that taken spatial information of eye abnormalities., GlobalAveragePooling2D is used for compacting the spatial feature maps into fixed-length vectors. These two independent vectors are then concatenated to form one combined feature representation to classify the eye abnormalities.

For taxonomy, we employed a Dense Layer with 512 neurons and ReLU activation. Batch Normalization and Dropout were also fused. Dropout randomly deactivates 50% of the neurons to prevent overfitting. The next layer reduced 50% of neurons, resulting in more concise and compact convolution maps. This reduction in neurons helped decrease MLOPS and inference time and scale. The final layers comprised classifiers and the Softmax operating principle, which classified the four eye disease classes.

For training purposes., we used the Adam optimizer, categorical cross-entropy loss function for loss calculations and validation accuracy as the main evaluation metric. To uniformity of training efficiency and stability of each model, Early Stopping was used to stop training when validation loss not improving, and ReduceLROnPlateau lowered the learning rate when progress slowed so the model could continue adjust its weights.

This fusion method offers a strong balance between efficiency and feature diversity. This combination we experimented for it remains an excellent option for resource-limited clinical settings where fast inference and lower hardware demands are priorities. This makes it well suited for point-of-care devices, mobile screening tools, and clinics with limited computational resources.

Training settings

To make fair comparision of all 12 models, we trained on identical setup of hardware and packages, libraries.

For fair comparison, each model was typically trained around 3.2 h on a P100 GPU. Most models achieved their best performance between 15 and 30 epochs, showcasing useful learning. All experiments were performed by utilizing Kaggle Notebooks and TensorFlow version 2.9.1. These notebooks were trained on NVIDIA Tesla P100 GPUs accelerator, with top of CUDA 11.2 and cuDNN8.1 to enhance the computational calculation and gradient optimization. To reduce the training time and gain the optimal weights for retina disease identification we employed transfer learning for all 12 models along with pre-trained ImageNet weights.

Initially, we freeze all convolutional layers of each architecture. This ensured that the model retained its pre-trained ImageNet weights. The fusion layer and classification layers were trained at the start. Later on, we slowly unfreeze the convolution Layer to extract intricate details of retina fundus images. The fine-tuning approaches was to select a lower learning rate, letting the models to boost high-level features without ruining the beneficial low-level images acquired from ImageNet weights.

Results and discussion

Ablation study on feature fusion strategy

We have conducted 12 experiments with various fusion setups to find the best approach for eye disease classification. These setups use well-known CNN models and let us check how each fusion step affects the model performance.

We have utilized three backbone pairs: EfficientNetB0 + ResNet50, EfficientNetB0 + InceptionV3 & EfficientNetB0 + AlexNet. Every model suggests a clear strength in feature extraction. EfficientNetB0 is the base model and we then paired it with others. Every backbone affects size, accuracy & speed in its own mode. InceptionV3 handles features at many scales due to its block design. ResNet50 is well-established and older model for training and passing large feature maps in deep convolutional networks using its residual linking techniques. AlexNet was chosen for implementing a lightweight solution for eye disease detection in remote locations. It offers a great balance between model accuracy and computational costs.

For each backbone combination, we applied four fusion strategies: concatenation (preserving all features by stacking them), element-wise summation (combining features into a more compact representation), weighted fusion (learning the relative importance of features through trainable weights), and majority voting (making final decisions by combining individual model predictions).

The first three methods integrate features within the network, while majority voting takes place at the classification stage. Comparing these approaches helped us determine whether early feature-level fusion or late decision-level fusion offers better diagnostic accuracy.

Each and every model was trained under same circumstances and parameters, like same datasets, splitting mechanism and same image preprocessing, same categorical cross entropy and identical optimizer and fixed mini batch setup. These settings are quite prominent to filtered out the performance metrices based on architecture rather than and external changes. We obtain ranges of evaluation techniques like MCC score, MIoU score, which is crucial to get indication regarding class imbalance and processing deployment in clinical environment. Combinedly these scores are important to decide which fusion strategies really work for retina disease classification.

Internal ablation study

To find out the influence of different architectural unions and fusion strategies on eye disease classification model performance, we showed a comprehensive ablation study involving dozen experiments using our internal datasets¹⁷. The overall internal validation findings are presented in Table 1.

Table 1 Performance metrics of internal validation.

Full size table

After completing a comprehensive experiment on classification models on the internal image validation dataset which surface up a range of key information’s. Notably, the different fusion strategy play a vital role to decides performance of the model. Feature-based concatenation models are outperformed others and one of key indicators that signify concatenation-based fusion simply overshadow the rest of fusion types like sum, voting or weighted.

Our internal results showcase that EffNetB0 + ResNet50 (Exp01) is top-performer of the entire experiments achieved 95.26% accuracy followed by EffNetB0 + InceptionV3 (Exp05) and EffNetB0 + AlexNet (Exp09) which is around 94.79% and 93.60% respectively. The weighted and voting pairs really struggle and unable to match benchmark set by concatenation-based models in extract the rare class diseases. These concatenation-based model are well-performer in all scenarios and achieved excellent AU-ROC score above 0.988. In-terms of training stats, we observed that ResNet50 concat models took longer training times generally between epochs 11 and 27 while AlexNet gradient converge rapidly between epochs 9 and 23.

For real-world deployment where speed and accuracy must be balanced, the EfficientNetB0 + ResNet50 combination is an excellent all-around choice. If fast processing is your top priority, EfficientNetB0 + AlexNet is a reliable and efficient option. But if maximum accuracy is the main goal, EfficientNetB0 + InceptionV3 remains the strongest performer. Although concatenation-based fusion validated well on internal datasets, additional external testing would further strengthen confidence in these findings.

In Fig. 3 we evaluate all four fusion strategies concatenation, summation, weighted aggregation, and majority voting tested across three pre-trained backbone pairings: EfficientNetB0 with ResNet50, InceptionV3, and AlexNet. Among these amalgamations, the EfficientNetB0 + ResNet50 model produced the highest overall results, reaching its peak accuracy of 95.26% when using concatenation fusion. EfficientNetB0 + InceptionV3 also demonstrated consistently resilient performance across all methods, achieving its best accuracy of 94.31% under the weighted fusion approach.

Intriguingly, EfficientNetB0 + AlexNet based fusion comparatively showcase stability irrespective different fusion method had applied. The prime finding we have learn from it that these models have brilliant since its inception and preserves the oldest legacy. Voting consecutively performed worse and struggle the deeper feature interactions of retina images either its normal, cataract or Glaucoma.

Training and validation curves

The Figs. between number 4, 5, 6 demonstrates the graphical representation of training, validation loss & accuracy curves for each model across all the 12 fusion experiments. By closely examining the curves, we can easily analyze the convergence speed and the minimisation of validation loss for each model. Left side of graph display information about accuracy part & right side of graph illustrates loss presentation.

All 12 fusion experiments exposed distinctive gradient convergence pattern that depends heavily on the backbone architecture and their feature-fusion strategy. Throughout all tested condition, Exp05–Exp08 InceptionV3 based fusion produced the reasonably stable training curves. Multiscale processing makes the InceptionV3 model immortal to optimized convergence by utilizing different kernel scales concurrently. This approach is particularly effective for retina disease classification, as there are no exceptions we can clearly saw in graphs. The weighted convolution approach, helps to managed inputs feature metrices at multiple scales, perform quite good which clearly seen validation loss remained steady at around 0.20 on average. This adaptability of EfficientNetB0 and InceptionV3 allowed it to effectively adjust even when small parameter changes the learning curves. Therefore, the fusion strategy and architecture play a vital role in determining the types of convergence they exhibit.

The ResNet50 based fusions portrays convergence graphs between Exp01–Exp04 had quickest convergence as reached about 90.97% accuracy. This all happened due to residual linking and avoid vanish gradient problems. For critical analysis part, each ResNet50 based model have steady difference between 4 and 5% from validation accuracy setups and highest loss fluctuations between (0.25 to 0.74), they usually mis leaded and taken details related to training rather than insightful deeper and generalized patterns. As per ResNet based concat models Exp01 remarkably achieve perfect training score (0.002) loss depicts model overfitting. Fusion in Exp02 exhibited inconsistent behavior, oscillating between solutions as if the optimizer was struggling to converge. Weighted fusion (Exp03) achieved 92.26% validation accuracy by learning which features were most important, thereby improving its adaptability as feature importance shifted. The voting method (Exp04) failed near epoch 19. The accuracy stayed near 67% because the base models drifted apart and most votes carried little weight. The AlexNet-type models (Exp09–Exp12) hit the same limits as the original AlexNet. Their simple design did not pair well with EfficientNet, and training was slow and uneven. Early accuracy was close to 60% with strong overfitting. Later runs reached 93%, but the validation loss stayed high at 0.25–0.40. The loss curve shows that concatenation kept all features but raised the space to 4096 dimensions, which increased overfitting. Sum fusion reduced the feature size and gave a small regularizing effect. The best gains came from learning feature weights, but this step was sensitive to tuning. Sum and voting fusion also pushed the models out of sync.

The key point is that models with very different feature scales needed a balanced setup. The weighted InceptionV3 fusion (Exp07) achieved that balance. It showed stable curves with a 4% gap between training and validation and strong results. It is the safest and most practical option for MIA.

Confusion matrices (CM) of internal dataset evaluation

The CMs for all 12 fusion models, shown in Fig. 7 and tested on the internal dataset¹⁷, show how each model performs and where it needs improvement. They reveal how well the models distinguish different disease types⁴¹ and highlight both strengths and weaknesses^{42, 43}.

Across all 12 experiments, EfficientNetB0 paired with other networks gave steady results for eye disease classification. EfficientNetB0 + ResNet50 (Exp01–Exp04) worked well. Exp03 did best because weighted fusion handled similar-looking cases. The top result came from EfficientNetB0 + InceptionV3 (Exp06/07), reaching 96.51% validation accuracy with 1.8% error. InceptionV3 captured small, multi-scale details that matched EfficientNet’s features.

EfficientNetB0 + AlexNet (Exp09–Exp12) worked for cataracts and DR but struggled with glaucoma. AlexNet is older and misses subtle signs. Weighted or sum fusion (Exp03, 06, 07, 10) improved results by focusing on key features. Modern networks with smart fusion handled hard tasks like spotting glaucoma in healthy eyes.

Class-wise performance heatmap of internal validation

Figure 8 shows precision, recall & F1-score for each model across the four eye conditions^{44, 45}.

Observing Exp01–Exp12 on dataset¹⁷, EfficientNetB0 + ResNet50 (Exp01–Exp04) retains high recall & precision for cataracts & DR. Performance for glaucoma & normal eyes varies. Weighted fusion in Exp03 lowers this variation.

EfficientNetB0 + InceptionV3 (Exp05–Exp08) is the overall best. Exp06 & Exp07 showed how InceptionV3’s multi-scale interpretation & EfficientNet’s features work collectively. This makes healthy vs. diseased retinas easy to find out.

EfficientNetB0 + AlexNet (Exp09–Exp12) is fine but less consistent for glaucoma & normal eyes. AlexNet brawls with subtle features and fusion adds some help. Heatmaps shows weighted & sum fusion give more balanced & reliable results for each disease.

Clinical Interpretability/Explainability of internal evaluation

We have used Score-CAM & LIME (SLIC Segmenta) to realize which parts of the eye the model mainly focuses.

For DR, heatmaps highlight the optic disc, macula & main blood vessels, where early signs appear. For glaucoma, activations stay on the optic nerve head, matching real structural changes when the neuroretina rim thins. Normal retinal images don’t show any strong hotspots, while cataract images show softer, more spread-out activations that match the overall haze & lowered contrast typical of the condition. Overall, the activation patterns make sense clinically & match the real anatomical changes associated with each disease as presented in Fig. 9.

External ablation study

To rigorously test whether the proposed eye disease classification model could handle real-world clinical conditions beyond the training data, we have performed external validation using two completely independent datasets: Messidor-2¹⁹ and the Ocular Disease Intelligent Recognition (ODIR) dataset¹⁸ provided 1,748 high-quality fundus images from 874 DR screening exams, complete with DR severity grades and quality ratings, which we carefully preprocessed to remove distracting black borders¹⁸ presented an even tougher challenge, containing fundus photographs from both eyes of 5,000 patients collected across multiple Chinese hospitals using different camera brands and models, resulting in images with varying resolutions, lighting conditions, and quality levels that doctors actually encounter in everyday practice.

We formed vigilantly a balanced external validation dataset with 100 professionally annotated images for each of the four diagnostic categories: Cataract, Glaucoma, Normal & DR. This test set was significant since it reflects real-world circumstances. Images come from diverse devices, patient groups, and have natural quality differences. This lets us check if the model handles real clinical cases or just memorizes training patterns. The results are presented in Table 2.

Table 2 Performance metrics of external validation.

Full size table

The external dataset had about 400 retinal images from^{18, 19}, from different hospitals, devices, and patient groups. All fusion models scored over 95% accuracy. The top model, EfficientNetB0 + InceptionV3 Concatenation Voting (Exp08), reached 97.99%. Its Weighted version (Exp07) had 97.74%. Lower-performing experiments (Exp06 & Exp12) still reached around 96.49%, showing the architecture is strong.

MCC scores were similar. Exp07 and Exp08 had 0.973. The lightweight AlexNet Sum fusion (Exp10) reached 0.980, sometimes beating heavier models on high-quality images. For class-level agreement (mIoU), InceptionV3 fusions led: Exp05 (0.961), Exp08 (0.960), and Exp07 (0.956). ResNet50 and some AlexNet variants were lower (Exp01: 0.918; Exp12: 0.932).

EfficientNetB0 + InceptionV3 balances good features and computation (≈ 27–30 M parameters, ~ 6500 FLOPs). AlexNet models were fast (0.130–0.263 s/image) and light but slightly less accurate. Concatenation fusion was the most stable. Weighted fusion gave small gains. Voting worked well but sometimes lowered accuracy.

Every model had a good AUROC value of > = 0.991 but Exp08 have 0.999. The EfficientNetB0 + InceptionV3 with concatenation or weighted fusion performed the best. AlexNet pairings were efficient & nearly very accurate (see Fig. 10).

We tested all 12 fusion models to compare accuracy and computation. Lightweight models like Exp09 (EfficientNetB0 + AlexNet, concatenation) use 11.86 M parameters and 2736.20 MFLOPs. They can process an image in 0.133 s, which works well for busy clinics or emergencies.

Heavier models like Exp01 (EfficientNetB0 + ResNet50, concatenation) use 29.47 M parameters and 8555.65 MFLOPs. They need stronger hardware and more time, but they catch details that lighter models might miss. Concatenation gives richer features but costs more compute. Sum or voting methods reduce load while keeping good performance. Processing times range from 0.133 s (Exp09) to 0.363 s (Exp05).

No model fits all cases. Large hospitals can focus on accuracy. Rural clinics may need efficiency. Emergency units need fast processing.

CM of external dataset evaluation

The CMs for all 12 models showed performance^{18, 19} errors on the datasets &. They reveal how well models separate classes & where improvement is needed^{41, 42, 43}.

EfficientNetB0 + ResNet50 (Exp01–Exp04) is reliable but often misclassifies Glaucoma. Concatenation models (Exp01, Exp05, Exp09) confuse Glaucoma with Normal. Sum fusion (Exp02, Exp06, Exp10) reduces these errors. Weighted and Voting models are mixed. Exp03, Exp07, and Exp12 detect Glaucoma at 98–100%. Normal cases do better with fusion methods other than concatenation.

Cataract and DR are classified well in all experiments. Voting helps maintain accuracy and reduce errors. EfficientNetB0 + InceptionV3 (Exp05–Exp08) produces clean matrices, while EfficientNetB0 + AlexNet (Exp09–Exp12) shows more errors, which Weighted or Voting fusion partly fixes. Voting is the most reliable strategy, especially for Glaucoma and Normal, consistent with Fig. 11.

Class-wise performance heatmap of external validation

Figure 12 shows class-level performance. Diabetic Retinopathy is easy to identify with high recall. Glaucoma is hardest, especially in Concatenation models (Exp01, Exp05, Exp09). Normal cases vary by fusion method; Sum and Voting improve stability.

InceptionV3 models (Exp05–Exp08) are the most consistent. AlexNet models are weaker initially but Weighted (Exp11) and Voting (Exp12) improve results. Voting (Exp04, Exp08, Exp12) gives stable predictions across diseases. Weighted fusion helps models with uneven feature contributions^{44, 45}.

Clinical interpretability/explainability of external evaluation

We checked model predictions on external data^{18, 19} using Score-CAM²⁰ and LIME (SLIC)⁴⁶. The model focuses on relevant retinal regions, not the background (Fig. 13).

Score-CAM shows correct attention: normal images are faint; DR highlights macula and vessels; Glaucoma focuses on the optic nerve; Cataracts show a diffused lens pattern. This confirms attention is clinically meaningful.

Comparative study

External accuracy matched or exceeded internal results (Fig. 14), improving 0.4–4.6% with no drops. Top models (Exp03, Exp05, Exp11) merge features well across datasets.

Weighted fusion is consistent (Exp03, Exp07, Exp11), learning how much to trust each backbone. Summation is steady, especially for AlexNet (Exp10). Concatenation and voting give smaller gains. Weighted fusion is best for combining multiple backbones efficiently.

Few studies fuse EfficientNetB0 with other CNNs for multi-class eye disease classification, showing a need for models that generalize, as in Table 3.

Table 3 Comparative Analysis.

Full size table

Discussion of results

Internal and external results are close. EfficientNetB0 + InceptionV3 (Exp05–Exp08) and + ResNet50 (Exp01–Exp04) scored 92–95% with strong recall and efficiency. Concatenation and Weighted fusion captured subtle features and reduced errors. AlexNet models, though weaker, still captured fine details.

External testing improved results. Most models reached 97–98% accuracy. Weighted and Sum fusion combined multi-scale features, increasing reliability. EfficientNetB0 + InceptionV3 and + ResNet50 AUROC was 0.996–0.999. AlexNet-based models also scored well with optimized fusion, e.g., Exp10 MCC 0.980.

Clinical significance and inferences

The Score-CAM visualizations reveal two connected sides of the same story: what the model actually learns from the data and how well that learning carries over to completely new hospitals and imaging systems^{58, 59, 60}, When we examine the internal dataset, the activation maps clearly show the model focusing on the correct anatomical regions. In glaucoma cases, the attention locks onto the optic nerve head, where damage typically appears. For diabetic retinopathy, the model spreads attention across the posterior part of the eye and the macular region, where disease indicators are usually found. Normal retinas show balanced attention across both optic discs key anatomical reference points while cataract cases highlight the central lens, where cloudiness develops. The patterns show that the model is learning real clinical features. The real test comes from external dataset^{18, 19} from different hospitals with changes in lighting, cameras, and image quality. Even with these changes, attention patterns stay consistent: glaucoma predictions focus on the optic nerve, DR predictions focus on the back of the retina, normal eyes show attention on both optic discs, and cataract predictions focus on the lens. This shows the model is not memorizing one system but learning what the diseases look like in general.

Testing on multiple datasets is important for trust. It shows the model works under controlled conditions and handles real-world differences. Score-CAM maps show where the model struggles or looks at unexpected areas. These cases point to situations where human judgment is needed.

The model stays reliable even with poor images. It also shows when it is unsure. This makes it useful for hospitals, clinics, rural centers, or mobile units, no matter the equipment available.

Conclusion & future work

This study tests several dual-backbone fusion models for eye disease classification. EfficientNetB0 is paired with ResNet50, InceptionV3, and AlexNet under four fusion types. Tests on both Internal and External datasets show high accuracy and good performance across sources. The models can identify cataracts, glaucoma, DR, and normal eyes reliably. On the Internal dataset, accuracy is 92–95%, with EfficientNetB0 + InceptionV3 (Exp05–Exp08) and EfficientNetB0 + ResNet50 (Exp01–Exp04) performing best.

External dataset results are higher, with 94.99–97.99% accuracy and AUROC of 0.991–0.999. Top setups Exp03, Exp07, Exp08 & Exp10 separate disease classes well & achieve high mIoU (0.960–0.980). AlexNet alike models, including Exp10 with sum fusion, show that smaller networks can still work well with the right fusion. Heatmaps & CMs support the reliability of the results. Concatenation & weighted fusion handle hard pairs such as glaucoma vs. normal eyes. Score-CAM, Grad-CAM & LIME (SLIC) point to key parts of the fundus, such as the optic disc, macula, nerves & vessels, while avoiding areas that do not help.

This study also offers clear guidance on model design, testing & performance for multi-class eye disease work. It supports future AI tools for clinics. The models exceed 95% accuracy & AUROC above 0.991, but larger & more diverse datasets are needed. Multimodal inputs, such as fundus with OCT, may help with early disease diagnosis. Vessel modeling, mixture-of-experts & graph-based work may also help track change and improve fusion.

Current explainability tools highlight the right regions across varied ages & image quality. Future work should add counterfactual views & uncertainty scores so clinicians can judge model confidence. Work on real-time use, edge devices & compression is also needed. Clinical testing with ophthalmologists will be the key step to move this system into practice. The strong external results show that it can adapt to many imaging setups.

Data availability

Data will be made available on appropriate request.

References

Organizations, W. H. Blindness and vision impairment, 10 August 2023. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment
Georgia Eye Associates. Diabetic Retinopathy, 24 December 2024. [Online]. Available: https://www.georgiaeyeassociates.com/retina-atlanta/diabetic-retinopathy/
Chao Chen, X. L. A review of convolutional neural network-based methods for medical image classification. ELSEVIER 185, 109507 (2025).
Google Scholar
Kumar, Pawan. Pawan Kumar Mall, A comprehensive review of deep neural networks for medical image processing: recent developments and future opportunities. ELSEVIER (Healthcare Analytics) 4(100216), 2772–4425 (2023).
Google Scholar
Hamza Riaz, Hamza. Deep and densely connected networks for classification of diabetic retinopathy. MDPI (Artificial Intell. Diagnostics) 10(1), 24 (2020).
Google Scholar
Zhao, L. Oken pyramid pooling-driven style adapter learning with dual-view balanced loss for imbalanced diabetic retinopathy grading. Elsevier (Pattern recognition) 171(112194), 0031–3203 (2026).
Google Scholar
Penny, A. Age-related cataract. Elsevier (The Lancet) 365(9459), 599–609 (2005).
Google Scholar
,., Z. X. H. F. Y. H., Y., J. Y., X., R. H., ,., J. L. & Xiaoqing Zhang Attention to region: Region-based integration-and-recalibration networks for nuclear cataract classification using AS-OCT images author links open overlay panel. ELSEVIER (Medical Image Analysis). 80 (102499), 1361–8415 (2022).
Google Scholar
Yu-Chi, Liu. Cataracts. Elsevier (The Lancet) 390(10094), 600–612 (2017).
Google Scholar
Zhang, Xiaoqing. Efficient pyramid channel attention network for pathological myopia recognition with pretraining and finetuning. Elsevier (Artificial Intell. Medicine) 154(102926), 0933–3657 (2024).
Google Scholar
D. A. W. M. J., T. R. F. K. S. N. P. S. T. Y. W. T. J. N. S. R. Brien A. Holden, global prevalence of myopia and high myopia and Temporal trends from 2000 through 2050. ELSEVIER (Ophthalmology). 123 (5), 1036–1042 (2016).
Google Scholar
Ziang, Zhang. DENSE-INception U-net for medical image segmentation,. Elsevier 192(105395), 0169–2607 (2020).
Google Scholar
Tan, Mingxing. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 6105–6114 (2019).
I. J., V. V. S. S. Z. W. Christian Szegedy, rethinking the inception architecture for computer vision. IEEE Explore, pp. 1–9, (2016).
Alex Krizhevsky, I. S. G. E. H. ImageNet classification with deep convolutional neural networks, LSVRC-, pp. 1–9, 2012., pp. 1–9, 2012. (2010).
Kaiming He, X. Z. S. R. J. S. Deep residual learning for image recognition. IEEE Explore, pp. 1–9, (2016).
Doddi, G. V. eye_diseases_classification, 1 May 2022. [Online]. Available: https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification/data
Larxel, O. D. Recognition, Larxel, 01 01 2020. [Online]. Available: https://www.kaggle.com/datasets/andrewmvd/ocular-disease-recognition-odir5k. [Accessed 12 06 2025].
MariaHerreroT Messidor-2, Kaggle, 1 January 2021. [Online]. Available: https://www.kaggle.com/datasets/mariaherrerot/messidor2preprocess. [Accessed 15 November 2025].
,., Z. W. M. D. F. Y. & Haofan Wang, Z. Z. S. D. P. M. X. H. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks, in IEEE Xplore, Seattle, (2020).
Averkin, E. N. V. A. A. N. Gradient-based Explainable Artificial Intelligence Methods for Eye Disease Classification, International Conference on Neural Networks and Neurotechnologies (NeuroNT), pp. 6–9, (2023).
A. H. A. K., M. A. M. R. R. & Abu Kowshir Bitto, M. M. Explainable AI Based Transfer Learning for Multiclass Eye Disease Classification from External Eye Images, International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN), pp. 1–7, 2025. pp. 1–7, 2025. (2025).
Hemal, Moniruzzaman. Explainable deep learning-based meta-classifier approach for multi-label classification of retinal diseases. Elsevier (Array) 26(100402), 2590–0056 (2025).
Google Scholar
Marcello Di, A. S. M. C. F. M. F. M. & Giammarco Explainable retinal disease classification and localization through Convolutional Neural Networks, ELSEVIER (Image and Vision Computing), vol. 162, no. 105667, pp. 0262–8856, (2025).
Santhosh Kumar, G. N. K. S. B. B N, Machine learning algorithms for ocular disease from fundus images using LBP and HOG. International J. Intell. Syst. Appl. Engineering pp. 3791–3798, (2024).
Raghavendra, U. Deep convolution neural network for accurate diagnosis of glaucoma using digital fundus images Author links open overlay panel. Elsevier 41–49 (2018).
Sorrentino, Francesco Saverio. Novel approaches for early detection of retinal diseases using artificial intelligence. PubMed Cent. 14(7), 17 (2024).
Google Scholar
Abdullah, Ahmed Aizaldeen. Review of eye diseases detection and classification using deep learning techniques. EDP Sci. 10(1051), 1–20 (2024).
Google Scholar
Tao, Y. G. K. W. S. G. H. L. H. K. & Li Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening, ELSEVIER, pp. 511 to 522, (2019).
Martin, C. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402–2410 (2016).
Article Google Scholar
Mohsen, Khosravi. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. PubMed Central 11, 23333928241234864 (2024).
Google Scholar
Parampal, T. & Grewal, S. Deep learning and computer vision for glaucoma detection: a review. Can. J. Ophthalmol. 53(4), 309–313 (2018).
Google Scholar
Tao, Li. Applications of deep learning in fundus images: a review. Elsevier B V 69(3), 101971 (2021).
Google Scholar
Zubair, O. M. A comprehensive review of techniques, algorithms, advancements, challenges, and clinical applications of multi-modal medical image fusion for improved diagnosis. arXiv preprint 2505(14715), 1–33 (2025).
Google Scholar
Tao Zhou, Z. Deep learning methods for medical image fusion: a review. Computers in Biology and Medicine 150, 106959 (2023).
Article Google Scholar
Liang, N. Medical image fusion with deep neural networks. Scientific Reports 14(1), 7972 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Muhammad, Adeel. A review on multimodal medical image fusion: compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput. Biol. Med. 144, 105253 (2022).
Article Google Scholar
Xiaoqing, Z. Adaptive dual-axis style-based recalibration network with class-wise statistics loss for imbalanced medical image classification. IEEE Trans. Image Process. 14(8), 1–16 (2021).
Google Scholar
Zhang, Y. . Machine learning for cataract classification/grading on ophthalmic imaging modalities a survey. Springer 19(3), 184–208 (2022).
ADS Google Scholar
Prasanna Porwal, S. P. R. K. M. K. G. D. V. S. F. M. Indian Diabetic Retinopathy Image Dataset (IDRiD): A Database for Diabetic Retinopathy Screening Research, Licensee MDPI, Basel, Switzerland, pp. 1–8, (2018).
Sathyanarayanan, B. R. T. S. Confusion Matrix-Based performance evaluation metrics. Afr. J. Bio Med. Res. 27 (4s), 4023–4031 (2024).
Google Scholar
M. B. L. Š. D. B.-Š. Damir Krstinić, Multi-label classifier performance evaluation with confusion matrix, University of Split (ViO - Vision Based Intelligent Observers), University of Split, R. Boškovića 32, Split 21000, Croatia, 2023.
Damir, L. ŠI. S. Comments on MLCM: Multi-Label confusion matrix. IEEE Access. 11, 40692–40697 (2023).
Article Google Scholar
Opitz, J. A closer look at classification evaluation metrics and a critical reflection of common evaluation practice. Trans. Association Comput. Linguistics. 12, 820–836 (2024).
Article Google Scholar
Szelogowski, D. J. Hebbian Memory-Augmented recurrent networks: engram neurons in deep Learning, arXiv, pp. 2–20, (2025).
Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. arXiv 1602(04938v3),1-10 (2016).
Ankur Biswasa, R. B. & Fusion, C. N. N. A promising technique for ophthalmic disorder cnn fusion: a promising technique for ophthalmic disorder diagnosis, elsevier,. Procedia computer science 233, 411–421 (2024).
Article Google Scholar
Customized, C. N. N., Models,Wang, Nan-Han. & Lu, Wei-Chang. Classification of color fundus photographs using fusion extracted features and & customized. Healthcare 11(15), 228 (2023).
Google Scholar
Orhan Sivaz, M. A. Combining EfficientNet with ML-Decoder classification head for multi-label retinal disease classification. Neural Comput. Appl. 36, 14251–14261 (2024).
Article Google Scholar
Xo’jayev, H. K. A. & Qadamboyevich, O. Classification of eye diseases with mobilenetv3 and efficientnetb0 models. Digit. Transformation Artif. Intell. 1(1), 92–96 (2024).
Google Scholar
Qureshi, I. AdaptiveSwin-CNN: adaptive Swin-CNN framework with Self-Attention fusion for robust Multi-Class retinal disease diagnosis. MDPI AI. 6 (2), 28 (2025).
Google Scholar
Ahlam Shamsan, E. M. S. S. A. S. Automatic classification of colour fundus images for prediction of eye disease types based on hybrid features. Diagnostics MDPI. 13 (10), 1706 (2023).
Article Google Scholar
A. K., M. Z. S. Y. CAD-EYE: an automated system for multi-eye disease classification using feature fusion with deep learning models and fluorescence imaging for enhanced interpretability. Diagnostics 14(23), 2679 (2024).
Article Google Scholar
Rakib, M. M. EfficientNet-based model for automated classification of retinal diseases using fundus images. Eur. J. Comput. Sci. Inform. Technol. 12(8), 48–61 (2024).
Google Scholar
Dominik Müller, I. S. R. F. K. Multi-disease detection in retinal imaging based on ensembling heterogeneous deep learning models. Preprint P 6, (2021).
Afsana ahsan jeny. Deep neural network-based ensemble & model for eye diseases detection and classification. Image Anal. Stereol. 42, 77–91 (2023).
Article Google Scholar
i Ezekiel Olorunshola, Oluwasey, Multi-classification of eye diseases using a cnn-haralick hybrid framework. Sci. J. Univ. Zakho 13(4), 480–489 (2025).
Article Google Scholar
Ali Alqutayfi, A. . Explainable disease classification: exploring grad-CAM analysis of CNNs and ViTs. Journal Adv. Inform. Technol. 16(2), 264–273 (2025).
Google Scholar
a., M. A. B. S. & SamadZamini, K. Explainable deep learning for cataract detection in retinal images: A Dual-Eye and knowledge distillation Approach, arXiv, pp. 1–13, (2025).
Jing, D. Z. B. M. Y. L. L. L. & Li FIMF score-CAM: fast score-CAM based on local multi-feature integration for visual interpretation of CNNS. IET Image Proc. 17, 761–772 (2022).
Google Scholar

Download references

Funding

Open access funding provided by Vellore Institute of Technology.

Author information

Authors and Affiliations

Department of Computer Science, Institute of Science, Banaras Hindu University, Varanasi, UP, India
Uday Kumar Sah
Department of CSE, Graphic Era University, Dehradun, India
Jyotir Moy Chatterjee
School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore, 632014, Tamilnadu, India
R. Sujatha

Authors

Uday Kumar Sah
View author publications
Search author on:PubMed Google Scholar
Jyotir Moy Chatterjee
View author publications
Search author on:PubMed Google Scholar
R. Sujatha
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors have equal contributions in this manuscript.

Corresponding author

Correspondence to R. Sujatha.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sah, U.K., Chatterjee, J.M. & Sujatha, R. Multi-class eye disease classification using deep learning EfficientNetB0 fusion techniques. Sci Rep 16, 6368 (2026). https://doi.org/10.1038/s41598-026-35357-0

Download citation

Received: 07 August 2025
Accepted: 05 January 2026
Published: 13 February 2026
Version of record: 13 February 2026
DOI: https://doi.org/10.1038/s41598-026-35357-0

Subjects

Abstract

Introduction

Literature review

Traditional approaches in eye disease detection

DL in ophthalmology

Fusion techniques in MIA

Methodology

Dataset description

Data collection and organization

Data splitting

Image preprocessing and augmentation

Proposed fusion architecture

EfficientNetB0 + ResNet concat fusion (Exp01)

EfficientNetB0 + inceptionV3 concat fusion (Exp05)

EfficientNetB0 + alexnet concat fusion (Exp09)

Training settings

Results and discussion

Ablation study on feature fusion strategy

Internal ablation study

Training and validation curves

Confusion matrices (CM) of internal dataset evaluation

Class-wise performance heatmap of internal validation

Clinical Interpretability/Explainability of internal evaluation

External ablation study

CM of external dataset evaluation

Class-wise performance heatmap of external validation

Clinical interpretability/explainability of external evaluation

Comparative study

Discussion of results

Clinical significance and inferences

Conclusion & future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links