Introduction

Eye diseases reveal an enormous vision health challenge, distressing billions of publics internationally and leading to an obvious root of optical injury and blindness. As per the WHO findings, approximately 2.2 billion people comprehensively undergo from vision hammering or blindness, with the peak repeated disorders like glaucoma, cataracts, and DR being primary cause to this issue1. In the circumstance of DR, a hindrance of diabetes mellitus, it damages the blood vessels in the cornea and can leading to critical hammering of foresight when miss the treatment. Additionally, glaucoma involves the slow deterioration of the optic nerve, and its symptoms naturally remain unnoticed during the early stages2. Cataracts, which cloud the eye’s natural lens, are a major cause of lasting blindness. Theme to contribute individual well-being and vision health, it’s obligatory to find eye disease as early as possible to avoid and handshake pre-emptive causes to protect foresight loss. Also, the progression of artificial intelligence (AI), especially DL, grant us the power to utilized a astonishing fusion-based architecture to timely identify and categorize such diseases, to take a voluntary step to protect charity to mankind. a decades ago, traditional approaches was reasonably appreciated among doctors. Which makes doctors to misinterpretations in spotting eye disease conditions, and in some exceptional cases, eye diagnosis treated wrongly and also became one of the major causes of visual blindness. Furthermore, it also pointedly delays in treatment and requires certified medical personnel to carry out the entire process efficiently and effectively.

Image classification approaches hold very good potential in medical image analysis (MIA) to create reliable techniques for analyzing medical images pixel by pixel, likewise MRI, X-ray, or CT scans. This approach goals to understand the fundamental causes of problems encountered with traditional methods. Numerous tools and techniques have emerged in recent times, including Convolutional Neural Networks (CNNs), which are widely used for image analysis tasks. However, CNNs currently have limitations in fully meeting the specific needs of medical diagnosis and clinical treatment. For instance, retina fundus images require a laser-sharp focus to accurately identify individual lesions before making disease predictions. One important requirements is to taking care of diverse features from fundus images to identify and classify multi-class eye diseases3, 4 same as DR, Most found in youths5, 6 and cataract, it is prime reason of ocular impairment7, 8, 9. Moreover, Myopia has been shown to increase the risk of developing cataracts, making it a significant contributing factor in the progression of other eye diseases. It is estimated that approximately 50% of people will have myopia by 2050 in the world10, 11. The perpetuation of the image categorization algorithms novelty and evolution of graphics processing technology and computing infrastructure in recent years has catalyzed the introduction of multiple DL frameworks12, EfficientNetB013 standards exhibit a vast perimeter in functioning dominance in image identification. These kinds of CNN algorithms have fairly cost fewer resources with multifaceted scaling. It’s a perfect theme that ascendable to depth, width, and compounding the image quality and resolutions. These are the motives why record scientists chose the EfficientNet-based types for MIA. Likewise, InceptionV314 and AlexNet15 have reputable resilient performance targets in computer imaging activities.

This research presents the eye disease classification as follows:

  • Single CNN architectures frequently miss critical pinpointing patterns due to its in-built architectural constraints. consequently, we have proposed a dual-backbone fusion framework that combines complementary strengths from multiple architectures.

  • We have scientifically evaluated all 12 fusion outlines by pairing EfficientNetB013 with ResNet5016, InceptionV314, and AlexNet15 assessment each combination across four fusion schemes: concatenation, element-wise summation, weighted fusion, and majority voting. Every architecture provides distinct competencies. EfficientNetB0 delivers efficient universal feature extraction, ResNet50 assists deep residual learning, InceptionV3 seizes multi-scale patterns, and AlexNet deals computational efficiency.

  • We have evaluated all fusion outlines through rigorous internal and external assessment on distinct datasets17, 18, 19. to mimic real-world generality. To ensure clinical responsibility, we have retained Score-CAM (Class Activation Matrix)20, 21, 22 and Local Interpretable Model-agnostic Explanations (LIME)23 explainability and interpretability outlines that unveil which eye regions drive pinpointing decisions24.

  • This inclusive study highlights fusion-based architectures as precise, understandable, and strong solutions for multi-class eye disease categorization across normal, cataract, glaucoma, and DR cases, appropriate for real-world medical deployment.

The rest of the article is set up to guide the reader through the work. Segment 2 reviewed past studies on eye disease classification and notes what each author added. 3rd part explains the methods and datasets. It covers how we gathered, arranged, split, and prepared the images. It also describes the fusion model and the training setup. Part 4 shows the results and how each fusion step affects the 12 models and presented Score-CAM, LIME, and SLIC to explain how the models make decisions. We have also compared our proposed model with existing works to check their strength. Section 5 ends with our concluding remarks and ideas for future improvements.

Literature review

Traditional approaches in eye disease detection

Ten years ago, doctors used traditional techniques to identify and recognize specific types of retinal diseases. However, they often faced difficulties and significant errors, sometimes leading to significant flaws in clinical treatment. These problems were not restricted to treatment mistakes; precise and deep patterns were also missed sometimes, which were critical for accurately cataloging of the diseases. At that time, manual preprocessing was deeply used, including optic disk localization (ODL), vessel segmentation (VS), and microaneurysm identification. Subsequently they also prefer hard-encoded distinct feature extraction algorithms like Oriented Gradients (HOG), Local Binary Patterns (LBP), and handcrafted textual analysis were also applied25. These algorithms are very effective under controlled settings when fundus picture quality is excellent and variance is rather comparable. They also have excellent track records. Among the various problems we face today which includes uneven datasets, varied data sources, and shifting image quality dependent on the camera or smartphone we employed. High training errors in production and a large overhead on data preprocessing to forecast eye disease classifications26 result from even current algorithms fighting to overcome these issues.

Modern models can predict diseases by analyzing signs and patient history, therefore revolutionizing new ways of healthcare and eye examination. Studies showed that early detection helps prevent severe damage and blindness. Rural areas often lack eye doctors, therefore limiting access to care, but AI technologies can help bridge this gap and offer safe guidance in such remote locations27. Manual checks, relying on human notes, are prone to errors28.

DL in ophthalmology

Artificial Intelligence, especially deep learning (DL), has revolutionized eye care after two pivotal studies29, 30. These studies revealed that convolutional neural networks (CNNs) can detect diabetic retinopathy (DR) as comparatively effectively or even outperformed than human specialists. This research has ignited new developments in medical AI and eye care31. Another study retained optic nerve images and vision field data to detect glaucoma with remarkable accuracy32 which completely changes story behind the researchers. A subsequent review33 examined deep learning’s application to fundus images. It covered tasks like the segmenting and classifying eye problems and highlighted the absence of standardized data, which hinders progress.

Fusion techniques in MIA

Accuracy is crucial in detecting eye diseases, but challenges persist. Missing data and low image clarity hinder model performance. Ensemble and fusion methods address these issues by combining features from multiple models before the final step. Numerous studies test many models, but strong head-to-head checks are very rare. Work on EfficientNet shows high speed & accuracy, which led to trials that pair EfficientNet with ResNest, InceptionV3, and AlexNet. Each model adds its own worth. ResNet is known as the training of deep neural networks (DNN) with residual connections to circumvent the vanishing gradient problem. InceptionV3 is acknowledged for apprehending multi-scale features using its inception modules. AlexNet is an older model, but it still offers a reliable baseline for many medical applications. The main theme of the researchers is to find an explainable model with trade-off accuracy34 covers pixel, feature, and decision fusion. It also shelters attention models alike transformers & generative models that capture more details. The researchers call for clear model output in heart, brain, and cancer care & warns about data risks and system attacks35 mixes CT, MRI, & PET with CNNs, GANs, and autoencoders. The studies show that newer fusion models capture fine details, perform better than older tools, and support medical work36 showed strong results across datasets with CNN fusion37 mapped six fusion types, explained how fusion improves test accuracy, and pointed out limits that affect real use. Fusion also helps in eye disease prediction. The accuracy & explainability gap and missing datasets still limits single models. These gaps cause weak feature detail & missed patterns.

Methodology

Dataset description

One key issue for researchers is class imbalance in medical data. Consent rules and privacy limits reduce access to varied samples38, 39. We used a balanced eye disease dataset17 to avoid this problem. No class dominated the set. It came from several sources, including IDRiD40 and the Ocular Recognition Database18. This mix helps support broad use and clear model behavior. Subsequently, we conducted experiments on these datasets. This approach also contributes to the creation of a generalized system and eliminates biases and favoritism associated with specific diseases.

This dataset having 4,217 high-resolution images of the retina fundus17 as shown in Fig. 1. These images are organized into four disease directories, each containing nearly 1,000 images per class. Each class has its own unique and crucial features which are essential for identifying the presence or absence of a particular disease. These four categories of disease as Normal, Cataract, Glaucoma, DR.

Normal images show the absence of any diseases or a normal eye structure. Cataract images depict an eye affected by lens clouding, which reduces clarity and alters the brightness and sharpness of the captured visual image. Glaucoma images illustrate hallmark signs of optic nerve damage, such as an enlarged cup-to-disc ratio, thinning of the neuro-retinal rim, and peripapillary changes all symptoms of progressive glaucoma. DR images contain lesions like microaneurysms, haemorrhages, and exudates, which are typical manifestations of diabetes-related retinal damage.

Fig. 1
figure 1

Eye disease dataset sample images.

Data collection and organization

Each class went into its own folder. We used a custom define_paths function to read file paths and set labels from the folder names. The define_df function then combined these paths and labels into one DataFrame.

Data splitting

We have split the dataset into three parts: 80% for training, 10% for validation & 10% for testing. We have used train_test_split through stratified sampling to keep class balance. We have customized random_state to 123 for repeatable results.

Image preprocessing and augmentation

We have used the Keras ImageDataGenerator function to formulate the images. It then formed batches & set up streams for training, validation & testing. The steps are as follows:

  1. 1.

    Every retina image was set to 224 × 224 pixels for a fixed input size.

  2. 2.

    All images were loaded as RGB with a 224 × 224 × 3 shape.

  3. 3.

    We have used horizontal flip to increase the training samples to cut the model overfitting.

  4. 4.

    Then, we have added a simple placeholder that can be updated as required.

  5. 5.

    We have used the same batch size for both training & testing. We have kept shuffling off to keep the image order fixed.

Proposed fusion architecture

The proposed model utilized feature-level fusion by mingling EfficientNetB0 with three standard CNNs namely AlexNet alike, ResNet50 & Inception V3 as shown in Fig. 2. These models pick up deep & local features in retina images. Their merger helped the system learn strong visual cues & reach high scores.

Fig. 2
figure 2

Model architecture diagram.

EfficientNetB0 + ResNet concat fusion (Exp01)

The first experiment begins with the EfficientNetB013 and ResNet5016 models, coupled to extract features individually. EfficientNet is renowned for its exceptional computational efficiency, while ResNet is illustrious for its deep residual learning, making them powerful feature extractors for fundus images.

We start with giving input of 224 × 224 RGB fundus images into both architectures concurrently, so each model practices the same input differently. EfficientNetB0 is prodigious at capturing features cost-effectively because of its balanced scaling, while ResNet50 knows for depth and steadiness with its residual connections, assisting the architecture learn more convoluted patterns. After complete feature extraction of each model individually, we forwarded those features into GlobalAveragePooling2D to convert their final convolution outputs into compressed feature vectors. These two vectors are then combined into a single enriched representation, absorbing information from both architectures.

For taxonomy, the model routines a Dense layer with 512 neurons and ReLU activation, Batch Normalization follows and a 50% randomly turn off neuron in Dropout layer. Another Dense layer with 256 neurons follows the same configuration. Then after the final output layer uses Softmax with 4 neurons for multi-class disease prediction.

Throughout training, the model retains the Adam optimiser, categorical cross-entropy loss functions, and monitors validation accuracy using early stopping to prevent overfitting and limit the usages of computational resources. Learning rate reduction is applied when progress slows.

EfficientNetB0 + inceptionV3 concat fusion (Exp05)

The second experiments begin with combining the two distinct architectures as EfficientNetB013 and InceptionV314 by merging retina feature concatenation representations. This configuration not only take advantages of multi-scaling or balance scaling but, also reduce the usages of computation resources but also efficient to extract subtle retina abnormalities and details for both architectures.

Both pre-trained models have been removing their classifier layer and set `include_top = False`. So we utilized our own classification configurations to predict our retina diseases.

For the model input, we took 224 × 224 RGB fundus image into both architectures independently,

so each architecture learns and extract feature individually so when we concatenated, we get unified feature vectors that taken spatial information of eye abnormalities. For compressing and compact the large feature vectors in smaller compact chunks, we utilized GlobalAveragePooling2D. These vectors and information then feed by using concatenation layer.

For taxonomy, we employed a Dense Layer with 512 neurons and ReLU activation. Batch Normalization and Dropout were also fused. Dropout randomly deactivates 50% of the neurons to prevent overfitting. The next layer reduced 50% of neurons, resulting in more concise and compact convolution maps. This reduction in neurons helped decrease MLOPS and inference time and scale. The final layers comprised classifiers and the Softmax operating principle, which classified the four eye disease classes.

For training purposes., we used the Adam optimizer, categorical cross-entropy loss function for loss calculations and validation accuracy as the main evaluation metric. To uniformity of training efficiency and stability of each model, Early Stopping was used to stop training when validation loss not improving, and ReduceLROnPlateau lowered the learning rate when progress slowed so the model could continue adjust its weights.

EfficientNetB0 + alexnet concat fusion (Exp09)

The last experiment involved the most versatile modern and vintage architecture combination. EfficientNetB013 and AlexNet15 Concatenation model is among of them. This combination well known for its lightweight version without compromising robust diagnostic and identification performance.

For the model input, we took 224 × 224 RGB fundus image into both architectures independently,

so each architecture learns and extract retina ‘s feature individually so when we concatenated, we get unified feature vectors that taken spatial information of eye abnormalities., GlobalAveragePooling2D is used for compacting the spatial feature maps into fixed-length vectors. These two independent vectors are then concatenated to form one combined feature representation to classify the eye abnormalities.

For taxonomy, we employed a Dense Layer with 512 neurons and ReLU activation. Batch Normalization and Dropout were also fused. Dropout randomly deactivates 50% of the neurons to prevent overfitting. The next layer reduced 50% of neurons, resulting in more concise and compact convolution maps. This reduction in neurons helped decrease MLOPS and inference time and scale. The final layers comprised classifiers and the Softmax operating principle, which classified the four eye disease classes.

For training purposes., we used the Adam optimizer, categorical cross-entropy loss function for loss calculations and validation accuracy as the main evaluation metric. To uniformity of training efficiency and stability of each model, Early Stopping was used to stop training when validation loss not improving, and ReduceLROnPlateau lowered the learning rate when progress slowed so the model could continue adjust its weights.

This fusion method offers a strong balance between efficiency and feature diversity. This combination we experimented for it remains an excellent option for resource-limited clinical settings where fast inference and lower hardware demands are priorities. This makes it well suited for point-of-care devices, mobile screening tools, and clinics with limited computational resources.

Training settings

To make fair comparision of all 12 models, we trained on identical setup of hardware and packages, libraries.

For fair comparison, each model was typically trained around 3.2 h on a P100 GPU. Most models achieved their best performance between 15 and 30 epochs, showcasing useful learning. All experiments were performed by utilizing Kaggle Notebooks and TensorFlow version 2.9.1. These notebooks were trained on NVIDIA Tesla P100 GPUs accelerator, with top of CUDA 11.2 and cuDNN8.1 to enhance the computational calculation and gradient optimization. To reduce the training time and gain the optimal weights for retina disease identification we employed transfer learning for all 12 models along with pre-trained ImageNet weights.

Initially, we freeze all convolutional layers of each architecture. This ensured that the model retained its pre-trained ImageNet weights. The fusion layer and classification layers were trained at the start. Later on, we slowly unfreeze the convolution Layer to extract intricate details of retina fundus images. The fine-tuning approaches was to select a lower learning rate, letting the models to boost high-level features without ruining the beneficial low-level images acquired from ImageNet weights.

Results and discussion

Ablation study on feature fusion strategy

We have conducted 12 experiments with various fusion setups to find the best approach for eye disease classification. These setups use well-known CNN models and let us check how each fusion step affects the model performance.

We have utilized three backbone pairs: EfficientNetB0 + ResNet50, EfficientNetB0 + InceptionV3 & EfficientNetB0 + AlexNet. Every model suggests a clear strength in feature extraction. EfficientNetB0 is the base model and we then paired it with others. Every backbone affects size, accuracy & speed in its own mode. InceptionV3 handles features at many scales due to its block design. ResNet50 is well-established and older model for training and passing large feature maps in deep convolutional networks using its residual linking techniques. AlexNet was chosen for implementing a lightweight solution for eye disease detection in remote locations. It offers a great balance between model accuracy and computational costs.

For each backbone combination, we applied four fusion strategies: concatenation (preserving all features by stacking them), element-wise summation (combining features into a more compact representation), weighted fusion (learning the relative importance of features through trainable weights), and majority voting (making final decisions by combining individual model predictions).

The first three methods integrate features within the network, while majority voting takes place at the classification stage. Comparing these approaches helped us determine whether early feature-level fusion or late decision-level fusion offers better diagnostic accuracy.

Each and every model was trained under same circumstances and parameters, like same datasets, splitting mechanism and same image preprocessing, same categorical cross entropy and identical optimizer and fixed mini batch setup. These settings are quite prominent to filtered out the performance metrices based on architecture rather than and external changes. We obtain ranges of evaluation techniques like MCC score, MIoU score, which is crucial to get indication regarding class imbalance and processing deployment in clinical environment. Combinedly these scores are important to decide which fusion strategies really work for retina disease classification.

Internal ablation study

To find out the influence of different architectural unions and fusion strategies on eye disease classification model performance, we showed a comprehensive ablation study involving dozen experiments using our internal datasets17. The overall internal validation findings are presented in Table 1.

Table 1 Performance metrics of internal validation.

After completing a comprehensive experiment on classification models on the internal image validation dataset which surface up a range of key information’s. Notably, the different fusion strategy play a vital role to decides performance of the model. Feature-based concatenation models are outperformed others and one of key indicators that signify concatenation-based fusion simply overshadow the rest of fusion types like sum, voting or weighted.

Our internal results showcase that EffNetB0 + ResNet50 (Exp01) is top-performer of the entire experiments achieved 95.26% accuracy followed by EffNetB0 + InceptionV3 (Exp05) and EffNetB0 + AlexNet (Exp09) which is around 94.79% and 93.60% respectively. The weighted and voting pairs really struggle and unable to match benchmark set by concatenation-based models in extract the rare class diseases. These concatenation-based model are well-performer in all scenarios and achieved excellent AU-ROC score above 0.988. In-terms of training stats, we observed that ResNet50 concat models took longer training times generally between epochs 11 and 27 while AlexNet gradient converge rapidly between epochs 9 and 23.

For real-world deployment where speed and accuracy must be balanced, the EfficientNetB0 + ResNet50 combination is an excellent all-around choice. If fast processing is your top priority, EfficientNetB0 + AlexNet is a reliable and efficient option. But if maximum accuracy is the main goal, EfficientNetB0 + InceptionV3 remains the strongest performer. Although concatenation-based fusion validated well on internal datasets, additional external testing would further strengthen confidence in these findings.

Fig. 3
figure 3

Fusion strategy metrices.

In Fig. 3 we evaluate all four fusion strategies concatenation, summation, weighted aggregation, and majority voting tested across three pre-trained backbone pairings: EfficientNetB0 with ResNet50, InceptionV3, and AlexNet. Among these amalgamations, the EfficientNetB0 + ResNet50 model produced the highest overall results, reaching its peak accuracy of 95.26% when using concatenation fusion. EfficientNetB0 + InceptionV3 also demonstrated consistently resilient performance across all methods, achieving its best accuracy of 94.31% under the weighted fusion approach.

Intriguingly, EfficientNetB0 + AlexNet based fusion comparatively showcase stability irrespective different fusion method had applied. The prime finding we have learn from it that these models have brilliant since its inception and preserves the oldest legacy. Voting consecutively performed worse and struggle the deeper feature interactions of retina images either its normal, cataract or Glaucoma.

Training and validation curves

The Figs. between number 4, 5, 6 demonstrates the graphical representation of training, validation loss & accuracy curves for each model across all the 12 fusion experiments. By closely examining the curves, we can easily analyze the convergence speed and the minimisation of validation loss for each model. Left side of graph display information about accuracy part & right side of graph illustrates loss presentation.

Fig. 4
figure 4

Train and val curves of all fusion strategies of Exp01 to Exp04.

Fig. 5
figure 5

Training and validation curves of all fusion strategies of Exp05 to Exp08.

Fig. 6
figure 6

Training and validation curves of all fusion strategies of Exp09 to Exp12.

All 12 fusion experiments exposed distinctive gradient convergence pattern that depends heavily on the backbone architecture and their feature-fusion strategy. Throughout all tested condition, Exp05–Exp08 InceptionV3 based fusion produced the reasonably stable training curves. Multiscale processing makes the InceptionV3 model immortal to optimized convergence by utilizing different kernel scales concurrently. This approach is particularly effective for retina disease classification, as there are no exceptions we can clearly saw in graphs. The weighted convolution approach, helps to managed inputs feature metrices at multiple scales, perform quite good which clearly seen validation loss remained steady at around 0.20 on average. This adaptability of EfficientNetB0 and InceptionV3 allowed it to effectively adjust even when small parameter changes the learning curves. Therefore, the fusion strategy and architecture play a vital role in determining the types of convergence they exhibit.

The ResNet50 based fusions portrays convergence graphs between Exp01–Exp04 had quickest convergence as reached about 90.97% accuracy. This all happened due to residual linking and avoid vanish gradient problems. For critical analysis part, each ResNet50 based model have steady difference between 4 and 5% from validation accuracy setups and highest loss fluctuations between (0.25 to 0.74), they usually mis leaded and taken details related to training rather than insightful deeper and generalized patterns. As per ResNet based concat models Exp01 remarkably achieve perfect training score (0.002) loss depicts model overfitting. Fusion in Exp02 exhibited inconsistent behavior, oscillating between solutions as if the optimizer was struggling to converge. Weighted fusion (Exp03) achieved 92.26% validation accuracy by learning which features were most important, thereby improving its adaptability as feature importance shifted. The voting method (Exp04) failed near epoch 19. The accuracy stayed near 67% because the base models drifted apart and most votes carried little weight. The AlexNet-type models (Exp09–Exp12) hit the same limits as the original AlexNet. Their simple design did not pair well with EfficientNet, and training was slow and uneven. Early accuracy was close to 60% with strong overfitting. Later runs reached 93%, but the validation loss stayed high at 0.25–0.40. The loss curve shows that concatenation kept all features but raised the space to 4096 dimensions, which increased overfitting. Sum fusion reduced the feature size and gave a small regularizing effect. The best gains came from learning feature weights, but this step was sensitive to tuning. Sum and voting fusion also pushed the models out of sync.

The key point is that models with very different feature scales needed a balanced setup. The weighted InceptionV3 fusion (Exp07) achieved that balance. It showed stable curves with a 4% gap between training and validation and strong results. It is the safest and most practical option for MIA.

Confusion matrices (CM) of internal dataset evaluation

The CMs for all 12 fusion models, shown in Fig. 7 and tested on the internal dataset17, show how each model performs and where it needs improvement. They reveal how well the models distinguish different disease types41 and highlight both strengths and weaknesses42, 43.

Fig. 7
figure 7

CM of internal ablation study.

Across all 12 experiments, EfficientNetB0 paired with other networks gave steady results for eye disease classification. EfficientNetB0 + ResNet50 (Exp01–Exp04) worked well. Exp03 did best because weighted fusion handled similar-looking cases. The top result came from EfficientNetB0 + InceptionV3 (Exp06/07), reaching 96.51% validation accuracy with 1.8% error. InceptionV3 captured small, multi-scale details that matched EfficientNet’s features.

EfficientNetB0 + AlexNet (Exp09–Exp12) worked for cataracts and DR but struggled with glaucoma. AlexNet is older and misses subtle signs. Weighted or sum fusion (Exp03, 06, 07, 10) improved results by focusing on key features. Modern networks with smart fusion handled hard tasks like spotting glaucoma in healthy eyes.

Class-wise performance heatmap of internal validation

Figure 8 shows precision, recall & F1-score for each model across the four eye conditions44, 45.

Fig. 8
figure 8

Class-wise performance heat-map of the Internal dataset.

Observing Exp01–Exp12 on dataset17, EfficientNetB0 + ResNet50 (Exp01–Exp04) retains high recall & precision for cataracts & DR. Performance for glaucoma & normal eyes varies. Weighted fusion in Exp03 lowers this variation.

EfficientNetB0 + InceptionV3 (Exp05–Exp08) is the overall best. Exp06 & Exp07 showed how InceptionV3’s multi-scale interpretation & EfficientNet’s features work collectively. This makes healthy vs. diseased retinas easy to find out.

EfficientNetB0 + AlexNet (Exp09–Exp12) is fine but less consistent for glaucoma & normal eyes. AlexNet brawls with subtle features and fusion adds some help. Heatmaps shows weighted & sum fusion give more balanced & reliable results for each disease.

Clinical Interpretability/Explainability of internal evaluation

We have used Score-CAM & LIME (SLIC Segmenta) to realize which parts of the eye the model mainly focuses.

For DR, heatmaps highlight the optic disc, macula & main blood vessels, where early signs appear. For glaucoma, activations stay on the optic nerve head, matching real structural changes when the neuroretina rim thins. Normal retinal images don’t show any strong hotspots, while cataract images show softer, more spread-out activations that match the overall haze & lowered contrast typical of the condition. Overall, the activation patterns make sense clinically & match the real anatomical changes associated with each disease as presented in Fig. 9.

Fig. 9
figure 9

Score-cam of internal ablation study.

External ablation study

To rigorously test whether the proposed eye disease classification model could handle real-world clinical conditions beyond the training data, we have performed external validation using two completely independent datasets: Messidor-219 and the Ocular Disease Intelligent Recognition (ODIR) dataset18 provided 1,748 high-quality fundus images from 874 DR screening exams, complete with DR severity grades and quality ratings, which we carefully preprocessed to remove distracting black borders18 presented an even tougher challenge, containing fundus photographs from both eyes of 5,000 patients collected across multiple Chinese hospitals using different camera brands and models, resulting in images with varying resolutions, lighting conditions, and quality levels that doctors actually encounter in everyday practice.

We formed vigilantly a balanced external validation dataset with 100 professionally annotated images for each of the four diagnostic categories: Cataract, Glaucoma, Normal & DR. This test set was significant since it reflects real-world circumstances. Images come from diverse devices, patient groups, and have natural quality differences. This lets us check if the model handles real clinical cases or just memorizes training patterns. The results are presented in Table 2.

Table 2 Performance metrics of external validation.

The external dataset had about 400 retinal images from18, 19, from different hospitals, devices, and patient groups. All fusion models scored over 95% accuracy. The top model, EfficientNetB0 + InceptionV3 Concatenation Voting (Exp08), reached 97.99%. Its Weighted version (Exp07) had 97.74%. Lower-performing experiments (Exp06 & Exp12) still reached around 96.49%, showing the architecture is strong.

MCC scores were similar. Exp07 and Exp08 had 0.973. The lightweight AlexNet Sum fusion (Exp10) reached 0.980, sometimes beating heavier models on high-quality images. For class-level agreement (mIoU), InceptionV3 fusions led: Exp05 (0.961), Exp08 (0.960), and Exp07 (0.956). ResNet50 and some AlexNet variants were lower (Exp01: 0.918; Exp12: 0.932).

EfficientNetB0 + InceptionV3 balances good features and computation (≈ 27–30 M parameters, ~ 6500 FLOPs). AlexNet models were fast (0.130–0.263 s/image) and light but slightly less accurate. Concatenation fusion was the most stable. Weighted fusion gave small gains. Voting worked well but sometimes lowered accuracy.

Every model had a good AUROC value of > = 0.991 but Exp08 have 0.999. The EfficientNetB0 + InceptionV3 with concatenation or weighted fusion performed the best. AlexNet pairings were efficient & nearly very accurate (see Fig. 10).

Fig. 10
figure 10

Radar plot of external ablation study.

We tested all 12 fusion models to compare accuracy and computation. Lightweight models like Exp09 (EfficientNetB0 + AlexNet, concatenation) use 11.86 M parameters and 2736.20 MFLOPs. They can process an image in 0.133 s, which works well for busy clinics or emergencies.

Heavier models like Exp01 (EfficientNetB0 + ResNet50, concatenation) use 29.47 M parameters and 8555.65 MFLOPs. They need stronger hardware and more time, but they catch details that lighter models might miss. Concatenation gives richer features but costs more compute. Sum or voting methods reduce load while keeping good performance. Processing times range from 0.133 s (Exp09) to 0.363 s (Exp05).

No model fits all cases. Large hospitals can focus on accuracy. Rural clinics may need efficiency. Emergency units need fast processing.

CM of external dataset evaluation

The CMs for all 12 models showed performance18, 19 errors on the datasets &. They reveal how well models separate classes & where improvement is needed41, 42, 43.

Fig. 11
figure 11

CM of external ablation study.

EfficientNetB0 + ResNet50 (Exp01–Exp04) is reliable but often misclassifies Glaucoma. Concatenation models (Exp01, Exp05, Exp09) confuse Glaucoma with Normal. Sum fusion (Exp02, Exp06, Exp10) reduces these errors. Weighted and Voting models are mixed. Exp03, Exp07, and Exp12 detect Glaucoma at 98–100%. Normal cases do better with fusion methods other than concatenation.

Cataract and DR are classified well in all experiments. Voting helps maintain accuracy and reduce errors. EfficientNetB0 + InceptionV3 (Exp05–Exp08) produces clean matrices, while EfficientNetB0 + AlexNet (Exp09–Exp12) shows more errors, which Weighted or Voting fusion partly fixes. Voting is the most reliable strategy, especially for Glaucoma and Normal, consistent with Fig. 11.

Class-wise performance heatmap of external validation

Figure 12 shows class-level performance. Diabetic Retinopathy is easy to identify with high recall. Glaucoma is hardest, especially in Concatenation models (Exp01, Exp05, Exp09). Normal cases vary by fusion method; Sum and Voting improve stability.

InceptionV3 models (Exp05–Exp08) are the most consistent. AlexNet models are weaker initially but Weighted (Exp11) and Voting (Exp12) improve results. Voting (Exp04, Exp08, Exp12) gives stable predictions across diseases. Weighted fusion helps models with uneven feature contributions44, 45.

Fig. 12
figure 12

Class-wise performance heat-map of the external dataset.

Clinical interpretability/explainability of external evaluation

We checked model predictions on external data18, 19 using Score-CAM20 and LIME (SLIC)46. The model focuses on relevant retinal regions, not the background (Fig. 13).

Score-CAM shows correct attention: normal images are faint; DR highlights macula and vessels; Glaucoma focuses on the optic nerve; Cataracts show a diffused lens pattern. This confirms attention is clinically meaningful.

Fig. 13
figure 13

Score-cam of the external ablation study.

Comparative study

External accuracy matched or exceeded internal results (Fig. 14), improving 0.4–4.6% with no drops. Top models (Exp03, Exp05, Exp11) merge features well across datasets.

Weighted fusion is consistent (Exp03, Exp07, Exp11), learning how much to trust each backbone. Summation is steady, especially for AlexNet (Exp10). Concatenation and voting give smaller gains. Weighted fusion is best for combining multiple backbones efficiently.

Fig. 14
figure 14

Internal and external validation comparison.

Few studies fuse EfficientNetB0 with other CNNs for multi-class eye disease classification, showing a need for models that generalize, as in Table 3.

Table 3 Comparative Analysis.

Discussion of results

Internal and external results are close. EfficientNetB0 + InceptionV3 (Exp05–Exp08) and + ResNet50 (Exp01–Exp04) scored 92–95% with strong recall and efficiency. Concatenation and Weighted fusion captured subtle features and reduced errors. AlexNet models, though weaker, still captured fine details.

External testing improved results. Most models reached 97–98% accuracy. Weighted and Sum fusion combined multi-scale features, increasing reliability. EfficientNetB0 + InceptionV3 and + ResNet50 AUROC was 0.996–0.999. AlexNet-based models also scored well with optimized fusion, e.g., Exp10 MCC 0.980.

Clinical significance and inferences

The Score-CAM visualizations reveal two connected sides of the same story: what the model actually learns from the data and how well that learning carries over to completely new hospitals and imaging systems58, 59, 60, When we examine the internal dataset, the activation maps clearly show the model focusing on the correct anatomical regions. In glaucoma cases, the attention locks onto the optic nerve head, where damage typically appears. For diabetic retinopathy, the model spreads attention across the posterior part of the eye and the macular region, where disease indicators are usually found. Normal retinas show balanced attention across both optic discs key anatomical reference points while cataract cases highlight the central lens, where cloudiness develops. The patterns show that the model is learning real clinical features. The real test comes from external dataset18, 19 from different hospitals with changes in lighting, cameras, and image quality. Even with these changes, attention patterns stay consistent: glaucoma predictions focus on the optic nerve, DR predictions focus on the back of the retina, normal eyes show attention on both optic discs, and cataract predictions focus on the lens. This shows the model is not memorizing one system but learning what the diseases look like in general.

Testing on multiple datasets is important for trust. It shows the model works under controlled conditions and handles real-world differences. Score-CAM maps show where the model struggles or looks at unexpected areas. These cases point to situations where human judgment is needed.

The model stays reliable even with poor images. It also shows when it is unsure. This makes it useful for hospitals, clinics, rural centers, or mobile units, no matter the equipment available.

Conclusion & future work

This study tests several dual-backbone fusion models for eye disease classification. EfficientNetB0 is paired with ResNet50, InceptionV3, and AlexNet under four fusion types. Tests on both Internal and External datasets show high accuracy and good performance across sources. The models can identify cataracts, glaucoma, DR, and normal eyes reliably. On the Internal dataset, accuracy is 92–95%, with EfficientNetB0 + InceptionV3 (Exp05–Exp08) and EfficientNetB0 + ResNet50 (Exp01–Exp04) performing best.

External dataset results are higher, with 94.99–97.99% accuracy and AUROC of 0.991–0.999. Top setups Exp03, Exp07, Exp08 & Exp10 separate disease classes well & achieve high mIoU (0.960–0.980). AlexNet alike models, including Exp10 with sum fusion, show that smaller networks can still work well with the right fusion. Heatmaps & CMs support the reliability of the results. Concatenation & weighted fusion handle hard pairs such as glaucoma vs. normal eyes. Score-CAM, Grad-CAM & LIME (SLIC) point to key parts of the fundus, such as the optic disc, macula, nerves & vessels, while avoiding areas that do not help.

This study also offers clear guidance on model design, testing & performance for multi-class eye disease work. It supports future AI tools for clinics. The models exceed 95% accuracy & AUROC above 0.991, but larger & more diverse datasets are needed. Multimodal inputs, such as fundus with OCT, may help with early disease diagnosis. Vessel modeling, mixture-of-experts & graph-based work may also help track change and improve fusion.

Current explainability tools highlight the right regions across varied ages & image quality. Future work should add counterfactual views & uncertainty scores so clinicians can judge model confidence. Work on real-time use, edge devices & compression is also needed. Clinical testing with ophthalmologists will be the key step to move this system into practice. The strong external results show that it can adapt to many imaging setups.