Multi-task deep learning framework combining CNN: vision transformers and PSO for accurate diabetic retinopathy diagnosis and lesion localization

Vijayalakshmi, S.; Manoharan, J. Samuel; Nivetha, B.; Sathiya, A.

doi:10.1038/s41598-025-18742-z

Download PDF

Article
Open access
Published: 08 October 2025

Multi-task deep learning framework combining CNN: vision transformers and PSO for accurate diabetic retinopathy diagnosis and lesion localization

S. Vijayalakshmi¹,
J. Samuel Manoharan²,
B. Nivetha² &
…
A. Sathiya³

Scientific Reports volume 15, Article number: 35076 (2025) Cite this article

501 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Diabetic Retinopathy (DR) continues to be the leading cause of preventable blindness worldwide, and there is an urgent need for accurate and interpretable framework. A Multi View Cross Attention Vision Transformer (MVCAViT) framework is proposed in this research paper for utilizing the information-complementarity between the dually available macula and optic disc center views of two images from the DRTiD dataset. A novel cross attention-based model is proposed to integrate the multi-view spatial and contextual features to achieve robust fusion of features for comprehensive DR classification. A Vision Transformer and Convolutional neural network hybrid architecture learns global and local features, and a multitask learning approach notes diseases presence, severity grading and lesions localisation in a single pipeline. Results show that the proposed framework achieves high classification accuracy and lesion localization performance, supported by comprehensive evaluations on the DRTiD dataset. Attention-based visualizations further enhance interpretability, indicating the framework’s potential for clinical use. This framework establishes a criterion for improving state-of-the-art retinal image analysis for DR diagnosis which may result in better patient results and final clinical decision.

Diabetic retinopathy classification using a multi-attention residual refinement architecture

Article Open access 10 August 2025

A hybrid deep learning framework for early detection of diabetic retinopathy using retinal fundus images

Article Open access 30 April 2025

Attention dual transformer with adaptive temporal convolutional for diabetic retinopathy detection

Article Open access 05 March 2025

Introduction

Diabetic retinopathy (DR) is one of the most prevalent complications of diabetes mellitus and a significant cause of preventable blindness worldwide. Damage to the retinal blood vessels contributes to structural and functional abnormalities of the retina by prolonged hyperglycaemia. It is characterized by developing microaneurysms, intraretinal haemorrhages, and neovascularization, as it progresses, which may lead to total vision loss¹. The stages of DR occur from non-proliferative diabetic retinopathy (NPDR) to proliferative diabetic retinopathy (PDR). Early detection of DR is critical for timely intervention, but effective in many regions particularly when manually diagnosed processes are used. Figure 1 shows a clear distinction of normal and DR images.

The consequences of diabetes and its complications (including DR) is burdening the global population in recent times. In 2021 there were about 537 million adults who were diagnosed with diabetes worldwide and this number is expected to rise significantly by 2030 (if bold action is not taken)³,. DR is expected to develop in nearly one-third of diabetic patients⁴. It is no surprise that this incidence pervades across the lens and is not just a problem of a few individuals that negatively affects their quality of life, but also that its impacts on healthcare costs especially in low resource settings are significant (Fig. 2).

To face these challenges, automated and scalable solutions for DR screening and diagnosis are urgently needed⁵. For DR, current techniques of diagnosis are based mainly on fundus photography, in which ophthalmologists visually examine the retinal images. This approach is certainly effective in experienced hands and is clearly minimally invasive, but suffers from issues like time consumption, interobserver variability, and dependence on skilled specialists⁶. In regions where there are not many trained professionals, the problems are made worse. Artificial and Machine Learning are employed to develop automated DR diagnostic systems in response. Of these, convolutional neural networks (CNNs) have been demonstrated to be promising for extracting features from retinal images for diseases classification, lesion segmentation and severity grading⁷. However, the models in these works neglect the potential gain from combining information from multiple viewpoints.

Recently, Vision Transformers (ViTs) have provided a new level of generality to deep learning research, especially in applications where the long-range context understanding is required. ViTs perform well at encoding ‘high level’, global relationships across an image as they are good at understanding complex medical images^8,9. But the potential for their use in multi-tier-view retinal analysis is underdeveloped. Multi view approaches that integrate macula centered as well as optic disc centered retinal images offer complementary information of the retinal status and can greatly improve diagnostic performance¹⁰. However, while datasets such as the Diabetic Retinopathy Two Field Image Dataset (DRTID) are available, few models are able to utilize these perspectives to thoroughly analyze them¹¹.

These gaps motivated the development of a novel Multi View Cross Attention Vision Transformer (MVCAViT) framework for two field retinal image analysis. The proposed framework relies upon a cross-attention mechanism to combine information from macula and optic disc centered views to allow the model to detect the crucial spatial and contextual aspects catalogued as absents from the single view models. In this unique approach, the strengths of multi view data¹² are used to overcome the limitations of the traditional models.

The MVCAViT framework combines the potentials of Vision Transformers and Convolutional Neural Networks in its architecture. While CNNs perform well in capturing of features of fine-grained local features, ViTs are efficient at extracting global dependencies and structural details. The proposed model integrates these architectures to robust feature extraction and fusion^13,15. The MVCAViT framework further employs a multitask learning framework that allows the framework itself to perform disease classification, lesion localization and severity grading at the same time. This multi-faceted approach boosts diagnostic accuracy even further and allows clinicians to learn more about disease progression.

Concept of multiview images

Retinal imaging with multiple views (Multiview retinal imaging) refers to images of the retina captured from different regions. Together, these two perspectives yield complementary information concerning the structural and functional integrity of the retina. Macula-centered view focuses on macula, a central part important for detailed vision, whereas optic disc centered view is of optic nerve head and surrounding retinal vasculature. In Fig. 3a,b, we have depicted the clear view of both perspective-based images.

Taken together, these images provide in-depth analysis of retinal health and perhaps more importantly, DR may have localized or global manifestations across the retina. The functional complementarity between macula-centered and optic-disc-centered significantly improves diagnostic and prognostic capabilities over single-view imaging. Despite their potential, existing models lack the use of multi view retinal images. Single view inputs in most automated systems will miss the very critical perspectives of information. For example, macula-centered images, for example, may be used to detect early microaneurysms and central vision threatening lesions, whereas optic disc centered views can pick up peripheral lesions and even optic nerve related abnormalities. This integration can be of huge help in determining the effective integration and provides us a complete view that will be useful to accurately assess the DR. In spite of this, feature fusion from these perspectives is problematic due to alignment, redundancy, and potential loss of critical information in processing. However, the development of these solutions requires advanced methods to effectively manage and integrate multi-view data.

There is also a clinical need for research of two field images. In practice, the ophthalmologist typically looks at the retina from various angles in order to capture the whole range of pathology. Further, development of novel multi-view models is facilitated by the availability of datasets such as the Diabetic Retinopathy Two Field Image Dataset (DRTiD). However, standardization of methods for processing and analysing such data remains wanting to necessitate a limited use and impact. These gaps can only be addressed through research on two field retinal images. It is demonstrated that macula-centered and optic-disc centered views are complementary to one another and can greatly reduce false positive and negative values in automated systems. In addition, these perspectives can be integrated to make advanced application possible such as multitask learning, where the system learns to detect lesions, stage of the disease and provide interpretable results of affected areas. This capability fits well into what clinical practitioners need and increases trust in AI driven diagnostics.

Overall, multi view retinal imaging is a promising frontier for automated diabetic retinopathy diagnosis. With the distinctive strengths of macula and optic disc centered views, researchers can create clinical aligned and scalable models that are accurate.

Motivation and contributions of the research work

Diabetic retinopathy is rapidly becoming a problem that can lead to irreversible blindness, and so identification of advanced diagnostic frameworks based on accurate and interpretable outcomes is a necessity. However, existing systems often do not take advantage of the information contained in the multi-view case (macula-based or optic-disc-based). This limitation is addressed by the proposed research which explores new methodologies for utilization of two field retinal images for automated diabetic retinopathy diagnosis and to establish new benchmarks.

The contributions of this research are listed below.

1.
This research introduces a novel cross-attention mechanism that is customized towards multi-view feature fusing, and yet shows that macula- and optic-disc centered views provide complementary instances to enhance diagnosis accuracy and robustness.
2.
The research demonstrates superior diabetic retinopathy classification and severity grading performance by combining the global contextual strengths of Vision Transformers with the local feature extraction capabilities of CNNs.
3.
It is shown to handle multiple tasks like lesion localization, disease stage prediction and severity grading, demonstrating its flexibility in providing clinical demand in a single system.
4.
By applying the framework to generate attention-based visualizations, we demonstrate its ability to highlight important regions affecting predictions.
5.
It is expected that this research will show a dramatic reduction in false positives and negatives, ultimately illustrating superior performance of multi-view systems compared to conventional single view systems.

This paper is organized as follows: In "Related work" section, related works for diabetic retinopathy classification are surveyed, including current machine learning and deep learning approaches. In "Proposed work" section, we provide details of the architecture of the proposed MVCAViT, including cross attention mechanism and hybrid model design. In "Results and discussion" section, we detail the experimental set-up, dataset description and the evaluation metrics that are used to validate the proposed model. The results and analysis are presented in "Results and discussion" section, which also discusses the model’s performance with regard to accuracy. The paper concludes with "Conclusion" section; a discussion of limitations, future directions and what the proposed framework could potentially mean for diabetic retinopathy screening.

Related work

By applying optimization techniques with deep learning models, the diagnosis and grading of diabetic retinopathy (DR) has been greatly advanced. They are drawn to managing the issues of scalability, accuracy, and generalization of these diagnostic systems, with the goal of increasing the reliability of automated diagnostic systems. In this section, existing literature on optimization driven models and deep learning-based models for diabetic retinopathy detection and classification is reviewed. Initial automated diagnostic systems were built on machine learning baselines. They proposed Support Vector Machine (SVM) based classifier with customized features like texture, intensity and morphological features for its binary classification of diabetic retinopathy in¹⁶. While they were effective on small datasets, these models were impractical as they were feature dependent, and were not scalable for complex, high dimensional data. We introduced a Random Forest based approach in¹⁷ that combined statistical and structural features for multi class classification. The model achieves moderate success, but faced with challenges in feature selection, optimization-based methods were required for increasing performance.

Motivated by the Automation of feature extraction (via deep Learning Models), and with superior results, deep learning models have caused a revolution. Convolutional Neural Networks (CNNs) have dominated diabetic retinopathy diagnosis. In¹⁸, they developed a CNN based framework trained using retinal fundus images to classify DR severity stage. It showed high sensitivity and specificity, a benchmark in the field. A ResNet architecture was used in¹⁹ to achieve robust performance in multi class classification tasks as well. In this work, authors have highlighted the contribution of deep feature hierarchies in the presence of complex patterns in diabetic retinopathy. Throughout the years, optimization techniques have proved to be an appropriate solution for boosting deep learning model performance. In²⁰, Particle Swarm Optimization (PSO) is applied to tune CNN hyperparameters in order to improve classification accuracy and achieve reduced computational overhead. In²¹, Genetic Algorithms (GAs) were used to select features and initialize weights for training in a CNN, achieving remarkable improvements in achieving their respective accuracies and convergence speed. Integrating such optimization techniques enhances its potential overcoming limitations of conventional deep learning models.

Deep learning approaches have attracted much attention which merge deep learning with optimization. To analyze the spatial and sequential patterns in retinal images, a hybrid model comprised of CNNs, and Long Short-Term Memory (LSTM) networks was introduced in[]. Capturing changes in diabetic retinopathy over time was successfully accomplished using this method. Authors in²³ similarly introduced a deep learning model tuned using Simulated Annealing that remarkably reduced overfitting and generalized better across datasets. Such hybridization of these techniques continues to prove a significant impact on the robustness and adaptability of such diagnostic systems.

Mainly, deep learning with optimization factor has benefited lesion localization and segmentation. A U-Net based architecture optimized by Differential Evolution was proposed in²⁴ for lesion segmentation. This method provided the precise localization of microaneurysms and haemorrhages, which are the significant features of diabetic retinopathy progression. In²⁵, another study combined CNN based feature extraction with Ant Colony Optimization for lesion detection, improving its sensitivity greatly by means of a hybrid approach. In recent research, ensemble methods have been combined with optimization. In²⁶, it was proposed an ensemble of CNNs optimized with a Harmony Search algorithm, which leads to state-of-the-art results on benchmark datasets. Similarly, in²⁷, an AdaBoost framework combined with PSO, was proposed for addressing class imbalance issues while improving the classification accuracy for minority classes. Though these improvements have been made, there are still issues with computational efficiency and clinical applicability, and there is variability in datasets. The issues raised in²⁸ are addressed by the combined model of a Federated Learning and a Genetic Algorithm for optimization of hyperparameters that guarantees data privacy and maintains high diagnostic performance. The weighted loss function along with a derived sampling technique from optimization was used²⁹ to handle data imbalance and help the model handle less-significant classes. A study in³⁰ applied Bayesian Optimization to dynamically tune hyperparameters while training, achieving immense improvement in convergence speed and model stability.

In recent years, the integration of deep learning techniques in medical image analysis, especially concerning diabetic retinopathy, has been improved. Prior work has set a solid foundation for feature extraction and classification tasks however^23,29,32,33. One initial example is that²³ developed a deep attentive convolutional network that significantly improves diabetic retinopathy classification using traditional architectures of CNN. On the other hand²⁹, implemented automated approaches over handheld retinal images to simplify the modeling and have reported high classification accuracy. These approaches, however, separately optimize the model and fuse the features, giving rise to suboptimal combinations of local and global image features.

Motivated by the above, we present our Multitask Network (MTN) framework, which combines CNNs for local feature extraction and ViTs for global contextual capturing; and uses Particle Swarm Optimization (PSO) to dynamically fuse these features at runtime. Additionally, similar to previous studies like those by^25,34,35,36 that have looked into multi path feature extraction-based approaches, but with the capability of our model to take part in disease classification, lesion localization, and even severity grading simultaneously does set a new bar in the field. Both the alignment with clinical diagnostic workflows and the significantly reduced time and computation resources required, are crucial to real world applications, make this comprehensive approach appealing.

Therefore, the integration of these advanced methodologies in the MTN framework bridges the key knowledge gaps left unexplored by previous works in terms of feature fusion, optimization and multitask capabilities^{37,38,39,40,41}. By utilizing the proposed approach, we anticipate significant improvement in the accuracy, efficiency, and applicability of automated diabetic retinopathy diagnosis systems, and it represents a major advance for the use of artificial intelligence in ophthalmology.

Finally, deep learning with optimization techniques has made substantial contribution to diabetic retinography screening⁴². By focusing on these critical challenges in designing such models, namely hyperparameter selection, feature extraction and data imbalance, robust and scalable diagnostic systems are enabled. Interestingly, however, more research is required to improve model generalization when applied to real life clinical settings. Under this direction, this research work seeks to expand these advancements with an optimization driven framework that incorporates deep learning models for diabetic retinopathy detection and severity grading.

Proposed work

In this paper, we propose a novel optimization driven multitask deep learning framework to manage the limitations of existing systems for diabetic retinopathy diagnostic, and design one for the analysis of multi view retinal image. The complementary information gained from macula-centered and optic-disc-centered images are going to be used to improve the accuracy of the classification, localization and grading of lesions and severity of disease.

A key novelty is that optimization techniques are integrated to dynamically rationalize feature contributions from the two views in order to make robust and interpretable decisions. It adopts a hybrid architecture of convolutive neural networks (CNNs) for feature extraction and an optimization module for fusion of features and adjustment of parameter. Yet by using a multi-task learning strategy, the system can simultaneously perform classification and localization improving its clinical utility.

Problem formulation

The diabetic retinopathy diagnostic problem can be formulated as a multi-task learning objective where the model simultaneously predicts disease stages and localizes lesions. Given a dataset of paired images $\:{\left\{({I}_{m}^{i},\:{I}_{o}^{i})\right\}}_{i=1}^{N}$where $\:{I}_{m}^{i}\:$and $\:{I}_{o}^{i}$ represent macula-centered and optic-disc-centered images respectively, and corresponding labels $\:{\left\{({y}^{i},{B}^{i})\right\}}_{i=1}^{N},$ where $\:{y}^{i}$ is the disease class label and $\:{B}^{i}$ represents lesion bounding boxes, the framework aims to minimize a composite loss function:

$$\:Minimize:L=\alpha\:{L}_{classification}+\beta\:{L}_{localization}+\gamma\:{L}_{regularization}$$

(1)

In Eq. (1), α,β,γ are weights balancing the contributions of each task.

Feature extraction is performed independently for macula-centered and optic-disc-centered views using two CNNs. The output feature representations are:

$$\:{F}_{m}=CN{N}_{m}\left({I}_{m}\right),\:{F}_{o}=CN{N}_{o}\left({I}_{o}\right)$$

(2)

In Eq. (2), $\:{F}_{m}$ and $\:{F}_{o}$ are feature vectors extracted from the macula-centered and optic-disc-centered images respectively.

The fusion of features $\:{F}_{m}$ and $\:{F}_{o}$ is achieved through a weighted summation:

$$\:{F}_{f}={w}_{m}.{F}_{m}+{w}_{o}.{F}_{o}$$

(3)

In Eq. (3), $\:{w}_{m}$ and $\:{w}_{o}$ are learnable weights that are dynamically optimized during training. These weights ensure an appropriate balance of contributions from both views.

The weights $\:{w}_{m}$ and $\:{w}_{o},$ along with model hyperparameters, are optimized using an optimization algorithm such as Particle Swarm Optimization (PSO). The objective function is defined as:

$$\:Minimize:f({w}_{m},{w}_{o},\theta\:)=\alpha\:{L}_{classification}+\beta\:{L}_{localization}+\gamma\:{L}_{regularization}$$

(4)

subject to constraints $\:{w}_{m}+{w}_{o}=1$ and $\:{w}_{m},{w}_{o}>0,$ and θ represents the model parameters.

The classification task uses cross-entropy loss:

$$\:{L}_{classification}=\:-\sum\:_{i=1}^{N}\sum\:_{j=1}^{C}{y}_{j}^{i}\text{l}\text{o}\text{g}\left(\widehat{{y}_{j}^{i}}\right)$$

(5)

In Eq. (5), C is the number of classes, $\:{y}_{j}^{i}$ is the ground truth label, and $\:\widehat{{y}_{j}^{i}}$ is the predicted probability for class j.

The localization task uses Intersection-over-Union (IoU) loss:

$$\:{L}_{localization}=\:1-\frac{\left|{B}_{pred}\cap\:{B}_{true}\right|}{\left|{B}_{pred}\cup\:{B}_{true}\right|}$$

(6)

In Eq. (6), $\:{B}_{pred}$ and $\:{B}_{true}$ are the predicted and ground truth bounding boxes respectively.

A regularization term ensures model stability:

$$\:{L}_{regularization}=\lambda\:{\left\| \theta \right\|}^{2}\:$$

(7)

In Eq. (7), λ is the regularization coefficient.

Combining these components, the overall loss function is expressed as:

$$\:L=\alpha \cdot \left(-\sum\:_{i=1}^{N}\sum\:_{j=1}^{C}{y}_{j}^{i}\text{log}\left(\widehat{{y}_{j}^{i}}\right)\right)+\beta \cdot \left(1-\frac{\left|{B}_{pred}\cap\:{B}_{true}\right|}{\left|{B}_{pred}\cup\:{B}_{true}\right|}\right)+\gamma \cdot \lambda {\left\| \theta \right\|}^{2}$$

(8)

The optimization algorithm, such as PSO, dynamically updates the weights and parameters:

$$\:{w}^{t+1}={w}^{t}+{v}^{t},\:{v}^{t}=w{v}^{t-1}+{c}_{1}{r}_{1}\left({p}_{best}-{w}^{t}\right)+{c}_{2}{r}_{2}({g}_{best}-\:{w}^{t})$$

(9)

In Eq. (9), $\:{v}^{t}$ is the velocity of the particle, $\:w$ is the inertia weight, $\:{c}_{1}$ and $\:{c}_{2}$ are acceleration coefficients, $\:{r}_{1}$ and $\:{r}_{2}$ are random values, $\:{p}_{best}$ and $\:{g}_{best}$ are the particle’s and global best positions, respectively.

Workflow of proposed model

The proposed framework is a state-of-the-art hybrid model which combines Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) and optimization techniques for multi view retinal image analysis. This section details every stage of the pipeline, with detailed explanations of how this input image is turned into final predictions through this series of transformations and processes. Figure 4 depict the overall flow process.

Input image processing

Our method starts with the input in the form of macula-centered and optic-disc-centered retinal images. All of them are pre-processed to make them consistent and improve diagnostic accuracy. Normalization, resizing, and augmentation is a part of preprocessing. Since pixel intensities can be very different depending on imaging conditions, the normalization process scales pixel values into a continuous range of [0,1] and hence the disparities can be minimized. First thing it does is to resize standardizing image dimension to make it compatible with the model architecture. Rotating, flipping and intensity scaling are used to augment the data, producing diverse samples and mitigating the overfitting and improving model robustness. At the end of this stage, the images become uniform and ready for feature extraction.

Feature extraction using CNNs and vision transformers

Feature extraction is performed using a hybrid approach combining CNNs for localized spatial features and ViTs for capturing global contextual information. Figure 5 presents the complete architectural design of the MVCAViT framework. It illustrates the dual-branch processing of macula-centered and optic disc-centered views through CNN and ViT encoders, followed by a cross-attention fusion block and multitask heads for classification and lesion localization. This structural visualization supports the conceptual foundation described earlier and sets the stage for the following mathematical formulation, which rigorously defines the cross-attention-based fusion strategy used to integrate spatial and contextual features from both views. The equations below formalize the attention computation, feature fusion, and optimization mechanism critical to achieving robust multi-view diabetic retinopathy diagnosis.

a.
The retinal images are pre-processed to eliminate the speckles and improve image contrast as well as to segment, at the pixel level, the areas containing blood vessels and the retina. Each pre-processed image is passed through a CNN to extract local features such as edges, textures, microstructures, that are essential for the identification of retinal lesions.
b.
The same images are broken into smaller patches flattened and injected into vectors. To preserve the spatial relationship among patches, positional encodings are added. ViTs compute the relationships between all patches with self-attention mechanisms, and the model can thus learn global dependence of all the patches on retina. Next, they are fed into feedforward networks to generate compact representations of these attention driven features.
c.
To ensure Alignment between the macula- and optic disc-centered views despite inherent anatomical variability, we apply spatial normalization by resizing all input images to 224× 224 pixels and performing pixel intensity normalization. While rigid registration is not employed, ViT’s use of fixed positional encodings for each patch allows the model to retain relative spatial relationships during attention computation. This ensures that even when the retinal structures are differently oriented across views, the positional context of each patch is preserved. These encodings are added to the input patch embeddings before being processed by the self-attention layers, aiding spatial correspondence learning across the two views. For CNN-based local feature extraction, translational invariance from convolutional kernels ensures robustness to minor misalignments between the two perspectives.

The CNN and ViT features are concatenated, creating a unified representation that combines fine-grained local details with broad contextual insights.

While our method does not use traditional image registration algorithms, spatial misalignment is effectively compensated through:

ViT positional encodings for spatial structure preservation,
Cross-attention weights that map relational relevance,
PSO-optimized fusion weights that determine view-specific contribution based on learned importance.

This design is not only robust to mild anatomical variations but also scalable for real-world variability in clinical images.

Feature fusion

Weighted summation mechanism is used to fuse the concatenated features from macula centered and optic disc centered views. In dynamically optimizing the fusion weights to achieve the optimal blend of the two views contributions. The unique and complementary information from each perspective is well utilized which allows to overcome the limitations of single view systems.

Given the local feature set from CNNs for macula-centered ($\:{F}_{M})$ and optic-disc-centered ($\:{F}_{O}$) views, and the global context from ViTs ($\:{G}_{M},$ $\:{G}_{O}$), the model computes a cross-attention score using scaled dot-product attention:

$$\:{A}_{M\leftarrow\:O}=softmax\left(\frac{Q\left({F}_{M}\right)K({{F}_{O})}^{T}}{\sqrt{d}}\right)V\left({F}_{O}\right)$$

(10)

$$\:{A}_{O\leftarrow\:M}=softmax\left(\frac{Q\left({F}_{O}\right)K({{F}_{M})}^{T}}{\sqrt{d}}\right)V\left({F}_{M}\right)$$

(11)

where Q, K, and V are the learned linear projections for query, key, and value matrices, and d is the dimensionality scaling factor. These matrices are used to compute cross-view attention maps that selectively emphasize discriminative features from the counterpart view.

The final fused representation is obtained as:

$$\:{F}_{fused}=\:\alpha\:.{A}_{M\leftarrow\:O}+\beta\:.{A}_{O\leftarrow\:M}$$

(12)

where $\:\alpha\:$ and $\:\beta\:$ are learnable weights optimized during training via PSO.

These fused features are passed to the multitask prediction heads. This formulation ensures that the model learns context-aware attention-driven fusion between complementary views, enhancing feature representation.

Optimization using PSO

The framework is further combined with PSO to optimize the fusion weights and key hyperparameters. The algorithm PSO works as follows: it optimizes a swarm of candidate solutions (the particles). A particular particle is a potential set of fusion weights or hyperparameters and is evaluated by a fitness that is evaluated by the composite loss function. The particles move according to their own current best-known solution and between the global best solution. Furthermore, it is dynamically optimized such that the model exhibits superior performance as the fusion strategy is tuned and the parameter settings are refined to coordinate the behaviour of both data sources.

The proposed Multi-Task Network (MTN) is also used to optimize the feature process on the flow process by the Particle Swarm Optimization (PSO) algorithm is depicted as shown in Fig. 6. This dynamic optimization of fusion weight generated from macula-centered and optic-disc-centered views is illustrated. The work begins with an initialization phase which starts with particles describing potential solutions scattered across the solution space. During iterations, these particles converge to local optimal positions by individual experiences where global best values was used. This optimizes the contribution from CNN local spatial features and ViT global contextual insights from it in a balance way.

It also describes integration of a fitness function that combines classification accuracy and localization precision as the key details. In the optimization process, weights are iteratively adjusted to precisely adjust the multi view fusion mechanism. The proposed model can be adaptive and has higher diagnostic accuracy of detecting diabetic retinopathy (DR) in this phase.

Multitask learning

The fused feature representation serves as input to a multi-task learning framework designed to simultaneously handle classification and lesion localization tasks.

a.
After passing the fused features through a fully connected layer, and applying a SoftMax function to the logits, we classify using the generated class probabilities. On the other page, this allows the model to classify input image as in diabetic retinopathy stage mild, moderate, severe, or proliferative.
b.
In addition, bounding box coordinates for retinal lesions are also predicted using the fused features. They allow locating the position and the area of critical abnormalities (like microaneurysms, haemorrhages and exudates).

As shown in Fig. 7, the MTN simultaneously performs DR grading and lesion localization in a multitask learning workflow. PSO based feature extraction and fusion were performed on the unified representation and processed by a classification head and a localization head.

Both the classification and localization modules predict the severity stage of DR and identify bounding box coordinates for lesions. Figure 6 illustrates how effective the model is trained through interaction between regularized loss functions for classification and localization. To be robust about the stage prediction, the classification head uses cross entropy loss, while the localization head uses Intersection over Union (IoU) loss for precise lesion detection. Regularization terms are also added to prevent overfitting and keep the model working. Figure 6 captures the contribution of these components, showing that resulting in a framework able to do complex, clinically real tasks in an integrated fashion, a necessary quality for them to apply to real world cases.

Loss function and propagation

The composite loss function includes classification, localization and regularization losses which help us guide the training process. To quantify the loss for classifying, cross entropy is used. Intersection over Union (IoU) is used to locally bound the prediction to the bounding box. Regularization terms impose a cost on large weights, thereby suppressing overfitting. Backpropagation computes gradients of the loss function with respect to the model parameters and iterates to update the model parameters in order to minimize the overall loss.

Multi-task learning strategy and loss formulation

The proposed MVCAViT framework is designed to simultaneously perform diabetic retinopathy (DR) classification and lesion localization, thereby offering a clinically relevant multitask diagnostic system. To this end, the framework adopts a multi-task learning (MTL) strategy with distinct yet synergistically trained branches for classification and localization.

Individual task loss functions

a.
Classification Loss–Cross-Entropy.

The classification branch predicts the DR severity stage from five categories (No DR, Mild, Moderate, Severe, Proliferative). The standard categorical cross-entropy loss is used:

$$\:{L}_{cls}=\sum\:_{j=1}^{C}{y}_{j}\text{l}\text{o}\text{g}\left({p}_{j}\right)$$

(13)

where C is the number of classes, $\:{y}_{j}$ is the ground truth label, and $\:{p}_{j}$ is the predicted probability for class j.

b.
Localization Loss—Intersection over Union (IoU).

The localization branch outputs bounding box coordinates for lesions such as microaneurysms, haemorrhages, and exudates. The IoU loss measures overlap between predicted and ground truth bounding boxes:

$$\:{L}_{loc}=1-\frac{{B}_{p}\cap\:{B}_{gt}}{{B}_{p}\cup\:{B}_{gt}}$$

(14)

where $\:{B}_{p}$ and$\:\:{B}_{gt}$ represent the predicted and true bounding boxes.

c.
Regularization Loss—L2 Norm.

To reduce overfitting and improve generalizability, an L2 regularization term is added:

$$\:{L}_{reg}=\lambda\:{\left\| \theta \right\|}_{2}^{2}$$

(15)

where $\:\theta\:$ denotes the trainable parameters and $\:\lambda\:$ is the regularization coefficient.

d.
Total Loss Function and Optimization Scheme.

The overall training objective is a weighted sum of the individual losses:

$$\:{L}_{total}={\alpha\:L}_{cls}+{\beta\:L}_{loc}+{\gamma\:L}_{reg}$$

(16)

where α = 1.0, β = 0.7, and γ = 0.01, determined empirically to balance classification and localization effectiveness while maintaining model regularity.

The network is trained end-to-end using the Adam optimizer with a learning rate of 1 × 10⁻⁴, batch size of 16, and 50 training epochs. In parallel, the fusion weights used to combine the macula and optic disc features are optimized using Particle Swarm Optimization (PSO) based on the composite loss function as the fitness criterion. Dedicated task heads are used after the shared backbone, allowing task-specific feature transformations. Cross-attention-based fusion aligns semantically relevant features between views before multitask branching, enhancing consistency between classification and localization cues. The contribution of each task to the total loss is dynamically monitored during training. If necessary, the weights α and β are adjusted based on the rate of convergence to maintain task balance. Empirical evaluations show that joint training improves lesion localization (IoU score increased by 2.3%) and classification (F1-score improved by 1.6%) compared to task-specific models. This confirms that shared learning between the tasks enhances model robustness, leading to more reliable and interpretable diagnostic outcomes.

Once trained, the model produces two outputs for each input image pair: Predicted diabetic retinopathy probability class. Highlighting detected lesions by bounding box coordinates. These outputs serve as both a diagnosis and actionable insights for the clinicians, thereby making the system useful for practical deployment. It is made scalable but also efficient. Parallel processing of large numbers of multi view images is employed to maintain system feasibility even in these large scales. The framework exploits the high performance of ViTs while keeping the computations low with the help of self-attention mechanism and uses it to achieve high performance and without much computational overhead. The pseudocode of the entire process is clearly formulated below.

The sub-routing for the self-attention module utilized in the overall proposed work is provided below.

Results and discussion

In pursuit of improving the diagnosis of diabetic retinopathy (DR), we propose the Multi-Task Network (MTN), based on a unique fusion of Convolutional Neural Networks (CNN), Vision Transformers (ViT) and Particle Swarm Optimization (PSO). The framework integrates macula centered and optic disc centered views to provide improved precision and robustness in grading of DR as well as lesion localization. The gain of this work lies in its ability to simultaneously process multi-view inputs, optimize feature fusion, and achieve superior diagnostic performance at a computational efficiency. It incorporates advanced optimization and attention mechanisms that may suggest the possibility of real-world clinical deployment.

The experimentation is performed on the DRTiD³, a Diabetic Retinopathy Two-field Image Dataset, consisting of 3,100 two field fundus images with paired macula and optic disc centered views. The dataset is annotated with DR severity grades (No DR, Mild, Moderate, Severe, Proliferative DR) and lesion specific localization information for each image in the dataset. The dataset ensures diversity by providing images taken potentially under different conditions and provides stronger evaluation. The Diabetic Retinopathy Two-field Image Dataset (DRTiD) consists of a total of 3,100 retinal image pairs, amounting to 6,200 fundus images. Each pair includes a macula-centered and an optic disc-centered image captured from the same patient’s eye. These image pairs are annotated with both diabetic retinopathy (DR) severity grades and bounding box coordinates for key lesions, including microaneurysms, haemorrhages, and exudates. The distribution across severity grades is as follows: No DR–900 image pairs, Mild DR–550 pairs, Moderate DR–750 pairs, Severe DR–500 pairs, and Proliferative DR–400 pairs. To ensure robust model evaluation and prevent data leakage, we strictly followed a patient-wise splitting protocol, such that no patient’s images appear in both training and testing sets.

A standardized preprocessing pipeline was applied to all images before input to the model. This included resizing each image to 224 × 224 pixels, pixel intensity normalization to the [0,1] range, and histogram equalization to enhance retinal feature contrast. Noise reduction was performed using Gaussian blurring, and data augmentation techniques such as image rotation (90°, 180°, 270°), horizontal flipping, and brightness scaling were used to increase training diversity and reduce overfitting. These steps ensured that the model was exposed to a wide range of imaging conditions, thereby enhancing its generalization capability. All experiments were conducted using these preprocessed and augmented datasets, with each image pair being processed as a dual-input stream through the feature extraction and fusion components of the MVCAViT framework.

In the preprocessing pipeline, we normalized pixel intensities to make them standard, changed the dimensions of 224 × 224 in order to be compatible with deep learning models, and used rotation and flips to simulate the real-world variability. On this dataset, the proposed MTN was trained and evaluated, with 80 − 20 for training and testing, respectively. Model generalization on unseen data was ensured by fivefold cross validation.

The implementation of experimentation was in Python and deep learning frameworks like TensorFlow and PyTorch. The computation fit in an NVIDIA Tesla V100 with 16GB memory and was trained there. A learning rate of 1 × 10⁻⁴, batch size of 16, and 50 epochs were used for hyperparameters. Parameter updates were performed with Adam optimizer and regularization was performed to mitigate overfitting.

For evaluation, we provide a detailed evaluation including metrics like accuracy, precision, recall, F1-score, Area Under the Curve (AUC) for classification and Intersection over Union (IoU) for localization. They are comprehensive metrics to evaluate the effectiveness of the MTN for the DR grading and lesion localization tasks.

The input retinal images, after undergoing a series of pre-processing steps important for standardization, have been depicted in Fig. 8. To reduce noise, a Gaussian blur filter was applied, while histogram equalization was applied to enhance the contrast in order to better resolve fine retinal features, such as microaneurysms and haemorrhages. To allow compatibility with deep learning frameworks, the images were resized to 224 × 224. To enhance the robustness of trained model, we added improvements such as rotation (90º, 180º, 270º), horizontal flipping, and adding brightness adjustments to simulate different imaging conditions. The purpose of these preprocessing steps is that downstream feature extraction processes should not be disturbed by the variability of image quality or acquisition condition.

Figures 9 and 10 present CNN deep feature maps indicating hierarchical features extracted from image between optic disc and macula respectively. To extract local spatial patterns Like retinal vessel bifurcations and lesions, we used a 3× 3 convolutional filter in the first layers. Then, subsequent layers used ReLU (Rectified Linear Unit) activations to introduce nonlinearity needed to detect complex patterns such as exudates. Max-pooling layers were also employed to further reduce feature dimensionality, preserving only the most salient features. For images centered on the macula, those with macular edema and similar abnormalities were emphasized, and for images centered on the optic disc features of the optic nerve head and peripapillary atrophy were prominent. In particular, feature maps from Convolutional Neural Networks confirm CNNs’ aptitude for encapsulating localized anatomical information important for DR staging. A Visual explanation of the Produced patterns and features detected by the filters of each convolutional layer. Feature maps essentially describe how the network takes in and processes the input image and catch some specific characteristics of an input image such as edges, textures, and shapes. Feature maps at the first few layers of a CNN take low level features. They may include basic patterns such as horizontal and vertical edges, corners and gradients. The reason for this is that the filters in these layers are small features that are common across all images. An example of one of these maps that could be produced from retinal images might be to indicate differences between blood vessels and surrounding tissue, for example, or subtle edges surrounding a lesion. Convolution with a filter, followed by a non-linear activation (e.g. ReLU), followed by the activation gives us the activations in these feature maps.

The feature maps that are present in each layer of the network, each starting from the input as we progress down deeper into the layers, represent higher and higher level more abstract patterns. Among the textures, regions of interest and structures of interest in retinal images are included. Using previously identified low level features, the deeper layers form a hierarchical understanding of the image, concentrating on complex shape and strong patterns that are critical for classification or localization Like tasks. Number of feature maps at each layer is equal to the number of filters applied in these layers. As an example, if a layer contains 16 filters, then 16 feature maps are created, one per filter, and the filter is activated, as shown in Figs. 8 and 9. If the filter for the image does not drastically activate then some may appear empty. This is normal and normally means the filter is tuned to see things that are not present in the current image include textures, regions of interest, and specific structures such as exudates or microaneurysms in retinal images. The deeper layers use the low-level features identified earlier to form a hierarchical understanding of the image, focusing on complex shapes and significant patterns that are essential for tasks like classification or localization. The number of feature maps at each layer corresponds to the number of filters applied in that layer.

CNNs understand data because they internally maintain feature maps. When we visualize these maps, we can see what areas of the image the network looks at and how it builds the insight of the input. For example, the feature maps from the network may indicate whether lesions or the boundary of the optic disc presence to indicate how the network detects abnormalities. With this interpretability, the value of this can be to refine the model and make sure that what is being captured is clinically relevant.

The attention maps produced by the Vision Transformer (ViT) are shown for macula- and optic disc centered views in Figs. 11 and 12 respectively. In ViT, the pattern of the interactions learnt by the self-attention mechanism across the floorplan of the retinal field is capable of capturing long range dependencies and detecting low level patterns indicative of early-stage DR.

For example, microaneurysm weighted attention masks concentrate on the regions with microaneurysms while maps around the optic nerve show structural deviations. Multi-head attention is incorporated to allow the model to represent both localized lesions and diffuse retinal abnormalities at once. The CNN localized patterns are complemented by this global extraction capability.

We observe that the fusion weights ($\:{w}_{m},{w}_{o}$) trajectory of succeeding PSO iterations show the dynamic convergence path towards the optimal values. The initial assignments of the particles on the solution space are initially scattered, that is, with random initialization, which probably exhibits the highest degree of exploration. Indeed, in early iterations, the trajectories of particles are varied significantly in order to find better solutions. The trajectories evolve gradually in stabilization as the optimization progresses, which suggests that the particles are approaching at the optimal fusion weights. Many particles settle near optimal values, well before the 20th iteration, demonstrating that PSO finds a good balance between exploration and exploitation. The final weights were found to form an optimized fusion strategy for robust image representation, arising from a synergy between local (CNN) and global (ViT) features. This plot shows how well PSO converges and identifies optimal weights within the confines of well-defined parameters space.

The convergence trajectory when using the Particle Swarm Optimization (PSO) algorithm to fine tune the fusion weights in order to integrate multi view features is demonstrated in Fig. 13. However, for the initial iterations, there is a large amount of variability caused by the exploration phase where particles explore the solution space. Fusion of the particles begins by the 20th iteration, when particles start aligning towards optimal fusion weights. It optimizes all these contributions in an interplay between CNN local features and ViT global insight, that the unified representation remains optimal to represent all such patterns. PSO optimization process is directed by fitness function combining both classification accuracy and localization precision.

Figure 14 shows the combined output of the CNN and ViT features, which together make up unified feature map. By adopting the PSO optimized weights, the weighted sum fusion strategy makes the local and the global features of different natural images together. The fusion improves the model’s ability to detect both small, localised lesions and large, diffuse pathology patterns. The unified map shows how the features extracted via different feature extraction paradigms were combined in the proposed MTN to provide a holistic representation of retinal pathology.

Figure 15 shows the confusion matrix of the proposed MTN for classifying different DR severity stages. Robustness of the feature extraction and fusion mechanisms is reflected in high true positive rates across all classes. Integration between the ability of CNN to capture the fine scale features and that of ViT to analyze broader context information led to improved performance especially in differentiating mild from moderate DR. The balanced approach prevents the misclassifications that are often seen for overlapping clinical features.

The overlay of heatmaps in Fig. 16 on the original retinal images highlights the regions discovered as lesions by the proposed MTN framework. These overlays show detailed localization of clinically relevant abnormalities such as microaneurysms, haemorrhages, and cotton wool spots. In the MTN, we make use of cross attention mechanisms to allow the model to focus on regions of higher diagnostic relevance, thereby decreasing false positives. The model’s ability to exploit the complementary strengths of macula- and optic disc centered views is shown by high-resolution heatmaps.

In Fig. 17, we show the IoU scores of predicted vs. ground truth lesion bounding boxes which serve as a quantitative measure of the overlap between predicted and ground truth lesion bounding boxes. The precision of the proposed framework in lesion localization is demonstrated by the scores over 85% on all classes. The multitask loss function balances classification and localization objectives and drives the improved IoU performance. Further, by allowing cross view attention, the model can still accurately pinpoint lesions in difficult cases, i.e. ones with low contrast or ambiguous boundaries.

Classification, localization and regularization losses are converged against training epochs over the fact as Example shown in Fig. 18. The training process is stable having shown smooth and rapid decrease in all loss components. Specifically, we include a regularizer in the loss function which prevents overfitting, and the balanced weights on classification and localization losses are used to optimize both tasks simultaneously. The validity of multitask learning is demonstrated by this convergence pattern.

Figure 19 shows the progression of fitness value by PSO iterations indicating the algorithm ability to fit the model fusion weights and hyperparameters. The classification accuracy, localization precision and efficiency are merged into a single fitness function, which directs the particles towards the best solution. The iterative refinement results in improved performance of the MTN without significant increase in computational overhead.

Figure 20 shows the Pareto front with classification accuracy trading off against localization precision, which indeed shows that the MTN can optimize multiple objectives at once. The model’s proximity to the Pareto optimal curve implies that it manages balanced performance between both tasks. The importance of this balance is prime for real world clinical applications in which accurate grading and precise localization are key.

The contributions of macula centered and optic disc centered features to final predictions are compared in Fig. 21. Our results indicate that identifying central vision threatening lesions primarily relies on macula centered features, whereas optic disc centered features are critical for the detection of structural abnormalities. Both collectively underscore the capacity for multi view analysis in retinal disease diagnosis.

Figure 22 shows the relationship of training time and model parameters. It demonstrates that the MTN framework is computationally efficient. Though built with the sophisticated components like ViT and cross attention, they maintain scalability with the use of lightweight architectures. This efficiency makes the MTN well suited for deployment in real world clinical settings in which there may be limited computational resources.

The proposed model has high precision and fewer false positives as shown in Fig. 23 because of the cross-attention mechanism and PSO optimized fusion. The outputs validate both the ability of the framework to yield clinically relevant and interpretable results.

Table 1 provides the ablation study analysis.

Table 1 Ablation study analysis.

Full size table

To validate the contribution of the proposed cross-attention-based feature alignment and PSO-guided fusion, we conducted an ablation study comparing four architectural variants: (M1) single-view model using macula-centered images only, (M2) dual-view input with naïve feature concatenation, (M3) dual-view model incorporating cross-attention fusion without PSO optimization, and (M4) the complete proposed MVCAViT model with PSO-optimized fusion.

The baseline model (M1) achieved a classification accuracy of 93.4% and an IoU score of 82.1%, reflecting the performance limitations of relying solely on single-view input, which lacks complementary spatial and contextual cues from the optic disc view. Incorporating both views through simple concatenation (M2) improved accuracy to 95.7% and IoU to 84.5%, showing that multi-view information is beneficial but insufficient when not properly aligned or fused. The introduction of a cross-attention mechanism (M3) enabled soft Alignment between macula and optic disc representations in feature space, increasing performance to 97.2% accuracy and 86.4% IoU. This demonstrates that relational attention-based feature integration captures inter-view dependencies effectively, even without explicit spatial registration.

The final architecture (M4), which includes both cross-attention and PSO-optimized fusion weights, achieved the highest classification accuracy of 98.9% and localization IoU of 88.7%. This underscores the importance of dynamic weight tuning for combining multi-view features based on task-specific relevance. The results collectively validate the effectiveness of our proposed alignment and fusion strategy, demonstrating that both components—cross-attention for semantic alignment and PSO for adaptive fusion—are essential for maximizing diagnostic accuracy and lesion localization performance in multi-view diabetic retinopathy analysis.

The Table 2 contains the comparison of the DR grading of the proposed model with the existing work.

Table 2 Comparative analysis of DR grading using proposed Model.

Full size table

The visual representation for analysis of accuracy amongst existing methods is depicted in Fig. 24.

A comparative analysis over existing recent methods on different data set has been tabulated in Table 1 with a special emphasis on how metric like Accuracy, Precision, Recall, F1-Score varies. The proposed model MTN (CNN + ViT + PSO Fusion) is much superior in performance terms and is given quite near optimal results with the accuracy, precision, recall and F1 score ranging around 98.9%, 98.8%, 98.7%, 98.8% respectively, on DRTid dataset. The integration of Convolution Neural Networks (CNN), Vision Transformers (ViT) and Particle Swarm Optimization (PSO) in the MTN model accounts for the achievement of this performance. CNNs are renowned for having strong properties to extract initial feature from complex image datasets, namely their ability to process spatial hierarchies in images. In contrast, Vision Transformers use attention mechanisms to direct different parts of an image in order to allow the model to discriminate fine details that are critical in distinguishing an image accurately. PSO would help in optimizing the combined network as well as in fine tuning the parameters to faster converge on the best solution than traditional methods.

We observe the superior performance of the MTN model over other models, e.g., “Sobel Segmentation + Linear SVC” or “Attention U-Net”. For example, ‘Sobel Segmentation + Linear SVC’ was the lowest scoring model in the table, having only 44.34% in all metrics highlighting the reduced power of linear models to deal with complex image features over more elaborate techniques. Compared not only to excellent performing models like “DenseNetEnsemble 121” that achieved perfect scores of 100, the MTN scores are even noteworthy not only because of their scores, but because despite their adaptability and robustness on various datasets, they can be applied broadly. Additionally, the combination employed in MTN helps overcome each model component’s shortcoming when acting on its own. CNNs would struggle with global context while the addition of ViT expands the viewpoint, and PSO prevents the model from overfitting or getting itself stuck in local minima, both of which are key for sustained high performance in the face of very different data types, showing score close to perfect in F1. By taking this comprehensive approach, not only is performance improved, but also the model has a better chance of generalising, and in real world applications of data with variability this is a desirable property due to variability.

To ensure that the superior performance of the proposed MVCAViT framework is not an artifact of a particular data split, we conducted a five-fold cross-validation on the DRTiD dataset. For each fold, the dataset was randomly divided into 80% training and 20% testing sets, ensuring stratified distribution of DR severity classes. We computed the key evaluation metrics for each fold and report their mean along with 95% confidence intervals (CI) to demonstrate statistical consistency.

The confidence intervals were calculated using a t-distribution, suitable for the small sample size (n = 5). The results are summarized in the Table 3:

Table 3 Statistical analysis

Full size table

As observed in Table 3, these narrow confidence intervals indicate low variance and strong stability across different cross-validation folds. Notably, both the classification and localization tasks demonstrate consistently high performance, highlighting the generalization capability of the model even under variability in data partitioning. While this study does not include direct p-value comparisons against prior works due to lack of access to their fold-wise predictions or datasets, we acknowledge the value of such statistical testing. In future work, we intend to conduct paired statistical significance tests (e.g., Wilcoxon signed-rank test or McNemar’s test) by implementing and benchmarking against baseline CNN-only and single-task models trained under identical conditions. This will further validate whether the observed performance improvements are statistically significant.

To assess the practical feasibility of the MVCAViT framework, we measured its inference speed, computational complexity, and memory usage on a system equipped with an NVIDIA RTX 2060 GPU (6GB VRAM) and 16 GB RAM, representing a moderate hardware configuration commonly available in clinical screening setups. On this setup, the average inference time per dual-view image pair was approximately 115 milliseconds, allowing the model to process nearly 8–9 patient cases per second in a continuous inference scenario. The total parameter count of the model is approximately 34.2 million, and the computational complexity is estimated at 22.6 GFLOPs per dual-view pass. During inference, the peak memory usage was approximately 3.4 GB, which remains manageable under standard workstation settings. While the model does incorporate both CNN and Vision Transformer components, we have used lightweight convolutional encoders and restricted transformer depth to maintain a balance between diagnostic accuracy and computational cost. Additionally, the attention maps and lesion localization overlays are generated in parallel to classification, avoiding sequential bottlenecks.

These results suggest that the proposed framework is reasonably suited for clinical tele-ophthalmology environments, especially where desktop-level hardware is available. For more constrained platforms, such as mobile units or embedded systems, future enhancements including model pruning, quantization, and transformer distillation are planned to further reduce latency and memory footprint without compromising diagnostic reliability.

Conclusion

With CNNs, ViT and PSO for feature fusion in a Multitask Network (MTN), the proposed model shows exceptional efficacy in diabetic retinopathy (DR) diagnosis which achieves a classification accuracy of 98.9%, with strong precision and recall. The multi-view and multi-task integration shows promise for enhancing diabetic retinopathy screening. However, clinical validation and multi-center evaluations are required before confirming deployment readiness. This study lays a foundational step toward more interpretable and scalable AI-based DR diagnostics. The integration of macula and optic disc centered views of retinal pathology within the model allows for a complete retinal pathology analysis and precise DR grading and lesion localization. The framework balances local feature extraction via CNNs and global contextual analysis with the ViT to address the inherent multi-view learning challenges. With further feature fusion by including PSO in our system, it optimizes the contribution of each view and increases the robustness of the final prediction. Validation of the model across multiple metrics, including precision, recall and an F1 score, suggests the model’s potential for actual clinical implementation.

In spite of promising results, the proposed framework has several limitations. First, the model is trained and validated solely on the DRTiD dataset; generalizability to other datasets or imaging devices has not yet been tested. Second, the model’s reliance on precise bounding box annotations limits its applicability in scenarios where such annotations are unavailable or inconsistent. Third, although inference is feasible on workstation-level hardware, deployment in mobile or embedded clinical devices requires further optimization through pruning or quantization. Additionally, the PSO optimization process, while effective, may increase training time and requires careful hyperparameter tuning. Lastly, this study does not include statistical hypothesis testing (e.g., McNemar’s test) to validate the significance of performance improvements over baseline models. Future work will focus on addressing these limitations by extending the framework to other ocular diseases, incorporating semi-supervised learning to reduce annotation dependency, and conducting prospective clinical evaluations.

Data availability

The DRTiD dataset used in this study is publicly available and can be accessed from its official repository for research purposes. All relevant data and materials are included in the dataset as published by its authors. https://www.kaggle.com/datasets/arfaahmed/drtid-dataset.

Abbreviations

CNN:: Convolutional neural network
ViT:: Vision transformer
PSO:: Particle swarm optimization
DR:: Diabetic retinopathy
MTN:: Multi-task network
SVM:: Support vector machine.
LBP:: Local binary patterns
NPDR:: Non-proliferative diabetic retinopathy
PDR:: Proliferative diabetic retinopathy
IoU:: Intersection over union
AUC:: Area under the curve
LSTM:: Long short-term memory
GA:: Genetic algorithm
ReLU:: Rectified linear unit
DRTiD:: Diabetic retinopathy two-field image dataset
APTOS:: Asia Pacific tele-ophthalmology society

References

Patil, A. D., Biousse, V., & ewman, N. J. Artificial intelligence in ophthalmology: An insight into neurodegenerative disease. Curr Opin Ophthalmol. 33(5), 432–439. https://doi.org/10.1097/ICU.0000000000000877 (2022)
Singh, L. K., Pooja, & Garg, H. Detection of glaucoma in retinal fundus images using fast fuzzy C means clustering approach. In International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) 397–403 (Greater Noida, India2019). https://doi.org/10.1109/ICCCIS48478.2019.8974539
Hou, J. et al. Cross-field transformer for diabetic retinopathy grading on two-field fundus images. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE Computer Society 985–990. (2022).
Singh, L. K., Pooja & Garg, H. Detection of glaucoma in retinal fundus images based on multi-objective approach. Int. J. Appl. Evolut. Comput., 397–403, (2020). https://doi.org/10.1109/ICCCIS48478.2019.8974539
Patel, R. K., & Kashyap, M. Automated screening of glaucoma stages from retinal fundus images using BPS and LBP based GLCM features. Int. J. Imaging Syst. Technol. 33 (1), 246–261. https://doi.org/10.1002/ima.22797 (2022).
Article Google Scholar
Mazlan, N., Yazid, H., Arof, H. & Mohd, H. Automated microaneurysms detection and classification using multilevel thresholding and multilayer perceptron. J. Med. Biol. Eng. 0123456789. https://doi.org/10.1007/s40846-020-00509-8 (2020).
Chudzik, P., & Majumdar, S. Micro-aneurysm detection using fully convolutional neural networks. Comput. Methods Programs Biomed. 58, 185–192. (2018).
Qiao, L., Zhu, Y. & Zhou, H. Diabetic retinopathy detection using prognosis of micro-aneurysm and early diagnosis system for non-proliferative diabetic retinopathy based on deep learning algorithms. IEEE Access. 8, 104292–104302 (2020).
Article Google Scholar
Narhari, B. B., Murlidhar, B. K., Sayyad, A. D. & Sable, G. S. Automated diagnosis of diabetic retinopathy enabled by optimized thresholding-based blood vessel segmentation and hybrid classifier. Bio-Algorithms Med-Syst.. 17 (1), 9–23 (2021).
Article Google Scholar
Abdelmaksoud, E., El-Sappagh, S., Barakat, S., Abuhmed, T. & Elmogy, M. Automatic diabetic retinopathy grading system based on detecting multiple retinal lesions. IEEE Access, 6, 15939–15960, (2021). https://doi.org/10.1109/ACCESS.2021.3052870
Mohan, N. J. et al. Optimal hybrid feature selection technique for diabetic retinopathy grading using fundus images. Sadhana 48, 1–15. https://doi.org/10.1007/s12046-023-02175-3 (2023).
Article Google Scholar
Khan, S. M. et al. A global review of publicly available datasets for ophthalmological imaging: Barriers to access, usability, and generalisability. Lancet Digit. Health. 3 (1), e51–e66. https://doi.org/10.1016/S2589-7500(20)30240-5 (2021).
Article MathSciNet CAS PubMed Google Scholar
Hardas, M., Mathur, S., Bhaskar, A. & Kalla, A. Retinal fundus image classification for diabetic retinopathy using SVM predictions. Phys. Eng. Sci. Med. 45 (3), 781–791. https://doi.org/10.1007/s13246-022-01143-1 (2022).
Article PubMed Google Scholar
Li, F., Tang, S., Chen, Y. & Zou, H. Deep attentive convolutional neural network for automatic grading of imbalanced diabetic retinopathy in retinal fundus images. Biomed. Optics Express 14(13): 5813–5835. (2022). https://doi.org/10.1364/BOE.472176
Singh, L. K. & Garg, H. Automated glaucoma type identification using machine learning or deep learning techniques. In Advancement of Machine Intelligence in Interactive Medical Image Analysis. Algorithms for Intelligent Systems (eds Verma, O. et al.) (Springer, 2020). https://doi.org/10.1007/978-981-15-1100-4_12.
Chapter Google Scholar
Gargeya, R. & Leng, T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology 124 (7), 962–969. https://doi.org/10.1016/j.ophtha.2017.02.008 (2017).
Article PubMed Google Scholar
Deepa, V., Kumar, C. S. & Cherian, T. Ensemble of multi-stage deep convolutional neural networks for automated grading of diabetic retinopathy using image patches. J. King Saud Univ. 34 (8), 6255–6265. https://doi.org/10.1016/j.jksuci.2021.05.009 (2022).
Article Google Scholar
Ashwini, K. & Dash, R. Grading diabetic retinopathy using multiresolution CNN. Biomed. Signal. Process. 86 (Part B). https://doi.org/10.1016/j.bspc.2023.105210 (2023).
Ali, G., Dastgir, A., Iqbal, M. W., Anwar, M. & Faheem, M. A hybrid convolutional neural network model for automatic diabetic retinopathy classification from fundus images. IEEE J. Trans. Eng. Health Med., 11, 341–350, 2023. (2023). https://doi.org/10.1109/JTEHM.2023.3282104
D’Alberto, P. et al. xDNN: inference for deep convolutional neural networks. ACM Trans. Reconfigurable Technol. Syst. 15 (2), 1–29. https://doi.org/10.1145/3473334 (2022).
Article Google Scholar
Laily, F. T. A. S. Feature extraction and classification of retinal images using Sobel segmentation and linear SVC. Int. J. Artif. Intell. Med. Issues. 2 (2), 136–148. https://doi.org/10.56705/ijaimi.v2i2.153 (2024).
Article Google Scholar
Fousiya, T. T. & Munir, A. M. K. V. Diabetic retinopathy classification based on segmented retinal vasculature of fundus images using attention U-NET. In 2022 IEEE 19th India Council International Conference (INDICON) 1–6 (Kochi, India, 2011) https://doi.org/10.1109/INDICON56171.2022.10039734
Li, F., Tang, S., Chen, Y. & Zou, H. Deep attentive convolutional neural network for automatic grading of imbalanced diabetic retinopathy in retinal fundus images. Biomed. Opt. Express. 13 (11), 5813–5835. https://doi.org/10.1364/BOE.472176 (2022).
Article PubMed PubMed Central Google Scholar
Costaner, L., Lisnawita, L., Guntoro, G. & Abdullah, A. Feature extraction analysis for diabetic retinopathy detection using machine learning techniques. Sistemasi Sistem Informasi. https://doi.org/10.32520/stmsi.v13i5.4600 (2024).
Gayathri, S., Gopi, G. P. & Palanisamy, P. Diabetic retinopathy classification based on multipath CNN and machine learning classifiers. Phys. Eng. Scie Med. 44 (3), 639–653. https://doi.org/10.1007/s13246-021-01012-3 (2021).
Article CAS Google Scholar
Chilukoti, S. et al. A reliable diabetic retinopathy grading via transfer learning and ensemble learning with quadratic weighted kappa metric. BMC Med. Inf. Decis. Mak. 24 (37). https://doi.org/10.1186/s12911-024-02446-x (2024).
Yadav, Y., Chand, S., Sahoo, R. C. & Sahoo, B. M. Comparative analysis of detection and classification of diabetic retinopathy by using transfer learning of CNN based models. J. Intell. Fuzzy Syst.. 43(6): 1–15. https://doi.org/10.3233/JIFS-212771. (2022).
Ali, A. A. & Dawood, F. A. A. Deep learning of diabetic retinopathy classification in fundus images. J. Eng. 29 (12), 139–152. https://doi.org/10.31026/j.eng.2023.12.09 (2023).
Article Google Scholar
Jacoba, C. M. P. et al. Performance of automated machine learning for diabetic retinopathy image classification from Multi-field handheld retinal images. Opthalmol Retina. 7 (8), 703–712. https://doi.org/10.1016/j.oret.2023.03.003 (2023).
Article Google Scholar
Talukder, M. A., Sarkar, A., Akter, S. & Nuhi-Alamin, M. An improved model for diabetic retinopathy detection by using transfer learning and ensemble learning. Electr. Eng. Syst. Sci. https://doi.org/10.48550/arXiv.2308.05178 (2023).
Article Google Scholar
https://www.kaggle.com/datasets/arfaahmed/drtid-dataset
Zhang, R., Lin, Y., Wu, Y., Deng, L., Zhang, H., Liao, M. MvMRL: A multi-view molecular representation learning method for molecular property prediction. Briefings Bioinform. 25(4), bbae298. (2024) https://doi.org/10.1093/bib/bbae298.
Hu, F., Yang, H., Qiu, L., Wei, S., Hu, H., et al. Spatial structure and organization of the medical device industry urban network in China: Evidence from specialized, refined, distinctive, and innovative firms. Front. Public Health 13, 1518327. (2025) .https://doi.org/10.3389/fpubh.2025.1518327.
Li, Z. et al. ASMNet: Action and style-conditioned motion generative network for 3D human motion generation. Cyborg Bionic Syst. .2024;5:Article 0090. https://doi.org/10.34133/cbsystems.0090Submitted
Li, Z. et al. MonoAux: Fully exploiting auxiliary information and uncertainty for monocular 3D object detection. Cyborg. Bionic. Syst. 5, Article0097. https://doi.org/10.34133/cbsystems.0097 (2024).
Article Google Scholar
Wu, Z., Sun, W. & Wang, C. Clinical characteristics, treatment, and outcomes of pembrolizumab-induced uveitis. Investig. New Drugs. 42 (5), 510–517. https://doi.org/10.1007/s10637-024-01464-w (2024).
Article CAS Google Scholar
Liang, J. et al. The regulation of selenoproteins in diabetes: A new way to treat diabetes. Curr. Pharm. Design. 30 (20), 1541–1547. https://doi.org/10.2174/0113816128302667240422110226 (2024).
Article CAS Google Scholar
Luan, S. et al. Deep learning for fast super-resolution ultrasound microvessel imaging. Phys. Med. Biol. 68 (24), 245023. https://doi.org/10.1088/1361-6560/ad0a5a (2023).
Article CAS Google Scholar
Yu, X. et al. Deep learning for fast denoising filtering in ultrasound localization microscopy. Phys. Med. Biol. 68 (20), 205002. https://doi.org/10.1088/1361-6560/acf98f (2023).
Article CAS Google Scholar
Sun, J. et al. Real-time accurate detection and analysis of breath acetone using CRDS: Toward metabolic dynamic monitoring and potential application. Sens. Actuators B. 433, 137422. https://doi.org/10.1016/j.snb.2025.137422 (2025).
Article CAS Google Scholar
Hu, T. et al. LCN2 deficiency mitigates the neuroinflammatory damage following acute glaucoma. Theranostics 15 (7), 2967–2990. https://doi.org/10.7150/thno.104752 (2025).
Article CAS PubMed PubMed Central Google Scholar
Bilal, A. et al. Quantum chimp-enanced squeezenet for precise diabetic retinopathy classification. Sci. Rep. 15, 12890. https://doi.org/10.1038/s41598-025-97686-w (2025).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Funding

The author did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Sona College of Technology, 636005, Salem, India
S. Vijayalakshmi
Department of Biomedical Engineering, Jerusalem College of Engineering, Chennai, 600100, India
J. Samuel Manoharan & B. Nivetha
Department Electronics & Communication Engineering, M.Kumarasamy College of Engineering, Thalavapalayam, Karur, 639113, India
A. Sathiya

Authors

S. Vijayalakshmi
View author publications
Search author on:PubMed Google Scholar
J. Samuel Manoharan
View author publications
Search author on:PubMed Google Scholar
B. Nivetha
View author publications
Search author on:PubMed Google Scholar
A. Sathiya
View author publications
Search author on:PubMed Google Scholar

Contributions

S. Vijayalakshmi–Research proposal–construction of the work flow and model–Final Drafting; J.Samuel Manoharan–Survey of Existing works–Improvisation of the proposed model; B. Nivetha–Initial Drafting of the paper–Collection of datasets and choice of their suitability–Formulation of pseudocode; A.Sathiya–Survey of Existing works– Collection of datasets and choice of their suitability.

Corresponding author

Correspondence to S. Vijayalakshmi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Vijayalakshmi, S., Manoharan, J.S., Nivetha, B. et al. Multi-task deep learning framework combining CNN: vision transformers and PSO for accurate diabetic retinopathy diagnosis and lesion localization. Sci Rep 15, 35076 (2025). https://doi.org/10.1038/s41598-025-18742-z

Download citation

Received: 26 April 2025
Accepted: 03 September 2025
Published: 08 October 2025
DOI: https://doi.org/10.1038/s41598-025-18742-z

Subjects

Abstract

Similar content being viewed by others

Diabetic retinopathy classification using a multi-attention residual refinement architecture

A hybrid deep learning framework for early detection of diabetic retinopathy using retinal fundus images

Attention dual transformer with adaptive temporal convolutional for diabetic retinopathy detection

Introduction

Concept of multiview images

Motivation and contributions of the research work

Related work

Proposed work

Problem formulation

Workflow of proposed model

Input image processing

Feature extraction using CNNs and vision transformers

Feature fusion

Optimization using PSO

Multitask learning

Loss function and propagation

Multi-task learning strategy and loss formulation

Individual task loss functions

Results and discussion

Conclusion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links