Introduction

Being a fatal and deadly disorder, cancer is an abnormal disease that harms and becomes the prominent cause of mortality in human lives1. Renal Cell Carcinoma is the most common kind of kidney cancer1. Over the last two decades, the global incidence of RCC has increased by an annual increment of around two percent. According to GLOBOCAN 2022, there were approximately 1,55,700 deaths from kidney cancer (predominantly RCC) worldwide in 2022. This accounts for 1,00,200 in men (~ 64%) and 55,500 in women (~ 36%)2. In addition to this, recent data from the studies of3 and4 indicate an increasing global incidence of the disease that demands an effective system for addressing future challenges. Staging and grading of renal tumors are important to dictate the kidney cancer diagnosis. Staging considers tumor size, location, and whether the cancer has spread to the nearby lymph nodes. Grading considers the level of abnormality of cancer cells that differ from normal cells upon microscopic observation5. Grading systems for RCC differ, but one commonly used four-tier system is by Fohrman, as utilized6. Within this system, Grade-1 nuclei resemble ordinary cells in terms of uniformity and tubular structure properties. Grade-2 types exhibit irregular shapes, Grade-3 types are highly recognizable abnormalities, and Grade-4 types display very odd shapes, including pleomorphic cells and mitotic structures.

The grade and stage of the tumor regions reveal details about the specific nature of growth a tumor exhibits and the potential for metastasis. The physician will then make use of these stages and grades to recommend distinct treatment strategies. The human-intensive grading of these complicated histology patterns is extremely time-consuming and prone to errors, requiring very specialized expertise7. This demands a crucial requirement for developing an automated but accurate renal cancer grading system using histopathology images. The advent of increased artificial intelligence (AI) usage in various fields, including vital sign monitoring8 and fire detection9, has been notable. It is undeniably true that AI applications tend to enhance efficiency, objectivity, and uniformity in the evaluation of renal cancer, thereby acting as an advantage for pathologists in their diagnostic procedures compared to traditional methods. The challenge remains realistic due to the range of different architectures of neural networks. This also involves the absence of extensive evaluations on training strategies in search of the best models and methodologies for particular tasks in histopathology. To this end, the paper introduces a fully automated RCC grading framework using deep learning techniques, called MobileDANet. This proposed MobileDANet architecture leverages the advantageous features of transfer learning and attention mechanisms.

Transfer learning (TL) models are pre-trained neural networks fine-tuned for newer tasks based on knowledge from previous tasks. This approach saves time, data, and computational resources, and it works well even for smaller datasets10. Herein, MobileNetV2 is a lightweight and efficient CNN architecture developed, in particular, for mobile and embedded vision applications11. The narrow structure of the architecture, along with a linear bottleneck, might make the model limited to learning complex features. This results in reduced accuracy on image classification tasks. To alleviate the problem of capturing global contextual relationships within image data, researchers are utilizing popular self-attention mechanisms with vision transformers (ViTs). These ViTs are capable of learning global representations, but with higher computational complexity and more parameters as compared with the transfer learning models. By keeping these things, the proposed MobileDANet provides a good balance between classification performance and computational complexity. In contrast to the already existing frameworks, the highlights of our proposed MobileDANet architecture are given below:

  1. 1.

    The framework extracts deep-learned features effectively using the MobileNetV2 model, which acts as a backbone of the proposed architecture. The selection ensures balanced computational efficiency with classification performance.

  2. 2.

    A novel dynamic attention (DA) block is introduced over the MobileNetV2 features for enhancing feature representations using multi-head attention mechanisms. This enables the model to capture both global and local dependencies within histopathological images. And thus helpful in refining critical features and improving the classification of distinct cancer grades.

  3. 3.

    Unlike many models that are designed for a single type of cancer classification, the proposed framework extends to classify not only renal cell carcinoma but also colon and breast cancer histopathology images. This proves the proposed architecture’s versatility across different cancer types.

  4. 4.

    The saliency maps using the GradCAM technique are attained for visual analysis of the architecture predictions using histopathological image inputs.

The remaining section of the manuscript is organized as follows: The relevant works are discussed in detail under “Related works” section, and an elaborate description of the methodology follows “Materials and methods” section. “Experimentation and results” section describes the multi-cancer type databases, training carried out, ablation experiments, and the obtained results. As a final point, a concise summary is presented in “Conclusion” section to conclude the study.

Related works

This section discusses the relevant research works that discuss challenges related to classification problems using histopathology images. Here, deep learning (DL) utilizes Convolutional Neural Networks (CNN)/transfer-learning pipelines, attention/Transformer-based models, and weakly-supervised or multi-modal formulations. These are discussed below with different sub-sections.

CNN/transfer-learning pipelines for RCC grading

Several works reported in the literature have graded RCC using either CNNs or Transfer-Learning (TL) approaches. Shyam et al.12, together with Inception-V313 and DenseNet-16914 implementations, explore efficiency by targeting FPGA platforms. These works are found to be promising for throughput but less accessible. This is due to their specialized hardware requirements and expertise. Kun et al.15 proposed the enhancement of ResNet-50 with multi-scale information fusion and dynamic channel allocation. This improved the feature richness, but it is potentially less robust for RCC grading tasks. RCCGNet, proposed by Amit Kumar et al.5, introduces shared channel-residual blocks and released a public RCC grading dataset. However, reliance on a single source cohort may constrain generalizability across demographics and acquisition routines. On the whole, these CNN/TL pipelines16,17,18 motivate better feature representation while exposing gaps in scalability and cross-cohort robustness.

Attention and transformer-based models

Attention mechanisms generally aim to improve long-range dependency modeling. Ritesh et al.19 apply multi-layer attention over a ResNet-18 TL backbone. The authors reported better results across magnifications, yet increasing practical complexity. Recent Transformer-style approaches such as RDTNet20 and broader CNN-ViT hybrid models21,22,23 capture global context effectively. These works performed better, often at higher parameter costs than compact CNNs. However, these works motivate hybrid designs that keep efficiency while gaining global context. Mahmood et al.24 proposed RCG-Net, a CNN + Transformer hybrid model using an adaptive convolution block (separable + dilated convs) and a dynamic attention block with a transformer encoder for histopathology image classification. In contrast, MobileDANet leverages MobileNetV2 transfer learning with a lightweight DA head, prioritizing compactness and deployment to contextualize DA among classic attentions.

Weakly-supervised, multi-modal, and prognostic models

Beyond fully supervised grading, weak supervision with human–machine collaboration has been proposed in25. This reported improved clinical alignment, but the outcome depends on cohort-specifics. Furthermore, as reported in26, stain-normalization with instance learning and tri-multiscale CNNs provided enhanced results at the expense of processing time and resources. Multi-modal frameworks combining genomics and imaging (Eg, ResNet-based pipelines) can enhance performance with increased computational demands27. Moreover, imaging-based survival prediction with ResNet-18 illustrates prognostic utility but remains resource-intensive28. Subsequently, the aforementioned works address practical clinical endpoints but may be difficult to operationalize widely.

As a final point of summary, in contrast to (1) purely CNN pipelines that may under-capture global context, (2) Transformer-only models that are heavier, and (3) resource-intensive weak/multi-modal solutions, the proposed MobileDANet integrates a compact MobileNetV2 backbone with a lightweight Dynamic Attention (DA) block. In this way, the work intends to balance local–global feature learning and efficiency. Moreover, the work demonstrates applicability beyond RCC by extending to breast (BreakHis) and colorectal (CRCH) datasets. Also, the work provides interpretability through Grad-CAM, aligning with clinical transparency goals. In addition, the comparative summary of related works illustrating framework, characteristics, and how the proposed MobileDANet differs is presented in Table 1.

Table 1 Comparative summary of related works.

Materials and methods

This section provides a detailed discussion of the proposed MobileDANet framework for multi-target histopathology image classification. Initially, the section presents the input datasets and their pre-processing, followed by an outline of the overall workflow of the proposed framework (Fig. 1). Next, the backbone (MobileNetV2) and the Dynamic Attention (DA) module are detailed, and it is discussed how they are both integrated into MobileDANet (Figures 6 and 7).

Fig. 1
figure 1

Overview of the proposed MobileDANet framework: the pipeline begins with publicly available histopathology datasets (RCC/KMC, BreakHis, CRCH), which are standardized and have undergone basic augmentations. The images are then fed into a MobileNetV2 backbone (efficient local feature extraction), followed by a Dynamic Attention block (refines global contextual relationships). The resulting representation is classified into dataset-specific grades.

Input datasets

The evaluation of the proposed MobileDANet architecture is done through three public datasets, namely Kasturba Medical College (KMC)5, breast cancer histology (BreakHis)31, and colorectal cancer histology (CRCH)32 datasets. In 2023, Amit et al.5 created a dataset of kidney histopathological images, known popularly as the KMC kidney histopathology database. The database contains histopathological images of five grades, where Grade-0 corresponds to normal or non-cancerous, and the remaining four grades correspond to cancerous malignant severities. The authors formulated this database at the Pathology Department of KMC, Mangalore, India, using multiple samples of kidney tissues that were collected during open surgical biopsy procedures. They formulated the histopathological slides from the attained samples, and thereafter, non-overlapped histopathological patches were created and resized with a \(224\times 224\) image size. Then the classification of the above patches is graded by the pathologists as non-cancerous (Grade-0) and malignant ones (Grades-1, 2, 3, and 4). As a whole, the database contains three data folders corresponding to training (3432), validation (503), and testing (142). As given in the original dataset, the proposed study performs experimentation using the same set and ratio of data. This ensures that the results of our study will be compared unbiasedly, as considered against previous works. Figure 2 illustrates the image samples of the KMC database.

Fig. 2
figure 2

KMC dataset’s image samples.

The next one, the BreakHis database31, corresponds to breast cancer histopathological images. This dataset contains both benign and malignant cases acquired at Brazil’s Laboratory of Pathological Anatomy and Cytopathology through several clinical studies conducted on eighty-two patients. As a result, the study yields approximately 7,910 microscope-level images with distinct zoom levels ranging from 40 to 400. The images were then acquired during open surgical biopsy procedures on different breast tissues. Then these samples were characterized and labeled by experienced pathologists. For an unbiased comparison with existing works, our study involves conducting 5 independent trials for each experiment. And this experimentation is done in accordance with the open-accessible five-fold training and testing database setup available alongside the BreakHis database. In this way, the results reported represent the average of these trials. Furthermore, our experimentation utilized images of a \(40\times\) zoom level since this resolution is consistent with the maximum scanning capability of most commercial slide scanners. Figure 3 illustrates the image samples of the BreakHis database.

Fig. 3
figure 3

BreakHis dataset’s image samples.

The third dataset, the CRCH dataset32, comprises colorectal cancer histopathological images with a composition of 1,00,000 and 7180 as training and testing images. In this dataset, all the images are uniformly available at a size of \(224\times 224\) pixels. This dataset includes twenty-five slides of colorectal cancer tissues. Figure 4 illustrates the image samples of the CRCH database. As shown, the dataset contains the severity grades of nine different cases. They are colorectal adenocarcinoma epithelium, background, adipose, debris, lymphocytes, complex stroma, normal colon mucosa, mucus, and muscle cases. Building on the exclusions made in previous studies, our work further removed the background and adipose classes. During the validation process, we randomly reserved ten percent of the training data. Following data augmentation, the training dataset contains 20,000 samples for each class. For ensuring a fair comparison with state-of-the-art methods, we maintained consistent dataset splits throughout our experiments. For the experimentation of evaluating the proposed model using the above datasets, the study utilized a computer system with a configuration of 16 GB RAM, 1 TB of hard disk, and an Intel Core i7 processor, mounted with the Windows 10 operating system. The entire evaluation of the proposed model is done using Python 3.6 IDE on the aforementioned system.

Fig. 4
figure 4

CRCH dataset’s image samples.

Data pre-processing and augmentation

There are several challenges involved when experimenting with artificial intelligence models and histopathological images. This includes larger image sizes and varying quality, as different labs and institutions stain samples in different ways. To tackle this and to make the analysis more reliable, the researchers often used the technique of stain normalization methods. However, the employed datasets, as discussed in the previous sub-section, were already pre-processed by the providers, where the staining variations and other issues are already handled. Consequently, the primary objective of our work focuses solely on developing a promising deep-learning model using these pre-processed databases for training and validation of the proposed model.

In the case of the KMC database5, the slides were prepared using the paraffin procedure by experienced pathologists. Afterward, a microscopic examination of the slides was done carefully by the pathologists to identify the tumorous regions. As an outcome of this careful examination, the pathologists extracted non-overlapping square patches and converted them to a uniform size of standard \(224\times 224\) dimension. The second dataset, BreakHis31, contains breast cancer samples prepared from the breast tissues. In this database, the image acquisition was done with a pixel dimension of \(752\times 582\) in the red–green–blue color (RGB) space. Due to the inclusion of dark borders and annotated texts, the images are cropped to \(700\times 460\) pixel dimensions. Next, in the case of the CRCH database32, the providers adopted both normal histology and multi-center human colorectal tissues. From this, the pathologists extracted uniform patches of \(224\times 224\) dimensions. And then the Macenko stain normalization approach is applied to these patches for addressing staining irregularities and achieving color normalization.

In machine learning and deep learning approaches, augmentation plays a key role in the artificial improvement of both the diversity and quantity of training inputs. The issue of the limited generalization ability of the employed architectures due to insufficient training inputs can be addressed using data augmentation33. Here, the approach includes augmentation of data using scaling, cropping, rotating, flipping, and adding noise in the training data. In this way, the paper focused on data augmentation, including the inclusion of noises such as salt-and-pepper, Gaussian, speckle noise, and image flipping (both vertical and horizontal). As a result of training data augmentation, the KMC dataset contains 2,000 images in each class. In the case of the BreakHis database, the average number of images per class is 2,500, and for the CRCH dataset, it is 20,000. The illustration of data augmentation of sample images is given in Fig. 5.

Fig. 5
figure 5

Sample illustration of data augmentation.

In summary, to address class imbalance, dataset-specific strategies are adopted. For KMC, class-weighted cross-entropy loss could be applied, assigning higher weight to under-represented grades. For BreakHis, balanced mini-batch sampling could ensure equal representation of benign and malignant cases. Moreover, for the CRCH dataset, performance could be reported using weighted F1-scores to account for skewed category distributions. In addition, the aforediscussed data augmentation (flips, rotations, noise) further contributed to balancing the effective sample distribution.

Outline of the proposed framework

The illustration of the overall workflow of the proposed MobileDANet framework is given in Fig. 1. The study presents a novel framework to classify multi-targeted histopathological images taken from several research forums, containing the cancer grades of kidney, colon, and breast tumors. For this, the proposed framework utilizes openly accessible databases that were pre-processed by their source providers. This ensures reliability and standardization in our experimental analysis. The proposed framework focuses on the development of the MobileDANet architecture, as shown in Fig. 6. As given in Fig. 1, data augmentation techniques are employed in the study to improve the robustness and overcome the overfitting problems of the model. Then, the augmented data are used to train the proposed model, followed by evaluating the architecture using test data. Afterward, the study conducts experiments to achieve robust classification performance for severity classification across distinct histopathological datasets of kidney, colon, and breast cancer.

Fig. 6
figure 6

Structure of the MobileNetV2 used for Feature Extraction: an input patch undergoes stem convolution and stacked inverted residual blocks with linear bottlenecks. Each block expands channels, applies depthwise convolution, and projects back using residual shortcuts. This efficient design retains spatial detail with low computation. The resulting compact feature map is sent to the DA block for context refinement before classification.

As illustrated in Fig. 1, the proposed framework involves two primary blocks: MobileNetV2, a transfer learning model, and a Dynamic Attention (DA) module. Here, the MobileNetV2 model leverages the effectiveness of depth-wise separable convolutions and inverted residuals, significantly reducing computation while providing a better feature representation of data inputs. The further available DA module customizes its transformations into rich features of data inputs. This module utilizes a multi-head attention mechanism, inspired by vision transformers, for extracting the complicated dependencies in the histopathology images.

Detailed architecture of the proposed framework

With the advent of convolution neural networks (CNN), they are becoming the base models in several applications of computer vision. The reason behind this includes their efficient learning ability in complex patterns through spatial inductive biases34. As an extension and development, transfer learning models are becoming popular due to their higher effectiveness in acquiring local spatial information. Here, the primary disadvantage of transfer learning models is their focus on local spatial relationships in the data. This leads researchers in computer-vision tasks to adopt vision transformers (ViTs) for capturing the global contextual relationships in the applied data35. Due to this, ViTs can be used for a wider range of image classification problems. At the same time, ViTs are prone to having higher memory requirements and large model sizes when compared with transfer learning models. This requirement of computational demand limits the practical usage of ViTs in limited-resource environments35. To address the challenge of capturing both local and global information with balanced computational requirements, the researchers combined the concepts of neural networks and vision transformers. This idea of integration aims to leverage the spatial inductive biases of convolutional neural networks (for effective extraction of local features) while taking advantage of the global contextual learning of vision transformers.

The study proposes a novel MobileDANet architecture for multi-target classification of histopathological images. The proposed model can capture long-range relationships while maintaining a balanced computational efficiency. In this way, the proposed methodology offers an advantageous approach to histopathological image processing tasks compared to the state-of-the-art (SOTA) methods. As illustrated in Fig. 1, the primary blocks constituting the proposed architecture are the MobileNetV2 model and DA blocks. The inclusive details of each unit of the proposed architecture are discussed in the subsequent sub-sections.

MobileNetV2 model for multi-target histopathological images

MobileNetV2 is chosen as the backbone of the proposed classification framework because it offers an effective trade-off between computational efficiency and classification accuracy, particularly in low-resource environments11. Compared to ResNet50 and DenseNet, it reduces computational overhead while maintaining competitive accuracy, and unlike ShuffleNet or EfficientNet-lite, it demonstrated more stable convergence in image classification tasks10,11. The model uses depthwise separable convolutions and an inverted residual structure with linear bottlenecks for its operation. As illustrated in Fig. 6, the MobileNetV2 model is primarily built on the following two key concepts11:

  1. 1.

    Depthwise separable convolution: This form of factorized convolution uses reduced computation by splitting a standard convolution into two stages: depthwise and pointwise convolution processes.

  2. 2.

    Inverted residuals with linear bottlenecks: This is a type of structure where the bottleneck layer expands the feature dimension, which is followed by a depthwise convolution, and finally compresses it back.

In depthwise convolution, a single convolution filter is applied to each input channel individually. In contrast, in pointwise convolution, a \(1\times 1\) convolution operation is performed for combining the output of the depthwise convolution. This can be mathematically discussed as illustrated in Eqs. (1) to (3). For standard convolution, the output \(Y\) can be computed while considering an input tensor \(X\) of shape \(H\times W\times D\) (height, width, depth), and a kernel tensor \(K\) of shape \(k\times k\times D\times D{\prime}\) (kernel size \(k\times k\), input depth \(D\), output depth \(D{\prime}\)) as11:

$$Y\left[i,j,{d}{\prime}\right]=\sum_{h=1}^{k}\sum_{w=1}^{k}\sum_{d=1}^{D}K\left[h,w,d,{d}{\prime}\right].X[i+h,j+w,d]$$
(1)

For depthwise convolution, considering each input channel \(d\), a separate filter \({K}_{d}\) of size \(k\times k\) is applied, then its output can be computed as:

$${Y}_{d}\left[i,j\right]=\sum_{h=1}^{k}\sum_{w=1}^{k}{K}_{d}\left[h,w\right].{X}_{d}[i+h,j+w]$$
(2)

For pointwise convolution, a \(1\times 1\) convolution operation is then applied to combine the depthwise convolution outputs as given below.

$${Y}{\prime}\left[i,j,{d}{\prime}\right]=\sum_{d=1}^{D}{K}{\prime}\left[{d,d}{\prime}\right].{Y}_{d}\left[i,j\right]$$
(3)

As from Eqs. (1) to (3), it is revealed that the computational cost can be reduced by a factor of approximately \(\frac{1}{D}+\frac{1}{{k}^{2}}\) when compared with the standard convolution process.

Furthermore, the concept of inverted residuals is a key innovation in the MobileNetV2 model, enabling it to achieve higher efficiency. In the inverted residual block, each block expands the input dimension by means of a pointwise convolution, then applies depthwise convolution, and finally projects it back through another pointwise convolution operation. This mimics the “inverted” structure since it first expands and then compresses the channels, in contrast to conventional residual block operations. This can be mathematically given in the equations below11 with the consideration of an input tensor \(X\) of shape \(H\times W\times D\) (height, width, depth).

The first one is the expansion process using pointwise convolution11 as given in Eq. (4).

$${X}{\prime}=ReLU6(X\times {W}_{1})$$
(4)

where \({W}_{1}\) denotes the weight matrix for the \(1\times 1\) convolution process through an expansion of depth to \(tD\) with \(t\) as an expansion factor. Next, the depthwise convolution can be mathematically represented as in Eq. (5)11.

$${X}^{{\prime}{\prime}}=ReLU6(DepthwiseConv({X}{^\prime}))$$
(5)

Afterward, the compression process using pointwise convolution is attained using Eq. (6)11.

$$Y=Linear({X}^{{\prime}{\prime}}\times {W}_{2})$$
(6)

where \({W}_{2}\) denotes the weight matrix for the final \(1\times 1\) convolution process, and thus reduces the depth again to \(D\). Finally, when the dimensions of input and output are equal in the residual connection, a residual connection is added as given in Eq. (7)11.

$$Y=X+Y$$
(7)

Here, the linear bottleneck utilizes a linear activation (in place of the ReLU function) in the last layer to prevent the loss of information, exclusively during low-dimensional manifolds. In summary, the MobileNetV2 architecture can be built by stacking the above-discussed inverted residual blocks. Moreover, these blocks are constructed using varying expansion factors and output dimensions. As illustrated in Fig. 6, the MobileNetV2 backbone begins with a shallow convolutional stem. It proceeds through inverted residual blocks that (1) expand channels, (2) apply depthwise spatial filtering, and (3) project features back through a linear bottleneck with residual connections when shapes align. Herein, downsampling is performed by selected blocks (stride > 1), progressively reducing spatial resolution while increasing channel capacity. This results in compact and high-level feature representation. This MobileNetV2 output is designed to retain discriminative tissue morphology with lower computation and is subsequently provided to the DA block. This is done for modelling longer-range relationships before the classification phase.

Dynamic attention (DA) block

The dynamic attention block is designed to refine and enhance the focus of MobileNetV2 extracted features using the attention mechanism. This ensures that the significant areas of the applied inputs are emphasized during the classification phase. The architectural details of the Dynamic Attention block are illustrated in Fig. 7.

Fig. 7
figure 7

Architectural Details of Dynamic Attention Block: The DA block adapts MobileNetV2 features via 1 × 1 convolution, applies multi-head self-attention with residuals and normalization for stable learning. And refines outputs using a lightweight MLP before classification. Unlike ViTs, it avoids tokenization and positional encoding, enabling efficient global context modelling.

As given in Fig. 7, the DA block takes the input as the extricated features \((7\times 7\times 320)\) of the MobileNetV2 architecture. The DA unit establishes a mechanism of dynamic attention and works similarly to the Vision Transformers for further refining the learned representation. The block involves three principal modules in which every component influences the learning and feature representation. The DA block starts with the use of a crucial \((1\times 1)\) convolution layer. This module is intentionally placed here to facilitate the adjustment of dimensions in the data input. The primary purpose of this layer is to facilitate effective transformation and feature refinement, thereby enabling the extraction of essential information necessary for subsequent data handling steps. This basic convolution process is significant in shaping the input’s representation ability and thus providing a foundation to extract and learn features in subsequent stages. In summary, this layer reduces the number of channels from \(320 to 256\) for computational efficiency and makes the subsequent operations less computationally expensive without losing much information. This can be mathematically represented as given in Eq. (8).

$${X}{^\prime}={W}_{1\times 1}*X$$
(8)

In Eq. (8), the convolution layer’s output is represented as \({X}{^\prime}\) applied on the feature inputs \(X\), its convolution weights and operation are denoted as \({W}_{1\times 1}\) and \(*\). A Mish activation is employed for enhancing feature representation after the pointwise convolution layer. This ensures better handling of complex patterns present in the features. Actually, the work considers two smooth, non-monotonic activations for the DA block interface, namely Swish and Mish. Here, the activation functions are represented as \(Swish\left(x\right)=x\sigma (x)\) and \(Mish\left(x\right)=x.\text{tanh}\left(softplus\left(x\right)\right)\). After the \(1\times 1\) projection (Eq. 8), feature channels are approximately zero-centered and subsequently consumed by multi-head attention with LayerNorm. In this setting, Mish’s soft negative outputs (rather than hard truncation) and smooth higher-order derivatives help retain low-magnitude yet informative responses from fine-grained tissue textures. At the same time, this provides stable gradients through attention-normalization stacks.

Afterward, the next multi-head attention block applies an attention mechanism to distinct parts of the obtained feature space. Herein, the intention is to provide attention to significant regions that are more crucial for multi-target histopathology image classification. This provides an output having the same size as the applied inputs, but the attention-weighted features are refined. This process is based on the principle of the self-attention mechanism. Here, the computation of attention scores is based on the mathematical Eq. (9).

$$Attention\left(Q,K,V\right)=softmax\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right)$$
(9)

where.

\(Q=X{W}_{Q}\), \(K=X{W}_{K}\), and \(V=X{W}_{V}\) represent the query, key, and value projections for the input \(X,\) respectively. And in Eq. (9), the dimensions of the keys are represented as \({d}_{k}\) and \({W}_{Q}\), \({W}_{K}\), \({W}_{V}\) correspond to the weight matrices for each of the heads. The above mechanism can be enriched typically for multi-head attention as given in Eq. (10). This allows the architecture to attend to information conjointly from distinct representation sub-space dimensions.

$$Multi-Head\left(Q,K,V\right)=Concat({head}_{a}, {head}_{b},\dots {head}_{h}){W}_{O}$$
(10)

In Eq. (10), \(h\) indicates the number of attention heads and \({W}_{O}\) is the output projection. Afterward, as shown in Fig. 7, a residual connection is employed to add the original input of the multi-head attention to its output. This step confirms the preservation of important base features. Then, layer normalization is employed for normalizing the feature distributions and stabilizing learning. Subsequently, a multi-layer perceptron network is incorporated to further strengthen the model’s adaptability through the addition of non-linearity and depth to the features. The final step of the block projects the processed feature vectors into the desired output space. Then this will be applied to the classification head for robust classification of mult-targeted histopathological images into different grades of severity.

The proposed dynamic attention block, as discussed, is inspired by the operation of vision transformers30. But it introduces task-oriented adaptations to ensure compactness and efficiency. Unlike ViTs, which tokenize image patches and apply positional encoding for global attention, the DA block (Fig. 7) directly operates in convolutional feature maps from the MobileNetV2 backbone without tokenization. Attention is applied at a channel-wise and feature-map level. This enables refinement of critical local–global dependencies while avoiding the parameter overhead of token–based transformers.

Furthermore, the DA block integrated a lightweight design with Mish activation. This might improve feature representation in histopathology tasks, and this will be validated during the subsequent ablation study. In this way, the DA block is not a mere replication of multi-head attention but a streamlined and computationally efficient adaptation, which is specifically tailored for multi-targeted histopathology image classification.

Experimentation and results

This section reports the experimental setup and findings. The section first describes the training setup and evaluation metrics used. Then it presents the ablation study, followed by comparative results in KMC, BreakHis, and CRCH datasets. Next, the section explores the extension of the model to different cancer datasets, and finally concludes with visual insights and key findings.

Data training

The proposed MobileDANet architecture is based on the theory of supervised learning, which relies on training with labeled data from the employed databases. In order to perform a comparative evaluation in an unbiased manner, the training data intended for all databases is the same as that utilized in the state-of-the-art approaches. For training the MobileDANet model, the Adam optimization and the categorical cross-entropy loss are utilized in the work. Herein, the training hyperparameters are selected with a learning rate of 1 × 10−4 after exploring 1−3 × 10−5, 100 epochs, and 16 as the batch-size. An early-stopping approach is employed to overcome the issue of overfitting, and the validation accuracy is tracked to prevent excessive training epochs. Furthermore, as illustrated in Fig. 8, the proposed architecture’s performance is monitored through accuracy and loss graphs during the phase of training and validation. As illustrated, these visualizations helped to confirm the effective convergence of the training process. In addition, the plot of Fig. 8 illustrates the MobileDANet architecture’s capability in reducing the overfitting problem of training data.

Fig. 8
figure 8

Accuracy and loss performance curves of MobileDANet on KMC dataset.

Performance metrics

The standard performance measures are used for assessing the proposed model, such as accuracy (Acc), Recall (Rec), Precision (Pre), and F1 score metrics. Here, the comprehensive analysis is provided by the last metric, the F1 score. In contrast, the measure of how many are true out of all the predicted outcomes is given by the accuracy metric. For ensuring unbiased comparative analysis, the performance measures and data ratios are considered in line with existing SOTA approaches. The mathematical equations for the employed metrics with \(T\) as the total number of output targets are given in Eqs. (11) to (14)33.

$$Acc=\frac{1}{T}\sum_{z=1}^{T}\frac{{TP}_{z}+{TN}_{z}}{{TP}_{z}+{TN}_{z}+{FP}_{z}+{FN}_{z}}$$
(11)
$$Rec=\frac{1}{T}\sum_{z=1}^{T}\frac{{TP}_{z}}{{TP}_{z}+{FN}_{z}}$$
(12)
$$Pre=\frac{1}{T}\sum_{z=1}^{T}\frac{{TP}_{z}}{{TP}_{z}+{FP}_{z}}$$
(13)
$$F1 Score=2\times \frac{Pre\times Rec}{Pre+Rec}$$
(14)

The metrics of Eqs. (11) to (14) are characterized by the confusion matrix elements, specifically True and False positives and negatives (TP, FP, TN, and FN). In the case of the BreakHis database, a patient-level classification metric, recognition rate31, is considered in the paper. This can be mathematically denoted in Eqs. (15) and (16)31 with \({N}_{P}\) and \({N}_{rec}\) as the number of image inputs corresponding to patient \(P\) and the number of correctly classified occurrences.

$$Patient Score (PS)=\frac{{N}_{rec}}{{N}_{P}}$$
(15)
$$Recognition Rate \left(RR\right)= \frac{\sum PS}{Total number of patients}$$
(16)

In addition to the above metrics, the weighted average F1-Score \((\widehat{F})\) is utilized for assessing the proposed model for the CRCH dataset. This can be mathematically denoted as given in Eq. (17)32 with \(T\) as the total number of output targets of the CRCH dataset.

$$\widehat{F}=\frac{\sum_{i}^{T}{n}_{i} F1 Score(i) }{t}$$
(17)

Here, \({n}_{i}\) and \(t\) are the total number of test samples in the \({i}^{th}\) class and the overall amount of test samples. This metric is taken together with the F1 Score to tackle the class imbalance issue. This is accomplished by calculating the average values of F1 Scores weighted by each class. In this way, this metric offers better sensitivity to minority classes, thus ensuring that the overall score is the contribution of both minority and majority classes.

Ablation study and its experimentations on the KMC dataset

This sub-section will discuss systematic experimentation on the selective removal of modules in the proposed architecture or in the modules of the process of training operation. Here, the intention is to assess the individualistic effect of these adaptations on the model’s performance. This, in turn, enables a better interpretation of the impact of each module constituted for the MobileDANet architecture. In this way, the experimentation on the KMC database is done for comprehensive analysis. The initial ablation study examines the impact of the simple baseline architecture of the proposed model. The result of this experiment is presented in Table 2. For this, the study involves the hyperparameter selection through an empirical grid search strategy. In this way, a learning rate of 1 × 10−4 was chosen after exploring 10−3–10−5. Here, higher rates caused unstable convergence, and lower rates slowed training without improvement. For the DA block, the attention heads are experimented with the values of \(\{2, 4, 6\}\). Here, the selection of 4 provided the best trade-off between accuracy and parameter efficiency. The projection dimension \((256)\) was selected to reduce the MobileNetV2 bottleneck (320-dim) while retaining representational capacity, since 128 degraded accuracy and 512 increased FLOPs disproportionately. Sensitivity checks around these choices altered final accuracy by less than 1.5%, confirming that MobileDANet is not overly dependent on precise hyperparameter tuning.

Table 2 Ablation study experimentations of the proposed model using the KMC dataset.

The graphical representation illustrating insights into how each addition impacts the performance is portrayed in Fig. 9. As illustrated in Table 2 and Fig. 9, the experimentation starts with the employment of a baseline MobileNetV2 architecture without using any attention mechanism. This becomes a foundation model but lacks fine-grained feature refinement and so provides a relatively lower performance of accuracy of 71.42%, precision of 72.18%, recall of 70.47%, and F1 score of 70.91%. The next one introduces an inclusion of standard SE (Squeeze-and-Excitation) and CBAM (Convolutional Block Attention Mechanisms) blocks, which substantially boost the performance. This reveals that attention helps in capturing more relevant feature areas, resulting in marginally better performance in accuracy (74.90–77.15%), precision (76.11–78.15%), recall (74.05–76.95%), and F1 score (74.76–77.98%). This led to the inclusion of the Dynamic Attention (DA) block in the baseline architecture to further improve classification performance. This is because the DA block provides the advantage of enabling the architecture to focus on complex dependencies and salient regions in the histopathological images. This, in turn, makes this combination yield a noticeable performance improvement in accuracy of 81.73%, precision of 82.85%, recall of 80.48%, and F1 score of 81.95%. Now the turn is to yield further performance improvement using a change of activation functions.

Fig. 9
figure 9

Performance insights on each additional module of the proposed model (KMC dataset).

In this way, the typical ReLU (Rectified Linear Unit) activation is replaced by the Swiss activation function. This activation is known for smoother gradient flow and thus yields a modest enhancement in performance as compared with the previous DA block setup. This minor performance enhancement is due to the better handling of challenging features of histopathological image classification. Subsequently, the next experiment involves using the Mish activation function instead of the Swiss activation. This made the proposed model perform better in classification as compared to the previously used activation functions. The reason behind this is the non-monotonic and smoother gradient flow nature of the Mish activation function. This makes it well-suited for the DA block and thus yields improved feature learning. Empirically, replacing Swish with Mish in MobileNetV2 + DA improved the classification performance without augmentation (F1 score from 82.46 to 86.99% (+ 4.53), precision from 83.41 to 88.62% (+ 5.21), recall from 81.82 to 84.96% (+ 3.14), and accuracy from 82.09 to 84.33% (+ 2.24)). Thus, this version of the proposed model (MobileDANet) effectively balances computational efficiency with promising classification results, as shown in Fig. 9. After these experimentations, the image augmentation is applied to the proposed model. Thus, the classification performance is significantly increased with an accuracy of 90.71%, precision of 91.29%, recall of 90.75%, and F1 score of 90.94%. This reveals that augmentation makes the model exposed to various data perspectives, improving robustness, and handling variability across different classes of input histopathological images. Finally, the obtained classification performance indicates that combining MobileDANet’s model with diverse training data prepares the architecture well for generalization to other cancer data. In addition, on the KMC test data, MobileDANet with augmentation achieved 90.71% accuracy compared to 71.42% for the MobileNetV2 baseline. A chi-squared test of proportions confirmed that this improvement is statistically significant (\({\chi }^{2}=16.79, p=3.8\times {10}^{-5}\)). The effect size, measured by Cohen’s h (\(h=0.52\)), corresponds to a medium practical impact. This indicates that MobileDANet provides a substantially higher and practically meaningful classification accuracy on KMC data. Figure 10 represents the visualization of a histopathological image of the KMC dataset (Grade 4) at the intermediate convolution layer of MobileDANet. Here, the cmap used for obtaining the plot is ‘hot’, illustrating how well the regions of the image are learned for final prediction.

Fig. 10
figure 10

Intermediate convolution layer visualization of MobileDANet (KMC dataset).

Extending experimentations on different cancer datasets

In this sub-section, the proposed architecture, MobileDANet, is applied to other cancer datasets, and its comparative performance analysis will be performed with the recently evolved existing works. The other cancer datasets used are: breast cancer histology (BreakHis) and colorectal cancer histology (CRCH) datasets. These two datasets consist of histopathology images, but for different cancers (breast and colorectal cancer). In addition to this, the results obtained for the KMC dataset are compared against the recently evolved existing studies. Tables 3, 4, and 5 illustrate the aforementioned comparative analysis, where it is found that the proposed model outperforms the existing works. The obtained classification performance of the proposed MobileDANet model is superior due to the adaptive characteristics of the dynamic attention module. Here, the DA block is designed to refine and focus on the MobileNetV2 extracted features using the attention mechanism. This ensures that the significant areas of the applied inputs are emphasized during the classification phase, thus yielding superior performance. For the KMC dataset, the comparison findings of the proposed MobileDANet architecture are presented in Table 3. According to the comparison findings, the proposed architecture achieved higher classification performance than existing models.

Table 3 Comparative analysis of MobileDANet and existing works (KMC dataset).
Table 4 Comparative analysis of MobileDANet and existing works (BreakHis dataset).
Table 5 Comparative analysis of MobileDANet and existing works (CRCH dataset).

For the case of BreakHis Dataset, the comparative analysis used different existing feature extractors such as completed local binary pattern (CLBP)49, gray-level co-occurrence matrix (GLCM)50, local binary pattern (LBP)51, local phase quantization (LPQ)52, oriented fast and rotated brief (ORB)53, and parameter-free threshold adjacency statistics (PFTAS)54. Furthermore, the analysis used different classification algorithms such as 1-nearest neighbor (1-NN)55, quadratic linear analysis (QDA)56, support vector machine (SVM)57, and random forests (RF)58. By using these algorithms, the comparative analysis of the proposed MobileDANet model is presented in Table 4 for the BreakHis dataset. The table reveals that the proposed model outperforms other previous models, and thus, it is evident that the proposed model works effectively across distinct cancer datasets. Similarly, for the CRCH dataset, the comparison findings of the proposed MobileDANet model are presented in Table 5, confirming the promising performance of our proposed model across different histopathology datasets. Figure 11 illustrates the ROC curves of three datasets. These plots reflect MobileDANet’s ability to generalize to complex multi-class tissue differentiation with robust classification performance. Collectively, these analyses confirm that MobilDANet not only achieves higher accuracy and F1 scores (Tables 3, 4 and 5) but also maintains strong discriminative power across datasets. This underscores the proposed model’s potential for reliable deployment in varied histopathological classification scenarios. Artificial intelligence-based approaches exhibit higher efficiency in several critical criteria. This includes the total number of parameters and floating-point operations per second (FLOPS) required to achieve the desired outcomes. The total number of learnable parameters and FLOPS is comparatively analyzed and summarized in Table 6. This reveals that the proposed MobileDANet achieves superior performance across all parameters, facilitating the adoption of AI-based solutions.

Fig. 11
figure 11

Receiver operating characteristic (ROC) curves of the MobileDANet across three datasets: (a) KMC dataset with five RCC grades, (b) BreakHis dataset (benign and malignant), and (c) CRCH dataset with seven classes. These plots display per-class ROC curves together with micro- and macro-averaged curves. This plot demonstrates the model’s ability to achieve high area under the curve (AUC) values across both binary and multi-class settings.

Table 6 Computational complexity analysis.

Visual insights of the findings

The paper followed a systematic method for conducting visual analytics on feature map vectors. This aims to localize and highlight key regions in histopathological inputs that validate the generated predictions by the proposed model. The traditional prediction approaches consider various feature vectors; on the other hand, these methods typically offer minimal context for critically important features. To overcome this problem, the study employed the gradient-weighted class activation mapping67, also known as the GradCAM approach. This methodology offers advantages in identifying critical regions in the applied inputs that are crucial for the accurate prediction of targets. It thus enables the work to provide valuable analytics from the proposed model.

The class activation maps and superimposed maps with inputs obtained using the GradCAM approach for sample image inputs of the KMC dataset are presented in Fig. 12. As shown, input from each grade (0, 1, 2, 3, and 4) is taken, and their class activation maps with superimposition output are taken. This evidently reveals that our proposed model identified critical discriminative regions, as indicated by the red and yellow zones. The model’s stability is qualitatively checked by looking at how it “sees” different images using Grad-CAM maps. For high-grade tumors (Grades 2–4) of KMC data, the model consistently focused on the dense tumor cells and abnormal tissue, ignoring the surrounding healthy tissue. For Grade 0 (healthy tissue), its focus was more spread out, and Grade 1 showed a mix of these patterns. This reveals that the model’s focus is reliable and not random. From this, it is evident that the proposed approach extricates significant features from the appropriate areas for precise classification of severities over different cancer datasets. Moreover, the shown visualization enhances transparency and understanding of our model’s prediction mechanism. This, in turn, ensures alignment with the explainable AI (XAI) principle for facilitating informed medical decision-making.

Fig. 12
figure 12

(a) Input image (b) GradCAM activation maps (c) Superimposed GradCAM output (KMC dataset).

Findings of the work

The histopathological image analysis with accurate prediction is highly useful for the earlier detection of critical diseases, and it will be grateful for the timely diagnosis of several lives. With the advent of computer vision and research, several studies are evolving for image classification and analysis. The top-line findings of this research work are given below:

  • The significant step in the research flow is image augmentation, which is used if there is any limited or lower-quality training data available. In this way, this work employed a robust image augmentation method. Unlike the conventional approaches, the work combines different types of noises – Gaussian, speckle, and salt and pepper – with vertical and horizontal flipping. This diversified augmentation approach enhances model robustness, effectively tackling variations and facilitating generalizability across distinct scenarios.

  • The work introduces a dynamic attention (DA) block (inspired by ViTs) over the MobileNetV2 features for enhancing feature representations using multi-head attention mechanisms. This enables the model to capture both global and local dependencies within histopathological images. The utilization of MobileNetV2 and ViT-inspired DA block provides a balanced computational complexity and is thus helpful in refining critical features and improving the classification of distinct cancer grades.

  • The multi-head attention mechanism and multi-layer perceptron concepts make the proposed model robust to the identification of intricate patterns in histopathology images. This approach highlights and recognizes the input image’s non-local dependencies with the use of multi-head attention that exhibits dynamic adaptability.

  • The research utilizes the Grad-CAM approach to demonstrate our method’s efficacy in identifying and prioritizing salient regions within histopathology images. As shown in Fig. 12, this feature highlights the proposed architecture’s potential for explainable AI, making it suitable for applications that require transparent decision-making.

Practical considerations for clinical deployment

Despite promising performance across RCC, breast, and colorectal datasets, the following points are framed for future researchers. The following points discuss some checks and safeguards towards clinical deployment of the model.

  • Histopathology datasets often exhibit skewed class distributions. While the work employed macro- or micro-averaged metrics, the clinical deployment should also need to consider class-balanced objectives, priority-weighted thresholds tuned on validation data, and per-class operating points aligned with clinical risk tolerances.

  • While augmentation with Gaussian, speckle, and salt-and-pepper noise improved robustness during training, the explicit evaluation of MobileDANet under additional degradations, such as reduced resolution, motion blur, or compression artifacts, has not been carried out. These represent realistic challenges in digital pathology, and systematic robustness testing across such conditions will be pursued in future work.

  • Stains, scanners, labs, and patient populations differ across sites. These changes can reduce model accuracy. Our cross-cancer data evaluation shows some improved robustness. However, this improvement is not enough in clinical settings. Real validation should use multi-site, same-organ data with fixed hyperparameters. Therefore, our future work will use site-holdout tests and domain adaptation to reduce shift effects.

  • Grad-CAM highlights regions, but only at a coarse level. Since it does not fully reveal decision logic, our future work will plan to complement saliency with attention rollout/perturbation analyses, quantitative fidelity measures (e.g., overlap with expert annotations or saliency IoU metrics) and probability calibration.

Conclusion

The work presented a lightweight and interpretable deep learning framework, MobileDANet, for multi-targeted histopathology image classification. By integrating the MobileNetV2’s efficiency with the contextual modelling of a Dynamic Attention (DA) block, the proposed framework effectively balances local feature extraction and global dependency capture. The work involves extensive experimentation on three public datasets, namely KMC (RCC), BreakHis (breast), and CRCH (colorectal), respectively. This demonstrated that MobileDANet consistently outperformed baseline CNN architectures while maintaining computational efficiency. The employment of Grad-Cam further enhanced interpretability, supporting the clinical relevance of the results.

Limitations and future directions

Despite these robust outcomes, the limitations are as follows: first, although three distinct datasets were used, their size and source variability are limited. And the cross-dataset validation is not carried out due to divergent organs of interest and label definitions. This restricts conclusions about the model’s generalizability across domains. Second, while Grad-CAM provided intuitive visual explanations, it offers only coarse heatmaps and does not fully capture complex decision processes. And finally, the current evaluation is retrospective; the framework has not been validated in real-world clinical workflows with pathologists in the loop. Future work will therefore focus on multiple directions. From a methodological perspective, the model compression techniques (pruning and quantization), advanced augmentations, advanced imbalance strategies, and automated hyperparameter optimization will be investigated for further optimizing performance-efficiency trade-offs. On the interpretability perspective, advanced explainable AI methods (attention rollout, perturbation-based analyses, quantitative analysis) will be experimented to strengthen transparency further. From a data perspective, validation of the model on larger, multi-institutional, and multi-modal cohorts will be planned further to evaluate robustness under domain shift and patient heterogeneity. Finally, future researchers will work on embedding MobileDANet into a prototype clinical decision-support system by overcoming the aforementioned issues.