Abstract
Kidney failure represents a pressing global health concern, further exacerbated by the widespread shortage of nephrologists, thereby necessitating the development of ِِِArtificial Intelligence (AI)-driven systems for automated renal disease diagnosis. This study focuses on the diagnosis of three major renal conditions: kidney stones, tumors, and cysts. Recent advancements in Deep Learning (DL) have highlighted the potential of attention mechanisms in enhancing the performance of Convolutional Neural Networks (CNNs), particularly in medical image analysis. In this context, we propose a novel method termed Pyramid Channel and Spatial Attention (PCSA), which depends on pyramidal multiscale convolution to reconstruct feature representations by extracting both spatial and channel attention weights. This dual-weight extraction facilitates the precise integration of multiscale contextual information, thereby improving the model capability to localize and focus on complex regions within medical images. The PCSA module is designed as a plug-and-play component that can be seamlessly integrated into various CNN backbone architectures to enhance diagnostic accuracy. To validate its effectiveness, we incorporate the PCSA module into several backbone networks and evaluate its performance. Experimental results demonstrate that PCSA-enhanced networks outperform multiple state-of-the-art image classification methods, achieving superior accuracy in renal disease classification. Although the current study focuses on three specific renal conditions, the modular architecture of PCSA-Net allows for future adaptation to a broader spectrum of renal pathologies. These findings underscore the potential of the proposed PCSA module to support automated, accurate, and scalable kidney disease diagnosis in clinical settings. The modular design also enhances the model suitability for real-world deployment, enabling integration into diverse diagnostic workflows.
Introduction
Kidney disease remains a major global public health challenge and continues to rise in prevalence despite significant advancements in preventive and therapeutic strategies1. Early diagnosis is essential to slowing the progression of Chronic Kidney Disease (CKD) and mitigating its potentially-severe consequences. Current estimates indicate that more than 10% of the global population are affected by CKD, which is expected to become the sixth leading cause of death worldwide by 20402. Alarmingly, CKD currently affects approximately 850 million individuals, more than double the global prevalence of diabetes and twenty times that of cancer3. This growing burden is primarily attributed to the increasing incidence of common risk factors, such as diabetes mellitus and obesity, which contribute substantially to the development and progression of CKD.
Among the most prevalent and clinically significant kidney disorders are renal cell carcinoma, kidney cysts, and nephrolithiasis4. Kidney cysts are fluid-filled sacs enclosed by a thin membrane that typically form on the kidney surface. These cysts may vary in number and size, often exhibiting water-like density with Hounsfield Unit (HU) values ranging from 0 to 20. Nephrolithiasis, or kidney stone disease, is characterized by the formation of crystalline deposits within the renal system and it affects approximately 12% of the global population. Renal cell carcinoma, the most common form of kidney cancer, ranks among the top ten malignancies worldwide and poses a substantial health burden5.
Computed Tomography (CT), Magnetic Resonance Imaging (MRI), X-ray, and B-mode ultrasound (US) are commonly employed alongside pathological assessments for the diagnosis of renal disorders6. Among these modalities, CT imaging offers distinct advantages, utilizing X-ray beams to generate cross-sectional views that provide detailed three-dimensional anatomical information of the targeted regions. This high-resolution capability makes CT particularly effective for kidney evaluations, as it facilitates sequential imaging with precise localization and contrast enhancement of structural abnormalities7. Timely identification and management of renal pathologies, including stones, cysts, and tumors, are essential to preventing disease progression toward renal failure. Consequently, early detection plays a pivotal role in reducing the risk of kidney failure and improving patient prognosis8.
However, the rising global burden of kidney diseases, coupled with the critical shortage of radiologists and nephrologists, highlights the urgent need for AI-based diagnostic tools that can rapidly and accurately interpret renal imaging data. Such AI models have the potential to significantly support clinical decision-making and alleviate the diagnostic workload of healthcare professionals. Despite progress in this area, the volume of published research remains limited. Moreover, the scarcity of publicly-available annotated datasets for kidney disease diagnosis continues to hinder large-scale investigation and model development. Existing studies have predominantly employed traditional machine learning approaches, often focusing on the classification of isolated disease categories, such as kidney stones, cysts, and tumors, rather than offering a unified diagnostic framework capable of handling multiple renal pathologies.
In light of these challenges, our study emphasizes the importance of integrating multiscale processing with attention mechanisms to develop a more adaptable and discriminative diagnostic framework. This strategy has been shown to outperform conventional feature fusion methods in the context of medical image analysis by enhancing both spatial precision and contextual understanding.
Convolutional Neural Network (CNN)-based models have achieved significant success across a wide range of computer vision applications by leveraging prior knowledge from large-scale image datasets. Despite these advancements, CNNs continue to encounter several limitations, including inefficient fusion of interlayer features in deep architectures, high computational complexity due to large parameter counts, and a restricted capacity for learning inter-channel relationships. Attention mechanisms have emerged as effective solutions to these challenges by enabling models to selectively emphasize the most informative regions of an image, while suppressing irrelevant or redundant features. This selective focus enhances the model ability to learn salient patterns, thereby improving performance in key tasks such as image segmentation, classification, and object detection9,10,11.
Recent research works have underscored the critical role of multiscale feature representation in improving the performance of medical image analysis systems. For instance, Agarwal et al.12 introduced a multi-scale dual-channel feature embedding decoder for biomedical image segmentation, effectively integrating contextual information across varying resolutions using a dual-stream architecture. Their approach demonstrated the importance of preserving both fine-grained details and global structural context. Similarly, Prakash et al.13 proposed a multiscale feature fusion framework that integrates ResNet50 and EfficientNet within a convolutional autoencoder to enhance tumor detection from CT images. While both studies illustrate the benefits of multiscale learning, they are predominantly focused on segmentation or autoencoding tasks.
In contrast, the present work introduces PCSA-Net, a novel architecture that integrates a pyramid-based multiscale convolutional structure with parallel channel and spatial attention mechanisms, specifically optimized for classification tasks. This attention-guided multiscale design enhances the network ability to adapt across different backbone architectures, and it is particularly effective in capturing the complex morphological variations associated with renal pathologies in CT imaging.
In this study, we propose PCSA-Net, a novel DL framework that incorporates a PCSA module into conventional CNN backbone architectures. The PCSA module is designed to concurrently capture multiscale spatial context and adaptively reweight channel-wise features, thereby enabling the network to concentrate more effectively on clinically-relevant patterns within medical images. By embedding this dual-attention mechanism within a pyramid convolutional structure, the model gains the capacity to extract and integrate both fine-grained and coarse-grained features across multiple spatial resolutions.
To evaluate the performance of PCSA-Net, we conducted experiments on a large-scale multiclass CT kidney dataset comprising over 12,000 images categorized into four clinically significant classes: normal, cyst, stone, and tumor. The proposed model is integrated with multiple backbone networks and rigorously compared to several recent state-of-the-art classification methods. Furthermore, we performed extensive ablation studies to isolate the individual contributions of each module and assess the robustness of the overall framework using stratified cross-validation.
This work contributes to the field by addressing key limitations in existing attention-based architectures, particularly in terms of multiscale integration and dual-domain feature enhancement. The proposed PCSA-Net demonstrates effective generalization across diverse renal pathologies and establishes a clinically-relevant, scalable approach to multiclass kidney disease classification using CT imaging.
Research problem
Renal diseases, including tumors, cysts, and kidney stones, present substantial diagnostic and therapeutic challenges on a global scale. Timely and accurate identification of these conditions is essential for preventing progression to renal failure and for reducing associated morbidity and mortality. However, current diagnostic workflows are heavily dependent on radiological expertise, which remains critically limited, especially in low-resource and underserved healthcare settings.
Although conventional machine learning techniques have demonstrated utility in certain diagnostic applications, they frequently encounter significant limitations. These include poor scalability, inadequate support for multiscale feature extraction, and limited capacity to effectively integrate complex spatial and channel-wise information from medical images. Such constraints compromise their reliability and applicability for comprehensive renal disease diagnosis in real-world clinical environments.
Given these challenges, there is an urgent need for advanced computational frameworks capable of delivering high diagnostic precision, while maintaining robustness and adaptability across diverse imaging datasets. A successful solution must effectively leverage multiscale and attention-based mechanisms to overcome existing limitations in feature learning, thereby supporting automated, scalable, and clinically-viable kidney disease detection.
Research motivation
The rising global burden of renal diseases, coupled with a persistent shortage of trained radiologists and nephrologists, underscores the critical need for automated, intelligent diagnostic systems. Recent advancements in DL, particularly the development of attention mechanisms, present a compelling opportunity to address these challenges by enabling more effective feature extraction, pattern recognition, and clinical decision support.
While CNNs have shown considerable promise in medical image analysis, many existing approaches fall short in capturing multiscale contextual information or in dynamically emphasizing diagnostically relevant regions within complex imaging data. These limitations often lead to diminished classification accuracy and restricted generalization in real-world clinical scenarios.
This study is driven by the need to overcome such shortcomings through the design of a robust and adaptive attention mechanism that not only enhances diagnostic precision but also improves model interpretability and computational efficiency. By bridging the gap between clinical requirements and current technological capabilities, the proposed approach aims to support healthcare professionals in delivering timely, accurate, and scalable diagnoses of renal diseases.
Main contribution
This study presents the PCSA, a novel attention mechanism specifically developed to enhance multiscale feature integration and adaptive learning for renal disease diagnosis. The PCSA block leverages pyramidal multiscale convolution to extract both spatial and channel attention weights, enabling precise feature reconstruction and targeted focus on diagnostically significant image regions. This design substantially improves the model ability to capture and discriminate intricate anatomical patterns commonly found in renal pathologies.
The PCSA block is implemented as a modular, plug-and-play component, facilitating seamless integration into a variety of CNN backbone architectures. To demonstrate its effectiveness, the PCSA block was incorporated into the ResNet architecture, resulting in the development of the PCSANet, an enhanced model that achieves superior diagnostic accuracy and computational efficiency compared to existing state-of-the-art methods, while maintaining a reduced parameter footprint. Moreover, the flexibility of the PCSA module allows it to be readily adapted for broader medical imaging applications beyond renal disease classification.
The primary contributions of this paper are summarized as follows:
-
We introduce a novel dual-attention mechanism, the PCSA, that concurrently captures and integrates multiscale spatial and channel information to enhance feature representation.
-
The PCSA is designed as a lightweight, modular component that can be easily embedded into various backbone architectures to improve model performance.
-
By integrating the PCSA into the ResNet architecture, we construct the PCSANet, a set of models that effectively learn complex multiscale features, while significantly reducing the number of parameters.
-
Experimental results demonstrate that PCSANet models outperform several state-of-the-art methods in terms of diagnostic accuracy, adaptability, and efficiency through precise and adaptive channel-wise weighting.
This paper is organized as follows. “Related work” gives a review of the related work. Section “Proposed methodology” details the proposed methodology. Section “Results and discussions” presents the experimental setup and results. Section “Conclusion and future works” gives the conclusion of the study with a discussion of the findings and their implications.
Related work
Deep learning (DL) and machine learning (ML) algorithms have shown considerable effectiveness in the prediction and diagnosis of various complex diseases. In recent years, the early detection of chronic conditions, particularly CKD, has garnered increased attention from both clinicians and researchers. Numerous emerging studies have explored the application of DL techniques for improving diagnostic accuracy of CKD and related renal disorders. This section provides a comprehensive review of recent contributions in this domain, highlighting both the advancements and the persisting limitations of existing approaches.
In 5, the authors utilized the publicly-available CT kidney dataset, Normal-Cyst-Tumor-Stone, comprising 12,446 annotated CT images to evaluate the performance of six ML models. These included three advanced Vision Transformer (ViT) architectures, Swin Transformer, Compact Convolutional Transformer (CCT), and External Attention Network (EANet), as well as three DL models, namely ResNet50, Inception V3, and VGG16. Among these, the Swin Transformer exhibited the highest classification performance, achieving a maximum accuracy of 99.30%, demonstrating its effectiveness in renal image analysis. Another approach14 presented a Deep Neural Network (DNN) framework for the early detection and prediction of CKD. This model depends on Recursive Feature Elimination (RFE) to identify key clinical predictors, including packed cell volume, specific gravity, hemoglobin levels, red blood cell count, serum creatinine, albumin, and hypertension. Using the UCI kidney dataset, the DNN model performance was compared with those of traditional classifiers such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression, Naive Bayes (NB), and Random Forest (RF). Remarkably, the DNN model achieved a perfect classification accuracy of 100%, highlighting its diagnostic potential in structured clinical datasets.
In15, a pyramidal DL pipeline was proposed for the classification of kidney whole-slide histology images. The architecture incorporated three CNNs, each processing image patches of varying sizes to enhance diagnostic precision through multiscale analysis. The preprocessing stage involves adaptive histogram equalization and edge enhancement to emphasize key structural features within the patches. Additionally, a Generalized Gauss–Markov Random Field (GGMRF) smoothing technique was applied to address inconsistencies in pixel-level predictions. The dataset consists of de-identified Whole-Slide Images (WSIs) obtained from archived pathology specimens at Indiana University, and segmented into overlapping patches. The final classification results were obtained through majority voting, and the pixelwise segmentation maps were refined using GGMRF smoothing. This pipeline achieved superior accuracy, sensitivity, and specificity in distinguishing four tissue types, outperforming baseline models including the pretrained ResNet18 and ResNet34.
In a separate study, Sudharson et al.16 proposed a collection of DNNs for the automated classification of kidney ultrasound images. Their approach depends on transfer learning, leveraging pretrained CNN architectures such as MobileNet-V2, ShuffleNet, and ResNet-101 for feature extraction. The final classification was performed using an SVM classifier. The authors addressed a multiclass classification problem by categorizing ultrasound images into four clinically-relevant classes: normal, stone, tumor, and cyst. Experimental results demonstrated robust performance, with a maximum classification accuracy of 96.54% for high-quality images and 95.58% for images with noise, indicating the model resilience to imaging artifacts.
In17, an ML-based framework was introduced for classifying CKD into three progressive stages: normal, mild/moderate, and severe, using ultrasound imaging. The methodology depends on two core components: feature extraction and classification. For feature extraction, the authors utilized the Gray-Level Co-occurrence Matrix (GLCM) technique to derive 19 texture-based features from three anatomically significant Regions of Interest (ROIs), the medulla, the cortex, and the cortical-medullary boundary. Image preprocessing steps, including histogram equalization and range filtering, were applied to enhance visual quality and suppress noise artifacts. In addition to texture features, kidney size was included as an important morphological indicator to improve classification accuracy. The extracted features were then used to train an Artificial Neural Network (ANN) classifier, which achieved an overall classification accuracy of 95.4%, demonstrating its reliability in CKD stage differentiation based on ultrasound data.
In18, the authors proposed an alternative approach for CKD classification by integrating a Multikernel Support Vector Machines (MKSVM) classifier with a Fruit Fly Optimization Algorithm (FFOA). In this hybrid model, the FFOA was used to identify and select the most informative features from the dataset, thereby enhancing classification efficiency. The MKSVM classifier then categorizes the samples into two groups: normal and abnormal CKD. This approach was validated using multiple benchmark CKD datasets, and experimental results confirmed its high classification performance, revealing strong potential for real-world diagnostic deployment.
In19, ML techniques were employed to classify CKD using a dataset comprising 400 patient records, which includes 250 cases of CKD and 150 healthy controls. Feature selection was performed using both the chi-square test and Recursive Feature Elimination (RFE) to identify the most significant predictors for disease classification. Multiple ML models were developed and evaluated, including KNN, logistic regression, ANN, NB, and SVM classifiers. Among these, logistic regression, when combined with optimal feature selection, achieved the highest accuracy of 98.75%, outperforming other models such as SVM and ANN, which also demonstrated strong performance. Key predictive features identified in this study included blood glucose level, white blood cell count, blood urea, and serum creatinine, highlighting their diagnostic relevance in CKD detection.
In20, the authors introduced a hybrid framework for the identification and segmentation of kidney disorders in ultrasound images, combining classification with image segmentation. This framework comprises four key modules: (1) preprocessing, which depends on a median filter to remove noise; (2) feature extraction, in which 22 GLCM features are computed from the denoised images; (3) feature selection, using the Crow Search optimization Algorithm (CSA) to identify the most relevant features; and (4) classification, in which an ANN classifier is trained on the optimized feature set to distinguish among normal, stone, and tumor categories. For the segmentation task, a multikernel k-means clustering algorithm was employed, incorporating both linear and quadratic kernels to delineate stone and tumor regions within abnormal cases. This framework achieved high performance across evaluation metrics, including sensitivity and specificity, with a segmentation accuracy of 99.61%, demonstrating its robustness in ultrasound-based kidney disorder analysis.
In21, the authors utilized a dataset comprising 8,400 CT scan images for 120 adult patients with suspected kidney masses, collected at King Abdullah University Hospital (KAUH) in Jordan, to classify CKD. Several DL models were developed for kidney cancer detection and classification using this dataset, including a six-layer 2D Convolutional Neural Network (CNN-6), ResNet50, VGG16, and a custom four-layer 2D CNN (CNN-4) tailored for classification tasks. Experimental results demonstrated that CNN-6 and ResNet50 achieve high detection accuracies of 97% and 96%, respectively, while CNN-4 yields a classification accuracy of 92%, highlighting the effectiveness of deep CNN architectures in analyzing renal CT scans.
In22, the authors proposed a DL-based system for categorizing kidney disorders using CT images. This system depends on the YOLOv8 object detection architecture to classify kidney images into four categories: tumor, cyst, stone, and normal. A curated dataset of 12,446 CT urogram and abdominal images, specifically representing various kidney abnormalities, was used. The dataset was divided into training and validation subsets, and data augmentation techniques were applied to enhance model generalizability. The YOLOv8 model achieved an accuracy of 82.52%, with a precision of 85.76%, recall of 75.28%, F1-score of 75.72%, and specificity of 93.12%, reflecting its potential for multi-class renal disorder detection from CT images.
In23, the authors developed a three-dimensional Convolutional Neural Network (3D CNN) model to classify CKD using volumetric image data. This model was built with the Faimed3D library based on the ResNet3D-18 architecture. The dataset used includes 321 patients, categorized into three groups according to their estimated Glomerular Filtration Rate (eGFR). Experimental findings showed that the 3D CNN model achieves optimal performance when utilizing bilateral kidney imaging data combined with Intensity Projection (IP) images, achieving a classification accuracy of 0.862 ± 0.036, indicating its suitability for volumetric medical imaging in renal disease classification.
While the studies reviewed above illustrate significant progress in the application of DL and ML techniques for kidney disease diagnosis and classification, several key limitations remain unaddressed. Many existing approaches continue to rely on handcrafted features or traditional ML models, which often lack the capacity to effectively integrate multiscale contextual information or dynamically prioritize diagnostically salient image regions. Although advanced architectures such as CNNs and Vision Transformers have demonstrated notable success, they frequently suffer from high computational overhead and limited modularity, restricting their adaptability across diverse clinical frameworks. Additionally, the datasets employed in prior research are often constrained in size, class diversity, or clinical heterogeneity, thereby impeding the generalizability and robustness of the presented models.
To overcome these limitations, this study introduces the PCSA, a novel dual-attention module designed to facilitate enhanced integration of spatial and channel-wise features. Unlike previous methods, the PCSA is implemented as a lightweight, modular plug-and-play component that can be easily embedded into various CNN backbones to improve diagnostic performance without incurring substantial computational cost. When incorporated into ResNet architectures, the resulting PCSANet models provide more robust and computationally-efficient solutions for kidney disease classification, achieving superior results across multiple datasets. This work directly addresses the critical gaps identified in earlier research by delivering a scalable, adaptable, and clinically-relevant DL framework.
Proposed methodology
This section outlines the architectural framework and theoretical foundation of the proposed PCSA-Net, a DL model designed for efficient and accurate renal disease classification using CT images. The development of the PCSA module is motivated by the limitations observed in existing attention mechanisms such as Squeeze-and-Excitation Networks (SE-Nets), Convolutional Block Attention Module (CBAM), and EPSA-Net. These previous methods either focus exclusively on channel or spatial attention or inadequately capture multiscale contextual information, which is critical for identifying subtle and heterogeneous patterns in medical images.
To address these shortcomings, the PCSA module is designed with a dual-branch architecture that integrates both spatial and channel attention within a pyramid convolutional framework. This configuration enables adaptive emphasis on both high-level structural patterns and localized diagnostic cues across a range of receptive fields. Theoretically, this design leverages the complementary strengths of parallel attention pathways and multiscale feature fusion, enhancing the model ability to learn discriminative representations, while minimizing feature redundancy.
The proposed PCSA module is seamlessly integrated into a ResNet backbone, forming the core of PCSA-Net. This integration enhances the network capacity to selectively focus on salient features across spatial and channel dimensions, thus improving the robustness and interpretability of deep feature learning in complex medical images.
We begin by presenting the theoretical rationale for embedding attention mechanisms into CNN backbones. These mechanisms are essential for guiding the model focus toward clinically-relevant features, ultimately leading to improvements in both computational efficiency and classification accuracy. Specifically, we analyze the complementary roles of Channel Attention (CA) and Spatial Attention (SA).
-
Channel Attention (CA) prioritizes feature channels that contain highly-informative content, enhancing the discriminative power of the model.
-
Spatial Attention (SA) highlights spatial regions within the feature maps that are most indicative of the target pathology.
The following subsections detail the internal structure of the PCSA module and its implementation across different convolutional backbones, demonstrating its modularity and effectiveness in medical image analysis.
Building upon the foundational principles of multiscale learning and attention mechanisms, we introduce the PCSA module, which synergistically combines CA and SA within a pyramid-based convolutional structure. In contrast to existing attention modules that often apply spatial or channel weighting in isolation or at a single resolution, the PCSA module operates across multiple scales, extracting hierarchical features before attention weighting is applied. This multiscale architecture allows the model to effectively capture both global contextual dependencies and localized discriminative patterns, which are essential for identifying the heterogeneous manifestations of renal pathologies in CT imaging.
We present PCSANet, a modified version of the ResNet architecture wherein PCSA modules are strategically embedded at key layers. This integration not only enhances the network ability to discriminate among fine-grained medical features but also maintains a lightweight and computationally-efficient structure. As a result, the PCSANet achieves a superior balance between diagnostic performance and model complexity, offering a clear advantage over existing DL models that often require significantly more parameters and computational resources to attain comparable accuracy.
Figure 1 illustrates the overall framework of the proposed PCSANet methodology. The pipeline begins with input CT images, which are first processed through a pyramid multiscale convolution block to extract hierarchical features across multiple receptive fields. These initial feature maps are then passed through parallel attention branches, comprising channel weight extraction and spatial weight extraction modules. These modules assign adaptive weights to emphasize semantically-meaningful and spatially-relevant features, respectively. The resulting weighted features are fused and routed through the PCSA integration layer, which reconstructs an enhanced feature map that incorporates both multiscale context and dual-attention refinement. This enriched representation is subsequently processed by a classification layer, which categorizes the input image into one of four clinically-significant classes: normal, stone, cyst, or tumor. This architecture showcases the core contribution of our study: a modular, attention-based enhancement mechanism that combines pyramid convolutional processing with channel and spatial attention, leading to improved diagnostic precision in kidney disease classification.
Attention mechanisms in backbone networks
Deep Learning (DL) methods, particularly CNNs, have demonstrated remarkable effectiveness in capturing spatial hierarchies and contextual dependencies, making them widely adopted in a variety of image recognition tasks24. However, the performance improvements of CNNs are often achieved by increasing the network depth and width, which is substantially reflected in the computational cost and training complexity.
To mitigate these limitations, the integration of attention mechanisms into CNN architectures has emerged as a compelling strategy. Attention modules allow the network to adaptively prioritize informative regions and suppress less relevant features, thus improving feature representation without the need for excessively large models. This selective focus mechanism enhances the network discriminative capacity, while maintaining efficiency in both computation and memory usage.
Attention mechanisms have been effectively embedded into various CNN backbone networks, significantly boosting their performance across a wide range of computer vision tasks, including image classification, object detection, and semantic segmentation25. Among the most commonly-used backbones are VGGNet and ResNet, both of which have served as foundational architectures for integrating attention modules due to their modular structure and widespread adoption in the DL community.
In this study, we emphasize the integration of attention mechanisms within the ResNet architecture. One of the foundational advancements in this area is the Squeeze-and-Excitation Network (SENet), which incorporates a Squeeze-and-Excitation (SE) module into the residual blocks of ResNet to adaptively recalibrate channel-wise feature responses. Building on this concept, several extensions, such as Fca-Net and ECANet, have been developed to further enhance the representational power of ResNet through more efficient channel attention strategies.
Moving beyond conventional attention mechanisms, the Efficient Pyramid Squeeze Attention (EPSA) module introduces multiscale convolutional processing to capture richer contextual dependencies. In EPSANet, the traditional convolutional layers of ResNet are replaced by Pyramid Squeeze Attention (PSA) modules, which aggregate information across multiple scales using channel attention weights. This approach significantly improves the network capacity to learn discriminative features in complex visual tasks.
Building on these advancements, we propose PCSANet, a novel extension of EPSANet, in which the newly designed PCSA module is integrated into ResNet34 and ResNet50 architectures. The PCSA module not only enhances multiscale feature extraction, but also introduces a dual attention pathway, channel and spatial, leading to improved classification performance and more robust feature learning in medical image analysis.
Although PCSA-Net draws conceptual inspiration from the multiscale architecture of EPSANet, it introduces several key architectural innovations that substantially enhance feature learning and attention modeling capabilities. Unlike EPSANet, which depends on channel attention across pyramid-scale branches using global average pooling, the proposed PCSA module depends on a parallel dual-attention mechanism that integrates both channel attention and spatial attention. This design enables the network to simultaneously capture inter-channel dependencies and spatial saliency, thereby improving its capacity to model complex contextual relationships within medical images.
Furthermore, PCSA takes place of the standard convolutional blocks used in the EPSANet with grouped dilated convolutions, which expand the receptive field while controlling the number of parameters, making the architecture more efficient. The resulting multiscale feature maps are subsequently reweighted through a unified attention mechanism, allowing the network to adaptively emphasize diagnostically-relevant anatomical structures within CT images.
This dual-attention, multiscale fusion design not only enhances the localization sensitivity of the model, but also allows the PCSA module to function as a general-purpose, plug-and-play enhancement compatible with various CNN backbones. As a result, the PCSA-Net provides a scalable and efficient solution for improving classification performance in medical imaging tasks.
Channel attention
The Channel Attention (CA) mechanism is integrated into the proposed PCSA module to enable the model to effectively capture and exploit inter-channel feature dependencies. By adaptively recalibrating channel-wise feature responses, the CA mechanism enhances the network sensitivity to informative and semantically-meaningful features across different channels. This selective weighting process allows the network to prioritize channels that are more relevant for identifying pathological patterns in CT images.
In the PCSA module, the CA is computed using a combination of max pooling and average pooling operations applied independently across each feature channel. These pooled descriptors are then passed through a shared Multi-Layer Perceptron (MLP) comprising two fully-connected layers with a non-linear activation function in between. The outputs are fused and passed through a sigmoid activation function to generate the final channel-wise attention weights, which are used to rescale the original feature map via element-wise multiplication26.
Figure 2 illustrates the architecture of the CA mechanism. The dual-pathway design captures complementary global descriptors, maximum and average pooled representations, which are processed through the shared MLP to yield refined channel weights. This mechanism allows the model to focus more effectively on diagnostically-critical features, while suppressing irrelevant or redundant information.
The CA mechanism generates two distinct spatial contextual descriptors for each channel by applying global average pooling and global maximum pooling to the input feature map F. These pooled descriptors represent the channel-wise mean and extreme activations, capturing complementary statistical properties. To model channel interdependencies, each descriptor is passed through a shared MLP with a bottleneck structure. The resulting vectors are then combined via element-wise addition, followed by a sigmoid activation to produce the final channel attention vector, which is used to recalibrate the input feature map. This process is formally defined as:
where,
-
\(\:{W}_{0}\in\:\:{R}^{c\times\:\frac{c}{r}}\) and \(\:{W}_{1}\in\:\:{R}^{\frac{c}{r}\times\:c}\:\) are the weight matrices of the two 1 × 1 convolutional layers (replacing conventional fully-connected layers),
-
δ(⋅) denotes the ReLU activation function,
-
σ(⋅) is the sigmoid function, and
-
X is the pooled input, derived from either MaxPool(F) or AvgPool(F).
Unlike traditional MLPs that depend on fully-connected layers, this formulation depends on 1 × 1 convolutions to improve computational efficiency, while preserving the channel-wise interactions. This structure enables effective and lightweight learning of cross-channel dependencies, making it suitable for integration into resource-constrained medical imaging models.
Spatial attention
The SA mechanism is designed to refine spatial feature representations by emphasizing salient regions in the spatial domain, thereby enhancing the model focus on anatomically critical locations in medical images27. Unlike CA, which focuses on inter-channel relationships, SA identifies contextually important pixel locations, making it a complementary mechanism for improving feature expressiveness28. In the proposed PCSA module, the SA mechanism is employed to compute pixel-level spatial attention weights, guiding the network to prioritize regions with high diagnostic relevance29. The process begins by applying global average pooling and global max pooling operations independently across all channels of the input feature map. These two 2D spatial descriptors capture both the mean intensity distribution and the strongest activations, respectively.
The resulting two single-channel feature maps are concatenated along the channel axis to form a unified representation that preserves both average and maximal spatial cues. This concatenated map is then passed through a convolutional layer with a 3 × 3 kernel, which aggregates local context, followed by a sigmoid activation function to produce the final SA map. The attention map is subsequently used to reweight the original feature map via element-wise multiplication, amplifying salient spatial features while suppressing irrelevant ones.
Figure 3 illustrates the structure of the SA mechanism. It highlights the parallel use of max and average pooling, followed by a shared convolutional operation and sigmoid activation to generate the attention map. This design enables the network to dynamically attend to the most informative spatial regions, enhancing localization and detection accuracy in medical image classification tasks.
The SA mechanism can be mathematically formulated as:
where:
-
\(\:{\text{C}\text{o}\text{n}\text{v}}_{3\times\:3}\:\) denotes a convolutional layer with a 3 × 3 kernel,
-
σ(⋅) represents the sigmoid activation function,
-
AvgPool(Fi) and MaxPool(Fi) are the global average pooling and global max pooling operations applied across the channel dimension of the input feature map Fi, and
-
[⋅ ; ⋅] denotes channel-wise concatenation of the pooled maps.
Focusing on the spatial distribution of salient features enables the network to dynamically emphasize diagnostically-relevant regions, while suppressing background or irrelevant noise. The SA mechanism thus complements the CA by enhancing the localization precision of the model, contributing to more a accurate and robust feature representation for downstream classification tasks.
Pyramid channel and spatial attention (PCSA)
The proposed PCSA mechanism builds upon and extends the capabilities of the PSA architecture by introducing a more accurate and computationally-efficient framework for attention modeling. The PCSA module is designed to jointly exploit multiscale spatial and channel-level feature interactions, thereby improving the network ability to capture both fine-grained and global contextual information critical for medical image classification.
The PCSA mechanism operates through the following four key stages:
-
1.
Pyramidal Multiscale Convolution Module: An enhanced pyramid convolutional structure is employed to generate grouped multiscale feature maps. This design captures image representations at multiple receptive fields, enabling the network to learn hierarchical features spanning both local and global contexts.
-
2.
Simultaneous Channel and Spatial Attention Extraction: For each scale-specific feature map, CA and SA attributes are extracted in parallel, allowing the network to simultaneously model inter-channel relationships and spatial saliency across all receptive fields.
-
3.
Attention Weight Normalization and Feature Reconstruction: The extracted attention maps are normalized using the Softmax function to produce probabilistic channel and spatial weight matrices. These weights are then used to recalibrate the multiscale feature maps, selectively enhancing informative regions and suppressing irrelevant noise.
-
4.
Feature Fusion via 1 × 1 Convolution: The recalibrated feature maps from each scale are linearly fused through a 1 × 1 convolutional layer, generating the final output feature representation. This step not only aggregates attention-enhanced features but also reduces dimensionality, ensuring a computationally-efficient representation suitable for downstream classification tasks.
The PCSA module is architected to be modular and scalable, making it suitable for integration into various CNN backbones. Its ability to jointly model multiscale channel and spatial dependencies makes it particularly well-suited for complex medical imaging applications, such as renal disease diagnosis from CT images.
Pyramid convolution module
In medical image analysis, it is essential to capture semantic information across multiple receptive fields, as anatomical structures and pathological features often vary in size, shape, and spatial distribution. To address this issue, the proposed PCSA module incorporates a pyramid convolution strategy that depends on parallel dilated convolutions with varying dilation rates. This design enables the model to extract a rich hierarchy of features, ranging from fine-grained textures to coarse contextual cues, which is particularly advantageous when distinguishing between visually-similar renal abnormalities, such as cysts and stones.
Our approach draws inspiration from pyramid-based models such as PCANet30, which introduced pyramidal convolutional attention for semantic segmentation. However, unlike PCANet, which applies attention independently within each scale, our design enhances computational efficiency and feature discriminability by aggregating multiscale representations through grouped dilated convolutions, followed by a unified dual-branch attention mechanism (channel and spatial). This holistic fusion improves compactness, and it is more suitable for classification tasks, particularly in medical contexts.
Compared to traditional convolution operations, the pyramid convolution module reduces computational cost by integrating pointwise group convolution with pyramid-shaped multiscale dilated group convolution. Specifically, the input feature map is divided into four groups, each processed with unique receptive fields, [3, 5, 7, 9], and group sizes, [1, 4, 8, 16]. Given an input feature map X ∈ \(\:\mathbb{R}\) C×W×H, each group Fi contains \(\:\frac{C}{4}\) channels, and it is passed through a scale-specific convolution branch. The outputs from all branches are concatenated along the channel axis to form the final multiscale representation.
where,
-
\(\:{G}_{i}\in\:\left\{1,\:4,\:8,\:16\right\}\) denotes the group size,
-
\(\:{k}_{i}\:\in\:\left\{3,\:3,\:7,\:5\right\}\) specifies the kernel size,
-
\(\:{d}_{i}\in\:\left\{0,\:1,\:0,\:1\right\}\) represents the dilation rate, and
-
Xi is the \(\:\frac{C}{4}\)-channel input for the ith group.
The final multiscale feature map, denoted as F ∈ \(\:\mathbb{R}\) C×W×H, is obtained by concatenating DFi across all branches. This multiscale representation serves as the input to the subsequent attention modules, allowing the network to perform refined semantic reasoning across spatial and channel dimensions.
Pointwise convolution (PWC)
To compensate for potential information loss caused by the grouped convolution in the pyramid module, a PWC operation is incorporated. This mechanism helps retain global contextual information, while maintaining computational efficiency.
In this design, the input feature map X is first divided into four channel-wise partitions, each denoted as Xi, where i = 1,2,3,4. A 1 × 1 convolution is then applied to each partition, producing corresponding pointwise feature maps PFi. This operation can be mathematically expressed as:
These pointwise outputs PFi are subsequently added elementwise to the corresponding outputs from the dilated group convolution DFi (as defined in Eq. 4). This fusion step effectively restores information lost during grouped processing and ensures better feature continuity across channels:
This combination of multiscale grouped convolution and pointwise convolution enables the network to maintain both fine-grained local detail and broad contextual awareness, which are crucial for accurate classification of renal abnormalities in CT images.
Channel attention and spatial attention
After generating the multiscale features Fi through the fusion of grouped dilated and pointwise convolutions, the next step involves computing CA and SA for each scale to enhance the discriminative capacity of the network.
For each feature map Fi, where i = 0,1,2,3, the Channel Weight (CW) and Spatial Weight (SW) are computed using dedicated attention extraction functions:
where,
-
\(\:{CA}_{i\:}\in\:{\mathbb{R}}^{{C}_{i}\times\:1\times\:1}\:\) denotes the channel attention map, emphasizing inter-channel dependencies, and
-
\(\:{SA}_{i\:}\in\:{\mathbb{R}}^{1\times\:W\times\:H}\) represents the spatial attention map, focusing on spatially-salient regions within the feature map.
To ensure adaptive and normalized weighting, both attention maps are passed through a Softmax function across their respective dimensions:
This normalization ensures that the attention weights are distributed proportionally across all branches, enabling the network to adaptively prioritize the most informative channels and spatial locations for each scale.
The resulting attention-enhanced features are then ready for feature reconstruction and integration, contributing to a refined and context-aware representation that improves classification accuracy, particularly for complex and heterogeneous medical imaging scenarios such as renal disease identification.
Feature fusion
Once the channel and spatial attention weights are computed and normalized, they are applied to the corresponding multiscale feature maps to emphasize diagnostically-significant patterns. The refined features are then combined through an elementwise operation that merges both attention pathways:
where ⊙ denotes elementwise multiplication, \(\:{CW}_{i}\:\) and \(\:{SW}_{i}\) represent the normalized channel and spatial attention weights, respectively, and Fi is the feature map from the ith scale.
Next, the attention-weighted outputs Zi from all pyramid branches are concatenated along the channel axis to generate a unified multiscale representation:
This concatenated feature map \(\:Z\) captures rich contextual and structural information from diverse receptive fields, effectively blending global semantics and local details.
To further enhance the nonlinear representational power and to model inter-channel dependencies, the concatenated feature map is passed through a final 1 × 1 convolutional layer. This step preserves spatial resolution, while reducing dimensional redundancy and facilitating inter-feature interaction.
By integrating channel and spatial attention mechanisms across multiple scales, the proposed PCSA module ensures precise feature recalibration and context-aware fusion, ultimately improving the model ability to focus on clinically-relevant patterns and enhancing classification performance in complex medical image processing tasks.
Theoretical justification and mathematical formulation
The proposed PCSA module significantly enhances the representational power of CNNs by integrating multiscale dilated convolutions with parallel channel and spatial attention mechanisms. This design enables the network to capture rich semantic information across varying receptive fields, while selectively emphasizing diagnostically salient features.
Let \(\:X\in\:{\mathbb{R}}^{C\times\:H\times\:W}\) denote the input feature map, where C, H, and W represent the number of channels, height, and width, respectively. To capture contextual information at multiple resolutions, a set of pyramid convolutions with different dilation rates \(\:d\in\:\left\{\text{1,2},3\right\}\) is applied:
The resulting multiscale feature maps \(\:{F}_{1},\:{F}_{2},\:{F}_{3}\) are then concatenated along the channel dimension and fused via a pointwise (1 × 1) convolution to form the aggregated pyramid features:
To enhance feature discriminability, CA is applied by first computing the global average pooling and global max pooling of \(\:{F}_{pyramid}.\)
These descriptors are forwarded through a shared two-layer Multi-Layer Perceptron (MLP) with ReLU activation and combined using a sigmoid activation function as follows:
where W0 ∈ \(\:\mathbb{R}\)C×C/r, W1 ∈ \(\:\mathbb{R}\)C/r×C, δ(⋅) denotes the ReLU function, and σ(⋅) is the sigmoid activation.
Simultaneously, SA is applied by concatenating the average and max pooled feature maps along the channel axis, followed by a 7 × 7 convolution and sigmoid activation:
The final attention-refined output feature map is obtained via elementwise multiplication of the pyramid features with the channel and spatial attention maps:
This dual-attention strategy enables the model to selectively amplify meaningful spatial and channel features, while maintaining multiscale contextual integrity. Unlike existing methods, such as SE-Net31 and CBAM32, which apply attention either sequentially or in isolation, PCSA processes spatial and channel dependencies in parallel and integrates them within a pyramid-based multiscale structure.
By enriching both the depth and diversity of semantic representations, the PCSA module provides a robust and efficient foundation for multiclass kidney disease classification, as evidenced by the superior performance demonstrated in our experimental evaluations.
PCSA in ResNet: PCSANet
The integration of the PCSA module into the ResNet architecture leads to the development of PCSANet, a robust and efficient DL model optimized for multiscale feature extraction and precise image classification. The PCSA module enhances the network representational power by aggregating multiscale features and modeling both channel interdependencies and spatial saliency. This is achieved through the combination of residual learning, 1 × 1 convolution, and grouped dilated convolutions.
The modified PCSANet architecture, illustrated in Fig. 4, replaces the standard 3 × 3 convolution layers in ResNet with the proposed PCSA modules. This structural adjustment allows the model to capture long-range spatial dependencies, while maintaining computational efficiency. By processing features across multiple receptive fields and applying attention mechanisms in parallel, the PCSANet effectively emphasizes diagnostically-critical regions in medical images.
The integration is performed vertically within the ResNet backbone, ensuring compatibility with the original residual block framework. The lightweight design of ResNet18 is utilized in conjunction with the modular block structure of ResNet50, enabling PCSANet to achieve a favorable trade-off between classification accuracy and computational complexity.
Overall, the PCSANet preserves the core advantages of ResNet, such as gradient stability and residual learning, while augmenting it with the advanced attention-based capabilities of the PCSA module. This hybrid configuration significantly enhances the model ability to learn discriminative features, particularly in challenging multiclass medical imaging tasks such as renal disease classification.
Table 1 presents a detailed structural comparison between the baseline ResNet-50 and the proposed PCSANet architecture. The table outlines the configuration at each stage of the network and highlights the integration of the PCSA modules. In the PCSANet, standard 3 × 3 convolutional layers within the bottleneck blocks are strategically replaced with PCSA modules, significantly enhancing attention-driven feature extraction. This design improves the model ability to capture multiscale contextual information and focus on diagnostically-relevant regions, all while maintaining computational efficiency. The result is a more effective and scalable architecture for image classification tasks, particularly in medical image processing.
Results and discussion
This section presents a comprehensive performance analysis of the proposed PCSANet model, based on extensive experiments conducted on a multiclass kidney CT image dataset. The classification capability of the model is evaluated using the top-1 accuracy metric, which quantifies the proportion of test samples for which the predicted class label matches the ground truth. To ensure the statistical reliability and robustness of the evaluation, the experiments were repeated multiple times under identical conditions, thereby accounting for inherent variability in DL model training. The final reported accuracy represents the average performance across these independent trials, offering a more stable and generalizable measure of the model diagnostic efficacy. The results ensure the superiority of PCSANet in capturing subtle renal pathologies and distinguishing between the four clinical categories: normal, cyst, stone, and tumor.
Dataset
This study depends on a CT kidney dataset obtained from the Picture Archiving and Communication System (PACS) of multiple hospitals located in Dhaka, Bangladesh. The dataset encompasses cases with diagnoses of normal kidneys, tumors, cysts, and kidney stones. Axial and coronal views were extracted from both non-contrast and contrast-enhanced CT scans, adhering to standard imaging protocols for the whole urogram and abdominal examinations.
Each DICOM study was meticulously sorted based on radiological diagnosis, resulting in well-structured image sets aligned with specific pathological findings. All patient-identifiable metadata were removed to preserve anonymity, and the DICOM images were subsequently converted into lossless JPEG format for efficient storage and processing. To ensure data integrity and diagnostic reliability, each case was independently validated by a radiologist and a medical technician, confirming the accuracy of classification.
The final dataset includes 12,446 CT images, distributed as follows: 5,077 normal, 3,709 cyst, 2,283 tumor, and 1,377 stone cases. This dataset is publicly available on Kaggle33. Figure 5 illustrates representative CT images from each diagnostic category, normal, cyst, stone, and tumor, used in this study.
Performance evaluation
The quantitative assessment of the proposed PCSANet model was conducted using a comprehensive set of standard evaluation metrics, including accuracy 34, precision, recall35, F1-score24, support, macro average, and weighted average36.
-
Accuracy measures the ratio of correctly-predicted instances to the total number of predictions.
-
Recall (also known as sensitivity) represents the proportion of true positive predictions relative to the actual positives in a given class.
-
Precision quantifies the ratio of true positive predictions to the total predicted positives.
-
F1-score is the harmonic mean of precision and recall, providing a balance between the two, especially when class distributions are uneven.
-
Macro average represents the average of precision, recall, and F1-score across all classes by giving all classes equal weights, making it suitable for unbalanced datasets.
-
Weighted average considers the relative support (i.e., the number of true instances per class), when averaging performance metrics, offering a more representative performance summary in the case of class imbalance.
To further interpret the classification outcomes, a confusion matrix was constructed. This matrix organizes predictions into four categories:
-
True Positives (TP): CKD cases correctly identified.
-
False Negatives (FN): CKD cases incorrectly predicted as non-CKD.
-
False Positives (FP): Non-CKD cases incorrectly predicted as CKD.
-
True Negatives (TN): Non-CKD cases correctly classified.
The confusion matrix provides insights into misclassification patterns across classes and highlights the robustness of the model in distinguishing between similar diagnostic categories. In this representation, rows correspond to actual class labels, while columns represent predicted class labels. The formal definitions of the employed evaluation metrics are presented in Table 2.
Here, N is the number of classes, wi denotes the support (weight) for class i, and Xi refers to the corresponding metric value for that class. These parameters are crucial for calculating the weighted average and other metrics, ensuring that the evaluation accounts for the significance of each class in the dataset.
Parameter configuration
To ensure consistency and reproducibility across experiments, the parameter configurations for the proposed PCSANet model were maintained uniformly throughout the study. The model training depends on the cross-entropy loss function with label smoothing, where a smoothing factor of 0.1 is applied to prevent overfitting and improve generalization by softening the target label distributions.
Optimization is performed using Stochastic Gradient Descent (SGD) algorithm with a weight decay regularization factor set to 0.01 to mitigate overfitting and enhance convergence stability. The initial learning rate is set at 0.05, and it is reduced by a factor of 0.5 every 30 epochs to facilitate gradual convergence and avoid local minima.
The model is trained using a batch size of 128, and the maximum number of training epochs is set to 300, providing sufficient iterations for the model to learn robust feature representations. These hyperparameter settings were empirically selected based on preliminary experiments to balance training efficiency and classification performance.
Results analysis
The performance of the proposed PCSANet model was evaluated on the kidney CT dataset using a comprehensive set of evaluation metrics, including precision, recall, F1-score, accuracy, macro average, weighted average, and support. To ensure statistical robustness and mitigate performance variability, a tenfold cross-validation strategy was adopted. The average results from these folds were used to construct the confusion matrix, compute evaluation metrics, and plot the Receiver Operating Characteristic (ROC) curve.
To further refine the model architectural design, we investigated various strategies for integrating the PCSA module into the ResNet-50 framework, as illustrated in Fig. 6. These architectural variants were evaluated to determine the most effective configuration for maximizing feature discrimination and classification accuracy. The explored strategies include:
-
Standard Integration: The PCSA module is embedded after the ResNet block within the residual structure, allowing attention refinement on the extracted features.
-
Pre-PCSA Integration: The PCSA module is placed before the ResNet block within the residual structure, enabling attention-guided feature refinement prior to residual learning.
-
Post-PCSA Integration: The PCSA module is inserted outside the residual structure, processing the residual output, directly.
These configurations were empirically compared to assess how the timing and positioning of attention mechanisms influence the network ability to capture relevant spatial and channel information. The results demonstrate that the integration strategy significantly impacts classification performance, with certain configurations yielding superior multiscale feature learning and generalization.
The experimental findings, summarized in Tables 3, 4 and 5, demonstrate the effectiveness of the PCSANet model with different integration strategies. Among these, the standard integration approach consistently delivered superior performance across all evaluated metrics, including accuracy, precision, recall, and F1-score.
Table 3 presents the results for the standard integration of the PCSA module within the ResNet architecture. This configuration achieved perfect scores (1.00) across all classes and metrics, indicating a highly discriminative and well-generalized model.
In contrast, Table 4 shows the performance of the Pre-PCSA integration, where the PCSA block precedes the ResNet layers. While this approach still yielded high performance, minor degradations in precision and recall, particularly for cyst and tumor classifications, were observed, reducing overall F1-scores to 0.95.
Table 5 details the results for the Post-PCSA integration, where the PCSA block follows the ResNet output. This strategy preserves good performance for most classes but suffers from diminished precision and recall in the “Stone” category, likely due to suboptimal attention refinement at the post-processing stage.
These results reveal that the standard integration of the PCSA block into the residual structure offers the most balanced and effective strategy for multiscale feature learning and fine-grained attention modulation. The PCSA algorithm significantly improves the model ability to isolate diagnostically-important regions and adaptively extract meaningful channel and spatial features, yielding higher generalization and reliability compared to existing attention-enhanced methods.
To comprehensively evaluate the performance of the implemented models, visual analyses were conducted using accuracy and loss curves, along with confusion matrices for the classification of normal, cyst, stone, and tumor categories within the CKD dataset. These evaluations were carried out for the Standard, Pre-PCSA, and Post-PCSA integration strategies to facilitate a comparative assessment of their effectiveness.
For the Standard model, the accuracy and loss curves are shown in Figs. 7 and 8, respectively, while the corresponding training and validation confusion matrices are presented in Figs. 9 and 10. The Pre-PCSA model performance is illustrated in Figs. 11 and 12 for accuracy and loss curves, and in Figs. 13 and 14 for the training and validation confusion matrices. Similarly, Figs. 15 and 16 illustrate the accuracy and loss curves for the Post-PCSA model, with Figs. 17 and 18 displaying the respective confusion matrices. These graphical representations provide valuable insights into the classification capabilities of each model and help elucidate their comparative strengths and limitations.
Among the three configurations, the Post-PCSA model exhibits convergence and improved generalization, as evidenced by the minimal divergence between training and validation curves. Notably, the loss curve demonstrates rapid convergence and lower final loss values relative to the other models. Furthermore, the confusion matrices reveal enhanced class-wise prediction consistency, particularly in identifying challenging categories such as cysts and tumors, which were more frequently misclassified by the Standard and Pre-PCSA configurations. These results underscore the Post-PCSA model improved ability to capture fine-grained pathological features.
Nevertheless, despite these advantages, the overall evaluation reveals that the Standard integration approach consistently achieves the highest classification performance across all metrics. As such, the Standard model is adopted in the final PCSANet architecture. Its tight alignment between training and validation accuracy further indicates strong generalization and minimal overfitting. While the proposed PCSA-enhanced framework demonstrates strong potential for accurate renal disease classification, it remains in the experimental phase and has yet to be validated in real-time clinical environments.
In addition, the proposed PCSANet models were rigorously evaluated using 5-fold cross-validation to ensure methodological soundness and reduce the risk of overfitting. This evaluation encompassed three distinct configurations: Standard PCSA integration (embedded within the residual block), Pre-PCSA (placed before the residual block), and Post-PCSA (positioned after the residual block). Each configuration was tested independently across five folds, with the results averaged to derive a reliable and comprehensive performance estimate. The evaluation metrics included accuracy, precision, recall, and F1-score, all of which are essential for assessing multiclass classification effectiveness.
Table 6 presents a detailed description of the results for each fold alongside the mean values for all metrics across the three configurations. The Standard integration consistently achieved superior performance, with an average accuracy of 99.92%, demonstrating high predictive reliability. In contrast, the Pre-PCSA configuration yielded slightly lower average metrics across all folds, indicating a modest reduction in effectiveness when attention is applied prior to residual processing. The Post-PCSA approach produced intermediate performance, with improvements over the Pre-PCSA model but still slightly below the Standard model. These findings reaffirm that integrating the PCSA module within the residual block offers optimal performance benefits.
Figures 19, 20 and 21 present the aggregated confusion matrices across all five folds for each PCSA integration strategy, Standard, Pre-PCSA, and Post-PCSA. These matrics offer a comprehensive visual representation of model performance, highlighting the classification accuracy and misclassification patterns across all renal condition classes.
The Standard PCSA integration consistently outperformed the other strategies across all evaluation metrics, including classification accuracy, precision, recall, and F1-score. This superior performance is clearly illustrated in the aggregated confusion matrix, which exhibits strong diagonal dominance and minimal misclassification, indicating robust and reliable class differentiation. These findings affirm that embedding the PCSA module within the residual blocks of the ResNet backbone effectively captures multiscale spatial and channel dependencies essential for distinguishing among renal disease types.
In contrast, the Pre-PCSA configuration, in which the attention module precedes the residual units, demonstrated comparatively lower performance. This was particularly evident in the increased confusion between classes with similar radiological features, such as cysts and normal tissues, as reflected by the elevated off-diagonal values in the confusion matrix. The Post-PCSA integration strategy achieved moderate improvements over Pre-PCSA, yet fell short of matching the classification efficacy of the Standard approach, consistent with the corresponding confusion matrix.
Despite the near-perfect performance of the Standard model, the adoption of 5-fold cross-validation across a diverse dataset comprising over 12,000 samples helps mitigate the risk of overfitting. Nonetheless, further evaluation on an independent external dataset is warranted to confirm clinical applicability and generalizability. Overall, these results underscore the effectiveness of the PCSA mechanism in enhancing convolutional network performance for renal disease diagnosis, with the Standard integration emerging as the most optimal configuration in terms of accuracy, efficiency, and diagnostic reliability.
Extended evaluation of PCSA on multiple backbones
To further assess the generalizability and robustness of the proposed PCSA module, we conducted an extended evaluation by integrating it into three additional widely-adopted convolutional neural network architectures: VGG16, MobileNetV2, and DenseNet121. These experiments were designed to determine whether the performance gains achieved with ResNet could be replicated across different backbone networks. Each modified architecture was evaluated using 5-fold cross-validation on the same CT kidney dataset to maintain consistency in experimental conditions. The quantitative results are summarized in Table 7, while Figs. 22, 23 and 24 present the aggregated confusion matrices, offering a visual comparison of class-wise classification performance for each backbone.
These results demonstrate that the PCSA module consistently enhances performance across all tested backbones, with the ResNet-based implementation achieving the highest metrics overall. The VGG16, MobileNetV2, and DenseNet121 variants also exhibit strong classification performance, confirming the adaptability and effectiveness of the PCSA mechanism across different network architectures.
The results presented in Table 7 demonstrate that the PCSA module consistently enhances classification performance across various backbone architectures. Among them, the PCSA-ResNet variant achieves the highest scores across all evaluation metrics, affirming that embedding the PCSA module within residual blocks is particularly effective for learning robust multiscale feature representations. Its corresponding confusion matrix reveals minimal misclassifications, underscoring the model strong generalization ability and low susceptibility to overfitting across cross-validation folds.
The PCSA-VGG16 model, although not matching the performance of ResNet, still delivers strong results. Its confusion matrix shows minor misclassifications, particularly between cyst and normal cases, as well as between tumor and stone categories. These errors show that, despite the absence of residual connections, the integration of PCSA significantly enhances VGG16 ability to focus on discriminative spatial and channel features.
The PCSA-MobileNetV2 model, designed as a lightweight architecture for mobile and embedded applications, achieves good performance, albeit with slightly lower precision and recall. The broader spread in its confusion matrix indicates a higher rate of misclassification, likely due to the MobileNetV2 limited representational capacity. Nevertheless, the inclusion of the PCSA module clearly contributes to improved classification stability and SA compared to the unmodified baseline.
The PCSA-DenseNet model benefits from the intrinsic strengths of dense connectivity, which promotes feature reuse and improved gradient flow. It demonstrates performance metrics that surpass those of MobileNetV2 and closely approach those of VGG16. The corresponding confusion matrix displays more compact class-wise prediction boundaries, indicating that the DenseNet architecture synergizes well with the PCSA module to enhance multiscale feature integration.
Overall, these findings confirm that the PCSA functions as a modular and architecture-agnostic enhancement capable of improving performance across diverse CNN backbones. While the degree of improvement varies depending on the underlying architecture complexity and capacity, all tested configurations exhibit measurable gains and strong generalization under 5-fold cross-validation. This extended evaluation underscores the adaptability and effectiveness of the PCSA as a plug-and-play module suitable for deployment in a variety of medical image analysis frameworks.
Ablation study on channel attention
To assess the individual contribution of the proposed CA mechanism within the PCSA module, we conducted a targeted ablation study by comparing multiple CA configurations. Unlike conventional SE-based channel attention, which employs only global average pooling, our proposed design combines both global average and global max pooling. This dual-pooling strategy aims to generate a more informative channel descriptor by capturing both prominent and distributed activation patterns.
Three variants of the PCSANet model were implemented using the ResNet-50 backbone, all trained on the same CT kidney dataset under identical hyperparameters and 5-fold cross-validation.
-
Model A (Baseline – No Channel Attention): The CA component was removed, retaining only the SA and pyramid multiscale convolution modules.
-
Model B (SE-Style Channel Attention): A traditional SE attention mechanism was implemented, using only global average pooling followed by a shared MLP.
-
Model C (Proposed Channel Attention): Both global average and max pooling were incororated, followed by a shared MLP implemented via 1 × 1 convolutions.
All models shared the same spatial attention module and multiscale pyramid convolution structure to ensure a controlled comparison environment.
Table 8 presents the classification results for each variant across five folds. Eliminating the channel attention module entirely (Model A) led to a noticeable reduction in accuracy, precision, recall, and F1-score, underscoring the importance of channel recalibration for effective feature emphasis.
In Model B, where SE-style attention using only average pooling was applied, performance improved relative to the baseline, validating the known benefits of channel recalibration mechanisms in deep networks. However, the proposed design (Model C) achieved the highest scores across all metrics. Specifically, it improved F1-score by + 0.9% and accuracy by + 0.74%, compared to the SE-style attention approach.
These improvements are attributed to the complementary strengths of the two pooling operations: max pooling captures salient, dominant activations, while average pooling accounts for distributed feature responses. Their combination yields a more expressive and discriminative channel descriptor, leading to superior attention-weighted feature maps.
This ablation study provides a clear evidence that the proposed dual-pooling CA mechanism contributes significantly to model performance. The enhancements go beyond superficial architectural modifications and result in statistically and practically meaningful improvements in classification outcomes. These findings reinforce the architectural merit of the proposed PCSA design, positioning it as an effective alternative to traditional attention modules such as SENet.
Comparison with existing methods
Table 9 provides a comparative analysis of the proposed PCSANet model against existing state-of-the-art methods applied to CKD datasets and related medical imaging datasets. While some prior studies, such as that by Senan et al.37 report 100% accuracy, the results were often obtained on small-scale datasets (e.g., 400 samples), limiting the methods generalizability and real-world applicability. Additionally, works such as that by Singh et al.14 focused solely on binary classification, which does not capture the complexity of multiclass kidney disease diagnosis.
In contrast, the proposed PCSANet model was rigorously evaluated on a comprehensive, large-scale CT kidney dataset comprising 12,446 images, representing four diagnostic categories: normal, cyst, tumor, and stone. This scale and diversity provide a more realistic and challenging benchmark, reinforcing the robustness and clinical relevance of the model. By integrating the novel PCSA module, PCSANet effectively captures multiscale contextual information and intricate feature dependencies, leading to superior classification performance.
Unlike many traditional methods that rely on handcrafted features or basic ML models, PCSANet represents a DL framework enhanced with a plug-and-play attention mechanism. For instance, Kadhim et al.38 reported high performance using SVM and MLP models on CT images, but their approach lacked adaptability to multiclass scenarios. Similarly, Hama et al.39 utilized SVM with Local Binary Patterns (LBP) for stone detection, effective for texture analysis but limited in semantic feature understanding. Ghosh et al.40 proposed a fuzzy ensemble with TransferNet, which demonstrated good interpretability but introduced added complexity and training overhead.
PCSANet, in contrast, delivers a streamlined yet highly effective architecture by embedding PCSA blocks into residual structures, allowing it to generalize across diverse image resolutions and pathology types. Its modular design was successfully applied to various CNN backbones, validating its adaptability and scalability—features often missed in prior models.
The results summarized in Table 9 clearly show that PCSANet outperforms conventional ML approaches and existing DL solutions, achieving 100% accuracy, recall, precision, and F1-score on the CKD dataset. This demonstrates the model exceptional ability to generalize across a large and complex dataset. These findings affirm the effectiveness of the PCSA mechanism in enhancing deep CNN architectures for robust medical image classification.
Conclusion and future works
This study introduced the PCSA mechanism, an advanced attention framework designed to improve feature extraction in convolutional neural networks. The PCSA leverages enhanced pyramid multiscale convolution to capture rich feature representations across varying receptive fields and generates both channel and spatial attention weights to effectively recombine multiscale features. This mechanism optimizes spatial information flow and improves the network focus on relevant regions. By replacing standard convolutions with dilated pyramid group convolutions, the PCSA reduces computational overhead, while pointwise convolution is employed to compensate for potential information loss inherent in grouped operations. As a lightweight and modular attention block, the PCSA is compatible with a wide range of deep learning architectures and can be integrated as a plug-and-play component. In this work, the integration of the PCSA into ResNet architectures led to the development of the PCSANet, a novel model that demonstrated superior classification accuracy and robustness compared to several state-of-the-art methods across a large-scale CT kidney dataset. Looking forward, future research will focus on extending the application of the PCSA to additional computer vision tasks such as object detection, semantic segmentation, and lesion localization, particularly within medical imaging domains. To further validate clinical utility, we plan to conduct external evaluations using multi-institutional datasets that incorporate heterogeneous patient demographics and imaging protocols. Additionally, to facilitate deployment in resource-constrained environments, we will explore model compression techniques and design lightweight variants of the PCSA module. These future directions aim to broaden the applicability, scalability, and practical adoption of the proposed attention mechanism in real-world healthcare and AI systems.
Data availability
This study utilized a CT kidney dataset sourced from the Picture Archiving and Communication System (PACS) across various hospitals in Dhaka, Bangladesh. The dataset comprises patients identified with normal kidneys, tumors, cysts, conditions, or stones. Axial and coronal cuts were chosen from both noncontrast and contrast-enhanced scans, following established protocols for whole urogram and abdomen imaging. The dataset is accessible to the public on Kaggle https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal-cyst-tumor-and-stone.
References
Schoolwerth, A. C. et al. Chronic Kidney Disease: A Public Health Problem That Needs a Public Health Action Plan. (2006).
Kovesdy, C. P. Epidemiology of chronic kidney disease: an update 2022. Kidney Int. Suppl. 12, 7–11 (2022).
Francis, A. et al. Chronic kidney disease and the global public health agenda: an international consensus. Nat. Rev. Nephrol 20, (2024).
Goksu, S. Y., Leslie, S. W. & Khattar, D. Renal Cystic Disease. The Kelalis-King-Belman Textbook of Clinical Pediatric Urology Study Guide 63–65 (2023). https://doi.org/10.5005/jp/books/12792_10
Islam, M. N. et al. Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from CT-radiography. Sci. Rep. 12, (2022).
Torres, H. R. et al. Kidney segmentation in ultrasound, magnetic resonance and computed tomography images: A systematic review. Comput. Methods Programs Biomed. 157, 49–67 (2018).
Hermena, S. & Young, M. CT-scan Image Production Procedures. StatPearls (2023).
Thurman, J. M. & Gueler, F. Recent advances in renal imaging. F1000Res 7, 1–14 (2018).
Yang, L., Zhang, R. Y., Li, L., Xie, X. & SimAM A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. 11863–11874 Preprint at (2021). https://proceedings.mlr.press/v139/yang21o.html.
Li, J., Wen, Y., He, L. & SCConv Spatial and channel reconstruction Convolution for feature redundancy. 6153–6162 (2023). https://doi.org/10.1109/CVPR52729.2023.00596.
Kumar, A., Shivakumara, P., Chowdhury, P. N., Pal, U. & Liu, C. L. DPAM: A new deep parallel attention model for multiple license plate number recognition. Proc. - Int. Conf. Pattern Recognit. 2022-August, 1485–1491 (2022).
Agarwal, R., Ghosal, P., Sadhu, A. K., Murmu, N. & Nandi, D. Multi-scale dual-channel feature embedding decoder for biomedical image segmentation. Comput. Methods Programs Biomed. 257, 108464 (2024).
Prakash, U. M. et al. Multi-scale feature fusion of deep convolutional neural networks on cancerous tumor detection and classification using biomedical images. Sci. Rep. 15, 1–23 (2025).
Singh, V., Asari, V. K. & Rajasekaran, R. A. Deep neural network for early detection and prediction of chronic kidney disease. Diagnostics. 12, 116 (2022).
Abdeltawab, H. et al. A pyramidal deep learning pipeline for kidney whole-slide histology images classification. Sci. Rep. 11, (2021).
Sudharson, S. & Kokil, P. An ensemble of deep neural networks for kidney ultrasound image classification. Comput. Methods Programs Biomed. 197, (2020).
Kim, D. H. & Ye, S. Y. Classification of chronic kidney disease in sonography using the Glcm and artificial neural network. Diagnostics 11, (2021).
Jerlin Rubini, L. & Perumal, E. Efficient classification of chronic kidney disease by using multi-kernel support vector machine and fruit fly optimization algorithm. Int. J. Imaging Syst. Technol. 30, 660–673 (2020).
Poonia, R. C. et al. Intelligent diagnostic prediction and classification models for detection of kidney disease. Healthcare (Switzerland) 10, (2022).
Nithya, A., Appathurai, A., Venkatadri, N. & Ramji, D. R. & Anna palagan, C. Kidney disease detection and segmentation using artificial neural network and multi-kernel k-means clustering for ultrasound images. Measurement (Lond) 149, (2020).
Alzu’Bi, D. et al. Kidney Tumor Detection and Classification Based on Deep Learning Approaches: A New Dataset in CT Scans. J. Healthc. Eng. (2022).
Pande, S. D. & Agarwal, R. Multi-Class kidney abnormalities detecting novel system through computed tomography. IEEE Access. 12, 21147–21155 (2024).
Nagawa, K. et al. Three-dimensional convolutional neural network-based classification of chronic kidney disease severity using kidney MRI. Sci. Rep. 14, (2024).
Tawfik, N. et al. Enhancing early detection of lung Cancer through advanced image processing techniques and deep learning architectures for CT scans. Comput. Mater. Continua. 81, 271–307 (2024).
Zheng, M. et al. Attention-based CNNs for image classification: A survey. J. Phys. Conf. Ser 2171, (2022).
Wang, Q. et al. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11534–11542 (2020).
Guo, C. et al. SA-UNET: Spatial attention U-net for retinal vessel segmentation. Proceedings - International Conference on Pattern Recognition 1236–1242 (2020). https://doi.org/10.1109/ICPR48806.2021.9413346.
Yu, Y., Zhang, Y., Cheng, Z., Song, Z. & Tang, C. Multi-scale Spatial pyramid attention mechanism for image recognition: an effective approach. Eng. Appl. Artif. Intell. 133, 108261 (2024).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. In Proceedings of the European conference on computer vision (ECCV) 3–19 (2018).
Sang, H., Zhou, Q., Zhao, Y. & PCANet Pyramid convolutional attention network for semantic segmentation. Image Vis. Comput. 103, 103997 (2020).
Hu, J., Shen, L. & Sun, G. Squeeze-and-Excitation networks. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 7132-7141 https://doi.org/10.1109/CVPR.2018.00745 (2018).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11211 LNCS, 3–19 (2018).
CT KIDNEY DATASET. Normal-Cyst-Tumor and Stone. https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal-cyst-tumor-and-stone.
Sajjadi, M. S. M., Bousquet, O., Bachem, O., Lucic, M. & Gelly, S. Assessing generative models via precision and recall. Adv. Neural Inf. Process. Syst. 2018-December, 5228–5237 (2018).
Deeks, J. J., Takwoingi, Y., Macaskill, P. & Bossuyt, P. M. Understanding test accuracy measures. Cochrane Handb. Syst. Reviews Diagn. Test. Accuracy. 53–72. https://doi.org/10.1002/9781119756194.CH4 (2023).
Saif, D., Sarhan, A. M. & Elshennawy, N. M. Early prediction of chronic kidney disease based on ensemble of deep learning models and optimizers. J. Electr. Syst. Inform. Technol. 11, (2024).
Senan, E. M. et al. Diagnosis of Chronic Kidney Disease Using Effective Classification Algorithms and Recursive Feature Elimination Techniques. J. Healthc Eng. 1004767 (2021).
Kadhim, D. A. & Mohammed, M. A. Advanced machine learning models for accurate kidney Cancer classification using CT images. Mesopotamian J. Big Data. 2025, 1–25 (2025).
Hama, H. K., Majeed, H. D. & Nariman, G. S. Enhanced kidney stone detection and classification using SVM and LBP features. UHD J. Sci. Technol. 9, 10–17 (2025).
Ghosh, A. & Chaki, J. Fuzzy enhanced kidney tumor detection: integrating machine learning operations for a fusion of twin transferable network and weighted ensemble machine learning classifier. IEEE Access. https://doi.org/10.1109/ACCESS.2025.3526272 (2025).
Jongbo, O. A., Adetunmbi, A. O., Ogunrinde, R. B. & Badeji-Ajisafe, B. Development of an ensemble approach to chronic kidney disease diagnosis. Sci. Afr. 8, e00456 (2020).
Ma, F., Sun, T., Liu, L. & Jing, H. Detection and diagnosis of chronic kidney disease using deep learning-based heterogeneous modified artificial neural network. Future Generation Comput. Syst. 111, 17–26 (2020).
Chittora, P. et al. Prediction of chronic kidney Disease - A machine learning perspective. IEEE Access. 9, 17312–17334 (2021).
Alsuhibany, S. A. et al. Ensemble of Deep Learning Based Clinical Decision Support System for Chronic Kidney Disease Diagnosis in Medical Internet of Things Environment. Comput Intell Neurosci (2021). (2021).
Sawhney, R., Malik, A., Sharma, S. & Narayan, V. A comparative assessment of artificial intelligence models used for early prediction and evaluation of chronic kidney disease. Decis. Analytics J. 6, 100169 (2023).
Ramu, K. et al. Hybrid CNN-SVM model for enhanced early detection of chronic kidney disease. Biomed. Signal. Process. Control. 100, 107084 (2025).
Acknowledgements
The authors extend their appreciation to the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Groups Program through the Grant number RGP-1444-0054.
Funding
This work was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Groups Program Grant number RGP-1444-0054.
Author information
Authors and Affiliations
Contributions
The authors confirm their contribution to the paper as follows: study conception and design: Nahed Tawfik, and Heba M. Emara; data collection: Walid El-Shafai; analysis and interpretation of results: Naglaa F. Soliman, Abeer D. Algarni, and Fathi E. Abd El-Samie; draft manuscript preparation: Nahed Tawfik, Heba M. Emara, and Walid El-Shafai. All authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Consent to participate
All authors contributed and accepted to submit the current work.
Consent to publish
All authors accepted to submit and publish this work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tawfik, N., Emara, H.M., El-Shafai, W. et al. PCSA-Net: pyramid channel and spatial attention network for multiclass renal disease diagnosis using CT images. Sci Rep 16, 5953 (2026). https://doi.org/10.1038/s41598-025-12335-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-12335-6























