Introduction

Alzheimer’s disease (AD) is a neurodegenerative brain disease that stands out as the most prevalent form of dementia, requiring significant medical attention1,2,3. Early and accurate diagnosis of AD is crucial to initiate timely therapeutic interventions and ensure effective patient care1,4. As per research findings, in 2018, approximately 50 million people new cases of dementia were recorded annually5. According to the World Health Organization (WHO), many people will be affected by AD in 2023, and the number is projected to exceed 152 million worldwide by 2050, surpassing global cancer patient statistics. Additionally, AD ranks fifth in terms of the worldwide mortality ratio, causing the death of numerous individuals globally5,6,7. AD is characterized as a persistent neurological brain disorder that progressively damages brain cells, leading to memory loss and cognitive difficulties. Ultimately, it accelerates the decline in the ability to carry out everyday activities in real-life scenarios5,6. This state leads to the shrinkage of the human brain, affecting memory and leading to the decline of behavioral , social, and reasoning abilities. Alzheimer’s is thought to be caused by the build-up of protein fragments in the brain. This leads to the formation of plaques and tangles around nerve cells. As a result, it badly impacts the lobes and hippocampus in terms of changing shape or shrinkage or enlarged ventricles 8,9,10.

It is true that AD is a most deathful, incurable, and fatal disease that mainly causes the patient to suffer a lifetime. Consequently, it brings numerous burdens on the patient’s family, including financial, mental, and physical issues11,12. Researchers still can not find the origin of AD, and there are no effective therapies or medications to protect people from this disease. Consequently, doctors and researchers are not able to reverse the dementia of AD-affected people. However, researchers found some stages of AD disease, such as the pre-clinical stage of AD, can be defined by Mild Cognitive Impairment (MCI), recognized as the pre-clinical stage of AD. This invention signifies a transitional state between normal aging and the onset of AD6. Early AD detection and assessment of its risk and severity are crucial13,14. In contrast, many traditional diagnostic centers are available in various countries where doctors use computer-assisted or neuro-imaging systems. Researchers found some drawbacks with low-performance accuracy during the initial stage of AD. Computed tomography (CT) and Positron emission tomography (PET) scans are medical imaging techniques that provide detailed information about the internal structures and functions of the body. CT scans use X-rays to create cross-sectional images, while PET scans involve the injection of a radioactive tracer to visualize metabolic activity. These scans help diagnose and monitor various medical conditions, revolutionizing the medical diagnosis of AD6. In addition, MRI is a practical, non-invasive method for collecting valuable information about the human body. This information is continuously used in the medical diagnosis of AD, and this has become a significant recent trend in computed-aided AD diagnosis8,9.

In recent years, researchers worldwide have developed numerous ML15 and DL16 algorithms to identify and categorize AD. However, while some research has achieved the best results using DL algorithms, there is still room for improvement. A sequence of DL models has been introduced in this domain, including a hybrid CNN that incorporates slice selection and integrating histogram stretching17,18,19, respectively. Other researchers proposed a CNN architecture that incorporates skull stripping techniques20. Additionally, a CNN model that includes slicing samples for pre-processing is introduced in21. Nevertheless, these deep models primarily emphasize classification tasks, which can be attributed to the inherent black-box nature of CNNs. Several datasets are available online, and the Alzheimer’s Disease Neuroimaging Initiative dataset stands out as one of the most challenging among them. Additionally, the ADNI dataset is widely used due to its publically available and comprehensive collection of clinical, imaging, and genetic data from individuals with AD, MCI, and health control. Numerous researchers have been diligently working to develop an AD recognition system using various DL technologies, aiming to enhance the performance accuracy and efficiency of the system with the ADNI dataset.

Kamal et al. develop a CNN-SVC system to improve the performance accuracy with the Kaggle dataset22. In the same way, researchers developed DEMNET23, CNN24, Triad25, and FDCT-WR26, where they reported performance accuracy compared to the previous model with kaggle dataset. Moreover, Sharma et al. used hybrid systems by combining transfer learning and machine learning algorithms such as DenseNet201-SVM27, and they reported 91.00% accuracy with the Kaggel dataset. Fareed et al.1 employed image augmentation techniques to overcome the challenges of balancing class labels among AD disease classes. They utilized a CNN-based feature extraction and classification module, namely ADD-Net, reporting a 97.05% accuracy with the Kaggle dataset. Al-Alhadia et al.28 explored the pre-trained models such as AlexNet and ResNet50 for AD diagnosis and reported an accuracy of 94.53%. The AlexNet-ResNet5028, ADD-Net1, DEMNET23, and Bi-Vision Transformer (BiViT)29 models try to improve performance accuracy, but they still face computational complexity problems and performance accuracy. Additionally, these models require more parameters for training as compared to our model. In this scenario, it is crucial to develop an Alzheimer’s disease detection module to overcome performance accuracy and efficiency challenges. To overcome the problem, we proposed a simple lightweight stack CNN with a channel attention network (SCCAN) module where we employed a series of conventional blocks with diverse deep layers, which later integrated with the channel attention module to achieve outstanding classification results in terms of the accuracy and efficiency, particularly in the early stages of AD. Our research makes significant contributions to the field, advancing the SOTA in AD detection through the following key contributions:

  • Innovative Model Architecture: We introduced the efficacy of a Stacked Convolutional Neural Network with a Channel Attention Network (SCCAN) explicitly designed for AD detection. Our model comprises two stages of the feature extraction technique: stack CNN and channel attention module after applying the data augmentation technique. Stack CNN comprises integrated 5 CNN modules, which form a stack CNN aiming to generate a hierarchical understanding of features through multi-level extraction, effectively reducing noise and enhancing the weight’s efficacy.

  • Channel-wise Feature Extraction: We incorporated a channel attention module to address the nuances of channel-wise features from the stack CNN spatial features. This module enhances the model’s capability to extract relevant features based on channel dimensions, aligning with the intricacies of AD classification from MRI images. Finally, we employed a classification module.

  • Comparative Evaluation: We conducted an extensive comparative evaluation of the proposed SCCAN model with the ADNI1 Complete 1Yr 1.5, ADNI Kaggle, and OASIS-1 datasets. The results unequivocally demonstrate the superior performance of our proposed SCCAN approach, marking a substantial advancement in AD classification using MRI images. Our code and other materials for preprocessing, feature extraction selection, and classification are available in the GitHub link (https://github.com/najm-h/Alzheimer/.).

The organization of our paper is outlined as follows: Section 2 delves into the related work. Datasets are described in the Section 3. Section 4 presents the proposed methodology including dataset preprocessing, feature extraction procedure, and classification method. Section  5 provides detailed results, including optimal parameter values and a comparative analysis. Finally, Section 5 (5.6) and Section 6 presented the discussion and conclusion of this paper respectively.

Related Work

The accurate detection and classification of medical images pose a challenging task due to the complex nature of acquiring medical datasets30. Therefore, institutions and organizations such as the ADNI31 and the Open Access Series of Imaging Studies (OASIS)32, which provide medical datasets, implement a stringent screening process. Recently there are many researchers has been working to develop an Alzheimer’s disease detection system utilizing the ADNI31 and OASIS32 datasets. Islam et al.33 utilized the OASIS dataset, which comprised 416 3D data samples, to construct a several-layer CNN model. They assessed the accuracy of their model by comparing it with two distinct pre-trained architecture models, namely Inception V434 and ResNet34. Khan et al.35 used a twelve-layer CNN structure to integrate convolution and pooling operations with the OASIS data set in a similar manner. To solve the gradient vanishing issue, other researchers used the Leaky Rectified Linear Unit ReLU36 with MaxPooling as the activation function while staying away from ReLU37. Recently, many researchers employed pre-trained models, including MobileNetV2, VGG19, InceptionV3, and Xception, to enhance the performance accuracy and efficiency of the AD system38,39,40. Ebrahimighahnavieh et al.16 proposed a hybrid framework including ResnetV2 with InceptionV4 and residual connections to provide the skip connection35. They reported 79.12% accuracy for the OASIS dataset. Similarly, Pradhan et al.41 employed VGG19 and DenseNet169 models for comparative analysis of AD classification42, and they reported accuracy of 88% and 87%, respectively, for these two transfer learning methods for the OASIS dataset. Meanwhile, Battineni et al.43 implemented a five-layer CNN model using the OASIS-3 dataset, focusing on classifying three distinct early stages of AD.

El-aal et al.44 selected potential features among the whole DL models to reduce the computational cost and reported better accuracy. To enhance the feature selection concept of DL, some researchers used refined genetic algorithm (RGA)45 and probability binary particle swarm optimization (PBPSO)46 to select the effective features. Additionally, some existing works presented on attention mechanism-based approach47,48,49,50,51 for the MRI image diagnosis. Several recent studies incorporate attention mechanisms to improve feature extraction in AD detection models. For example, Illakiya et al.47 developed an adaptive hybrid attention network, achieving 98.53% accuracy on the ADNI dataset; however, their model suffers from high computational demands, making it less practical for real-time deployment. Attention-based methods like these are effective but require substantial resources, which limits their utility in clinical settings where efficiency is essential. Yen et al.52 developed a model based on attention mechanisms for the classification of AD and reported an accuracy of 85.24%. Besides the computational complexity, many researchers have been working to improve the performance accuracy using Deep learning scratch22,23,24,26,27. Kamal et al. proposed an AD detection system using SpinalNet with a CNN module, and they reported 97.60% accuracy for the Kaggle MRI AD dataset22. Chabib et al. utilized a cruvlet transformer-based CNN module, DeepCurvMRI, achieving 98.62% accuracy26 with the same dataset. Kim et al. employed an ensemble CNN, specifically 1D CNN, to classify AD. They combined ten 1D CNNs and reported a 98.60% accuracy and a 0.0386% loss for their model. Murugan et al.23 proposed a CNN network to build a framework for AD detection using the Kaggle dataset. Sharma et al.27 introduced a hybrid AI-based model that includes the combination of permutation-based ML and transfer learning-based voting classifier in two phases of feature extraction. Finally, they employed several machine learning algorithms for classification, reporting 91.75% accuracy and a 96.50% F1 score.

Transformer-based models like Vision Transformers (ViT) have shown promise due to their ability to handle large-scale features53,54. As evidenced by Dhinaagar et al.54 achieving 82.69% accuracy. Fareed et al.1 employed image augmentation techniques to balance class labels among AD disease classes. They utilized a CNN-based feature extraction and classification module, reporting a 97.05% accuracy with the Kaggle dataset.

However, the above-mentioned method often lacks the interpretability and efficiency of CNNs, especially when applied to smaller datasets, where overfitting and generalization issues may arise. However, the mentioned existing AD recognition systems often struggle with interpretability, efficiency, overfitting, and generalization issues, especially on smaller datasets. High computational demands and poor dataset imbalance handling further limit their accuracy and suitability for real-time use, resulting in suboptimal performance on standard datasets like ADNI1Complete 1Yr 1.5T, Kaggle, and OASIS-1. To overcome the problems, we proposed that the SCCAN model distinguishes itself from other recent SOTA methods in two key aspects: achieving an average accuracy improvement of 1.31% across three datasets and including an analysis of the FLOPs of our model to determine its efficiency. Our proposed Stacked Convolutional Neural Network with a Channel Attention Network (SCCAN) aims to enhance feature extraction through a hierarchical multi-level approach, reduce computational complexity by focusing on the most relevant features, and improve generalization and robustness. By addressing these specific drawbacks, our model demonstrates superior performance and suitability for real-time deployment, as evidenced by our experimental results on multiple datasets. As shown in Table 4, 5, 6, and Table 7, SCCAN achieves higher accuracy and reduced FLOPs, making it suitable for real-time applications. This comparison Table included most of the state-of-the-art AD recognition methodology with various parameters. Our research contributes to advancing the state-of-the-art (SOTA) in AD detection through the following key innovations:

  • Unlike traditional CNN and attention-based models, SCCAN employs a stacked CNN structure with five integrated CNN modules for multi-level feature extraction. This hierarchical design enhances feature representation while reducing noise and computational demands, evidenced by lower FLOPs compared to conventional attention mechanisms (see Table 5).

  • Our model introduces a channel attention module that refines feature selection based on channel dimensions. This approach addresses the limitations of earlier models focused solely on spatial features, increasing SCCAN’s accuracy by capturing more nuanced AD characteristics in MRI data.

  • SCCANs performance was evaluated against recent SOTA models, demonstrating an average accuracy improvement of 1.31% across the ADNI1 Complete 1-Year 1.5T, Kaggle, and OASIS datasets. This evaluation, along with detailed FLOP and accuracy metrics (see Tables, 4, 5, 6, 7), highlights SCCAN’s efficiency and suitability for real-time applications in AD detection.

AD Dataset

In this study, we used three datasets: ADNI1 Complete 1-Year 1.5T (https://adni.loni.usc.edu), Kaggle55, and OASIS-156,57 to evaluate the proposed model.

ADNI1 1Yr 1.5T

We used the ADNI1 Complete 1-Year 1.5T dataset to perform the multiclass classification. The primary objective of the ADNI is to identify biomarkers for AD to facilitate early diagnosis, improve clinical treatments, and enhance our comprehension of the pathophysiology of AD. The detailed description of the data set used contained 6400 directories, 2294 files, MRI scans, and a total number of 639 subjects labeled AD, CN, and MCI. The scan consists of subjects ranging from 50 years to 91 years, and it was collected in 57 United States and Canada. A total number of 639 subjects were divided into three categories based on the disease stages, namely AD: 133, CN: 195, and MCI: 311. We divide dataset images according to 60% training, 20% testing, and 20% validation based on the previous study1,53, the dimension of each image ( \(265\times 256\)) for the experiment.

Kaggle Dataset

This dataset comprises MRI scan images of unidentified patients and the corresponding class label information. It is worth mentioning that this is a multiclass dataset encompassing diverse perspectives and four distinct classes. These classes consist of an average NOD category and three additional categories that represent various early stages of AD. The AD dataset has a total sample count of 6400, and samples consist of three-channel RGB images with dimensions of \(176 \times 208\) pixels, categorized into four classes. Namely mild demented (MID), moderately demented (MOD), non-demented (NOD), and very mild demented (VMD), as shown in Figure 1. The distribution of images for each class is depicted in Table 1. Notably, the data set is not balanced. To address this issue, neighbors use SMOTETOMEK techniques58,59 to create synthetic data for each underbalanced class to achieve balance with the other classes. The advantage of using the combination of SOMET and TOMEK techniques performance is the ability to mitigate knowledge loss and minimize overfitting60. According to the existing data splitting protocol1,22,23,27, we split it into training, validation, and test sets, with proportions of 60-20-20, respectively.

Fig. 1
figure 1

Example of image Samples 4 classes from the AD dataset.

Table 1 AD datasets distribution.

OASIS-1 dataset

We collected the OASIS-1 dataset from the consisting of four classes: NOD, VMD, MD, and NOD. It contains 80k MRI images. Because of the dataset balancing, the previous study used three classes: NOD, VMD, and MID61. The Table 1 depicted the distribution of images for each class. According to the previous work, we also used three classes in the study where these three classes contained a total of 9600 MRI images, the size of each image \(176 \times 208\) pixels, which are divided into equal amounts for each class. We divided it into training, validation, and test sets, with ratios of 60-20-20 respectively, based on the existing protocol1,62.

Proposed methodology

In the study, we first preprocess data and then feed it into the feature extraction and classification module. Figure 2 demonstrated the preprocessing diagram that visualizes the conversion of nii formate to RGB formate based on the axial, sagittal, and coronal axes. The proposed Stacked CNN with Channel Attention Network SCCAN modules are designed for feature extraction and classification module architecture details shown in Figures 3, 4, and 5 respectively. Figure 3 depicted more details of the proposed SCCAN model including convolutional blocks, dens blocks, and channel attention blocks. The architecture of CNN-based methods in the biomedical field is similar to the structure of the human brain. Its predominant use lies in computer vision domains such as the classification of images, segmentation of images, and identification of objects. Newer and previous machine learning and CNN-based methods make it possible to extract meaningful features directly from the dataset45. The CNN-based method shows the best performance as compared to traditional-based methods on predefined features in most computer vision and image processing. We proposed a lightweight Stacked CNN model with a Channel Attention Network SCCAN that performs well in AD classification. The proposed model, a Stacked CNN with channel attention, comprises five CNN blocks (CNNBs), each featuring a channel attention block. Each CNNB is equipped with an activation function ReLU and a 2D average pooling layer. Additionally, the model includes two dropout layers, two dense layers, and a Softmax layer for the classification task. The architecture of the proposed Stacked CNN with Channel Attention SCCAN network, detailing the structural design and flow of data, along with the model summary, which includes specific layer configurations and parameters, is comprehensively examined in Table 2. This architecture consists of four modules: the convolution blocks, channel attention, dense block, and classification module. The M and N represent the width and height of the images, which vary based on the dataset, and M=176, N=208 for the Kaggle dataset. Additional information with details of each module is succinctly provided in the following subsection.

Table 2 Total number parameters for the proposed CNN with a channel attention-based model.
Fig. 2
figure 2

Dataset gathering and preprocessing.

Preprocessing

We downloaded the ADNI-1 complete 1yr 1.5T dataset from the websites (https://adni.loni.usc.edu/.) as a Neuroimaging Informatics Technology Initiative (NIfTI) file (.nii) file, which contains a total of 2294 MRI scans from 639 subjects labeled as AD:133, CN:159, and MCI: 311. In the preprocessing, we follow some of the steps in previous studies25,63,64,65, first, we convert the .nii format into MRI images (.jpg format) of each class, i.e. AD, CN, MCI, for the entire preprocessing step available in this links (https://github.com/najm-h/Alzheimer/). According to previous studies1,53 the dataset is divided in a 60-20-20 ratio for training, testing, and validations. Next, we extracted these MRI images into three views: axial, sagittal, and coronal, using med2image (https://github.com/FNNDSC/med2image). We selected 150 subjects, including those with MCI, CN, and AD (Three classes) with 50 subjects from each class. From these axial slices, only the slices with indices 60 to 120 were used in the study, under the assumption that these images cover the areas that have important features for the classification task as mentioned in the previous study64,65. After checking the quality of the MRI images, we selected 60 MRI images, resulting in a total of \(150\times 60\) = 9000 images. Figure 2 shows the preprocessing stage and three automatic planes extracted from the ADNI dataset. The brain MRI images contain a lot of information that can visualize organs in three planes: axial, sagittal, and coronal.

Feature extraction based on convolutional blocks

To enhance the spatial information in the input data, we utilized a stack of Convolutional Neural Network Blocks (CNNBs), where stack means a series of CNNBs to overcome overfitting and extract hierarchical features. Each CNNB consists of a 2D convolutional layer, an activation function, and a pooling layer, as depicted in Figure 4(a). The convolutional layer captures spatial patterns, while the activation and pooling layers ensure efficient learning and dimensionality reduction. A kernel initializer was employed to initialize the weight matrix in the convolutional layer.

The convolutional operation applied to the input is mathematically described as:

$$\begin{aligned} F_{i,j,k}^{(n)} = \sum _{p=1}^{P} \sum _{q=1}^{Q} X_{i+p,j+q}^{(n-1)} \cdot W_{p,q,k}^{(n)} + b_k^{(n)} \end{aligned}$$
(1)

where \(F_{i,j,k}^{(n)}\) is the output feature map at position (ij) in the n-th convolutional layer for the k-th channel, \(X_{i+p,j+q}^{(n-1)}\) is the input feature map from the previous layer, \(W_{p,q,k}^{(n)}\) is the convolutional kernel, and \(b_k^{(n)}\) is the bias term.

To introduce non-linearity and enable faster learning, we applied the ReLU activation function to the output of the convolutional layer:

$$\begin{aligned} A_{i,j,k}^{(n)} = \text {ReLU}(F_{i,j,k}^{(n)}) = \max (0, F_{i,j,k}^{(n)}) \end{aligned}$$
(2)

Finally, to reduce the dimensionality of the feature maps while retaining the most important information, we employed an average pooling layer, defined as:

$$\begin{aligned} P_{i,j,k}^{(n)} = \frac{1}{m \times m} \sum _{p=1}^{m} \sum _{q=1}^{m} A_{i+p,j+q,k}^{(n)} \end{aligned}$$
(3)

where \(P_{i,j,k}^{(n)}\) is the pooled output, and \(m \times m\) is the size of the pooling window. The convolutional layers in the early stages of the model extract low-level patterns, such as lines, edges, and curves, which form the basis for subsequent short-range dependency features. As the data moves through deeper CNNBs, high-level features are extracted, improving the model’s ability to classify images accurately. To capture more detailed and comprehensive features, we employed five CNNBs in sequence. Each block extracts multilevel compelling features that help the model adapt to variations in the data while effectively reducing noise and disregarding irrelevant variations. The hierarchical feature map after N convolutional blocks can be represented as:

$$\begin{aligned} F_{StackCN}=F^{(N)} = \text {CNNB}_N\left( \text {CNNB}_{N-1}(... \text {CNNB}_1(X))\right) \end{aligned}$$
(4)

where \(F^{(N)}\) is the final hierarchical feature map obtained after the N-th convolutional block which we finally defined as \(F_{StackCN}\) that fed into the channel attention module. This process ensures the extraction of both low-level and high-level spatial patterns, allowing the model to handle complex and variable input data effectively. The integration of multiple CNNBs ensures robustness in feature extraction, reducing overfitting while enhancing generalization and accuracy.

Fig. 3
figure 3

Proposed Alzheimer recognition architecture.

Fig. 4
figure 4

Details of the (a) Convolutional Block (b) Dense Block (c) Channel Attention Block.

Fig. 5
figure 5

Classification Module.

Channel wise Feature Selection

We implemented the channel attention (CA) module on the output of stacked CNN to enhance the representation of features with the channel of the CNN layers. The CA module calculates the channel-wise weight matrix and then sorts the weight matrix among the channels. After that, it selects the top weight matrix and discards the lower weight matrix contained in the channel. Our aim in using the CA module66,67,68 in the study is to select the effective feature that is also known as optimal features and suppress the less optimal features. This effective feature enhances the gen-realizability and discriminative capability of the proposed SCCAN model. In this research, the CA models enhanced the extracted feature maps by subjecting them to global average pooling, producing distinctive outputs for individual channels. Afterwards, every channel underwent a series of fully connected (FC) layers, followed by a batch normalization layer, and was associated with ReLU activation, generating either positive values or 0. The powerful feature vector emerged by multiplying the activation function’s output with the input features, enhancing the channel attention mechanism. To clarify, the CA module allocated a positive value to promising features and 0 to less impactful features. After the multiplication operation, the crucial feature was isolated from the stacked CNN feature, causing unimportant features to be transformed into zeros69. Figure 4(c) depicts the architecture of the CA employed in this study, showcasing how global average pooling received input from N channels. We utilized a dense layer with a size of N/470 and passed it through a batch normalization layer to mitigate internal covariate shift issues and avoid excessively small gradients. Following the application of the ReLU activation layer, we employed another FC layer with a size of N, succeeded by another ReLU activation. The preference for ReLU activation was based on its lower computational complexity compared to the sigmoid function. Mathematically, we can define the channel attention mechanism as below equation 5 where pass \(F_{StackCN}\) through the Channel Attention Module. The global average pooling (GAP) step is applied to aggregate spatial information across the channels, and the resulting vector is passed through two fully connected layers to compute the channel-wise weights. This process is described by:

$$\begin{aligned} F_{CA}=F^{CA}_c = \sigma \left( W_2 \cdot \text {ReLU}\left( W_1 \cdot \left( \frac{1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} F_{StackCN,i,j,c}\right) + b_1\right) + b_2\right) \cdot F_{StackCN,c} \end{aligned}$$
(5)

Where:

  • \(F_{StackCN,i,j,c}\) represents the feature value at location (ij) for channel c from \(F_{StackCN}\).

  • GAP aggregates the spatial information over the height (H) and width (W) of each channel.

  • \(W_1\) and \(W_2\) are the weights of the fully connected layers, and \(b_1\) and \(b_2\) are their biases.

  • \(\sigma\) is the sigmoid function, producing the attention score for each channel.

  • \(F^{CA}_c\) is the final output feature map after the channel attention is applied.

By multiplying the attention weights with the original feature map \(F_{StackCN,c}\), we obtain the refined feature map \(F^{CA}_c\), where important channels are emphasized, and less relevant channels are suppressed and can be defined as \(F_{CA}\) which fed into the next step classification module. This refined feature map captures the most critical information across channels, enhancing the model’s ability to discriminate between different features for classification.

Classification module

In the classification module, the flattened layer transforms the multi-dimensional feature map into a 1D array, followed by the dropout layer, which serves as a regularization method which is shown in Figure 5. Finally, the softmax layer converts the raw outputs into a probability distribution, selecting the class with the highest probability for precise class predictions. The main components are explained as follows:

Flatten layer

The flattened layer converts the feature map from the channel attention module into a one-dimensional vector. If the input feature map has dimensions \((H \times W \times C)\), the flattening process reshapes it to a vector of size \(H \times W \times C\), denoted as:

$$\begin{aligned} \mathbf {F_{flatten}} = \text {Flatten}(F^{CA}) \in \mathbb {R}^{H \times W \times C} \end{aligned}$$
(6)

This operation ensures that the attention-weighted features are compatible with the fully connected layers, which follow the flattening.

Dense block

The dense block is shown in Figure 4 (b), which consists of a fully connected layer with ReLU. The ReLU activation function is applied to hidden layers for computational efficiency:

$$\begin{aligned} \textbf{h}^{(l)} = \text {ReLU}(W^{(l)} \cdot \mathbf {F_{flatten}} + b^{(l)}) \end{aligned}$$
(7)

where \(W^{(l)}\) and \(b^{(l)}\) are the weight matrix and bias for the dense layer l, and \(\textbf{h}^{(l)}\) is the output after applying ReLU activation.

Dropout Layer

The dropout layer is introduced after the dense block of flattening to reduce the risk of overfitting by randomly turning off some neurons during training. Dropout is controlled by a probability \(p_{dropout}\) that governs the proportion of neurons to deactivate:

$$\begin{aligned} \hat{h}^{(l)} = {\left\{ \begin{array}{ll} 0, & \text {with probability } p_{dropout} \\ h^{(l)}, & \text {with probability } 1 - p_{dropout} \end{array}\right. } \end{aligned}$$
(8)

where \(h^{(l)}\) represents the activations of layer l, and \(\hat{h}^{(l)}\) is the result after dropout is applied. By deactivating a portion of neurons, the model becomes less sensitive to overfitting, improving its generalization performance the output of the dropout layer fed into the dense layer again.

Probabilistic map with softmax activation

Finally, the output of the dense layer is passed through the SoftMax function for multi-class classification. The SoftMax function converts the logits into probabilities for each class, enabling accurate prediction:

$$\begin{aligned} P(y = i \mid X) = \frac{\exp (W_i \cdot \textbf{h} + b_i)}{\sum _{j=1}^{C} \exp (W_j \cdot \textbf{h} + b_j)} \end{aligned}$$
(9)

where \(P(y = i \mid X)\) is the probability that the input X belongs to class i, C is the number of classes, and \(W_i\) and \(b_i\) are the weights and bias for class i. The final class prediction is determined by selecting the class with the highest probability:

$$\begin{aligned} \hat{y} = \arg \max _{i} P(y = i \mid X) \end{aligned}$$
(10)

where \(\hat{y}\) is the predicted class label. Thus, the classification module combines dropout, flattening, and fully connected layers to generate accurate multi-class predictions while preventing overfitting and ensuring generalization.

Experimental evaluation

To evaluate the proposed SCCAN model, we utilized three Alzheimer’s benchmark datasets and included state-of-the-art comparisons. We describe the environmental setup, conduct an ablation study, and present the experimental evaluation results for each dataset in the following sections.

Hyperparameter setting and environmental setup

We implemented the system using the Python environment along with various TensorFlow modules. During hyperparameter tuning, we experimented with different configurations to optimize model performance. The initial learning rate was set to 0.01, which was determined to provide stable convergence through empirical evaluation. We tested several optimization algorithms, including Adam, RMSProp, and Stochastic Gradient Descent (SGD), and after evaluating performance, we selected SGD as the optimizer due to its superior accuracy and convergence across all three datasets. We also tuned the batch size and determined that a size of 12 was optimal for balancing computational efficiency and model accuracy. Additionally, for the classification tasks, we opted for Categorical Cross-Entropy (CCE) as the loss function, as it consistently outperformed Mean Squared Error (MSE) when paired with the SoftMax output layer. The model was trained on an “NVIDIA GeForce RTX 3090 GPU machine” running Ubuntu with NVIDIA version 12.1 and 48 GB RAM. We ensured a comprehensive assessment of model robustness by employing various metrics, including accuracy, loss, AUC, ROC curve, extended ROC curve, F1-Score, precision, and confusion matrix. These metrics provided a detailed analysis of the model’s performance, offering a holistic understanding of its effectiveness in classification tasks.

Ablation study

We performed an ablation study of the proposed model to prove the system’s superiority, as demonstrated in Table 3. We analyzed the impact of different configurations on the model’s accuracy, providing insights into the effectiveness of specific components and techniques in the proposed SCCAN model. Our objective in using a stacked CNN is to extract long-range dependency features, which reveal complex relationships among the pixels in the spatial domain, thereby enhancing the feature maps. We then apply a channel attention module to select the most significant features from these long-range dependency features. We enhanced spatial feature maps in the ablation study using 3, 4, and 5 convolutional blocks. We found that two blocks or more than five blocks decreased performance accuracy. Additionally, we highlighted the necessity of channel attention for feature selection and the impact of data balancing techniques. While the ablation study may seem trivial, our goal was to illustrate the power of stacked CNN features with and without channel attention. Our findings indicate that fewer than three convolutional blocks produce low-performance accuracy, suggesting that less complex relationships fail to accurately capture the essential features needed to represent the AD class accurately. Conversely, overly complex relationships with more than five blocks do not effectively represent the actual AD class either. In the first ablation study, we stacked 4 CNN modules, including Channel attention and balancing, which shows 98.29%. We also experimented with 3 CNN blocks, focusing on the balancing technique that generated 98.21% accuracy. Using five blocks of CNN without a channel attention module, including the data balancing technique, it produced 97.24% accuracy. Employing the number of CNN block five beside the channel attention module and data balancing, it reported 99.22% accuracy.

Table 3 Ablation study of the proposed SCCAN model with Kaggle dataset.

Experimental evaluations with kaggle dataset

We evaluated the performance of the proposed SCCAN model utilizing the overall metrics as depicted in Table 4. We compared the result of a stacked CNN with channel attention network SCCAN to the SOTA algorithms regarding the model accuracy, loss, ROC curve, extension of the ROC curve, and confusion matrix. These comparative metrics are briefly explained as follows.

Performance metrics : Accuracy, Precision, F1-Score and Confusion matrix

The experimental performance matrics and the SOTA comparison are shown in Table 4 including the existing methods, dataset name, number of images, and data splitting ratio. We observe that the proposed model achieves remarkable accuracy, precision, and F1-Score compared to all other models. Figure 8 visualizes the accuracy curves of the proposed model and SOTA models after each epoch, while Figure 10 presents the confusion matrix. Besides the proposed SCCAN model, we also experimented with recent models to make the SOTA comparison, including InceptionResNet V2, ResNet50, AlexNet28, VGG1671, Deep-ensemble72, Demnet23, CNN-SVC22, DenseNet201-SVM27, ADD-Net1, FDCT-WR26, and CNN24, in terms of accuracy as depicted in Table 4. Among all existing systems, we observed the most comparable performance accuracy in Demnet23, CNN-SVC22, DenseNet201-SVM27, ADD-Net1, FDCT-WR26, Conv-BLSTM with SMOTE73 and CNN24. Kamal et al.22 proposed an AD detection system using SpinalNet with a CNN module employing the Kaggle MRI AD dataset. They used microarray gene expression data to recognize the disease, employing several machine learning algorithms, and achieved 97.60% accuracy. Finally, they employed several machine learning algorithms for classification, reporting 91.75% accuracy and a 96.50% F1 score. Chabib et al.26 utilized a cruvlet transformer-based CNN module, namely DeepCurvMRI, achieving 98.62% accuracy. Kim et al.24 employed an ensemble CNN, specifically 1D CNN, to classify AD. They combined ten 1D CNNs and reported a 98.60% accuracy and a 0.0386% loss for their model. Murugan et al.23 proposed a CNN network to build a framework for AD detection using MRI images. They utilized the Kaggle dataset to evaluate their model and reported 95.23% accuracy and a 97% AUC. Sharma et al.27 introduced a hybrid AI-based model that combines permutation-based ML and transfer learning-based voting classifier in two phases of feature extraction. Fareed et al.1 employed image augmentation techniques to balance class labels among AD disease classes. They utilized a CNN-based feature extraction and classification module, reporting a 97.05% accuracy. Besides the existing model, our proposed model achieved 99.22% accuracy, which is high-performance accuracy and surpasses all existing models in this domain.

Table 4 F1 score, Recall, Precision, AUC, and Accuracy of the existing Algorithms with Kaggel datasets.
Fig. 6
figure 6

ROC curves of the Add-Net and proposed SCCAN model with Kaggel dataset.

Fig. 7
figure 7

Extensive ROC curves for multiclass curves of the Add-net and the proposed model with the Kaggel dataset.

Fig. 8
figure 8

Accuracy curves of the Add-net and proposed SCCAN model with Kaggel dataset.

Loss and required model parameters

Table 5 details the number of parameters, model loss, and FLOPs of the proposed SCCAN model in comparison to SOTA models. The SCCAN model achieved a loss value of 0.0284, requiring fewer parameters (1,241,140) and FLOPs, demonstrating superior performance.. Figure 9 illustrates the loss curves of the proposed SCCAN model and the AddNet model after each epoch.

Fig. 9
figure 9

Loss curves of the Add-net and proposed SCCAN model with kaggel dataset.

Table 5 Computational costs comparison with state of- the-art models.

ROC, extensive ROC curves

The AUC information of the proposed model is shown in Table 4, where we observed that the proposed SCCAN model achieved the best value of 99.92% of AUC. The ROC and extensive ROC curves of the proposed model are demonstrated in Figure 6 and 7, respectively, including the SOTA Add-net1 model.

Experimental evaluations with ADNI1 Complete 1Yr 1.5T dataset

We evaluate the proposed SCCAN model using the “ADNI1 Complete 1Yr 1.5T” dataset and compare its performance with existing models, as shown in Table 6. The Table includes performance metrics along with details such as reference number, type of input, number of MRIs, number of images, dataset splitting, and methods, where reported. The proposed model is more comparable with recent models such as VGGNet1664, DCNN75, VGGNet1676, CNN-VGG1665, Deep ensemble72 TriAd25 and ViT53. Gunawardena et al.76 developed two methods: SVM and CNN base utilized the slices-based MRI image involving the preprocessing steps. First, they converted the 3D MRI to 2D Images and selected 1615 coronal plane images. Further they randomly 1292 images for training and 323 for testing. Billones et al.64 used different tools for the reconstruction process and obtained coronal image slices with size \(256\times 256\) in the PNG format. They selected the slice indices 111 to 130 due to the assumption that these slices cover more informative areas for feature extraction. They used the CNN with VGGnet16 for the classification of AD, CN, and MCI and reported an accuracy of 91.85%. Jain et al.65 selected the ADNI dataset of 150 subjects, including 50 AD, 50 CN, and 50 MCI, for the classification task. Each brain MRI is in NIfTI format, and the NIfTI images are volumetric (3D) images of size \(256\times 256\times 256\) after preprocessing. These images contain 2D images called MRI slices. They selected 4800 slices (150 subjects \(\times\) 32 slices, which consist of 1600 AD, 1600 NC, and 1600 MCI slices). They used a transfer learning-based approach, including a CNN architecture and the VGG16-trained model for feature extraction, and reported an accuracy of 95.18%. Kundram et al. proposed a DL-based method using the ADNI-1 dataset for classification subjects with AD, MCI, and NC; they reported a good accuracy of 98%75. Mercaldo et al.25 extracted 6768 MRI images from the same ADNI dataset and developed a DL-based model, TriAD, consisting of two CNN blocks called Tblocks for AD detection. They reported an accuracy of 95.18%. Alp et al.53 proposed a vision transformer (ViT) based model to extract features from the ADNI1 complete 1Yr 1.5T dataset and utilize a time series transformer to classify features. They achieve an accuracy of 91.24%, though at the cost of computational efficiency. Similarly, Liu et al.74 proposed an attention-based CNN mechanism to improper the feature presentation capability. They performed segmentation tasks using ADNI MRI data to obtain gray matter and white matter. They performed binary classification tasks and reported an accuracy of 95.37% AD vs MCI. Abdulazeem et al.77 proposed a CNN-based model for AD classification using 211,655 images extracted from the ADNI dataset and reported an accuracy of 97.50% for multi-class classification. Similarly, Savas et al.63 used the same analysis of the preprocess steps of our study. First, they converted the .nii format to .png with Python code and exported 166 frames from each nii image, which were selected as the middle two images. They select 4364 slices of images and split them in the 80:10:10 training testing and validation ratio, respectively. Next, they suggest 29 pre-trained models as a comparison and determine that EfficientNetB3 achieves a high accuracy of 97.28%. The main difference between their method of exploring the pre-trained model and our proposed model is a novel approach. In summary, we follow the same analysis of preprocessing steps in previous studies such as25,63,64,65 the proposed SCCAN model achieved the best performance accuracy of 99.58%, a loss of 0.0174, a precision of 99.58%, a recall of 99.58%, and an F1 score of 99.66%, which demonstrates that outperforming than the existing methods. Figures 11 and 12(a) depict the proposed model’s accuracy, loss, and ROC curves, respectively, demonstrating that our model performs well during training and validation. Figure 12(b) shows the confusion matrix of the SCCAN model.

Table 6 Performance of the proposed SCCAN model for ADNI1 Complete 1Yr 1.5T dataset with SOTA.
Fig. 10
figure 10

Confusion matrix of Add-net and proposed SCCAN model with kaggel dataset.

Fig. 11
figure 11

Loss and accuracy curves for the ADNI1 Complete 1Yr 1.5T dataset.

Fig. 12
figure 12

ROC Curve :a Confusion matrix:b of the proposed SCCAN model with ADNI1 Complete 1Yr 1.5T dataset.

Fig. 13
figure 13

Loss and accuracy curves of the proposed SCCAN model with OASIS-1 dataset.

Experimental evaluations with OASIS-1 dataset

In Table 7, we present the numerical results regarding performance accuracy for AD classification on the OASIS-1 dataset with existing models. The existing models Mohammad et al. proposed a hybrid model that consists of (AlexNet+SVM and ResNet-50+SVM) for AD classification and reported an accuracy of 94.80%78. Kabir et al.79 presented the DL-based approach with 18-layer architecture using the same dataset for multi-classification and reported an accuracy of 92.00% through the cost computational. Most of the comparable accuracy results in the existing methods we investigated are related to our study62,72. Loddo et al.72 presented a deep ensemble strategy using the OASIS dataset with slices based on AD diagnosis and reported an accuracy of 98.51%. Almufareh et al.62 employed attention-based techniques with the Vision Transformer (ViT) approach for AD detection using the same OASIS dataset and reported an accuracy of 99.06%. Besides this, our proposed SCCAN model achieved an accuracy of 99.70%, demonstrating a notable improvement over existing methods. Figure 13 shows the accuracy and loss curves for the proposed model. The curves indicate that our model performs well on this dataset, with consistent accuracy and loss metrics for both the training and testing phases.

Table 7 Performance of the proposed SCCAN model for OASIS dataset.

Discussion

Our contribution lies in the novel combination and adaptation of these methodologies to address the specific challenges of AD classification using MRI images. Our approach, SCCAN, integrates multiple CNN modules in a stacked architecture, enhancing feature extraction and reducing noise in MRI images. Our stacked CNN approach provides comprehensive multi-level feature extraction, surpassing traditional single CNN architectures in capturing complex patterns in medical images. The channel attention mechanism in our model prioritizes important features, achieving higher accuracy, especially with limited training data. Our model achieves superior performance with accuracies of 99.58%, 99.22%, and 99.70% on the ADNI-1, Kaggle, and OASIS-1 datasets, respectively, demonstrating robustness and generalizability across different datasets. This performance accuracy proved the effectiveness of our approach in advancing the state-of-the-art in this domain. We compared the previous model with our proposed model, including DL-based and channel attention-based in terms of the performance accuracy demonstrated in Table 4, 6 and 7. We recognize the need for a more detailed error analysis and have provided insights based on confusion matrices for multi-class classification tasks as depicted in Figures 10 and 12(a). Most errors occurred between closely related stages, such as VMD and NOD in the four-class problem and NC and MCI in the three-class problem. Despite these minimal error rates (less than 1%), further feature refinement and data augmentation could enhance the model’s performance. Overall, the model’s strong handling of false positives and negatives confirms its reliability for real-world diagnostic applications. The computational complexity of the proposed SCCAN model is addressed through a detailed comparison with SOTA as shown 5. Besides the reduced parameter count, we evaluated the model regarding FLOPs and model loss, which are critical indicators of computational efficiency. This significant reduction in parameters and FLOPs indicates that our model is computationally more efficient while maintaining competitive accuracy. These reductions are essential for clinical deployment, where real-time performance and resource efficiency are necessary. Lower-value FLOPs ensure faster inference times and reduced computational resource needs, making our model suitable for deployment in environments with limited hardware, such as portable medical devices or systems with limited computational power. Our proposed model’s architecture, combining stacked CNN modules with a channel attention mechanism, is highly versatile and can be applied to various fields beyond Alzheimer’s Disease classification. Cancer Detection, Cardiovascular Disease Diagnosis, Anomaly Detection, Facial Recognition.

Conclusion

In the study, we proposed the SCCAN model as a groundbreaking solution for AD classification through Magnetic Resonance Imaging (MRI). We introduce several advancements aligning the deep learning technique with the contemporary needs of medical image diagnosis. Its distinctive feature extraction mechanism, leveraging a hierarchical approach through multi-level extraction, contributes to a reduced parameter model tailored for efficient training on smaller datasets. Moreover, the SCCAN model enhances interpretability by leveraging channel attention mechanisms for effective feature selection.. The high-performance accuracy with extensive experiments proved that our proposed SCCAN model achieved the goal by achieving better performance compared to the existing SOTA models. For future work, we plan to collect more data, explore the integration of advanced transfer learning techniques, and investigate the SCCAN model’s performance across diverse datasets, which could provide valuable insights and further enhance its applicability in broader medical imaging contexts.