Introduction

With the increase in the aging population worldwide, Alzheimer’s disease (AD) has become a rapidly growing global public health concern1. The progression of AD is irreversible, but early diagnosis and intervention can effectively slow its onset and progression, thereby improving the quality of life for elderly patients2. The development of AD involves complex changes in brain structure and function, which are influenced by various factors, including the complex interactions among various mechanisms and the impact of the disease3. In terms of structure, individuals with AD commonly exhibit cortical thinning, reduced gray matter volume and hippocampal atrophy1. In terms of the brain’s functions, AD patients often show weakened connectivity in the default mode network4,5. However, current AD diagnostic studies are typically limited to a single modality, focusing either on brain structure or function; this approach only reflects part of the changes and fails to completely capture the unique pathological features of early AD patients6,7. Combining clinical data from brain imaging and neuropsychological assessments offers a comprehensive view of the brain, enhancing diagnostic accuracy and informing treatment strategies. This integrative approach can detect subtle brain changes that are not visible with a single imaging modality. By incorporating multimodal clinical information, our study aims to enhance the early detection and classification of AD, ultimately contributing to better patient outcomes.

Magnetic resonance imaging (MRI), functional magnetic resonance imaging (fMRI), and structural magnetic resonance imaging (sMRI) techniques, in conjunction with neuropsychological features, have emerged as critical tools for the early diagnosis of AD8,9. fMRI, which utilizes the blood-oxygen-level-dependent (BOLD) contrast10, serves as a precise tool for measuring neural activity and functional changes in the brain. Functional connectivity (FC) can be assessed by analyzing the correlation among BOLD signals from different brain regions, which sensitively detects functional changes in the brain associated with AD pathology11,12. sMRI enables a comprehensive examination of brain structures in individuals with AD13. Texture features, such as gray-level co-occurrence matrices (GLCMs), obtained from sMRI images, provide valuable information on gray-level directions and variations, which are significant for the diagnosis and prognosis of AD14,15,16. Additionally, neuropsychological assessment scales can evaluate individuals’ cognitive, emotional, and behavioral functions. The mini-mental state examination (MMSE) and the clinical dementia rating (CDR) are commonly-used convenient tools for the early clinical diagnosis of AD. These multimodal data can be used to obtain biological information in different dimensions of brain structure, brain function and cognitive level, resulting in new perspectives for additional disease research and intelligent diagnostic assistance as compared to single-modal brain structure or function information.

Recently, convolutional neural networks (CNNs) have progressively emerged as the leading and effective tools in medical image feature extraction and classification tasks17,18,19. However, current CNN architectures face considerable challenges that limit their application and performance in AD diagnosis. Specifically, the fixed-size convolutional kernels in CNNs constrain their ability to learn similar patterns of existing objects and lack sufficient receptive fields to capture global features20,21. This limitation reduces the effectiveness of the field of view for feature extraction, hindering the exchange of information on distant features outside the field of view, which may diminish the accuracy of feature recognition and mapping.

To address these challenges, this study introduces an innovative group self-calibrated coordinate attention network (GSCANet). GSCANet utilizes a group self-calibrated convolution module22 and a coordinate attention module23 to facilitate the identification of multiscale features and improve the exchange of information between feature maps. The group self-calibrated convolution module22 significantly improves the exchange of information between feature mappings, enabling more comprehensive feature extraction. This method employs differently sized convolutional kernels within the same network layer to identify multiscale features, thereby expanding the receptive field and enhancing the recognition accuracy of feature mapping20. Recent research has shown that expanding the receptive field has the potential to improve classification performance in AD diagnosis24,25. By mplementation of group self-calibrated convolution leads to more comprehensive and diverse set of AD brain image features, resulting in more accurate classification.

Furthermore, we employed the coordinate attention module assigns different weights to input features, effectively eliminating redundant information and accelerating CNN convergence, thereby improving model performance26. There are two primary types of attention mechanisms: channel-wise and spatial-wise. Channel-wise attention assigns different weights to feature map channels, enabling the network to prioritize specific attributes, such as brightness and texture. Spatial-wise attention allocates weights across spatial locations within the feature map, highlighting important areas, such as the hippocampus27, that contain critical information.However, the previous model ignored interactions between the channel and spatial attention28. To address this issue, we employed coordinate attention23 to compress spatial information into channel descriptors. Specifically, in this study, channel relations and long-term dependencies are encoded as precise positional information, sensitive to channel and spatial variations when capturing cross-channel information. Feature processing is coordinated and enhanced by selectively focusing on spatial and channel information to help suppress irrelevant and detailed information. We can effectively capture a more diverse and critical set of AD brain image features by integrating coordinate attention into our CNN model, achieving improved performance in classification tasks.

In conclusion, we proposed a CNN-based GSCANet model that incorporates group self-calibrated convolution and coordinate attention modules. The key contributions of our study are as follows:

  1. (1)

    We used multimodal features including GLCM, FC and neuropsychological features (CDR/MMSE) for classification as changes in the brain structure and function in AD are not independent. This approach provides a more comprehensive view of the complex relationships between brain structural morphology, functional dynamics, and cognitive impairments, improving classification accuracy compared to traditional single-modality methods that focus solely on structural MRI images. This is useful in capturing the complexity of pathological relationships and thus is better suited for classification.

  2. (2)

    We incorporated group self-calibrated convolution into our model to expand the receptive field through varied-sized convolutional kernels within the same layer. This technique enhances feature extraction by capturing a broader and more detailed representation of AD brain images.

  3. (3)

    We employed coordinate attention module by selectively focusing on critical spatial and channel information to effectively suppress less relevant details. This targeted focus enables the capture of a diverse and critical set of features, leading to improved performance in classifying AD brain images.

The overall workflow is shown in Fig. 1. We extracted features from multimodal data which were input into the GSCANet model. Subsequently, the model was applied to the multi-classification study of AD, MCI and NC using ten-fold cross-validation to evaluate its performance.

Figure 1
figure 1

Workflow of this study. Abbreviations: GSCANet, group self-calibrated coordinate attention network; AD, Alzheimer’s disease; late mild cognitive impairment LMCI; EMCI, early MCI; NC; sMRI, structural MRI; FC, functional connectivity.

Materials and methods

Subjects

Two sources of data were utilized in this study: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (http://adni.loni.usc.edu/) and the database from the First Hospital of Jilin University. The ADNI database provided baseline MRI scans of 637 participants (143 AD, 264 MCI, and 143 NC). The MCI group comprised 92 EMCI and 172 LMCI participants based on the severity of the disease symptoms.

The First Hospital of Jilin University contributed data from 195 participants (60 AD, 62 MCI, and 73 NC). This study was approved by the ethics committee of the First Hospital of Jilin University. All procedures were conducted in accordance with the Helsinki declaration. The dataset included participants with AD, MCI, and NC subjects. All participants signed a written informed consent before the experiment and participated in cognitive psychological evaluations, including the MMSE and the CDR. Combined with the scale information, the diagnosis and enrolment of the participants were completed by experienced clinical neurologists.

MRI acquisition

All ADNI participants were scanned using a Philips 3T MRI scanner with the following parameters: TR = 3000 ms, TE = 30 ms, flip angle = 90º, number of layers = 48, layer thickness = 3.3 mm, FOV = 256 × 256 mm, and 140 time points. For T1 weighted imaging (T1w), the parameters acquired were TR = 8.9 ms, TE = 3.9 ms, flip angle = 8°, slice thickness = 1 mm, and FOV = 256 × 256 mm.

Participants recruited from the First Hospital of Jilin University were scanned using a 3T field strength Siemens MRI equipped with a standard head coil to acquire resting-state MRI images of the brain. Prior to acquisition of the experimental data, the participants were instructed to keep their eyes closed and remain awake during the acquisition process. MRI images of all the participants with both T1w as well as rs-fMRI were obtained. The fMRI parameters acquired were TR = 2500 ms, TE = 27 ms, flip angle = 90º, and FOV = 230 × 230 mm. T1w parameters acquired were TR = 8.5 ms, TE = 3.3 ms, flip angle = 12º, and FOV = 256 × 256 mm.

Data preprocessing

Visual quality control is conducted on T1w MRI images. Images of low quality, marked by incomplete brain coverage, low signal-to-noise ratio, or apparent visible artifacts, are excluded from the analysis. Preprocessing of the T1w MRI images is performed using the Statistical Parametric Mapping (SPM) (version 12)29 software. The preprocessing involves procedures such as skull stripping, denoising, aligning the T1w MRI with the Montreal Neurological Institute (MNI) space, and resampling to a resolution of 1 × 1 × 1 mm, all of which facilitate subsequent analysis.

The preprocessing protocol for fMRI data is implemented using SPM 12.029. The protocol includes removal of the initial ten time points, slice timing correction, motion correction, registration and spatially normalized to the MNI template. Linear drift is corrected, followed by the application of a bandpass filter in the range 0.01–0.1 Hz. Covariates, including six motion parameters as well as the mean signal from the white matter and cerebrospinal fluid, are removed to mitigate their potential influence. Finally, a visual quality control check is performed to discard any incorrect alignments from the dataset.

Haralick texture features extraction

GLCM describes images by quantifying the relationship between pairs of gray values in grayscale images. Haralick30 defined 14 statistical parameters to characterize texture features. In this study, we focused on visualizing a specific parameter (Fig. 2). Ten feature values, including those from the original image, were employed to describe the texture information of the brain image. These feature values were selected based on their effectiveness in differentiating between normal and abnormal brain tissues and were used in subsequent analyses.

Figure 2
figure 2

Computation of the gray-level co-occurrence matrix (GLCM). The GLCM was utilized to extract attribute values for texture features which were employed to describe the structural information of the image. Abbreviations: Std, standard deviation; Asm, angular second order moment; Max, maximum.

Consider an image I with pixel coordinates (x, y) and a pixel gray value li, and its neighboring pixel with coordinates (x + Δx, y + Δy) and a pixel gray value lj. GLCM counts the number of occurrences of pixel pairs with a grayscale value (\(\:{l}_{i}\), \(\:{l}_{j}\)) along a direction θ, separated by d steps, and denoted as M(i, j/d, θ). Typically, the values of θ are 0°, 45°, 90°, or 135°. GLCM reflects the grayscale variation of the image. For an image I with L gray levels, each element in the matrix of dimension L × L can be represented by C(\(\:{\text{l}}_{\text{i}}\), \(\:{\text{l}}_{\text{j}}\)). The corresponding formula is shown in (1).

$$\:C({l}_{i},{l}_{j})=M(i,j/d,\theta\:)/{\sum\:}_{i=0}^{L-1}M(i,j/d,\theta\:)$$
(1)

After calculating the GLCM, it was normalized to obtain P(i, j), given by (2)

$$P\left(i,j\right)=\begin{bmatrix}\begin{array}{cccccc}C\left(l_0,l_0\right)&C\left(l_0,l_1\right)&\cdots&C\left(l_0,l_1\right)&\cdots&C\left(l_0,l_{L-1}\right)\\C\left(l_1,l_0\right)&C\left(l_1,l_1\right)&\cdots&C\left(l_1,l_j\right)&\cdots&C\left(l_1,l_{L-1}\right)\\\dots&\cdots&\cdots&\cdots&\cdots&\cdots\\C\left(l_i,l_0\right)&C\left(l_i,l_1\right)&\cdots&C\left(l_i,l_j\right)&\cdots&C\left(l_i,l_{L-1}\right)\\\cdots&\cdots&\cdots&\cdots&\cdots&\cdots\\C\left(l_{L-1},l_0\right)&C\left(l_{L-1},l_1\right)&\cdots&C\left(l_{L-1},l_j\right)&\cdots&C\left(l_{L-1},l_{L-1}\right)\end{array}\end{bmatrix}$$
(2)

.

Using this approach, we extracted features from the image by computing frequencies of specific gray voxel value pairs. Nine features were extracted from T1w MRI using the GLCM feature extraction method.

The mean value, Mean, is a statistical parameter, used to describe the degree of regularity of texture, defined as.

$$\:Mean={\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}P(i,j)\cdot\:i,$$
(3)

where \(\:P(i,j)\) is the normalized GLCM, the standard deviation, Std, which is a measure of the deviation of the true image from the Mean is defined as

$$\:Std=\sqrt{\left(P\right(i,j)\cdot\:(i-Mean{)}^{2}}.$$
(4)

Contrast (Con) is another statistical parameter that quantifies the local variation present in the image and is given by.

$$\:Con={\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}(i-j{)}^{2}P(i,j).$$
(5)

Dissimilarity (Dis) is a measure of the total amount of local contrast variation present in the image, which is defined as.

$$\:Dis={\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}\left\{\right|i-j\left|P\right(i,j\left)\right\}.$$
(6)

Homogeneity (Hom) is used to quantify the local smoothness in an image and is defined as.

$$\:Hom={\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}\frac{P(i,j)}{1+(i+j{)}^{2}}.$$
(7)

The angular second moment (ASM) is a metric used to describe the uniformity of the image grayscale distribution and texture thickness, which can be calculated using.

$$\:ASM={\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}\left\{P\right(i,j){\}}^{2}.$$
(8)

Energy (SE) is used as a measure of whether a textured pattern is relatively homogeneous and regularly varying; it is defined as.

$$\:{S}_{E}=\sqrt{ASM}.$$
(9)

The maximum value (Max) is used to describe the maximum value of the image grayscale pair calculated using GLCM, which is given by

$$\:Max=P(i,j{)}_{max}$$
(10)

.

Entropy (ENT), used to quantify the randomness and complexity of the image; it is defined as

$$\:ENT=-{\sum\:}_{i=0}^{L-1}{\sum\:}_{j=0}^{L-1}P(i,j)\cdot\:{log}P(i,j)$$
(11)

.

Functional connectivity feature extraction

Previous studies have demonstrated FC changes in patients with AD. These connectivity alterations can be explored by measuring both global and local network topology. We calculated the Pearson correlation coefficients between the time series of resting-state fMRI (rs-fMRI) of each brain region and those of other regions using the Gretna toolkit and the AAL 90 template. This resulted in a 90 × 90 resting-state FC (rs-FC) matrix (Fig. 3). In the FC network, each brain region is represented as a node, with the strength of connections described as edges. After constructing the FC network, graph theory is employed to calculate both the local and global properties of the brain network. The properties include the average clustering coefficient, average shortest path length, clustering coefficient, small worldness, local efficiency, assortativity, and synchronization31.These parameters will be the functional features in the GSCAnet model. The definition and calculation formula of functional connection features are as follows:

Figure 3
figure 3

Process of feature extraction using fMRI data. Abbreviations: fMRI, functional magnetic resonance imaging (MRI); AAL, anatomical automatic labeling atlas; Bold, blood-oxygen-level-dependent.

The clustering coefficient (aCp) is used to characterize the degree of clustering among all nodes in the network, indicating local connectivity of the image. The average clustering coefficient is calculated by averaging over the clustering coefficients of all nodes in the network (Fig. 3)

$$\:{C}_{i}=\frac{n}{{C}_{k}^{2}}=\frac{2n}{k(k-1)}$$
(12)

where n is the number of connection edges that exist between node i and its neighboring nodes.

The shortest path length (aLp) is the average of the shortest path lengths of all pairs of nodes in the network. The aLp is a measure of the optimal path for information transfer from one node to another node in the network; it reflects the global information transfer capability of the network. The calculation formula is given by

$$\:{L}_{i}=\frac{1}{N-1}{\sum\:}_{i\ne\:j\in\:G}{min}\{{L}_{i,j}\}$$
(13)

where Li, j is the length of all paths between nodes i and j, and min{Li, j} is the shortest path length between nodes i and j.

The local efficiency (\(\:{E}_{i-loc}\)) of a node represents the tolerance of the network for this node (i.e., the effect of removing this node on the transmission efficiency of the sub-network). It is given by

$$\:{E}_{i-loc}=\frac{1}{{N}_{{G}_{i}}({N}_{{G}_{i}}-1)}{\sum\:}_{j,k\in\:{G}_{i}}\frac{1}{{L}_{j,k}},$$
(14)

where \(\:{\text{L}}_{\text{j},\text{k}}\) is the shortest path length between each pair of nodes directly adjacent to node i. \(\:{\text{N}}_{{\text{G}}_{\text{i}}}\) is the total number of node pairs present in the sub-network directly adjacent to node i.

Network local efficiency (aEloc) is the average of the local efficiency of all nodes, which is given by.

$$\:E_{loc}=\left\langle E\left(i\right)\right\rangle=\frac1N{\sum\:}_{i\in\:V}E\left(i\right).$$
(15)

Assortativity is used to analyze the likelihood that similar nodes tend to connect to each other.

Synchronization refers to the degree to which changes in the connectivity of nodes in a network are similar when the network undergoes disturbances or perturbations.

Neuropsychological scores feature extraction

Cognitive data were extracted from the “ADNIMERGE” file, which incorporates merged data sets from ADNI 1/GO/2. The file contains clinical data and numeric summaries to assess cognitive function over time. The neuropsychological variables used in the analysis of cognitive changes were the clinical medical dementia rating sum of boxes (CDR) and mini-mental state examination (MMSE). All participants from the First Hospital of Jilin University underwent psychological cognitive assessments, including the MMSE and CDR. We performed a one-way analysis of variance (ANOVA) for each of the two scales across the groups in both the databases. The results are summarized in Table 1. Significant differences were observed between the groups on both scales.

Table 1 Description of the database.

GSCANet model

CNNs feature local area connectivity and weight-sharing properties. The receptive field represents the size of the region where the pixel points on the output feature map of each layer of the network correspond to the input image. Selecting a receptive field that is too small leads to information loss, whereas a very large receptive field results in computational overload and information redundancy. To address this, we propose a new method that uses group convolution with group self-calibrated and coordinated attention, enabling efficient expansion of the receptive field without additional computational costs.

The model framework is shown in Fig. 4A. It follows a series structure, where the outputs from each layer are the inputs to the next layer. The overall deep-learning model comprises one convolutional layer, one batch normalization layer, one maximum pooling layer, two activation layers, 16 residual blocks, one fully connected layer, one flatten layer, and a softmax layer. In this study, the GLCM extracted in the preliminary stage is fed into the model. Initially, the convolutional layer extracts different features from the input. The normalization layer addresses the issues of gradient explosion and vanishing problems. To minimize redundant features, the improved linear unit is used as the nonlinear activation function, while the maximum pooling layer is utilized for multiscale feature learning32. Additionally, the model employs residual blocks in the classical deep residual network (ResNet), which enhances its robustness by deepening the layers and eliminating the less effective ones. The Resblock module is illustrated in Fig. 4B. Each residual block has three convolutional layers, including group self-calibrated coordinated attention convolutions (Fig. 4C) and coordinated attention (Fig. 4D). The flatten layer incorporates multidimensional data into a single dimension, combining it with numerical information (FC features, MMSE scores, and CDR scores) to combine image and numerical data. Finally, classification of subjects is achieved using these multimodal features through the fully connected layer followed by the softmax layer.

Figure 4
figure 4

Schematic diagram of the GSCANet model architecture. (A) Overall modeling framework based on diagnostic and predictive AD. (B) Residual blocks. (C) Group self-calibrated coordinate attention module. (D) Coordinate attention module. Abbreviations: Cov, convolutional layer; BatchNorm, batch normalization layer; ReLU, linear unit; Resblock, Residual block; FC, fully connected layer.

Group self-calibrated operation

Figure 4C illustrates the group self-calibrated convolutional framework. This framework splits the input data into multiple groups for parallel processing, with each group possessing the same structure, with the convolution filter of each group divided into multiple non-uniform parts. Each component is trained with different feature extraction methods to calibrate features across multiple scales. Instead of uniformly processing all features in the original space, self-calibrated convolution22 first splits the input data into two parts. One part performs direct feature mapping, whereas the other performs feature extraction from the global brain after down-sampling to enhance the signal-to-noise ratio, increase the receptive field, and reduce dimensionality. The group self-calibrated operation uses the lower dimensional patterns of the latter filtered transform to calibrate the convolutional transform of the former filter. In this study, the group self-calibrated convolution enables effective information communication with the filter following multiple heterogeneous convolution operations. This approach enables precise localization of target features and enhances feature discrimination, particularly when the convolutional receptive field expands. Finally, the features processed by the group self-calibration are input to the attention mechanism.

The group self-calibrated convolution comprises four heterogeneous filters—K1, K2, K3, K4—that are used to process the input data X. To achieve this, X is split into two parts, X1 and X2. In the first part, the original spatial information is preserved, and the features are directly mapped using K1, which helps prevent feature loss. Specifically, this is computed as Y1 = F1(X1), where F1(X1) = X1·K1. In the second part, the remaining three filters are applied to down sample X2 and generate Y2. This operation implements self-calibration from the entire dataset and performs calibrated operations within Y2. Finally, Y1 and Y2 are concatenated and subjected to attention coordination to enable multiscale information interactions in remote contexts. This leads to the extraction of relevant features from the data.

Self-calibrated convolution enhances the connection between individual feature maps by averaging pooling operations, where the latter capture the contextual information of features to reduce overfitting. Therefore, the average pooling operation in 3D space was performed for a given input X2, according to M2 = AvgPool3d(X2). The feature mapping of the averaged pooled features M2 was based on the K2 filter, \(\:{X}_{2}^{{\prime\:}}={F}_{2}\left({M}_{2}\right)\). The original X2 was activated using the sigmoidal activation function along with \(\:{X}_{2}^{{\prime\:}}\) after the following feature mapping, \(\:{X}_{2}^{{\prime\:}{\prime\:}}=S({X}_{2}+{X}_{2}^{{\prime\:}})\). \(\:{X}_{2}^{{\prime\:}}\) was used as the calibration weight for the corresponding element-wise multiplication of congruent elements followed by additional calibration work based on \(\:{Y}_{2}^{{\prime\:}}={F}_{3}\left({X}_{2}\right)\cdot\:{X}_{2}^{{\prime\:}{\prime\:}}\). The final feature transformation of the calibrated data can be expressed as \(\:{Y}_{2}={F}_{4}\left({Y}_{2}^{{\prime\:}}\right)\). Additionally, the output features after calibration were further combined with the features of the original spatial context. The cascaded features Y’ then underwent coordinated attention operations, resulting in the final output features Y.

Coordinate attention module

The attention mechanism enhances or selects important information concerning the target using attention distribution coefficients or weight parameters. To leverage both spatial location and channel information, we propose a coordinate attention mechanism based on a ResNet architecture. This mechanism enables the efficient extraction of features. Several alternatives to traditional pooling methods have emerged to prevent the loss of spatial and channel information caused by simple and brute-force pooling (Fig. 4D). In this study, we propose the use of average pooling to retain long-range interactions between location and channel information. We also use the attention mechanism to embed location information into channel attention. The image was first transformed by downscaling; the 3D image was transformed into a one-dimensional (1D) feature encoding using averaging pooling across three directions. The transformation can be expressed as

$$\:{z}_{c}=\frac{1}{H\times\:W\times\:L}{\sum\:}_{i=1}^{H}{\sum\:}_{j=1}^{W}{\sum\:}_{l=1}^{L}{y}_{c}(i,j,l)$$
(16)

where H, W, and L are the height, width, and length of the 3D image Y, respectively; i, j, and l are the arbitrary values of height, width, and length, respectively; yc is the input feature of Y with dimensionality reduction operation performed on the cth channel of Y, and zc is the output of the cth channel. By performing a 1D pooling operation on the image from 3D, the features are aggregated into three separate direction-aware feature maps along different directions of the input features. Specifically, the features are transformed by (H, 1, 1), (1, W, 1), and (1, 1, L) with three different pooling kernels in different dimensions as follows:

$$\:\begin{array}{c}H^{'\:}=\text{AdaptiveAvgPool3d(}H\text{,}\:\text{1,}\:\text{1)}\\W^{'\:}=\text{AdaptiveAvgPool3d(1,}\:W\text{,}\:\text{1)}\\L^{'\:}=\text{AdaptiveAvgPool3d(1,}\:\text{1,}\:L),\end{array}$$
(17)

where \(\:{H}^{{\prime\:}}\), \(\:{W}^{{\prime\:}}\) and \(\:{L}^{{\prime\:}}\) represent the height, width, and length features after pooling, respectively. The features pooled in the \(\:{H}^{{\prime\:}}\), \(\:{W}^{{\prime\:}}\) and \(\:{L}^{{\prime\:}}\) dimensions generated by the above operation are concatenated according to \(\:M=cat({H}^{{\prime\:}},{W}^{{\prime\:}},{L}^{{\prime\:}})\), and the combined data are convolved as a whole: \(\:{M}^{{\prime\:}}=Conv3d\left(M\right)\). Additionally, the batch normalization is used to mitigate distribution bias in the data. In this study, after nonlinear activation, the features are separated and convolved again separately depending on the division of the three dimensions of the image. Finally, the overall features are reassigned after applying the sigmoidal activation function to preserve important feature information.

The coordinated attention mechanism employs three 1D global pooling operations to encode global information. This approach enables the model to capture long-term dependencies among spatial locations, which is crucial for tasks related to disease diagnosis. Specifically, GSCANet was used to integrate the input features and perform feature extraction with an expanded effective receptive field, thereby ensuring the accurate extraction of features. This approach enabled coordinated attention to capture remote dependencies between spatial locations, thus rendering it highly effective for analyzing medical images.

Performance evaluation

This study compares the performance of various neural network-based classification models in distinguishing between different datasets to assess the generalization capabilities of deep learning models. The AD vs. MCI vs. NC triple classification and AD vs. NC binary classification models were trained using the ADNI dataset and validated using the in-house dataset. The remaining experiments (including AD vs. LMCI vs. EMCI vs. NC quadruple classification, EMCI vs. LMCI, EMCI vs. NC dichotomization) were trained using the ADNI dataset and validated using the ADNI test set, which is distinct from the subjects in the training set. To balance the sample sizes across different categories within the ADNI dataset, we utilized multimodal data from the same individuals collected over multiple years. To mitigate bias arising from data similarity, we ensured that data from the same individual were not included in both the training and testing sets simultaneously.

All training and testing in this study were carried out on the same hardware and software platform. The hardware environment was Windows Server 2016 (64-bit) operating system, Intel® Xeon® Gold 6238R CPU, and V100S-PCIE-32GB GPU. The software programming environment for GSCANet was Python 3.8.18, Pytorch 1.10.1, and CUDA 10.2. The learning rate (lr) was set to 3e-4, and the number of training epochs was set to 100. Training was performed using cross-entropy loss with incremental gradient descent as the training method.

Furthermore, we conducted ten-fold cross-validation for multiple classifications to assess the variability in the classification performance. Specifically, we randomly divided all participants into ten folds, using the data of participants in nine folds to build and train the model. The scans of participants in the remaining fold were used for model testing. We performed iterative testing using subjects in the remaining fold to prevent bias in the random assignment of data during cross-validation. Specifically, 15 replications per experiment were conducted using different combinations of training and test sets.

Finally, the diagnostic performances were evaluated using quantitative metrics, such as accuracy (ACC), sensitivity (SEN), and specificity (SPE), and the area under receiver operating characteristic curve (AUC), which are expressed as.

$$\:ACC=\frac{TP+TN}{TP+TN+FP+FN}$$
(18)
$$\:SEN=\frac{\text{T}P}{TP+FN}$$
(19)
$$\:SPN=\frac{TN}{TN+FP}$$
(20)

where TP, TN, FN, and FP represent true positive, true negative, false negative, and false positive, respectively. The classification results were averaged for ten-fold validation.

Results

Analysis of GSCANet’s early diagnostic results on ADNI data and its comparison with other models

First, we trained our model in the ADNI dataset and tested it using a completely non-overlapping ADNI dataset, conducting a tenfold cross-validation. Note that the in-house dataset was not used in this cross-validation process. As shown in Table 2, the final result for differentiating among the four categories of subjects (AD vs. LMCI vs. EMCI vs. NC) was 78.70% (95% CI: 77.5–81.5). For the binary classification of EMCI and LMCI, the final score was 84.67% (95% CI: 84.16–85.18), and the accuracy of the classification of EMCI and NC was 82% (95% CI: 81.79–82.21).

The GSCANet framework proposed in this study demonstrated improved performance compared to previous studies12,33,34,35,36,37,38,39. Table 2 shows that the accuracy of the four categories was significantly improved (by approximately 14%) compared to the adaptive sparse learning model34. In the EMCI vs. LMCI binary study, the SPE improved by 15% compared to that of the SWRLS dFC model37. The accuracy of EMCI and NC classification in our study improved by 2.75% compared to that of the GDCA model40. These results highlight the superior performance of our model in complex classification tasks.

Table 2 Performance comparisons of our model with those of other studies reported in the literature for AD vs. LMCI vs. EMCI vs. NC, EMCI vs. LMCI, and EMCI vs. NC.

Generalizability investigation of GSCANet model with independent database and comparison to other models

To further investigate generalizability and reproducibility of GSCANet model, training was performed on the ADNI dataset while testing was conducted on the in-house dataset. Subsequently, the classification performance of this study was explored in three classifications (AD, MCI, and NC) and two classifications (AD and NC). The results are presented in Table 3. The final result for the AD vs. MCI vs. NC triple classification was 83.33% (95% CI 77.5 to 81.5), which improved compared to those in 3D-CNN41, THAN42, STNet33 and LSTM-Robust43. The final result for the AD vs. NC dichotomy was 92.81% (95% CI 77.5 to 81.5), which was better than that of BB + pA-blocks (A) + Bili44, 3D-ResAttNet3445, THAN42, PT DCN26, and Dense CNN46. We observed that the method proposed in the present study had higher classification accuracy than that of other studies. Table 3 shows that the final results of the three classifications in this study improved by 7.37% compared to the LSTM-Robust model43. Our results in the AD vs. NC experiment improved by 8% in SEN compared to the DenseCNN2 model46. Based on these experiments, our proposed model has good generalizability and reproducibility in diagnosing AD and shows better classification performance than some previous models.

Table 3 Performance comparisons of our model with those of other studies reported in the literature for AD vs. MCI vs. NC and AD vs. NC binary classifications.

A comparative study of using the same features with other existing deep learning methods for diagnosing AD

As different datasets and features are used, it would be unfair to make direct comparisons between the different methods. We utilized the same dataset and experimental environment and only changed the parts of the experiment that required comparisons to validate fully the effectiveness of the proposed method in this study47. Our models were compared with classical models, such as VGG, ResNet, and ResNeSt networks for classification. All these models use the same features, including GLCM, FC, and neuropsychological scores. We compared the results of four-class and three-class classifications using these same features on different models. Specifically, the four-class classification (AD vs. LMCI vs. EMCI vs. NC) was trained on the ADNI dataset and tested using a completely non-overlapping ADNI dataset, whereas the three-class classification (AD vs. MCI vs. NC) was trained on the ADNI dataset and tested on the in-house dataset. As shown in Table 4, our model outperformed the VGG48 by 8.5% in the four-classification study. Furthermore, when compared with traditional models, such as the deep-residual network 50 (ResNet50)49 and split-attention networks (ResNeSt50)50, our model exhibited improved performances of 11.09% and 22.96%, respectively. In the three-classification experiment of AD vs. MCI vs. NC, our model exhibited improved performances of 5.45, 7.32, and 22.96%, compared to the VGG, ResNet50, and ResNeSt50 models, respectively. The results demonstrated that GSCANet has stronger feature extraction capabilities than other networks.

Table 4 Performance results of our model compared to the classical model.

We used confusion matrices to present the results from Tables 2 and 3 to assess the network’s performance on the validation data for each class (Fig. 5). A confusion matrix is a tool used to assess the performance of a classification model. It shows the relationship between the predicted categories and actual categories, helping to identify areas where the model performs well and where it is prone to errors. Figure 5A, B, and C illustrate the training results on the ADNI dataset. Figure 5D and E show the training results on the ADNI dataset and the testing results on the in-house dataset. In the four-class classification, NC was frequently misclassified into MCI(including EMCI) (Fig. 6A). This may be attributed to the subjective nature of diagnoses and the multi-site data sources, which can lead to similarities between NC and MCI. In the three-class classification, two AD subjects were misclassified as MCI, eleven MCI subjects were misclassified as AD, three MCI subjects were misclassified as NC, and sixteen NC subjects were misclassified as MCI. In the binary classification, subjects were more often misclassified into a more severe disease stage than into a less severe one. For example, in the EMCI and NC classification, 48 NC subjects were misclassified as EMCI, whereas no EMCI subjects were misclassified as NC. This indicates the effectiveness of the model, as in medical diagnostics, it is preferable to identify individuals with potential disease rather than incorrectly predicting negatives and missing those with the condition.

Figure 5
figure 5

Confusion matrix for classification of Alzheimer’s disease diagnosis.

Figure 6
figure 6

ROC curve of the multiclassification model in this study. (a) Binary classification. (b) Graph plots of the three classification results, including those for other model validation methods. (c) Graph plots of four classification results, including other model validation methods. Abbreviations: ROC, receiver operating characteristic; AUC, area under curve.

To better display our classification results, we plotted the time-dependent receiver operating characteristic (ROC) curves of the aforementioned binary classification, three classification, and four-classification as shown in Fig. 7. The AUC for the AD vs. MCI vs. NC three-category study was 0.92. The AUC for the AD vs. LMCI vs. EMCI vs. NC four-category was 0.95, and in the dichotomous category, the AUCs for the EMCI vs. LMCI and AD vs. NC were 0.89 and 0.82 respectively.

Figure 7
figure 7

Interpretability results. Attention maps.

Ablation experiment

To thoroughly validate the effectiveness of the method proposed in this study, we conducted ablation tests on both the feature and model modules to assess the contributions of GLCM, FC, MMSE/CDR modalities, as well as the group self-calibrated convolution module and coordinate attention module to the overall performance. Specifically, we utilized the same dataset and experimental setup to conduct three-class classification (AD vs. MCI vs. NC), altering only the components under comparison in each experiment.

In the feature ablation experiments, we selected one or more from GLCM, FC, and MMSE&CDR on the basis of the original model to evaluate the contribution of each modality to the overall performance. The results of classification using features of a certain modality show that the accuracy of GLCM features alone was 73.39% (Table 5), which is higher than that of FC (58.49%) and CDR&MMSE (65.56%). By combining three modal features, we can enhance the model’s predictive power and achieve better classification results. This proves that multimodal data can provide supplementary information for AD diagnosis.

In the module ablation experiments, we chose VGG as the backbone network and incorporated either the group self-calibrated convolution module or the coordinate attention module, or both, to compare the impact of different configurations on the model’s parameters and recognition accuracy. The comparative results, as shown in Table 6, demonstrate how the addition of these components step-by-step contributes to the improvement of the model’s performance. The results indicate that the baseline accuracy of 77.88% was improved by both the group self-calibrated convolution module (80.38%) and the coordinate attention module (79.43%). Their combination exhibited the highest accuracy of 83.33%, which clearly indicate the effectiveness of both modules in enhancing feature extraction and facilitating interactions among channels and spaces.

Table 5 Ablation among multiple modalities result.

Model interpretability

In computer-aided diagnosis, identifying specific brain regions closely related to the predictions of deep learning models is vital. Understanding these regions aids in explaining the model’s decision-making process, enhancing its interpretability and reliability. In our study, we utilized the Gradient Weighted Class Activation Mapping (Grad-CAM)51 technique to extract feature mappings from the proposed network. Before using concatenation to obtain the desired labels, Grad-CAM was applied to the final convolutional layer of the proposed GSCANet model. The attention maps generated serve as visual descriptions of the suggested network, highlighting the image regions essential for determining the target class. The attention maps and visualizations created using the Grad-CAM algorithm for GLCM images of NC and EMCI are shown in Fig. 7.

In our model, energy feature is used less frequently in classification which may imply that its discriminative power is weaker in the current dataset and model. While the Asm and entropy feature holds the highest proportion, which indicates that in the classification of AD, the Asm and entropy feature plays a more significant role in the model’s prediction, possibly because it captures texture information more relevant to the disease. Our findings indicate that the key regions, including the hippocampus, medial superior frontal gyrus, precuneus, middle temporal gyrus, posterior cingulate gyrus, lingual gyrus, and dorsolateral superior frontal gyrus—critical areas in AD pathology—provide the richest features for our model’s predictions. Changes in these regions are not only markers of the disease but also provide vital information to the model, aiding in improving classification accuracy and reliability. This visual evidence enhances our understanding of the model’s predictions. It also indicates that the self-calibration and coordination attention mechanisms enhance feature extraction and interactions between channels and spatial regions, enabling the model to focus on critical areas in brain images associated with AD.

Table 6 Ablation among proposed blocks result.

Discussion

This study employed a combination of local Haralick texture features based on sMRI, global FC features based on fMRI, and cognitive information based on neuropsychological score features to diagnose AD. Our distinctive approach lies in the proposed GSCANet model, which employs group self-calibrated convolution to extract meaningful features and expand our receptive field. We integrated coordinated attention to facilitate significant interactions among channels and spaces. The outcomes of our experiments consistently demonstrated the high performance of our model in classification tasks and when tested on external datasets.

The results presented demonstrate that the multimodal feature, encompassing Haralick texture features, functional connectivity, and neuropsychological scores combination method yields higher classification accuracy than traditional feature combination methods, for both tetra- and tri-classification tasks. This is because the proposed method extracts complementary information from multiple modalities and reduces the classification error, thus expanding the classification accuracy52,53. This approach provides a more comprehensive understanding of brain function by integrating the unique information of each modality54. The study used image intensity-derived GLCM to calculate the Haralick texture features and extract subtle structural changes that may be indicative of early AD55. This approach characterizes the local patterns and arrangement rules of brain images, providing subtle, high-resolution texture information of the brain surface56. Recent studies have shown that patients with early AD exhibit subtle structural changes before significant occurrence of brain tissue atrophy57,58. These changes may lead to mild brain atrophy manifested in the forms of various symptoms in AD patients, such as forgetfulness24. Therefore, the use of advanced features, such as subtle changes extracted from the original image, enables the diagnosis of AD57,59. Functional connectivity features can detect the functional changes in the brain that are related to the pathology of AD. It can enhance the diagnostic accuracy of brain disorders associated with AD-related cognitive impairments11. Moreover, neuropsychological scales, such as MMSE and CDR, each offer distinct advantages. These scales are among the most commonly used tools for assessing cognitive, emotional, and behavioral-related functions, providing valuable insights for diagnosis. In summary, the use of multimodal features, including Haralick texture, functional network, and neuropsychological score, provide a more comprehensive understanding of brain function and yields higher accuracy in AD diagnosis. This approach has the potential to facilitate early AD detection and intervention by characterizing subtle structural changes in the brain.

This study proposed a parameter-efficient GSCANet framework that exhibited improved performance compared to previous studies34,43,46. Tables 2 and 4 show that our results in the AD vs. NC experiment improved compared to those in LSTM-Robust43. The accuracy of the four categories was also significantly improved than that in adaptive sparse learning model34, and the accuracy of EMCI and NC classification in our study improved compared with that in GDCA model40. These results indicate that our model performs better in complex classification tasks. The improved performance of GSCANet can be attributed to several factors. First, the framework incorporates additional group self-calibrated convolution operations into the residual blocks after multiple sets of treatments, thereby enabling multiscale information interaction through heterogeneous convolutional kernels. This expands the receptive field and supplements the global information of features, facilitating the early identification of brain differences among different subjects. Second, GSCANet utilizes multiple group self-calibrated coordinate attention modules for feature extraction, wherein each block set operates independently. This approach effectively aggregates hierarchical and refined channel and spatial information using group self-calibrating features to coordinate attention. Using accurate location information to encode channel relationships and long-term dependencies, GSCANet facilitates effective interaction of channel and space information. This enhances the model’s sensitivity to semantic information among features, thereby improving the accuracy of diagnosis60. The proposed GSCANet framework incorporates group self-calibrated convolution and coordinates attention modules to improve performance in complex classification tasks. This approach enables effective multiscale information interaction and facilitates the aggregation of hierarchical and refined channel and spatial information. GSCANet improves the sensitivity of the model by utilizing accurate location information and encoding channel relationships and long-term dependencies to semantic information between features and enhances the accuracy of diagnosis.

The effectiveness of the method proposed in this study has been thoroughly validated. In a comparison with classic network models for classification, such as VGG, ResNet50, and ResNeSt50, our model demonstrated superior performance. It surpassed the VGG48 model in a four-category classification study. Furthermore, our model displayed exceptional performance improvement over traditional models, including ResNet49 and ResNeSt50, in the three-classification experiment of AD vs. MCI vs. NC. The results demonstrate that GSCANet has stronger feature extraction capabilities than other networks. We speculate that group self-calibrated coordinate attention in the residual block can identify potential features among data and possess robust recognition capabilities for networks with attention mechanisms or multiscale extraction60. Our model enables multiscale information interactions by expanding the receptive field and aggregating spatial channel information, thus enabling more efficient and faster detection of subtle predementia brain changes. The model framework proposed in this study ensures good classification performance and improves the flexibility and generalization performance, thereby indicating its potential to identify more effectively the early onset of AD compared with other methods. Our network model is based on a general framework that permits the flexible implementation and design of computer-aided diagnostic systems that can address disease diagnosis problems.

Our analysis demonstrated that integrating features from all three modalities—GLCM from sMRI, FC from fMRI, and cognitive scales—achieves higher classification accuracy. To identify which feature contributes most significantly, we conducted ablation experiments. The results revealed that Gray-GLCM features alone provided superior classification accuracy compared to FC and cognitive scale features individually. This highlights the critical role of GLCM features as they offer deeper` insights into localized texture information and brain abnormalities associated with AD, beyond what functional connectivity can provide. Additionally, we explored which network module contributes most to classification improvement through further ablation experiments. Both group self-calibrated convolution and coordinated attention modules enhanced the network’s performance and the combination of these two modules resulted in significant improvement in classification accuracy. Furthermore, the visualizations result indicates that key regions, including the hippocampus, medial superior frontal gyrus, precuneus, middle temporal gyrus, posterior cingulate gyrus, lingual gyrus, and dorsolateral superior frontal gyrus, critical areas in AD pathology61,62, provide the richest features for our model’s predictions. This visual evidence not only enhances our understanding of the model’s predictions but also paves the way for verifying the diagnosis of AD.

However, a few limitations of the study exist, first, deep learning-based methods require large and diverse datasets to efficiently extract features and ensure robust performance. However, the amount of data used in this study is small, mainly consisting of Caucasian participants (ADNI dataset) and Asian participants (in-house dataset), which may limit the model’s generalizability. Future research should include more diverse datasets to improve the model’s robustness and applicability across different populations. Second, potential biases in the current datasets, stemming from demographic factors, imaging protocols, and clinical settings, could affect the generalization of our model. Future work will involve careful consideration and mitigation of these biases to enhance the model’s applicability across different populations and settings. Third, the interpretability of complex deep learning models remains a significant challenge in medical contexts. Ensuring transparency and trust in the model’s predictions is crucial. Future work will focus on developing methods to improve model interpretability, such as incorporating visualization techniques and explainable AI approaches, to provide clearer insights into the decision-making process and help clinicians understand and trust the model’s predictions. Lastly, the testing set we used did not include EMCI and LMCI subjects, which restricts the potential applicability of our model. Expanding the testing set to include these subjects in future studies will be essential to fully evaluate the model’s performance and utility in diagnosing different stages of AD.

Conclusion

In this study, we proposed a novel approach to achieve the early diagnosis of AD using GSCANet’s CNN model. Our approach involved the extraction of effective features through group self-calibration to expand our receptive field. We employed coordinated attention to enable characteristic interactions between channels and spaces. The experimental results demonstrated that our model consistently performed well in evaluation tasks and in the generalized testing of external datasets. Our approach effectively identifies subgroups of subjects with different patterns of AD progression and has potential clinical applications.