An interpretable framework for gastric cancer classification using multi-channel attention mechanisms and transfer learning approach on histopathology images

Zubair, Muhammad; Owais, Muhammad; Hassan, Taimur; Bendechache, Malika; Hussain, Muzammil; Hussain, Irfan; Werghi, Naoufel

doi:10.1038/s41598-025-97256-0

Download PDF

Article
Open access
Published: 16 April 2025

An interpretable framework for gastric cancer classification using multi-channel attention mechanisms and transfer learning approach on histopathology images

Muhammad Zubair¹,
Muhammad Owais²,
Taimur Hassan³,
Malika Bendechache⁴,
Muzammil Hussain⁵,
Irfan Hussain² &
…
Naoufel Werghi⁶

Scientific Reports volume 15, Article number: 13087 (2025) Cite this article

3112 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The importance of gastric cancer (GC) and the role of deep learning techniques in categorizing GC histopathology images have recently increased. Identifying the drawbacks of traditional deep learning models, including lack of interpretability, inability to capture complex patterns, lack of adaptability, and sensitivity to noise. A multi-channel attention mechanism-based framework is proposed that can overcome the limitations of conventional deep learning models by dynamically focusing on relevant features, enhancing extraction, and capturing complex relationships in medical data. The proposed framework uses three different attention mechanism channels and convolutional neural networks to extract multichannel features during the classification process. The proposed framework’s strong performance is confirmed by competitive experiments conducted on a publicly available Gastric Histopathology Sub-size Image Database, which yielded remarkable classification accuracies of 99.07% and 98.48% on the validation and testing sets, respectively. Additionally, on the HCRF dataset, the framework achieved high classification accuracy of 99.84% and 99.65% on the validation and testing sets, respectively. The effectiveness and interchangeability of the three channels are further confirmed by ablation and interchangeability experiments, highlighting the remarkable performance of the framework in GC histopathological image classification tasks. This offers an advanced and pragmatic artificial intelligence solution that addresses challenges posed by unique medical image characteristics for intricate image analysis. The proposed approach in artificial intelligence medical engineering demonstrates significant potential for enhancing diagnostic precision by achieving high classification accuracy and treatment outcomes.

An interpretable hybrid deep learning framework for gastric cancer diagnosis using histopathological imaging

Article Open access 01 October 2025

Enhanced gastric cancer classification and quantification interpretable framework using digital histopathology images

Article Open access 28 September 2024

A large histological images dataset of gastric cancer with tumour microenvironment annotation for AI

Article Open access 22 January 2025

Introduction

Gastric cancer (GC), or stomach cancer, is a formidable health challenge with significant global implications. It has a long-standing history and remains one of the most prevalent and deadly cancers worldwide¹. Early detection and timely treatment are crucial factors in improving patient outcomes and reducing mortality rates associated with this disease. Recently, an increasing focus has been on understanding GC’s epidemiology, risk factors, and biological characteristics. Such knowledge contributes to developing effective prevention strategies, diagnostic approaches, and treatment modalities. GC’s prevalence, morbidity, and mortality rates necessitate continuous efforts to improve detection methods and ensure early intervention for optimal patient care. According to recent statistics, GC has become the fifth most prevalent disease globally and the fourth leading cause of death, making it a significant public health concern^2,3. It is responsible for many cancer-related deaths, ranking as the third leading cause of cancer mortality worldwide³. These statistics emphasize the urgent need for improved detection and management strategies to address the impact of GC on global health.

The current diagnostic methods for GC mainly involve endoscopic examinations, biopsies, and histopathological analysis. Endoscopy allows for direct visualization and tissue sampling, enabling clinicians to identify suspicious lesions and collect biopsy samples for further analysis. Tissue staining techniques employed for the examination of anatomical connectivity^4,5,6, cancer progression⁷, forensic pathology⁸, studying tissue morphology^9,10, disease surveillance¹¹, and genetic alterations^12,13 Other applications of immunohistochemistry staining are discussed in detail¹⁴.

The histopathological study of GC constitutes the gold standard for identifying GC¹⁵. The diagnosis of GC is mainly through pathological biopsy, which is stained with hematoxylin and eosin (H&E). The histopathological examination provides crucial information about tumor characteristics, including histological type, grade, and stage. The nucleus and cytoplasm of tissue sections are examined by viewing the H&E stained sections, highlighting the fine structure of cells and tissues for physician observation.

However, these diagnostic approaches have limitations, including invasiveness, sampling errors, and interobserver variability, which may impact diagnostic accuracy¹⁶. Under a microscope, the biopsy’s morphology and tissue properties are scrutinized, and the doctor’s expertise is synthesized to determine the detection findings. Nonetheless, individual pathology professionals rely on their own experiences and contextual circumstances when making diagnoses, potentially leading to discrepancies in their interpretations of tissue pathology images. Additionally, pathologists are responsible for analyzing numerous histology images regularly. Maintaining continuous focus and working extended hours may increase the probability of professionals making diagnostic errors. Consequently, precise pathologist detection of stomach cancer is a significant issue¹⁷. In addition, early diagnosis is paramount in achieving favorable outcomes for GC patients. Detection at an early stage allows for more effective treatment options, including curative surgery, and can significantly improve survival rates. Therefore, developing reliable, accurate, and sensitive screening and diagnostic methods is essential to guarantee GC’s accurate and early detection.

The above-mentioned problems could be addressed by introducing a computer-aided diagnosis (CAD) system that could identify pathological images of GC to alleviate the lack of pathologists and lower the incidence of histological examination misdiagnosis¹⁸. Advanced algorithms could be developed to help shorten the processing time and allow the CAD system to make objective decisions^19,20,21,22, classification^{23,24,25,26,27,28,29}, and segmentation³⁰ during cervical cancer^31,32, skin cancer³³, and neurological disorders^4,34 detection. In the past, the rapid development of CAD technology for GC, which can more rapidly and reliably identify cancer locations, has been made possible by the constant advancements in image processing, machine learning (ML), and pattern recognition algorithms^1,35,36. These algorithms utilize ML and deep learning (DL) techniques to analyze diagnostic data, such as imaging, biomarkers, and clinical parameters. Although these algorithms promise to improve diagnostic accuracy, they also have limitations. Factors such as dataset heterogeneity, lack of standardization, and interpretability of results may hinder their widespread implementation in clinical practice^37,38. Moreover, the conventional ML techniques used in traditional CAD^19,24 approaches operate as follows: First, the manual extraction of visual attributes, including form, color, and texture. Afterwards, a classifier categorizes the retrieved characteristics³⁹. Convolutional neural network (CNN) models allow for automatic feature learning in computers, replacing the subjectivity of feature extraction in ML. This has significantly improved the accuracy and effectiveness of CAD^20,21,22,40. The drawback of CNN models is that they do not effectively extract reliable data from small datasets. Because of this limitation, it is crucial to integrate CNN models with an attention mechanism.

Recent studies in GC classification using histopathological images have two major challenges, including the lack of interpretability of the models and the limited generalizability of the data sets. Interpretability is crucial for clinical adoption to gain trust of the clinicians in model prediction. Although some studies have incorporated attention mechanisms⁴¹, they do not provide proper visualization of the decision-making process. To address this, we integrate Grad-CAM visualizations within our multi-channel attention-based framework, enhancing model transparency. In addition, heterogeneity of the dataset poses a significant challenge due to variations in staining techniques, scanner types, and demographic differences between medical centers. Traditional models often struggle to generalize well under these conditions. Our approach mitigates this issue by using a multi-scale feature extraction mechanism and a transfer learning-based pipeline⁴² trained on diverse histopathology GasHisSDB and HCRF datasets. This enhances the model’s adaptability to different clinical environments. These contributions fill critical gaps in the literature, providing a more interpretable and robust framework for GC classification.

Attention mechanism

According to cognitive research, humans only take in a small portion of all observable information due to processing bottlenecks. Inspired by the human visual system, attention mechanisms are techniques for directing focus to the most crucial picture areas while ignoring irrelevant ones⁴³. It prioritizes the most informative signal component while allocating computing resources⁴⁴. Researchers searched for a model of visual selective attention to mimic how people perceive visual information, model how people’s attention is distributed when viewing still images and moving pictures, and broaden the model’s usefulness. Attention methods have been shown to enhance model performance and are also congruent with the perceptual process of the human brain and eyes. Most research integrating DL with visual attention processes in computer vision, for instance, focuses on using masks. According to the masking concept, a new layer with a new weight is used to identify the essential characteristics in the image data. DNNs may develop attention by learning and training the portions of each new image that require attention. As attention processes have developed into several categories throughout development, different models stress distinct feature domains. These models are used for various tasks, including classification, detection, segmentation, model improvement, video processing, and more. Attention mechanisms can be categorized into channel attention, spatial attention, mixed attention, and self-attention. Channel attention approaches, some typical works of which include the aforementioned Squeeze-and-Excitation Network (SENet)⁴⁵, Efficient Channel Attention (ECANet)⁴⁶, and Style-based Recalibration Module (SRM)⁴⁷ produce attention mask throughout the channel domain and utilize it to pick significant channels. Spatial Transformer Networks (STN)⁴⁸ and Gather-excite Networks (GENet)⁴⁹ are two examples of spatial attention approaches that produce attention masks across geographic domains and utilize them to pick significant spatial locations. Convolutional block attention module (CBAM)⁵⁰ and coordinate attention⁵¹ are examples of channel and spatial attention techniques that combine the benefits of both to create 3-D attention maps. Some other recent techniques concentrating on branch and temporal attention results were proposed^52,53.

Attention mechanism enhanced CNN

CNN’s performance has been accelerated by attention processes, sparking excellence across a range of visual difficulties, including classification, detection, segmentation, model improvement, video mastering, and more⁵⁴. Attention techniques often take the form of plug-and-play attention modules that may enhance a block’s convolutional outputs and help the entire network learn more illuminating information⁵⁵. Due to the integration of attention modules in some advanced CNN designs, such as the SE module added to MobileNet V3, that network version performs better than MobileNet V1 and MobileNet V2⁵⁶. To overcome challenges like complex backdrops, dispersed lesions, and inter-class resemblances - think abnormality detection and normal cell identification-researchers in the field of image classification are increasingly incorporating attention modules into their custom network designs^{57,58,59,60,61,62}. However, building these attention modules frequently entails complex elements, such as pooling options, which might add parameters and computational load and be unwelcoming for lightweight network topologies.

Our research provides a unique strategy to address the intrinsic complexity of medical picture information, where complex components and restricted inequalities across several phases make it difficult to identify relevant attention areas using a single AM. In particular, this study provides a learning paradigm that uses a multi-channel attention mechanism (MCAM). Our proposed framework improves the accuracy of GC histopathology image classification procedures by overcoming the challenges posed by complex medical pictures. The flow chart of the proposed framework is visually depicted in Fig. 1. The methodology has two phases: training and testing. The MCAM model, which consists of three channels: multi-scale global information channel (MGIC), spatial information channel (SIC), and multi-scale spatial data channel (MSIC), is used for learning. After numerous epochs, the weighted voting technique is used to extract the model parameters of the learning from the MCAM model using the training pictures. The test photos are provided while maintaining the optimized model parameters to achieve the GC histopathological image classification task results. The model parameters are finally kept to achieve the GC histopathological image classification job results, and the test pictures are input.

The main contributions of this research study are as follows:

A multi-channel attention mechanism (MCAM)-based framework using transfer learning (TL) is introduced as an efficient GC classifier. Three channels, including multi-scale global information channel (MGIC), spatial information channel (SIC), and multi-scale spatial information channel (MSIC) using attention mechanism could extract comprehensive multi-scale local, global, and spatial information, are integrated and deployed with TL, resulting in an effective classification approach.
The reliability of the proposed MCAM model is underscored by its consistent performance across two distinct datasets, highlighting the model’s inherent robustness.
The proposed model has achieved the highest evaluation metrics compared to the conventional deep learning approaches and previously existing competitive studies on GC classification using histopathology images.
The growing need for transparent AI tools in medical diagnostics is met by including attention mechanisms and strengthening model interpretability. The regions of interest are depicted using Grad-CAM visuals, which promote therapeutic confidence and provide insights into the decision-making process. A comparative analysis with cutting-edge deep learning models, including VGG-16, Xception, Vision Transformers (ViT), and ensemble approaches, highlights the superior performance of the proposed MCAM framework.

To improve the classification of gastric cancer histopathology images, the hypothesis was to test an MCAM-based framework that may overcome the limitations of conventional deep learning models through enhanced feature extraction, dynamically focusing on relevant features and capturing intricate relationships in medical data. The study offers a thorough and efficient solution for GC classification by tackling dataset heterogeneity, interpretability issues, and the lack of robustness in earlier approaches. The paradigm differs from other approaches in the field as it incorporates MCAM, transfer learning, and a focus on interpretability.

Related work

We undertake two explorations in this section. First, a full and in-depth description of deep learning techniques is presented, exploring their fundamental ideas and wide range of uses. Next, we focus on a comprehensive analysis of GC identification and categorization using a thorough investigation of the DL techniques utilized in previous competitive research. GC detection and classification are two areas in which this two-pronged approach seeks to provide the reader with a deep understanding of DL techniques and a nuanced understanding of their particular applications.

Overview of deep learning methods

CNN models are the most popular DL techniques used in computer vision tasks. Transformer and multilayer perceptron (MLP) models have also gained popularity because of their constant improvement. Particularly, many biological image analysis tasks, such as histological image analysis^{63,64,65,66,67}, cytopathological image analysis^68,69,70,71, microorganism image analysis^72,73,74, COVID-19 identification^75,76, and sperm image analysis^77,78, make extensive use of DL techniques. These models can translate low-level aspects of the data into high-level abstract features. This trait makes DL models stronger than shallow ML models in feature representation^79,80. The ongoing advancements in CNN models specifically address three main areas: the network’s depth, width, and a hybrid combination of both^81,82. The ResNet⁸³, VGG⁸⁴, and DenseNet⁸⁵ models boost the network depth by employing small convolutional layers, dense layers, and residual mechanisms to enhance model performance. The Xception⁸⁶ and Inception-V3⁸⁷ models boost the network width by using separable convolutional blocks and multi-scale inception blocks. Some models, such as ResNeXt⁸⁸ and InceptionResNet⁸⁹, efficiently combine residual mechanisms and inception blocks during the feature extraction. InceptionResNet increases network depth and width. Consequently, classification performance is significantly improved, representing an important breakthrough in network optimization.

In the contemporary landscape of AI research, transformer models⁹⁰, are finding promising applications in unraveling complex challenges within computer vision. Transformer models categorically unfold into two primary factions: a fusion with convolutional neural networks (CNN) and the realm of pure, unadulterated transformer models⁹¹. Pure transformer models include ViT⁹², CaiT⁹³, DeiT⁹⁴, and T2T-ViT⁹⁵ models. Transformer models combined with CNNs are the CoaT⁹⁶, LeViT⁹⁷, and BoTNet⁹⁸ models, which input the feature maps created by convolution of images into the transformer encoder. MLP models are enhanced versions of transformer models and are improved by substituting the self-attention layers of the ViT⁹² model with multiple perceptions.

Gastric cancer detection using deep learning methods

DL is a type of ML that can identify more abstract information from input data over time^99,100,101. DL has recently caught oncologists’ interest. Oncology has seen significant advancements in DL, which is superior to traditional ML methods^102,103. DL on pathology images for the spatial organization and molecular correlation of tumor-infiltrating lymphocytes was presented¹⁰⁴.

A study¹⁰⁵ proposed a DL system for evaluating lymph node and tumor locations using whole-slide images. So, DL models could aid pathologists in diagnosing lymph nodes to identify new prognostic markers that are challenging to quantify manually. In a recent study³⁰, a Naive Bayes classifier with the Gaussian Mixture Model and a novel, improved Fuzzy c-means clustering algorithm were proposed for improved classification and segmentation, respectively. A binary image segmentation method enables cancer detection at the pixel level by utilizing a CNN of DeepLab v3 architecture¹⁰⁶. On the used GC dataset, the authors claim that their AI aid system has an average specificity of 0.806 and a sensitivity of 0.996. Another study made a whole-slide gastric histopathology dataset (GasHisSDB) publicly available⁶⁷. In addition, three CNN classifiers, a unique transformer-based classifier, and seven traditional ML classifiers were tested on this dataset⁶⁷. It was found that the accuracy rates of different classifiers differ significantly; the DL’s highest accuracy was 0.965, and its lowest was 0.862. A study¹⁰⁷ presented an automated method using TensorFlow DL packages to classify tumor type detection by categorizing the GC dataset having whole-slide images. In another study¹⁰⁸, DL-based models were used to identify tumors and forecast the course of GC by examining pathological images. In a study by¹⁰⁹, Epstein-Barr virus (EBV)-positive and microsatellite instability (MSI)/mismatch repair deficient (dMMR) tumors were included that used a histology-based DL model to screen for immunotherapy-sensitive subgroups. Likewise, another study¹¹⁰ proposed an efficient DL model to detect EBV-associated GC using H&E-stained images. An ensemble model that combines the decision of multiple DL models managed to attain high accuracy for GC detection using histopathology images⁴¹. The authors justify the improved performance due to important feature extraction, even from the smaller patches. However, the limitations include higher computational costs. Another DL-based ensemble model using H&E-stained images was presented¹¹¹ to identify the Lymphovascular invasion, which is an indirect predictor of GC.¹¹² proposed an ensemble approach that combines the capabilities of ResNet50, VGGNet, and ResNet34 that outperforms the models like EfficientNet and ViTNet. The ensemble model achieves promising accuracy as a result of integrating the mentioned models. This demonstrates the effectiveness of ensemble models in capturing key features offering a significant advantage in GC classification. A hybrid DL and gradient-boosting approach has proven highly effective in classifying gastric histopathology images¹¹³. Grad-CAM visualizations confirm that the model focuses on relevant histological features, enhancing interpretability. The consistent accuracy and robust performance across metrics demonstrate its potential for reliable GC screening. Feature fusion strategies¹¹⁴ were used using a support vector machine and random forest to classify the histopathology images for GC classification. Cross-magnification experiments yielded promising results, achieving accuracies of nearly 80% and 90% when tested on unseen images at varying resolutions.

In a study, radiopathomics models were developed using Logistic regression, NaiveBayes, and Support vector machine, integrating pathomics with radiomics features to classify GC stage¹¹⁵. A DL-based prediction was made¹¹⁶ using primary tumor slide score and histopathological lymph node status. A multimodal fusion DL model was proposed using histopathology images to predict GC tumor mutational level¹¹⁷. In short, DL approaches have shown better results in detecting and categorizing GC¹¹⁸. However, a significant issue that needs to be resolved is the improvement of assessment metrics further to boost the reliability and robustness of these approaches.

In a related study¹¹⁹, a promising approach for the efficient classification of whole-slide images in gastrointestinal pathology was shown by this CNN/RNN combo. The authors classified biopsy histopathology whole-slide images of the stomach and colon into three categories: adenocarcinoma, adenoma, and non-neoplastic, employing CNN and recurrent neural networks (RNNs). To improve the algorithm’s resilience to visual changes and provide a regularization effect, several data augmentation approaches were used in conjunction with the conventional inception-v3 network architecture. As a feature extractor, the trained inception-v3 network provided input to an RNN model that could deal with length sequences and generate a single output. To confirm the methodology, the study used external datasets from the TCGA-STAD and TCGA-COAD programs, which are publically accessible and may be accessed via the Genomic Data Commons portal. The work¹²⁰ addresses issues like label noise and feature aggregation redundancy in multi-instance learning for cancer diagnosis utilizing whole-slide images. Inter-bag discrimination and fine-grained feature encoding are enhanced by the suggested dual-curriculum contrastive MIL technique. Its potential to improve whole-slide image-based cancer prognostic analysis has been demonstrated by experiments performed on public datasets, which demonstrate better performance compared to state-of-the-art techniques. To address the unpredictability and predictive constraints of Laurén classification, a study¹²¹ developed a DL model for GC classification. The DL model demonstrated great classification performance and superior patient survival stratification compared to pathologists, demonstrating its promise as a diagnostic and prognostic tool. It was trained using TCGA data (N=166) and externally verified on European (N=322) and Japanese (N=243) cohorts. Researchers examined the shortcomings of conventional staining methods, like IHC and EBER-ISH, in precisely distinguishing GC molecular subclasses¹²². To predict molecular subclasses directly from hematoxylin-eosin-stained histology, they utilized an ensemble CNN. The TCGA-based decision tree for GC subtyping was challenged by the model’s identification of intra-tumoral heterogeneity and overlapping subclass traits. A study developed deep learning-based models, GastroMIL and MIL-GC, to assist in diagnosing GC and predicting overall survival using hematoxylin and eosin-stained pathological images¹⁰⁸. Trained on cohorts from Renmin Hospital of Wuhan University and the Cancer Genome Atlas, with external validation from the National Human Genetic Resources Sharing Service Platform, achieved a diagnostic accuracy of 0.920, comparable to expert pathologists. While the focus of this review is on gastric cancer, it is noteworthy that deep learning-based approaches have also been successfully applied to other types of cancer, including urological cancers. For instance, studies on urology cancers¹²³ have demonstrated the effectiveness of AI models in diagnosing, predicting, and treating various subtypes such as prostate¹²⁴, bladder¹²⁵, and renal cancers¹²⁶, with detection accuracies ranging from 77% to 95%.

The motivation for this study arises from the limitations of current GC detection and diagnosis methods, which have primarily relied on traditional ML models. Albeit DL models have shown potential, they still require further refinement to improve their effectiveness. Past investigations highlight that attention mechanisms enhance DL model efficiency, but there is significant untapped potential in using multiple attention mechanisms to extract multi-scale information. Additionally, integrating the attention mechanisms with transfer learning could improve diagnostic efficiency. In the existing literature we have found voids, including extraction of comprehensive multi-scale information and incorporation of multiple attention mechanism for enhanced diagnostic performance in GC. Therefore, this study aims to develop an MCAM framework utilizing a transfer learning approach to create a more robust and efficient automated GC diagnostic system.

Materials and methods

This section delves into an in-depth examination of the three key components fundamental to our suggested framework: TL, attention mechanisms, and CNNs. We believe our detailed explanation of these fundamental components will give readers the knowledge they need to appreciate the subtleties of our suggested framework. The following explanation thoroughly explains the MCAM architecture, as illustrated in Fig. 2. This step-by-step dissection is designed to promote coherent comprehension, guaranteeing that readers can assimilate the framework’s theoretical foundations and architectural nuances in an orderly fashion.

Convolutional neural network

A CNN is a feedforward neural network distinguished by its distinct design, incorporating convolution and depth computations. CNNs are made up of many layers, each with a distinct function. The convolutional layer, which uses convolution kernels to extract image features, is the main part. The input feature map is then condensed, highlighting key features using the pooling layer. The fully connected layer creates connections between all features and performs classification using a classifier as its last step. The information retrieved by the convolutional layers in the context of CNNs can be divided into two main categories: global and local. The comprehensive representation of an image inside its class is called global information. Local information, often known as spatial information, examines the characteristics of narrow, isolated sections inside the image. Smaller convolution kernels often extract this data type, enabling the network to recognize finer details and localized features essential for classification tasks. In our proposed methodology, we have employed the CNN architectures, including Inception-V3⁸⁷, VGG-16⁸⁴, and Xception⁸⁶. Each has a special layout and set of features. These networks have been extensively employed for various computer vision tasks, including object identification, feature extraction, and image categorization. The specific needs of the task and the available processing resources influence the architecture choice.

Transfer learning

CNN models require a lot of data and computer power to train from scratch, resulting in lengthy training durations. The training issue is further exacerbated by the peculiarities of medical datasets. TL stands out in this situation as an unprecedented approach to overcoming these difficulties in the field of CAD work¹²⁷. TL is an ML technique that uses a previously trained model for a different job¹²⁸. The TL procedure consists of two parts. The first step is choosing an original dataset and pre-training on it. The second step involves fine-tuning the pre-trained model using the target task’s dataset.

The ImageNet is a widely used dataset with over a million images across 1000 classes for image processing applications^129,130,131. The ImageNet dataset, recognized for its extensive and varied collection of images, is the original dataset for pre-training the model in the particular instance covered in the research. However, using the conventional TL technique to pre-train MCAM models directly presents significant difficulties because of limitations in workstation computer capacity. So, we have modified the TL technique to work around our computational constraints. This improved method involves layer-by-layer loading into the MCAM model of the pretraining parameters from conventional CNN models, such as VGG-16, Inception-V3, and Xception, made available through the PyTorch Vision package. The Single Information Channel (SIC), Multi-Global Information Channel (MGIC), and Multi-Scale Spatial Information Channel (MSIC) components, which are described in Fig. 2, are the elements of the MCAM architecture to which these parameters belong. Notably, during training, these loaded layers stay frozen. The completely connected layers and AM layers are at the center of the fine-tuning process, where the model adjusts to the specifics of the target CAD work. Additionally, a weighted voting system provides the channels with the proper weights, ensuring the model successfully incorporates data from each source. This novel approach maximizes the utility of pre-trained models by utilizing the generic feature extraction capabilities of pre-trained models and customizing them to the unique requirements of the CAD task. A compromise has been discovered between utilizing prior information and adapting the model to the specifics of medical picture analysis by combining TL with selective fine-tuning.

Multi-channel attention mechanism

One of the most critical ideas in the field of DL is the AM method¹³². When only one AM is used, it may be difficult to distinguish between important details and extraneous information, resulting in the decision-making process including extraneous or redundant information. Therefore, the accuracy and effectiveness of the model’s predictions may be jeopardized. Innovative methods, like MCAM, that concentrate on concurrently recording connections across several channels or feature maps, are necessary to overcome these constraints. By doing this, MCAMs improve the model’s capacity to identify important patterns and eliminate superfluous or duplicate data, thereby increasing the precision and dependability of the predictions made by the model. We propose an MCAM model that uses three channels, MGIC, SIC, and MSIC, to extract characteristics from various viewpoints. These three complementing channels improve the accuracy of categorization tasks and the precision of identifying attention areas.

MGIC: The model in the MGIC is contemplated to be able to extract multi-scale global data. The Inception-V3 model⁸⁷, rooted in GoogleNet¹³³, is widely regarded as the optimal CNN model for capturing comprehensive global information. The Inception-V3 model employs a distinctive convolution technique, breaking down large filter sizes through parallel and factorized convolution rather than increasing network layers. The term “inception structure” encompasses the entire decomposition module. This model also features five distinct inception structures, each with unique elements. The Inception-V3 model substantially reduces parameters relative to other models by adopting an Inception module instead of a large convolution kernel. Furthermore, it replaces a fully connected layer with a global average pooling (GAP) layer. Because of its parallel convolution structure and partially big convolution kernels, Inception-V3 among CNN models excels at extracting global multi-scale information. Therefore, to extract features from MGIC, the Inception-V3 model is chosen. The Inception-V3 model implements the extraction of multi-scale information by concatenating various sized receptive fields, and each feature map’s channel domain reflects the multi-scale capability of the Inception-V3 model. The MGIC’s SE attention mechanism, which has a good distribution of channel weights, is chosen to increase the weighting of the channel features⁴⁵. The structure of the SE attention mechanism is shown in Fig. 3. Squeeze and excitation are the two stages of the SE attention process. The squeeze phase pools the global averages to encode all spatial features into a single global feature to produce channel-wise statistics. The dimensionality-reduction and dimensionality-increasing layers are two completely connected layers used in the excitation phase to determine the channel-wise importance. The sigmoid activation function then determines the final channel-wise weights. The SE module includes channel and spatial attention modules, as shown in Fig. 3 outlined in the dotted border. The channel and spatial modules help the network learn “what” and “where” to pay attention to the channel and spatial axes. The spatial attention module uses the inter-spatial relations of certain features to produce a spatial attention map. The convolution operation (kernel: [1, 1], stride: [1, 1], channels: 1) is used to obtain $x_{s}$(H x W x 1) from the input x (H x W x C). Here, H, W, and C represent height, width, and the channel, respectively. By spatially multiplying the input x and the $x_s$, the channel is transformed from $C_1$ to $C_2$, and the spatial attention map $x_{spatial}$ (H x W x $C_1$) is produced. This transformation of $C_1$ to $C_2$ and back to $C_1$ in the spatial attention module is illustrated in Fig. 3 within the outlined border.

The channel attention module creates a channel attention map and can selectively boost helpful features while suppressing invalid ones. A GAP operation on the input x produces $x_c$(1 x 1 x $C_1$). Full convolution (channels: $C_3$, $C_3$ = $C_1$/4) and Relu to $x_c$ were used to produce the result $x'_c$(1 x 1 x $C_3$). Then $x'_c$ continuously executed fully-convolution operation (channels: C1) and sigmoid activation, obtaining $x''_c$ (1 $\times 1 \times C_1$). The channel-wise multiplication of the input x and the $x''_c$ yields the channel attention map $x_{channel}$ (H x W x $C_1$). After adding two attention maps, convolution (kernel: [3, 3], stride: [1, 1]), batch normalization, and Relu are sequentially connected to obtain the output of the attention block.

SIC: This channel can extract the best spatial information. The SimAM attention mechanism allocates weights to spatial dimension characteristics¹³⁴. Fig 4 visually represents the architecture of the SimAM. The most relevant neurons in visual neuroscience exhibit different firing patterns in the surrounding neurons and maintain their activity, a phenomenon known as spatial suppression¹³⁵. Measuring the linear separability between the target and other neurons is the quickest technique to identify these spatially suppressed neurons. The edge features of images frequently play a significant role in categorization problems in computer vision. In addition, spatial suppression neurons frequently display extraordinarily high contrast with the surrounding colors and textures, just like the edge elements of images. The energy function from neuroscience is thus used by the SimAM attention mechanism to assign weights to various spatial regions. The minimal energy of neurons can be represented in Eq. (1) because the energy function treats feature maps’ every pixel as an individual neuron.

$$\begin{aligned} { e^*_{x} } = \frac{4(\sigma ^2 + \omega )}{(x-\mu )^2 + 2\sigma ^2 + 2\omega } \end{aligned}$$

(1)

Where x is the target neuron, $\sigma$ and $\mu$ are variance and mean calculated over all neurons except the target neuron, and $\omega$ is the coefficient added to the variance to smoothen the variance effect, thereby controlling the attention mechanism’s sensitivity to the features’ variance. The coefficient $\omega$ is set to 1e - 4 as was used in CIFAR datasets by¹³⁴. Spatial suppression neurons have a higher linear separability than other neurons, which results in a considerable x and $\mu$ deviation and a low $e^*_{x}$. In contrast, it is believed in neuroscience that neurons with lower energy are more distinct from nearby neurons. Therefore, using $e^*_{x}$, it is possible to determine each neuron’s weight. A scaling operator in Eq. (2) is used to reach the optimization phase of the entire SimAM attention mechanism.

$$\begin{aligned} { \tilde{F} = sigmoid \biggl (\frac{1}{E} \biggl ) F} \end{aligned}$$

(2)

where $\tilde{F}$ and F are output and input feature maps, all $e^*_{x}$ are grouped in channel and spatial dimensions and represented as E. A sigmoid is added to limit excessively high E values. So, the sigmoid activation function determines each neuron’s confidence at each location. The output of the SimAM block is a feature map with the same dimensions as the input block. However, the feature values are altered based on attention weights to highlight significant regions and hide less significant ones. This improves the model in drawing conclusions and learning from the most pertinent features found in the data. The VGG-16 model⁸⁴ was introduced by the Visual Geometry Group (VGG). Its novel contributions were to increase network depths from 8 to 16 and split up large convolution kernels like 9 x 9 and 7 x 7 into multiple 3 x 3 small convolution kernels. Due to its deep and consistent architecture, which uses many layers of 3x3 convolutional filters, VGG16 excels at extracting spatial information. This design allows The network to record complex spatial patterns and hierarchies. VGG16 builds a hierarchy of feature maps with progressively decreasing spatial dimensions and increasing feature channels to encode low-level and high-level spatial details. Additionally, the pre-trained models of VGG16, which were trained on expansive datasets like ImageNet, offer a solid foundation for spatial feature extraction, making it an excellent option for computer vision tasks demanding accurate spatial understanding. After AlexNet¹³⁶, it represents another major step in DL and serves as a benchmark for evaluating new approaches. The VGG model has a lot of benefits¹³⁷. It uses a tiny convolution kernel to improve the extraction of spatial information.

MSIC: Depth-separable convolution of the Xception model⁸⁶ is used to implement the MSIC channel. To properly extract multi-scale spatial information, depth-separable convolution diversifies the information derived from individual channels within the feature map. MSIC diversifies information extraction within each channel while efficiently capturing multi-scale spatial details. After each flow, the Xception model employs the ECA attention mechanism to improve its capacity to obtain data on multiple scales. The ECA attention mechanism uses a quick method to weigh the significance of each feature map’s channel information⁴⁶. GAP is initially used by the ECA attention mechanism to collect channel-specific data, followed by 1D convolutional, which uses a convolutional kernel of size k to gather cross-channel interaction data, and finally, the sigmoid activation function to gather channel-wide weight data. This innovative approach enhances the model’s ability to extract valuable features and optimizes computational resources, making it an excellent choice for tasks requiring precise multi-channel attention. Fig. 5 presents the ECA attention mechanism architecture. The depth-separable convolution and residual mechanism are combined in the Xception model⁸⁶ to enhance the Inception-V3 model⁸⁷. Contrary to conventional convolution, depth-separable convolution carries out each channel in the feature map independently¹³⁸. The benefit of Xception is the integration of depth-separable convolution with residual structure. The image’s multi-scale characteristics are successfully extracted using depth-separable convolution, and the network model converges quickly thanks to the residual method. The tiny convolutional kernel in depth-separable convolution offers the Xception model excellent local multi-scale information extraction capabilities, in contrast to the Inception-V3 model.

Multi-channel ensemble strategy: By combining the strength of numerous data sources or channels, a multi-channel fusion strategy can greatly improve classification performance. This methodology can generate complementary insights by combining data from many channels, increasing the feature representation, and enhancing the model’s ability to distinguish across classes. To increase classification performance, this method uses an integrated classifier that depends on the weights and classification decision values of various channels⁶⁹. Additionally, it encourages robustness by allowing the classifier to adjust to changes and difficulties in particular channels, lessening the influence of noise or uncertainties. It optimizes decision-making processes through sophisticated fusion techniques, such as weighted voting or feature concatenation, reducing the likelihood of misclassifications and bolstering overall accuracy. This method essentially combines many data streams into a single, comprehensive perspective, producing a categorization system that is more accurate and efficient across a variety of applications and domains. In this experiment, classification decision values for each channel utilizing pooling, fully connected, and softmax layers are obtained using the most recent feature maps of MGIC, SIC, and MSIC. To produce the classification decision values for the MCAM model, the classification decision values for each channel are then weighted and evaluated using grid-weighted voting. The formula for weighted majority voting combines the votes of multiple classifiers, each weighted by its importance or reliability. The final classification decision is that the class label receives the most weighted votes. Let $w_{i}$ be the weight and $v_{j}$ be the vote of a channel for label $l_{j}$. The weighted vote for class label $l_{j}$ is computed using Eq. (3).

$$\begin{aligned} { V(j) = \sum \limits _{i=1}^{n} \omega _{i} v_{i}(j)} \end{aligned}$$

(3)

The category included in the MCAM model’s maximum classification decision values is then used as the final classification outcome. The final classification decision C is the class label with the highest weighted vote, which is calculated using Eq. (4).

$$\begin{aligned} { C = argmax_{j\in {\{1, . . . , k\}}} V(j)} \end{aligned}$$

(4)

In short, the weighted vote for each class label is computed, followed by determining the class label with the highest weighted vote. This ensures the final decision considers individual classifiers’ votes and their respective weights, leading to a reliable classification.

The feature map F is defined as $F\in R^{C_{1}.H.W}$. All the input feature points $x_i$ share weights with the M input and $\hat{M}$ output channels. The feature map is fed into a convolutional layer $\{A,B,C\}\in R^{\hat{M}.H.W}$ are reshaped $\{A,B,C\}\in R^{\hat{M}.N}$, where N is the feature map size. The A,C results $\{A,C\}\in R^{\hat{M}.N}$ after transpose. A matrix multiplication of A and B performed on each row generates the attention map as expressed in Eq. (5).

$$\begin{aligned} AM_{ab} = \frac{e \sum \limits _{k=1}^{\hat{M}} A_{ak} B_{kb}}{\sum \limits _{\omega =1}^{N} e \sum \limits _{k=1}^{\hat{M}} A_{ak} B_{kb}} \end{aligned}$$

(5)

The channel attention module’s input-output relation is expressed in the Eq. (6).

$$\begin{aligned} { z_{a} = \frac{1}{c(x)} \sum \limits _{\forall a} f(x_a, x_b) x_b} \end{aligned}$$

(6)

$x_a$ and $z_a$ are the channel’s input and output feature maps. To reduce the calculation, the feature map is expanded into 1-D column vectors, $\{x_i, x_j\} \in R^{N}$. The correlation function is defined in the Eq. (7).

$$\begin{aligned} { f(x_{a},x_{b}) = e^{Q((x_{a} - Q(x_{a})) . (x_{b} - Q(x_{b})))}} \end{aligned}$$

(7)

$Q((x_{a} - Q(x_{a})) . (x_{b} - Q(x_{b})))$ is the covariance of $x_{a}$ and $x_{b}$, $Q(x_a)$ is approximate by mean of $x_a .Q(x_a,x_b)$ as shown in the Eq. (8).

$$\begin{aligned} { f(x_{a},x_{b}) = e^\frac{{Q((x_{a} - Q(x_{a})) . (x_{b} - Q(x_{b})))}}{N}} \end{aligned}$$

(8)

The operational approach of the multi-channel ensemble model is outlined through Algorithm 1.

Experimental results and analysis

This section delves into the experimental setup, giving an overview of the conditions that led to the thorough testing of our proposed framework. A detailed discussion of the classification experiment results and an analysis of the long-term experiment results are provided. This thorough investigation seeks to provide a nuanced understanding of our experimental setup’s performance metrics and results. Through thoroughly examining the results under various conditions and scenarios, we offer readers a thorough understanding of the efficiency and resilience of our suggested framework in various experimental settings, thus assisting in a comprehensive assessment of its capabilities.

Experimental environment

This section thoroughly investigates the experimental environment, covering essential components like dataset information, dataset partitioning, experimental parameter configurations, and evaluation metrics used to gauge the effectiveness of the suggested framework. A thorough analysis of the segmentation procedures and comprehensive insights into the make-up and properties of the datasets used in our experiments are presented. Comprehensive explanations of the experimental parameter settings critical to the framework’s functionality provide insight into the decisions made during the experimentation process. Moreover, the assessment metrics employed to determine the efficacy of the suggested framework are elaborated upon, offering a thorough summary of the methodological factors and standards utilized for a comprehensive appraisal of its capacities.

Dataset

GasHisSDB is a recently available histopathology image dataset with 245196 images. The dataset is divided into three sized cropped sub-size image datasets of 160x160, 120x120, and 80x80 pixels. Each sub-size dataset contains separate folders of normal and abnormal images. The total number of all normal and abnormal images is approximately 148120 and 97076, respectively. Table 1 shows the GasHisSDB dataset distribution. The normal images are generally free from any cancerous region. In addition, the nuclei of the cells in the micrograph are regularly arranged in a single layer with essentially little mitosis⁶⁷. Therefore, it can be determined that an image under an optical microscope is normal if no cancellation of any cells or tissues is seen and the parameters of a normal image are met¹³⁹. The abnormal images with malignant cells show that GC typically takes the form of an ulcer. Cancer nests spread as the condition worsens, invading the muscle, serosal, and mucosal layers. It has a rough texture and is frequently gray or white. The cancer cells can be grouped in a nest, acinar, tubular, or cord shape when observed under a microscope, and the border with the stroma is typically distinct. However, the line dividing the cancer cells from the stroma is blurred when they invade it⁶⁷. Normal and abnormal sample images from three sub-size datasets, A, B, and C, are shown in Fig. 6. Based on the aforementioned information, it is possible to determine that the pathological image is aberrant when cells are seen to form gland or adenoid structures that are uneven in size, varied in shape, or arranged irregularly. The malignant cells are frequently irregularly distributed in multiple layers in the abnormal images, and the nuclei display a variety of sizes and division phenomena^{15,140,141,142}. The GasHisSDB dataset contains diverse collection of histopathology images, captured under different imaging conditions and representing a wide range of patient cases. This dataset includes variations in staining techniques, and tissue structures, making it well-suited for evaluating the robustness of the proposed framework. Additionally, it consists of both normal and abnormal samples across multiple resolution levels (160$\times$160, 120$\times$120, and 80$\times$80 pixels), ensuring a comprehensive assessment of the model’s ability to adapt to different image scales. The structured nature of the dataset facilitates rigorous testing, allowing the model to learn discriminative features essential for accurate GC classification across varied clinical scenarios.

Table 1 GasHisSDB dataset distribution description.

Subjects

Abstract

Similar content being viewed by others

An interpretable hybrid deep learning framework for gastric cancer diagnosis using histopathological imaging

Enhanced gastric cancer classification and quantification interpretable framework using digital histopathology images

A large histological images dataset of gastric cancer with tumour microenvironment annotation for AI

Introduction

Attention mechanism

Attention mechanism enhanced CNN

Related work

Overview of deep learning methods

Gastric cancer detection using deep learning methods

Materials and methods

Convolutional neural network

Transfer learning

Multi-channel attention mechanism

Experimental results and analysis

Experimental environment

Dataset

Data setting

Hyper-parameters setting

Evaluation metrics

Classification assessment

Experimental results

Extended experiments

Ablation experiments

HCRF image classification

Interchangeability experiments

Testing environment and computational time

Discussion

Conclusion and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links