Research on liver cancer pathology image recognition based on deep learning image processing

Yang, Mingyuan; Niu, Huanzhang

doi:10.1038/s41598-025-24834-7

Download PDF

Article
Open access
Published: 02 January 2026

Research on liver cancer pathology image recognition based on deep learning image processing

Mingyuan Yang¹ &
Huanzhang Niu¹

Scientific Reports volume 16, Article number: 467 (2026) Cite this article

1156 Accesses
Metrics details

Subjects

Abstract

The accurate diagnosis of Liver cancer through pathological image analysis remains challenging due to the complexity and heterogeneity of histopathological features. This study proposes MSAF-Net, a novel multi-space attention fusion network that systematically integrates five complementary feature spaces (R, B, Y, entropy, and LBP) with an SE-block enhanced fusion mechanism and EfficientNet-Lite based feature extraction. The proposed framework establishes a new state-of-the-art in pathological image analysis by effectively combining engineered feature spaces with deep learning, offering both high diagnostic reliability and computational efficiency for clinical applications. Experimental results demonstrate superior performance with 94.7% accuracy, 93.2% sensitivity, and 95.8% specificity, representing significant improvements of 6.3%, 7.1%, and 5.6% respectively over conventional single-space methods.

MANet: a multi-attention network for automatic liver tumor segmentation in computed tomography (CT) imaging

Article Open access 16 November 2023

A novel liver cancer diagnosis method based on patient similarity network and DenseGCN

Article Open access 26 April 2022

Liver CT image segmentation network based on multi-scale feature fusion

Article Open access 26 September 2025

Introduction

Traditional pathological diagnosis relies heavily on the subjective interpretation of histopathological images by experienced pathologists, which can be time-consuming and prone to inter-observer variability¹. By leveraging advanced artificial intelligence (AI) and deep learning techniques, automated image recognition systems can assist pathologists in identifying malignant features with higher consistency and speed, reducing diagnostic errors and workload². This technology can also facilitate the detection of subtle morphological patterns that might be overlooked by the human eye, enabling earlier intervention and better prognostic stratification. Moreover, the integration of AI-driven tools into clinical workflows can democratize access to high-quality diagnostic services, particularly in resource-limited settings where specialized pathologists are scarce. Beyond diagnosis, such systems can contribute to research by uncovering novel biomarkers and correlations between histological features and therapeutic responses, paving the way for personalized medicine. The development of robust and generalizable models for liver cancer pathology recognition thus holds promise not only for improving individual patient care but also for advancing our understanding of the disease’s heterogeneity and progression. As AI continues to evolve, its application in this field represents a transformative step toward more precise, efficient, and equitable healthcare solutions.

Over the past years, pathology image recognition technology has made significant breakthroughs in the field of liver cancer diagnosis, and its core challenge is to deal with the complex morphological features and low contrast of tissue sections. Khan³ proposes the Salp Swarm algorithm to optimize the integrated network provides new ideas for liver cancer pathology image analysis: the CLAHE contrast enhancement and DWT multi-resolution fusion strategy validated in this study for diabetic retinopathy diagnosis can be directly migrated to liver adenocarcinoma tissue section preprocessing, and its 88.52% cross-modal accuracy indicates weighted integration of DenseNet169, MobileNetV1 and other heterogeneous models can effectively capture the microscopic variations of alveolar structure. Khan⁴ further extend the 3D pathology block analysis method, which is complementary to the ScienceDirect literature on brain tumor MRI with the Multi-sequence image fusion technique to complement - the latter’s strategy to enhance glioma classification performance through T1/T2-weighted image fusion, revealing that liver cancer pathology can be combined with HE staining, immunohistochemistry, and other multimodal section data.

Deep learning-based computer vision techniques⁵ have emerged as powerful tools for the analysis of liver cancer pathology images, offering unprecedented capabilities in automated detection⁶, classification⁷, and prediction⁸. Traditional histopathological diagnosis relies on manual examination by pathologists, which is labor-intensive, subjective, and prone to variability. In contrast, convolutional neural networks (CNNs)⁹ and other deep learning architectures can process whole-slide images (WSIs) at scale, identifying malignant patterns with high accuracy. These models excel at recognizing complex morphological features, such as nuclear atypia¹⁰, glandular formation, and stromal invasion, which are critical for distinguishing between adenocarcinoma, squamous cell carcinoma, and other liver cancer subtypes. Transfer learning¹¹, where pre-trained models like ResNet¹² or Vision Transformers (ViTs)¹³ are fine-tuned on pathology datasets, has proven particularly effective in overcoming limited annotated data. Additionally, weakly supervised learning techniques, such as multiple instance learning (MIL), enable the analysis of gigapixel WSIs using only slide-level labels, reducing annotation burdens¹⁴. Beyond diagnosis, deep learning facilitates the quantification of tumor microenvironment characteristics, including immune cell infiltration and fibrosis, which are increasingly recognized as prognostic indicators. Despite these advances, challenges remain, including model generalizability across different staining protocols, scanner variations, and inter-institutional datasets, necessitating robust domain adaptation and data harmonization strategies.

The integration of computer vision in liver cancer pathology extends beyond diagnostic applications, playing a pivotal role in predictive and precision oncology. Deep learning models can correlate histopathological features with molecular alterations, such as EGFR¹⁵ or KRAS mutations¹⁶, enabling non-invasive biomarker discovery and guiding targeted therapies. Furthermore, graph neural networks (GNNs)¹⁷ and attention mechanisms¹⁸ enhance spatial context modeling, capturing interactions between tumor cells and surrounding tissues, which are crucial for understanding tumor progression and treatment resistance. Explainable AI (XAI) techniques, such as gradient-weighted class activation mapping (Grad-CAM)¹⁹, provide interpretable visualizations of model decisions, fostering trust among clinicians and facilitating clinical adoption. However, the translation of these technologies into routine practice faces hurdles, including regulatory approval, integration with hospital information systems, and ethical considerations regarding data privacy. As deep learning continues to evolve, its synergy with digital pathology promises to transform liver cancer care by enabling earlier, more accurate diagnoses and personalized therapeutic strategies, ultimately improving patient survival and quality of life.

Combining the current research results, we can find that traditional methods usually rely on a single color space such as RGB, which makes it difficult to comprehensively capture the complex features of pathological images; and the traditional features (e.g., SMLBP, ACHLAC) rely on manual design and have limited generalization ability.

This study proposes a novel deep learning framework named MSAF-Net for liver cancer pathological image recognition, which systematically integrates multi-space image mapping, deep feature extraction with adaptive fusion, and ensemble learning-based classification. The pipeline begins with a multi-space transformation module that decomposes the original histopathological images into five distinct feature spaces (R, B, Y, entropy, and LBP spaces) to comprehensively capture diverse visual characteristics and address the limitation of single-color space representation. The subsequent feature extraction module employs EfficientNet-Lite, a lightweight CNN architecture, to automatically learn high-level semantic features from each transformed space, overcoming the constraints of traditional handcrafted features like SMLBP and ACHLAC that rely on manual design. An innovative attention-based adaptive fusion mechanism is incorporated to dynamically weigh the importance of features from different spaces, effectively resolving the potential redundancy in multi-space features while preserving discriminative information. The final classification stage adopts an ensemble learning approach that combines predictions from multiple classifiers through voting mechanisms, enhancing the model’s robustness and accuracy beyond what could be achieved by any single classifier. This hierarchical design progressively transforms raw pathological images into more abstract and clinically meaningful representations, with each module addressing specific challenges in pathological image analysis: the first tackling information completeness, the second focusing on feature discriminability, and the third improving decision reliability. The architecture’s modular nature allows for flexible adaptation to different pathological analysis tasks while maintaining computational efficiency through the use of lightweight networks and optimized feature fusion strategies. By systematically integrating these advanced computer vision techniques, the proposed framework aims to establish a more comprehensive and reliable solution for automated liver cancer diagnosis from histopathological images.

The innovations of this study are the following three:

1.
The proposed method innovatively transforms pathological images into five complementary feature spaces (R/B/Y/entropy/LBP), enabling comprehensive representation beyond conventional color spaces for enhanced feature diversity.
2.
An SE-block enhanced feature fusion mechanism dynamically weights multi-space CNN features, effectively addressing feature redundancy while preserving diagnostically critical patterns.
3.
The pipeline combines EfficientNet-Lite feature extractors with voting-based ensemble classification, achieving robust performance without heavy computational overhead typical of deep learning systems.

Related work

Multi-space image analysis

Multi-space image analysis²⁰ has emerged as a powerful paradigm in medical image processing, particularly for complex tasks like pathological diagnosis. This approach operates on the fundamental principle that representing images in multiple complementary feature spaces can reveal hidden patterns that might be obscured in any single space. By transforming raw images into different domains—such as color channels, texture spaces, frequency domains, or entropy mappings—the technique provides a more comprehensive representation of the underlying biological information. The core idea stems from the observation that different feature spaces emphasize distinct aspects of tissue morphology, cellular arrangement, and pathological characteristics that collectively contribute to more accurate analysis.

From a technical classification perspective, multi-space methods can be broadly categorized into color space transformations²¹, texture analysis spaces²², and hybrid approaches. Color space decompositions, like RGB to HSV or LAB conversions, help separate staining information from tissue structures. Texture spaces including Local Binary Patterns (LBP)²³ or Gray-Level Co-occurrence Matrices (GLCM)²⁴ capture spatial relationships of pixel intensities. Hybrid methods combine these with more advanced spaces like wavelet or fractal domains to create multi-dimensional feature representations. Each category offers unique advantages for specific diagnostic challenges in pathological image interpretation.

The advantages of multi-space analysis are particularly evident in its ability to address the limitations of single-space approaches. Traditional methods relying solely on RGB space often miss critical diagnostic information present in other representations. Multi-space techniques overcome this by providing complementary views of the same tissue sample, enabling better discrimination between normal and malignant regions. This becomes especially valuable in cancer diagnosis where subtle variations in nuclear morphology, chromatin distribution, and tissue architecture require examination from multiple perspectives for accurate classification.

Implementation of multi-space analysis typically involves careful selection and combination of appropriate feature spaces based on the specific diagnostic task. Some spaces may be more suitable for detecting nuclear abnormalities, while others excel at identifying stromal changes or tumor margins. The challenge lies in determining which combinations provide optimal diagnostic value without introducing redundant or noisy features. Advanced machine learning techniques, including feature selection algorithms and attention mechanisms, are often employed to automatically identify and weight the most discriminative spaces for particular pathological patterns.

Looking forward, multi-space image analysis continues to evolve with the integration of deep learning architectures and computational efficiency improvements. Modern implementations increasingly combine traditional multi-space transformations with learned feature representations from convolutional neural networks, creating hybrid systems that leverage both engineered and learned features. This direction promises to further enhance the technique’s diagnostic capabilities while addressing its computational complexity. As digital pathology advances, multi-space approaches are likely to remain fundamental to comprehensive image analysis, particularly for challenging cases requiring nuanced interpretation of complex tissue structures.

Deep learning in pathology

Deep learning has revolutionized pathology image analysis by introducing powerful computational methods that can automatically learn discriminative features from vast amounts of histopathological data²⁵. At its core, these techniques leverage hierarchical neural networks, particularly convolutional neural networks (CNNs)²⁶, to extract increasingly abstract representations of tissue morphology without relying on manual feature engineering. The technology operates by learning spatial hierarchies of patterns - from low-level edges and textures to high-level tissue architectures and cellular arrangements - mirroring how pathologists examine slides at different magnifications. This approach has proven especially valuable in pathology where diagnostic patterns often involve complex interactions between multiple tissue components that are difficult to quantify using traditional image processing methods. Beyond CNNs, newer architectures like vision transformers are being adapted to pathology, offering improved capabilities in capturing long-range dependencies across whole slide images.

The applications of deep learning in pathology can be broadly categorized into diagnostic classification²⁷, segmentation²⁸, and predictive analytics tasks²⁹. Classification models help distinguish between different disease subtypes or grades, while segmentation algorithms precisely delineate regions of interest like tumors or specific tissue structures. More advanced implementations are moving beyond pure pattern recognition to predict molecular alterations, treatment responses, and patient outcomes directly from histology images. These models typically employ various network architectures tailored to specific tasks, such as U-Net for segmentation³⁰ or ResNet for classification³¹, often pretrained on natural images and fine-tuned with pathological data. The field has also seen growing interest in weakly supervised learning approaches that can work with slide-level labels, addressing the challenge of obtaining expensive pixel-level annotations for training.

The advantages of deep learning in pathology are manifold, offering improvements in accuracy, consistency, and scalability compared to conventional methods. These systems can process whole slide images much faster than human pathologists while maintaining high diagnostic accuracy, potentially reducing workload and improving turnaround times. They also demonstrate remarkable consistency, minimizing the inter-observer variability that plagues traditional pathology. Importantly, deep learning models can uncover subtle morphological patterns that may be imperceptible to human observers but correlate with clinically relevant outcomes. However, challenges remain in model interpretability, generalizability across different institutions, and seamless integration into clinical workflows. As the field matures, there’s increasing emphasis on developing explainable AI techniques and robust validation frameworks to facilitate clinical adoption while maintaining the rigorous standards required for medical diagnostics.

Feature extraction methods for medical imaging

Feature extraction methods for medical imaging form the foundation of computer-aided diagnosis by converting raw pixel data into meaningful quantitative representations³². Traditional approaches rely on handcrafted features³³ such as texture descriptors (GLCM, LBP)³⁴, shape metrics³⁵, and intensity statistics³⁶ that capture specific tissue characteristics. These methods are interpretable and computationally efficient but often fail to represent the complex patterns in medical images. More advanced techniques leverage filter banks and wavelet transforms to decompose images into multi-scale representations³⁷, providing better characterization of anatomical structures at different resolutions. While effective for certain tasks, these approaches require expert knowledge to select appropriate features and may miss subtle pathological patterns.

Deep learning has revolutionized feature extraction by automatically learning optimal representations directly from medical imaging data. Convolutional neural networks³⁸ excel at hierarchical feature learning, capturing low-level edges in early layers and high-level semantic features in deeper layers. Pretrained networks on natural images can be fine-tuned for medical tasks³⁹, benefiting from transfer learning when labeled medical data is scarce. Attention mechanisms¹⁸ and transformer architectures⁴⁰ further enhance feature extraction by focusing on diagnostically relevant regions and modeling long-range dependencies in volumetric scans. These data-driven methods outperform traditional techniques in most applications but require large, annotated datasets and substantial computational resources.

Hybrid approaches combining traditional and deep learning features are gaining popularity in medical imaging analysis. These methods leverage the interpretability of handcrafted features while incorporating the discriminative power of learned representations. Dimensionality reduction techniques like PCA⁴¹ and autoencoders⁴² help manage the high dimensionality of combined feature sets. The choice of extraction method depends on the clinical application, available data, and computational constraints, with simpler methods preferred for well-defined tasks and deep learning approaches chosen for complex pattern recognition. As medical imaging evolves, feature extraction continues to advance through multimodal fusion and self-supervised learning techniques that maximize information capture from limited clinical datasets.

Method

Overview

This study proposes a novel multi-space adaptive fusion network (MSAF-Net) for liver pathology image classification, which systematically addresses key challenges in histopathological image analysis through three innovative modules. The pipeline begins with a multi-space transformation module that decomposes input images into five complementary feature spaces (R, B, Y, entropy, and LBP spaces), effectively overcoming the limitations of single-space representations by capturing diverse tissue characteristics. This transformation provides a more comprehensive foundation for subsequent analysis compared to conventional RGB-space methods, enabling the system to better characterize complex pathological patterns that may be obscured in any single space.

The core of MSAF-Net lies in its adaptive feature extraction module, which combines the strengths of deep learning and attention mechanisms. Using EfficientNet-Lite as the backbone, this module extracts high-level semantic features from each transformed space while maintaining computational efficiency. A squeeze-and-excitation (SE) attention block is then employed to dynamically weight and fuse these multi-space features, addressing both the redundancy problem in traditional concatenation approaches and the limitations of handcrafted features. This dual strategy of deep feature extraction and intelligent fusion allows the model to automatically focus on the most discriminative spatial and textural patterns for accurate diagnosis.

The final stage incorporates an ensemble classification module that leverages multiple base classifiers with a voting mechanism to enhance decision robustness. This design mitigates the performance limitations of single classifiers while maintaining the system’s efficiency through careful selection of complementary base models. The complete MSAF-Net architecture demonstrates superior capability in handling the inherent challenges of liver pathology images, including staining variations, complex tissue structures, and subtle morphological changes, offering a balanced solution that combines comprehensive feature representation with practical clinical applicability. The structure of the MSAF-Net model proposed in this study is shown in Fig. 1.

Multi-space image transformation module

The Multi-space Image Transformation Module in MSAF-Net introduces a novel paradigm for comprehensive pathological image representation through simultaneous decomposition into five complementary feature spaces. Given an input liver pathology image $\:\mathbf{I}\in\:{\mathbb{R}}^{H\times\:W\times\:3}$ in RGB space, the module generates:

$$\:\mathbf{R}=\mathbf{I}\left(:,:,1\right),\:\mathbf{B}=\mathbf{I}\left(:,:,2\right)$$

where $\:\mathbf{R}$ and $\:\mathbf{B}$ represent the red and blue channels respectively, preserving stain-specific information critical for nuclear and cytoplasmic analysis. The luminance channel $\:\mathbf{Y}$ is computed as:

$$\:\mathbf{Y}=0.299\mathbf{R}+0.587\mathbf{G}+0.114\mathbf{B}$$

This transformation maintains illumination-invariant features while enhancing contrast in poorly stained regions, addressing a key challenge in histopathology image analysis.

The module’s innovation extends to texture space transformations through entropy and LBP computations. The local entropy space $\:\mathbf{E}$ quantifies tissue disorder at each pixel $\:\left(x,y\right)$:

$$\:\mathbf{E}\left(x,y\right)=-\sum\:_{i=0}^{L-1}{p}_{i}{\text{l}\text{o}\text{g}}_{2}{p}_{i}$$

where $\:{p}_{i}$ denotes the probability of intensity level $\:i$ in a $\:5\times\:5$ neighborhood window. Concurrently, the rotation-invariant LBP space $\:\mathbf{L}$ encodes microstructural patterns:

$$\:\mathbf{L}=\sum\:_{p=0}^{P-1}s\left({g}_{p}-{g}_{c}\right)\cdot\:{2}^{p},\:s\left(x\right)=\left\{\begin{array}{ll}1,&\:x\ge\:0\\\:0,&\:x<0\end{array}\right.$$

where $\:{g}_{c}$ represents the center pixel intensity and $\:{g}_{p}$ denotes neighboring pixel values within radius $\:R$.

The strategic selection of these five spaces ($\:\mathbf{R}$, $\:\mathbf{B}$, $\:\mathbf{Y}$, $\:\mathbf{E}$, $\:\mathbf{L}$) forms the core innovation, enabling simultaneous capture of: (1) stain-specific features through $\:\mathbf{R}/\mathbf{B}$, (2) illumination-robust intensity via $\:\mathbf{Y}$, (3) tissue disorder quantification in $\:\mathbf{E}$, and (4) cellular pattern encoding in $\:\mathbf{L}$. This comprehensive decomposition overcomes the limitations of existing methods that typically use only 2–3 spaces.

The output tensor $\:\mathbf{T}\in\:{\mathbb{R}}^{H\times\:W\times\:5}$ maintains spatial correspondence while providing diverse pathological representations. The parallel implementation achieves computational efficiency with $\:\mathcal{O}\left(n\right)$ complexity per transformation, making it feasible for whole-slide image analysis. This multi-space foundation enables subsequent modules to exploit complementary information that would be inaccessible in any single space. The principle of the multi-space image transformation module is shown in Fig. 2.

Adaptive feature extraction module

Building upon the multi-space representations from the previous module, the Adaptive Feature Extraction Module introduces a dual-path architecture that combines deep feature learning with intelligent feature fusion. For each transformed space $\:{\mathbf{T}}_{k}\in\:{\mathbb{R}}^{H\times\:W}$ where $\:k\in\:\{1,...,5\}$ corresponds to R, B, Y, E, and L spaces respectively, we employ parallel EfficientNet-Lite branches:

$$\:{\mathbf{F}}_{k}={\mathcal{E}}_{k}\left({\mathbf{T}}_{k};{\theta\:}_{k}\right),\:{\mathbf{F}}_{k}\in\:{\mathbb{R}}^{h\times\:w\times\:c}$$

where $\:{\mathcal{E}}_{k}$ denotes the $\:k$-th EfficientNet-Lite branch with parameters $\:{\theta\:}_{k}$, reducing spatial dimensions while expanding channel depth to extract hierarchical features. Each branch is initialized with ImageNet pretrained weights and fine-tuned independently, enabling specialized feature learning for each space while maintaining computational efficiency through depthwise separable convolutions.

The module’s key innovation lies in its Squeeze-and-Excitation (SE) based adaptive fusion mechanism. For concatenated features $\:{\mathbf{F}}_{concat}=\left[{\mathbf{F}}_{1},...,{\mathbf{F}}_{5}\right]\in\:{\mathbb{R}}^{h\times\:w\times\:5c}$, we first apply channel-wise attention:

$$\:\mathbf{z}=\frac{1}{h\times\:w}\sum\:_{i=1}^{h}\sum\:_{j=1}^{w}{\mathbf{F}}_{concat}\left(i,j,:\right)$$

where $\:\mathbf{z}\in\:{\mathbb{R}}^{5c}$ represents global channel statistics. The attention weights are then computed through:

$$\:\mathbf{s}=\sigma\:\left({\mathbf{W}}_{2}\delta\:\left({\mathbf{W}}_{1}\mathbf{z}\right)\right)$$

with $\:{\mathbf{W}}_{1}\in\:{\mathbb{R}}^{\frac{5c}{r}\times\:5c}$ and $\:{\mathbf{W}}_{2}\in\:{\mathbb{R}}^{5c\times\:\frac{5c}{r}}$ being learnable weights, $\:r=4$ the reduction ratio, $\:\delta\:$ the ReLU activation, and $\:\sigma\:$ the sigmoid function. The final adaptively weighted features become:

$$\:{\mathbf{F}}_{fusion}=\mathbf{s}\odot\:{\mathbf{F}}_{concat}$$

where $\:\odot\:$ denotes channel-wise multiplication. This dynamic weighting mechanism automatically emphasizes diagnostically relevant features while suppressing redundant or noisy channels, addressing the key limitation of naive feature concatenation.

The module further enhances discriminative power through multi-scale feature aggregation. By extracting features from multiple EfficientNet blocks (layers 6, 12, and 18) and fusing them via skip connections:

$$\:{\mathbf{F}}_{final}=\mathcal{U}\left({\mathbf{F}}^{6}\right)+\mathcal{U}\left({\mathbf{F}}^{12}\right)+{\mathbf{F}}^{18}$$

where $\:\mathcal{U}$ denotes upsampling operations and superscripts indicate block indices. This hierarchical integration captures both local cellular patterns and global tissue architectures, crucial for comprehensive pathology analysis.

The complete module outputs $\:{\mathbf{F}}_{final}\in\:{\mathbb{R}}^{h{\prime\:}\times\:w{\prime\:}\times\:c{\prime\:}}$ where spatial dimensions are reduced but channel depth is optimized for the classification task. Compared to traditional handcrafted features like SMLBP or ACHLAC, this approach automatically learns pathology-specific representations while maintaining computational efficiency (0.5G FLOPs per space). The design specifically addresses staining variations and tissue heterogeneity through its attention mechanism, outperforming conventional feature extraction methods by 12–15% in preliminary validation on liver pathology datasets. Figure 3 shows the structure of the adaptive feature extraction module.

Ensemble classification module

The Ensemble Classification Module in MSAF-Net introduces a novel hybrid voting mechanism that strategically combines three diverse base classifiers to address the limitations of single-model predictions. Taking the adaptively fused features $\:{\mathbf{F}}_{final}\in\:{\mathbb{R}}^{h{\prime\:}\times\:w{\prime\:}\times\:c{\prime\:}}$ from the previous module as input, we first employ dimension reduction through Global Average Pooling (GAP):

$$\:\mathbf{f}=\frac{1}{h{\prime\:}\times\:w{\prime\:}}\sum\:_{i=1}^{h{\prime\:}}\sum\:_{j=1}^{w{\prime\:}}{\mathbf{F}}_{final}\left(i,j,:\right)\in\:{\mathbb{R}}^{c{\prime\:}}$$

The core innovation lies in our carefully selected ensemble of three complementary classifiers: (1) a lightweight Multi-Layer Perceptron (MLP) with dropout ($\:p=0.3$), (2) a Gaussian Kernel Support Vector Machine (SVM), and (3) a Random Forest (RF) with 100 decision trees. Each classifier processes $\:\mathbf{f}$ independently:

$$\:\begin{array}{rr}{p}_{mlp}&\:=\text{softmax}\left({\mathbf{W}}_{2}\delta\:\left({\mathbf{W}}_{1}\mathbf{f}+{\mathbf{b}}_{1}\right)+{\mathbf{b}}_{2}\right)\\\:{p}_{svm}&\:=\text{sign}\left(\sum\:_{i=1}^{N}{\alpha\:}_{i}{y}_{i}\text{e}\text{x}\text{p}\left(-\gamma\:\parallel\:\mathbf{f}-{\mathbf{f}}_{i}{\parallel\:}^{2}\right)+b\right)\\\:{p}_{rf}&\:=\frac{1}{100}\sum\:_{t=1}^{100}\mathbb{I}\left[{h}_{t}\left(\mathbf{f}\right)=y\right]\end{array}$$

where $\:{\mathbf{W}}_{1}\in\:{\mathbb{R}}^{512\times\:c{\prime\:}}$, $\:{\mathbf{W}}_{2}\in\:{\mathbb{R}}^{C\times\:512}$ are MLP weights ($\:C$=number of classes), $\:{\alpha\:}_{i}$ are SVM support vector coefficients, and $\:{h}_{t}$ represents the $\:t$-th decision tree in RF.

Our novel dynamic voting mechanism goes beyond simple majority voting by incorporating confidence-based weighting:

$$\:{w}_{k}=\frac{\text{e}\text{x}\text{p}\left(\eta\:\cdot\:{\text{confidence}}_{k}\right)}{\sum\:_{j=1}^{3}\text{e}\text{x}\text{p}\left(\eta\:\cdot\:{\text{confidence}}_{j}\right)},\:k\in\:\{\text{MLP},\text{SVM},\text{RF}\}$$

where $\:\eta\:=2.0$ is a temperature parameter controlling weight sharpness, and confidence is measured by the margin between top two class probabilities for MLP/RF or decision boundary distance for SVM. The final prediction combines weighted votes:

$$\:{p}_{final}=\sum\:_{k}{w}_{k}\cdot\:{p}_{k},\:{y}_{pred}=\text{a}\text{r}\text{g}\text{m}\text{a}\text{x}\left({p}_{final}\right)$$

This design provides three key advantages: (1) the MLP captures complex nonlinear relationships in deep features, (2) the SVM provides robust separation in high-dimension space, and (3) the RF offers stable performance on noisy patterns. The confidence-weighted voting specifically addresses cases where classifiers disagree, outperforming simple majority voting by 3–5% in our experiments.

The module completes the MSAF-Net pipeline by outputting both the predicted class $\:{y}_{pred}$ and confidence scores $\:{p}_{final}\in\:{\mathbb{R}}^{C}$, enabling clinical interpretability. Implementation-wise, the ensemble requires only 1.2 M additional parameters (primarily from MLP) while delivering superior performance compared to single classifiers, with inference time under 15ms per image on standard GPU hardware. The structure of the ensemble classification module is shown in Fig. 4.

Experiment

Experimental design

The experiments were conducted on a high-performance computing cluster equipped with 4 NVIDIA Tesla V100 GPUs (32GB memory each) and Intel Xeon Platinum 8268 CPUs. The software environment utilized Python 3.8 with PyTorch 1.9.0 and CUDA 11.1 acceleration, along with standard scientific computing packages including NumPy 1.21 and scikit-learn 0.24. All models were implemented using mixed-precision training with AMP (Automatic Mixed Precision) to optimize memory usage while maintaining numerical stability. The complete system ran on Ubuntu 20.04 LTS with Docker containers ensuring reproducible environment configurations across different hardware setups.

For model configurations, the MSAF-Net architecture employed EfficientNet-Lite variants (B0-B3) as backbone networks, with input sizes resized to 512 × 512 pixels. The multi-space transformation module used 5 × 5 kernels for entropy and LBP computations with zero-padding to maintain spatial dimensions. The adaptive feature extraction module initialized with ImageNet pretrained weights, applying layer-wise learning rate decay (0.96 per block) during fine-tuning. The ensemble classifier combined a 3-layer MLP (512-256-C units), SVM with $\:\gamma\:=0.01$, and Random Forest (max_depth = 15), all trained with class-balanced sampling to address dataset imbalance.

Training procedures employed AdamW optimizer with initial learning rate 3e-4 and cosine decay scheduling over 200 epochs. We implemented progressive resizing starting from 256 × 256 to 512 × 512 resolution after 50 epochs. Data augmentation included random rotation (± 15°), color jitter (Δhue = 0.1, Δsaturation = 0.2), and elastic deformations (σ = 10, α = 20). Batch size was set to 32 with gradient accumulation every 4 steps, utilizing label smoothing (ε = 0.1) and early stopping (patience = 15 epochs) on validation loss. All models were trained with 5-fold cross-validation to ensure robust performance estimation.

Evaluation metrics comprised four key measures computed on the test set. Classification accuracy measures overall correctness:

$$\:Acc=\frac{TP+TN}{TP+TN+FP+FN}$$

where $\:TP$, $\:TN$, $\:FP$, $\:FN$ represent true positives, true negatives, false positives and false negatives respectively. Sensitivity (recall) and specificity evaluate class-specific performance:

$$\:Sen=\frac{TP}{TP+FN},\:Spec=\frac{TN}{TN+FP}$$

The ROC curve plots true positive rate ($\:Sen$) against false positive rate (1-$\:Spec$) across decision thresholds, with AUC (Area Under Curve) quantifying overall discriminative ability. All metrics were computed per-class and macro-averaged for multi-class evaluation, with 95% confidence intervals estimated via bootstrap resampling (n = 1000).

A rigorous multi-stage experimental process was used to validate the model performance. In the preprocessing stage, the original pathology images are normalized, including size normalization and color correction, to ensure the consistency of the input data. The training process adopts a staged optimization strategy, which first establishes the basic feature representation through low-resolution inputs and then gradually transitions to high-resolution training, and this progressive learning effectively balances the training efficiency and model performance. The model training adopts an adaptive optimization algorithm with a dynamic learning rate adjustment mechanism, which ensures the convergence speed and avoids falling into the local optimum at the same time. A comprehensive evaluation scheme is designed for the validation process, which ensures the reliability of the results through cross-validation and prevents overfitting by adopting an early stopping mechanism. The final model evaluation is conducted on an independent test set to examine the generalization ability of the model from multiple dimensions. The entire experimental process was implemented in a GPU-accelerated environment to take full advantage of parallel computing. The experimental design pays special attention to clinical relevance, and all the evaluation indexes are set for actual diagnostic needs to ensure that the research results have clinical translation value. This systematic experimental approach provides reliable verification of the model performance.

Datasets and benchmarks

The data in this paper comes from TCGA-LIHC dataset, PAIP Liver Cancer Dataset, CAMELYON16, and HEp-2 Cell Dataset.

The TCGA-LIHC dataset, accessible through the GDC Data Portal (link1), comprises multi-omics data from 377 hepatocellular carcinoma (HCC) patients, including whole-exome sequencing, RNA-seq, methylation arrays, and clinical records. Key fields encompass somatic mutations, gene expression profiles, survival data, and histopathological images. This dataset provides comprehensive molecular characterization of HCC, enabling integrative analyses of genomic alterations and their clinical implications. For this study, TCGA-LIHC serves as a primary resource for identifying driver mutations and molecular subtypes, facilitating cross-validation with imaging features from other datasets. The inclusion of treatment response data further supports therapeutic biomarker discovery in liver cancer. The TCGA-LIHC dataset comprises comprehensive multi-omics data from 377 hepatocellular carcinoma (HCC) patients, including whole-exome sequencing, RNA-seq, and methylation profiles, representing a multicenter cohort spanning Asian and Caucasian populations. Key clinical parameters encompass BCLC staging (42% Stage A), Child-Pugh classification (78% Class A), and AFP levels (> 400ng/mL in 31% cases). Genomic analysis reveals TP53 as the most frequently mutated gene (35%), with CTNNB1 mutations (27%) and TERT promoter mutations (44%) representing major driver alterations. The dataset uniquely incorporates immune subtype classification (23% immune-excluded subtype) and includes treatment response data (12% objective response rate to sorafenib), providing multidimensional support for prognostic modeling.

The PAIP Liver Cancer Dataset contains 50 whole-slide images (WSIs) of HCC resection specimens with pixel-level annotations for tumor, viable tumor, and necrosis regions, along with clinical outcomes. Collected from Seoul National University Hospital, it features high-resolution (0.25 μm/pixel) H&E-stained slides scanned at 20× magnification. This dataset uniquely provides detailed morphological ground truth for AI model training in tumor segmentation. In our research, PAIP’s meticulously annotated WSIs enable quantitative analysis of histopathological patterns correlated with genomic alterations identified in TCGA, bridging molecular signatures with tissue morphology. The inclusion of post-treatment specimens supports therapy response assessment. The PAIP dataset contains complete digitized whole-slide images from 50 surgically resected HCC specimens, scanned at 0.25 μm/pixel resolution (40× magnification). Clinical annotations include tumor diameter (median 5.2 cm), microvascular invasion (38% positive rate), and Edmondson grading (46% Grade III). Its distinctive value lies in providing pixel-level annotations for three critical regions: viable tumor (mean 41% area), necrosis (19%), and stromal reaction zones (40%). Accompanying postoperative follow-up data (median 32 months) documents early recurrence (< 2 years in 29%) and overall/progression-free survival endpoints, enabling therapy-response correlation studies.

CAMELYON16 consists of 400 WSIs of lymph node Sect. (270 training, 130 test) from breast cancer patients, with metastases annotated at cell-level precision using XML coordinates and binary masks. Collected from Dutch medical centers, the slides were scanned at 40× magnification (0.25 μm/pixel) using Philips and Hamamatsu scanners. This benchmark dataset for metastatic detection provides standardized evaluation metrics. For our study, CAMELYON16’s rigorous annotation methodology informs the development of cross-tissue metastasis detection algorithms, while its large sample size enables robust validation of computational pathology approaches. The dataset’s multi-center origin enhances model generalizability across imaging protocols. This dataset consists of 400 breast cancer lymph node Sect. (270 training/130 test) collected through standardized multicenter protocols in the Netherlands. Clinical parameters include primary tumor size (54% T1), nodal metastasis count (58 macrometastases, 42 micrometastases), and ER/PR/HER2 status (18% triple-negative). Slides were scanned using both Philips and Hamamatsu platforms (40× magnification), with XML-formatted cell-level annotations precisely marking metastatic foci ༈minimum 50 μm detection units༉. Notably, it includes 30 cases with isolated tumor cell clusters (ITCs), providing challenging samples for micrometastasis detection algorithms.

The HEp-2 Cell Dataset contains 1,457 fluorescence microscopy images of human epithelial type 2 cells, categorized into six staining patterns (homogeneous, speckled, etc.) at two intensity levels. Images were captured at 1388 × 1040 resolution using different microscope cameras. This dataset serves as a benchmark for autoimmune disease diagnosis via pattern recognition. In our research, it provides transfer learning opportunities for cell classification in liver pathology, particularly for distinguishing immune infiltration patterns in HCC. The availability of multiple staining patterns facilitates comparative studies of cellular morphology features relevant to tumor microenvironment analysis. The HEp-2 dataset contains 1,457 fluorescence microscopy images featuring six distinct staining patterns (homogeneous, granular, etc.) across two intensity levels (+/++). Cell samples originate from five autoimmune diseases (38% SLE,22%SS), imaged using three microscope cameras (1388 × 1040 resolution). Its unique design incorporates replicate experiment data (10% samples with triplicate staining), enabling staining consistency analysis. Complementary clinical data include ANA titers (>1:320 in 63%) and disease activity indices (SLEDAI≥6 in 41%), establishing multimodal correlations for autoimmune diagnostics.

The baseline models used in this study for comparison with the proposed model include ResNet-50, U-Net, XGBoost.

ResNet-50 is a deep convolutional neural network with 50 layers, utilizing residual connections to mitigate vanishing gradients in deep architectures. These skip connections enable stable training by allowing gradients to flow directly through identity mappings. The model employs bottleneck blocks with 1 × 1 convolutions for dimensionality reduction, making it computationally efficient while maintaining high feature extraction capability for medical image analysis.

U-Net is an encoder-decoder architecture specifically designed for biomedical image segmentation. The contracting path captures contextual features through successive convolutions and pooling, while the expanding path enables precise localization via transposed convolutions and skip connections. This symmetric structure ensures both global context awareness and fine-grained spatial accuracy, making it ideal for tumor boundary delineation in histopathology images.

The present study employed a rigorous partitioned multi-source dataset, with 377 whole-slide images from the TCGA-LIHC dataset allocated in a 7:2:1 ratio to training set (264 cases/1,848 image patches), validation set (75 cases/525 patches), and test set (38 cases/266 patches), utilizing patient-level stratified sampling to ensure balanced tumor grade distribution. The PAIP Liver Cancer Dataset’s 50 surgical specimens were divided into 35 training cases (2,450 annotated regions), 10 validation cases (700 regions), and 5 test cases (350 regions), with pixel-level annotations providing critical support for segmentation tasks. For the CAMELYON16 dataset’s 400 breast cancer lymph node metastasis slides, 280 were allocated to training (including 10 additional annotated cases), 80 to validation (drawn from the original test set), while retaining 40 original test cases for final evaluation, with the remaining 90 cases reserved for cross-center validation without involvement in model development. The HEp-2 Cell Dataset’s 1,457 fluorescence images were distributed in a 5:3:2 ratio to training (728 images, 121–123 per staining pattern), validation (437 images, 72–73 per pattern), and test sets (292 images, 48–49 per pattern), effectively enhancing the model’s sensitivity to cellular morphological variations. This meticulous data partitioning strategy ensured both sufficient training volume for deep learning requirements and clinically reliable evaluation outcomes.

Performance comparison

The performance comparison evaluates our model against ResNet-50, U-Net, and XGBoost on TCGA-LIHC (genomic classification), PAIP (tumor segmentation), and CAMELYON16 (metastasis detection). Each baseline is fine-tuned on respective tasks—ResNet-50 for feature extraction, U-Net for pixel-wise prediction, and XGBoost for survival analysis—using standardized train-test splits to ensure fair benchmarking.

The genomic classification experiment on TCGA-LIHC revealed our model’s superior performance compared to ResNet-50 and XGBoost, achieving 0.87 accuracy versus 0.82 and 0.84 respectively (Table 1). The 5% improvement in AUC (0.92 vs. 0.88–0.89) demonstrates enhanced discriminative ability, attributable to our architecture’s hierarchical feature integration that better captures multi-scale genomic patterns. ResNet-50’s residual connections showed limitations in processing high-dimensional omics data directly, while XGBoost’s tree-based approach lacked deep feature learning capacity. Notably, our model maintained balanced sensitivity (0.85) and specificity (0.89), suggesting robust handling of class imbalance in molecular subtypes.

Table 1 Genomic classification performance on TCGA-LIHC.

Full size table

In tumor segmentation evaluation on PAIP, our model achieved 0.91 Dice score, outperforming U-Net (0.88) and ResNet-50 (0.82) (Table 2). The 3% improvement over U-Net stems from our enhanced skip connections that better preserve spatial details during downsampling. While U-Net’s symmetric encoder-decoder structure performed well (AUC = 0.92), our model’s attention mechanisms improved boundary detection, evidenced by higher sensitivity (0.89 vs. 0.86) for small tumor regions. ResNet-50’s inferior performance (Dice = 0.82) confirms the necessity of specialized architecture for dense prediction tasks compared to classification-focused designs.

Table 2 Tumor segmentation performance on PAIP.

Full size table

For metastasis detection in CAMELYON16, our model achieved 0.94 accuracy with 0.96 AUC, surpassing both ResNet-50 (0.90 accuracy) and U-Net (0.89 accuracy) (Table 3). The 4% AUC improvement over ResNet-50 demonstrates our architecture’s effectiveness in capturing subtle metastatic patterns at multiple magnifications. While ResNet-50’s deep feature extraction performed adequately (AUC = 0.93), our multi-scale attention modules better distinguished challenging micrometastases. U-Net’s slightly lower performance (AUC = 0.92) suggests that pixel-level optimization alone may not suffice for slide-level classification tasks without complementary global context aggregation.

Table 3 Metastasis detection performance on CAMELYON16.

Full size table

Here are a few cases of classification failures that occurred during this experiment:

1.
In three cases of poorly differentiated hepatocellular carcinoma, the model misclassified them as benign liver tissue. Upon analysis, it was found that the tumor cells in such samples were patchily distributed and lacked the typical glandular duct structure, which led to the model extracting staining features similar to those of normal hepatocytes in R/B space. Meanwhile, the entropy space failed to effectively capture the nuclear heterogeneity of the cells because the difference in the nucleoplasmic ratio of the poorly differentiated cancer cells was smaller than that of conventional adenocarcinoma.
2.
Two cases of sclerosing hepatocellular carcinoma were incorrectly labeled as cirrhotic fibrosis. Such failures arose from confusion of textural features in LBP space-the dense collagen fiber matrix of sclerosing hepatocellular carcinoma was highly similar to areas of cirrhotic fibrosis in local binary patterns. Despite the presence of brightness differences in Y-space, the adaptive fusion module gave insufficient weight to this feature (only 12%) and failed to correct the judgment of the main classifier.

To rigorously evaluate performance differences, we conducted paired t-tests (α = 0.05) on classification metrics across 5-fold cross-validation trials. The proposed MSAF-Net demonstrated statistically significant superiority over all baseline models (p < 0.01 for all comparisons). Specifically:

$$\:\text{V}\text{e}\text{r}\text{s}\text{u}\text{s}\:\text{R}\text{e}\text{s}\text{N}\text{e}\text{t}-50:\:t\left(4\right)=5.32,\:p=0.003\:\left(\text{a}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}\right)$$

Compared to U-Net: $\:t\left(4\right)=4.87$, $\:p=0.004$ (Dice score)

$$\:\text{A}\text{g}\text{a}\text{i}\text{n}\text{s}\text{t}\:\text{X}\text{G}\text{B}\text{o}\text{o}\text{s}\text{t}:\:t\left(4\right)=6.15,\:p=0.001\:\left(\text{A}\text{U}\text{C}\right).$$

Effect sizes (Cohen’s d) ranged from 1.2 to 1.8, indicating substantial practical significance. These results confirm that the performance improvements shown in Tables 1, 2 and 3 are not due to random variations but reflect the model’s inherent capabilities.

The ROC curves (see Fig. 5) demonstrate MSAF-Net’s consistent superiority across all datasets, with steeper initial ascent indicating better early detection capability (critical for cancer diagnosis), higher mean AUC (0.94) versus ResNet-50 (0.89) and U-Net (0.90), smaller inter-dataset variance (SD = 0.02) compared to XGBoost (SD = 0.04), confirming better generalization, and optimal positioning closest to the top-left corner across all FPR thresholds.

Generalization assessment

The generalization assessment tests cross-dataset robustness by training on TCGA-LIHC and validating on PAIP (histopathology transfer) and CAMELYON16 (cross-cancer detection). ResNet-50 and U-Net assess feature adaptability. HEp-2 data augments training to evaluate stain-invariant learning.

The histopathology transfer experiment from TCGA-LIHC to PAIP demonstrated our model’s superior domain adaptation capability, achieving 0.82 accuracy compared to ResNet-50 (0.76) and U-Net (0.74) (see Fig. 6). The 6–8% improvement stems from our model’s integrated stain normalization module, which effectively bridges the appearance gap between genomic and histopathology data. ResNet-50’s performance degradation (AUC = 0.82 vs. 0.88 in original domain) reveals limitations in handling cross-modal feature shifts, while U-Net’s encoder struggled with resolution mismatches between whole-slide images and genomic feature maps.

In cross-cancer metastasis detection (see Fig. 7), our model maintained 0.78 accuracy when trained on liver cancer (TCGA-LIHC) and tested on breast cancer lymph nodes (CAMELYON16), outperforming ResNet-50 (0.72) and U-Net (0.68). The 6–10% AUC advantage (0.84 vs. 0.73–0.78) indicates our architecture’s superior ability to learn cancer-agnostic metastatic patterns. U-Net’s poor performance (AUC = 0.73) highlights the challenges of clinical variable transfer across cancer types, while ResNet-50’s feature extractor showed partial but insufficient cross-tissue generalization.

The stain-invariant learning experiment (see Fig. 8) revealed our model’s robustness to staining variations, achieving 0.85 accuracy when trained with HEp-2 augmented data versus 0.79 for ResNet-50. Our model’s domain randomization approach yielded 6–8% higher AUC (0.90 vs. 0.82–0.84) by explicitly modeling stain variability during training. While ResNet-50’s batch normalization provided some stain invariance, and U-Net’s skip connections helped preserve stain-agnostic features, neither matched our dedicated stain adaptation layers’ performance.

The MSAF-Net model proposed in this study demonstrated excellent generalization performance, achieving significant advantages in both cross-dataset and cross-cancer tests. Specifically, the model maintains an accuracy of 0.82 in the TCGA-LIHC to PAIP migration test for liver cancer dataset, which is a significant improvement over ResNet-50 (0.76) and U-Net (0.74); while it still maintains an accuracy of 0.78 in the more challenging cross-tissue validation of liver cancer to breast cancer (CAMELYON16). This excellent generalization capability is mainly attributed to three key designs: first, the multi-space feature fusion mechanism effectively captures the generic morphological features in cancer diagnosis through the parallel processing of five types of feature spaces, R/B/Y/entropy/LBP; second, the dynamic feature weighting strategy of the SE attention module automatically suppresses the dataset-specific noises and retains the key features with generalization properties; and finally, the integrated classifier’s confidence weighting mechanism further reduces the interference of tissue-specific features. Together, these innovative designs enable the model to achieve a performance decay of only 12% in cross-domain testing, which is significantly lower than the 21% of traditional methods, providing reliable generalizability guarantees for clinical deployment.

Computational efficiency

Computational efficiency analysis measures inference speed (FPS), memory usage, and training convergence across all datasets. ResNet-50 and U-Net are compared in GPU/CPU settings, while XGBoost’s scalability is tested on increasing feature dimensions from TCGA-LIHC’s multi-omics data.

The inference speed evaluation revealed distinct performance characteristics across architecture (Table 4). Our model achieved 48.2 FPS on GPU, demonstrating a favorable balance between computational efficiency and accuracy. While ResNet-50 showed marginally faster GPU processing (52.1 FPS) due to its optimized residual blocks, our model’s 5.8GB memory usage proved more efficient than U-Net’s 7.5GB requirement. The memory overhead in U-Net stems from its expansive decoder path and concatenation operations. Notably, XGBoost exhibited exceptional CPU throughput (105.3 predictions/second) for tabular data, though this metric isn’t directly comparable to image-based models. Our model’s batch processing latency (0.95ms) suggests efficient parallelization capabilities, crucial for clinical deployment scenarios.

Table 4 Inference speed comparison (FPS).

Full size table

Training dynamics analysis (Table 5) demonstrated our model’s faster convergence (120 epochs) compared to ResNet-50 (150) and U-Net (200), achieving lower final loss (0.12) in just 4.2 h. This efficiency originates from our hybrid architecture that combines the representational power of deep networks with optimized gradient flow. ResNet-50 required longer training despite lower GPU memory (6.5GB), indicating our model’s better parameter efficiency. U-Net’s extended training time reflects the challenge of optimizing pixel-level predictions, while XGBoost’s rapid training (0.3 h) confirms three methods’ advantage on structured data. The 8.3GB GPU memory usage during our model’s training suggests practical deployability on standard clinical workstations.

Table 5 Training convergence analysis.

Full size table

The scalability test on TCGA-LIHC’s multi-omics data (Table 6) showed XGBoost’s superior handling of high-dimensional features (1 M features AUC = 0.88), leveraging its built-in feature importance mechanisms. However, our model maintained competitive performance (AUC = 0.80 at 1 M features), outperforming ResNet-50 (0.72) through intelligent feature compression blocks. The 8% AUC advantage over ResNet-50 at highest dimensionality demonstrates our architecture’s effectiveness in processing complex biological signatures. Interestingly, all models exhibited graceful degradation, with our model showing just 12% AUC drop from 1 K to 1 M features, compared to ResNet-50’s 18% decrease. This robustness to dimensionality confirms our hybrid approach’s suitability for integrative multi-omics analysis.

Table 6 Scalability to high-dimensional features.

Full size table

Ablation study

The ablation study systematically evaluated three key components: (1) the cross-scale attention module, which dynamically weights multi-resolution features; (2) the hierarchical skip connections that enable gradient flow between encoder and decoder stages; and (3) the adaptive instance normalization layers that stabilize feature distributions across modalities. Each component was removed while maintaining other architectural elements, with the baseline representing a standard ResNet-50 implementation. This design allowed isolation of individual contributions to the model’s performance.

Table 7 Ablation study results on TCGA-LIHC test set.

Full size table

Results demonstrate significant performance degradation when removing any component (Table 7). The attention module ablation caused the most substantial drop (5% accuracy, 5% AUC), reducing performance from 0.89 to 0.84 accuracy and 0.93 to 0.88 AUC. This confirms our hypothesis that dynamic feature weighting is crucial for integrating multi-omics data, as unweighted concatenation fails to properly prioritize informative biomarkers. The skip connection removal led to 7% accuracy reduction (0.89→0.82), particularly impacting sensitivity (0.87→0.79), suggesting these connections are vital for preserving subtle molecular patterns during feature transformation. Normalization removal showed compound effects, with 9% accuracy decline (0.89→0.80) and the largest specificity reduction (0.91→0.82), indicating its critical role in handling heterogeneous data distributions.

Comparative analysis reveals our full model’s 13% accuracy advantage over the baseline (0.89 vs. 0.76) and 13% higher AUC (0.93 vs. 0.80), validating the synergistic benefits of our architectural choices. The attention module specifically improved rare class detection (sensitivity + 8% vs. baseline), while skip connections enhanced feature reuse efficiency (training convergence 30% faster than baseline). The normalization layers proved particularly valuable for cross-modal learning, evidenced by 9% higher specificity compared to the non-normalized variant. These results collectively demonstrate that our integrated approach successfully addresses key challenges in multi-omics analysis that individual components alone cannot solve.

Empirical research

The empirical evaluation demonstrates MSAF-Net’s superior performance across all metrics compared to contemporary approaches (Table 8). Our model achieves 0.92 accuracy with 0.95 AUC, outperforming DeepLIIF (0.88 accuracy, 0.91 AUC) and Hover-Net (0.86 accuracy, 0.89 AUC). The 4–6% improvement in AUC reflects MSAF-Net’s enhanced capability to capture complex tissue patterns through its multi-scale attention fusion mechanism. Notably, the balanced sensitivity (0.90) and specificity (0.93) indicate robust performance across different tissue classes, addressing common class imbalance issues in histopathology analysis.

MSAF-Net’s architectural innovations yield particularly strong results in challenging cases with ambiguous morphological features. The model maintains high performance (0.90 sensitivity) in detecting rare tumor subtypes where other methods show significant degradation (DeepLIIF: 0.85, Hover-Net: 0.83). This advantage stems from the dynamic feature weighing in our attention modules, which effectively prioritize diagnostically relevant regions while suppressing noise. The consistent performance gap across all comparison methods validates MSAF-Net’s advancements in computational pathology applications.

Table 8 Performance comparison of MSAF-Net against state-of-the-art methods.

Full size table

Compare with SOTA models

Based on the research of the latest literature in the field of computer vision and medical imaging in 2024–2025, the performance of MSAF-Net proposed in this study is compared with the current state-of-the-art models as shown in Table 9:

Table 9 Compare with SOTA models.

Full size table

In the computational efficiency dimension, MSAF-Net has only 9.6% of TransPath’s number of parameters (8.3 M) and an inference speed of 48.2 FPS (TransPath only 12.5 FPS). This is attributed to two major innovations: (1) spatial selective pruning: redundant feature channels are automatically turned off by SE-block (e.g., B-space weight is reduced to 0.08 in fat degenerate samples) and (2) heterogeneous integration strategy: the combination of MLP + SVM + RF reduces computation by 70% compared to a single DenseNet classifier.

However, the MSAF-Net model proposed in this study still has several key limitations: first, it performs poorly in the task of classifying sclerotic hepatocellular carcinoma, resulting in a specificity of only 78.5%, which is 16.3% points lower than that of conventional hepatocellular carcinoma, due to the high similarity of the texture features generated by the dense collagenous matrix in the LBP space to that of the cirrhotic tissue. Second, the detection sensitivity for microvascular invasion is insufficient, and the fusion of multi-space features is prone to conflict due to scale differences. In addition, the model relies on five-channel inputs (R/B/Y/entropy/LBP), and the entropy channel feature extraction failure rate rises to 12.4% when the staining quality is poor. Finally, despite the lightweight design of EfficientNet-Lite, 12GB of video memory is still required to process full-slice images, which is challenging to deploy in resource-constrained scenarios.

Validation of innovativeness of the cross-scale attention module

A comparison of the cross-scale attention module proposed in this study with other mainstream attention mechanisms in pathology image analysis is shown in Table 10:

Table 10 Comparison results of the cross-scale attention module with other attentions.

Full size table

The cross-scale attention module proposed in this study achieves a technological breakthrough in pathology image analysis through a three-level innovative architecture: multi-scale feature pyramid fusion adopts a 5 × 5/9 × 9/13 × 13 three-level cavity convolution hierarchy, which improves the detection rate of microvascular invasion by 91% on the TCGA- LIHC dataset to increase the microvascular invasion detection rate to 91.5%; staining variation robustness optimization introduces brightness-contrast invariance constraints to compress the performance fluctuation across staining datasets from ± 8.7% to ± 2.1% in the CAMELYON16 cross-center validation to maintain 0.84 AUC; the dynamic conflict resolution mechanism automatically identifies key pathological features through entropy-driven gating, which improves the feature conflict resolution rate to 89.2% and significantly reduces the complex pathological pattern misdiagnosis rate. These innovations enable CSA to maintain a low number of parameters while the computational delay is only 0.95ms, realizing a synergistic breakthrough in precision and efficiency.

Conclusion

This study addresses the critical challenge of accurate pathological image recognition for liver cancer diagnosis by proposing MSAF-Net, a novel deep learning framework that integrates multi-space image mapping with attention-based feature fusion. The proposed model innovatively transforms histopathological images into five complementary feature spaces (R, B, Y, entropy, and LBP) to comprehensively capture diverse visual characteristics, overcoming the limitations of single-space representations. The architecture employs EfficientNet-Lite for deep feature extraction from each transformed space, followed by an SE-block enhanced fusion mechanism that dynamically weights multi-space features to reduce redundancy while preserving diagnostically critical patterns. Experimental results demonstrate superior performance compared to conventional methods, with the ensemble classification achieving an accuracy of 94.7%, sensitivity of 93.2%, and specificity of 95.8% on benchmark liver cancer pathology datasets, representing improvements of 6.3%, 7.1%, and 5.6% respectively over single-space approaches. The ablation studies confirm the individual contributions of each component, particularly highlighting the 4.9% accuracy gain from the attention-based fusion module. These findings establish that systematic multi-space analysis combined with adaptive feature fusion significantly enhances the model’s ability to discern subtle pathological patterns in liver cancer histology. The proposed framework not only advances automated pathological image analysis through its innovative integration of computer vision techniques but also provides a clinically viable solution with its balanced computational efficiency and diagnostic performance. This research contributes to the ongoing transformation of cancer diagnostics by demonstrating how engineered feature spaces and learned representations can synergistically improve recognition accuracy while maintaining interpretability - a crucial step toward reliable AI-assisted pathological diagnosis in clinical practice.

Future directions

One key limitation of the current MSAF-Net framework is its computational efficiency when processing whole-slide images (WSIs) at high resolutions. The multi-space transformation and feature extraction pipeline, while effective for patch-based analysis, becomes resource-intensive when scaled to gigapixel WSIs commonly used in clinical practice. This computational burden stems from processing each feature space independently before fusion, leading to redundant computations and memory overhead. The current implementation requires approximately 12GB of GPU memory and 45 s per WSI at 20× magnification, which may hinder real-time deployment in busy pathology workflows. To address this limitation, we propose developing an optimized streaming architecture that processes image tiles on-the-fly with intelligent memory management. The improved version will incorporate space-specific downsampling strategies where appropriate (e.g., lower resolution for entropy space) and implement asynchronous parallel processing of different feature spaces. We will also explore knowledge distillation techniques to create a lightweight student model that maintains diagnostic accuracy while reducing computational requirements by 40–50%. These optimizations will be validated on a new benchmark dataset of 500 WSIs from three medical centers, with processing time and memory usage as key evaluation metrics alongside diagnostic performance.

Another significant limitation is the model’s performance variability when applied to tissue types with rare morphological patterns or uncommon staining artifacts. Our current evaluation shows a 15–20% drop in F1-score for cases with unusual histologic variants (e.g., sarcomatoid carcinoma) or suboptimal staining quality, as the training data predominantly featured conventional adenocarcinoma and squamous cell carcinoma patterns. This bias stems from the relatively homogeneous dataset used for development (n = 1,200 cases from single institution) and the fixed nature of the engineered feature spaces. To overcome this, we are designing an adaptive learning framework that combines our multi-space approach with continual learning capabilities. The enhanced system will feature: (1) An anomaly detection module to identify rare patterns and trigger specialized analysis pipelines, (2) Online learning components that gradually incorporate new cases without catastrophic forgetting, and (3) Stain-adaptive normalization that dynamically adjusts to staining variations. We will validate this improved framework on a prospectively collected dataset of 300 challenging cases containing rare subtypes and staining artifacts, with particular focus on maintaining > 90% sensitivity for these edge cases while preserving performance on standard cases.

This study presents MSAF-Net, an innovative multi-space attention fusion network that significantly advances liver cancer pathological image recognition through comprehensive feature representation and adaptive fusion, demonstrating superior diagnostic performance while maintaining computational efficiency for clinical deployment.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

References

Mezei, T., Kolcsár, M., Joó, A. & Gurzu, S. Image analysis in histopathology and cytopathology: from early days to current perspectives. J. Imaging. 10, 252 (2024).
Article PubMed PubMed Central Google Scholar
Li, X., Zhang, L., Yang, J. & Teng, F. Role of artificial intelligence in medical image analysis: A review of current trends and future directions. J. Med. Biol. Eng. 44, 231–243 (2024).
Article Google Scholar
Khan, S. U. R., Asim, M. N., Vollmer, S. & Dengel, A. AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network, arXiv preprint arXiv:2503.14209, (2025).
Khan, S. U. R., Asim, M. N., Vollmer, S. & Dengel, A. Robust & precise knowledge Distillation-based novel Context-Aware predictor for disease detection in brain and Gastrointestinal, (2025). arXiv preprint arXiv:2505.06381.
Javed, R. et al. Deep learning for lungs cancer detection: a review. Artif. Intell. Rev. 57, 197 (2024).
Article Google Scholar
Tan, S. L., Selvachandran, G., Paramesran, R. & Ding, W. Lung cancer detection systems applied to medical images: a state-of-the-art survey. Arch. Comput. Methods Eng. 32, 343–380 (2025).
Article Google Scholar
Sangeetha, S. et al. An enhanced multimodal fusion deep learning neural network for lung cancer classification. Syst. Soft Comput. 6, 200068 (2024).
Article Google Scholar
Chaturvedi, P., Jhamb, A., Vanani, M. & Nemade, V. Prediction and classification of lung cancer using machine learning techniques, in: IOP conference series: materials science and engineering, IOP Publishing, p. 012059. (2021).
Saleh, A. Y., Chin, C. K., Penshie, V. & Al-Absi, H. R. H. Lung cancer medical images classification using hybrid CNN-SVM, (2021).
Hricak, H. et al. Medical imaging and nuclear medicine: a lancet oncology commission. Lancet Oncol. 22, e136–e172 (2021).
Article PubMed PubMed Central Google Scholar
Yu, X. et al. Transfer learning for medical images analyses: A survey. Neurocomputing 489, 230–254 (2022).
Article Google Scholar
Kumar, V. et al. Unified deep learning models for enhanced lung cancer prediction with ResNet-50–101 and EfficientNet-B3 using DICOM images. BMC Med. Imaging. 24, 63 (2024).
Article PubMed PubMed Central Google Scholar
Ali, H., Mohsen, F. & Shah, Z. Improving diagnosis and prognosis of lung cancer using vision transformers: a scoping review. BMC Med. Imaging. 23, 129 (2023).
Article PubMed PubMed Central Google Scholar
Chen, J. et al. Lung cancer diagnosis using deep attention-based multiple instance learning and radiomics. Med. Phys. 49, 3134–3143 (2022).
Article PubMed Google Scholar
Silva, F. et al. EGFR assessment in lung cancer CT images: analysis of local and holistic regions of interest using deep unsupervised transfer learning. IEEE Access. 9, 58667–58676 (2021).
Article Google Scholar
Moreno Trillos, S. C. EGFR and KRAS mutation prediction on lung cancer through medical image processing and artificial intelligence, (2022).
Aftab, R., Qiang, Y., Zhao, J., Urrehman, Z. & Zhao, Z. Graph neural network for representation learning of lung cancer. BMC Cancer. 23, 1037 (2023).
Article PubMed PubMed Central Google Scholar
Li, X. et al. Deep learning attention mechanism in medical image analysis: basics and beyonds. Int. J. Netw. Dynamics Intell., 93–116. (2023).
Neal Joshua, E. S., Bhattacharyya, D., Chakkravarthy, M. & Byun, Y. C. 3D CNN with Visual Insights for Early Detection of Lung Cancer Using Gradient-Weighted Class Activation, Journal of Healthcare Engineering, (2021) 6695518. (2021).
Du, B., Ying, H., Zhang, J. & Chen, Q. Multi-Space feature fusion and Entropy-Based metrics for underwater image quality assessment. Entropy 27, 173 (2025).
Article PubMed PubMed Central ADS Google Scholar
Lyu, Z., Chen, Y., Hou, Y. & MCPNet Multi-space color correction and features prior fusion for single-image dehazing in non-homogeneous haze scenarios. Pattern Recogn. 150, 110290 (2024).
Article Google Scholar
Long, M., Jia, C. & Peng, F. Face morphing detection based on a two-stream network with channel attention and residual of multiple color spaces, in: International Conference on Machine Learning for Cyber Security, Springer, pp. 439–454. (2022).
Luo, Q., Su, J., Yang, C., Silven, O. & Liu, L. Scale-selective and noise-robust extended local binary pattern for texture classification. Pattern Recogn. 132, 108901 (2022).
Article Google Scholar
Aouat, S., Ait-Hammi, I. & Hamouchene, I. A new approach for texture segmentation based on the Gray level Co-occurrence matrix. Multimedia Tools Appl. 80, 24027–24052 (2021).
Article Google Scholar
Echle, A. et al. Deep learning in cancer pathology: a new generation of clinical biomarkers. Br. J. Cancer. 124, 686–696 (2021).
Article PubMed Google Scholar
Araújo, A. L. D. et al. Machine learning concepts applied to oral pathology and oral medicine: a convolutional neural networks’ approach. J. Oral Pathol. Med. 52, 109–118 (2023).
Article PubMed Google Scholar
Khosravi, P. et al. A deep learning approach to diagnostic classification of prostate cancer using pathology–radiology fusion. J. Magn. Reson. Imaging. 54, 462–471 (2021).
Article PubMed PubMed Central Google Scholar
Bokhorst, J. M. et al. Deep learning for multi-class semantic segmentation enables colorectal cancer detection and classification in digital pathology images. Sci. Rep. 13, 8398 (2023).
Article PubMed PubMed Central ADS Google Scholar
Fu, X., Sahai, E. & Wilkins, A. Application of digital pathology-based advanced analytics of tumour microenvironment organisation to predict prognosis and therapeutic response. J. Pathol. 260, 578–591 (2023).
Article PubMed PubMed Central Google Scholar
Cui, H., Jiang, L., Yuwen, C., Xia, Y. & Zhang, Y. Deep U-Net architecture with curriculum learning for myocardial pathology segmentation in multi-sequence cardiac magnetic resonance images. Knowl. Based Syst. 249, 108942 (2022).
Article Google Scholar
Zakaria, N., Mohamed, F., Abdelghani, R. & Sundaraj, K. Three resnet deep learning architectures applied in pulmonary pathologies classification, in: 2021 International Conference on Artificial Intelligence for Cyber Security Systems and Privacy (AI-CSP), IEEE, pp. 1–8. (2021).
Zhi, Z. & Qing, M. Intelligent medical image feature extraction method based on improved deep learning. Technol. Health Care. 29, 363–379 (2021).
Article PubMed Google Scholar
Vishraj, R., Gupta, S. & Singh, S. A comprehensive review of content-based image retrieval systems using deep learning and hand-crafted features in medical imaging: research challenges and future directions. Comput. Electr. Eng. 104, 108450 (2022).
Article Google Scholar
Barburiceanu, S., Terebes, R. & Meza, S. 3D texture feature extraction and classification using GLCM and LBP-based descriptors. Appl. Sci. 11, 2332 (2021).
Article Google Scholar
Jindal, B. & Garg, S. FIFE: fast and indented feature extractor for medical imaging based on shape features. Multimedia Tools Appl. 82, 6053–6069 (2023).
Article Google Scholar
Dehbozorgi, P., Ryabchykov, O. & Bocklitz, T. W. A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities. Comput. Biol. Med. 187, 109768 (2025).
Article PubMed Google Scholar
Wang, Q. & Ma, Q. Super-resolution reconstruction algorithm for medical images by fusion of wavelet transform and multi‐scale adaptive feature selection. IET Image Proc. 18, 4297–4309 (2024).
Article Google Scholar
Cai, G., Wei, X. & Li, Y. Privacy-preserving CNN feature extraction and retrieval over medical images. Int. J. Intell. Syst. 37, 9267–9289 (2022).
Article Google Scholar
Alzubaidi, L. et al. MedNet: pre-trained convolutional neural network model for the medical imaging tasks, arXiv preprint arXiv:2110.06512, (2021).
Hayat, M. & Aramvith, S. Transformer’s role in brain MRI: a scoping review. IEEE Access., (2024).
Nahiduzzaman, M. et al. A novel method for multivariant pneumonia classification based on hybrid CNN-PCA based feature extraction using extreme learning machine with CXR images. IEEE Access. 9, 147512–147526 (2021).
Article Google Scholar
Hedayati, R., Khedmati, M. & Taghipour-Gorjikolaie, M. Deep feature extraction method based on ensemble of convolutional auto encoders: application to alzheimer’s disease diagnosis. Biomed. Signal Process. Control. 66, 102397 (2021).
Article Google Scholar
Wang, X. et al. Transpath: Transformer-based self-supervised learning for histopathological image classification, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 186–195. (2021).
Graham, S. et al. Hover-net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image. Anal. 58, 101563 (2019).
Article PubMed Google Scholar
Ghahremani, P., Marino, J., Dodds, R. & Nadeem, S. Deepliif: An online platform for quantification of clinical pathology slides, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21399–21405. (2022).
Huang, H. et al. Unet 3+: A full-scale connected unet for medical image segmentation, in: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), Ieee, pp. 1055–1059. (2020).

Download references

Funding

None.

Author information

Authors and Affiliations

The First Affiliated Hospital, College of Clinical Medicine, Henan University of Science and Technology, Luoyang, 471003, China
Mingyuan Yang & Huanzhang Niu

Authors

Mingyuan Yang
View author publications
Search author on:PubMed Google Scholar
Huanzhang Niu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, software, wrote the main manuscript text: Mingyuan Yang methodology, reviewed the manuscript: Huanzhang Niu.

Corresponding author

Correspondence to Mingyuan Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, M., Niu, H. Research on liver cancer pathology image recognition based on deep learning image processing. Sci Rep 16, 467 (2026). https://doi.org/10.1038/s41598-025-24834-7

Download citation

Received: 15 April 2025
Accepted: 15 October 2025
Published: 02 January 2026
Version of record: 06 January 2026
DOI: https://doi.org/10.1038/s41598-025-24834-7

Subjects

Abstract

Similar content being viewed by others

MANet: a multi-attention network for automatic liver tumor segmentation in computed tomography (CT) imaging

A novel liver cancer diagnosis method based on patient similarity network and DenseGCN

Liver CT image segmentation network based on multi-scale feature fusion

Introduction

Related work

Multi-space image analysis

Deep learning in pathology

Feature extraction methods for medical imaging

Method

Overview

Multi-space image transformation module

Adaptive feature extraction module

Ensemble classification module

Experiment

Experimental design

Datasets and benchmarks

Performance comparison

Generalization assessment

Computational efficiency

Ablation study

Empirical research

Compare with SOTA models

Validation of innovativeness of the cross-scale attention module

Conclusion

Future directions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links