Introduction

Colorectal cancer (CRC) remains a leading cause of cancer-related mortality worldwide1. Since most CRC cases originate from adenomatous polyps, early detection and removal via colonoscopy are critical for prevention. However, manual segmentation of polyps is labor-intensive and prone to human error, with missed detection rates ranging from 17% to 28%2,3. Consequently, there is an urgent demand for automated, high-precision computer vision systems to assist clinicians in polyp segmentation.

In recent years, deep learning has revolutionized this field. CNN-based methods, such as U-Net4 and its variants5,6, have demonstrated strong capabilities in modeling local features but often struggle to capture long-range dependencies due to limited receptive fields. Conversely, Transformer-based models7 excel at global context modeling but may lose fine-grained local details and suffer from high computational complexity. To mitigate these issues, recent hybrid approaches (e.g., TransNetR8) attempt to combine convolutional layers with self-attention mechanisms. Despite their progress, existing hybrid models still face significant limitations: 1) the intrinsic discriminative capacity of convolutional features is not fully exploited within the hybrid feed-forward paths; and 2) they often exhibit diffuse attention distribution, leading to insufficient boundary delineation and an inability to effectively balance global coherence with local precision (Fig. 1).

Fig. 1
Fig. 1
Full size image

Illustration of the typical colorectal polyps.

To address these specific limitations, we propose the Hierarchical Contextual Information Aggregation Network (HCIA). Unlike existing methods, HCIA is explicitly designed to establish stable long-range dependencies while preserving local structural details through two novel components. First, the Interconnected Attention Module (IAM) captures global dependencies across all hierarchical levels using a shared memory mechanism with linear complexity, ensuring comprehensive supervision. Second, the Hierarchical Aggregation Module (HAM) facilitates the dynamic integration of multi-scale features from adjacent layers, effectively suppressing background noise and sharpening polyp boundaries. Extensive experiments demonstrate that HCIA achieves superior accuracy and generalization compared to state-of-the-art methods.

The main contributions of this paper are summarized as follows:

  • We propose the HCIA network, a novel architecture that efficiently aggregates hierarchical contextual information for precise polyp segmentation.

  • We introduce the Hierarchical Aggregation Module (HAM) to integrate multi-scale features from adjacent branches, enhancing boundary discrimination and robustness against variable polyp sizes.

  • We design the Interconnected Attention Module (IAM) to establish global dependencies across hierarchical levels with linear complexity, enabling effective global supervision.

  • Comprehensive evaluations on multiple benchmarks demonstrate that HCIA achieves state-of-the-art performance and exhibits strong generalization capabilities.

Related work

CNN-based segmentation methods

Polyp segmentation classifies pixels from colonoscopy images into polyp tissue categories, generating an accurate tissue mask for subsequent clinical analysis. Initially, CNNs dominated this field for their proficiency in modeling local contextual information, thereby becoming the go-to architecture for such tasks for years. Pioneering this domain, Brandao et al.9 were the first to leverage fully convolutional networks specifically for this purpose. The introduction of the UNet model by Ronneberger et al.4, with its innovative encoder-decoder and skip connection strategy, marked a significant leap, facilitating dense and high-resolution predictions. Building on this foundational work, a plethora of UNet-inspired architectures emerged, with notable improvements such as the nested dense skip connections found in Zhou et al.’s UNet++5, and the pyramid-style feature coding for multi-scale representations in Jha et al.’s ResUNet++6. Departing from traditional U-shaped blueprints, alternate designs like Fan et al.’s PraNet10 capitalized on reverse attention to enhance edge delineation of polyp tissues, and Tomar et al.’s DDANet11 employed a dual-decoder mechanism to furnish additional attention-guided maps. Zhao et al.12 introduced the TACT network, which employs the FAPS module for precise segmentation of polyp boundaries and integrates high-level and low-level features through the MSFA module. Huang et al.13 proposed MGF-Net, which refines polyp edge details with enhanced accuracy via multi-channel grouping fusion. Liu et al.14 developed the multi-cascade network MCA-Net, focusing on issues related to variations in polyp shape, size, and texture. Du et al.15 presented UM-Net, which mitigates the impact of polyp tissue color using a color transfer operation. Zhu et al.16 introduced Polyp-Mamba, which employs discrete cosine transform to analyze features from multiple spectral perspectives. Nevertheless, despite the progression in CNN-derived approaches, their intrinsic inability to adeptly model long-range dependencies has surfaced as an intrinsic bottleneck, presenting a barrier to the evolution of the segmentation proficiency.

Fig. 2
Fig. 2
Full size image

Illustration of the proposed Hierarchical Contextual Information Aggregation Network (HCIA), which consists of four hierarchical branches. The output local prediction maps \(p_{1}{\prime }, p_{2}{\prime }, p_{3}{\prime }\), and \(p_{4}{\prime }\) are aggregated as the final prediction p. ’HAM’ denotes the Hierarchical Aggregation Module (HAM). ’Inter Connected Attention’ denotes the Interconnected Attention Module (IAM). ’Pred Head’ represents the Prediction Head. ’UpConv’ is an upsampling convolutional layer.

Transformer-based segmentation methods

The transformative prowess of transformers in capturing long-range dependencies has seen them gain traction in various fields, including polyp segmentation. Recent contributions, such as by Wang et al.17, devised the SSFormer employing a pyramid transformer backbone for elevated segmentation accuracy. Meanwhile, Duc et al.18 constructed a nimble model dubbed ColonFormer that serves as a segmentation baseline. Despite these innovations, transformer-based techniques are somewhat hamstrung by their native self-attention mechanisms, which do not adequately account for context-rich dependencies.

In an attempt to bridge this gap, some researchers have embarked on crafting hybrid networks that integrate convolutional computations into the transformer framework or merge self-attention and convolutional layers. Such models strive to capture both long-range and close-knit dependencies concurrently. Zhang et al.19 put forward TransFuse, a model which synchronously harnesses parallel CNN and transformer encoders to learn both global and local relationships. Similarly, the TransUNet, as showcased by Chen et al.20, chains transformers to CNNs for multi-scale feature synthesis, and Dong et al.21 introduced the poly-PVT that employs the Pyramid Transformer to reach into hierarchically structured features, enhanced by multiple decoders for pixel-wise segmentation. He et al.22 proposed CTHP, which utilizes unidirectional attention to comprehensively model both global and local information. Liu et al.23 introduced CAFÉ-Net, designed to maximize the utilization of fine-grained information by reconstructing missing data while preserving low-level features. Xiao et al.24 presented CTNet, which enhances segmentation accuracy by leveraging multi-scale information and high-resolution features. Wang et al.25 developed WBANet, focusing on modeling multi-scale edge information by extracting the slope of the polyp tissue edges. Notably, these hybrid designs have demonstrated superior performance; however, two principal challenges persist: 1) The intrinsic discriminative capacity of convolutional features integrated within the feed-forward layers is not fully realized, and 2) a propensity for hybrid models to overly conform to convolutional features, potentially resulting in diffuse attention that undermines the overall efficacy of the model.

In contrast to these methods, HCIA does not establish long-range dependencies and local context solely through Transformer or convolution operations; instead, it achieves this through the interplay of IAM and HAM. The IAM refines the feature representation of the current layer by calculating attention over the hierarchical features at each level, facilitating the exchange of information across different scales via a shared attention memory. Meanwhile, the HAM is responsible for connecting adjacent layers and integrating their features, thereby obtaining a multi-scale fused perspective. This design enables HCIA to establish stable long-range dependencies and local information while effectively mitigating the drawbacks associated with hybrid models.

Attention module in polyp segmentation

Due to the need for fine-grained features in polyp segmentation, numerous methods have employed various attention modules to enhance retrieval performance. Fan et al.10 utilized reverse attention to further optimize the extraction of details at the edges of polyp tissues. Zhang et al.26 adopted cross-semantic attention to calibrate low-dimensional semantic information within the encoder. Liu et al.27 implemented convolution-based attention to focus on locally significant information. He et al.22 employed height-direction and weight-direction attention to model local contextual features. Liu et al.23 utilized a cross-attention decoder to protect low-level features while recovering fine-grained characteristics. Wang et al.25 designed a hierarchical attention fusion mechanism to guide the model’s focus towards critical regions.

However, most existing attention modules typically prioritize local subtle features or suffer from high computational costs. To explicitly clarify the novelty of our method, we summarize the key distinctions between the proposed IAM/HAM and closely related attention mechanisms as follows:

  • Global Supervision vs . Local Refinement: Unlike Reverse Attention10 or Convolution-based Attention27, which are limited to refining local edges or specific regions, our IAM leverages a globally shared memory. This allows for information exchange across all network branches and layers, forming a coherent global supervision framework rather than isolated local enhancements.

  • Linear vs . Quadratic Complexity: While Cross-Semantic Attention26 and standard self-attention mechanisms generally exhibit quadratic complexity \(O(N^2)\), imposing a heavy computational burden, IAM is designed with linear complexity. This significantly improves computational efficiency without sacrificing the ability to model global dependencies.

  • Inter-layer Interaction vs . Single-layer Focus:In contrast to methods that apply attention independently within specific layers (e.g22.,), our HAM explicitly connects adjacent layers. This integration fuses multi-scale features dynamically, ensuring that both high-level semantic guidance and low-level structural details are preserved effectively.

Proposed method

Encoder

To effectively capture multi-scale hierarchical features conducive to polyp segmentation, our framework adopts PVTv228, initially pretrained on the ImageNet29, to serve as the backbone encoder, as shown in Fig. 2 (a). PVTv2 distinguishes itself from traditional vision transformers by incorporating self-attention blocks married with strided convolutions. This strategic amalgamation allows for the formation of long-range spatial dependencies across a descending cascade of feature resolutions—a design tuned for dense prediction tasks that simultaneously seeks to curtail the computational expense. The outcome is a pyramidal architecture that yields a multi-tier suite of features spanning various scales. More specifically, the encoder provides a quartet of hierarchical feature levels, designated as \(f_{i} \in \mathbb {R}^{\frac{H}{2^{i+1}} \times \frac{W}{2^{i+1}} \times C_{i}}\) for \(i \in \{1,2,3,4\}\). Each subsequent level \(F_{i}\) not only diminishes in spatial dimensions, enabling a larger receptive field, but also exhibits escalated feature dimensionality, reflecting the increased abstraction of deeper levels. For decoding purposes, these features undergo channel-wise refinement via \(1\times 1\) convolutions, preparing a tailored input for the decoder branch of our architecture.

Hierarchical aggregation module

The hierarchical aggregation module (HAM) plays a pivotal role in assimilating multi-scale hierarchical features across adjacent branches, an essential process for the effective attenuation of background interference and the enhancement of salient foreground regions. This functionality is crucial when dealing with polyps across a spectrum of sizes and for the discernment of clear and reliable boundaries—both vital attributes for bolstering the generalizability of the segmentation model. Drawing from the attention mechanisms presented by Oktay et al.30, our HAM implementation utilizes a grid-attention framework.

As depicted in Fig. 2, the operations within the HAM at the \(i-th\) (\(i\in \left\{ 1,2,3 \right\}\)) hierarchical level involve harmonizing intermediate features \(f_{i+1}^{\prime }\) from the antecedent lower-level branch together with the concurrent level’s features \(f_{i}\). This integration enables our model to concurrently optimize for local feature refinement and global context, which is instrumental for achieving high-fidelity polyp segmentation. This process can be formulated as:

$$\begin{aligned} & f_{i+1}^{\prime } = \textrm{UpConv} (f_{i+1}^{\prime }), \end{aligned}$$
(1)
$$\begin{aligned} & f_{i} = \textrm{GN}( \mathrm {Conv_{1\times 1}} (f_{i})),\quad f_{i+1}^{\prime } = \textrm{GN}( \mathrm {Conv_{1\times 1}} (f_{i+1}^{\prime })), \end{aligned}$$
(2)
$$\begin{aligned} & A_{i,i+1}^{HAM} = \textrm{GN} (\mathrm {Conv_{1\times 1}} (\sigma (f_{i}+f_{i+1}^{\prime }))), \end{aligned}$$
(3)
$$\begin{aligned} & f_{i}^{\prime \prime }=\left[ \tau (A_{i,i+1}^{HAM})\odot f_{i},f_{i+1}^{\prime } \right] , \end{aligned}$$
(4)

where \(\textrm{UpConv}(\cdot ) = \sigma (\textrm{GN} (\mathrm {Conv_{3\times 3}} (\textrm{Up}_{2\times }(\cdot ))))\),\(\textrm{Up} _{2\times }\) denotes upsampling operation. \(\textrm{Conv} _{3\times 3}\) is a convolution layer with a \(3 \times 3\) kernel while \(\textrm{Conv} _{1\times 1}\) is that with a \(1 \times 1\) kernel. \(\textrm{GN}\) denotes a group normalization. \(A_{i,i+1}^{HAM}\) denotes an affinity matrix. \({\sigma }\) is the SiLU function. \(\tau\) is the Sigmoid function. \(\odot\) denotes matrix multiplication and [,] represents the concatenation operation.

Fig. 3
Fig. 3
Full size image

Illustration of the structure of Hierarchical Aggregation Module (HAM).

Fig. 4
Fig. 4
Full size image

Illustration of the mechanism of Interconnected Attention Module (IAM).

Interconnected attention module

The Interconnected attention module (IAM) is a cornerstone component of the proposed HCIA that fulfills a dual purpose. First, the IAM forges local contextual dependencies among the varying levels of hierarchical branches, effectively enriching the capacity of these features to distinguish relevant patterns within the data. Second, it leverages globally shared memories, designated as \(M_{k}\) and \(M_{v}\), that are intricately woven into the attention computations of IAMs across all four hierarchical branches. This incorporation of global memories imparts a comprehensive supervisory element, orchestrating an integrated approach to information consolidation across the multiple scales of hierarchical features (Figs. 3, 4 and 5).

In our construction of IAMs, we employ an external attention mechanism as formulated by Guo et al.31, which allows for an expansive attention reach beyond the confines of local associations typically seen in convolutional networks, as shown in Fig. 2 (c). The use of external attention not only provides a means to capture higher-order dependencies within the feature maps but also ensures that the attention mechanism benefits from a consistency in the oversight across all levels of feature hierarchy. This unified attentional guidance is intended to significantly bolster the accuracy in segmenting polyps of various scales and complexities, delivering superior model performance.This process can be formulated as:

$$\begin{aligned} & A_{i}^{IAM } = (W^{IAM }_{i})^{\top }f_{i}^{\prime \prime }\odot M_{k}, \end{aligned}$$
(5)
$$\begin{aligned} & f_{i}^{\prime } = f_{i}^{\prime \prime } + \textrm{GN} (A_{i}^{IAM }) \odot M_{v}, \end{aligned}$$
(6)

where \(W^{IAM}_{i}\) is a weight matrix of a learnable linear layer in the \(i-th\) hierarchical branch. \(A_{i}^{IAM }\) denotes an affinity matrix. \(\odot\) represents matrix multiplication. \(M_{k}\) and \(M_{v}\) are globally shared memories which are randomly initialized.

Indeed, in the Interconnected Attention Module (IAM) design within the Hierarchical Contextual Information Aggregation Network (HCIA), the substitution of the standard keys and values, typically computed from the input features \(f_{i}^{\prime \prime }\), with the pre-established globally shared memories \(M_{k}\) and \(M_{v}\) is a strategic innovation. This replacement abandons the notoriously computation-intensive quadratic complexity of traditional self-attention schemes for a far more efficient linear complexity framework. This shift to linear complexity in the attention computation confers a substantial enhancement in the computational efficiency of the IAM. By circumventing the demand for pairwise feature comparisons intrinsic to quadratic attention, the IAM can perform attention operations over extensive feature maps without incurring debilitating computational costs. This efficiency gain propels the Hierarchical Contextual Information Aggregation Network (HCIA) to a more feasible status compared to counterparts that are still anchored to the conventional quadratic computational paradigm, particularly in the context of large input dimensionality or extensive datasets—commonplace in medical image analysis.

Fig. 5
Fig. 5
Full size image

Illustration of the structure of Prediction Head (PH).

Prediction head

The prediction head is responsible for generating local prediction maps from the outputs of each hierarchical branch in the architecture, as shown in Fig. 2 (e). For the \(i-th\) hierarchical layer, the produced local prediction map is represented as \(p_{i}^{\prime }\). To enhance the robustness and comprehensiveness of the feature discrimination, we aggregate the local prediction maps \(p_{1}{\prime }, p_{2}{\prime }, p_{3}{\prime }\), and \(p_{4}{\prime }\) emanating from all four hierarchical branches. This summation harnesses the combined strengths of each level’s feature representation to yield a more holistic and refined feature profile. The collective map is then further refined by passing it through a Sigmoid activation function. This process can be formulated as:

$$\begin{aligned} & p_{i}^{\prime } = \textrm{Conv} _{1 \times 1}(\textrm{PRE} _{2}(\textrm{PRE} _{1}(f_{i}^{\prime }))), \end{aligned}$$
(7)
$$\begin{aligned} & p = {\tau }(\sum _{i=1}^{4} p_{i}^{\prime }), \end{aligned}$$
(8)

where \(\textrm{PRE} (\cdot ) = {\sigma }(\textrm{GN} (\textrm{Conv} _{3\times 3}(\cdot )))\) is a prediction module, which consists of a convolution layer with a kernel of \(3 \times 3\), a group normalization layer, and a SiLU activation function. \(\tau\) is the Sigmoid function. p denotes the overall prediction map.

Experiments

Datasets and metrics

Kvasir-SEG. The Kvasir-SEG dataset32 is a publicly available collection featuring 1000 polyp images, complete with expertly annotated segmentation masks verified by seasoned gastroenterologists. These images pose a substantial challenge due to their diversity in size—ranging from dimensions of \(332 \times 487\) to \(1920 \times 1072\)—and the variety in the appearance of the polyps themselves, which differ markedly in size, shape, and texture.

CVC-ClinicDB. The CVC-ClinicDB dataset33 is a widely accessible set of 612 colonoscopy images derived from 31 video sequences, standardized to a resolution of \(384 \times 288\). Accompanying each image is a meticulously annotated, pixel-accurate segmentation mask, validated by medical experts to ensure reliability.

CVC-ColonDB. The CVC-ColonDB dataset34 comprises a total of 380 annotated images extracted from 15 video sequences, with each image having a resolution of \(574 \times 500\). Each image has been validated by medical experts to exclude similar images, ensuring that the dataset represents content from different perspectives.

Evaluation metrics. Our model’s efficacy is gauged using four established metrics10,35, each serving as a standard for performance evaluation in polyp segmentation: mean Dice coefficient (mDice)36, mean Intersection over Union (mIoU), Structure-measure (\(S_{\alpha }\))37, and Enhanced-alignment measure (\(mE_{\xi }\))38. The mDice and mIoU offer quantifiable insights into the region-centric similarity between predicted and ground truth masks. \(S_{\alpha }\) evaluates the structural integrity of the predicted segmentation by aligning it with the object-attuned similarity. Finally, \(mE_{\xi }\) delivers a nuanced index of the segmentation’s fidelity by assessing the model’s predictive competency at both the global image level and the detailed pixel level. We conduct five independent tests on the model, and the average of the evaluation metrics from these five tests is taken as the final result to mitigate any potential variability or bias.

Table 1 Single-domain performance evaluation of HCIA compared to other SOTA models. Optimal results are highlighted in bold.

Implementation details

Our architecture is instantiated and investigated within the PyTorch version 1.11.0. For the training and assessment phases, we leverage the computational prowess of an NVIDIA RTX 3090 GPU outfitted with a substantial 24GB of VRAM. To standardize the input data, we resize all polyp images to a uniform dimension of \(352 \times 352\) pixels. We deploy a suite of image augmentation strategies to enhance the generalizability of the model, including Gaussian blur, color jittering, horizontal and vertical flips. Additionally, affine transformations are applied to simulate common variations in positioning, such as translation, rotation, scaling, and shearing. The model optimization process harnesses the Adam53. We initiate training with a learning rate set at \(1e-4\). To facilitate a streamlined training operation capable of accommodating sizable datasets, we adopt a batch size of 16 and employ mixed-precision training through NVIDIA’s Apex library, which accelerates computation while reducing memory requirements.

Evaluation

The proposed HCIA in the comparison presented, exhibits exceptional effectiveness and generalizability against a selection of contemporary state-of-the-art (SOTA) models on the public datasets Kvasir-SEG, CVC-ClinicDB and CVC-ColonDB. For a comprehensive assessment, HCIA is benchmarked against two distinct groups of SOTA methodologies: 11 CNN-based models, including U-Net4, UNet++5, PraNet10, DCRNet40, MMFIL-Net42, EFA-Net45, SRaNet47, BUNet48, MADGNet41, Polyp-Mamba16, UM-Net15 and APCNet44, as well as 10 Transformer-derived models, namely Transfuse19, Polyp-PVT46, Polyp-LVT43, SSFormer49, FCBFormer52, CAFE-Net23, CTNet24, DSHNet50, CTHP22 and MGCBFormer51. To ensure fair comparison, all open-source methods, including ours, are retrained with authors’ publicly available source codes on a unified hardware setup maintaining their default configurations. The assessment encompasses two dimensions: efficacy, as demonstrated on individual dataset testing subsets, and broader applicability through cross-dataset evaluations. Models are trained on specified splits from each dataset and then tested within their corresponding domains. For cross-dataset generalizability, models trained on the entirety of Kvasir-SEG are evaluated on CVC-ClinicDB (Kvasir \(\rightarrow\) CVC), and vice versa.

Reflecting on the results noted in Table 1, HCIA presents a compelling performance profile, consistently outshining other methods on most metrics across the board for the datasets under study. Specifically, HCIA leads on Kvasir-SEG, achieving 94.2% on mDice, 89.4% on mIoU, 94.9% on \(S_{\alpha }\) and 96.8% on \(mE_{\xi }\), outmatching CNN-based method Polyp-Mamba16 (2025) in terms of mDice and mIoU by 2.3% and 2.7%, while surpassing Transformer-based method DSHNet50 in terms of mDice and mIoU by 1.7% and 1.3%. On CVC-ClinicDB and CVC-ColonDB, similarly, HCIA’s performance expressed in nearly all metrics surpasses that of competing CNN-based methods and Transformer-based solutions, solidifying the contribution of the hierarchical and attentive components HAM and IAM to the overall performance.

Table 2 Cross-domain performance evaluation of HCIA compared to other SOTA models. Optimal results are highlighted in bold.

Table 2 lays out the generalization results which underscore relatively excellent capacity of HCIA to maintain its stable performance across disparate datasets. In the Kvasir \(\Rightarrow\) CVC setting, HCIA shows a striking advantage over FCBFormer-L in mDice and mIoU by 4% and 4.4%, while for CVC \(\rightarrow\) Kvasir, HCIA surpasses FCBFormer-L in mDice and mIoU by 5.8% and 9%. This consistency across diverse training and testing scenarios underlines the robustness of the HCIA’s design ethos in tackling the intricate task of polyp segmentation.

Table 3 The ablation studies of HCIA evaluated in single domain on Kvasir-SEG and CVC-ClinicDB. Base denotes the adopted baseline. H denotes HAM while I denotes IAM. Params denotes model parameters. FLOPs denotes floating point operations.
Table 4 The comparative ablation studies of HCIA evaluated in single domain on Kvasir-SEG and CVC-ClinicDB. Base denotes the adopted baseline. H denotes HAM while I denotes IAM.

Ablation studies

The meticulous ablation study conducted on HCIA sheds light on the potency and pivotal roles of the constituent modules, primarily the Hierarchical Aggregation Module (HAM) and the Interconnected Attention Module (IAM), across Kvasir-SEG and CVC-ClinicDB. As delineated in Table 3 and Table 4, these experiments are bifurcated into incremental and comparative sets.

As shown in Table 3, the incremental experiments progressively enrich the baseline model with core components—the HAM and the IAM—examining their individual and combined influence on model performance. On the other hand, as shown in Table 4, the comparative experiments evaluate different attention mechanisms within the IAM framework by testing a series of derivative variants. Variant \(I^{\nabla }\) substitutes the external attention with spatial attention (SA), \(I^{\triangle }\) with channel attention (CA), \(I^{\Box }\) harnesses both SA and CA simultaneously, and \(I^{\Diamond }\) utilizes dedicated \(M_{k}\) and \(M_{v}\) memories for each IAM, shedding the globally shared property.

Effectiveness of each element. The edge that HAM brings to the table is evident when examining the leap in performance observed from the baseline to the baseline augmented with HAM (’\(Base+H\)’ in Table 3). This manifests as gains in mDice by 1.8%, and 1.3% on Kvasir-SEG and CVC-ClinicDB, respectively. The HAM module evidently excels at mitigating background distractions and accentuating foreground entities, enabling more precise segmentation demarcations. Conversely, IAM’s contribution is highlighted by the notable upticks in segmentation accuracy upon its integration into the baseline model (’\(Base+I\)’ in Table 3), validating its role in fostering coherent context dependencies and centralized global supervision. This integration yields improvements of 2.3%, and 2.1% in mDice metric across the datasets. It can be observed that, owing to the linear complexity design of the IAM, the increase in both Params and FLOPs introduced by IAM is minimal. This characteristic endows the HCIA with a significant advantage over other methods that utilize attention mechanisms. When both HAM and IAM are harmonized within a single framework, HCIA unleashes its full potential, culminating in substantial improvements of 3.7% and 3.3% in mDice when benchmarked against the baseline model over all three datasets (’\(Base+H+I\)’ in Table 3). This composite effect underscores HCIA’s proficiency in managing features across multiple scales, seamlessly establishing discriminative context dependencies and integrated global range interdependencies.

Effectiveness of Attention Mechanism Design in IAM. The distinctive architecture of the IAM in the proposed HCIA is scrutinized through comparative ablation experiments to validate the impact of its design elements, particularly the use of external attention and the global sharing of memories (\(M_{k}\) and \(M_{v}\)). Traditional attention mechanisms like spatial attention (SA) and channel attention (CA) are generally employed to augment the capture of localized details within images. Yet, the segmentation of complex structures, such as polyps, demands more than just local awareness—it necessitates an ability to also grasp the broader, long-range semantic context that SA and CA alone may inadequately provide due to the limits of their design. When examining the performance across the variants detailed in Table 4, specifically those incorporating attention—-\(I^{\nabla }\) with SA, \(I^{\triangle }\) with CA, and \(I^{\Box }\) with both—it becomes apparent that the IAM configuration employing external attention (\(Base+H+I\)) surpasses each of these. The metrics demonstrate that \(Base+H+I\) outperforms the best of these individual configurations, \(Base+A+I^{\Box }\), by a margin of 0.7% and 0.7% in \(mDice\) across all datasets. This reinforces the notion that external attention is critical in capturing the complexities of polyp segmentation. Furthermore, HCIA’s design introduces IAMs across four hierarchical branches, wherein each module refines the features at their respective scale to fortify the local context connections. The collective insights and oversight among the IAMs are facilitated by global memories, which strengthen the coordination and global supervision of the entire system. In the \(I^{\Diamond }\) variant, where \(M_{k}\) and \(M_{v}\) are not shared globally but kept exclusive to each IAM, a comparative decline in performance becomes evident when measured against the IAM using globally pooled memories (\(Base+A+I\)). This discrepancy highlights the benefits of a shared global memory framework, confirming that the global interworking of information across IAMs significantly contributes to HCIA’s state-of-the-art performance in polyp segmentation tasks.

Fig. 6
Fig. 6
Full size image

Qualitative analysis on Kvasir-SEG. GT denotes the ground truth.

Fig. 7
Fig. 7
Full size image

Qualitative analysis on CVC-ClinicDB. GT denotes the ground truth.

Qualitative analysis

Visual representations are indeed compelling when it comes to demonstrating the performance and strengths of a segmentation model. By bringing forth a selection of representative methods, including the CNN-based U-Net, PraNet and UM-Net, as well as the Transformer-based Transfuse, Polyp-LVT and CAFE-Net, for a side-by-side visual comparison with the proposed HCIA, we gain clear insights into the practical capabilities of the methodologies. From the snapshots provided in Fig. 6, the effectiveness of the HCIA, is evident across a range of challenging polyp segmentation cases on Kvasir-SEG. HCIA adapts adeptly to polyps with diverse sizes, shapes, and textures–even when boundary delineation is made difficult by varying lighting conditions that obscure the difference between the polyp tissues and surrounding normal tissues. HCIA’s proficiency in these scenarios underscores the necessity for models to possess robust local information extraction tools coupled with a comprehensive global perspective—qualities endowed by the integration of HAM and IAM within HCIA. In scenarios exhibited by Fig. 7, presented with CVC-ClinicDB dataset images, where the foreground-background contrast is accentuated by improved lighting conditions, it is apparent that while most of the compared methods achieved relatively high accuracy in segmentation tasks, HCIA consistently excels at drawing detailed and precise segmentation contours. This is particularly noteworthy in cases where polyps display intricate edges as seen in rows (3) of Fig. 7. HCIA’s capability to delineate complex polyp edges with a high degree of accuracy not only speaks to its performance but also to its versatility in adapting to different imaging conditions and the variable nature of polyp structures. Through the visual comparison of segmentation results under differing conditions, the HCIA’s efficiency in processing polyps of various scales, shapes, and complex contours is affirmed.

Table 5 Efficiency evaluation ofHCIA compared to other SOTA models. Optimal results are highlighted in bold.

Efficiency evaluation

To demonstrate the superior efficiency of the proposed HCIA, we conducted a comprehensive comparative analysis of model efficiency and complexity on the Kvasir-SEG dataset, as illustrated in Table 5. We utilized floating point operations (FLOPs), network parameters (Params) and frames per second (FPS) as evaluation criteria. It is evident that HCIA exhibits a balanced count of Params and FLOPs while achieving improved segmentation performance. Polyp-PVT demonstrates optimal model efficiency, with 25.1M Params and 11.2G FLOPs while our HCIA only incurs an increase of 0.7M Params and 2.5G FLOPs, yielding enhancements of 2.5% in mDice and 2.7% in mIoU. The comparative results validate that HCIA achieves an effective balance between efficiency and performance. Additionally, HCIA’s inference speed reaches 51 FPS, meeting the requirements for real-time predictions.

Limitations

Despite the promising performance of HCIA, there are several limitations worth noting. First, although the IAM is designed with linear complexity to improve efficiency, the multi-scale feature aggregation in HAM and the hierarchical architecture still involve a considerable number of parameters. This may pose challenges for deploying the full model on strictly resource-constrained edge devices (e.g., embedded endoscope processors) for real-time inference without further optimization like model quantization or pruning. Second, the current framework operates on a frame-by-frame basis (2D) and does not yet leverage the temporal consistency available in colonoscopy video sequences, which could potentially further enhance detection stability.

Conclusion

In this paper, we proposed a novel Hierarchical Contextual Information Aggregation Network (HCIA) for accurate polyp segmentation. To address the challenge of balancing global context and local details, we introduced two key components: Interconnected Attention Module (IAM) and Hierarchical Aggregation Module (HAM). Specifically, IAM captures long-range dependencies across all layers via a globally shared memory mechanism with linear complexity, while HAM effectively integrates features from adjacent levels to enhance multi-scale representation. Comprehensive experiments on multiple benchmarks demonstrate that HCIA achieves state-of-the-art performance, exhibiting superior accuracy and generalization capabilities compared to existing methods.