CFM-UNet: coupling local and global feature extraction networks for medical image segmentation

Niu, Ke; Han, Jiacheng; Cai, Jiuyun

doi:10.1038/s41598-025-92010-y

Download PDF

Article
Open access
Published: 01 July 2025

CFM-UNet: coupling local and global feature extraction networks for medical image segmentation

Ke Niu¹,
Jiacheng Han¹ &
Jiuyun Cai¹

Scientific Reports volume 15, Article number: 22236 (2025) Cite this article

1258 Accesses
Metrics details

Subjects

Abstract

In medical image segmentation, traditional CNN-based models excel at extracting local features but have limitations in capturing global features. Conversely, Mamba, a novel network framework, effectively captures long-range feature dependencies and excels in processing linearly arranged image inputs, albeit at the cost of overlooking fine spatial relationships and local pixel interactions. This limitation highlights the need for hybrid approaches that combine the strengths of both architectures. To address this challenge, we propose CNN-Fusion-Mamba-based U-Net (CFM-UNet). The model integrates CNN-based Bottle2neck blocks for local feature extraction and Mamba-based visual state space blocks for global feature extraction. These parallel frameworks perform feature fusion through our designed SEF block, achieving complementary advantages. Experimental results demonstrate that CFM-UNet outperforms other advanced methods in segmenting medical image datasets, including liver organs, liver tumors, spine, and colon polyps, with notable generalization ability in liver organ segmentation. Our code is available at https://github.com/Jiacheng-Han/CFM-UNet.

A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning

Article Open access 25 April 2023

VMKLA-UNet: vision Mamba with KAN linear attention U-Net

Article Open access 17 April 2025

Improved SwinUNet with fusion transformer and large kernel convolutional attention for liver and tumor segmentation in CT images

Article Open access 24 April 2025

Introduction

Medical image segmentation aims to separate structures or tissues in medical images through computer algorithms. These segmentation results provide valuable information about anatomical regions required for detailed analysis, greatly helping physicians to characterize injuries, monitor disease progression, and specify further treatment options. As the demand for intelligent medical image analysis continues to grow, accurate and robust segmentation methods are becoming increasingly important^1,2.

CNN-based medical image segmentation models^3,4,5,6,7,8, represented by U-Net⁹, have achieved remarkable success in a variety of medical applications through accurate segmentation from MRIs, CT scans, PET/CT scans, and other medical imaging modalities, becoming mainstream methods. However, these methods have limitations in capturing remote relationships and global context, especially for objects with large inter-patient variations in texture, scale, and shape. Various strategies have emerged, such as Expanded Convolution^3,4, Image Pyramid^5,6, Prior-guided methods^7,10, Multi-scale Fusion^8,11, and Self-attention Mechanism^12,13, all attempting to address these limitations in medical image segmentation. However, they still have weaknesses in extracting global contextual features, as shown in Fig. 1.

Recently, a new architecture based on the State Space Model (SSM) called Mamba¹⁴ has also been applied in medical image segmentation. VMamba¹⁵ constructs a visual state space (VSS) block based on this architecture, capable of capturing global contextual information of images with linear computational complexity, demonstrating powerful capabilities in establishing distant dependencies and understanding global context. However, existing Mamba-based visual modules¹⁵ capture global context by processing images as linear sequences along specific directions. While effective for capturing long-range dependencies, this approach inherently limits the ability to accurately capture local features. This limitation arises from the neglect of fine spatial relationships and local pixel interactions, which are crucial for detailed segmentation tasks. The insufficient capability to capture local features becomes particularly evident in scenarios with noisy or complex backgrounds. As shown in Fig. 1, the Mamba-based Swin-UMamba¹⁶ model is highly sensitive to such background complexities, often resulting in blurred boundaries and noticeable segmentation errors.

Since Mamba-based medical image segmentation methods are still in their nascent stage, models emerging at this stage, such as VM-UNet¹⁷ and U-Mamba¹⁸, also extract and preserve local features of the image by introducing the VSS block into the U-shape framework using classical downsampling and skip connections, which are insufficient to improve the understanding of local details in the model itself. This limitation highlights the challenges faced by Mamba-based models in real-world applications, particularly in tasks requiring fine-grained segmentation. Therefore, the need arises for hybrid approaches that combine the strengths of Mamba and CNN architectures, enabling a more balanced extraction of both global and local features to improve segmentation performance. Inspired by the parallel network structure proposed by Peng et al.¹⁹, we migrate and construct a parallel network model of CNN and Mamba and apply it to medical image segmentation. Considering the performance characteristics and shortcomings of these two network structures, and noting the SE (Squeeze-and-Excitation) network’s²⁰ versatility and adaptive feature weighting capability, which enhance feature expression while being simple and lightweight, we propose the CFM-UNet model. This model aims to improve the segmentation accuracy of medical images by leveraging the strengths of both CNN and Mamba without compromising their independent segmentation frameworks. It introduces the SEF (SE-based Feature Fusion) block to fuse features learned from both frameworks during the downsampling process and provide feedback. This approach enables complementary advantages between the two networks, thereby improving segmentation precision. Our contributions can be summarized as follows:

The encoder of CFM-UNet combines the strengths of CNN-based Bottle2neck²¹ and Mamba-based VSS¹⁵ frameworks, efficiently learning and capturing both local spatial contextual features and global features.
The SEF block feature fusion module does not require constructing a very deep network structure and can efficiently couple the two frameworks while effectively avoiding the problems of gradient vanishing and feature information loss.
The proposed CFM-UNet demonstrates excellent performance on the LiTS²² liver and liver tumor dataset, SPIDER²³ spine dataset, and Kvasir-SEG²⁴ colon polyp dataset, and also shows good generalization ability on the ATLAS²⁵ liver dataset.

Related work

CNN-based medical image segmentation models

The fully convolutional network (FCN)²⁶ was the first to introduce deep learning methods into image segmentation. The emergence of the U-Net⁹ further facilitated the rapid development of deep learning-based segmentation in medical imaging, establishing the dominance of the U-shape architecture in this field. Subsequently, various U-Net variants have been developed to improve segmentation accuracy. For instance, Attention-UNet²⁷ incorporates the attention mechanism in the decoder, Res-UNet²⁸ integrates residual learning into the network blocks, and U-Net++⁶ introduces a nested U-Net structure with a deep supervision mechanism. Despite the significant progress in CNN-based medical image segmentation, the inherent localization of convolutional operations and complex data access patterns continue to present challenges that require ongoing research.

ViT-based medical image segmentation models

The Visual Transformer (ViT)²⁹, originally developed in the context of natural language processing, has emerged as a promising model for vision-based tasks. Models such as Swin-unet¹², which are purely Transformer-based, represent notable advancements in this domain. To address the limitations of traditional CNNs, ViT is often integrated into existing frameworks or fused with CNN modules. For example, TransUNet¹³ adopts a hybrid CNN-Transformer-as-encoder approach, while TransFuse³⁰ introduces pyramidal vision in PolypPVT³¹ and incorporates additional control mechanisms in MedT³² to enhance self-attention modules. TransAttUnet³³ combines multilevel guided attention and multiscale jump connectivity. However, these models still require improvements in terms of inference speed and the performance of local feature extraction.

Mamba-based medical image segmentation models

Mamba, characterized by its linear computational complexity and ability to handle longer sequences, presents a challenge to Transformers in the field of medical image segmentation¹⁴. The VSS¹⁵ module demonstrates potential in addressing complex structures and patterns within medical images. Consequently, Mamba-based models have emerged as a viable alternative alongside CNN and ViT-based approaches. The VSS module performs “four-directional” 2D selective scans (top-left to bottom-right, top-right to bottom-left, bottom-left to top-right, and bottom-right to top-left) on the input image. It serializes the scanned results and applies the state space model for selective scanning. Subsequently, it restores and merges the results to obtain a comprehensive global view of the image. At this stage, exemplified by VM-UNet¹⁷, Mamba-based segmentation models are still in their early stages of development, but their inherent advantages are already apparent.

To leverage the potential of Mamba in medical image segmentation and address CNN’s limitations in capturing global features, we propose the CFM-UNet architecture. CFM-UNet adopts a parallel network branching approach for feature fusion, ensuring consistency and richness in 2D medical image segmentation. By harnessing the complementary strengths of both networks, CFM-UNet aims to further enhance segmentation accuracy.

Methods

In this section, we first introduce the structure of CFM-UNet, depicted in Fig. 2. We then elaborate on the encoder and decoder components, along with the employed loss function in the training process.

Encoder

In the encoder, we designed a parallel framework comprising Local and Global feature blocks, leveraging the strengths of the Bottle2neck²¹ and VSS¹⁵ network structures, as shown in the blue and green components in Fig. 2. To exploit the adaptive feature weighting capabilities of the SE network²⁰, which introduces channel attention mechanisms with minimal computational overhead, we developed the SEF block for feature fusion, as shown in the yellow component in Fig. 2. This module integrates features from both frameworks and incorporates them back into the networks. Its feedback mechanism is seamlessly integrated into the decoder, overcoming many of the shortcomings of other modules (such as addition fusion, multiplication fusion, etc.) in terms of handling feature importance or computational efficiency, and significantly enhancing the adaptive feature fusion effect.

Local branch

In the Local framework, the Bottle2neck module, based on the residual unit structure, serves as the primary method for local feature extraction. As shown in the blue module in Fig. 3, Bottle2neck divides the input into two independent information streams. One stream undergoes a $1\times 1$ convolution, which is subsequently partitioned into several blocks based on channel number. Each block $X_i$ undergoes a $3\times 3$ convolution, and its result $X_i$ is then added to the result of $X_{i-1}$ from the previous convolution. This process generates outputs with varying quantities and receptive field sizes, thereby enhancing the capability to extract local features. Finally, after concatenating and merging all outputs via a $1\times 1$ convolution layer, they are added back to the output of the other information stream.

Global branch

In the Global framework, the VSS block, derived from VMamba, functions as the primary module for global feature extraction. As shown in the green module in Fig. 3, the input to the VSS block undergoes processing through an initial linear embedding layer initially. Subsequently, it is split into two separate information streams. One stream passes through a $3\times 3$ depth convolutional layer followed by SiLU activation and enters the 2D-Selective-Scan (SS2D) module. The normalized output from SS2D is then combined with the output from the other information stream, which also undergoes SiLU activation. The merged output constitutes the final result, where each output undergoes patch merging to increase channel numbers, reduce resolution, and normalize to produce the final output result.

SEF block for feature fusion

Although the sampling processes of these two feature extraction frameworks do not interfere with each other, they are not entirely independent. The SEF block integrates the downsampling outcomes from these two frameworks and feeds them back into their respective networks, thereby enhancing the model’s capability to extract global and local features simultaneously. In Fig. 2, the blue and green arrows represent information from the two frameworks entering the SEF block, while the yellow arrow indicates the result of feature fusion being fed back into these two networks.

The input image $x \in {\mathbb {R}}^{H \times W \times C}$ is split into two information flows. One flows into the Local framework, where it undergoes a $3\times 3$ deep convolutional layer followed by ReLU activation and max pooling, extracting shallow features to produce $x_1 \in {\mathbb {R}}^{\frac{H}{4} \times \frac{W}{4} \times C'}$ (typically $C' = 64$). The other flow enters the Global framework, passing through a patch merging layer to obtain shallow features $x_1' \in {\mathbb {R}}^{\frac{H}{4} \times \frac{W}{4} \times C'}$.

The results $x_1$ and $x_1'$ from these different feature extraction paths are concatenated and then fed through a $3\times 3$ convolutional layer to restore the number of channels before entering the SEF block.

After regularization and sigmoid activation, the SE module evaluates the importance of each channel using global average pooling and two fully connected layers. It dynamically adjusts the weights of each channel to enhance important features and suppress irrelevant ones as follows:

$$\begin{aligned} y & = \sigma ({\text{BN}}({\text{Conv}}({\text{Concat}}(x_{i} ,x_{{i^{\prime } }} )))) \\ x_{{i^{{\prime \prime }} }} & = y \times W_{2} (\sigma (W_{1} ({\text{AvgPool}}(y)))) \\ \end{aligned}$$

Here, BN denotes batch normalization, $W_1$ and $W_2$ are the weights of the first and second fully connected layers respectively, and $\sigma$ represents the Sigmoid activation function. Finally, the fused feature image $x_1''$ is obtained. By adding the tensors $x_1$ and $x_1'$ to $x_1''$, the subsequent deep feature extraction operations are continued.

The two frameworks each use their own feature extraction blocks, increasing the number of channels continuously according to $[2C', 4C', 8C', 16C']$, and reducing the feature dimensions. Each time, the obtained $x_i$ and $x_i'$ enter the SEF block for feature fusion to get $x_i''$, which is then fed back to the two frameworks to continue downsampling. This multi-stage deep feature extraction and fusion allow global and local features to be maximally preserved and supplemented in both frameworks.

Decoder

The CFM-UNet encoder employs $2\times 2$ deconvolutions to resize the image iteratively, as indicated by the gray arrow in Fig. 2. Each deconvolution output is integrated with $x_i''$ using skip connections, combining deep and shallow features to preserve fine details. Following this, dual-layer $3\times 3$ convolutions merge features with ReLU activation, progressively reducing channel numbers in the sequence [8C, 4C, 2C, C]. Ultimately, $1\times 1$ convolutions decrease the channel count to 1, and interpolation restores the original image dimensions to produce the final segmentation output.

Loss function

To improve the segmentation accuracy towards the ground truth, we combine Dice loss and binary cross-entropy (BCE) using weighted factors in our loss function. The formulation of the loss function is given by:

$$\begin{aligned} \text {Loss} = \lambda _1 \text {Loss}_{\text {DICE}} + \lambda _2 \text {Loss}_{\text {BCE}} \end{aligned}$$

where:

$$\begin{aligned} \left\{ \begin{aligned} \text {Loss}_{\text {DICE}}&= 1 - \frac{2|X \cap Y|}{|X| + |Y|} \\ \text {Loss}_{\text {BCE}}&= -\frac{1}{N} \sum _{i=1}^{N} [ y_i \log ({\hat{y}}_i) + (1 - y_i) \log (1 - {\hat{y}}_i) ] \end{aligned} \right. \end{aligned}$$

and $\lambda _1$ and $\lambda _2$ are the weights of the loss functions, with default values of 0.5 each. $|X|$ and $|Y|$ denote the sizes of the ground truth and prediction sets respectively, and $y_i$ and ${\hat{y}}_i$ represent the true labels and predictions. This balanced strategy ensures a trade-off between maximizing overlap and enhancing classification accuracy, providing a robust starting point adaptable to various segmentation tasks. Such a configuration often yields satisfactory initial results and can be fine-tuned further based on task-specific performance needs.

Experiment and results

To validate our model’s segmentation performance on medical images, we selected datasets covering organs, tumors, bones, and tissues.

Data source

For human organs and tumor lesions, we used the LiTS²² dataset, which consists of 131 annotated CT scans for liver and liver tumor segmentation, presenting challenges such as class imbalance, varying tumor shapes, and scan variability. For human skeleton segmentation, we utilized the SPIDER²³ dataset, which includes 3D MRI and CT scans annotated for vertebrae, intervertebral discs, and the spinal canal, designed for spinal anatomy segmentation but affected by variability in anatomy and scan quality. For human tissues, we chose the Kvasir-SEG²⁴ colon polyp dataset, annotated for semantic segmentation of anatomical structures like polyps, mucosa, and lumen, with challenges such as image noise and class imbalance. To assess the model’s generalization ability, we conducted transfer testing on the ATLAS²⁵ liver dataset, which features significant variability in tumor shape, size, and contrast, complicating accurate segmentation.

A smaller dataset scale we selected simulated realistic issues, including the high cost of acquiring medical image data, ethical restrictions involving patient privacy, and the difficulty of obtaining high-quality annotations. Meanwhile, we selected non-continuous slices from each patient when processing the CT datasets to enhance data processing efficiency, reduce redundancy caused by similar consecutive slices, and focus on key regions. Therefore, we randomly selected 1030 liver slices and 670 liver tumor slices from LiTS, 1000 complete spine slices from SPIDER, and all 1000 colon polyp endoscopic images from Kvasir-SEG. To further validate the model’s generalization ability, we also selected 800 liver slices from ATLAS for model transfer testing. For each segmented dataset mentioned above, we uniformly randomly divide the dataset into training and test sets using an 8:2 ratio. We evaluate the model performance using recommended metrics specific to each dataset. These metrics include the average cross-dice similarity coefficient (DSC), intersection-over-union (IoU), true positive rate (TPR), volumetric overlap error (VOE), relative volumetric difference (RVD), and average surface-to-surface distance (ASSD). For ease of comparison, we choose to display the absolute values.

Experimental environment

The experiments in this paper utilize the Mamba and PyTorch frameworks. The GPU employed is an RTX 4090, and Python version 3.10 is used. The model training epoch is set to 100, with a batch size of 4. The optimizer selected for training is Adam.

Experimental comparison

We compare our proposed CFM-UNet with 12 advanced 2D segmentation models, which can be roughly categorized into four groups. (1) general-purpose segmentation models^6,9,27,28: the U-Net (Ronneberger et al., 2015), the ATT-UNet (Oktay et al., 2018), the Res-UNet (Xiao et al., 2018), and the U-Net++ (Zhou et al., 2018). (2) CNN-based medical image segmentation models^8,34: the U-ReSNet (Estienne et al., 2019), M²SNet (Zhao et al., 2023). (3) Transformer-based medical image segmentation models^13,35: TransUNet (Chen et al., 2021), META-Unet (Wu et al., 2023). (4) Mamba-based medical image segmentation Models^16,36,37: Swin-UMamba (Liu et al., 2024), VM-UNet-V2 (Zhang et al., 2024), UltraLight VM-UNet (Wu et al., 2024). To be fair, none of the models used pre-training parameters in our comparison experiments. Each model trained on the training set was tested five times on the test set, and the average value for each metric was calculated.

We selected two segmentation results from each dataset to represent the segmentation effect for visual analysis. The overall segmentation effect graph is depicted in Fig. 4. Subsequently, we will proceed with a detailed analysis of the model performance.

Table 1 Comparison results on LiTS.

Full size table

Liver and liver tumor segmentation

Liver segmentation

In sample a, aside from CFM-UNet, which accurately learns liver segmentation, the other models misclassify other organs as liver, as shown in Fig. 4. This highlights CFM-UNet’s effective utilization of the Mamba network, demonstrating a superior understanding of liver anatomy. In sample b, concerning detail handling, CFM-UNet and U-ResNet effectively utilize CNN’s capacity to capture local details, demonstrating precise localization in the lower right corner of the liver. In contrast, other models exhibit errors in this area.

Liver tumor segmentation

As demonstrated in Fig. 4, CFM-UNet, UNet++, U-ResNet, META-UNet, and UltraLight-VM-UNet, which employ more complex network structures, accurately segment liver tumors even in relatively small areas. In contrast, all other models exhibit noticeable noise, underscoring the effectiveness of deeper network architectures in comprehensively capturing liver tumor characteristics to a significant extent.

We preprocessed the LiTS dataset by slicing it into $512 \times 512$ 2D images. Table 1 presents the quantization results of each model for segmenting this dataset.

The Ushape framework demonstrates outstanding performance in segmenting specific organs, emphasizing the importance of enhancing local accuracy control for improved accuracy. Our model utilizes a Local Branch to maximize local detail extraction capabilities, while the Global Branch leverages a feature fusion module to enhance learning of liver organ features. Given that liver tumors are typically small and scattered, demanding high model proficiency in local detail extraction, our approach harnesses the advantages of residual structures in both the lower convolution of the Local Branch and the feature fusion module. This strategy significantly enhances the model’s capacity to learn intricate local features of both the liver and liver tumors.

In comparison to other models, ours demonstrates robust segmentation capabilities across all three evaluation metrics, notably excelling in the core DICE metric. Although our model does not surpass others in all aspects of TPR and ASSD, the performance gap with the top model is minimal. Therefore, overall, our model exhibits excellent performance and remains highly competitive in the segmentation of both liver organs and liver tumors.

Spine segmentation

All models successfully segment the entire spine in Fig. 4. In sample g, CFM-UNet and M²SNet exhibit relatively less noise, while in sample h, CFM-UNet, META-UNet, and VM-UNet-V2 also show reduced noise. This suggests that CFM-UNet effectively leverages the Mamba’s characteristics to comprehensively understand each component of the spine. Additionally, it utilizes CNN capabilities to capture intricate details of the spine.

We preserved the entire spine structure in the SPIDER dataset and preprocessed it by slicing into $512 \times 512$ 2D images. Table 2 presents the quantization results of each model for segmenting this dataset.

Table 2 Comparison results on SPIDER.

Full size table

Retaining all spine components in the SPIDER dataset means that vertebrae, intervertebral discs, and the spinal canal may not appear simultaneously in a single slice, requiring the model to comprehend and learn the entire spine comprehensively. In our comparative experiments, our model achieved first place in four performance metrics, showcasing superior segmentation accuracy and robust feature learning capabilities.

Colon polyps segmentation

As demonstrated in Fig. 4, CFM-UNet, U-ResNet, M²SNet, META-UNet, VM-UNet-V2, and Swin-UMamba models achieve correct segmentation in sample e, while other models display noticeable segmentation errors. This highlights the specialized effectiveness of medical image segmentation methods over generalized approaches. The Mamba-based segmentation methods demonstrate superior capability in understanding the overall structure of the image. In sample f, only CFM-UNet correctly segments the colon polyps, showcasing its proficiency in leveraging the advantages of the Mamba network for accurate segmentation.

The Kvasir-SEG dataset comprises 2D images with varying pixel sizes, which we uniformly resize to $512 \times 512$ dimensions as part of our preprocessing. Table 3 displays the quantization outcomes for each model in segmenting this dataset.

Table 3 Comparison results on Kvasir-SEG.

Full size table

The segmentation of polyps in the Kvasir-SEG dataset is challenging due to varying sizes and unpredictable distribution caused by viewing angles. This variability imposes stringent demands on the model’s feature learning capabilities. A single network structure may provide incomplete polyp feature extraction, prompting the integration of our model with two network structures. This approach inherently advantages our model, leading to superior segmentation accuracy across all metrics compared to other models.

Significance analysis

To further demonstrate the superiority of our model approach on the aforementioned metrics in the experimental tasks, we employed the T-Test method to conduct significance testing between the top five models for each segmentation task and the CFM-UNet across the five repeated experiments on test set. We visualized the calculated p-values using a heatmap to intuitively assess the significance of CFM-UNet compared to other models on each metric.

If the p-value calculated using the T-Test is less than 0.05, the null hypothesis can be rejected, indicating a significant difference between the two models on that metric. Specifically, as shown in Fig. 5, the lighter the blue color in the blocks, the more apparent the difference between CFM-UNet and the corresponding model on the current metric for the task. From Fig. 5, we can observe in the “Experimental Comparison” section that the blocks for the metrics where CFM-UNet shows a clear advantage are light in color, with p-values much smaller than 0.05, indicating significant superiority. Although CFM-UNet does not perform at the top in terms of the TPR metric for the Spine Segmentation task, the corresponding block is darker, suggesting that the difference with the leading models is not significant. Moreover, since the ASSD quantification values of the leading models in the Spine Segmentation task do not show significant differences and the corresponding blocks are darker, it suggests that the performance in terms of boundary smoothness for spine segmentation is similar across these models.

Melt experiment

Number of parameters

Our model consists of two branches: one based on CNN and the other on Mamba. Therefore, determining the hyperparameter, which is the combination of the number of modules needed for each downsampling in these two branches, is crucial. For the parameters of Bottle2neck Blocks, we drew on the number of modules from three classic deep feature extraction networks, including Res2Net²¹, and attempted various combinations. Initially, we used the same number of modules as some common configurations, including the [2,2,2,2] structure from VM-UNet-V2³⁶.

Table 4 Performance comparison of different parameters.

Full size table

We tested these parameter combinations on the LiTS liver dataset and evaluated the segmentation performance using the DICE metric. The quantitative results are presented in Table 4. Therefore, we have selected Bottle2neck Blocks configured as [3, 4, 6, 3] for the Global Branch and VSS blocks configured as [3, 3, 3, 3] for the Local Branch as the downsampling modules in our model.

Structure ablation

To assess the effectiveness of our parallel network structure, we conducted experiments by individually removing the Local Branch and Global Branch. Additionally, to evaluate the effectiveness of our SEF block, we conducted experiments where we removed the SEF block and instead employed additive fusion and multiplicative fusion methods.

Table 5 Performance comparison of structure ablation.

Full size table

All experiments were conducted on the LiTS liver dataset, and detailed metrics can be found in Table 5. The experiments demonstrate that removing any module from our model results in reduced accuracy. Furthermore, our SEF block, with its negative feedback mechanism, effectively integrates the strengths of both global and local networks, leading to comprehensive improvements in segmentation accuracy. In addition, selecting other CNN-based modules(dual convolutions and Bottleneck Blocks) as the core component of the Local Branch in CFM-UNet fails to surpass the Bottle2neck Blocks in fully leveraging CNN’s capability for local feature extraction.

Generalization ability

To verify the generalization ability of the model, we conducted generalization experiments through data augmentation and data migration without changing the training parameters of the model on the LiTS liver dataset.

Image rotation

The test image was rotated by $0^\circ$, $60^\circ$, $120^\circ$, $180^\circ$, $240^\circ$, and $300^\circ$, then resized to a pixel size of $512 \times 512$ before reassessing segmentation accuracy.

Pixel resizing

The test image was resized to pixel dimensions of $256 \times 256$, $512 \times 512$, and $1024 \times 1024$, followed by a reassessment of segmentation accuracy.

Table 6 Comparison of DICE for image rotation and resizing.

Full size table

As shown in Table 6, the two key segmentation performance metrics, DICE and TPR, do not exhibit significant fluctuations overall. This indicates that data augmentation through image rotation and pixel adjustment has a minimal impact on the overall accuracy of the CFM-UNet.

Data migration

For the model migration test, we utilized the open-source liver dataset ATLAS. From this dataset, 800 images were randomly selected and uniformly resized to $512\times 512$ pixels. Subsequently, the segmentation accuracy test was performed anew.

The visual representation in Fig. 6 shows that DICE coefficients results of CFM-UNet on ATLAS mostly fluctuate between 0.8 and 0.9, demonstrating robust and favorable quantitative performance.

Table 7 Performance metrics on ATLAS.

Full size table

As presented in Table 7, DICE and TPR metrics maintained at approximately 84%. In conclusion, CFM-UNet demonstrates robust generalization capabilities in both data augmentation and data migration experiments.

Conclusion and future work

We introduced the CFM-UNet model, which integrates two network architectures, CNN and Mamba, for image segmentation. In the Local Branch, we employed the CNN-based Bottle2neck module to extract local features, while in the Global Branch, the Mamba-based VSS module was utilized to capture global features. The SEF block, designed for feature fusion and feedback, enables leveraging the complementary strengths of both networks. Experimental results demonstrate that CFM-UNet achieves excellent performance on the LiTS, SPIDER, and Kvasir-SEG medical image datasets, and exhibits strong generalization capabilities on the ATLAS dataset.

In future experiments, we plan to further optimize CFM-UNet and explore its application in semi-supervised medical image tasks. This approach aims to effectively utilize unlabeled medical image data, thereby enhancing the model’s generalization ability and practical applicability.

Data availability

The datasets used in the study are publicly available at the following links: LiTS - https://competitions.codalab.org/competitions/17094. ATLAS - https://atlas-challenge.u-bourgogne.fr/dataset. Kvasir-SEG - https://datasets.simula.no/downloads/kvasir-seg.zip. SPIDER - https://spider.grand-challenge.org/data/.

References

Qureshi, I. et al. Medical image segmentation using deep semantic-based methods: A review of techniques, applications and emerging trends. Inf. Fusion 90, 316–352. https://doi.org/10.1016/j.inffus.2022.09.031 (2023).
Article Google Scholar
Ramesh, K., Kumar, G. K., Swapna, K., Datta, D. & Rajest, S. S. A review of medical image segmentation algorithms. EAI Endors. Trans. Pervasive Health Technol. 7, e6–e6 (2021).
Article Google Scholar
He, F. et al. Brain tumor image segmentation network based on dual attention mechanism. In International Conference on Intelligent Computing, 125–136 (Springer, 2023).
He, X., Qiao, Z., Huang, Y. & Hao, Q. Dilated-residual u-net for optical coherence tomography noise reduction and resolution improvement. In Optics in Health Care and Biomedical Optics XIII, Vol. 12770, 246–254 (SPIE, 2023).
Yuan, Y. et al. Dsca-pspnet: Dynamic spatial-channel attention pyramid scene parsing network for sugarcane field segmentation in satellite imagery. Front. Plant Sci. 14, 1324491 (2024).
Article PubMed PubMed Central Google Scholar
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018).
Zhao, X. et al. Prior attention network for multi-lesion segmentation in medical images. IEEE Trans. Med. Imaging 41, 3812–3823 (2022).
Article PubMed Google Scholar
Zhao, X. et al. M²snet: Multi-scale in multi-scale subtraction network for medical image segmentation. arXiv preprint arXiv:2303.10894 (2023).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, 234–241 (Springer, 2015).
Guo, C. et al. Sa-unet: Spatial attention u-net for retinal vessel segmentation. In 2020 25th International Conference on Pattern Recognition (ICPR), 1236–1242 (IEEE, 2021).
Wang, Z., Zhang, H., Huang, Z., Lin, Z. & Wu, H. Multi-scale dense and attention mechanism for image semantic segmentation based on improved deeplabv3+. J. Electron. Imaging 31, 053006–053006 (2022).
Article ADS Google Scholar
Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision, 205–218 (Springer, 2022).
Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).
Liu, Y. et al. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024).
Liu, J. et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302 (2024).
Ruan, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).
Ma, J., Li, F. & Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024).
Peng, Z. et al. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 367–376 (2021).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks (2018).
Gao, S.-H. et al. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2019).
Article Google Scholar
Bilic, P. et al. The liver tumor segmentation benchmark (lits). Med. Image Anal. 84, 102680 (2023).
Article PubMed Google Scholar
van der Graaf, J. W. et al. Lumbar spine segmentation in MR images: A dataset and a public benchmark. Sci. Data 11, 264 (2024).
Article PubMed PubMed Central Google Scholar
Jha, D. et al. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 451–462 (Springer, 2020).
Quinton, F. et al. A tumour and liver automatic segmentation (atlas) dataset on contrast-enhanced magnetic resonance imaging for hepatocellular carcinoma. Data 8, 79 (2023).
Article Google Scholar
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440 (2015).
Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
Diakogiannis, F. I., Waldner, F., Caccetta, P. & Wu, C. Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote. Sens. 162, 94–114 (2020).
Article ADS Google Scholar
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Zhang, Y., Liu, H. & Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, 14–24 (Springer, 2021).
Dong, B. et al. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932 (2021).
Qi, Q., Lin, L., Zhang, R. & Xue, C. Medt: Using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis. IEEE Access 10, 28750–28759 (2022).
Article Google Scholar
Chen, B., Liu, Y., Zhang, Z., Lu, G. & Kong, A. W. K. Transattunet: Multi-level attention-guided u-net with transformer for medical image segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 8(1), 55–68. https://doi.org/10.1109/TETCI.2023.3309626 (2023).
Estienne, T. et al. U-resnet: Ultimate coupling of registration and segmentation with deep nets. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, 310–319 (Springer, 2019).
Wu, H., Zhao, Z. & Wang, Z. Meta-unet: Multi-scale efficient transformer attention unet for fast and high-accuracy polyp segmentation. IEEE Trans. Autom. Sci. Eng. 21(3), 4117–4128. https://doi.org/10.1109/TASE.2023.3292373 (2023).
Zhang, M., Yu, Y., Gu, L., Lin, T. & Tao, X. Vm-unet-v2 rethinking vision mamba unet for medical image segmentation. arXiv preprint arXiv:2403.09157 (2024).
Wu, R., Liu, Y., Liang, P. & Chang, Q. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv preprint arXiv:2403.20035 (2024).

Download references

Acknowledgements

Financial support for this project was provided by the Promoting the Classification and Development of Colleges and Universities-Student Innovation and Entrepreneurship Training Programme Project-School of Computer (5112410852).

Author information

Authors and Affiliations

Beijing Information Science and Technology University, Computer School, Beijing, 100000, China
Ke Niu, Jiacheng Han & Jiuyun Cai

Authors

Ke Niu
View author publications
Search author on:PubMed Google Scholar
Jiacheng Han
View author publications
Search author on:PubMed Google Scholar
Jiuyun Cai
View author publications
Search author on:PubMed Google Scholar

Contributions

K.N.: Investigation, Conceptualization, validation, Resources, supervision. J.H.: Methodology, Software, Writing – original draft. J.C.: Formal analysis, Writing – original draft.

Corresponding author

Correspondence to Ke Niu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Niu, K., Han, J. & Cai, J. CFM-UNet: coupling local and global feature extraction networks for medical image segmentation. Sci Rep 15, 22236 (2025). https://doi.org/10.1038/s41598-025-92010-y

Download citation

Received: 18 December 2024
Accepted: 25 February 2025
Published: 01 July 2025
DOI: https://doi.org/10.1038/s41598-025-92010-y

Subjects

Abstract

Similar content being viewed by others

A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning

VMKLA-UNet: vision Mamba with KAN linear attention U-Net

Improved SwinUNet with fusion transformer and large kernel convolutional attention for liver and tumor segmentation in CT images

Introduction

Related work

CNN-based medical image segmentation models

ViT-based medical image segmentation models

Mamba-based medical image segmentation models

Methods

Encoder

Local branch

Global branch

SEF block for feature fusion

Decoder

Loss function

Experiment and results

Data source

Experimental environment

Experimental comparison

Liver and liver tumor segmentation

Liver segmentation

Liver tumor segmentation

Spine segmentation

Colon polyps segmentation

Significance analysis

Melt experiment

Number of parameters

Structure ablation

Generalization ability

Image rotation

Pixel resizing

Data migration

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links