Introduction

Accurate segmentation of medical images is one of the critical technologies to assist doctors in diagnosing and developing personalized treatment plans1. In medical image analysis, the segmentation task is crucial. It is used to identify and extract essential regions in the image and provides in-depth image feature information to help doctors make more accurate diagnostic judgments and treatment decisions. Accurate medical image segmentation is a fundamental part of disease detection, diagnosis, and treatment planning and one of the core challenges in medical imaging applications2,3. High-quality segmentation results can provide an essential basis for disease assessment, surgical planning, and image-guided surgical operations. However, traditional manual segmentation is labour-intensive and time-consuming, which makes automated segmentation methods particularly important. Fast and accurate automated segmentation can significantly reduce costs and improve the efficiency and accuracy of diagnosis and treatment4.

In image processing tasks, U-net5 is widely used as a classical encoder-decoder network architecture. It adopts an encoder-decoder structure, where the encoder generates low-resolution features, and the decoder up-samples the features and gradually recovers the spatial resolution. The jump-join mechanism in U-net helps to recover the spatial information lost in the pooling layer in the decoding stage, ensuring that the spatial details are well-preserved in the U-net architecture. However, the jump connection in U-net is more direct and only partially utilizes the multilayer feature fusion, which has some limitations. To overcome this limitation, UNet++6 introduces a nested U-net architecture, which realizes more profound feature fusion through multi-level jump connections, aiming to improve segmentation accuracy and fine-grained information retention by comprehensively utilizing multi-scale features. Moreover, the attention mechanism combined with U-net has also been widely noticed, such as Attention U-Net7, which suppresses the irrelevant regions of the image and highlights the salient features of the local regions through Attention Gate. However, after adding the attention mechanism with the increase of parameters, the computational amount increases significantly; the research began to approach how to accomplish the task efficiently, such as the lightweight network EGE-UNet8 maintains the segmentation effect while reducing the computational amount to improve the efficiency.

In summary, several of the methods above aim to compensate for the loss of spatial information due to the encoder downsampling process in U-shaped networks and improve the focus on localized regions. Although these methods demonstrate exemplary performance in medical image segmentation tasks, they still need to improve in efficiently extracting multi-scale image features globally. Specifically, U-net and UNet++ use a linear connectivity mechanism. At the same time, Attention U-Net introduces a nonlinear attention-gating mechanism, and these methods mitigate the semantic gap between high- and low-level features to some extent. However, there are still limitations in capturing multi-scale spatial information. Based on the in-depth exploration of this theoretical problem, a new need ensues: how to efficiently extract spatial and channel information of high-level and low-level features in the encoder and decoder in order to better bridge the semantic gap between the encoder and the decoder and to compensate for the loss of information due to downsampling? To this end, we revisit the jump-connection mechanism of U-shaped networks and propose a novel deep-learning network that combines the MSL, AC, and Conv-Attention modules9. The new network design aims to enhance further the feature extraction capability of jump connections to capture multi-scale and multi-morphology spatial and channel information more efficiently, thus providing a more accurate and efficient solution for medical image segmentation tasks. Our contributions are summarized as follows:

  • In order to improve the segmentation accuracy of the model in complex structured images, the AC module designed in this paper dynamically adjusts the parameters of the convolution kernel in a data-driven way through the combination of the adaptive convolutional layer, batch normalization layer and ReLU activation layer. The adaptive feature of the AC module enhances the flexibility and adaptability of the features so that it can capture the subtle changes in the image while maintaining the original structural information.

  • To balance context awareness and detail capture, the Conv-Attention module reduces the computational cost of the attention mechanism by learning by proxy from the downsampling results of the last layer. It retains the essential contextual information and accurately recognizes critical regions in the image.

  • This paper introduces the MSL module to better deal with the variation of organizational structure and size in medical images and improve the accuracy of segmentation. The MSL module adopts a step-by-step up-sampling strategy and is able to capture fine-grained features and high-level semantic information in the encoding stage, thus realizing effective modelling of complex structures.The MPE layer included in the MSL module is used to up-sample the feature maps layer-by-layer. In contrast, the MSV layer further fuses the multi-scale features to realize the joint expression of multi-level information.

The improved U-net model proposed in this paper utilizes the global context capture capability of the Conv-Attention module, the multi-scale feature enhancement capability of the MSL module, and the dynamic adaptive capability of the AC module so that the model exhibits higher robustness and accuracy in medical image segmentation tasks. The experimental results show that compared with the traditional U-net and other segmentation models, the network proposed in this paper provides an effective segmentation scheme for medical image processing with significant improvement in segmentation accuracy, edge detail capture, and a significant reduction in computation.

Related work

CNN for medical image segmentation

U-net adopts an encoder–decoder design based on convolution, where features are extracted by gradually compressing the spatial dimensions of the input image through the encoder and then recovering the spatial information step by step through the decoder to generate a high-resolution output. The core advantage of U-net lies in the efficient extraction of cross-layer features, which is realized by directly transferring the features between the corresponding layers of the encoder and decoder through jumping connections and realizing the integration of information from different layers. The fusion of different levels of information is realized. This design is particularly suitable for medical image segmentation tasks, which can capture subtle information in the image while maintaining high accuracy10,11. Due to its powerful segmentation performance, U-net has been widely used as an indispensable tool in the field of medical image processing and has achieved remarkable success in numerous 2D semantic segmentation tasks12,13,14.

As research advances, improved methods based on U-net continue to emerge. For example, UNet++ effectively integrates the four layers of features of U-net by introducing a multi-level feature connection mechanism, enabling the network to autonomously learn the weights of features of different depths to adapt to complex tasks more flexibly. In addition, some models extend U-net to process 3D image data, such as CRANet15, 3D U-Net16 and V-net17. These models provide substantial advantages in processing 3D medical images (e.g., CT and MRI data) by encoding and decoding the body data in 3D space.

The application of U-net is also gradually expanding to combine with other network architectures to achieve more efficient performance. For example, Unext18 combines U-net and multilayer perceptron (MLP) to reduce the number of parameters and increase the computational speed while maintaining high performance; WRANet19 utilizes the discrete wavelet transform to replace the standard downsampling module of the CNN in U-net, which efficiently removes the high-frequency noise information and enhances the network’s noise immunity; DualNet20 further improves the image segmentation performance by using two U-net models. These innovative U-net-derived networks perform excellent segmentation tasks on multiple datasets, such as ISIC201721 and CVC-ClinicDB22. In addition, novel networks combining U-net with other architectures are emerging. For example, TransUNet combines U-net with Transformer23 to achieve a balance between global feature modelling and local feature capture; SwinUnet adopts a pure Transformer approach based on the inspiration of the U-net structure to enhance the segmentation performance further; and Cascaded U-Net24 enhances the segmentation performance by using a cascading strategy of multiple U- Net model cascading strategy to enhance the capture of multi-scale features; and Uctransnet25 utilizes the channel cross-attention mechanism to achieve effective fusion of multi-scale encoder features. These improved methods achieve significant segmentation results on datasets such as Glas26 and MoNuSeg27, demonstrating their strong adaptability and performance advantages in different scenarios.

Transformer for medical image segmentation

In deep learning image segmentation and recognition tasks, the attention mechanism has been widely used as an effective means to enhance the performance of models28,29,30. The core idea of the attention mechanism is to give the model the ability to focus on key regions so that the model can automatically ignore irrelevant parts when processing complex data, thus enhancing the efficiency and accuracy of feature extraction. Introducing the attention mechanism into the model structure can effectively improve the recognition of detailed regions in the image and optimize the feature expression, especially in medical image processing that requires high-precision segmentation, which plays an important role.

With the application of the attention mechanism, many new models based on attention have been proposed one after another. For example, Attention U-Net introduces the Attention Gate module into the traditional U-net architecture, which enables the network to focus on the target region more effectively, thus significantly improving the segmentation accuracy and robustness. In addition, Vision Transformer (ViT)31 introduces the Transformer architecture to image classification tasks by applying the global self-attention mechanism to the whole image. It achieves excellent classification results on large datasets such as ImageNet.ViT’s global feature capture capability has enabled it to perform highly in natural image processing, but its application is also subject to computational constraints. Computational resources and training data size also limit performance in natural image processing, but its application.

In the field of medical images, to cope with the problems of data scarcity and heterogeneity, Valanarasu et al. proposed the gated axial attention model MedT32. This model introduces a specific gating mechanism that better captures feature information in both vertical and horizontal axes to optimize the segmentation accuracy of medical images further. SeTformer33 significantly improves the computational efficiency while enhancing the performance of the model by replacing the traditional dot-product self-attention (DPSA) with self-optimizing transmission (SeT). At the same time, MASFormer34 is an innovative variant based on Transformer, which introduces a hybrid attention-spanning mechanism to handle both long-range and short-range dependencies more efficiently, thus further optimizing the feature extraction capability of the model. However, while these approaches enhance the feature extraction capability of the model by adding an attention-spanning mechanism, they do not directly focus on optimizing the U-net structure and tend to increase the complexity and computational cost of the network. In particular, little attention is paid to the feature maps at different scales, often resulting in poor results and bringing structural redundancy and computational complexity.

Methods

Overall architecture of MSCA-UNet

The overall architecture of the MSCA-UNet proposed in this paper is shown in Fig. 1, which adopts a U-shaped encoder–decoder architecture, where we use adaptive convolution instead of the traditional convolution of the U-net model, which allows for better integration of the local contextual information while reducing the model parameters to make the training more efficient. The encoder has four stages, each with our proposed AC module. Specifically, the AC module takes the input \(X\in R^{ H\times W\times 3}\) through an adaptive convolutional layer for feature extraction and then through a batch normalization layer for final ReLu activation, which, like the traditional U-net network, uses max-pooling to downsize the feature map, again reducing the number of parameters while extracting significant features. The input enters the decoder after four stages of feature extraction by the encoder. The decoder is divided into three stages, and unlike the U-net network, we propose an MSL module to upsample the feature map and capture the fine-grained multi-scale information and high-level semantic information from the encoding stage. The MPE layer upsamples the features, and the MSV layer further fuses the combined features. For better extraction of meaningful features, we propose the convolutional attention mechanism to weigh the feature channels, which have fewer parameters than the self-attention mechanism.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Our proposed MSCA-UNet is divided into three parts overall. The upper part has four AC modules, and the rectangles between the AC modules represent the feature maps; the middle part corresponds to the Conv-Attention module; the lower part is the MSL module.

Adaptive convolution block

In the inference process of ordinary convolution, the parameters of the convolution kernel remain fixed, which limits the adaptive ability to handle diverse input features and makes it difficult to effectively process organs or lesion regions of different morphologies and scales. So this paper proposes a module based on adaptive convolution called AC Module (Adaptive Convolution Module). The module consists of an adaptive convolution layer, a batch normalization layer, and a ReLU activation layer, which can dynamically adjust the parameters of the convolution kernel according to different input features, thus enhancing the network’s ability to express features. As shown in Fig. 1, the input features first pass through the adaptive convolutional layer for efficient feature extraction, followed by the batch normalization and activation operations. The process can be described by the following equation.

$$\begin{aligned} X_{A}= & AdaConv(X), \end{aligned}$$
(1)
$$\begin{aligned} X_{B}= & BatchNorm(X_{A}), \end{aligned}$$
(2)
$$\begin{aligned} X_{i}= & ReLU(X_{B}), \end{aligned}$$
(3)

where AdaConv stands for adaptive convolution operation, BatchNorm stands for batch normalization, ReLU is the activation function.

In traditional convolutional operations, the kernel parameters remain fixed throughout the inference process, limiting the model’s adaptive ability to handle diverse input features. In the AC module, however, each input X prompts the adaptive convolutional layer to adjust dynamically, enabling the model to flexibly capture feature patterns at different scales, angles and viewpoints. The AC module significantly improves the model’s feature adaptive ability and extraction efficiency through the synergy of adaptive convolution, batch normalization, and ReLU activation. Meanwhile, optimizing the number of parameters also improves the model’s convergence speed and overall stability.

Convolution–attention

To efficiently capture feature maps’ global context information, we propose a convolution-like low-dimensional proxy high-dimensional attention mechanism module, the Conv-Attention module,which is shown in Fig. 2. Compared with the standard self-attention mechanism, the Conv-Attention module significantly reduces the computational complexity while maintaining the global sensory field and is suitable for high-resolution feature map processing. Specifically, after the multilayer AC module processes the input feature X, the dimension is reduced to the original 1/32, and this low-dimensional representation serves as a proxy for the high-dimensional features. The Conv-Attention module first performs a \(1\times 1\) convolution operation on this low-dimensional feature channel map to obtain the feature \(X'\). It adjusts the number of its channels to be the same as that of the encoder’s output feature map at the ith layer. Next, a mapping operation is performed on \(X'\) and \(X_{i}\) using weight matrices \(W_{q}\), \(W_{k}\), and\(W_{v}\) such that \(X'\) serves as Query, while the downsampled representation of \(X_{i}\) serves as Key and Value:

$$\begin{aligned} Q'= & X'W_{q}, \end{aligned}$$
(4)
$$\begin{aligned} K= & X_{i}W_{k},\end{aligned}$$
(5)
$$\begin{aligned} V= & X_{i}W_{v}. \end{aligned}$$
(6)

Then, the Softmax attention mechanism is applied to compute the attention matrix of \(X'\) as an agent of \(X_{i}\):

$$\begin{aligned} Attention(Q,K,V)=Softmax(\frac{QK^{T}}{\sqrt{d^{k}}}). \end{aligned}$$
(7)

This matrix is then computed with the Query and Key of \(X_{i}\) by attention to arrive at the final feature matrix with global contextual information:

$$\begin{aligned} Q= & X_{1}W_{q}, \end{aligned}$$
(8)
$$\begin{aligned} K'= & X'W_{k},\end{aligned}$$
(9)
$$\begin{aligned} Out= & Attention(Q,K',Attention(Q,K,V)). \end{aligned}$$
(10)

The Conv-Attention module introduces a downsampling strategy to reduce the spatial resolution of the input feature maps before generating the Query, Key, or Value matrices. This strategy downsamples the feature maps to 1/s of the original resolution, where s is the downsampling rate. Thus, the attention computation is simplified to the original \(1/s^{2}\) and resource consumption is significantly reduced. Compared with the traditional self-attention module, the Conv-Attention module focuses more on retaining high-level abstract features while capturing long-range dependencies. Assigning weights to the high-level features after multiple rounds of feature extraction with i-layer feature channels makes the model focus more on the discriminative feature channels. On the one hand, this optimizes the expression of multi-scale features and the use of computational resources and improves the recognition ability of the model; on the other hand, it effectively reduces the number of parameters and significantly speeds up the computation. Compared with the standard self-attention mechanism, the Conv-Attention module shows stronger robustness and adaptability in complex scenes.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Details of our proposed Conv-Attention module: different scale feature maps \(X_{i}\), last layer downsampled \(X_{4}\), Conv as a convolution operation.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

We propose the MSL module, SS2D for multi-scale extraction of VMamba.

Multi-scale learning block

The MSL module proposed in this paper is shown in Fig. 3 , aims to up-sample the feature map, combining fine-grained local features and high-level semantic information to capture the multi-scale information in the encoding stage fully. Through multi-scale feature fusion, the MSL module can effectively enhance the model’s ability to understand complex scenes, especially in extracting and integrating information at different scales and layers. The MSL module consists of two core layers: the MPE and MSV layers. In the encoder–decoder architecture, the encoding phase gradually compresses the feature maps to extract high-level semantic information, but some fine-grained spatial details may need to be recovered. The primary role of the MPE layer is to up-sample the feature maps to recover the spatial resolution and then use the combination of feature maps with different resolutions to fuse the high-level semantic information with the low-level detailed features. Through multiple up-sampling operations, the MPE layer can generate a set of multi-scale feature maps containing rich spatial details, which are further extracted and fused by depth-separable convolution. At the same time, by recovering layer by layer, the MPE layer can maintain the integrity of the detailed information and combine it with the multi-scale information from the encoding stage to realize fine-grained high-resolution characterization. The MSV layer further performs multi-scale feature fusion based on the up-sampled feature maps in the MPE layer. SS2D35 provides a new approach for this layer to enhance the characterization capability of features by effectively integrating features from different scales to provide a deeper understanding of the details and semantic information in multi-scale and complex scenes. The MSV layer employs multi-scale convolutional operations to process the up-sampled features to adequately fuse the feature information from different scales and generate a feature map with a high resolution and high quality. The MSV layer uses multi-scale convolutional operations to process the up-sampled features to fully integrate the feature information from different scales and generate joint features with stronger robustness. By combining the MPE and MSV layers, the MSL module efficiently captures and reconstructs multi-scale features from the encoding stage and integrates fine-grained details and high-level semantic information. This feature is particularly suitable for task scenarios such as medical image segmentation that require precise localization and recognition. Compared with traditional up-sampling methods, the MSL module focuses more on combining multi-scale features, enabling the model to cope with scale variations and detail loss in complex scenes. Through multi-scale learning and feature fusion, the MSL module effectively improves the feature map’s spatial resolution and characterization ability and enhances the model’s ability to express fine-grained and global semantics.

Hybrid loss function

The commonly used loss functions in medical image segmentation tasks mainly include classification loss and segmentation loss36. To improve the accuracy of medical image segmentation, this paper uses a combination of cross-entropy loss and Dice loss as the overall loss function. Cross-entropy loss quantifies the difference between the predicted values and the true values, while dice loss evaluates the consistency between the predicted results and the true annotations.

$$\begin{aligned} L_{CE}= & -\sum _{i}\ T_{i}\times \log (P_{i}), \end{aligned}$$
(11)
$$\begin{aligned} L_{DC}= & 1-\frac{2\times \sum _{i}\ P_{i}\times T_{i}}{\sum _{i}\ T_{i}}+\sum _{i}\ P_{i}, \end{aligned}$$
(12)

where \(T_{i}\) is the true label using ONE-HOT coding and \(P_{i}\) is the predicted probability. Although Dice loss is well adapted to class-imbalanced data, it is still challenging to achieve excellent segmentation performance in network training using Dice loss alone. In order to segment the tumour region more accurately, this paper combines the advantages of the two loss functions, which can effectively balance the classification performance and segmentation accuracy to optimize the performance of the model, defined as follows:

$$\begin{aligned} Loss=\alpha L_{CE}+\beta L_{DC}, \end{aligned}$$
(13)

where \(\alpha\) and \(\beta\) are hyperparameters in the hybrid loss function, in this paper \(\alpha =0.4\) \(\beta =0.6\) is selected,this weight configuration enhances the model’s ability to learn small - target boundary features by reducing the background weight ( \(\beta =0.6\)). Meanwhile, the 0.4:0.6 weight ratio, while maintaining the symmetry of the Dice coefficient, dynamically adjusts the loss contributions of the foreground and background, further strengthening the model’s learning of foreground details. Additionally, this strategy suppresses background noise interference, improves the model’s adaptability to cross-modal data distribution differences, and ensures the segmentation stability of the model under different imaging conditions. Therefore, it provides an interpretable parameter optimization solution for automated segmentation tasks in complex clinical scenarios.

Experiments and results

In this section, we first introduce the three public datasets. The details of the experiments with evaluation metrics, while later we validate the segmentation performance of MSCA-UNet on ISIC201721, MICCAI 2023 Tooth37, and CVC-ClinicDB22 by comparing it with U-net5, Attention U-Net7, DualNet20, MedT27,TransUNet[22] Unext12, EGE-UNet8, WRANet13 SeTformer32 are evaluated. Finally, ablation experiments on our model are outlined.

Datasets

CVC-ClinicDB

This publicly available dataset contains 612 colonoscopy images from real colonoscopy videos, with a resolution of 384x288 pixels.

MICCAI 2023 tooth

This dataset is a high-resolution medical image containing 2D scanned X-ray images acquired from a dental imaging device, which is an accurate pixel-level segmentation of the main structures of the teeth, such as crowns and roots37. The training set is 2000 images, and the test set is 500.

ISIC2017

This dataset contains 2000 dermoscopic images with detailed pixel-level segmentation annotations21. The training, validation, and test sets are 1279, 150, and 600 images, respectively. Each image has a high resolution and is suitable for nuanced feature analysis.

To standardize the input style across different datasets, all images were resized to a uniform size of 256 \(\times\) 256 before training. Data enhancement techniques, including random rotation, flipping, scaling, and cropping, were also employed to increase the diversity of the data and enhance the robustness of the model.

In the following two subsections, we elaborate on the details of the experiment and the evaluation metrics and conclude with a discussion of the results.

Experimental details

In this work, we apply the proposed model to the CVC-ClinicDB, MICCAI 2023 Tooth and ISIC2017 datasets and compare it with the standard U-net and SOTA models. The specific experimental setup is as follows:

Model training

All experiments are conducted using the same computational resources and environment. The Adam optimizer is used, and the initial learning rate is set to 0.001, and the learning rate is gradually reduced using the cosine annealing strategy to accelerate convergence. The number of training rounds was set to 100, and the Early Stopping strategy was used (Early Stopping) to avoid overfitting. Each dataset is divided into a training set and a test set. The model is trained on the training set, and the optimal model is selected and evaluated on the test set to ensure the reliability of the experimental results and the model’s generalization ability.

Experimental platforms

All experiments were conducted on a computer equipped with an NVIDIA RTX 4090 GPU, and the deep learning framework was PyTorch.

Evaluation metrics

In this study, to comprehensively evaluate the segmentation performance of the model, we use the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) as the primary evaluation metrics. These two metrics are mainly used to measure the degree of overlap between the predicted segmentation results and the accurate segmentation results, which are defined as follows.

DSC

The Dice coefficient is used to measure the overlap between the predicted segmentation region and the real region and is defined as:

$$\begin{aligned} DSC=\frac{2\times \left| P\cap T\right| }{\left| P\right| +\left| T\right| }, \end{aligned}$$
(14)

where P denotes the segmentation region predicted by the model, and T denotes the actual segmentation region. The DSC is between 0 and 1, with higher values indicating a better match between the segmentation results and the accurate segmentation.

IoU

IoU is used to evaluate the proportion of the intersection of the predicted region with the actual region in the concatenation set, which is defined as:

$$\begin{aligned} DSC=\frac{ \left| P\cap T\right| }{\left| P\cup T\right| }. \end{aligned}$$
(15)

Similarly, IoU values range from 0 to 1, with higher values indicating a better match between the segmented region predicted by the model and the actual segmented region.

FLOPs

The computational power of the model is usually measured by the number of floating point operations (FLOPs, Floating Point OMPErations), which indicates the number of floating point operations that the model needs to perform during the inference process. The higher the FLOPs, the higher the computational demand of the model.

Results and discussion

To comprehensively validate the effectiveness of our proposed method, we conducted experiments on three representative public datasets: the CVC-ClinicDB, the MICCAI 2023 Tooth, and the ISIC2017. Seven state-of-the-art methods are employed to validate the performance of MSCA-UNet. The experimental results show that our method performs well in coping with image segmentation tasks of different complexity levels, demonstrating the proposed module’s significant advantages in enhancing feature extraction and segmentation performance.

Comparison on CVC-ClinicDB dataset

In order to verify the effectiveness of our proposed module in dealing with boundary fuzzy problems, we choose CVC-ClinicDB intestinal polyp detection for testing. The eight models were tested thrice, taking the mean plus or minus the variance to represent the final results. The experimental results are shown in Table 1. Our method outperforms the other models on the CVC-ClinicDB dataset, with DSC and IoU scores achieving 89.77% and 82.87% on average. Mainly, it leads MedT by nearly 22% and 27% in DSC and IoU, respectively, which is a significant improvement.

Table 1 Comparisons with state-of-the-art models on CVC-ClinicDB dataset.

Comparison on MICCAI 2023 tooth dataset

Faced with a structurally complex image situation, we use MICCAI 2023 Tooth teeth for testing. The experimental results are shown in Table 2. Our method leads in both evaluation metrics, achieving an average of 92.37% and 88.24% in DSC and IoU scores, reflecting its superior performance for complex image processing.

Table 2 Comparisons with state-of-the-art models on MICCAI 2023 tooth dataset.

Comparison on the ISIC2017 dataset

We tested comparisons on the ISIC2017 dermatology for situations that are diverse and have more blurred boundaries. The experimental results are shown in Table 3, where our method’s DSC and IoU scores achieved 89.42% and 82.27% on average, proving the robustness of image segmentation in the face of situation-complex boundary blurring.

The experimental results benefit from the synergistic effect of our proposed modules: The AC module enhances the ability to capture boundary detail by dynamically adjusting the convolution kernel. The Conv-Attention module effectively directs the model to focus on the key regions and suppresses the background noise. The MSL module integrates the multi-scale information to improve the model’s comprehension of the global structure and local details. Together, these modules form an efficient feature extraction framework, which enables the model to exhibit excellent segmentation performance in different complex scenarios.

Table 3 Comparisons with state-of-the-art models on ISIC2017 dataset.
Table 4 Comparisons with state-of-the-art models on complexity analysis.

Visualization of segmentation results

Figures 4, 5 and 6 show the segmentation visualization results for the CVC-ClinicDB, MICCAI 2023 Tooth and ISIC2017 datasets. Our method exhibits higher accuracy in the segmentation task. The first column is the original image, the second column is Ground Truth, and the rest of the columns are U-net, Attention U-Net, DualNet, MedT, TransUNet, Unext, EGE-UNet, WRANet, SeTformer and our method in that order. Thanks to the modules’ superiority, our method can accurately segment regions with blurred boundaries while recognizing the target locations of complex structures. For example, in the CVC-ClinicDB dataset, our method can capture the edges of intestinal polyps more clearly, whereas the other models suffer from more missegmented regions. In the MICCAI 2023 Tooth dataset, our method performs well in processing complex structural images, accurately distinguishing the boundaries between teeth.

Moreover, in the ISIC2017 dataset, facing dermatologic images with diverse situations and blurred boundaries, our method can capture more detailed features and generate more accurate segmentation results. In contrast, models such as MedT and Unext are slightly less capable of capturing details. The results show that our module is efficient, and our method can effectively learn detailed features in the data, which is a significant advantage over other SOTA models.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Comparison chart of experimental results at CVC-ClinicDB.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Comparison chart of experimental results at MICCAI 2023 Tooth.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Comparison chart of experimental results at ISIC2017.

Ablation experiments

Our method is designed with three key modules: the AC module, the Conv-Attention module, and the MSL module. Comprehensive ablation experiments are performed on multiple datasets to validate and visualize the ablation experiment segmentation results. The experimental results are shown in Tables 56 and 7, and the visualization results are shown in Fig. 7. “Baseline + AC + Conv-Attention + MSL” exhibits optimal performance on all the test datasets, which fully proves the validity of the combination of this module.

First, when the AC, Conv-Attention, or MSL modules are introduced alone, the model’s segmentation accuracy on the CVC-ClinicDB and ISIC2017 datasets is significantly improved, and the effect is especially prominent when dealing with boundary-blurring regions. On the MICCAI 2023 Tooth dataset, the model also shows excellent performance in segmenting complex structured regions. In addition, the modules demonstrate unique advantages in mitigating information loss, enhancing contextual modelling capabilities, and refining boundary features.

Upon further testing the combined effect of the modules, we find that the modelling performance is improved by two-by-two combinations, such as “AC + Conv-Attention”, “Conv-Attention + MSL”, or “AC + MSL”, which can further improve the segmentation results of the three datasets. Finally, the model achieves the best segmentation performance when all three modules, AC, Conv-Attention and MSL, are used simultaneously. Our method performs excellently in all segmentation tasks on the CVC-ClinicDB and MICCAI 2023 Tooth datasets.

The experimental results show that our proposed modular design can effectively improve the overall performance of the segmentation tasks. This further emphasizes the importance of enhancing feature extraction capabilities and multi-scale information modelling within the encoder-decoder framework, which provides an efficient and robust solution to the complex image segmentation problem.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Visualization of ablation experiments on each of the three datasets: (a) +AC, (b) +Conv-Attention, (c) +MSL, (d) +AC+Conv-Attention, (e) +AC+MSL, (f) +Conv-Attention+MSL, (g) +AC+Conv-Attention+MSL.

Table 5 Ablation experiments on CVC-ClinicDB dataset.
Table 6 Ablation experiments on MICCAI 2023 tooth dataset.
Table 7 Ablation experiments on ISIC2017 dataset.

Complexity analysis

We analyzed the performance of our proposed method with existing SOTA models in terms of the number of parameters, FLOPs, and inference time, and the results are shown in Table 4. Although our method introduces various modules to enhance segmentation performance, it performs better in segmenting intestinal polyps, teeth and skin diseases with a significant reduction in the number of parameters and inference time compared to models such as U-net, MedT and WRANet.Although the UNext model outperforms our model in inference time, our model still achieved superior segmentation performance while ensuring the inference time remains as short as possible, demonstrating superior overall performance. In addition, thanks to the introduction of the efficient AC module and Conv-Attention module, our method can achieve fast inference and excellent segmentation performance while keeping the computational complexity low. This makes it a more attractive choice for practical applications.

Conclusion

This study proposes an improved model based on the U-net backbone network that combines the AC module, the MSL module, and the Conv-Attention mechanism to cope with boundary ambiguity, complex structure, and missing information in medical image processing. Through experimental validation on several public datasets (e.g., CVC-ClinicDB, MICCAI 2023 Tooth, and ISIC2017), our approach achieves significant improvements in accuracy and robustness, significantly outperforming the seven network models tested in both Dice Similarity Coefficient (DSC) and Intersection-to-Union Ratio (IoU) metrics.The experimental results show that our model exhibits significantly better segmentation performance than comparable methods on multi-modal video datasets, X-ray datasets in the field of medical imaging, and MRI datasets, verifying its generalization ability across modalities and domains. Notably, the model maintains stable performance in low-contrast X-ray images and high-noise video frames, further demonstrating its robustness and generalization ability, and providing a reliable solution for automated segmentation tasks in complex clinical scenarios.

Compared with the traditional U-net, our model reduces the number of parameters and computational complexity and improves the segmentation accuracy without degrading the performance. The introduced AC and MSL modules effectively enhance the feature extraction capability, and the Conv-Attention mechanism further efficiently improves the model’s focus on critical features. Overall, the method is suitable for accurate segmentation of medical images and provides new ideas and feasible solutions for image processing tasks in other complex scenes.

In future work, we plan to further optimize the computational efficiency of the model and explore its applicability in more medical domains. We expect to validate its generalization performance and practical application value on a wider range of datasets.