Introduction

Magnetic resonance imaging (MRI) and computed tomography (CT) have become the main ways for clinicians to analyze disease. The purpose of medical image segmentation is to segment lesion regions or lesion organs from different types of medical images, and provide an important basis for accurate assessment, diagnosis and treatment of subsequent diseases. For example, the segmentation of ischemic stroke on MRI images can help clinicians quickly locate and quantify the lesion, so as to determine the ischemic penumbra as soon as possible, in time for timely clinical rescue and reduce medical expenses. It can also reduce patient injuries and improve the quality of life and survival rate of patients. Segmentation of lesions in CT images of Coronavirus Disease 2019 (COVID-19) can help clinicians locate and quantify lesions, prevent missed and false detection, and assist clinicians in timely diagnosis of patients.

At present, the classic segmentation models include U-Net1, DeepLab2, PSPNet3 etc., which have achieved good results in the lesion segmentation of medical images4,5,6. U-Net1 adopted an encoder-decoder structure, the encoder is used to extract feature information, and the decoder is used to restore feature information. In addition, the high-resolution information and deep-level semantic information are fused by skip connection, which enriches the context information and achieves success in small-scale dataset segmentation. However, U-Net typically exhibits limitations in explicitly modeling long-range dependencies and may lead to a decrease in resolution and loss of information due to the use of convolution and pooling operations. DeepLab2 first proposed atrous convolution, which expanded the receptive field without adding additional params. In addition, it also used a fully connected Conditional Random field to enhance the model’s ability to locate the boundary. This method has developed rapidly in the fields of natural image segmentation and medical image segmentation. However, DeepLab has poor semantic understanding ability for objects of different sizes and is not suitable for segmenting lesions with large size variations. PSPNet3 proposed a pyramid pooling module based on the encoder-decoder structure, which can realize multi-scale information fusion and help the model obtain richer global context information. Recently, some literatures7,8 have used Transformer to model long-range dependencies, but they require large amounts of training data and significant computational resources due to the large number of query operations, which limits their practical application in clinical settings. Zhu et al.9 proposed a lightweight 3D segmentation network, SV-net, which replaces the convolutional blocks in V-Net with lightweight convolutional blocks, significantly speeding up the training process. To reduce the dependence of deep learning networks on labels, Liu et al.10 proposed a weakly supervised biomarker localization and segmentation network that only requires image-level annotations. This network uses a novel pretraining strategy based on supervised contrastive learning, effectively reducing the annotation burden on ophthalmologists in OCT images.

In lesion segmentation, the amount of medical data containing real labels is small, so among the above networks, U-Net, which is friendly to small-scale dataset segmentation, is the most widely used and shows great potential. Zhou et al.4 redesigned the skip connection to make up for the semantic difference between the encoder-decoder feature maps. The authors also used model pruning to reduce the complexity of the model and achieve a balance between inference time and performance. Gu et al.11 added a context extractor module on the basis of the encoder-decoder structure to solve the problem of spatial information loss caused by continuous convolution operations and pooling operations in the original U-Net. Oktay et al.12 proposed the Attention U-Net, embedding the AG module into the U-Net skip-connections for the first time. The AG module can effectively suppress feature activation in irrelevant regions, so as to improve the segmentation accuracy of the model. Cai et al.13 used Attention U-Net as the baseline network, and connected the features fused at different scales with the parallel spatial attention (SA) mechanism and channel attention (CA) mechanism, effectively capturing the long-range dependence relationship. Mu et al.14 proposed an attention-based residual U-Net framework for intracranial aneurysm segmentation. This framework introduces differential preprocessing and geometric post-processing functions, which can automatically segment intracranial aneurysms with high accuracy. U-Net and its variants have achieved success in different medical image segmentation tasks. However, the conventional convolution operation and simple skip connections in U-Net limit the performance of the model. It is not conducive to deal with the problem of lesions with large differences in shape, size or location, and is sensitive to the problem of blurred boundaries. This is a common problem in lesion segmentation of different types of diseases presented in different images, and it is extremely challenging to build a universal model with multi-type lesion segmentation. To address the above problems, we propose a DTA-UNet that can effectively segment different types of lesions and is more friendly to the boundary blurring problem. The main contributions are as follows:

1. DCD was introduced to replace all the regular convolutions in the encoding-decoding process of Attention U-Net12, which significantly improving the feature extraction capability of the CNN with a small increase in parameters.

2. We use the combination of TA and AG to act on the skip connection, highlight the lesion region, and optimize the features extracted by dynamic convolution.

3. A U-Net Based on DCD and TA is called DTA-UNet, which is a universal segmentation model that is friendly to lesion regions as well as boundaries, and shows accurate segmentation results on three types of datasets.

The rest of the paper is organized as follows: “Related work” provides a brief background on dynamic convolution methods and attention mechanisms. “Universal lesion segmentation framework DTA-UNet” provides a detailed overview of the proposed DTA-UNet methodology, and discusses the architecture diagram of the proposed DTA-UNet methodology as well as the TA structure diagram. In “Experiments”, detailed information about the dataset is provided. The proposed DTA-UNet method is compared with the use of various state-of-art deep learning models, and the experimental results are discussed. Finally, “Conclusion” provides a number of conclusions and recommendations for future work.

Related work

Dynamic convolution

Before the advent of dynamic convolution, conventional convolution used the same convolutional kernel parameters for all input images, which limited the feature expression ability of the network. People usually enhance the expressive ability of the network by increasing the depth or width of the network. However, this method will greatly increase the calculation cost. Therefore, Yang et al.15 proposed a Conditionally parameterized Convolutions (CondConv), which applies the attention mechanism to several parallel convolution kernels, breaking the characteristics of conventional convolution for all input shared params. It shows excellent performance on detection and classification tasks, and provides a new direction for improving the expressiveness of the model. Chen et al.16 proposed DYnamic Convolution (DY-Conv) on the basis of CondConv. By constraining attention and introducing the temperature parameter into Softmax, the training speed of the model was accelerated and the number of params was reduced. Li et al.17 analyzed the problems of the large number of dynamic convolution params and the high difficulty of joint optimization of dynamic attention and conventional convolution kernels from the perspective of matrix decomposition, proposed a DCD model with dynamic channel fusion mechanism. Compared with CondConv and DY-Conv, DCD has smaller params and better performance. Therefore, in this study, all the conventional convolution of the encoder-decoder part of Attention U-Net was replaced by DCD, with a small number of params in exchange for a large performance improvement.

Attention mechanism

In the field of computer vision, the attention mechanism is to calculate different weight params for the input feature map, so that the model pays more attention to key information and ignores irrelevant information such as background. Hu et al.18 proposed the Squeeze-and-Excitation (SE) block to realize information interaction in the channel dimension. However, the calculation of the excitation part is mainly realized through the operation of dimension reduction and dimension increase, and the operation of dimension reduction is not conducive to the effective learning of channel relations by the network. Wang et al.19 proposed Efficient CA (ECA), which removes the dimensionality reduction operation and realizes the information interaction between some channels through one-dimensional convolution, reducing the number of params and improving performance. Yang et al.20 proposed a Simple parameter-free Attention Module (SimAM). The core is to use the energy function to mine the importance of neurons. The lower the energy, the higher the importance. The method shows good performance on both detection and segmentation tasks. Liu et al.21 proposed the Normalization-based Attention Module (NAM), which represents the importance of channels and spaces through the variance in Batch Normalization (BN). The larger the variance, the richer the channel or spatial information, and the higher the importance. Woo et al.22 proposed the Convolutional Block Attention Module (CBAM), which combines the SA mechanism and the CA mechanism. CBAM can adaptively optimize features and seamlessly connect to the CNN architecture. However, SA and CA in CBAM are independent of each other, and lack cross-dimension information that is beneficial to performance. Therefore, Misra et al.23 proposed a near parameter-free TA, which uses the residual connection and the rotation transformation between tensors to realize the information interaction across the three dimensions of channel, height and width. At the same time, it avoids the adverse effects caused by the dimensionality reduction operation in the CBAM module. Compared with the above attention mechanisms, TA not only realizes the attention of spatial and channel dimensions, but also realizes the cross-dimension information interaction. Therefore, we selected TA to optimize the features. In addition, in medical image segmentation, since the medical image background accounts for a large proportion, and the great difference in the shape and size of the segmentation target itself. In this regard, this study designed a combination of TA and AG, which not only focused on lesions of different shapes and sizes, but also eliminated the adverse effects caused by irrelevant regions and backgrounds in skip connections.

Universal lesion segmentation framework DTA-UNet

The proposed universal lesion segmentation framework DTA-UNet is shown in Fig. 1, with the Attention U-Net containing the AG structure as the basic network. First, in order to enhance the feature expression ability of the model, we use DCD to replace all the conventional convolutions in the encoding-decoding process. Secondly, we use the combination of TA and AG to correct the feature maps extracted by DCD, and eliminate the ambiguity caused by irrelevant regions in spatial and channel dimensions. Finally, the output is obtained by bilinear upsampling.

Figure 1
Figure 1
Full size image

The architecture of DTA-UNet. Each rectangular box represents a feature map, and the number directly above the rectangular box represents the number of channels of the feature map. In the encoding stage, the number in the lower left part of the rectangular box represents the size of the feature map. In the decoding stage, the number in the lower right part of the rectangular box represents the size of the feature map.

DTA-UNet adopts a conventional encoder-decoder structure. The encoder consists of dual DCD, TA and max pooling, which are used to extract features, optimize features, reduce image resolution and increase the number of channels. The decoder consists of bilinear upsampling, dual DCD, skip connection, TA and \(1\times 1\) convolution, which is used to restore the encoded abstract feature map to its original size.

The conventional CNN is limited by the computation cost, and its network depth and width can not be greatly increased, which leads to its limited expressive ability. We replace the conventional convolutions in the encoding-decoding process with DCD, which helps the network to obtain finer semantic information and effectively balances the computational cost and expression ability. DCD replaces dynamic attention with dynamic channel fusion. The basic principle of DCD is as follows, and the dynamic convolution proposed by CondConv and DY-Conv is shown in Eq. (1).

$$\begin{aligned} {\textrm{W}({\textbf{x}}) = \sum \limits _{k = 1}^K {{\pi _k}} ({\textbf{x}}){{\textrm{W}} _k}} \end{aligned}$$
(1)

Where \(\textbf{x}\) represents the input feature graph, \({\textrm{W}}({\textbf{x}})\) represents the dynamic convolution kernel, \({{\textrm{W}} _k}\) represents the k-th conventional convolution kernel, k represents the number of convolution kernels, and the value range is \(\mathrm{{[}}1,K\mathrm{{]}}\), \({\pi _k}{\textbf{x}}\) the attention weight coefficient, and the value range is \(\mathrm{{[}}0,1\mathrm{{]}}\), in addition, constraints \(\sum _{k = 1}^K {{\pi _k}} ({\textbf{x}}\mathrm{{) = 1}}\).

First, by representing each conventional convolution with the idea of residuals, we can get Eq. (2):

$$\begin{aligned} {{{\textrm{W}} _k} = {{\textrm{W}} _0} + \Delta {{\textrm{W}} _k}} \end{aligned}$$
(2)

Where the definition of \({{\textrm{W}} _k}\) is the same as that defined in Eq. (1), denotes the k-th regular convolution kernel, \({{\textrm{W}} _0} = \sum _{k = 1}^K {{{\textrm{W}} _k}}\) denotes the mean convolution kernel.

Secondly, the singular value decomposition of \(\Delta {{\textrm{W}} _k}\) is performed as shown in Eq. (3):

$$\begin{aligned} {\Delta {{\textrm{W}} _k} = {{\textrm{W}} _k} - {{\textrm{W}} _0} = {{\textrm{U}} _k}{{\textrm{S}} _k}{{\textrm{V}} _k}^T} \end{aligned}$$
(3)

Where \(\Delta {{\textrm{W}} _k}\),\({{\textrm{W}} _k}\),\({{\textrm{W}} _0}\) and have the same meaning as above, \({{\textrm{U}} _k}\) represents the left singular matrix corresponding to \(\Delta {{\textrm{W}} _k}\), \({{\textrm{S}} _k}\) representing the diagonal matrix, and \({{\textrm{V}} _k}\)represents the right singular matrix.

Then, by bringing Eq. (3) into Eq. (2), we get the entire dynamic convolution, as shown in Eq. (4):

$$\begin{aligned} {{\textrm{W}}({\textbf{x}}) = \sum \limits _{k = 1}^K {{\pi _k}} ({\textbf{x}}){{\textrm{W}} _0} + \sum \limits _{k = 1}^K {{\pi _k}} ({\textbf{x}}){{\textrm{U}} _k}{{\textrm{S}} _k}{{\textrm{V}} _k}^T = {{\textrm{W}} _0} + {\textrm{U}}{} \mathbf{\pi }({\textbf{x}}){{\textrm{SV}} ^T}} \end{aligned}$$
(4)

Where \({{\textrm{W}} _0}\) is expressed the same as above, \({\textrm{U}}\) is the left singular matrix of \({\textrm{W}}({\textbf{x}})\), the shape is \({\textrm{U}} = \mathrm{{[}}{{\textrm{U}} _1},...,{{\textrm{U}} _K}\mathrm{{]}}\), \({\textrm{S}}\) is the unit diagonal matrix, which represents the convolution weight coefficient obtained by the attention mechanism, and the shape is \({\textrm{S}} = \mathrm{{[}}{{\textrm{S}} _1},...,{{\textrm{S}} _K}\mathrm{{]}}\), \({\textrm{V}}\) is the right singular matrix of \({\textrm{W}}({\textbf{x}})\), and the shape is \({\textrm{V}} = \mathrm{{[}}{{\textrm{V}} _1},...,{{\textrm{V}} _K}\mathrm{{]}}\).

Thirdly, if the constraint \(\sum _{k = 1}^K {{\pi _k}} ({\textbf{x}}\mathrm{{) = 1}}\) is relaxed and the channel dimension is used to represent the dynamic convolution, the number of channels of a single regular convolution is \({\textrm{c}}\), then the number of channels of \({\textrm{K}}\) regular convolutions is \({\textrm{KC}}\), as shown in Eq. (5):

$$\begin{aligned} {{\textrm{W}}({\textbf{x}}) = \Lambda ({\textbf{x}}){{\textrm{W}} _\mathrm{{0}}} + \sum \limits _{i = 1}^{KC} {{\pi _{\Big \lceil {{i/ C}} \Big \rceil }}} ({\textbf{x}}){{\textrm{u}} _i}{{\textrm{s}} _i}{_\mathrm{{,}}{}_i}{{\textrm{v}} _i}^{\textrm{T}}} \end{aligned}$$
(5)

Where \({{\textrm{u}} _i},{{\textrm{v}} _i}\) are vectors, and \({{\textrm{s}} _i}{_\mathrm{{,}}{}_i}\) is the diagonal matrix.

Finally, the dimension of the hidden space is artificially set to \({\textrm{L}}\), and \({\mathrm{L<<C}}\). What’s more \({\textrm{P}}\), \(\Phi ({\textbf{x}})\), and \({{\textrm{R}} ^T}\) are used to replace Eq. (5). The expression of the whole DCD is shown in Eq. (6):

$$\begin{aligned} { W (x)=\Lambda (x) W_0 + P \Phi (x) R^T } \end{aligned}$$
(6)

Where x represents the input feature map, W(x) and \(W_{0}\) are both \(C\times k^2\) order matrix, C represents the number of channels, and \(k^2\) represents the convolution kernel size is \(k\times k\). W(x) represents DCD, \(\Lambda (x)\) is a \(C\times C\) diagonal matrix, generated by attention, and \(W_{0}\) is the mean convolution kernel. R is the \(k^2\times L\) order matrix, the purpose is to compress the number of kernel elements from \(k^2\) to \(L\). \(\Phi (x)\) is a \(L\times L\) matrix, and its function is to dynamically fuse \(L\) elements , which can significantly reduce the dimension of the hidden space. \(P\) is the \(C\times L\) order matrix, used to increase the dimension. Artificially setting \(L =\lfloor k^2/2 \rfloor\) to reduce the number of params. In this paper, \(k =3\), \(L =4\).

Furthermore, in order to further optimize the features, we add TA in the encoding-decoding process to make it more focused on segmentation targets. Assume the feature extracted by DCD that is f. It can be expressed as \(f=x \odot W(x)\), \(\odot\) represents element-wise multiplication. The shape of f is \(C\times H\times W\), where C is the number of channels, H is the height, and W is the width, the structure of the TA is shown in Fig. 2.

Figure 2
Figure 2
Full size image

The architecture of TA.

It is a three-branch structure. The first and second branches are relatively similar, mainly through rotation, Z-Pool, convolution and Sigmoid to realize the dimensional interaction with between C,H and C,W. The third branch is used to calculate spatial attention weights. Finally, the three branches are averaged. The feature map obtained in this way not only has the advantages of CA and SA, but also has cross-dimensional interactive information, which realizes the purpose of optimizing features. The Z-pool23 layer is responsible for reducing the tensor of the C dimension to 2 dimensions, connecting the average and maximum pooling features on this dimension. This allows the layer to retain a rich representation of the actual tensor, while reducing its depth to make further computationally less intensive. The expression of Z-Pool is shown in Eq. (7):

$$\begin{aligned} Z-Pool(f)=[Maxpool_{0d}(f),Avgpool_{0d}(f)] \end{aligned}$$
(7)

where Maxpool\(_{0d}(f)\), Avgpool\(_{0d}(f)\) represents that max pooling and average pooling are performed on the 0 dimension.

Algorithm 1 shows the pseudocode of the model training process proposed in this paper.

Algorithm 1
Algorithm 1
Full size image

Train DTA-UNet.

Experiments

Datasets and evaluation metrics

Dataset description

In order to evaluate the generalization ability and robustness of the DTA-UNet, this paper will conduct experiments on three different types of datasets. The three datasets have the following things in common: (1) The size, shape and location of the lesion are quite different, (2) The boundary of the lesion is blurred. The three datasets are described in detail as follows:

AIS Dataset This dataset is the MRI image of stroke patients provided by the cooperative hospital, referred to as the AIS dataset. This dataset contains MRI images of 132 ischemic stroke patients. About 20 axial slices with a resolution of \(160\times 160\) were obtained for each patient, for a total of 2644 axial slices. The labels were annotated by neurosurgeons from a cooperative hospital using ITK-SNAP 3.8.0 software. In this paper, 771 images containing lesions were selected from 2644 images, which were unified into \(256\times 256\), and the training set and testing set were divided according to the ratio of 8:2.

COVID-SemiSeg Dataset This dataset is a public dataset for the COVID-19 segmentation provided by Fan et al.24, which contains 1698 two-dimensional CT images with different resolutions. Among them, 98 CT images are ground truth (GT) marked by doctors, and 1600 CT images belong to pseudo labels generated by semi-supervision. The dataset is linked as: https://drive.google.com/file/d/1bbKAqUuk7Y1q3xsDSwP07oOXN_GL3SQM/view. In this paper, the unified size of 1698 two-dimensional CT images with different resolutions is \(256\times 256\). We use 50 two-dimensional CT images of COVID-19 with GT and 1600 two-dimensional CT images of COVID-19 with pseudo-labels as the training set, and 48 images with GT as the testing set.

ISIC 2018 Dataset25 This dataset is a public dataset released by the International Skin Imaging Collaboration (ISIC) in 2018. A total of 2598 labeled two-dimensional dermoscopic images. Resolutions vary in size, such as \(767\times 576\), \(6688\times 4459\) etc. The dataset is linked as: https://challenge.isic-archive.com/data/#2018. In this paper, We unify the size of all images to \(256\times 256\), and divide the training set and testing set according to the ratio of 8:2.

Evaluation indicators

We selected Intersection over Union (IoU), Dice Similarity Coefficient (DSC), Hausdorff Distance (HD) and Precision (Pre) as evaluation indicators for lesion segmentation. Among them, IoU measures the degree of overlap between the prediction area of the model and the real area, which can objectively evaluate the positioning accuracy of the model for the segmentation target area, considers the location and size of the segmentation target, and is versatile and intuitive. DSC is also a measure of the degree of overlap between the predicted area and the real area, and it is more sensitive to the segmentation results of small targets. HD is a metric that measures the farthest distance between two point sets, which is used to evaluate the distance between the segmentation result and the real result, which can reflect the difference between the boundary of the segmentation result and the real boundary, so it plays an important role in some scenarios where boundary accuracy needs to be concerned. In a segmentation model, Pre can be used to measure the predictive accuracy of the model for the target region. Params are a spatial measure of the complexity of a model. In this paper, there are differences in lesion shape, size and location in stroke segmentation, novel coronavirus pneumonia lesion segmentation and skin cancer lesion segmentation, and the clinical application will limit the computational resources but still need to maintain high accuracy, so the above evaluation indicators are selected in this paper. The values of IoU and DSC range from [0,1], and the larger the value, the more overlapping parts, that is, the closer the prediction result is to the true label. True Positives (TP), False Positives (FP) and False Negatives (FN). The formula is as follows:

$$\begin{aligned} IoU= & {} \frac{TP}{TP+FP+FN} \end{aligned}$$
(8)
$$\begin{aligned} DSC= & {} \frac{2TP}{2TP+FP+FN} \end{aligned}$$
(9)

Pre reflects the proportion of TP in all predicted positive samples. TP and FP are the same as above, the equation is as follows (10):

$$\begin{aligned} Pre=\frac{TP}{TP+FP} \end{aligned}$$
(10)

HD is a measure of the maximum mismatch between set A and set B. The formula is as follows(11,12,13):

$$\begin{aligned} HD(A,B)= & {} \max \{h(A,B),h(B,A)\} \end{aligned}$$
(11)
$$\begin{aligned} h(A,B)= & {} \mathop {\max }\limits _{a\in A} \{\mathop {\min }\limits _{b\in B} {\parallel a-b\parallel }\} \end{aligned}$$
(12)
$$\begin{aligned} h(B,A)= & {} \mathop {\max }\limits _{b\in B} \{\mathop {\min }\limits _{a\in A} {\parallel b-a\parallel }\} \end{aligned}$$
(13)

where \(\parallel \cdot \parallel\) represents a norm of 2, h(AB) represents the maximum distance from point a in set A to point b in set B closest to it. Similarly, h(BA) represents the maximum distance from point b in set B to point a in set A closest to it.

Experimental details

Stroke segmentation dataset, COVID-SemiSeg dataset and ISIC 2018 dataset were trained for 300, 300 and 200 epochs, respectively. The batch size of all datasets is 12, the optimizer is Adam, and the learning rate is 0.001. The experimental environment programming language is python3.7, the deep learning framework is PyTorch1.8, CUDA11.1. The main packages are Numpy1.21, Tensorboard2.5, SimpleITK2.0, etc. Hardware environment: The CPU is 12900KF, the GPU is a single Nvidia GeForce RTX3090, and the memory is 24G. The loss function is Binary Cross Entropy (BCE) loss, such as Eq. (14) as follows:

$$\begin{aligned} L_{BCE} = {-\frac{1}{ N }} \sum _{ i =1}^ N \left( {y_{i}}\cdot {\log ( p_{i} )} + (\textrm{1} - y_{i} ) \cdot \log (\textrm{1} - p_{i} ) \right) \end{aligned}$$
(14)

Where \(N\) is the number of pixels, \(y_{i}\) is the label of pixel \(i\), and \(p_{i}\) is the probability of being predicted as a positive class.

Ablation study

In order to fully explore the impact of DCD and TA on model performance, we conducted ablation research on stroke segmentation dataset. In this experiment, Attention U-Net is used as the baseline network, and DCD and TA are added to the encoder-decoder to explore the performance of the model. The experimental results are shown in Table 1. “\(\downarrow\)” means the smaller the better, “\(\uparrow\)” means the bigger the better. Bold fonts indicate optimal results, Italic fonts indicate suboptimal results. The numbers in brackets represent the difference compared with the baseline network experiment results. “+” or “−” respectively represent the increase or decrease of the corresponding index. “\(\surd\)” indicates that DCD or TA is used in the encoding-decoding process, and “\(\times\)” indicates that DCD or TA is not used in the encoding-decoding process.

Table 1 DTA-UNet ablation experiment table.

As shown in Table 1, compared with only using DCD in a single process, using DCD in the encoding-decoding process at the same time, although the number of Params is increased by a small amount, the four segmentation indicators of IoU, DSC, HD and Pre are significantly improved. When the TA module is used at the same time in the process of encoding and decoding, the number of Params hardly increases, and the improvement of various segmentation indexes is optimistic, among which Pre is the most obvious, reaching 0.8891. When the DCD and TA modules are used in the encoding-decoding process at the same time, the number of Params only increases by 0.7628M, and the four segmentation indicators are all optimal, which can more effectively improve the performance of Attention U-Net. The results of ablation experiments show that the combination of TA and AG is beneficial for the network to pay more attention to the lesion, and the introduction of DCD can promote the network to learn more complex features, which is more conducive to the accurate segmentation of lesions. Therefore, this study takes the approach of using both DCD and TA in the encoding-decoding process.

Compare the experimental results

AIS experimental results

In order to evaluate the superiority of DTA-UNet and other methods, it was compared with eight state-of-art or classic methods. The eight methods are U-Net1, UNet++4, CE-Net11, Attention U-Net12, Inf-Net24 , SmaAt-UNet26, ELU-Net27 and CMUNeXt28. As shown in Table 2: Compared with the above eight methods, DTA-UNet shows the best results on the four segmentation indicators. The three indicators of IoU, DSC and Pre increased by 0.0386, 0.0431 and 0.0345 respectively, which shows that DTA-UNet is very effective for stroke region segmentation. HD decreased by 0.0928, which shows that DTA-UNet can get more accurate boundaries compared to other methods. At the same time, DTA-UNet has increased the number of Params by 4.5979M, which shows that it sacrifices the complexity of the model in exchange for the improvement of segmentation performance. Furthermore, although ELU-Net and SmaAt-UNet does not perform well in stroke segmentation, their lightweight design stems from the introduction of deep separable convolution, which requires minimal computational overhead.

Table 2 Performance comparison of different methods on the stroke segmentation dataset.

The segmentation results on the stroke segmentation dataset are shown in Fig. 3, and each row corresponds to a different segmentation result of a selected patient. The first column is the original image, the second column is GT, The third column is the segmentation result of the DTA-UNet and the fourth to eleventh columns are the segmentation results of the above eight representative methods.

Figure 3
Figure 3
Full size image

Visual comparison of stroke segmentation results. P* represents for patient*. The red box is the magnified image of the selected region, and the green box is the mis-segmentation.

The segmentation results of 5 patients were randomly selected as shown in Fig. 3. For patient 1, DTA-UNet showed the best performance in small target segmentation, CMUNeXt and Inf-Net showed under-segmentation, and the other methods showed mis-segmentation. This may lead to missed or misdiagnosed clinically occurrence of the phenomenon. For medium and large lesions, such as patients 2–5, the segmentation results of DTA-UNet and Inf-Net are the closest to GT, and the other methods showed different degrees of under-segmentation or mis-egmentation. In addition, as shown in the red box, compared to other methods, DTA-UNet performs more prominently in details. This shows that the introduction of DCD and TA makes the model more focused on stroke regions and boundaries, thus strengthening the accurate segmentation of stroke.

COVID-semiseg segmentation experiment results

The segmentation results on the COVID-SemiSeg dataset are shown in Table 3. Compared with other methods, DTA-UNet and Inf-Net performed very well, and all indicators reached the optimal or suboptimal. This shows that the introduction of DCD and TA is also competitive in the segmentation of COVID-19 CT images. The DTA-UNet performed best on the two indicators of IoU and DSC, which increased by 0.0715 and 0.0672, respectively. HD took second place. HD dropped 0.3947, Pre improved 0.0593. It shows that DTA-UNet can well segment the COVID-19 infection regions, and Inf-Net, which integrates multiple edge attention modules, is friendly to the boundary extraction task.

Table 3 Performance comparison of different methods on the COVID-SemiSeg dataset.

The visualization results on the COVID-SemiSeg dataset are shown in Fig. 4. The segmentation results of DTA-UNet and Inf-Net are closer to GT, while other methods show limitations. In addition, as shown in patient 1, DTA-UNet has a slight advantage over Inf-Net for the segmentation of infected regions. This is attributed to the excellent feature extraction ability of DTA-UNet and higher attention to infected regions. As shown in patient 2, Inf-Net is slightly better at boundary segmentation, which further verifies the results in Table 3.

Figure 4
Figure 4
Full size image

Visual comparison of segmentation results on the COVID-SemiSeg dataset.

ISIC 2018 segmentation experiment results

The segmentation results on the ISIC 2018 dataset are shown in Table 4, DTA-UNet got the best results on the three indicators, and the Pre indicator also got second place. This further proves that the DTA-UNet model has better robustness and generalization ability.

Table 4 Performance comparison of different methods on the ISIC 2018 dataset.

The visual segmentation results on the ISIC 2018 dataset are shown in Fig. 5. For all dermoscopic images, DTA-UNet can effectively segment skin lesions, and combined with the segmentation index, the advantages of the DTA-UNet model are further verified. In contrast, such as patients 1, 2, and 4, Inf-Net is slightly inferior in the precise segmentation of boundaries. In addition, as shown in patient 3, although the baseline Attention U-Net recognized the lesion, it was under-segmented. This shows that DCD and TA are very necessary for the improvement of Attention U-Net.

Figure 5
Figure 5
Full size image

Visual comparison of segmentation results on the ISIC 2018 dataset.

In summary, DTA-UNet shows good performance on three different medical image segmentation datasets, reflecting good universality. This is attributed to the excellent feature extraction ability of DCD and the focus of TA on lesions in DTA-UNet.

Conclusion

In this paper, we propose a universal deep learning model DTA-UNet for lesion segmentation. First, we replace the conventional convolution in Attention U-Net with DCD, which enhances the expressive ability of the model. Then, We use the combination of TA and AG to pay more attention to the target, effectively remove redundant information in the channel and space, and enhance the segmentation performance of the model. Compared with the current mainstream methods, DTA-UNet has shown excellent performance on three datasets of different image types and different disease types. However, it also adds some computational overhead, so how to make it lightweight will be explored in the future. In addition, transfer learning technology with the help of theoretical support can be used to extend the model to datasets with different distributions, reducing training costs. Therefore, combining the model proposed in this paper with transfer learning to solve new tasks is also an important research direction in the future.