Introduction

The pathogenesis of bladder cancer may be closely related to factors such as advanced age, male sex, smoking, and family history of bladder cancer1. Since the mid-2000s, bladder cancer has become an important public health issue; the rate of decline in the incidence of bladder cancer has changed2. Its rising incidence and high medical costs have brought heavy burdens to society and patients3.

In order to achieve early detection and effective management of bladder cancer, cystoscopy and artificial intelligence are widely used in bladder cancer diagnosis4,5,6,7. However, bladder tumors, especially early-stage ones, are often small (≤ 5 mm) or hidden in bladder folds, diverticula, or near the ureteral orifice. Moreover, the class differences between lesion regions and normal regions are small8. Segmentation models need to accurately capture these subtle lesions to meet clinical requirements. Obtaining pixel-level annotations for cystoscopic images is not only costly but also time-consuming9. In recent years, 10 have shown that semi-supervised learning still has defects in the bladder cancer segmentation task. As shown in Fig. 1, the pixel ratio of tiny tumors in cystoscopic images is extremely small, and the features are not obvious, such as (a) and (b). There are also images with blurred boundaries and unclear features of the target area, such as (c) and (d). This situation depends on the judgment of professional doctors.

Fig. 1
Fig. 1
Full size image

Some samples of the bladder tumor dataset.

This study aims to make full use of tiny unlabeled images, improve the accuracy of bladder tumor segmentation, and obtain more reliable results that better match the ground truth. The main contributions of this study are as follows:

(1) We propose a semi-supervised segmentation method for bladder tumors, UDS-MT, which integrates the Mean Teacher model and the guided branch, effectively solves the problems of limited labeled data, and significantly improves the performance of bladder tumor segmentation.

(2) The guided branch uses uncertainty estimation to generate masks, which helps to obtain high-quality pseudo labels andsuppresses overfitting.

(3) We propose a defend loss term that only calculates the loss for pixels with high prediction confidence of the model, which promotes the model to learn the reality of the target area, further improving the accuracy of segmentation prediction.

(4) Through comparative experiments on bladder tumor clinical image datasets, this method has demonstrated excellent segmentation results. Comparative experiments on the public colonoscopy dataset have verified that our method has generalization ability and robustness. Finally, ablation studies have been conducted to verify the role of each module.

Related work

Semi-supervised learning

Self-training architectures15 (such as local contrast loss16, uncertainty-guided ensemble self-training17,18 and co-training frameworks19,20 (such as deep co-training21, attention-generated adversarial networks22, integrating adversarial training and co-consistent learning23 are used to generate high-quality pseudo-labels for segmentation tasks, especially lesion segmentation in breast ultrasound images. These methods improve the reliability of pseudo-labels and segmentation performance through information exchange and uncertainty-aware mechanisms between models24. Consistency regularization25,26 (such as the π model27, time series combination model27, mean teacher model28 improves the robustness and consistency of the model through data augmentation and temporal prediction. Methods that combine pseudo-labels and consistency regularization (such as the cross-teaching consistency method29, MC-Net + network26, cross-consistency training30 further improve segmentation performance, but there is a challenge in synchronizing prediction results. Generative adversarial networks (GANs)31 improve segmentation performance by alternating updates of the discriminator and generator, but do not fully consider the preservation of multi-scale features and boundary detail information.

Semi-supervised learning has been developed in recent years10,11,12,13,14, such as mask learning (CGML)32, contrast agent pseudo-supervision task33, mean teacher model guided by triple uncertainty and contrastive learning34, shape-aware network35, Bayesian pseudo label36, etc., aiming to improve pseudo-label quality and segmentation performance. By introducing perturbations and consistency regularization to enhance the learning of unreliable regions, the Mismatch37 method incorporates differentiated morphological feature perturbations at the feature level, thereby improving the segmentation performance of the model when labeled data is limited. The method introduces a weak-to-strong perturbation strategy38 and the corresponding feature perturbation consistency loss to efficiently utilize unlabeled data and guide the model to learn reliable regions.

Existing methods have made progress in generating high-quality pseudo-labels and improving segmentation performance, but there are still challenges in synchronizing model prediction, preserving boundary and detail information, and fusing multi-scale features.

Mean teacher model

Tarvainen et al.28 first proposed the mean teacher model in 2017, which has attracted widespread attention as an effective semi-supervised learning method. By performing exponential moving averaging (EMA) on the parameters of the teacher model and using consistency regularization to constrain the student model, the student model’s prediction results for labeled and unlabeled data are consistent.

Wang et al.39 introduced the mean teacher model into the area of medical image segmentation. To address the common problems of class imbalance and insufficient labeled data in medical images, this model can generate high-quality soft labels using little labeled data and a large amount of unlabeled data. Through experiments in medical image segmentation tasks, such as lungs and liver, the effectiveness and advantages of the mean teacher model in this field were verified. Based on the original mean teacher model, Liu et al.40 improved the structure of the teacher model and the student model and the calculation method of the consistency loss to improve the segmentation accuracy, especially when dealing with complex lesion areas and small target segmentation tasks. Li et al.41 proposed a multi-scale mean teacher model for extracting multi-scale image features. The model performs well in segmenting lesions of different sizes and shapes. Zhu et al.42 introduce 2D/3D hybrid dual-teacher and uncertainty weighted regularization to achieve approximate full supervision segmentation accuracy on very few labeled MRI data. Xiao et al.43 integrate local CNN details with Transformer global features on mean teacher, and significantly improve semi-supervised cardiac segmentation performance through cross-pseudo-labeling and uncertainty screening.

These studies demonstrate the potential of the mean teacher model in medical image segmentation, and our approach extends the Mean Teacher framework, introduces supervised branches, and integrates uncertainty estimation and consistency regularization to further improve the model’s performance.

Uncertainty estimation

Uncertainty estimation plays an increasingly important role in semi-supervised medical image segmentation work. Hua et al.44 proposed a method based on uncertainty-driven. The uncertainty estimation can effectively guide the model to focus on difficult-to-classify areas, significantly improving the performance of the segmentation model when labeled data is scarce. Zhu et al.42 incorporated uncertainty perception into the mean teacher model and found that the uncertainty perception mechanism can enable the model to better utilize unlabeled data. Wei et al.45 proposed an uncertainty-based multi-scale feature learning method. The model learns and filters multi-scale features through uncertainty estimation, which can better capture the characteristics of lesion areas of different sizes and shapes in the image. Wang et al.46 constructed a framework based on uncertainty-guided adversarial training. Uncertainty estimation is used to guide the training process of the generative adversarial network, and the images generated by the generator are adversarially learned from the real images based on the uncertainty features.

Uncertainty estimation shows strong potential in semi-supervised medical image segmentation. By combining it with semi-supervised learning methods, it can effectively utilize unlabeled data and improve the performance of segmentation models.

Method

Overview

We propose a semi-supervised learning method, as shown in Fig. 2, which uses the guided branch to guide the generation of masks and jointly trains the supervised branch and the Mean Teacher model to fully utilize unlabeled data. To enhance the reliability of pseudo-labels, we calculate the loss by combining the masks with the pseudo-labels of the student model and the teacher model, respectively.

Fig. 2
Fig. 2
Full size image

We propose the semi-supervised segmentation method : (1) For labeled data, the pseudo-labels generated by Model-G and Model-S are mutually constrained for consistency, and they are trained under the supervision of real labels to obtain the initial parameters of the model; (2) For unlabeled data, the pseudo-labels generated by Model-G generate masks under uncertain estimation, and the masksare used as supervisory signals to measure the reliability of the pseudo-labels of Model-S and Model-T.

We explain the work of the method as follows: Let \(\:D={D}_{l}+{D}_{u}\) represent the entire dataset, where \(\:{D}_{l}\) and \(\:{D}_{u}\) respectively represent the labeled dataset and the unlabeled dataset. We use \(\:\left({x}_{i},{y}_{i}\right)\in\:{D}_{l}\) to denote the labeled image pair, where \(\:{y}_{i}\) is the ground truth. The unlabeled image is represented as \(\:{u}_{j}\)\(\:{D}_{u}\). In our method, Model-T implements the segmentation task, and Model-G is used to participate in the supervision of the predicted probability of labeled and unlabeled images for each batch. The encoder and decoder of Model-G, Model-S, and Model-T have the same structure but with independent parameters. It is obvious that Pyramid Vision Transformer (PVT) encoder combines Transformer’s global modeling ability and CNN’s multi-scale feature advantages to have efficient feature extraction capabilities. As shown in Fig. 3, the model uses PVT encoder to extract four-layer feature information, and uses CFM to cross the feature information of the last three layers to reduce the differences between features and avoid redundant information. Prevent fuzzy feature information from being filtered, especially for tiny segmentation targets or segmentation boundaries of fuzzy pictures. We connect the features of the first layer with the crossed features to obtain the prediction probability of the model.

Fig. 3
Fig. 3
Full size image

Network backbone flow work: We extract four-layer feature information through the PVT encoder, and use Cross Feature Module (CFM) to cross the feature information of the last three layers to reduce the differences between features and avoid redundant information. To prevent fuzzy feature information from being filtered, we connect the features of the first layer to obtain the prediction probability of the model.

Let \(\:\theta\:\) and\(\:{\theta\:}^{{\prime\:}}\) correspond to the weight parameters of Model-S and Model-T, respectively. Model-T weight parameters are updated through the EMA, that is, they are updated in combination with the previous network weight parameters \(\:\theta\:\) and the smoothing coefficient α, where t represents the training time.

$$\:{\theta\:}_{t}^{{\prime\:}}=\alpha\:{\theta\:}_{t-1}^{{\prime\:}}+(1-\alpha\:){\theta\:}_{t}$$

When using labeled data, the labeled data is processed by Model-G and Model-S to extract the target features, and the output results are constrained by consistency loss. \(\:{l}_{sc}\) and supervision loss \(\:{l}_{s1\:\:}\)and \(\:{l}_{s2}\). When an unlabeled image is input, Model-G, Model-S, and Model-T are processed simultaneously to generate the prediction results, and the consistency loss and cross entropy loss constraints are imposed on each other, expressed by \(\:{l}_{c1}\)\(\:{l}_{c2}\) and \(\:{l}_{c3}\). In order to reduce the coupling between models, we introduce uncertainty estimation in Model-G, generate pseudo labels and masks from the prediction results, and calculate the loss of the segmentation results of Model-S and Model-T under the constraint of the mask.

Progressive increase strategy of consistency weight

Consistency regularization is a method that calculates the loss by measuring the consistency between the prediction results of the student model and the teacher model in semi-supervised learning, thereby constraining the prediction behavior of the student model on labeled data and unlabeled data. During the training process, the consistency weight will gradually increase, so that the model mainly relies on labeled data for learning in the early stage of training, and gradually uses more information from unlabeled data in the later stage of training. When the labeled data is limited, this method can help the model fully explore the valuable information in unlabeled data and improve the generalization ability of the model.

Uncertainty estimation

In this method, so as to improve the quality of the target shape segmentation results, the method introduces an uncertainty estimation mechanism. By estimating the uncertainty of the prediction results of the unlabeled data in the supervised branch to guide the training and generate a mask, the mask guides the prediction results of other branches to perform pixel constraints. This strategy can fully mine reliable information in unlabeled data, thereby further improving model performance.

Based on the uncertainty estimation of information entropy, it can not only directly quantify ambiguity in probabilistic model outputs, but also captures both aleatoric and epistemic uncertainty, which works for fuzzy information caused by noise in cystoscopic images. It is computationally efficient and easy to integrate into training/workflows.

Loss function

We design a loss function to take into account both supervised and unsupervised learning scenarios, and enhance the performance of the semi-supervised segmentation work by combining consistency regularization, cross-entropy loss, and Dice loss. We denote the size of the input image by H×W, and the position of a pixel block in the image by \(\:\left(h,w\right)\). The loss function of the overall method is divided into two parts: one is the loss function for labeled data, and the other is the loss function for unlabeled data.

The loss function for labeled data \(\:{\varvec{L}}_{1}\)

The loss function for labeled data includes the supervision loss \(\:{l}_{s}\) between the prediction results of Model-G and Model-S and the true label, and the consistency loss \(\:{l}_{sc}\) between the prediction results of Model-G and the student model.

$$\:{L}_{1}={l}_{s}\left({\text{P}}_{\text{g},\text{T}}\right({\text{x}}_{\text{i}}),{\text{y}}_{\text{i}})+{\upgamma\:}{l}_{\text{s}\text{c}}({\text{P}}_{\text{g}}({\text{x}}_{\text{i}}),{\text{P}}_{\text{T}}({\text{x}}_{\text{i}}))$$

Dice loss is calculated under the constraint of the true label to measure the accuracy between the prediction result and the true label, while the consistency loss is evaluated by cross-entropy loss to measure the probability distribution difference between the two prediction results.

$$\:{l}_{s}={L}_{Dice}\left({P}_{g}\right({x}_{i}),{y}_{i})+{L}_{Dice}\left({P}_{T}\right({x}_{i}),{y}_{i})$$
$$\:{l}_{sc}={L}_{CE}\left({P}_{g}\right({x}_{i}),{P}_{T}({x}_{i}\left)\right)$$

Unlabeled data loss function \(\:{\varvec{L}}_{2}\)

Due to the lack of true labels of manually annotated unsupervised data, the prediction results of Model-G provide pseudo-label supervision signals for the prediction results of Model-S and Model-T under the action of the softmax function. Unlabeled data loss includes consistency loss \(\:{l}_{c}\) between the prediction results of the Model-G, Model-S, and Model-T, and supervised information loss \(\:{l}_{p}\).

\(\:{l}_{c}\:\)is subjected to consistency loss to constrain the prediction results of the model under different conditions to the greatest extent possible.

$$\:{l}_{c}={L}_{CE}\left({P}_{g}\right({u}_{j}),{P}_{T}({u}_{j}\left)\right)+{L}_{CE}\left({P}_{g}\right({u}_{j}),{P}_{S}({u}_{j}\left)\right)+{L}_{CE}\left({P}_{S}\right({u}_{j}),{P}_{T}({u}_{j}\left)\right)$$

We use the defend loss \(\:{l}_{p}\), combined with uncertainty estimation, to expand cross-entropy loss.

Pseudo-label supervision

Pseudo-labels \(\:{Y}_{g}\left({u}_{j}\right)\) are generated through the results of the Model-G \(\:{P}_{g}\left({u}_{j}\right)\) to guide the training of unlabeled images of Model-S and Model-T. The generation method is.

$$\:{Y}_{g}\left({u}_{j}\right)={argmax}(soft{max}({P}_{g}\left({u}_{j}\right)\left)\right)$$

Uncertainty estimation

Since the pseudo-labels generated by the guide branch lack additional constraint information, their reliability may be low. In order to improve the reliability of pseudo-labels, we introduced an uncertainty estimation method based on information entropy. Specifically, when the probability distribution of the prediction results is more uniform, the information entropy is larger, indicating higher uncertainty; when the probability distribution is more concentrated, the information entropy is smaller, indicating lower uncertainty. We quantify uncertainty by calculating the information entropy of the pseudo-labels generated by the guide branch. The higher the entropy value, the more uncertain the model’s prediction for that area.

To filter out areas with high certainty of pseudo-labels, we set a dynamic threshold (\(\:{T}_{threshold}\)), which can flexibly filter out areas where the model predicts reliably according to different training stages.

$$\:{T}_{threshold}=\left(0.75+0.25*\gamma\:\left(t,T\right)\right)$$
$$\:\gamma\:\left(t,T\right)=\left\{\begin{array}{c}1.0\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:if\:T=0\\\:\text{exp}\left(-5.0\times\:{\left(1-\frac{t}{T}\right)}^{2}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:if\:T\ne\:0\end{array}\right.$$

Where \(\:T\) indicates the total number of training sessions, and \(\:t\) indicates the current number of iterations.

Areas with higher uncertainty are considered unreliable. We filter out areas with higher reliability as masks based on the threshold. \(\:M\) denotes the number of elements in the set \(\:\phi\:\) of pixels selected by the mask. Using pseudo-labels as comparison results and masks as supervision signals, the supervised information loss of the prediction results of Model-S and Model-T is measured.

$$\:{l}_{p}=\:\frac{1}{M}{\sum\:}_{h,w\in\:\phi\:}\left({P}_{S}log\left({Y}_{g}\right)+{P}_{T}log\left({Y}_{g}\right)\right)$$

The weight of the consistency loss \(\:{\upgamma\:}\) gradually increases during the training process, which makes the model mainly use the labeled data for learning in the early stage of training, and gradually use more information from the unlabeled data in the later stage of training. This design can effectively play the role of unlabeled data, thereby improving the generalization ability and segmentation accuracy of the network.

The total loss function combines the loss function of the labeled data and the unlabeled data. By dynamically adjusting the consistency weight, the role of labeled data and unlabeled data is balanced.

Dataset

The experiment used the cystoscopy dataset provided by the Affiliated Hospital of Anhui Medical University and the open access polyp datasets Kvasir-SEG, CVC-ColonDB, and ETIS-LaribPolypDB for comparative experiments to verify the stability of the network. These datasets are authentic and validated by experienced experts, and sample images are shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Sample images from each dataset.

(1) Bladder tumor dataset: This dataset includes 1948 images of different bladder areas with a resolution of 1920 × 1072. Cystoscopy images are pivotal for diagnosing bladder cancer, which has a high recurrence rate. Tumor margins often lack distinct boundaries, and blood/mucus interference degrades image quality, while rare tumors require careful sampling to avoid bias.

(2) Kvasir-SEG47: The dataset is a critical dataset for colorectal polyp segmentation and contains resolutions ranging from 332 × 487 to 1920 × 1072. It has 1000 images with different polyp regions. Polyps vary in size, shape, and texture, with flat lesions blending into the mucosal background. The dataset was expanded with a subset of 196 challenging flat polyps, enhancing its utility for real-world clinical scenarios.

(3) CVC - ColonDB dataset 48: The dataset has a resolution of 500 × 573 and contains 300 images extracted from 13 polyp video sequences. It is foundational for polyp detection research.

(4) ETIS - LaribPolpyDB dataset49: The dataset has a resolution of 1225 × 966 and contains 196 images. Mucosal folds and vascular patterns often obscure polyp boundaries, requiring models to learn subtle texture differences. With fewer images compared to Kvasir-Seg, overfitting is a risk.

When training different datasets, we use a 6:2:2 ratio to split images between training, validation, and testing. On the comparison dataset, we used the same evaluation metrics.

Evaluation metrics

We use 6 indicators to evaluate, including: Dice coefficient (Dice), sensitivity (sm), mean Intersection over Union (mIoU), specificity (em), Mean Absolute Error (MAE), and Accuracy (Acc).

Detailed demonstration

We use the PyTorch library to implement our method. We use the SGD optimizer, and initialize the learning rate to 0.001 and decay it by a factor of \(\:\frac{epoch}{({n}_{epoch}{)}^{0.9}}\) In 500 epochs of training networks. In our experimental setting, the images processed at each step contain two labeled data and two unlabeled data. All experiments are run on the experimental cloud platform.

Experiments and results

Our method is compared and evaluated with several state-of-the-art methods. These methods include: Mean Teacher framework (MT)28, as well as the advanced Uncertainty-Aware Mean teacher (UAMT)17, the Dual Task Consistency Strategy (DTC)50 that cleverly combines two complementary tasks, the Polyp-Aware Mixture and Dual-Level Consistency Regularization (PolypMix)51, the Twin Student Mean teacher (DSST)52,weak-to-strong perturbation consistency(W2sPC)38, and Expectation maximization pseudo labels(EMSSL)53. These network method comparisons are crucial for evaluating the performance of UDS-MT. In these experiments, 15% of the training data is randomly selected as labeled data.

We implemented these methods using published code and following the same settings. For data augmentation, we apply the same technique, performing simple resizing, random flipping, and rotation. Similarly, we use the same decoder and PVT encoder for all method backbones in the experiments.

Bladder tumor dataset segmentation results

Our method is compared with the fully supervised method. This comparison is important for evaluating the performance difference between our UDS-MT method and the fully supervised method. We train a fully supervised network using the same backbone network on 187(15%), 374(30%), and 1246(100%) labeled images, respectively, denoted as supervised. We use the same backbone network to demonstrate the effectiveness of the method.

Table 1 Comparison results of the bladder tumor dataset on the fully supervised method and different semi-supervised methods on MT, UAMT, DTC, PolypMix, DSST, W2Spc, EMSSL, and UDS-MT. The indicator results are selected as the test mean percentage results on the test set.

As shown in Table 1, when the labeled data accounts for 30%, the results of fully supervised methods are similar to those of our method under the 15% limit. When the labeled data is only 15% of the bladder tumor training set, compared with other semi-supervised methods, Dice is improved by at least 2.81%, mIoU is improved by at least 3.48%, sm and em are also improved to varying degrees, MAE is reduced to 5.18%, and accuracy is also improved.

There are two main reasons for the improvement in the evaluation index effect. First, the deep learning methods54,55 rely heavily on the large number and high quality of the training set, and the limited labeled dataset limits the performance of the fully supervised. Our method utilizes a large amount of unlabeled images, and through techniques such as pseudo-label generation and consistency regularization, the model can learn richer feature representations under limited labeled data. This strategy effectively alleviates the problem of insufficient labeled data and improves the efficiency of the model in utilizing unlabeled data, thereby achieving significant improvements in segmentation accuracy. However, in the absence of labeled data, overfitting may occur. By introducing uncertainty estimation and consistency loss, we not only improve the accuracy of segmentation but also enhance the robustness of the model to noise and outliers. This robustness enables the model to output high-quality segmentation results more stably when facing complex bladder tumor images, further improving the performance of key indicators such as Dice and mIoU. At the same time, the improvement of sm and Em indicators shows that the model also performs well in structure preservation and edge detection, while the reduction of MAE shows that the model has improved to a certain extent in prediction accuracy.

As shown in Fig. 5, we provide visualization results of different methods under a limited dataset of 15%. Test results of (a) and (b) indicate that our method works in the early diagnosis of tiny bladder tumors. The visualization results in the figure show that our method can provide relatively accurate results under limited labeled data. Whether it is a small tumor, a blurry low-quality image, or an image with highlights, it shows good segmentation results.

Fig. 5
Fig. 5
Full size image

Test results of different methods under 15% limited bladder tumor dataset training.

Comparative dataset segmentation results

In order to verify the effectiveness of our method in the bladder tumor segmentation task, we conducted extensive comparative experiments on the publicly accessible polyp dataset. We followed the same settings as the bladder tumor dataset and used 15% of the training set for experiments.

The Kvasir-SEG dataset shows a variety of polyp morphologies and imaging conditions, making it a very challenging test platform. According to the data partition ratio, we randomly selected 90 (15%) and 180 (30%) images as labeled data. We simulated the actual situation with limited labeled data to evaluate the effectiveness of our method. The results are shown in Table 2.

Table 2 Contains 600 polyp images. Kvasir-SEG test percentage results under different methods.

The ETIS_LaribPolypDB dataset contains 196 images. According to the same training set division as the bladder tumor dataset, the data involved in the training consists of 118 images, of which 18 are used as labeled data, and the other 100 images are used as test sets for testing. Test results are shown in Table 3.

Table 3 Contains 118 polyp images ETIS_LaribPolypDB test percentage results under different methods.

The CVC-conlonDB dataset contains 300 images. According to the same training set division as the bladder tumor dataset, there are 180 images involved in the training data, of which 27 are used as labeled data, and the other 120 images are used as test sets for testing. The test results are shown in Table 4.

Table 4 Contains 118 polyp images CVC-conlonDB test percentage results under different methods.

Table 2 Tables 3 and 4 show the test results on the comparison dataset, respectively. Judging from the results, although our method cannot guarantee the optimality of each indicator, the Dice coefficient of the overlap between the predicted results and the real label has been improved, which shows that our method has generalization ability and robustness.

We visualize the Dice results of each dataset under different methods, visually demonstrating the advantages of our method. As shown in Fig. 6.

Fig. 6
Fig. 6
Full size image

Visualization of Dice results with different datasets using different methods on a limited dataset of 15%.

Combined with Figs. 5 and 6, we can see that the method works. Uncertainty estimation based on information entropy makes the pseudo-labels generated by the segmentation model clearer in the target area and also filters some of the redundant information. During the iteration process, the accurate feature information is extracted and added to the next iteration training, and the accuracy of the segmentation result is further improved. Reliable feature information plays an important role in segmenting image blur or small tumors. Moreover, the consistency constraints between pseudo-labels suppress the occurrence of overfitting to a certain extent, making the method robust.

In order to see the segmentation effect more intuitively, we select two pictures from each comparison dataset and display the segmentation results, as shown in Fig. 7.

Fig. 7
Fig. 7
Full size image

We select six of the three data segmentation results of Kvasir-SEG, CVC-ColonDB, and ETIS-LaribPolypDB for display.

Test results of 30% limited dataset

So as to improve the effectiveness of our method verification, we tested different semi-supervised methods on the bladder tumor dataset and Kvasir-SEG dataset with 30% limited annotations. The result percentages are shown in Tables 5 and 6. The results show that with 30% limited data, our method still has better segmentation ability over other methods. On a limited dataset of 30%, the visualization of each method is shown in Fig. 8, which verifies that our methods have more advantages than other methods on limited datasets.

Fig. 8
Fig. 8
Full size image

Visualization of Dice results on bladder dataset and Kvasir-SEG dataset using various methods on a limited dataset of 30%.

Table 5 Bladder tumor results under 30% limited label dataset.
Table 6 Kvasir-SEG test results on 30% limited labeled dataset.

With 30% limited annotations, the testing results of our method on the bladder dataset and Kvasir-SEG dataset still have certain advantages. The ability to segment the target area is the best compared to other methods. In Fig. 8, we can see it intuitively.

Computational costs

In this section, we study the computational cost of UDS-MT in comparison with other semi-supervised methods. We use two key metrics to evaluate the computational cost: the single image time (ITime) for the test set to pass the network for image segmentation and the total training time (TTime) required to train the network on the training set.

Table 7 shows the comparison results of computational cost. Because additional inference is introduced on the mean teacher, our method takes more time to train than the mean teacher. But it shows similar inference cost, which indicates that we can make improvements in module lightweighting. It also shows that our method can be a viable candidate for clinical applications as the inference cost is less than the clinically required 30ms.

Table 7 Comparison of computational cost with existing methods based on the bladder tumor dataset. “TTime” represents the total training time for 500 cycles, measured in hours (h). “ITime” refers to the inference time for a single image when performing image segmentation on the test set, measured in seconds per frame (ms/F).

Ablation analysis

To explore the impact of small modules on the overall network performance, we conducted experiments by training the model and removing or modifying some parts. The experiments followed the settings described in the network and used the bladder tumor dataset for ablation studies. The network used the mean teacher framework as a baseline, where Model-S and the Model-T use the same backbone network structure but do not share it. As shown in Table 8, adding only the Model-G to the mean teacher base network improves Dice by 2.56% and mIoU by 2.35%; introducing only uncertainty estimation(As shown in Fig. 9) improves Dice by 0.28% and slightly reduces mIoU. In addition, our network improves performance by combining uncertainty estimation with consistency regularization and the introduction of pseudo-labels, resulting in an increase of 3.27% and 3.28% in dice and mIoU.

Fig. 9
Fig. 9
Full size image

The uncertainty estimation generates a mask to supervise the Mean Teacher’s segmentation prediction: Model-G, Model-T, and Model-S generate prediction probabilities. The prediction probabilities of multiple forward propagations on the Model-G are calculated to get uncertainty, and the high-confidence area is selected as the mask. The pixel block loss of Model-T and Model-S’s pseudo-labels is calculated under the supervision of the mask.

Table 8 In our method, the ablation rates of different modules were evaluated and the data are presented as percentages.

Limits and future work

While UDS-MT achieved relatively good results for bladder tumor segmentation, there are certain limits. Compared with the truth labels, as shown in Fig. 4 the segmentation results of our method still have some differences at the edge. The pixel-level annotation itself has 1–2 pixel system errors, and it may also be that the maximum down-sampling of UDS-MT causes the thinnest edge features to be lower than the average thickness of the bladder wall, resulting in structural defects. Additional inference takes more time to train, which indicates that we can make improvements in module lightweighting. Ghost modules can be tried. Replacing standard convolution with Ghost modules, the channel will first reduce the dimensionality and then expand linearly to speed up the inference speed. Multi-scale deep supervision can also be introduced, and the decoding layer synchronously outputs edge maps without introducing additional up-sampling. Prior knowledge is also a method that accelerates the reasoning process. Focusing on edge micro information will be the main content of the future work.

Conclusion

We designed a method for semi-supervised tasks, and our experimental verification on the bladder tumor dataset, Kvasir-SEG, ETIS_LaribPolypDB and CVC-conlonDB datasets shows that our method outperforms other methods. Additionally, we also conducted ablation experiments to further verify the effectiveness of each module and highlight the superiority of this method. However, further efforts are still needed in terms of inference time and segmentation accuracy. By rationally utilizing unlabeled data, the segmentation performance is significantly improved on the bladder tumor dataset, which is expected to improve the accuracy and efficiency of bladder cancer diagnosis and benefit patients.