Introduction

Automatic visual understanding in laparoscopic and robotic surgical videos is crucial for enabling autonomous surgery and providing advanced surgical support to clinical teams. Within this field, instrument segmentation, as shown in Fig. 1, is a fundamental building block. Example use cases include automatic surgical skill assessment, placing informative overlays on the display, performing augmented reality without occluding instruments, intra-operative guidance systems, surgical workflow analysis, visual servoing, and surgical task automation1. Surgical instrument segmentation has advanced rapidly from traditional methods2,3, to modern deep learning-based methods4,5,6,7. Given the significance of this task, open datasets have been released, in particular, through the organisation of computational challenges such as Robust-MIS 20191.

Fig. 1
figure 1

Representative sample images from Robust-MIS 2019 of laparoscopic surgery (left) and state-of-the-art instrument segmentation results (right). True positive (yellow), true negative (black), false positive (purple), and false negative (red).

Most studies in this area exploit fully-supervised learning, the performance of which scales with the amount of labelled training data. However, given the required expertise to provide accurate manual segmentation, the collection of annotations is costly and time-consuming. It is thus unsurprising that in comparison to industry standards for natural images, no large-scale annotated datasets for surgical tool segmentation currently exist. This leads to significant barriers in establishing the robustness and precision needed to deploy surgical instrument segmentation in the clinic. To tackle this challenge, a number of weak supervision approaches8,9,10,11,12 have been proposed which take advantage of unlabelled data or image-level labels as these are easier to capture. While interesting conceptually, weak supervision for surgical instrument segmentation has not yet been demonstrated in practice to generalize to the largest open-access surgical datasets and achieve the required accuracy for deployment in the operating theatre.

In this work, we propose a new semi-supervised surgical instrument segmentation framework, termed as SegMatch, building upon the widely adopted semi-supervised image classification pipeline, FixMatch13. During training, FixMatch processes unlabelled input images through two concurrent paths which implement weak (e.g. image flip and rotation) and strong (e.g. significant photometric changes) augmentations respectively. Augmented images from both paths are then fed to a shared backbone prediction network. For regions of high confidence, the prediction from the weakly augmented image serves as a pseudo-ground-truth against which the strongly augmented image is compared.

In order to adapt FixMatch to the segmentation task, we make a seemingly small but critical first contribution by changing the augmentation paths in SegMatch. For classification tasks, networks are expected to be invariant with respect to all types of augmentation within a certain range tailored to the specific requirements of the task. In contrast, for segmentation tasks, networks are expected to be invariant with respect to photometric transformations (e.g. contrast, brightness, hue changes) but equivariant with respect to spatial transformations (e.g. rotations, translations, flips). In SegMatch, spatial transformations that are used as augmentations are inverted after the network prediction.

Our second and main contribution to SegMatch lies in the inclusion of a learned strong augmentation strategy. Promoting prediction consistency between the weakly augmented and strongly augmented branches is what helps FixMatch and SegMatch learn from unlabelled data and generalise better. Yet, there is no guarantee that the fixed, hand-crafted set of strong augmentation types suggested in previous work13 is optimal. In fact, once the network learns to be sufficiently invariant/equivariant with respect to the fixed set of augmentation, the information learned from the unlabelled data would have saturated. In order to guide the model to learn continuously as the training progresses, we introduce an adversarial augmentation scheme to generate strongly augmented data. We rely on the established iterative fast gradient sign method (I-FGSM)14 and integrate it into our training loop to dynamically adapt the strong augmentation as we learn from the data.

We conduct extensive experiments on the Robust-MIS 20191 and EndoVis 201715 datasets for binary segmentation tasks, as well as on the CholecInstanceSeg16 dataset for multi-class segmentation task. Our study demonstrated that using a ratio of 1:9 or 3:7 of labelled data to unlabelled data within a dataset allowed our model to achieve a considerably higher mean Dice score compared to state-of-the-art semi-supervised methods. Our method shows a significant improvement in surgical instrument segmentation compared to existing fully-supervised methods by utilizing a set of 17K unlabelled images available in the Robust-MIS 2019 dataset in addition to the annotated subset which most competing methods have exploited.

The key contributions of this work are:

  • We adapt the representative semi-supervised classification pipeline with segmentation-specific transformations for weak and strong augmentations and consistency regularization.

  • We propose a trainable adversarial augmentation strategy to improve the relevance of strong augmentations, ensuring the model learns robustly by leveraging both handcrafted and dynamic augmentations.

  • Extensive experiments on the EndoVis 2017 and CholecInstanceSeg datasets demonstrate superior performance by our SegMatch compared to existing semi-supervised models for both binary and multi-class segmentation tasks.

Related work

Semi-supervised learning

Pseudo-labelling17,18 is a representative approach for semi-supervised learning. A model trained on labelled data is utilized to predict pseudo-labels for unlabelled data. This in turn provides an extended dataset of labelled and pseudo-labelled data for training. Consistency regularization19,20 is also a widespread technique in semi-supervised learning. There, an auxiliary objective function is used during training to promote consistency between several model predictions, where model variability arises from techniques such as weight smoothing or noise injection.

Berthelot et al.21 introduced MixMatch to incorporate both consistency regularization and the Entropy Minimization strategy of Grandvalet et al.22 into a unified loss function for semi-supervised image classification. Aiming at providing a simple yet strong baseline, Sohn et al.13 introduced FixMatch to combine consistency regularization and pseudo-labelling and achieve state-of-the-art performance on various semi-supervised learning benchmarks.

Adapting semi-supervised learning for image segmentation tasks requires dedicated strategies that account for the fact that labels are provided at the pixel level. Previous works explored the adaptation to semantic segmentation of classical semi-supervised learning including pseudo-labelling23,24, and consistency regularization25,26. Ouali et al.27 proposed cross-consistency training (CCT) to force the consistency between segmentation predictions of unlabelled data obtained from a main decoder and those from auxiliary decoders. Similarly, Chen et al.28 exploited a novel consistency regularization approach called cross pseudo-supervision such that segmentation results of variously initialised models with the same input image are required to have high similarity. Previous work also investigated the use of generative models to broaden the set of unlabelled data from which the segmentation model can learn29. More recently, Wang et al.30 proposed a contrastive learning approach for semi-supervised dense prediction, introducing adversarial negatives and an auxiliary classifier designed to track and exclude potential false negatives.

Despite such advances, current semi-supervised semantic segmentation models derived from classification models have not yet demonstrated the performance gains observed with FixMatch for classification. We hypothesise that an underpinning reason is that they do not adequately address the issue of transformation equivariance and invariance and do not exploit modern augmentation strategies as efficiently as FixMatch.

Surgical instrument segmentation

The majority of surgical instrument segmentation works are supervised methods6,31,32,33,34. Numerous studies have explored different methods to improve the accuracy of surgical instrument segmentation. For instance, Islam et al.6 took advantage of task-aware saliency maps and the scan path of instruments in their multitask learning model for robotic instrument segmentation. By using optical flow, Jin et al.33 derived the temporal prior and incorporated it into an attention pyramid network to improve the segmentation accuracy. Gonzalez et al.31 proposed an instance-based surgical instrument segmentation network (ISINet) with a temporal consistency module that takes into account the instance’s predictions across the frames in a sequence. Seenivasan et al.34 exploited global relational reasoning for multi-task surgical scene understanding which enables instrument segmentation and tool-tissue interaction detection. TraSeTR35 utilizes a Track-to-Segment Transformer that intelligently exploits tracking cues to enhance surgical instrument segmentation. Recently, large models have started to be used. Wei et al.36 for example introduced an adapter network that integrates pre-trained knowledge from foundation models into a lightweight convolutional model, enhancing both robustness and performance in surgical instrument segmentation. Yet, the use of unlabelled task-specific data for surgical tool segmentation remains relatively untapped.

A relatively small number of works exploited surgical instrument segmentation with limited supervision. Jin et al.33 transferred predictions of unlabelled frames to that of their adjacent frames in a temporal prior propagation-based model. Liu et al.10 proposed an unsupervised approach that relies on handcrafted cues including colour, object-ness, and location to generate pseudo-labels for background tissues and surgical instruments respectively. Liu et al.37 introduced a graph-based unsupervised method for adapting a surgical instrument segmentation model to a new domain with only unlabelled data. We note that most of the existing works with limited supervision focus on exploring domain adaptation or generating different types of pseudo-labels for surgical tool segmentation models and do not exploit a FixMatch-style semi-supervised learning.

Adversarial learning for improved generalisation

Deep neural networks (DNNs) have been found to be vulnerable to some well-designed input samples, known as adversarial samples. Adversarial perturbations are hard to perceive for human observers, but they can easily fool DNNs into making wrong predictions. The study of adversarial learning concerns two aspects: (1) how to generate effective adversarial examples to attack the model38; (2) how to develop efficient defence techniques to protect the model against adversarial examples39. A model which is robust to adversarial attacks is also more likely to generalise better40. As such, we hypothesise that adversarial methods may be of relevance for semi-supervised learning.

For adversarial attacks, the earliest methods38,41 rely on the gradient of the loss with respect to the input image to generate adversarial perturbations. For instance, fast gradient sign attack (FGSM)41 perturbs the input along the gradient direction of the loss function to generate adversarial examples. Tramer et al.42 improved the FGSM by adding a randomization step to escape the non-smooth vicinity of the data point for better attacking. The basic iterative method (BIM)14, which is also referred to as iterative FGSM (I-FGSM), improves FGSM by performing multiple smaller steps to increase the attack success rate. Carlin et al.43 introduced a method now referred to as C&W which solves a constrained optimisation problem minimising the size of the perturbation while ensuring a wrong prediction after perturbation. Recently, Rony et al.44 proposes a more efficient approach to generate gradient-based attacks with low L2 norm by decoupling the direction and norm (DDN) of the adversarial perturbation. For adversarial defence, adversarial training45 is a seminal method which generates adversarial examples on the fly and trains the model against them to improve the model’s robustness. Other defence methods include relying on generative models46 and leveraging the induced randomness to mitigate the effects of adversarial perturbations in the input domain47.

Although the adversarial attack has the potential to enhance the performance and robustness of deep learning models, it has not been yet applied to semi-supervised learning methods for semantic segmentation. As we show later in this work, adversarial attacks can indeed be used to effectively augment unlabelled images used for consistency optimization.

Method

Proposed SegMatch algorithm: overview

Our proposed SegMatch algorithm adapts the state-of-the-art semi-supervised image classification framework, FixMatch13, to semi-supervised semantic segmentation. Our application primarily targets the segmentation of surgical instruments but also has the potential to be utilized for various other semantic segmentation tasks. We follow the basic architecture of FixMatch as illustrated in Fig. 2. During training, SegMatch uses a supervised pathway and an unsupervised pathway which share the same model parameters \(\theta\) for segmentation prediction. For the supervised pathway, the prediction from the labelled image is classically optimized against its ground truth using a standard supervised segmentation loss. For the unsupervised pathway, given an unlabelled image, we follow FixMatch and process the input image through two concurrent paths, which implement weak and strong augmentations respectively. The augmented images are then fed to the model to obtain two segmentation proposals. The output from the weak augmentation branch is designed to act as the pseudo-ground-truth for that from the strong augmentation branch.

As detailed in Sect. Weak augmentations, equivariance, pseudo-labels, segmentation models are typically expected to be invariant with respect to bounded photometric transformations and equivariant with respect to spatial transformations of the input images. We use simple spatial transformations for our weak augmentations and propose to apply the inverse spatial transformation after the network prediction to handle spatial transformation equivariance. We use photometric transformations for our strong augmentations and exploit a pixel-wise loss promoting consistency between the segmentation outputs of the two branches.

Given the complexity of determining the suitable range of parameters for hand-crafted strong augmentation, we propose a solution that involves a learning-based approach. As detailed in Sect. Trainable strong augmentations, in order to gradually increase the difficulty of the prediction from the strongly augmented branch, we introduce a learnable adversarial augmentation scheme in the strong augmentation branch.

Fig. 2
figure 2

SegMatch training process structure. The top row is the fully-supervised pathway which follows the traditional segmentation model training process. The two bottom rows form the unsupervised learning pathway, where one branch uses a weakly augmented image fed into the model to compute predictions, and the second branch obtains the model prediction via strong augmentation for the same image. The model parameters are shared across the two pathways. The hand-crafted photometric augmentation methods are used to initialize the strong augmented image, which is further perturbed by an adversarial attack (I-FGSM) for K iterations.

Weak augmentations, equivariance, pseudo-labels

Fig. 3
figure 3

Equivariance (left) and invariance (right) properties for an image augmented under different types of augmentations: spatial (left) or photometric (right).

Equivariance and invariance in SegMatch

We start by introducing the notion of equivariance and invariance illustrated in Fig. 3. Let us consider a function \(f_\theta\) (standing for the neural network, \(\theta\) being the parameters of the network) with an input x (standing for the input image), and a class of transformation functions \(\mathcal {G}\) (standing for a class of augmentations). If applying \(g\in \mathcal {G}\) (standing for a specific augmentation instance) to the input x also reflects on the output of \(f_\theta\), that is if \(f_\theta (g(x))=g(f_\theta (x))\), then the function \(f_\theta\) is said to be equivariant with respect to transformations in \(\mathcal {G}\) (see Fig. 3-Left). Conversely, if applying \(g\in \mathcal {G}\) to the input x does not affect the output of \(f_\theta\), that is \(f_\theta (g(x))=f_\theta (x)\), then the function \(f_\theta\) is said to be invariant with respect to transformations in \(\mathcal {G}\) (see Fig. 3-Right).

For the classification task, the model is expected to be invariant to all the augmentation strategies. In contrast, given a segmentation model, spatial transformations on the input image should reflect on the output segmentation map while photometric transformations should not. Segmentation models should thus be equivariant with respect to spatial transformations and invariant for photometric transformation. In FixMatch, weak augmentations only comprise simple spatial transformations such as rotation, and flip, which preserve the underlying structure of the image. Meanwhile, strong augmentations only comprise photometric transformations such as posterizing and sharpness changes as provided by the RandAugment48 algorithm, which shifts the intensity distribution of the original image.

Inverting transformations from the weak augmentations

Similar to FixMatch, given an input unlabelled image, we randomly select one simple spatial transformation, i.e. rotation, flip, crop or resize, to apply to it in the weak augmentation branch. In order for our SegMatch to take advantage of a consistency loss between the weak augmentation branch where spatial transformations are used (with expected equivariance of the segmentation) and the strong augmentation branch where photometric transformations are used (with expected invariance of the segmentation), we deploy an inverse spatial transformation on the output of the segmentation model in the weak augmentation branch.

Given an unlabelled image \(x^u\), we denote its weak augmentation as \(\omega _{e}(x^u)\). It is fed to the network \(f_\theta\) to obtain the segmentation output \(f_\theta (\omega _{e}(x^u))\). We apply the inverse transformation \(\omega _{e}^{-1}\) to \(f_\theta (\omega _{e}(x^u))\), i.e. \(\omega _{e}^{-1}(f_\theta (\omega _{e}(x^u)))\). This addresses the equivariance expectation and allows for the output of the weak augmentation branch to be easily compared with the segmentation output from the strongly augmented image.

Soft pseudo-label generation

Following the inverse transformation, we obtain a segmentation prediction in logit space \(p_{w}=\omega _{e}^{-1}(f_\theta (\omega _{e}(x^u)))\). For the i-th pixel in \(p_{w}\), i.e. \(p_{{w}_{i}}\), a pseudo-label is extracted by applying a sharpened softmax:

$$\begin{aligned} \widetilde{y_i}=\text {Sharpen}(\text {Softmax}(p_{w_{i}}), T) \end{aligned}$$
(1)

where the distribution sharpening operation of21 is defined as

$$\begin{aligned} \text {Sharpen}(d,T)_i:=\frac{d_i^{\frac{1}{T}}}{\sum _{j=1}^{L}d_{j}^{\frac{1}{T}}} \end{aligned}$$
(2)

with T being a temperature hyper-parameter. The sharpening operation allows us to control the level of confidence in the resulting probabilities.

Trainable strong augmentations

To tackle the generalization problem typically faced by convolutional neural networks, previous work has employed strong augmentations techniques13,21,49. However, these augmentations are hand-crafted and designing realistic strong augmentations to bridge large and complex domain gaps is challenging50. This challenge is further exacerbated in segmentation tasks which are highly sensitive to pixel-level perturbations due to their pixel-level prediction nature51. For these reasons, despite utilizing powerful hand-crafted augmentations during training, existing methods still demonstrate limited generalization capabilities. In this section, we propose to tackle these key limitations by learning to perform data augmentation using adversarial perturbation during model training.

Strong augmentation initialization

Rather than completely replacing the strong augmentation approach in FixMatch, we build on it to serve as initialization which will be further perturbed. We chose the strong augmentations from the operations in RandAugment48 based on two criteria. First, we focus on photometric transformations only as these satisfy the invariance expectation and do not require the use of an inverse transformation as discussed in Sect. Weak augmentations, equivariance, pseudo-labels. Second, we select rather basic transformations that provide visually plausible augmentations, thereby leaving the more complex changes to the trainable refinement. More specifically, our initial augmentation for the strong augmentation branch is a composition of three transformations randomly chosen from a collection of handcrafted photometric augmentation strategies. These include adjustments to contrast, brightness, colour, and sharpness, as well as the addition of random noise, posterization, and solarization. The strength of the individual transformations is chosen according to predefined hyper-parameters.

Adversarial augmentation approach

As a simple yet powerful adversarial method, we decide to use the iterative fast gradient sign method (I-FGSM)14, which applies multiple gradient updates by iterative small steps.

I-FGSM is based on FGSM which provides an adversarial perturbation to the input image in a single gradient-based operation. In FGSM, the direction of the perturbation is computed from the gradient of the loss with respect to the input image. The magnitude of the gradient is discarded by keeping only the sign along each dimension. A scaling factor is applied to keep the perturbation small. To compute a more refined perturbation, I-FGSM applies FGSM multiple times in smaller steps. The perturbations are clipped to make sure they are in the \(\epsilon\)-neighbourhood to the original image. The I-FGSM equation for iteration \(k+1\) out of K is as follows:

$$\begin{aligned} x^s_{k+1}=\text {Clip}_{x_0^s,\epsilon }\{x^s_{k}+ \frac{\epsilon }{K} \cdot \text {Sign}(\bigtriangledown _{x^s_{k}}(L_u(f_\theta (x^s_{k}),\widetilde{y}))\} \end{aligned}$$
(3)

where \(x_0^s\) represents the initial strongly-augmented image; \(\widetilde{y}\) is the pseudo-label obtained from the weak augmentation branch; \(\text {Clip}\{\cdot \}\) is the clipping function which applies to every pixel in the perturbation image to limit the difference between \(x^s_{{K}}\) and \(x^s_{0}\) and keep it within an \(\epsilon\)-neighbourhood; and \(L_u\) is the unsupervised loss function defined in Eq. (5). The magnitude of the perturbation \(\epsilon\) and the number of I-FGSM steps K are hyperparameters to adjust the quality of the adversarial approach.

Loss functions in SegMatch

In this work, the training objective for the supervised pathway is the standard pixel-wise cross-entropy loss (\(l_{CE}\)) combined with Dice loss (\(l_{DSC}=1-DSC\)), where DSC represents the soft Dice coefficient:

$$\begin{aligned} L_s =\frac{1}{|\mathcal {D}^l|} \sum _{x^l\in \mathcal {D}^l} \Big ( l_{DSC}\big (y^l, f_\theta (x^l)\big ) + \frac{1}{N} \sum _{i=0}^{N-1} l_{CE}\big (y^l_i, f_\theta (x^l)_i\big ) \Big ) \end{aligned}$$
(4)

where \(x^l\) is a labelled input from the labelled set \(\mathcal {D}^l\); \(y^l\) is the corresponding ground-truth label; \(x^l_i\) and \(y^l_i\) denote the \(i{\text {th}}\) pixel of \(x^l\) and \(y^l\), respectively; and N is the number of pixels in \(x^l\).

The training objective for the unsupervised pathway is a cross-entropy loss calculated between a subset of confident pixel-level pseudo-labels \(\widetilde{y_i}\) stemming from the weak augmentation branch and the output probability \(p_{{i}}\) from the strongly augmented image:

$$\begin{aligned} L_u = \frac{1}{{\left| D_u \right| }} \sum _{x^u\in D_u} \frac{1}{{\left| N_v^{x^u} \right| }} \sum _{i\in N_v^{x^u}} l_{CE}(\widetilde{y_i}, p_{i}) \end{aligned}$$
(5)

where where \(x^u\) is an unlabelled input from the unlabelled set \(\mathcal {D}^u\); c denotes a specific class; and \(N_v^{x^u}\) is the set of pixel indices with confidence score of the most confident class \(\max _c(p^c_{w_{i}})\) higher than or equal to a hyper-parameter threshold t, i.e. \(N_v^{x^u} = \{ i \, | \, \max _c(p^c_{w_{i}})\ge t\}\).

The final loss is given by:

$$\begin{aligned} L=L_s+w(t)L_u \end{aligned}$$
(6)

where, following Laine et al.52, w(t) is an epoch-dependent weighting function which starts from zero and ramps up along a Gaussian curve so that the supervised loss contributes to the total loss more at the beginning and the unsupervised loss increases contribution in subsequent training epochs.

Data flow

To provide an overall workflow for both the weak and strong augmentation branches in SegMatch, we present detailed pseudo-code illustrations for each component in Algorithms 1 and 2. Algorithm 1 outlines the workflow for the weak augmentation branch, while Algorithm 2 details the strong augmentation branch, which integrates hand-crafted strong augmentation initialization and adversarial augmentation.

Algorithm 1
figure a

Workflow for weak augmentation branch

Algorithm 2
figure b

Workflow for strong augmentation branch

Experimental setup

Dataset

Robust-MIS 2019

Robust-MIS 2019 is a laparoscopic instruments dataset including procedures in rectal resection, proctocolectomy, and sigmoid resection to detect, segment, and track medical instruments based on endoscopic video images1. The training data encompasses 10-second video snippets in the form of 250 consecutive endoscopic image frames and the reference annotation for only the last frame is provided. In total, 10,040 annotated images are available from a total of 30 surgical procedures from three different types of surgery.

As per the original challenge, the samples used for training were exclusively taken from the proctocolectomy and rectal resection procedures. These samples comprise a total of 5983 clips, with each clip having only one annotated frame while the remaining 249 frames are unannotated.

The testing set is divided into three phases as per the original challenge, where there was no patient overlap between the training and test datasets. Stage 1: 325 images from the proctocolectomy procedure with another 338 images from the rectal resection procedure. Stage 2: 225 images from the proctocolectomy procedure and 289 others from the rectal resection procedure. Stage 3: 2880 annotated images from sigmoid resection, an unknown surgery which only appears in the testing stage but not in the training stage.

EndoVis 2017

EndoVis 2017 is a robotic instrument image dataset captured from robotic-assisted minimally invasive surgery15, which comprises a collection of 10 recorded sequences capturing abdominal porcine procedures. For the training phase, the first 225 frames of 8 sequences were captured at a rate of 2Hz and manually annotated with information on tool parts and types. The testing set consists of the last 75 frames from the 8 sequences used in the training data videos, along with 2 full-length sequences of 300 frames each, which have no overlap with the training phase. To prevent overlap between the training and test sets from the same surgical sequence, we followed the same split as described in ISINet31. This involved exclusively assigning the 2 full-length sequences for testing while keeping the training set intact with 225 frames from the remaining 8 sequences. Note that there are no additional unannotated images for EndoVis 2017.

CholecInstanceSeg

The CholecInstanceSeg dataset16 is a comprehensive, instance-labeled collection featuring 41,933 frames from 85 unique video sequences. This dataset includes seven distinct tool categories: Grasper, Bipolar, Hook, Clipper, Scissors, Irrigator, and Snare. Each frame has been meticulously annotated, extracted from 85 laparoscopic cholecystectomy procedures in the Cholec8053 and CholecT5054 datasets. The dataset is divided into a training set with 55 sequences (26,830 frames), a testing set with 23 sequences (11,299 frames), and a validation set with 17 sequences (3,804 frames). In this work, we utilize the dataset solely for multi-class semantic segmentation and thus discard the instance-specific information.

Dataset usage for semi-supervised learning evaluation

Since the above challenges were designed for fully-supervised benchmarking, we make some adaptions to evaluate SegMatch with respect to competitive semi-supervised (and fully-supervised) methods.

The Robust-MIS 2019 dataset was split into training and testing sets based on the original challenge splits. The three original challenge testing stages were merged to form a single combined testing set, comprising 4057 images. For training, we started with the full set of 5983 labelled original challenge training images and the corresponding 17,617 unlabelled images. In our first experiments, we use only the 5983 labelled original challenge training images and keep only 10% or 30% of the annotations. This allows for a comparison with supervised methods having access to all 5983 labelled images. To further compare with the state-of-the-art supervised methods, we also conducted experiments using the whole 5983 images of the training set as a labelled set, and used 17,617 additional unlabelled frames from the original videos.

For EndoVis 2017, as no additional unlabeled data is available, the original training set, which has 1800 images in total, is split into labelled and unlabelled subsets with ratios 1:9 or 3:7.

As for the multi-class segmentation evaluation using CholecInstanceSeg, we use the same data splitting ratios as in our experiments with the Robust-MIS 2019 dataset. In our ablation studies, we utilized only the original 26.8k labelled training images, retaining 30% of the annotations. To further evaluate multi-class segmentation performance and compare it against the baseline model, we also used the full set of 26.8k labelled training images and the 3804 frames in the validation set as the labelled dataset. Additionally, we incorporated a random selection of 66.1k unlabelled frames to assess the impact of including unlabelled data. For testing, we utilized the 11,299 frames from the original test split.

Implementation details

During training, for each batch, the same number of images is sampled from the labelled dataset \(D_l\) and the unlabelled dataset \(D_u\). Within each batch, unlabelled samples are initialized by random but hand-crafted strong augmentations and then adversarially updated by adding the I-FGSM perturbations.

All our experiments were trained on two NVIDIA V100 (32GB) with a learning rate of 0.001. The model was trained using an SGD optimizer with a momentum of 0.95, which we found can provide a smooth convergence trajectory (unreported manual tuning involved experiments with momentum values from 0.9 to 0.99). The learning rate was initialized as \(1.0 \times 10^{-2}\) and decayed with the policy \(lr_{ini}\times (1 - epoch/epoch_{max}) \times \eta\), where \(epoch_{max}\) is the total epochs and \(\eta\) is set to 0.7. The total training epoch number is 1000 and the batch size was set as 64 considering the memory constraints and training efficiency. The selection of these parameters was initially based on the baseline model OR-Unet7 and further adjusted through multiple experimental runs and prior experience with similar setups.

For the hyper-parameters in adversarial augmentation, the magnitude of adversarial augmentations \(\epsilon\) was set to 0.08 based on experiments shown in Fig. 6. This value represents the optimal trade-off, which demonstrates that increasing \(\epsilon\) beyond 0.08 led to performance degradation. As shown in Table 7 and Fig. 7 in the manuscript, increasing I-FGSM iterations improved segmentation performance slightly. However, considering the trade-off between performance and computational efficiency, we selected step number \(K = 25\) in our final model. Additionally, two types of initial strong augmentation techniques were utilized with pre-defined minimum and maximum magnitude values, specifically from the photometric transformations in RandAugment48 method.

For the segmentation model backbone, we employed the representative OR-Unet7. Our OR-Unet contains 6 downsampling and 6 upsampling blocks where the residual blocks were employed in the encoder and the sequence Conv-BN-Relu layers with kernel size \(3\times 3\) were used in the decoder.

Evaluation metrics

We evaluated our model based on the criterion proposed in Robust-MIS 2019 MICCAI challenge1 for binary segmentation task, which includes:

  • Dice Similarity Coefficient, a widely used overlap metric in segmentation challenges;

  • Normalized Surface Dice (NSD)1 is a distance-based measurement for assessing performance, which measures the overlap of two surfaces (i.e. mask borders). In adherence to the challenge guidelines1, we set the tolerance value for NSD to 13 pixels, which takes into account inter-rater variability. Note that the value of tolerance was determined by comparing annotations from five annotators on 100 training images as illustrated in the challenge.

For the multi-class segmentation evaluation, we adopted the evaluation method from ISINet31 and used three IoU-based metrics: Ch_IoU, ISI_IoU, and mc_IoU.

  • Ch_IoU calculates the mean IoU for each category present in the ground truth of an image, then averages these values across all images.

  • ISI_IoU extends Ch_IoU by computing the mean IoU for all predicted categories, regardless of whether they are present in the ground truth of the image. Typically, Ch_IoU is greater than or equal to ISI_IoU.

  • mc_IoU calculates the average IoU across different instrument classes and addresses category imbalance by altering the averaging order used in ISI_IoU.

We used “percentage points” (pp) to denote the absolute change in Mean Dice score, NSD, Ch_IoU, ISI_IoU, and mc_IoU, as it represents the difference between two percentage values. Additionally, as shown in Sect. Comparison with strong baselines, we report both the mean performance and the inter-run standard deviation (std) across folds for our proposed model, SegMatch, to capture performance consistency across multiple runs. For other models, we provide the single-run sample-based std, which reflects consistency within a single run. In Seciton Ablation and parameter sensitivity study, we report the single-run sample-based std for both the original SegMatch model and the other ablation experiments.

Results

Comparison with strong baselines

We compare our results to the state-of-the-art on Robust-MIS 2019 and EndoVis 2017 datasets. We categorize comparisons into two groups. First, a head-to-head comparison is made with other semi-supervised methods. Second, we measure the added value of incorporating unlabelled data in addition to using the complete labelled training data in Robust-MIS 2019.

Comparison with semi-supervised baselines

For the first group, we adapted the representative semi-supervised classification method Mean-Teacher19 for the segmentation task using the same backbone network and experimental setting as ours. We also conducted experiments with two established semi-supervised semantic segmentation models: WSSL55, a representative benchmark in this field, and CCT27, which emphasizes feature consistency across various contexts through perturbed feature alignment. Additionally, we evaluated ClassMix56, a novel data augmentation technique designed for semantic segmentation, Min-Max Similarity57, a semi-supervised surgical instrument segmentation network based on contrastive learning, and PseudoSeg58, a FixMatch-based semi-supervised semantic segmentation model with a self-attention module. Illustrative segmentation results for the selective methods on Robust-MIS 2019 are presented in Fig. 4, with key areas highlighted for clarity.

Table 1 State-of-the-art semi-supervised model comparisons for Robust-MIS 2019 dataset (left) and EndoVis 2017 dataset (right) under differently labelled data to unlabelled data ratio.

Table 1 shows that, within the dataset of Robust-MIS 2019, for the two labelled to unlabelled data ratios we tested, our SegMatch outperforms other methods with statistical significance (p<0.05). Comparing SegMatch to the second-best method, PseudoSeg, we observed notable performance improvements. Specifically, when using 10% and 30% labelled data in the Robust-MIS 2019 dataset, SegMatch achieved mean Dice score improvements of 2.9 percentage points (pp) and 4.4 pp, respectively. Similar observations were made on the EndoVis 2017 dataset, where SegMatch outperformed PseudoSeg by 1.3 pp and 1.4 pp when utilizing 10% and 30% labelled data, respectively.

Fig. 4
figure 4

Segmentation results on exemplar images from three different procedures in the testing set. Here, SegMatch, CCT27, and WSSL55 were trained using the whole labelled training set of Robust-MIS 2019 as a labelled set, and 17K additional unlabelled frames from the original videos. The fully supervised learning models (OR-UNet7 and ISINet31) were trained using the whole labelled training set of Robust-MIS 2019 as a labelled set. The first column is the ground truth mask placed on top of the original image, and the other column is the segmentation results of SegMatch ablation models and state-of-the-art models. The three rows from up to bottom are the testing image samples from proctocolectomy procedures, sigmoid resection procedure (unseen type), and rectal resection procedure respectively. The yellow stars highlight the key area in better visualization..

These results demonstrate the superior performance of SegMatch in both datasets and both labelled to unlabelled data ratios, highlighting the effectiveness of our proposed method in scenarios with limited labelled data. Note that, the training dataset only consists of the proctocolectomy procedure and rectal resection procedure, and the sigmoid resection procedure is considered a new type of data for the trained model. Qualitatively, our proposed SegMatch is able to recognize the boundaries between different tools and segment more complete shapes for each individual tool, especially in those areas with high reflection.

Comparison with fully-supervised method

We compare OR-Unet7 with our SegMatch across a wide range of labelled-to-unlabelled ratios of the Robust-MIS 2019. We train the fully supervised OR-Unet7 and our semi-supervised SegMatch using the same amount of labelled data while utilising the remaining data as unlabelled data only for SegMatch.

The results shown in Table 2 demonstrate that SegMatch outperforms OR-Unet consistently by effectively utilizing additional unlabelled data. As the amount of labelled data increases, the benefit of SegMatch diminishes, leading to the convergence of the performance of the two methods. Nonetheless, minor variations in the 100% labelled case arose due to differences in codebase implementation and experimental randomness.

Table 2 Performance comparison for OR-Unet7 (fully supervised methods) vs SegMatch (our semi-supervised model) across different labelled data ratios on Robust-MIS 2019 dataset.

Added value of unlabelled data

For the second group, we used the whole labelled training set of Robust-MIS 2019 as the labelled set and take advantage of the unlabelled video footage available from the training set of Robust-MIS 2019 to evaluate the impact of adding unlabelled data as mentioned in Sect. Dataset.

Table 3 shows the comparison between supervised approaches trained on the labelled set and SegMatch trained on the combination of the labelled and unlabelled sets. Compared to the existing model OR-Unet7, the inclusion of additional unlabelled data in our semi-supervised pipeline SegMatch led to a 5.7 pp higher Dice score. Our model also demonstrates a noteworthy enhancement of 4.8 pp in comparison to the more recent ISInet31, which is now commonly employed for surgical instrument segmentation. Additionally, it maintains a 0.8 pp advantage compared to DINO-Adapter36, a recent approach that leverages pre-trained knowledge from foundation models via adapter networks When evaluating the stage 3 testing data, which corresponds to a surgical procedure that was not seen during the training phase, our SegMatch model demonstrated superior performance compared to the official Robust-MIS 2019 challenge winner for stage 3 (haoyun team) by a margin of 3.9 pp.

It is noteworthy that the performance improvement from SegMatch over fully-supervised baselines is more substantial on stage 3 testing data (unseen procedure types) than on stage 1 and 2 data (procedure types represented in the training data). This indicates that our model exhibits better generalizability. This enhanced generalizability can enable the model to handle diverse surgical scenarios more effectively.

Table 3 Comparison on the Robust-MIS 2019 dataset between fully-supervised models and SegMatch with additional unlabelled data (* indicates only labelled data was used).

Multi-class segmentation

To evaluate the multi-class segmentation capabilities of our model, we conducted experiments using the CholecInstanceSeg dataset. We utilized the entire labelled training set and leveraged the available unlabeled video footage to analyze the impact of incorporating unlabelled data, as discussed in Sec. Dataset.

Table 4 compares our method with the baseline OR-Unet approach7. By incorporating an additional 66.1k unlabeled data points, our model achieved significant improvements, surpassing OR-Unet by 24.2 in Ch_IoU, 21.89 in ISI_IoU, and 25.06 in mc_IoU. Additionally, our approach outperforms OR-Unet across multiple instrument categories, highlighting the effectiveness of utilizing unlabeled data in enhancing multi-class segmentation compared to the fully-supervised OR-Unet.

Table 4 Comparison of our method with state-of-the-art methods on the CholecInstanceSeg dataset for multi-class segmentation.

Ablation and parameter sensitivity study

To evaluate the contribution of the different components of our pipeline, we conducted ablation and parameter sensitivity studies on different strong augmentation strategies and adversarial strong augmentation methods.

Table 5 Ablation study results for different components, evaluating on Robust-MIS 2019 with different labelled and unlabelled data amount, on binary segmentation.
Table 6 Ablation study results for different components, evaluating on CholecInstanceSeg with different labelled and unlabelled data amount, on multi-class segmentation.

Analysis of our semi-supervised method

We evaluated the contribution of the semi-supervised learning in SegMatch by training only its fully-supervised branch, essentially turning it into OR-UNet. As discussed previously and shown in Table 3, disabling the unlabelled pathway leads to a drop in terms of Dice and NSD scores thereby confirming the benefits of semi-supervised learning.

We also studied the effectiveness of varying the confidence threshold when generating the pseudo-labels as shown in Fig. 5. We observe that when the confidence threshold approaches 1.0, the model returns the worst segmentation performance. When the threshold value is changed within the range of [0.7, 0.9], the confidence threshold does not affect the model’s performance significantly. However, it should be noted that further reducing the threshold value leads to a rapid decrease in the mean Dice score, which indicates that the quality of contributing pixels to the unsupervised loss may be more important than the quantity. This evaluation was conducted on the binary segmentation task using the Robust-MIS 2019 dataset, but a similar trend was observed in the multi-class segmentation task as well.

Augmentation strategy

We also operated an ablation study on the weak and strong augmentation in SegMatch as tabulated in Table 5. First, we removed the adversarial augmentation, thereby only keeping handcrafted strong augmentations. Notably, we observed consistent results across different labelled data ratios. Specifically, when utilizing 17K additional unlabelled data, we found that the Dice score decreased by 3.1 pp compared to the full proposed SegMatch model. This suggests that applying adversarial augmentations can prevent the learning saturation caused by hand-crafted strong augmentations on unlabelled data, thereby enabling the model to learn continuously. When both the weak augmentation and adversarial augmentation are removed, the results drop by an additional 1.9 pp compared to only removing the adversarial augmentation, indicating that applying the weak augmentation function to the input image, which generates new and diverse examples for the training dataset, can enhance the segmentation performance.

To evaluate the effectiveness of the overall strong augmentation strategies, we replaced the strongly augmented images with the original images. The resulting Dice score dropped by 5.1 pp compared to only removing the adversarial augmentation, and by 3.2 pp compared to removing both the weak and adversarial augmentation. The evidence that removing strong augmentation results in a greater decrease in performance than removing weak augmentation suggests that stronger perturbations are beneficial for learning with consistency regularization. Additionally, when strong augmentation is removed, the model is presented with the same input image from different views, which reduces the benefits of consistency regularization. In this scenario, pseudo-labelling becomes the primary technique for achieving better segmentation performance. Therefore, our results suggest that consistency regularization is more crucial than pseudo-labelling in our pipeline for improving segmentation performance.

We conducted an ablation study on augmentation strategies for multi-class segmentation using the CholecInstanceSeg dataset, as shown in Table 6. Even when trained on only 70% of the data, our model outperformed the baseline OR-Unet7, which was trained on the full dataset as shown in Table 4, by 6.5 pp in Ch_IoU and 6.84 pp in ISI_IoU. This demonstrates the potential of our model in scenarios with limited annotated data. Consistent with results from the binary segmentation task, removing adversarial augmentation led to a noticeable drop in IoU scores, highlighting its role in preventing learning saturation from over-reliance on hand-crafted strong augmentations in unlabelled data. The results show that removing strong augmentation hurts performance more than removing weak augmentation, suggesting that, similar to binary segmentation, multi-class segmentation benefits from a greater input difference between branches for effective consistency regularization.

Adversarial augmentation analysis

We further evaluated the sensitivity of our results to changes in the maximum amplitude value \(\epsilon\) of the adversarial perturbation as shown in Fig. 6 and qualitatively illustrated in Fig. 7. When increasing from \(\epsilon\) = 0.0 (i.e. no perturbation), we observed a consistent pattern across different ratios of labelled data for both FGSM and I-FGSM. Initially, as \(\epsilon\) increased, the segmentation performance improved, reaching its peak at approximately \(\epsilon\) = 0.08. However, beyond this optimal point, the performance started to decline, indicating that stronger perturbations can enhance the model’s performance only within a certain range.

In this work, by precisely defining the acceptable range of perturbations, we restrict the perturbations within the \(\epsilon\)-neighbourhood, which ensures that the integrity of the instrument pixels is preserved while introducing subtle variations that aid in improved generalization of the model. Comparing FGSM and I-FGSM, we found that I-FGSM showed superior performance to FGSM before reaching the optimal \(\epsilon\) value. However, after the optimal point, I-FGSM exhibited a more significant decrease in the model’s performance compared to FGSM, which suggests that I-FGSM has a higher attack success rate than FGSM but becomes more harmful to the model when the attacking amplitude is large.

We also varied the number of I-FGSM iterations as shown in Table 7 and qualitatively illustrated in Fig. 7. Increasing the number of iterations in the I-FGSM attack shows a minor improvement consistent with expectations that it increases the attach success rate. However, it is important to consider that this small improvement in segmentation performance through increased iterations comes with a trade-off in computational efficiency.

Fig. 5
figure 5

Mean Dice score produced by varying the confidence threshold for pseudo-labels.

Additionally, we conducted experiments using various adversarial learning methods43,44 as shown in Table 7. Our findings indicate that performance achieved by using one-step attack methods is consistently lower compared to our iterative strategy, which suggests when attempting to manipulate image samples to create adversarial examples, breaking the attack down into smaller steps can improve the overall success rate, as noted in previous research14.

The same ablation studies for adversarial attack methods were conducted on multi-class segmentation tasks using the CholecInstanceSeg dataset, as summarized in Table 8. The results consistently align with those observed in binary segmentation, demonstrating that one-step attack approaches perform worse than our interactive adversarial augmentation. Notably, when the maximum perturbation amplitude \(\epsilon\) is fixed, the number of iterations has a greater impact on the performance of multi-class segmentation compared to binary segmentation.

Table 7 Segmentation performance of SegMatch on Robust-MIS 2019 when applying different adversarial attack methods, evaluating on binary segmentation task.
Table 8 Segmentation performance of SegMatch on CholecInstanceSeg when applying different adversarial attack methods, evaluating on multi-class segmentation task.
Fig. 6
figure 6

Optimal \(\epsilon\) value enhances segmentation performance (as indicated by the peak in mean Dice score).

Failure cases analysis

We present failure cases in Fig. 8, highlighting challenges where SegMatch struggles to accurately segment surgical instruments. While adversarial and trainable augmentations allow flexible transformations to handle complex scenarios like overlapping tools or instruments obscured by tissue, the model faces difficulties with reflective surfaces (e.g., gauze in the 3rd example) and instruments resembling the tissue background (4th example), leading to false positives and imprecise pseudo-labels.

This implies noise in unlabelled data, such as poor image quality or artefacts, can reduce SegMatch’s performance since consistency regularization may amplify errors. Additionally, out-of-distribution scenarios, such as the instrument-dominated image in the 2nd example, further challenge SegMatch, demonstrating its limitations in adapting to highly variable or ambiguous inputs despite leveraging unlabelled data effectively.

Fig. 7
figure 7

Examples showcase the impact of strong augmentation transform functions and adversarial augmentation on an original unlabelled image input to a model. The first column features the original image covered by its segmentation mask output from the model, as well as the strongly augmented image obtained via initial strong augmentation and its output segmentation mask. The 2-6 columns showcase adversarial images produced by I-FGSM with varying values of \(\epsilon\) and K (which becomes FGSM when \(K=0\)) to replace the original strongly augmented images for model parameter updating. The upper rows in the 2-6 columns display the adversarial images, while the bottom rows show the corresponding segmentation results produced by the model.

Fig. 8
figure 8

Failure cases of SegMatch’s output. The first row shows the original image with the ground truth mask, while the second row overlays SegMatch predictions on the original image. Columns 1 to 4 illustrate the 1st, 2nd, 3rd, and 4th examples, respectively.

Discussion and conclusion

In this paper, we introduced SegMatch, a semi-supervised learning algorithm for surgical tool segmentation that achieves state-of-the-art results across two of the most commonly used datasets in this field. Our algorithm, SegMatch, was adapted from a simple semi-supervised classification algorithm FixMatch which combines consistency regularization and pseudo-labelling. During training, SegMatch makes use of a standard labelled image pathway and an unlabelled image pathway with training batches mixing labelled and unlabelled images. The unlabelled image pathway is composed of two concurrent branches. A weak augmentation branch is used to generate pseudo-labels against which the output of a strong augmentation branch is compared. Considering the limitation of fixed handcrafted strongly augmentation techniques, we introduced adversarial augmentations to increase the performance of strongly augmented images. We also highlighted the importance of considering equivariance and invariance properties in the augmentation functions used for segmentation.

Putting our work into its application context, automatic visual understanding is critical for advanced surgical assistance but labelled data to train supporting algorithms are expensive and difficult to obtain.

SegMatch’s ability to leverage unlabelled data makes it highly applicable to clinical environments, where appropriately annotated data is severely restricted due to time and budget constraints. Moreover, SegMatch’s adaptability to varying levels of supervision aligns well with clinical workflows, where relying on partial annotations or weak labels is more feasible than the resource-intensive process of creating fully annotated datasets.

In robotic-assisted surgeries, our approach enables precise segmentation of surgical instruments during intraoperative procedures, facilitating real-time image analysis and delivering enhanced visual guidance to support surgical decision-making. Key applications include augmented reality for precise overlay placement, haptic feedback simulation to improve maneuver safety, and tissue mosaicking to broaden the surgical visual field. Furthermore, these advancements can contribute to surgical workflow optimization, objective skills assessment, camera calibration, and visual servoing, paving the way for improved accuracy, efficiency, and automation in complex surgical interventions.

The principles underlying SegMatch are not confined to surgical instrument segmentation. Our trainable adversarial augmentation strategy could be equally beneficial in domains requiring precise segmentation of complex structures whereas annotations are limited and expensive to obtain.

We believe that simple but strong-performance semi-supervised segmentation learning algorithms, such as our proposed SegMatch, will not only accelerate the deployment of surgical instrument segmentation in the operating theatre but also be applicable to other domains lacking annotated data.