Introduction

Oropharyngeal cancer (OPC) is a subtype of head and neck squamous cell carcinoma that predominantly affects the tonsils and the base of the tongue and poses substantial challenges in medical imaging and treatment. Early detection and effective treatment of OPC are critical for improving patient outcomes, in terms of quality of life and survival1. Magnetic Resonance Imaging (MRI), Computed Tomography (CT), and Positron Emission Tomography (PET) are the primary modalities used for the initial staging, planning of radiation therapy (RT), and follow-up of OPC2. RT is a pivotal treatment modality for OPC, but it is based on laborious and error-prone manual or semi-automatic segmentation of primary gross tumor volume (GTVt)2. An accurate segmentation of GTVt in the oropharynx region is particularly challenging due to significant interobserver variability3,4,5. This challenge not only compromises the efficacy of treatment, but it also increases both the duration and cost of care2. Consequently, there is a need for the development of precise, fast, and cost-efficient automatic segmentation techniques for OPC GTVt to enhance treatment outcomes and operational efficiency.

An automated segmentation of the OPC GTVt using deep learning (DL) methods has shown good promise in reducing variability and enhancing the precision and reliability of radiotherapy planning6,7,8. However, the segmentation can fall short of the required performance and necessitate further manual refinement or complete rework by clinicians. In such cases, interactive deep learning presents a compelling approach by facilitating efficient interface for segmentation refinement9.

A common approach for interactive DL segmentation is click-based interaction, where the user provides feedback by clicking on coordinates requiring correction9,10. To our knowledge, the only work considering interactive DL for OPC tumor segmentation is11, which used a slice-based method where users manually segment an entire slice of the tumor volume. However, this approach requires time-consuming retraining of the DL model after each interaction, limiting its practical use in clinical settings. Consequently, research on practical interactive deep learning segmentation to improve GTVt segmentation remains limited, despite its potential benefits9.

As for related research works, the DeepGrow12 and DeepEdit13, both integrated within the Medical Open Network for Artificial Intelligence (MONAI)14, are two widely known click-based interactive segmentation methods. The DeepGrow incorporates interaction events in each iteration during training, which leads to lower initial segmentation performance compared to the baseline. However, with interactions, segmentation accuracy improves rapidly12,13. The DeepEdit addresses this issue by introducing a hyperparameter called “click-free iterations,” which controls the fraction of non-interactive training iterations. While this improves initial segmentation, it affects negatively the performance during interactions13. Evaluations of multiple 3-dimensional (3D) medical image data sets have shown that DeepEdit performance is inferior to traditional DL segmentation methods in non-interactive mode, and inferior to DeepGrow in interactive mode13. Hence for both methods there is trade-off between non-interactive and interactive segmentation performance.

In this study, we introduce a novel two-stage Interactive Click Refinement (2S-ICR) framework to enhance user-driven segmentation refinement while preserving the initial state-of-the-art segmentation accuracy. In addition we demonstrate the effectiveness of interactive OPC GTVt segmentation from volumetric PET-CT scans and highlight its potential in clinical applications. Moreover, we conduct a comprehensive comparison of existing click-based state-of-the-art interactive deep learning methods against our proposed segmentation framework, which establishes a new benchmark for future research on OPC GTVt segmentation.

Our findings reveal that the trade-off between the non-interactive and interactive segmentation performance can be addressed by dividing the task into two stages and training specialized deep learning models for each stage. We refer to these models as the initial and refinement networks. Additionally, we find that the sigmoid probability volume can be used efficiently as a memory mechanism, not only between the non-interactive and interactive deep learning models but also across interaction events. Furthermore, we demonstrate that an ensemble approach can be seamlessly integrated into interactive semantic segmentation.

Results

Experimental setup

We trained and validated interactive deep learning models using five-fold cross-validation on the 2021 HECKTOR dataset2 and simulated user interactions as in12,13. To reduce variability due to simulated probabilistic interactions, validation was repeated three times with different seeds. Testing on the MDA dataset employed an ensemble of the five-fold trained HECKTOR 2021 models. DeepEdit was trained with 0%, 25%, and 50% click-free propotions, referred to as DeepGrow, DeepEdit-25, and DeepEdit-50, respectively.

Segmentation performance

Initial segmentation, or 0-click segmentation, represents the model output before any user interactions. The 2S-ICR framework demonstrated superior Dice Similarity Coefficient performance for the HECKTOR 2021 and MDA datasets. On the MDA dataset, 2S-ICR achieved a DSC of 0.722, surpassing DeepGrow with 0.642 and DeepEdit variants with scores of 0.642 and 0.721 for 25 percent and 50 percent click-free propotions. On the HECKTOR 2021 dataset, 2S-ICR achieved the highest DSC of 0.752, exceeding DeepGrow with 0.663 and DeepEdit models with scores of 0.729 and 0.738. For the HD95, the 2S-ICR achieved the best result on the HECKTOR 2021 dataset with a value of 3.000. On the MDA dataset, the 2S-ICR achieved a value of 5.385, just next to DeepEdit-50 with value of 5.099. Full results for HECKTOR 2021 and MDA datasets are shown in Tables 1 and 2, respectively.

Table 1 Quantitative results on the MDA dataset (\(N = 67\)) for 0, 1, 5, and 10 number of clicks and for the overall averaged. Bolded values indicate the best performance in each column.
Table 2 Quantitative results on the HECKTOR 2021 dataset over 5-fold validation for 0, 1, 5, and 10 number of clicks and for the overall averaged.

Specifically on the MDA dataset, the 2S-ICR showed steady improvements for all interaction levels. Specifically, the DSC increased from 0.722 for 0 clicks to 0.858 for 10 clicks, with an average performance of 0.820 over the whole range of interaction. In terms of HD95, the metric improved significantly from 5.385 mm for 0 clicks to 2.236 mm for 10 clicks, thus indicating enhanced boundary precision through user interactions.

Specifically on the HECKTOR 2021 dataset, the 2S-ICR achieved the highest DSC of 0.836 on average, with a peak of 0.870 for 10 clicks. The HD95 results paralleled the improvements in the DSC, starting from 3.000 mm and down to 1.732 mm by the tenth interaction. These results reflect a consistent enhancement in segmentation accuracy with increasing user input.

Compared to DeepGrow and DeepEdit variants, the 2S-ICR consistently delivered the best results. On the MDA dataset, it maintained the highest DSC of 0.820 on average, compared to DeepGrow’s 0.794 and DeepEdit-25’s 0.795. HD95 results further emphasized its superiority, showing the most substantial improvements across all interaction levels. Statistical analysis in Fig. 1 revealed that the 2S-ICR is statistically significantly better than DeepEdit-25 and DeepEdit-50 after 5 and 10 clicks.

Fig. 1
figure 1

Change in the segmentation performance through click interactions evaluated on the MDA dataset. Performance is evaluated using (a) Dice similarity coefficient (DSC) and (b) Hausdorff Distance at the 95th percentile (HD95). The statistical significance tests between the models are based on the two-sided Wilcoxon signed rank test with Benjamini–Hochberg procedure to correct for multiple testing, in which p < 0.05 is considered significant.

Number of clicks required to achieve specific thresholds

The metric of number of clicks (NoC) highlights the efficiency of the 2S-ICR framework in achieving specific segmentation thresholds. On the MDA dataset, the 2S-ICR consistently required fewer clicks compared to the DeepGrow and DeepEdit variants. For example, it achieved a Dice Similarity Coefficient (DSC) of 0.75 with just 1.81 clicks on average, and a DSC of 0.85 with 5.97 clicks. Additionally, 2S-ICR showed superior performance for HD95 thresholds, requiring fewer interactions to achieve 5.0 mm and 2.5 mm thresholds.

Similarly, on the HECKTOR 2021 dataset, the 2S-ICR mostly outperformed the competing methods for all metrics, see Table 3. It achieved the DSC value of 0.75 with an average of 1.34 clicks and 0.85 with 3.96 clicks. For the HD95 threshold of 5.0 mm, 2S-ICR required only 0.07 clicks on average. Moreover, the proportion of failures (PoF) was the lowest for 2S-ICR in nearly all cases, except for the 0.75 DSC threshold on HECKTOR 2021, where the DeepGrow showed a marginally lower failure rate of 0.45% compared to the 2S-ICR’s rate of 1.64%.

Table 3 Number of Clicks (NoC) required to achieve specific thresholds for Dice Similarity Coefficient (DSC) and Hausdorff Distance at the 95th percentile (HD95) on the MDA and HECKTOR 2021 datasets.

Segmentation refinement using the 2S-ICR framework is visually demonstrated with a scan from the MDA test set in the Fig. 2. Initially, the network segmented two regions adjacent to the throat erroneously, as highlighted in yellow. These regions were connected to a tumor located in the lower horizontal region of the neck, resulting in an overly extensive segmentation mask. Through user interactions, the segmentation surface was iteratively adjusted to align more closely with the ground truth delineation. Specifically, the segmentation on the left side was refined with a single click, while the right side required two additional clicks, for optimal correction. This example illustrates the capability of the method to efficiently enhance segmentation accuracy by guiding the segmentation surface closer to the ground truth tumor boundaries through user interaction, as depicted in Fig. 2.

Fig. 2
figure 2

The progressive refinement of segmentation of 2S-ICR shown for the first three interactions overlaid on CT (top row) and PET (bottom row) slices. In addition, false positives are marked in yellow and clicks with white arrow.

Runtime and memory analysis

Table 4 Inference efficiency of interactive segmentation methods on the HECKTOR 2021 dataset, reporting peak VRAM usage and mean inference times (with standard deviation) on an NVIDIA RTX 3080 GPU and Intel i5-12600K CPU.

We evaluate the inference efficiency of our proposed 2S-ICR method against baseline interactive segmentation approaches–DeepGrow12, DeepEdit variants (DeepEdit-25, DeepEdit-50)13, and a non-interactive U-Net baseline15–on 3D volumes from the HECKTOR 2021 dataset. Tests were conducted on an NVIDIA RTX 3080 GPU (10 GB VRAM) and an Intel i5-12600K CPU, with results summarized in Table 4.

As shown in Table 4, 2S-ICR achieves a GPU inference time of 0.08 s, matching the efficiency of DeepGrow, DeepEdit variants, and U-Net, with a peak VRAM usage of 2.06 GB. This VRAM requirement, while slightly higher than DeepGrow/DeepEdit (1.86 GB) and U-Net (1.88 GB), remains well within the capacity of consumer-grade GPUs to ensure practical deployment. On the CPU, 2S-ICR delivers a competitive inference time of 1.62 ± 0.05 s, marginally faster than the 1.63 ± 0.05 s of baselines. These results demonstrate that 2S-ICR balances low latency and modest memory demands, making it well-suited for real-time interactive segmentation in clinical workflows.

Impact of mask dropout on interactive segmentation performance

The incorporation of mask dropout during training of the 2S-ICR framework influenced significantly interactive segmentation performance. As presented in Table 5, the DSC increased from \(0.827 \pm 0.134\) to \(0.845 \pm 0.109\) when \(p_{\text {drop}}\) increased from 0 to 0.2. For higher dropout probabilities, the DSC values stabilized around 0.845, showing minimal sensitivity for further increases in \(p_{\text {drop}}\).

Increasing \(p_{\text {drop}}\) from 0.0 to 0.2 had a significant impact on the number of voxels affected per interaction event. As can be seen in Table 5, for \(p_{\text {drop}} = 0.0\), the mean number of voxels adjusted was \(731 \pm 719\), beign significantly lower compared to the \(941 \pm 1477\) for \(p_{\text {drop}} = 0.2\). This increase suggests that introducing mask dropout facilitates larger updates in response to user interactions. Beyond \(p_{\text {drop}} = 0.2\), the mean number of changed voxels varied only slightly, stabilized around \(900\).

Table 5 Effect of varying mask dropout probability (\(p_{\text {drop}}\)) on the interactive segmentation performance of the 2S-ICR framework on the HECKTOR 2021 dataset.

Discussion

Here we have introduced a two-stage click refinement (2S-ICR) framework, to serve as a novel interactive deep learning method that redefines the standard for segmenting the volume of primary gross tumors in oropharyngeal cancer. Our framework’s core innovation lies in the deployment of two specialized models: an initial segmentation model and a refinement model. This dual-model approach strategically eliminates the trade-off between non-interactive and interactive performance, as observed in case of previous methodologies.

Volumetric medical image segmentation offers significant potential but is fraught with unique challenges in the medical field. These include heterogeneous data from various imaging devices, imaging artifacts, patient-specific variations, disparate image acquisition and quality across centers, and the presence of lymph nodes with high metabolic responses in PET images2. These complexities can occasionally lead to failures in AI-driven segmentation, thus highlighting the indispensable need for human expertise to interactively guide and refine the segmentation process with AI models. Given the current limitations of technology, this collaboration between human expertise and AI models is essential to achieve precise and reliable results in medical imaging.

Although the interactive segmentation approach has a strong basis in 2D, particularly in non-medical domains16,17,18, the previous interactive segmentation research in the OPC GTVt domain has mainly focused on 2D slicing methods11 or on reducing annotation effort19. However, the state-of-the-art non-interactive segmentation methods for OPC GTVt have used 3D methods with volumetric PET-CT scans and were found to improve performance via global context compared to 2D based methods2,6,7,8,20. Our present work addresses this limitation by performing interactive OPC GTVt segmentation directly in the volumetric space.

Although some 2D interactive segmentation methods employ two models to reduce computational costs21, our motivation for the two-model architecture of 2S-ICR is distinct. Here we will prioritize avoiding the trade-off between non-interactive and interactive performance as is often the case in single-model approaches13. By leveraging previous outputs as input, a common practice in 2D shown to stabilize predictions18, we not only enhance 3D performance but also seamlessly chain non-interactive and interactive models. This enables a synergistic workflow, in which each model is optimized for its specific task.

DeepEdit was one of the first interactive models implemented for 3D medical segmentation tasks13, where both the pre- and post-interaction performances were measured. It turned out that the quality of interactive DL segmentation without interactions was worse than that of non-interactive DL methods. DeepEdit has addressed this issue, to some extent, with the approach of “click-free” (i.e., non-interactive) training iterations. However, this approach introduced a trade-off: more click-free iterations improved non-interactive performance at the expense of interactive performance. In contrast, the 2S-ICR’s two distinct models ensure optimal initial segmentation and effective refinement with interactions.

In the evaluation of our framework using the MDA dataset, 2S-ICR turned out to consistently outperform user interactions at all levels compared to established methods such as DeepGrow and various configurations of DeepEdit, with the only exception being HD95 at 0 clicks. However, as HD95 was not used during training, this also illustrates the discrepancy between DSC and HD95 results. Specifically, the Dice similarity coefficients for the 2S-ICR ranged from 0.722 without any clicks to 0.858 with ten clicks, averaging 0.820 across all interaction levels, which exceeds the performance of competing models (Table 1). These results underscore the efficacy of our dual-model architecture in harnessing user interactions to progressively refine segmentation accuracy without compromising baseline performance.

Moreover, the 2S-ICR showed superior handling of segmentation challenges, as evidenced by the HD95 results. For example, HD95 metrics improved from 5.385 mm at 0 clicks to 2.236 mm at 10 clicks, with a lower interquartile range than in case of other methods, reflecting a stable and substantial improvement in segmentation quality as user involvement increased (Table 1). These results not only highlight the robustness of 2S-ICR in different operational scenarios, but also shows its potential to deliver precise and clinically relevant segmentation in interactive settings.

The analysis of the HECKTOR and MDA datasets reveal a significant variance in the image-level segmentation results, which can be seen in Tables 1 and 2, respectively, confirming the challenges noted in the existing literature on accurate GTVt segmentation2,3,5,20. This variability shows that while some segmentations meet clinical standards, others do not, which emphasizes the need for an interactive method for efficiently enhancing suboptimal segmentations.

Beyond segmentation accuracy, the inference efficiency of 2S-ICR underscores its potential for clinical adoption. On an NVIDIA RTX 3080 GPU, 2S-ICR achieves an inference time of 0.08 s and matches the performance of established interactive methods like DeepGrow and DeepEdit, while using 2.06 GB of VRAM. Although this VRAM usage is slightly higher than baselines (1.86–1.88 GB), it remains well within the capacity of consumer-grade GPUs. On an Intel i5-12600K CPU, 2S-ICR delivers a low latency result of 1.62 ± 0.05 s and enables real-time segmentation on standard clinical workstations without specialized hardware. These attributes highlight 2S-ICR’s suitability for seamless integration into clinical workflows.

Our study has several limitations. First, the 2S-ICR framework was developed and evaluated for a single binary segmentation task, whereas clinical applications often require multi-class segmentation, such as distinguishing primary tumours from lymph nodes20. While 2S-ICR is theoretically extendable to multi-class interactive segmentation–for example, by incorporating class-specific positive and negative click maps (i.e., 3D volumes encoding user interactions for each class) and modifying the output layer accordingly–this extension is beyond the scope of the present study and thus left for future work.

Second, the evaluation was limited to primary gross tumour volumes using the HECKTOR 20212 and MDA datasets, which, although derived from real-world clinical settings, do not include complex cases such as metal artifacts or post-surgical anatomy. As prior work has shown that evaluation outcomes are sensitive to dataset composition22, and therefore, the generalizability of our results to other clinical scenarios is uncertain. However, as for reducing the effects of metal artifacts, we refer the reader to the following literature23,24. In addition, 2S-ICR could potentially be integrated into active learning pipelines25,26 to support efficient annotation of prioritized samples.

Third, we used simulated interaction events, following prior work12,13. Although simulations provide a controlled and scalable environment, they may not fully capture how clinicians interact in practice. The simulator identifies error regions by comparing model predictions to ground truth and samples interaction points using a distance-weighted probability distribution. While effective for benchmarking, this approach assumes idealized user behavior by favoring areas with large errors and never producing incorrect interaction events. Furthermore, our preliminary results indicated that the location of the interaction had a considerable effect on the model performance. This highlights the need to understand clinician behavior during interactive segmentation for improved applicability. To develop better-suited interaction simulation algorithms, human interaction patterns should be analysed, and improved simulation algorithms should be developed. As interactive segmentation changes depend on interaction locations and types, these may affect the results. However, this study was beyond the scope and is planned for future work.

Despite these limitations, 2S-ICR remains a flexible framework that may generalize to broader clinical applications, support active learning, and be adapted to various segmentation architectures beyond the one evaluated in this study.

The benefits of the proposed framework extend beyond improved accuracy. By eliminating the last remaining drawback associated with interactive segmentation, 2S-ICR unlocks the full potential of the entire interactive segmentation field. This breakthrough paves the way for a wider adoption of interactive segmentation in various clinical applications. By enabling clinicians to easily and quickly improve segmentation results, it promises more accurate treatment planning and improved patient outcomes.

In this study we have introduced 2S-ICR, a new interactive click-based framework, for segmentation of primary gross tumor volume in oropharyngeal cancer. The results show that our framework achieves performance comparable to or superior to state-of-the-art interactive deep learning methods, both with and without user interactions. These results highlight the potential of this approach to improve the performance of GTVt segmentation, enabling clinicians to quickly improve segmentation results based on just a few interactions. The more accurate segmentation enabled by our approach could lead to a more precise OPC treatment planning and to improved patient outcomes.

Methods

2S-ICR framework

The 2S-ICR dual-model segmentation framework, depicted in Fig. 3 and formalized in Algorithm 1, integrates two deep learning models: a standard segmentation model with a 2-channel input and an interactive refinement model with a 5-channel input. If no interactions are given to the 2S-ICR, it segments the PET-CT image using the standard model. When the user first interacts with the model, the given error coordinate, PET-CT image, and the output of the standard model are given to the interactive model. When the user further interacts with the 2S-ICR, the output of the standard model is replaced with the last output of the 2S-ICR, and the new interaction coordinate is given alongside with the previous ones for the 2S-ICR to further refine its output. We train the standard model and the interactive model separately, so as to closely follow the scenario where the 2S-ICR is applied on top of a pre-trained GTVt segmentation network.

Fig. 3
figure 3

Visualisation of 2S-ICR framework. The initial segmentation (\(t=0\)) is provided by a standard model which is shown in the green box on left. The segmentation refinement (\(t\ge 1\)) loop using a refinement model is visualised in the yellow box on right. Spatial dimensions (H\(\times\)W\(\times\)D), thresholding (>), negative (Neg), and positive (Pos) feature maps.

A key feature of 2S-ICR is its use of sigmoid-activated segmentation volumes as a memory mechanism. These volumes, with continuous values in the range 0 to 1, initially bridge the standard and interactive models by preserving the spatial information of initial segmentation. During iterative interactions, they maintain continuity between interaction events, stabilizing refinements per user input. Unlike prior methods, such as DeepGrow12 and DeepEdit13, which lack memory mechanisms and thus cannot leverage prior segmentation states, 2S-ICR’s continuous segmentation maps enable it to examine and utilise prior segmentation state in interpreting the interaction inputs for enhanced performance.

Algorithm 1
figure a

2S-ICR interactive segmentation algorithm

We utilized the Monai implementation of the 3D U-Net architecture27 across all models: the initial segmentation model of 2S-ICR framework, the segmentation refinement model of the framework, and the models for DeepGrow and DeepEdit. To ensure a fair comparison between these methods, the only difference was the number of input channels. The networks consisted of channels [16, 32, 64, 128, 256], stride [1, 2, 2, 2], two residual units. The choice of a stride of 1 for the first layer was pivotal for enhancing the impact of the interaction event. All interactive methods use a click encoding scheme proposed by Maninis et al.28 for 2D images, adapted by DeepGrow12 for 3D volumes, and adopted by DeepEdit13 and 2S-ICR. User clicks are encoded as Gaussian-smoothed balls in two 3D volumes: one for positive clicks to mark foreground (e.g., tumors in PET-CT) and one for negative clicks to mark background, guiding precise segmentation refinement.

Ensemble of interactive segmentation models

The ensemble experiments consist of a five-member ensemble based on the 5-fold cross-validation, both for the standard and interaction models. The ensemble probability is based on the average of the member probabilities. The ensembling was integrated into the interaction loop by selecting each interaction coordinate based on the ensemble prediction, i.e., each member in interaction model is given the same interaction coordinates in each iteration.

Simulated interactions

As large-scale training and validation of an interactive DL model is not feasible with human interactions, we chose to simulate interactions in these phases. We used the user click interaction simulator proposed in12,13. The click simulator compares the model output to the ground truth segmentation in order to select optimal interaction coordinates. Specifically, the simulator first extracts erroneous regions by examining where in the volume the model output and the ground truth differ. Then for each erroneous voxel, the distance to the border of the erroneous region is computed. After this, the distances are normalized by the sum of all the distances. As a result, all voxel values are in the range [0, 1] and sum to 1. Then, we treat these values as probabilities of a multinomial distribution and sample an interaction coordinate accordingly.

Training procedure

The 2S-ICR framework comprises an initial segmentation network and a refinement network, designed to iteratively enhance segmentation accuracy based on user feedback. To prevent the refinement network from becoming overly reliant on the initial segmentation, we introduce a novel regularization strategy during training. This dependency on the initial mask can hinder the network’s ability to effectively incorporate interaction feedback. Our approach mitigates this issue by randomly omitting the initial segmentation with a probability of \(p_{\text {drop}}\), replacing it with a neutral volume filled with the value 0.5. Since the initial segmentation is a post-sigmoid output where each voxel represents the probability of belonging to the foreground class, the value 0.5 signifies uncertainty. This regularization encourages the network to rely more on the original input and user interactions, improving performance and responsiveness to feedback.

We evaluated the impact of varying \(p_{\text {drop}}\) on segmentation performance and the number of modified voxels per interaction (Table 5). The results indicate that \(p_{\text {drop}} > 0\) consistently yields better outcomes than \(p_{\text {drop}} = 0\). The average DSC improved with non-zero values of \(p_{\text {drop}}\), and the number and standard deviation of changed voxels per interaction increased, reflecting greater adaptability in the refinement process. Based on these findings, we chose \(p_{\text {drop}} = 0.2\) for our experiments.

Unlike previous interactive volumetric segmentation methods13,29 that execute the backward pass only after all interaction events have been accumulated–making the process computationally costly–we optimized the refinement network of the 2S-ICR framework at every interaction event during training. This approach significantly accelerated the training process while still allowing the model to achieve state-of-the-art performance.

The initial network of the 2S-ICR is trained without any user interactions and its output segmentation serves as the starting point for subsequent refinements by the refinement network. For the training of the refinement network we follow the procedure proposed in prior research13. Specifically, we randomly determine the number of simulated interactions in each training iteration, employing a uniform distribution ranging from 1 to 15 interactions. This approach ensures that the refinement network learns to effectively incorporate user interactions across a diverse range of scenarios. During training, the best checkpoint was chosen based on the highest mean DSC over interactions ranging from 0 to 10, reflecting our primary evaluation metrics as presented in Tables 1 and 2.

All models were trained using a composite loss function that integrates the Dice Loss with the Binary Cross-Entropy (BCE) Loss, similar to previous approaches for OPC GTVt segmentation7,8. The composite loss function is formulated as:

$$\begin{aligned} \mathcal {L}_{\text {DiceBCE}} = \mathcal {L}_{\text {Dice}} + \mathcal {L}_{\text {BCE}}, \end{aligned}$$
(1)

where \(\mathcal {L}_{\text {Dice}}\) denotes the Dice Loss and \(\mathcal {L}_{\text {BCE}}\) the Binary Cross-Entropy Loss. While the composite loss can also be computed as a weighted sum of its components, we chose to use uniform weights.

The Dice Loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {Dice}} = 1 - \frac{2 \sum _{i} p_{i} g_{i} + \epsilon }{\sum _{i} p_{i} + \sum _{i} g_{i} + \epsilon }, \end{aligned}$$
(2)

where \(p\) and \(g\) represent the model output and ground truth segmentation, respectively, and \(\epsilon = 1 \times 10^{-5}\) is a smoothing factor to prevent division by zero. The Binary Cross-Entropy Loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {BCE}} = -\sum _{i} \left[ g_{i} \log (p_{i}) + (1 - g_{i}) \log (1 - p_{i}) \right] . \end{aligned}$$
(3)

In both loss functions, the summation is over the voxels of the segmentation volume. The composite loss allows for a more comprehensive optimization by addressing both the overlap of the imbalanced foreground class and the per-pixel classification accuracy30.

To enhance the models’ robustness to variations in imaging conditions and reduce overfitting, we applied a data augmentation pipeline during model training, adhering to the procedures presented in8. All the models are trained under the same settings for fair comparison. Specifically, we apply random affine transformations that include rotations over all axes up to 45 degrees, scaling and shearing within ranges of [-0.1, 0.1], and translation within a range of [-32, 32] voxels, each with a probability of 0.5. In addition, the augmentations include mirroring over all the axes, induced with the same probability of 0.5. For the CT modality, additional intensity augmentations are implemented, which consist of random contrast adjustments, with gamma range of [0.5, 1.5] and applied with probability of 0.25, intensity shifting, with offset of 0.1 and applied with probability of 0.25, random Gaussian noise, with standard deviation of 0.1 and applied with probability of 0.25, and Gaussian smoothing applied with probability of 0.25.

Each model was trained for a maximum of 300 epochs, employing early stopping with a patience of 50 epochs. We utilized the AdamW31 optimization algorithm. The initial learning rate was set to \(1 \times 10^{-4}\) and gradually decayed to zero by the final epoch using the cosine annealing scheduler. The AdamW weight decay coefficient was set to \(1 \times 10^{-5}\). The best checkpoint of each model was determined based on the highest mean DSC over 0 to 10 interactions, evaluated after each epoch on the validation fold during 5-fold cross-validation. Each model was trained using a mini-batch size of 1 on an NVIDIA A100 GPU with 80 GB of memory. Additional validation runs were performed on an NVIDIA RTX 3080 GPU with 10 GB of memory.

Evaluation measures

We assess the performance of the algorithms using two widely adopted metrics for automated GTVt segmentation: the Dice Similarity Coefficient and the Hausdorff Distance at the 95th percentile2. Given the isotropic 1 mm voxel resolution of both the HECKTOR and MDA datasets, Euclidean distances between voxels correspond directly to distances in millimeters.

Before computing the evaluation metrics, the predicted segmentation probabilities are binarized by thresholding at 0.5 to produce the binary segmentation \(S\):

$$\begin{aligned} S = \{ i \mid p_i > 0.5 \}, \end{aligned}$$
(4)

where \(p_i\) is the predicted probability at voxel \(i\). This binarization is applied consistently for all evaluation metrics to ensure a fair comparison between the predicted and ground truth segmentations.

The DSC measures the volumetric overlap between the ground truth segmentation and the predicted segmentation. Let \(G\) be the set of voxels belonging to the ground truth segmentation:

$$\begin{aligned} G = \{ i \mid g_i = 1 \}, \end{aligned}$$
(5)

where \(g_i\) is the ground truth label at voxel \(i\), with \(g_i = 1\) indicating the foreground class.

The DSC is defined as:

$$\begin{aligned} \text {DSC}(G, S) = \frac{2 |G \cap S|}{|G| + |S|}, \end{aligned}$$
(6)

where \(| \cdot |\) denotes the cardinality of a set, and \(G \cap S\) represents the intersection of the two sets. The DSC ranges from 0 (no overlap) to 1 (perfect overlap), providing a measure of how closely the predicted segmentation matches the ground truth.

The HD95 metric quantifies the spatial discrepancy between the surfaces of the ground truth and predicted segmentations on a per-volume basis. It is a robust version of the Hausdorff Distance, focusing on the 95th percentile of the distances to reduce the impact of outliers.

For each volume \(v\), we first extract the surfaces (boundary voxels) of the ground truth segmentation \(\partial G_v\) and the predicted segmentation \(\partial S_v\) from their respective binary segmentations \(G_v\) and \(S_v\).

We define the minimum distance from a point to a surface as:

$$\begin{aligned} d(a, \partial B) = \min _{b \in \partial B} \Vert a - b \Vert , \end{aligned}$$
(7)

where \(a\) is a point on one surface, \(\partial B\) is the set of points on the other surface, and \(\Vert a - b \Vert\) is the Euclidean distance between points \(a\) and \(b\).

The sets of distances are then:

$$\begin{aligned} D_{G_v \rightarrow S_v}&= \{ d(x, \partial S_v) \mid x \in \partial G_v \}, \end{aligned}$$
(8)
$$\begin{aligned} D_{S_v \rightarrow G_v}&= \{ d(y, \partial G_v) \mid y \in \partial S_v \}. \end{aligned}$$
(9)

We combine these distances into a single set for each volume:

$$\begin{aligned} D_v = D_{G_v \rightarrow S_v} \cup D_{S_v \rightarrow G_v}. \end{aligned}$$
(10)

The HD95 metric for volume \(v\) is then defined as the 95th percentile of the distances in \(D_v\):

$$\begin{aligned} \text {HD95}_v = \operatorname {Percentile}_{95}(D_v), \end{aligned}$$
(11)

where \(\operatorname {Percentile}_{95}(D_v)\) denotes the value below which 95% of the distances in \(D_v\) fall. Since the voxel spacing is isotropic 1 mm, these distances are measured in millimeters.

After computing the HD95 metric for each volume, we aggregate the results by reporting the median and interquartile range (IQR) of the image-level HD95 values:

$$\begin{aligned} \text {Median HD95}&= \operatorname {Median} \left( \{ \text {HD95}_v \} \right) , \end{aligned}$$
(12)
$$\begin{aligned} \text {IQR HD95}&= Q_3 - Q_1, \end{aligned}$$
(13)

where \(Q_1\) and \(Q_3\) are the 25th and 75th percentiles of the set \(\{ \text {HD95}_v \}\), respectively.

To evaluate the interaction efficacy of the algorithms, we report the Number of Clicks (NoC) required to achieve predefined performance thresholds for both the DSC and HD95 metrics. The NoC quantifies the average number of user interactions needed for each sample to exceed these thresholds.

For the DSC metric, we report the NoC required to reach DSC thresholds of 0.75 and 0.85, which indicate the number of clicks needed to achieve a Dice Similarity Coefficient of 0.75 and 0.85, respectively. Similarly, for the HD95 metric, we report the NoC required to reduce the HD95 below thresholds of 5.0 mm and 2.5 mm, corresponding to the number of clicks needed to bring the Hausdorff Distance at the 95th percentile below 5.0 mm and 2.5 mm, respectively.

The maximum number of allowed clicks is set to 20. If the model fails to reach the target threshold within this limit, the sample is considered a failure. We report the Percent of Failures as the percentage of failed samples for each threshold, calculated as:

$$\begin{aligned} \text {PoF} = \left( \frac{\text {Number of Failed Samples}}{\text {Total Number of Samples}} \right) \times 100\%. \end{aligned}$$
(14)

Specifically, the PoF at DSC thresholds of 0.75 and 0.85 denotes the percentage of samples that did not achieve the respective DSC thresholds within 20 clicks. Likewise, the PoF at HD95 thresholds of 5.0 mm and 2.5 mm represents the percentage of samples that did not reduce the HD95 below the respective thresholds within 20 clicks.

Datasets

This study does not involve human subjects as it relies on retrospective and registry-based data; therefore, it is not subject to IRB approval. Our external validation dataset was retrospectively collected under a HIPAA-compliant protocol approved by the MD Anderson Institutional Review Board (RCR03-0800), which includes a waiver of informed consent.

The 2021 HEad and neCK TumOR dataset (HECKTOR), introduced in2, consists of co-registered PET-CT images from 224 patients. The dataset was gathered from five centers located in Canada, Switzerland, or, France, with ground truth GTVt segmentations provided by multiple annotator agreement. The external MD Anderson Cancer Center dataset (MDA) consists of co-registered PET-CT images from 67 patients that are human papillomavirus positive, with the GTVt segmentations from a single annotator. The images are cropped to contain only the head and neck region, centered on the GTVt, and resampled to \(144^3\) volumes with isotropic 1 mm resolution, i.e., in terms of both pixel-spacing and slice thickness.

We adhered to the data normalization procedure established in the previous work7. The CT scans were windowed to [-200, 200] Hounsfield units and subsequently normalized to the range of [-1, 1]. The PET scans were standardized using z-score normalization. This normalization procedure ensured consistency across the datasets and enabled the use of the same models without retraining.