Introduction

The transformative roles and applications of artificial intelligence (AI) in healthcare has led to rapid development of the technology and exploration of its potentials in advancing care in patients1,2. In radiotherapy, AI is expected to play an important role in almost all the steps of a radiotherapy workflow from CT simulation to treatment delivery3. Prior to treatment planning, clinicians have traditionally manually contoured organs at risk (OARs) and target volumes on CT scans. This time-consuming, yet indispensable, procedure ensures the accurate assessment of radiation dose to normal tissues and target volume3,4. From an atlas-based contouring to machine learning and now towards deep learning (DL)-based contouring5,6, there is an increased adoption of auto-segmentation in the clinic due to the accurate and efficient output and numerous clinically ready commercial DL-based auto-segmentation models7,8. This was demonstrated in a recent study by E. Gibbons et al., which found that DL auto-segmentation has outperformed the atlas-based method in the contouring accuracy of OARs in head-and-neck, thoracic and pelvic regions9.

Specifically for head-and-neck cases, several studies10,11,12 have highlighted the resource intensive process of manual delineation due to the complexity and extensive number of OARs and lymph node (LN) levels involved in this anatomical region. In addition, the heterogenous and inconsistent contouring among expert physicians has also been widely reported13,14,15. Hence, the industry interest towards the development and use of AI-based annotations of medical images has increased significantly over the last decade to solve the logistics and inter-observer variability challenges. Several studies have reported on the numerous clinical benefits of the auto-segmentation algorithms. X. Ye et al. and J. J. Lucido et al. have reported significant time saving benefits in the auto-segmentation of OARs in head-and-neck cancer where more than 76%-time reduction has been observed8,16. L. J. Stapleford et al. demonstrated that atlas-based LN segmentation reduced inter-observer contouring variation17 and similar findings have also been observed in other treatment sites such as in breast cancer18.

There are several studies regarding DL-based auto-segmentation in head-and-neck cancers but most of the works are limited to algorithm development, performance quantification and are also a largely single-institutional study19,20,21,22. At the time of writing, only one study has reported a non-randomised multi-centre study using a stratified OAR segmentation system (SOARS) to automatically delineate 42 head-and-neck OARs8. They evaluated SOARS in five external institutions and found up to 90% clinical workload reduction and a smaller inter-observer variation in contouring. Despite the reported clinical benefits, it is pertinent to acknowledge potential observers’ fatigue, which may be experienced by the limited human experts in the study due to the need to review the OAR contours from many patients and not all the 42 region-of-interests (ROIs) were evaluated for all five institutions due to varying institutional practice in OAR contouring. These are potential confounders in the study. Lastly, all the five institutions are located within a single country and therefore the findings may not be representative of the clinical benefits in a global setting. Our group had previously investigated the accuracy and efficiency of DL-based CT auto-segmentation tools for prostate cancer cases and have reported improved consistency and time saving benefits of the tool23. The present HARMONY (HeAd neck Rapid deep-learning auto-segmentation tool – a Multi-clinic evaluatiON studY) study adopts a multi-institutional approach with randomization of scans to investigate the clinical benefits of a commercial auto-segmentation algorithm in head-and-neck CT scans.

The proposed study involves seven institutions located across different continents and is structured into two phases. The first phase assesses the time saving and the auto-segmentation accuracy of the auto-segmentation algorithm within each individual institution using their own internal CT scans. The second phase investigates the time saving and the reduction in inter-observer contouring variability among the seven institutions by pooling scans from different institutions. Through the proposed study design and the inclusion of institutions from distinct geographical locations, we aimed to quantify a more representative understanding of the clinical benefits of auto-segmentation software in the global setting, and to identify heterogeneity in these benefits across institutions. Additionally, this study also includes evaluates the clinical benefits of auto-segmenting the various LN levels, an aspect, which has not been reported previously in a multi-center setting.

Results

Data characteristics and study design

The schematic of this study is illustrated in Fig. 1. The aim of phase one was to assess the auto-segmentation qualities of all the 18 ROIs in the 35 CT scans (5 CT scans per clinic × 7 clinics). The assessment was quantified via two approaches. The first objective was to determine the absolute value of the contour similarity metrics between auto-segmentation and manual contours. The second objective focused on comparing the differences in similarity metrics between auto-segmentation versus manual contours and edited versus manual contours. If no statistically significant differences were detected between the metrics, it would imply that the auto-segmentation closely approximates clinically acceptable standards, requiring minimal editing. The aim of phase two was to assess the reduction in inter-observer contouring variability between the manual and the edited contours (when users were presented with the auto-segmentations). This was evaluated among the seven clinics on seven unique CT scans for all the 18 ROIs. The time saving clinical benefits were evaluated in both phases of the study. This analysis aimed to investigate whether the time-saving benefits were observed when the clinicians contoured internal CT scans from their own institution compared to external CT scans from other institutions.

Fig. 1: Schematics of the study.
figure 1

This figure shows the design of the two phases study. The first phase compares the manual and AI-generated contours within each institution while the second phase compares across different institutions. The pooling and exchange of scans in phase two is to reduce institutional CT scan-specific bias when evaluating the time reduction and inter-observer contouring variation. AI artificial intelligence, CT computed tomography.

Contouring time saving in phase one and two

The contouring time saving results for phases one and two are shown in the left and right panels of Fig. 2. Figure 2a, b show the box plots of the manual contouring and editing time for all the ROIs in phase one and two, respectively. Using the Wilcoxon signed-rank test, statistically significant time reduction was observed for majority of the ROIs in both phases, except for LN IA, IB_R, IB_L, III_R, IVA_L, IV_A_R, IVB_L and IVB_R (indicated by ** in the figures). The percentage time reductions after the use of auto-segmentation software are shown in Fig. 2c, d. A positive value meant a shorter time for editing compared to manually contouring the ROI and therefore, it indicated that the ROI benefitted from the use of an auto-segmentation software. In terms of average time savings, the parotid glands and brachial plexus benefitted the most with more than 50% reduction in contouring time. Conversely, the time required to edit LN IVB contours took longer than manually contouring them from scratch. This observation was consistent in both phase one and two. Lastly, the total time reduction in contouring for all the ROIs in each individual clinic is shown in Fig. 2e, f. The time saving was non-uniform across the clinics. The “lion” clinic benefitted the most in terms of absolute time saving while the “duck” clinic benefits the least. The total segmentation time in phase one decreased from 90.7 ± 52.2 min to 52.2 ± 22.7 min, and in phase two decreased from 74.4 ± 56.2 min to 37.8 ± 28.0 min. Overall, the segmentation time decreased across both phases from 81.9 ± 54.7 min to 44.4 ± 26.5 min, which amounts to about 46% time saving.

Fig. 2: Comparison of the contouring time for phase one and two of the studies.
figure 2

a, b The time taken to contour each ROI for phase one and two, respectively. c, d The time reduction for each ROI between manual and edit-auto. e, f The total time taken to finish contouring all the 18 ROIs for the CT scans from each clinic for phase one and two. Each boxplot in (b) represents the results from the five CT scans contoured by a single clinician in the clinic, while each boxplot in figure d shows the contouring time of six different clinicians on a single CT scan from a different clinic. The blue and orange boxplots represent the time taken to manually contoured the ROI from scratch and the editing time of an AI-generated contour. (**) in (a) and (b) denotes no statistically significant difference in the contouring time between the manual and edit-auto ROI. ROI region-of-interest, CT computed tomography, AI artificial intelligence, LN lymph node, L left, R right.

Auto-segmentation performance in phase one

Figure 3a, b shows the box plots of Dice score and HD of auto-segmentation versus manual contour (blue) and edited contour versus manual contour (orange). The corresponding relative difference between the Dice and HD is shown in Fig. 3c, d. Using the Wilcoxon signed-rank tests, all the improvements in the Dice score after editing on the auto-segmentation were found to be statistically significant. The improvements in HD were less conclusive with the parotid (left and right), brachial plexus (left and right) and LN IVB (left and right) showing no statistically significant improvement in contouring similarity after editing on the auto-segmentation. It was interesting to note that the Dice score between the edited and manual contours fall below 0.80 for most ROI with the brachial plexus having the lowest average Dice score of below 0.60.

Fig. 3: Comparison of contour similarity of the manual contour versus auto-segmentation and manual contour versus edited ROI in phase one.
figure 3

a, b The Dice and HD of the comparison respectively. The manual versus auto-segmentation and manual versus edited ROI results are shown as blue and orange box plots respectively. The difference between the Dice of edited ROI and auto-segmentation versus manual is shown in (c). Similarly, the difference between HD of edited ROI and auto-segmentation versus manual is shown in (d). (**) in (b) denotes no statistically significant difference between the HD of manual versus auto-segmentation and manual versus edit-auto ROI. The red lines in (c) and (d) represent zero editing.

Inter-observer variation in phase two

The inter-observer contouring variation among the manual and edited contours is shown in Fig. 4. Figure 4a, c shows the Dice score and HD for the comparison between the manual contours and its consensus (blue) and the edited contours and its consensus (orange). Using the Mann–Whitney U tests, all the Dice score (HD) for the ROIs show statistically significant increase (decrease) in the edited contours. This shows that the inter-observer variation in contouring decreases when clinician edited the auto-segmentation results, when compared to manually contouring from scratch. The results using the pairwise comparison approach is shown in Supplementary Table 1 and similar reduction in inter-observer variation is observed in all ROIs as well. The left and right brachial plexus show the greatest decrease in inter-observer variations in the edited contours. The manual and edited contours for brachial plexus, LN IVA, IVB and parotid ROIs from all the clinicians are compiled and displayed in Fig. 5. Additional visual comparisons of all the ROIs are shown in Supplementary Fig. 16. The edited contours are clearly a lot more consistent among the clinicians as shown in the right panel of the figure. The great improvement in consistency of brachial plexus contouring is also evident in the figure. The results are also stratified according to the clinics and the similarity metrics of all the ROIs are shown in Fig. 4b, d. The “lion” clinic benefitted the most in term of increased agreement to the consensus contour after editing on the auto-segmentation.

Fig. 4: Comparison of the contour similarity of manual versus consensus and edited ROI versus the consensus contours.
figure 4

a, c The Dice and HD results between the manual and edited ROI and the consensus contour, respectively, for each ROI. b, d Similar results for all the ROIs in each of the seven unique CT scans. The blue and orange boxplots represent similarity metrics of manual versus consensus and edited ROI versus consensus contours respectively. HD 95th percentile Hausdorff distance, ROI region-of-interest, CT computed tomography, LN lymph node, L left, R right.

Fig. 5: Visual comparison of the manual and edited contours in phase two.
figure 5

This figure shows a comparison of the six manual and edited contours on each CT scans. The red contours represent the manual and edited delineations made by each clinician, while the blue contours represent the auto-segmentation. The yellow contours represent the consensus contour. In the right panel, the blue and yellow contours are less visible due to substantial overlap with the red edited contours. CT computed tomography, LN lymph node.

Heterogeneity in performance across clinic and ROI

Figure 6 shows the details of the time saving and contour similarity from Figs. 2 to 4 as a function of the clinic and the ROI. The color bar represents the percentage reduction in time or improvement in similarity metrics, with red representing the benefits of edited contours and blue indicating a lack of improvement. The size of the marker is correlated to the contouring time and similarity metrics evaluated on the manual contours. The heterogeneity in the results across the clinics and ROIs is evident in this figure. The “lion” clinic benefitted the most in terms of contouring time (in both phase one and two) and agreement to the consensus contours as indicated by the intense red markers. The edited contours from the “bovid” clinic benefitted the least in terms of inter-observer variation with LN II, LN III and the submandibular glands showing a poorer agreement to the consensus contour. Both the “bovid” and “duck” clinics benefitted the least in terms of time saving with multiple ROIs in both phases showing an increased editing time compared to manual contouring time (shown as blue markers). In both phases, all clinics except the “lion” take a longer time to edit than to manually contour the LN IVA and IVB ROIs (indicated by blue markers).

Fig. 6: Detailed information on the contouring time, Dice and HD improvements across different ROI and clinics.
figure 6

a, b The percentage reduction in the average contouring time (indicated by the color bar) for phase one and two, respectively. c, d The absolute percentage increase in the Dice score and the decrease in HD between the manual and auto-edit ROI (from Fig. 4) for phase two. The size of the markers in (a) and (b) represent the average manual contouring time, while the size of the markers in (c) and (d) represent the average Dice and HD between the manually contoured ROI and its consensus ROI. In all the figures, red color indicates improvement in auto-edit compared to manual contouring process. HD Hausdorff distance, ROI region-of-interest, LN lymph node, L left, R right.

Discussion

In this study, we have shown for the first time the heterogeneity in clinical benefits of using a head-and-neck auto-segmentation software in a multi-institutional and inter-continental setting. The time saving, inter-observer contouring variation and auto-segmentation accuracy with respect to manual contours in each local institution are quantified through a two phases study design. The results in Fig. 2e, f show that all clinics benefitted from a contouring time reduction using the auto-segmentation software. This observation is consistent across both phases regardless of whether the clinicians are contouring on scans within or outside their institution. The total time saving in both phases is 46%, which is in close agreement to other single institutional auto-segmentation studies24,25,26. However, a closer analysis of the time saving stratified by ROIs and clinics in Figs. 2, 6a, b reveals a very heterogeneous performance, which sheds light on the actual clinical benefits in a global setting. As depicted in Fig. 2a, b, the parotid glands and brachial plexus experienced the most significant time savings, exceeding 50%. However, not all ROIs benefited to the same extent, since LN IA, IB, III, IVA and IVB showed no statistically significant differences in contouring time between manual contouring and editing the auto-segmentation. Figure 6a, b elucidate the reason behind this finding, revealing that the time savings for these ROIs vary significantly between the institutions, as evidenced by the mixture of blue and red markers in the horizontal rows. The results show that in the majority of the clinics, manually contouring these ROIs is faster than editing the auto-segmentations. This is especially true for LN IVA and IVB in both phases. The only distinct exception is the “lion” clinic, which shows time saving in all the ROIs with the use of auto-segmentations in both phases (except for LN IA in phase one). In fact, time saving of more than 75% can be observed in multiple ROIs such as brachial plexus, constrictor muscle and the various LN levels in the “lion” clinic. In all, the differential time saving benefits across the ROIs and clinics can be attributed to the intricate interplay between the structure complexity, experience of the clinician and the contouring software in the institution. Interestingly, these results also show that the use of auto-segmentation in a clinic can be optimized by omitting the auto-segmentation of certain ROIs, which prove to take a longer time to edit than to manually contour. Ideally, this optimization process should take place during commissioning of the software and a proper assessment needs to be conducted since it is impossible to predict a priori, which ROI will benefit from it.

The reduction in the inter-observer contouring variability in the edited contours is evident in the results of the phase two study as shown in Fig. 4. All the ROIs in Fig. 4a, c show a statistically significant improvement in the Dice and HD for the edited contours and its consensus compared to the manual counterpart. The brachial plexus shows the most significant reduction in inter-observer variability, as illustrated in Fig. 5, where the edited contours have greater consistency compared to the manual contours. This consistency is crucial for accurate OAR dose calculations, especially in the treatment of lower neck nodes and gross tumour volumes (GTV). Additionally, all the LN contours show an enhanced consistency, with LN IVB showing the highest average improvement. We also evaluated the inter-observer variation in combined LN groups, which are used in clinical practice. The results in three LN groups (I + II + III, II + III, II + III + IV) are shown in Supplementary Table 2 and 3 where an average Dice score of close to 0.90 in the pairwise comparison. Even though Ye et al. has shown a reduced inter-observer contouring variability between multiple institutions in head-and neck OARs8, this is the first study to report similar findings in the contouring of the LN clinical target volumes. Figure 4b, d show that “lion” clinic benefitted the most from the improved contouring consistency. A closer look at Fig. 6c, d show that brachial plexus, LN IA and IB improved the most in this clinic. Visually, apart from the “lion” clinic, the rest of the clinics has reasonably similar improvement in the contouring consistency. The vast improvement for “lion” compared to the rest of the clinics could be due to a difference in training, experience and cancer type prevalence within the clinic and country. Auto-segmentation, therefore, could play an important role in harmonizing ROI contouring between different institutions and lower the learning curve for institution with limited experience.

The auto-segmentation accuracy with respect to the manual contours in each institution is shown by the blue box plot in Fig. 3a. Following the editing process, all the ROIs show a statistically significant increase in the Dice score when compared to the manual contours. This indicates that editing is still required for all contours to achieve clinically acceptable quality in each clinic. However, using a Dice threshold of 0.80 (from TG13227) for an acceptable contour agreement, half of the ROIs do not agree with the manual contours even after editing (orange box plot). The only contours, which have an average Dice score exceeding 0.80 after editing are the parotid glands, submandibular glands and LNs IB, II and III. This implies that the institutional practices for contouring the other ROIs generally diverge from the DAHANCA guidelines used in the training dataset, or it may indicate that the auto-segmentation process did not perform optimally. These results highlight the actual clinical situation when a purchased auto-segmentation software is often trained on data from other centers. However, interestingly, the results show that the edited contours are eventually closer to the auto-segmentation compared to the original manual contours. This again highlights the value of auto-segmentation in harmonizing the contouring practice globally, which plays an important role when performing data pooling analysis in a multi-institutional clinical trial28.

There are two limitations in our study. Firstly, several ROIs were omitted from this study, including the optic apparatus, retro-pharyngeal constrictors, brainstem, spinal cord, esophagus, larynx and oral cavity. The consensus guideline recommends contouring at least 40 OARs in head-and-neck cancer patients29, and it would certainly improve the clinical relevance of the study if more OARs were evaluated. However, the resources required from each clinic (contouring, timing and the institutional improvement approval), are important considerations. Consequently, this study primarily focused on the LN level contouring due to a lack of multi-institutional auto-segmentation research in this specific area. Secondly, the auto-segmentation performance in the presence of dental filling or obturator artifacts was not assessed in this study and the results in this study only applied to pre-operative head-and-neck CT scans as the post-operative CT scans were excluded.

Methods

Data characteristics

This was a two-phase multi-centers retrospective randomized study involving seven institutions to evaluate the accuracy, efficiency and clinical benefit of a DL-based auto-segmentation in comparison to the conventional manual contouring processes. This study was conducted retrospectively and did not involve human subjects. A total of seven clinics participated in this study: 1) National Cancer Centre Singapore, Singapore 2) Docrates Cancer Center, Finland, 3) Erasmus MC Cancer Insitute, Rotterdam, Netherlands, 4) Kuopio University Hospital, Finland, 5) North Estonia Medical Centre, Estonia, 6) Oulu University Hospital, Finland, 7) Turku University Hospital, Finland. The identities of the clinics were anonymized in the results of this study. They were labelled as bovid, bug, cat, duck, koi, lion and shrimp. Ethics clearance was obtained at each of the institutions based on the regulatory guidelines at the respective participating clinical site. Informed consent was waived as this study used only anonymized CT scans without associated clinical information. Inclusion criteria for the study consisted of pre-operative head-and-neck cancer patients who underwent CT simulation scans, with no restriction on tumour (T) or nodal (N) classification. Conversely, post-operative cases were excluded from the study.

Auto-segmentation software and algorithm

The segmentation model was developed using a 3D encoder-decoder U-Net architecture, similar to the approach described in our previous CT segmentation work30. The model was trained on a dataset of approximately 500 CT scans from head-and-neck cancer patients, with manual segmentations performed by experienced radiation oncologists. OARs were delineated following multiple guideline sources, including DAHANCA, Brouwer et al., and Scoccianti et al.29,31,32, while LN levels were contoured according to the 2013 update of the international consensus guidelines by DAHANCA, EORTC, HKNPCSG, NCIC CTG, NCRI, RTOG, and TROG33. Data augmentation techniques including random rotations, translations, and intensity transformations were applied to improve model generalization. The encoder utilized a ResNet-type backbone pre-trained on ImageNet (Stanford Vision Lab, Stanford University, Stanford, CA, USA), while the decoder comprised multiple DenseNet blocks. The model was optimized using a combination of Dice loss and weighted cross-entropy loss. Training was performed for roughly 300 epochs using the Adam optimizer and a ReduceLROnPlateau scheduler. All preprocessing and training was implemented in Python (Python Software Foundation, Wilmington, DE, USA) using the PyTorch framework. The final model achieved median Dice similarity coefficients ranging from 0.81 to 0.91 for different substructures on an independent test set. This segmentation model was implemented in the Contour+ auto-segmentation product by MVision AI and the head-and-neck models had been deemed clinically acceptable in two independent single institutional studies7,34.

Multicentre auto-segmentation study design

In phase one of the study, each clinic first selected five anonymized head-and-neck planning CT scans according to the inclusion and exclusion criteria. After which, the radiation oncologist from each clinic would contour the 18 ROIs manually on all the five scans. 16 of the 18 ROIs were the bilateral parotid glands, submandibular glands, brachial plexus and different LN levels comprising of IB, II, III, IVA and IVB. The remaining two ROIs were the constrictor muscle and the LN IA. While clinical practice typically involves contouring LN levels without sub-level differentiation, the granularity employed in this study was important for two primary reasons: to enable sub-level-specific targeting35 and to identify specific LN regions requiring further development in auto-segmentation accuracy36. There were other important OARs in head-and-neck such as brainstem and optic apparatuses, that were not included in this study. The 18 ROIs were selected by clinicians and MVision, who agreed that these regions continue to pose a significant challenge for accurate contouring by most available auto-segmentation software. All the clinicians in each clinic were informed to contour the 18 ROIs based on the local institution contouring practices. The scans were sent to MVision AI Oy (Helsinki, Finland) to generate the auto-segmentations by DL. After a minimum two weeks, these auto-segmentations were returned to the respective clinic for the same clinician to review and edit the contours if required to achieve clinical quality. A minimum time delay was necessary to remove any bias associated with memory of the manual contours while editing the auto-segmentations. The aim of phase one was to study the auto-segmentation quality and the local contouring time saving with auto-segmentation in each institution.

In phase two of the study, two scans were selected randomly from each clinic. One scan would be used for manual contouring, while the other one would be used for editing of auto-segmented contours. The pooled seven CT scans which was meant for manual contouring were then circulated among the seven clinics. The pooling and exchange of the scans was to reduce any institutional scan-specific bias when evaluating the clinical benefit of the auto-segmentation. Each clinic then manually contoured the 18 ROIs for six CT scans from the other clinics. In addition, the auto-segmentations were also generated by MVision for another set of seven scans and the same clinician from each clinic edited the auto-segmentations in all the six CT scans to achieve clinical quality as defined in their own clinic. The random selection and exchange of scans was facilitated by an independent party (J. Niemelä and G. Bolard from Mvision AI) who was not part of the participating hospitals to ensure that each clinic did not receive their own scan for manual contouring or editing. In order to further minimise the risk of potential recall bias, a different auto-segmented CT scan from the manually contoured dataset was presented to each clinic. While this decision may diminish the statistical power to discern differences in contouring time or inter-observer contouring variability during phase two, it preempts the potential confounding effects of recall bias. In summary, the rationale for a two phases study design was to distinguish the clinical benefits assessed in a local context (phase one) and in a global context (phase two). In particular, through a series of scan exchanges, we aimed to quantify both the reduction in contouring time and inter-observer variability when using auto-segmentation, allowing for generalizable results.

Contouring time saving in phase one and two

The time required to manually contour and to edit the auto-segmentation for each of the 18 ROIs were recorded in all the clinics in phase one and two. The time saving benefit of an auto-segmentation software was quantified by the difference between the time to edit the auto-segmentation and to manually segment the ROI from scratch. The time reduction of each ROI in all the 35 CT scans (five CT scans per clinic) scans in phase one were tested for statistical significance using the Wilcoxon signed-rank test. Similar statistical test was also used to test for significant time reduction in the ROI contouring in phase two (7 CT scans circulated among 7 clinics).

Auto-segmentation performance in phase one

The contour similarity of the auto-segmentations versus manual contours and the edited versus manual contours were compared for each ROI in phase one. The aim was to quantify the performance of the auto-segmentation software with respect to the manual contour in each clinic and to quantify the difference between the manual and edited contours. The metrics for quantifying the contour similarity were Dice score and 95th percentile Hausdorff distance (referred to as HD in this study). Dice score is a volumetric overlap metric and it ranges from zero to one where zero and one represent no and perfect overlaps respectively37,38. 95th percentile HD39,40 is a spatial distance-based metric, which is sensitive to boundary error39,41 and is therefore not correlated to Dice score and plays a complementary role in this study37,42. It ranges from zero to infinity with zero indicating perfect overlap of the contours. These two metrics were chosen to give a holistic assessment of the contour agreement in terms of volumetric overlap and boundary error, and had been traditionally been used to quantify the quality of the semantic segmentation38,43,44. It is worth noting that there are other metrics such as the added path length (APL) and surface Dice score which were recently shown to be indicative of the contouring time saving37,45. However, since the contouring time was explicitly measured in this study, we decided to focus on the two most used metrics in segmentation task. Wilcoxon signed-rank test was used to test for significant difference of the metrics between the auto-segmentations and the edited contours for each ROI.

Inter-observer variation in phase two

Inter-observer variation of a ROI measures the agreement between the contours delineated by different observers (or in this case a clinician) on a common CT scan. In phase two, the inter-observer variation between the seven clinics was compared for the manual contours and the edited contours for each ROI. The inter-observer variation of the manual and edited contours was quantified by two complementary approaches. The first involved calculating the pairwise contour similarity metrics (Dice and HD as defined above) between the seven manual contours and the consensus contour (calculated from the seven manual contours) for every ROI. The consensus contour was generated using the STAPLE (Simultaneous Truth And Performance Level Estimation) algorithm46,47. The same calculation was also performed for the seven edited contours and its consensus contour. The second approach was through the calculation of pairwise contour similarity metrics among the seven manual contours to yield a total of 21 calculations. Similar calculations were also performed among the edited contours. Large inter-observer variation was indicated by a poor contour similarity result in both approaches. A threshold Dice score of 0.8027 was regarded as a good overlap of agreement in the contours in the medical auto-segmentation task. There is currently no clear threshold for HD for an acceptable contour agreement and this served as a relative comparison in this study and to inform the reader the displacement deviation in the contours. Mann-Whitney U test was used to test for significant difference between the inter-observer variation between the manual contours and edited contours for each ROI.

Statistical analysis

All the statistical tests were conducted using the statsmodels v0.14.3 library in Python. In this study, a two-sided P-value < 0.05 was considered significant and Bonferroni correction was applied for multiple hypothesis testing.