Background & Summary

Nasopharyngeal carcinoma (NPC) is the 23rd most common cancer, and radiotherapy (RT) is the primary treatment approach for it1,2. In the process of RT administration, delineating lymph node (LN) clinical target volumes (CTVs) on computed tomography (CT) is a crucial step. RT planning requires accurately delineating neck lymphatic drainage regions and calculating their radiation dose, as it maximizes the chance of a cure while minimizing radiation-related toxic effects. In the past, clinicians manually performed LN CTVs delineation with slice-by-slice on CT images. It is laborious and prone to intra- and interobserver variability3,4. Additionally, the online adaptive RT often involves repeated manual delineation on repeated planning CT images, thus increasing much burden on radiation oncologists and prolonging patient waiting time5,6.

Recently, deep learning (DL)-based segmentation has shown the potential to alleviate the manual delineation burden7,8,9,10,11. Furthermore, the application of DL for automated delineation has been extended to LN levels in head and neck cancer (HNC).Van der Veen et al.12 examined the performance of a DL model in segmenting LN levels in HNC patients and showed that it improved time efficiency and reduced interobserver variability. Additionally, Cardenas et al.13 developed a DL model to auto-delineate commonly used LN level combinations for HNC patients robustly and consistently. Those studies have investigated the accuracy of DL-based LN level segmentation in HNC. The training of those models relied on the delineation of LN levels following the 2013 updated consensus guideline14. However, it is important to note that those models and contours generated by them may not directly apply to NPC whose nodal metastatic characteristics differ from other HNC15,16. Lin et al. proposed a specific LN CTV delineation map to NPC based on 10,651 LNs distribution17. They observed that certain boundaries of levels Ib, II, IVa and V required adjustments. For instance, when delineating level Ib, the submandibular gland should be spared. Moreover, two works proposed additional modifications for levels IIb and V18,19. Hence, to facilitate the automated RT workflow in NPC, it is necessary to develop a DL model specifically tailored for LN CTVs delineation in this particular subtype following the most recent clinical guidelines.

In addition, specific combinations of LN levels selected according to LN metastatic risk are needed in clinical practice20,21,22. For instance, in NPC patients with N0, it is recommended to delineate a combination of bilateral levels II, III and Va. In patients with N1, the recommended delineation includes an ipsilateral combination of levels II-V and a contralateral combination of levels II, III and Va23. However, most previous DL models for NPC could not automatically output selective target volumes based on LN involvement24, which does not align with the clinical requirements.

Although existing studies reported promising segmentation performance for automatic HNC organ-at-risk (OAR) and tumor segmentation12,13,25, several potential challenges need to be noticed and handled before clinical application development. First, all of these studies developed the model and reported the performance on their collected private datasets12,13, it’s not clear what’s the encouraging solution for this task due to the lack of wide and reproductive research. It’s significantly different from many organ26,27 or tumor28 segmentation tasks which have many public datasets that can be used for model development and evaluation. However, few public HNC LN CTV datasets can be accessed for research until now29,30,31. Second, these public datasets mainly focused on survival and recurrence analysis using tumor-based radiomics. So, no study investigates CTV delineation in radiation therapy. Third, the robustness of the automatic segmentation models when dealing with patients with different stages or treatment strategies is not well evaluated. With the treatment ongoing, significant changes may occur in the location, size, and appearance of the tumor and LNs, such as reduction or disappearance23. It is desirable to develop a detailed LN CTV delineation dataset to investigate its potential influence in radiation therapy.

In this work, we aim to build a multicenter NPC LN CTVs segmentation benchmark for model development and evaluation. We retrospectively collected 262 patients with 440 CT images from four hospitals. To develop a specialized DL model for delineating LN CTVs in NPC, we modified the delineation of levels Ib, IIb, III, and the supraclavicular region based on previous studies17,18,19 and the 2013 consensus14. We invited three clinical expert boards to delineate a comprehensive set of six basic LN CTVs manually as the ground truth, an example of this dataset is shown in Fig. 1. To comprehensively investigate and evaluate the performance of recent works, the dataset consists of patients with different stages, treatment strategies, imaging modalities, and raters. Here, we split the dataset into 5 cohorts according to the sources, an internal cohort for training and validation and 4 external cohorts just for evaluation. After that, we trained and evaluated several state-of-the-art (SOTA) segmentation methods on the internal training and testing sets. Moreover, we investigated the robustness and generalization ability of these SOTA methods on four external testing sets. These results provided benchmark results and potential directions for future research.

Fig. 1
figure 1

An example of six annotated LN CTVs in a contrast-enhanced CT scan. The bottom table lists the annotated LN CTV categories. (a), (b), (c) and (d) represent the visualization in axial, coronal, and sagittal and annotation rendering views respectively.

Methods

Data ethics

This retrospective study was approved by the Ethics Committee on Biomedical Research of these hospitals (approve number, SCCHEC-02-2023-005), including Sichuan Cancer Hospital (SCH), Sichuan Provincial People’s Hospital (SPH), Anhui Provincial Hospital (APH), and Southern Medical University (SMU). Since this was a retrospective study with blurred face regions and all clinical and personal metadata removed, the informed consent was waived. The committee approved the open publication of this dataset.

Data collection

A total of 262 NPC patients were retrospectively collected between January 2019 and June 2022. The internal cohorts were collected from SCH and contained 300 CT images from 150 patients (Table 1). Each patient underwent an unenhanced CT (ueCT) scan followed by a contrast-enhanced CT (ceCT) scan in the same body position within several minutes. The detailed imaging protocols are that all NPC patients underwent RT and were immobilized supine using a thermoplastic head, neck, and shoulder mask. CT examinations were then performed, covering a scanning range from 3 cm above the suprasellar cistern to 2 cm below the sternal end of the clavicle. The CT images from SCH were acquired using a Brilliance CT Big Bore system from Philips Healthcare (Philips Healthcare, Best, the Netherlands), with the following scanning conditions: bulb voltage at 120 kV, current ranging from 275 to 375 mA, scan thickness of 3.0 mm, and an image matrix of 512  × 512. An injected contrast agent, iohexol, was used during the ceCT examination. Similarly, CT images from SPH, APH and SMU were acquired using a Somatom Definition AS 40 system from Siemens Healthcare (Siemens Healthcare, Forcheim, Germany), with the following conditions: bulb voltage at 120 -140 kV, 280-380 mAs current, 3.0 mm slice thickness, and an image matrix of 512  × 512. Iohexol was used as a contrast agent. In addition, to make possible dosimetric studies in the future, the electron density (ED) to Hounsfield unit (HU) of the CT conversion table for the four hospitals is present in Fig. 2.

Table 1 Clinical and image characteristics of the LNCTVSeg dataset.
Fig. 2
figure 2

Curves of electronic density (ED) to Hounsfield unit (HU) of CT conversion for patients from four hospitals. SCH, SPH, APH and SMU mean Sichuan Cancer Hospital, Sichuan Provincial People’s Hospital, Anhui Provincial Hospital and Southern Medical University, respectively.

LN CTVs delineation

During the RT treatment process, different patients need to delineate different LN CTVs according to the clinical stages for individual RT. To obtain individual LN CTVs for each NPC patient, we designed and manually delineated six basic LN CTVs based on the clinical anatomical structures, including the left (L)_Ib, L_II+III+Va, L_IV+Vb+Vc, right (R)_Ib, R_II+III+Va and R_IV+Vb+Vc. The residual seven LN CTVs can be generated by an anatomical-guided adaptive combination. For instance, combining L_II+III+Va with L_IV+Vb+Vc gives L_II-V. Hence, in addition to the six CTVs, the DL model can produce L_Ib-V, L_II-V, R_Ib-V, R_II-V, bilateral (B)_II+III+Va, B_Ib-V, and B_II-V. In total, 13 LN CTVs can automatically be produced by the DL model. The clinical and anatomical significance of 13 LN CTVs are shown in Table 2 and it provides the adaptive combination anatomical guidelines. Based on the above definitions, all radiation oncologists used ITK-SNAP32 to delineate CTV manually.

Table 2 Clinical significance of the 13 lymph node clinical target volumes.

Clinically delineation guidelines

In terms of level IIb, we adapted the ranges according to the previous work19. Specifically, the cranial boundary of level IIb is set at the cranial edge of C1 (the first cervical spine), the medial boundary is defined by the lateral edges of the rectus capitis lateralis, obliquus capitis superior, obliquus capitis inferior, and elevator scapulae muscles and the posterior range extends to the C1 and C2 levels, while the gap between sternocleidomastoid and splenius capitis muscles could be removed (Fig. 3A, red arrow). The caudal, lateral, and anterior boundaries remain consistent with the 2013 consensus guidelines. Regarding level Ib, we made modifications based on the previous studies17,18. First, we spared the submandibular gland (SMG) during delineation (Fig. 3D, yellow arrow). Second, based on the distance between all positive LNs and anatomical landmarks in level Ib, the Medial Mandibular sub-level was omitted (Fig. 3D, green arrow). For level III, we primarily modified the anterior boundary following the previous work17. The anterior edge of level III is defined as the region close to the anterior edge of the carotid sheath (Fig. 3E, purple arrow), and we spared the space between the anterior of the sternocleidomastoid muscle and the anterior neck ribbon muscles (Fig. 3E, dark blue arrow). Regarding level V, we used the cervical fascia anatomy-oriented nodal delineation. The lateral border of level V is defined by the scapular hyoid muscle (Fig. 3G, light blue arrow), and the medial border is close to the anterior border of the carotid sheath. To ensure the precise contours, S.C. Zhang (MD, with more than twenty years of experience in head and neck radiation therapy), Y. Zhao (MD, with more than ten years of experience in head and neck radiation therapy) and W.J. Liao (MD, with more than ten years of experience in head and neck radiation therapy), who were the main authors of the three clinical guideline papers17,18,19 involved in LN levels delineation in NPC, participated in the ground truth delineation for LN CTVs in this study. These experts first manually delineated several clinical classical patients to provide annotation references for multiple expert boards to minimize the variations in annotation style. These modifications and references were applied to enhance the accuracy and clinical relevance of the DL model.

Fig. 3
figure 3

Examples of experts’ annotation at some key anatomical levels. The red arrow in A represents the tight muscle gap between the sternocleidomastoid and splenius capitis muscles, which can be excluded when delineating IIb. The yellow arrow and green arrow in D represent the submandibular gland (SMG) and Medial Mandibular sub-level, respectively. These two areas can be removed when delineating Ib. The purple arrow in E represents the carotid sheath. The dark blue arrow represents the space between the anterior of the sternocleidomastoid muscle and the anterior neck ribbon muscles, and this area can be spared when delineating III. The light blue in G represents the scapular hyoid muscle, which is the lateral border of level V.

Data split

We randomly split into 120 patients for training and 30 patients for internal testing (validation cohort). External cohorts with different modalities encompassed two external testing cohorts: 1) Validation cohort 1, collected from SPH and comprised 60 ceCT images; 2) Validation cohort 2 was collected from APH and included 32 ueCT images. External cohorts with different treatment strategies consisted of two additional external testing cohorts (Validation cohorts 3 and 4), including CT images obtained before and post treatments: 1) Validation cohort 3, collected from SMU, comprised 12 patients, each with induction chemotherapy (IC)-naïve and post-IC ceCT images (referred to as CTpreIC and CTpostIC, respectively); 2) Validation cohort 4, collected from SCH, included eight patients who underwent three repeated ceCT scans during adaptative RT. These scans were performed after 0, 15-20, and 25-30 fractions and labeled as CTRA, CTRB, and CTRC, respectively (the detailed scan intervals are presented in Table 3). Table 1 presents the detailed clinical and imaging characteristics of each cohort over the LNCTVSeg dataset. The inclusion of CT images from various centers and manufacturers enhances the heterogeneity and representativeness of our data. Meanwhile, it can be observed that the clinical and image characteristics of these cohorts are distributed in a similar space. To protect the patients’ private information, we blurred the face region and removed all metadata about clinical and personal33.

Table 3 The scan interval of different CT images.

Data Records

The dataset without private information is available on Figshare34. Then, all files are stored with the format NIfTI-1 (32-bit floating point). Each cohort is stored as a folder and named “Internal_Cohort”, and “Testing_Cohort_X” (“X” is from 1 to 4), Raw CT images in the training or testing sets are stored in the “imagesTr” or “imagesTs” folders of each cohort with the name “lnctvseg_cx_xxxx_yyyy.nii.gz”, respectively, where “cx” denotes the center number x is the digit number from 1 to 4, “xxxx” is the index number of each patient the range is 1 to the total number of patients in “cx”, “yyyy” is set to “0000” (non-contrast CT) or “0001” (contrast-enhanced CT). The corresponding labels are stored in the “labelsTr” and “labelsTs” folders with the name “lnctvseg_cx_xxxx.nii.gz”, respectively. The clinical baseline and CTV labelling index are presented in Tables 1 and 2.

Technical Validation

Experiment settings

To benchmark the LN CTV segmentation task, we use the nnUNet as a baseline and also investigate the results of several recent transformer-based architectures, including 1) 3D UXNET35, a hypermodel combines large kernel size convolutions and hierarchical transformer for medical image segmentation, which can enlarge global receptive fields and reduce the model parameters and perform very well on several medical datasets; 2) REPUXNET36, a pure 3D CNN model that performs element-wise scaling in large kernel weights to enhance the learning convergence and effectively adapt large receptive field for volumetric segmentation, which achieved the SOTA performance on abdominal organ and lesion segmentation; 3) SwinUNETR37, a new pure transformer model that merges the merits of Swin Transformers and UNet for medical image segmentation and shows encouraging performance on several medical segmentation tasks. For a fair comparison, we used their official implementations with provided pre-trained models and trained them with the same settings. Considering the huge variations in CT imaging protocols across different centers, encompassing different vendors, parameters, and the presence or absence of contrast enhancement, we extended the baseline nnUNet with mixup-based data augmentation38,39 to improve the generalization ability on heterogeneous images.

Evaluation metrics

Three metrics were used to evaluate segmentations, including DSC, the 95% Hausdorff distance (HD95), and normalized surface Dice (NSD) with a tolerance distance 2 mm40. The DSC measures the volumetric overlap between two segmentations, and the HD95 (mm) measures the boundary distance between two segmentations. Generally, a superior model should generate a higher value for DSC (maximum of 1) and a lower value for HD95 (minimum of 0). The NSD quantifies the overlap of the surfaces between two segmentations, with a perfect match scoring 1. If a target is missed, the DSC, HD95 and NSD will be set to 0, 100 and 0, respectively. We employed the DSC, HD95 and NSD implementations of the MedPy.

Intraobserver variability

In this study, twelve CT images were randomly selected from SCH and re-contoured by the same expert who had taken part in ground truth generation after two months. The DSC was calculated to assess intraobserver variability between the original and recontoured delineations for various LN CTVs. Meanwhile, we also measured the DSC between the original and the DL-generated contours (nnUNet with mixup). The results are listed in Table 4. The DSC between the original and the model-predicted contours was higher than between the original and recontoured LN CTVs, although a statistical difference was not obtained. Moreover, the variance in observed DSC for LN CTVs was lower for DL auto-segmentation than for recontouring by the expert, except for bilateral levels Ib.

Table 4 Quality of the DL-generated segmentation compared to intraobserver (from SCH board).

Interobserver variability

we randomly selected 30 patients from SCH and invited two expert groups (SPH and APH) to independently segment the LN CTVs for these patients. Together with the initial segmentations by the SCH group, we obtained LN CTVs delineated by three expert groups (SCH, SPH, and APH). We calculated the DSC to assess interobserver variability, as shown in Table 5. For example, the DSC among the three groups around from 80.0% to 89% for four basic LN CTVs except for levels_Ib. It is indicated that although multiple expert groups delineated the target areas, the consistency of the target areas delineated by different groups was relatively good when all experts strictly followed our delineation principles and requirements.

Table 5 Interobserver variabilities across three boards (SCH, SPH, APH) in the term of DSC.

Experimental results

Table 6 shows the quantitative segmentation results of these methods both in the internal and external cohorts. It can be found that the baseline nnUNet and the modified nnUNet perform better than the other three transformer-based models in terms of mean DSC (mDSC) on all testing cohorts. Meanwhile, the modification of the baseline also improved the baseline performance by a slight margin. Besides, the results in the term of NSD also show a similar trend compared with DSC, where the CNN-based segmentation models consistently outperform transformer-based models. (Fig. 4 shows some segmentation results.)

Table 6 Segmentation results across several SOTA methods in the internal and external testing cohorts.
Fig. 4
figure 4

Visual examples of the nnUNet(mixup) predictions in internal testing (A), external testing 1(B), and external testing 2 cohorts (C). We each ranked all patients in the internal testing cohort, external testing cohort 1, and external testing cohort 2 and selected the median scores of patients in the term of DSC. Yellow lines denoted the experts’ delineated contours, and green lines denoted the model-predicted ones.

Limitation

Unlike previous studies including level Ia segmentation13,25, our study did not involve it, which might limit the dataset’s and model’s broader applicability. That’s because the patients in this dataset consist of NPC, one of the main subtypes of HNC. In NPC, LN metastases are rarely observed in this area, and thus this area is not commonly included in NPC RT.

Usage Notes

The Dataset is publicly available on Figshare under the Creative Commons 4.0 Attribution (CC-BY-4.0). All NIfTI files can be visualized via ITK-SNAP32 and other packages.