Introduction

Thyroid cartilage invasion serves as a crucial determinant in TNM staging of laryngeal squamous cell carcinoma (LSCC), with medial cortex invasion classified as stage T3 and extralaryngeal tissue penetration as stage T4. Previous studies have reported thyroid cartilage invasion in 12%–43% of laryngeal carcinoma patients, with missed diagnoses possibly resulting in staging errors in 40%–50% of cases1,2. Clinically, thyroid cartilage invasion significantly impacts the choice of treatment strategies. For cases with thyroid cartilage invasion, radiotherapy alone is often insufficient, making concurrent chemoradiotherapy a common option for organ-preserving treatment. However, when the tumor extensively involves the cartilage or invades extralaryngeal tissues, total laryngectomy is necessary to ensure complete resection3. Additionally, thyroid cartilage invasion closely correlates with radiotherapy response and tumor recurrence4,5. Hence, it is of great significance to accurately evaluate thyroid cartilage invasion for making treatment decisions and assessing prognosis in patients with LSCC.

CT is frequently utilized to predict thyroid cartilage invasion, with a sensitivity of approximately 80%6. However, its accuracy and specificity are unsatisfactory, which may be attributed to the asymmetric ossification of normal thyroid cartilage, and the similar CT values between tumor tissue and non-ossified cartilage may also affect the identification7,8. Although MRI has a higher sensitivity for diagnosing thyroid cartilage invasion, cartilage edema or inflammatory changes may resemble tumor infiltration9. Dual-energy CT technology, particularly iodine overlay (IO) image, can help differentiate between iodine-enhanced tumors and non-ossified cartilage10,11. However, IO image has limitations in identifying lesions in previously ossified cartilage, potentially overestimating iodine distribution12. Currently, there is a lack of robust and precise methods for discerning thyroid cartilage invasion in LSCC.

Recently, artificial intelligence, particularly machine learning and deep learning (DL), has seen extensive use in medical imaging. Radiomics, a key aspect of machine learning, extracts high-dimensional features and offers valid quantitative data for tumor analysis. DL, on the other hand, automatically captures and learns discriminative features in an end-to-end manner, improving prediction accuracy. Radiomics and DL have been reported to be valuable in predicting tumor stage13, early recurrence14, treatment response15, and prognosis16 in laryngeal carcinoma. To our knowledge, few studies have investigated radiomics and DL techniques in predicting thyroid cartilage invasion. Given the complexity of laryngeal tumor, we also developed a 3D DL model, which usually requires an extensive training dataset to achieve precision17,18.

Collectively, this study aimed to construct a radiomics model, a 2D DL model, and a 3D DL model using venous-phase CT images to evaluate thyroid cartilage invasion, and compare these models with two radiologists. A nomogram incorporating the optimal predictive model and clinical risk factors was also developed. Additionally, we further explored the prognostic value of the optimal predictive model for thyroid cartilage invasion.

Materials and methods

Patients

This retrospective multicenter study analyzed data from 418 patients with pathologically confirmed LSCC. The study utilized data from two institutions: The First Affiliated Hospital of Chongqing Medical University (center 1), which contributed 357 patients between March 1, 2011, and January 1, 2021, and Tianjin First Central Hospital, School of Medicine, Nankai University (center 2), which provided data from 61 patients between March 1, 2017, and January 1, 2021. Patients were categorized into a training cohort and an internal validation cohort at a 7:3 ratio for center 1, while data from center 2 was used for external validation (Fig. 1). All patients were enrolled via the following inclusion criteria: (1) pathologically confirmed diagnosis of LSCC, (2) comprehensive clinical and imaging data, and (3) preoperative contrast-enhanced CT (CECT) performed within 2 weeks before surgery. The exclusion criteria included: (1) tumor recurrence or history of other malignant tumors, (2) poor quality of CT images, and (3) preoperative anti-tumor therapy.

Fig. 1
figure 1

Flowchart of patient recruitment. LSCC, laryngeal squamous cell carcinoma; center 1, The First Affiliated Hospital of Chongqing Medical University; center 2, Tianjin First Central Hospital, School of Medicine, Nankai University.

Baseline clinical data included age, sex, smoking status, alcohol consumption, tumor location, CT-reported anterior commissure (AC) invasion, histological grade, clinical T stage, clinical N stage, and overall clinical stage. All patients in this study underwent surgical resection, and the postoperative pathological findings were considered the gold standard for determining the presence of thyroid cartilage invasion. Representative diagnostic examples are shown in Fig. 2.

Fig. 2
figure 2

Thyroid cartilage invasion was evaluated on full sections of LSCC stained with H&E. Representative samples classified as with thyroid cartilage invasion (A) and without thyroid cartilage invasion (B). The red dashed line indicates the margin of the thyroid cartilage.

Endpoints and follow-up

In this study, a comprehensive postoperative follow-up plan was implemented to ensure accurate tracking of patient conditions. The schedule of follow-up was designed every 3 months in the first year, every 6 months from the second to the fifth year, and annually thereafter. Follow-up continued for at least 3 years or until death. For patients lost to follow-up, survival time was calculated from the date of surgery to the last recorded follow-up. The primary objective was to assess disease-free survival (DFS), which measures the time from surgery to disease progression, the last follow-up, or death from any cause.

CT imaging acquisition and tumor segmentation

Each patient underwent CECT scanning using the multislice spiral CT scanners from two centers. After the plain scan, the contrast agent was administered intravenously at a flow rate of 2.5–4.0 mL/s to obtain contrast-enhanced images. The detailed CT acquisition parameters for the two centers are provided in Table S1.

The venous-phase CT images, most commonly used in laryngeal cancer research19,20, were chosen for further analysis. Tumor regions of interests (ROIs) were manually delineated slice-by-slice on each transverse section by radiologist A (C.X.W.). A 3D volume of interest (VOI) was formed by merging the corresponding ROIs. To assess the inter- and intraobserver reproducibility of radiomics features, 30 patients were randomly selected and re-segmented by both radiologist A and radiologist B (J.H.) after one month. Figure 3 shows the overall workflow of this study.

Fig. 3
figure 3

The pipeline of this retrospective study. In this study, an SVM classifier was applied to construct the radiomics model. The 2D DL model was built using cropped 2D ROIs by ResNet-50, and Grad-CAM was used for model visualization. 3D-ShuffleNet V1 was adopted as the backbone for the 3D DL model, and its input was the bounding cube of the 2D ROIs. Additionally, the performance of two readers was also evaluated for comparison with that of the predictive models. Eventually, the optimal predictive model would be obtained and was utilized to assess patient prognosis.

Radiomics feature extraction/selection and radiomics model Building

Utilizing PyRadiomics (version 2.2.0, https://github.com/Radiomics/pyradiomics), features were extracted from the VOIs to ensure a robust analysis. The image processing and feature extraction procedures in this study were conducted in accordance with the Image Biomarker Standardization Initiative (IBSI) standards. Prior to extraction, data preprocessing was performed to ensure internal consistency across the dataset, including resampling VOIs to a voxel spacing of 1 × 1 × 1 mm³ and discretizing voxel intensities using a bin width of 25 Hounsfield units (HU). A total of 1,106 radiomics features were initially extracted, encompassing four categories: first-order statistics, shape-based features, texture features, and wavelet-transformed features. The reliability of these features were assessed through intraclass correlation coefficient (ICC) analysis. To refine the feature set, an initial screening procedure was conducted using the Spearman correlation coefficient, which helped in identifying and removing highly correlated features. Subsequently, the remaining features normalized by z-score were followed by a more precise selection process using the least absolute shrinkage and selection operator (LASSO) method, which retained the non-zero coefficient features for modeling. Finally, a radiomics model was constructed using a support vector machine (SVM) classifier, chosen for its effectiveness in binary classification tasks.

2D DL model construction and visualization

In this study, ResNet-50, pretrained on ImageNet, was used for the backbone network of the 2D DL model (see Supplementary Material). The 2D ROIs were semi-automatically cropped by a rectangular box in the largest tumor cross-section. Before model training, all cropped ROIs were resized to a standardized resolution of 224 × 224 pixels using bilinear interpolation. Data augmentation strategies, including random horizontal and vertical flipping and random cropping, were applied to increase training robustness. The 2D ROIs were then input into ResNet-50 for a binary classification task, with the output aimed at predicting the risk of thyroid cartilage invasion. Model training encompassed forward computation and backpropagation, and categorical cross-entropy served as the loss function. In terms of training parameters, we incorporated a stochastic gradient descent optimizer with a base learning rate of 0.01, a weight decay of 0.01, a batch size of 64, and 50 training epochs. L2 regularization and an early stopping strategy were applied to prevent overfitting. In addition, we utilized gradient information from the final convolutional layer to generate gradient-weighted class activation mapping (Grad-CAM), which provides visual explanations for the 2D DL model.

3D DL model development and validation

Considering that 2D ROIs may not fully capture the structural intricacies of laryngeal tumors, we propose a 3D DL model for a more exhaustive analysis (see Supplementary Material). In the context of 3D DL, a rectangular bounding box was used to mark the bounding cube of the 2D ROIs, with linear interpolation standardizing all cubic ROIs to dimensions of 96 × 96 × 96. To improve data diversity and model generalizability, data augmentation strategies such as random flipping along the X, Y, and Z axes were applied. Preprocessed cubic tumor images were subsequently fed into the 3D-ShuffleNet network, where operations such as tridimensional convolutional layers, batch normalization, ReLU activation, and channel shuffling were leveraged for image analysis. Ultimately, the 3D DL model would output the probability of thyroid cartilage invasion for each patient. The 3D network was trained via an adaptive moment estimation optimizer, with a learning rate of 0.02, a batch size of 16, and 100 training epochs.

Radiologists’ visual evaluations

Two experienced radiologists (P.J. and L.Q.J.) independently evaluated patients from the two validation cohorts. They referred to the multiphase CT images of each patient for the precise assessment and were encouraged to make a rating on a five-level scale: 1, definitely negative, 2, probably negative, 3, erosion, 4, lysis, and 5, extralaryngeal spread through the cartilage12,21. More detailed information is provided in Supplementary Material. Throughout this procedure, two radiologists were completely blinded to patients’ clinical data. To estimate the agreement of the ratings between the two radiologists, Cohen’s kappa concordance analysis was employed. Details of the radiologists’ ratings are provided in the Supplementary Excel File.

Nomogram for thyroid cartilage invasion prediction

In the training cohort, the candidate clinical factors were tested via univariate analysis to screen out key variables (p < 0.05). Meanwhile, the performances of the radiomics model, 2D DL model, and 3D DL model were compared to determine the optimal predictive model. Subsequently, the optimal predictive model collaborating with key clinical variables was subsequently utilized to develop the nomogram. Furthermore, the performance of the nomogram was independently tested in two validation cohorts.

Statistical analysis

The data were analyzed using SPSS software (version 26.0) and R software (version 4.1.0). Differences in clinical features were analyzed by Mann–Whitney U-test (continuous variables) and chi-square test (categorical variables). A p value < 0.05 was considered statistically significant. The area under the receiver operating characteristic (ROC) curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. The calibration curve and Hosmer-Lemeshow (HL) test are utilized to assess the calibration of the model. Decision curve analysis (DCA) curves were utilized to appraise the model’s clinical utility.

Results

Patient characteristics

Table 1 presents comprehensive clinical details pertaining to LSCC patients. The rates of thyroid cartilage invasion were 36.0% in the training cohort, 39.1% in the internal validation cohort, and 37.7% in the external validation cohort. No significant differences were found in age, sex, smoking status, histological grade, alcohol consumption, tumor location, or clinical N stage between patients with and without thyroid cartilage invasion in the training cohort, while the differences in CT-reported AC invasion, clinical T stage, and overall clinical stage between the two groups were statistically significant. In addition, Cohen’s kappa value of the ratings for thyroid cartilage invasion was 0.784, indicating good consistency between the two radiologists (Table S2).

Table 1 Baseline clinical characteristics of the patients in the datasets.

Performance analysis of the radiomics model, the 2D DL model, the 3D DL model, and two readers

Thirteen radiomics features were screened to develop the radiomics model (Table S3 and Figure S1), which yielded an AUC of 0.867 (95% confidence interval [CI]: 0.819–0.916) in the training cohort. Additionally, the 2D DL model achieved an AUC of 0.846 (95% CI: 0.797–0.895), and the 3D DL model achieved the highest AUC of 0.959 (95% CI: 0.934–0.984), as shown in Table 2 and Figure S2.

Table 2 The performance of predictive models and readers.

In two validation cohorts, the 2D DL model exhibited superior performance than other models and two readers (P.J. as reader 1 and L.Q.J. as reader 2, respectively) (Table 2; Fig. 4A, B). Specifically, the 2D DL model achieved superior performance in predicting thyroid cartilage invasion, with AUCs of 0.835 (95% CI: 0.758–0.911) in the internal validation cohort and 0.804 (95% CI: 0.696–0.913) in the external validation cohort. In comparison, the 3D DL model yielded AUCs of 0.732 (95% CI: 0.638–0.827) and 0.698 (95% CI: 0.569–0.836), the radiomics model achieved AUCs of 0.727 (95% CI: 0.621–0.823) and 0.705 (95% CI: 0.567–0.843), reader 1 had AUCs of 0.742 (95% CI: 0.644–0.841) and 0.726 (95% CI: 0.598–0.854), and reader 2 had AUCs of 0.727 (95% CI: 0.630–0.824) and 0.715 (95% CI: 0.790–0.944), respectively—all of which were inferior to the 2D DL model. Detailed comparisons of the AUCs are provided in Figure S3.

Fig. 4
figure 4

The ROC curves of different models and readers in the internal validation cohort (A) and the external validation cohort (B), respectively. The calibration curves of different models and readers in the internal validation cohort (C) and external validation cohort (D), respectively. The DCA curves of different models and readers in the internal validation cohort (E) and external validation cohort (F), respectively.

The calibration curves of the 2D DL model in two validation cohorts had a good fit (p > 0.05 for both, HL test) (Fig. 4C, D). The DCA curves indicated that the 2D DL model provided greater net benefit (Fig. 4E, F). In addition, the Grad-CAM heatmaps confirmed the validity of the 2D DL model (Figure S4).

Development and performance of the nomogram

In this study, three key clinical risk factors for predicting thyroid cartilage invasion were identified via univariate analysis, including CT-reported AC invasion, clinical T stage, and overall clinical stage (Table 1). As shown in Fig. 5, we constructed a nomogram that integrates the 2D DL signature (the optimal model) and the three factors, which achieved further improvement and delivered the best predictive performance. In the internal validation cohort, the nomogram achieved an AUC of 0.867 (95% CI: 0.799–0.936), which was significantly higher than that of the 3D DL model, the radiomics model, and both radiologists (p < 0.05 for all, DeLong test), with no statistically significant difference from the 2D DL model. In the external validation cohort, the nomogram achieved an AUC of 0.823 (95% CI: 0.714–0.931), with no statistically significant differences compared to other models (p > 0.05 for all, DeLong test) (Fig. S3). The calibration curves and DCA curves demonstrated that the nomogram had good consistency (p > 0.05, HL test) and excellent clinical utility (Fig. 4C–F).

Fig. 5
figure 5

The nomogram for identifying thyroid cartilage invasion in this study. The comprehensive nomogram was built from 2D DL signature and clinical risk factors, including CT-reported AC invasion, clinical T stage and overall clinical stage.

Survival prediction

The median follow-up time of the whole dataset (418 patients) was 38.6 months, varying from 1.6 to 82.0 months. More detailed information is shown in Table S4. Before performing cox regression analysis, the proportional hazards assumption was assessed for all variables, as shown in Table S5. Ultimately, the 2D DL signature (hazard ratio [HR] = 4.666) and clinical N stage (HR = 2.191) were found to be independent prognostic factors associated with DFS (Table 3; Fig. 6). By utilizing the ‘survminer’ R package, we calculated the optimal cut-off point (0.51) for 2D DL signature, which stratified LSCC patients into high-risk and low-risk groups (Figure S5). Kaplan-Meier survival analysis revealed that the low-risk group with a lower 2D DL signature had significantly better DFS than the high-risk group with a higher 2D DL signature (p < 0.05, log-rank test). Similar results were observed for clinical N stage (p < 0.05, log-rank test). Eventually, we developed a prognostic nomogram integrating 2D DL signature and clinical N stage to predict the DFS of LSCC patients, yielding a 3-year AUC of 0.620, a 4‐year AUC of 0.650, and a 5‐year AUC of 0.653 (Fig. 6E).

Table 3 Univariate and multivariate analyses of predictors of DFS.
Fig. 6
figure 6

The relationship between 2D DL signature and patient prognosis inn the whole cohort. (A) The 2D DL signature and clinical N stage were found to be independent prognostic factors for LSCC patients, and a corresponding nomogram was developed. Survival analysis stratified by the optimal cut-off value of 2D DL signature (B) and clinical N stage (C) in the whole cohort. Patients with high 2D DL signature or clinical N stage positive (cN+) status had poorer DFS (p < 0.05, log-rank test). (D) Forest plot illustrating multivariable cox regression analyses for prognosis in the whole cohort. (E) Time-dependent ROC curves of the nomogram in the whole cohort.

Discussion

In this study, the 2D DL model achieved better performance than the 3D DL model, the radiomics model, and two readers in evaluating thyroid cartilage invasion. In addition, the proposed nomogram incorporating 2D DL signature and clinical risk factors (CT-reported AC invasion, clinical T stage, and overall clinical stage) demonstrated the best classification performance. Furthermore, the 2D DL signature significantly correlated with DFS, offering important information for clinical decision-making and prognosis evaluation for LSCC patients.

Accurate assessment of thyroid cartilage invasion can assist clinicians in individualized clinical treatment. However, relying on macroscopic appearance of the tumor on CT/MRI to identify thyroid cartilage invasion leads to discrepancies in accuracy. Our findings indicated that the AUCs of the 2D DL model surpassed those of two readers in two validation cohorts. It showed that a considerable proportion of patients were misdiagnosed on the basis of radiologists’ visual interpretations13,19, which may arise from inherent limitations of human visual perception. In addition, our study revealed that patients with advanced T stage or overall stage, as well as those with AC involvement, were more susceptible to thyroid cartilage invasion, aligning with other studies22,23.

Recently, radiomics and DL have showcased great potential in tumor staging, molecular typing, and prognosis evaluation13,15,24. Guo et al. 19 established a radiomics model to predict thyroid cartilage invasion in laryngeal and hypopharyngeal carcinoma. By applying a synthetic minority oversampling algorithm to correct data imbalance, they improved the AUC of the model in the training set to 0.905. However, this study did not include an additional validation set. Currently, there still exist many single-center studies on laryngeal cancer, leaving the models’ robustness unconfirmed. Moreover, few studies have utilized 3D DL technology in laryngeal carcinoma. This study compared the performance of the radiomics model, 2D DL model, 3D DL model, and two readers in predicting thyroid cartilage invasion, and demonstrated that the 2D DL model and corresponding nomogram displayed outstanding predictive performance. However, the radiomics model revealed unsatisfactory performance, especially since the sensitivity of the external validation set was only 0.391, which may be partially attributed to the differences in CT devices between the two centers25,26. Nevertheless, our study had a relatively large sample size with independent external validation, enhancing the credibility of the findings. Consistent with previous radiomics studies on laryngeal cancer13,19,20, our study employed venous-phase CT images for model construction. This decision was based on the fact that the delayed enhancement allows better visualization of tumor neovasculature and stromal components, thereby enabling more comprehensive and informative features extraction13,20. Although fusing radiomics features from multiple phases may improve model performance, it also can introduce feature redundancy and instability. In certain scenarios, they may fail to provide significant performance gains while increasing model complexity27,28. Given these concerns, we chose to construct our model based solely on venous-phase data.

Considering the complex structure of larynx, this study implemented a 3D DL model to capture the intricate spatial features of laryngeal tumors. The results revealed that the 3D DL model experienced significant overfitting, which may be due to the requirement of a larger dataset for 3D CNNs to analyze effectively29. Additionally, 3D CNNs also necessitate more parameter adjustments and greater storage space30,31. Conversely, the 2D DL model indicated excellent discriminative ability in predicting thyroid cartilage invasion, surpassing other models and two readers. Previous studies have also indicated that the 2D DL model often achieves better results in clinical research32,33. Two possible reasons may account for this preponderance: Firstly, DL can automatically learn more complex and discriminative features from convolutional layers, making it more effective and precise than manually designed radiomics features34. Secondly, the proportion of useful diagnostic information representing lesions in the input data of the 2D DL model may be higher than that in 3D data32,35, rendering it more suitable for clinical diagnostic tasks. Although the rates of thyroid cartilage invasion were comparable across the training, internal, and external cohorts, there were still inherent discrepancies in patient distribution and imaging acquisition protocols between institutions. In addition, the external validation cohort had a relatively limited sample size. These factors likely contributed to the observed performance heterogeneity and, in particular, to the decreased AUC of the 3D DL model on external validation, given the greater sensitivity of 3D networks to sample size and protocol variability. The relatively stable performance of the 2D DL model across cohorts further supported its greater generalizability in the current multi-center setting.

Importantly, the nomogram that fused the 2D DL signature with clinical risk factors (CT-reported AC invasion, clinical T stage and overall clinical stage) achieved the best predictive performance in both validation cohorts; this superiority was statistically significant only in the internal validation cohort, but not in the external validation cohort, which may be attributed to the smaller size of the external validation cohort. From a clinical perspective, such a nomogram could serve as a practical, noninvasive decision-support tool: preoperatively it may help identify patients at high risk of cartilage invasion and thus inform surgical planning, guide multidisciplinary decisions regarding the need for more extensive resection or adjuvant therapy in borderline cases, and enable tailoring of postoperative surveillance intensity according to individualized risk. Moreover, this study observed a close negative correlation between 2D DL signature and DFS, indicating that a lower 2D DL signature was associated with longer DFS. More specifically, patients without thyroid cartilage invasion had significant survival benefits. Our study also identified clinical N stage as another independent prognostic factor, consistent with the findings of previous studies36. Based on these findings, the prognostic nomogram offers a practical tool for clinicians to customize effective therapeutic strategies and evaluate prognosis for LSCC patients.

Notably, some limitations in the study are worth emphasizing. Firstly, this was a retrospective analysis. Although this study included an independent external validation set, the amount of data was limited due to stringent inclusion criteria. Thus, a larger multicenter prospective research is warranted to validate the model presented in this study. Secondly, precise manual delineation of tumor regions required specialized radiologists, and the results were influenced by subjective experience. Therefore, future studies will focus on automatic segmentation algorithms (e.g., U-Net) to improve consistency and reduce human bias. Thirdly, model development in this study was based on the venous-phase CT images. More modalities for modeling may potentially improve the model’s performance. Fourthly, there is still a lack of clear understanding on designing network architecture and adjusting training parameters. Therefore, future work should focus on optimizing network architectures and tuning training parameters to enhance model robustness.

Conclusion

In conclusion, we demonstrated that the 2D DL model had notable advantages over the 3D DL model, the radiomics model, and two readers. The integrated nomogram, incorporating 2D DL signature and clinical factors, achieved satisfactory predictive performance for thyroid cartilage invasion and patient prognosis in LSCC. Further investigation is needed to fully generalize the clinical applicability of the proposed nomogram.