Introduction

Laryngeal squamous cell carcinoma (LSCC) ranks among the most prevalent head and neck malignancies, with an estimated 5-year survival rate of approximately 63%1. Despite advancements in treatment, up to 40% of patients experience recurrence within three years of radical treatment, significantly impacting long-term survival2. Currently, the tumor-node-metastasis (TNM) staging system is employed extensively to predict patient prognosis3. However, prognostic outcomes can differ even among patients within the same stage due to the intrinsic heterogeneity of tumors4. Therefore, exploring novel and effective prognostic evaluation methods is highly important for LSCC patients to improve risk stratification and guide individualized therapy.

Machine learning (ML) has emerged as a powerful tool for analyzing medical imaging, offering considerable potential in tumor diagnosis5 treatment efficacy assessment6 and prognostic prediction7,8. Radiomics, by extracting and analyzing numerous quantitative features from medical images, captures intricate biological characteristics within the tumor, thereby providing a non-invasive and efficient method for prognostic prediction in LSCC9,10. Chen et al. developed a radiomics model based on CT images from 136 LSCC patients after surgical resection to predict overall survival (OS). The model achieved a concordance index (C-index) of 0.752, outperforming the TNM staging system alone (C-index: 0.699)11. However, the relatively small sample size and lack of external validation limit its clinical applicability. Additionally, due to the manual selection and definition of features, traditional radiomics methods may fail to capture all the valuable information contained within the images12.

Deep learning (DL), an important branch of ML, is increasingly applied in medical imaging. Compared to radiomics, it can automatically extract features, enabling end-to-end predictions and capturing complex, multi-level data representations13,14. Although DL has shown promising performance in the diagnosis and classification of LSCC15,16 few studies have specifically explored its value for survival prediction in LSCC. A preliminary study showed that a DL model using diffusion-weighted imaging during radiotherapy treatment from 70 patients was significantly associated with 2-year progression-free survival (PFS) in laryngeal and hypopharyngeal cancer patients17. Another study utilized a DL algorithm using CT imaging to predict outcomes in 291 LSCC patients treated with chemoradiotherapy, with the best model achieving a C-index of 0.3118. Similarly, the aforementioned studies on DL-based models for predicting LSCC prognosis were constrained by relatively small sample sizes and a focus solely on patients treated with radiotherapy or chemotherapy, without evaluating the prognosis of those treated surgically. Recently, a novel combination of radiomics with DL has been reported, showing improved performance in tumor prognostic prediction6,19. Meng et al. constructed a nomogram using clinical, radiomics, and DL signatures to predict OS in patients with gallbladder cancer after surgery, demonstrating that the combined model had the highest C-index compared to the individual models20. As far as we know, studies focused on this combined approach for predicting the postoperative prognosis of LSCC patients are currently lacking.

We hypothesized that the combined model would offer more comprehensive and accurate prognostic evaluation. In this study, we aimed to develop and validate a CT-based combined model integrating clinical risk factors, radiomics, and DL to predict recurrence-free survival (RFS) in LSCC patients who underwent surgery, thereby enabling more effective risk stratification and guiding personalized postoperative management.

Materials and methods

Ethics

This study was conducted in accordance with the Declaration of Helsinki. Due to the retrospective nature of the research, the Institutional Review Board of the First Affiliated Hospital of Chongqing Medical University waived the requirement for informed consent (approval ID: 2020080). The study was approved by the Institutional Review Board of the First Affiliated Hospital of Chongqing Medical University, and all methods adhered to the relevant guidelines and regulations.

Patients

Data from Center 1 between January 2014 and December 2020 and data from Center 2 between January 2018 and December 2020 were obtained. The inclusion criteria were as follows: (I) patients who underwent a contrast-enhanced CT scan within 2 weeks before surgery; (II) complete clinical and imaging data were available; (III) LSCC was confirmed by pathological examination; and (IV) complete follow-up information. The exclusion criteria were: (I) the presence of distant metastasis or other cancers; (II) poor CT image quality; (III) patients with prior chemotherapy or radiotherapy before surgery; and (IV) tumors with a maximum dimension of less than 5 mm. No data imputation was performed, and only those patients with fully available data were included. Ultimately, 349 patients from two centers were enrolled. A group of 271 patients from Center 1 was randomly assigned into a training set (n = 189) or an internal testing set (n = 82) at a 7:3 ratio. The 78 patients from Center 2 constituted an external testing set. The process of patient recruitment is depicted in Fig. 1.

Fig. 1
figure 1

Flowchart of patient recruitment. LSCC, laryngeal squamous cell carcinoma. Center 1, the First Affiliated Hospital of Chongqing Medical University; Center 2, Tianjin First Central Hospital; CT, computed tomography.

Endpoint and survival follow-up protocol

The principal endpoint was RFS, which is defined as the duration of the period between the surgical procedure and local recurrence or death. The follow-up deadline for this study was December 2023. Follow-up evaluations were systematically conducted at intervals of 1 to 3 months during the initial year, 3 to 6 months during the second year, 4 to 8 months during the third to the fifth year, and yearly afterward. Patients alive without recurrence were censored at their last follow-up.

Imaging acquisition

All patients underwent preoperative CT examinations within 2 weeks before surgery. Table S1 summarizes the CT scanners and acquisition protocols used at the two centers. The scan was conducted from the base of the skull to the aortic arch. Arterial and venous phase images were obtained at 35 s and 60 s after injecting the contrast agent, respectively. Venous-phase CT images were used for subsequent analyses in this study21,22.

Clinical model construction

Clinical characteristics, including age, sex, smoking status, alcohol, tumor location, differentiation, clinical T stage, clinical N stage, clinical TNM stage, laryngectomy, and adjuvant therapy, were obtained from the patients’ medical records. The tumor stage was determined according to the American Joint Committee on Cancer 8th edition. All clinical features were analyzed via univariate and multivariate Cox regression, and only those with a corresponding p value < 0.05 were selected as independent clinical risk factors for constructing the clinical model.

Radiomics model construction

Image preprocessing and segmentation

The tumor segmentation procedure was carried out by two radiologists who were unaware of the clinicopathological information. Radiologist A, with five years of experience, delineated the regions of interest (ROIs) corresponding to the tumors using ITK-SNAP (version 3.8.0). Then, three-dimensional (3D) volumes of interest (VOIs), which encompassed the entire tumor, were obtained by stacking the ROIs. To assess feature stability, Radiologist A resegmented images from 30 randomly selected cases after one month. Additionally, Radiologist B, who had ten years of experience, segmented the images from the same 30 cases. Any disagreements were resolved through discussion between the two radiologists until consensus was achieved. Intraclass correlation coefficients (ICCs) were employed to evaluate intra- and inter-observer reproducibility.

Radiomics feature extraction and selection

CT-based radiomics feature extraction was performed using the PyRadiomics package (version 3.0.1), in accordance with the IBSI guidelines23. All the images were resampled to isotropic voxels with dimensions of 1 × 1 × 1 mm³ and discretized with a fixed bin width of 25 Hounsfield units (HU) before feature extraction. To standardize the distribution of feature intensities, Z-score normalization was conducted. The following steps were taken for selecting features and developing signatures. First, radiomics features with ICCs < 0.75 were removed. Second, Pearson’s correlation coefficient was used to analyze highly repeatable features, and to prevent redundancy, only one feature from pairs with a coefficient above 0.9 was retained. Third, univariable Cox regression analysis was implemented to streamline the extensive set of radiomics features. Fourth, the least absolute shrinkage and selection operator (LASSO) Cox regression was employed to further refine the features. The optimal tuning parameter (λ) was selected using 5-fold cross-validation to enhance model generalizability and stability. Finally, the radiomics score (RS) was derived by combining the chosen features with their respective weights. The detailed parameter settings for radiomic feature extraction are provided in Table S2.

DL model construction and visualization

The convolutional neural network (CNN) model, ResNet-34, was pretrained on ImageNet data and served as the backbone network for this study. We extracted two-dimensional (2D) ROIs from the slices showing the maximum dimension of the tumor via a rectangular box with a 300 HU window width and a 50 HU window level. Each cropped image was subsequently resized to 224 × 224 pixels and Z-score normalized before being input into the model. To increase model performance and mitigate overfitting, random data augmentation techniques, such as flipping and cropping, were conducted on the training set. The prediction probability output by the CNN model was serving as the deep learning score (DLS), which was subsequently utilized to predict the RFS of LSCC patients.

The model for the prediction task was trained using the cross-entropy loss function, and the weights of the entire network of the fine-tuned model were iteratively updated via backpropagation with stochastic gradient descent (SGD). Throughout the training process, an initial learning rate of 0.01 and a batch size of 64 were maintained, and 50 epochs of iterations were carried out. Furthermore, to visualize and understand the model’s ability to identify crucial lesion areas, gradient-weighted class activation mapping (Grad-CAM) was employed to generate heatmaps highlighting suspicious regions. The model was trained on an NVIDIA GeForce RTX 4070 GPU, with its network framework built using PyTorch.

Combined model construction

Multivariate Cox regression analyses were performed using clinical characteristics, RS, and DLS to identify independent prognostic indicators. These indicators were then used to construct a combined model, and a nomogram was developed to visually illustrate their impact on individual prognostic risk in LSCC. Additionally, using the median nomogram score of the training set as the threshold, a risk stratification system was established to divide patients into low- and high-risk groups. Kaplan-Meier (KM) survival analysis and log-rank tests were then performed for survival analysis.

Statistical analysis

Statistical analyses were conducted utilizing Python (version 3.7.12) and R (version 4.2.1). Continuous variables were compared by either one-way ANOVA or the Kruskal-Wallis test as appropriate. Categorical variables were compared via the chi-square test. The C-index and area under the curve (AUC) of the time-dependent receiver operating characteristic (ROC) curve were conducted across all datasets to evaluate the prognostic performance of the proposed models. Their 95% confidence intervals (CIs) were derived using the “survival” and “timeROC” packages. AUCs were compared using DeLong’s test, and C-index was compared using the “survcomp” package. Calibration curves and Hosmer–Lemeshow analysis were performed to assess the model’s calibration accuracy. Decision curve analysis (DCA) was used to evaluate the model’s clinical usefulness. Subgroup analyses were performed according to baseline characteristics. All models were developed via Cox proportional hazards regression. We deemed a two-sided p < 0.05 to indicate statistical significance.

Results

Patient characteristics

This study enrolled 349 patients (mean age ± standard deviation, 61.7 ± 8.0 years; 332 men, 17 women). Table 1 outlines their baseline clinicopathological characteristics. The median RFS was 54.0 months (interquartile range [IQR], 23.0–66.0) in the training set, 47.5 months (IQR, 25.0–62.8) in the internal testing set, and 40.5 months (IQR, 28.5–51.8) in the external testing set. There were no statistically significant differences in the baseline characteristics across all sets (p > 0.05). A flow diagram for this study is presented in Fig. 2.

Table 1 Baseline clinical characteristics of patients.
Fig. 2
figure 2

A flow diagram of this study. VOI, volume of interest; ROI, region of interest; Grad-CAM, gradient-weighted class activation mapping.

Performance of the clinical model, RS, and DLS

Among the clinical characteristics, univariate and multivariate Cox regression analyses revealed that tumor location and N stage were significantly associated with RFS (p < 0.05), and these factors were subsequently used to construct the clinical model (Table S3). Table S4 presents the p-values for performance comparisons between models. The clinical model had a C-index of 0.683 (95% CI: 0.623–0.743), 0.634 (95% CI: 0.528–0.740), and 0.602 (95% CI: 0.493–0.711) in the training, internal testing, and external testing sets, respectively (Table 2).

Table 2 Prediction performance of different models.

A total of 1106 features were initially extracted from each VOI. Among these, 971 radiomics features demonstrated strong inter- and intra-observer reliability (ICC > 0.75) and were retained for further analysis. Following Pearson’s correlation analysis, 176 features remained, and univariable Cox regression was then applied to retain 20 features. Ultimately, eleven features were selected through LASSO Cox regression to construct the RS. Table S5 lists the selected features along with their corresponding coefficients. Compared with the clinical model, the RS achieved a higher C-index of 0.711 (95% CI: 0.648–0.775) in the training set, 0.679 (95% CI: 0.571–0.787) in the internal testing set, and 0.617 (95% CI: 0.509–0.725) in the external testing set (Table 2). Figure S1 shows the time-dependent ROC curves for the RS.

We trained a CNN model using preoperative CT images, with its output serving as the DLS to predict RFS in patients with LSCC. The DLS showed excellent prognostic performance, with a C-index of 0.742 (95% CI, 0.675–0.809), 0.727 (95% CI, 0.624–0.832), and 0.729 (95% CI, 0.623–0.835) in the training, internal testing, and external testing sets, respectively, outperforming both the RS and the clinical model (Table 2). Figure S2 shows the time-dependent ROC curves for the DLS. To improve the interpretability of the DL model, we used the Grad-CAM method. Figure S3 presents Grad-CAM heatmaps depicting two types of patients with divergent outcomes: (A) recurrence or (B) recurrence-free. The red areas represent regions that significantly contribute to the prediction, while the blue areas indicate regions with less influence, thereby highlighting the key parts of the input image that impact the model’s decision-making process.

Performance and risk stratification of the combined model

The univariate and multivariate Cox regression analyses revealed that the tumor location, N stage, RS, and DLS were independent prognostic predictors associated with RFS (all p < 0.05) (Table S3). Table S6 presents the standardized beta coefficients (β) for these factors. Based on these predictors, we developed a combined model and constructed a nomogram to provide a visualization and calculate a risk score for each patient, thereby facilitating risk stratification (Fig. 3A). Further subgroup analysis of these predictors in the combined model, as shown in Figures S4 and S5, revealed that DLS accurately identifies high-risk and low-risk patients in most subgroups, while RS effectively distinguishes patients in relatively fewer subgroups.

Fig. 3
figure 3

Development and performance of the nomogram. Nomogram integrating the tumor location, N stage, RS, and DLS (A) Time-dependent ROC curves of the nomogram at 1-, 2-, and 3-year RFS for the training set (B) internal testing set (C) and external testing set (D) Calibration curves of the nomogram at 1-, 2-, and 3-year RFS for the training set (E) internal testing set (F) and external testing set (G) respectively. RS, radiomics score; DLS, deep learning score; RFS, recurrence-free survival; ROC, receiver operating characteristic; AUC, area under the curve.

Compared to standalone clinical, radiomic, and DL models, the combined model achieved a superior C-index of 0.826 (95% CI: 0.779–0.873), 0.810 (95% CI: 0.736–0.883), and 0.742 (95% CI: 0.649–0.834) across the training, internal testing, and external testing sets, respectively (Table 2). Time-dependent ROC curves for the combined model are presented in Fig. 3B and D. In the external testing set, the combined model demonstrated reliable predictions for 1-, 2-, and 3-year RFS, with AUC values of 0.707 (95% CI: 0.521–0.893), 0.812 (95% CI: 0.683–0.942), and 0.778 (95% CI: 0.651–0.904), respectively. Calibration curves and Hosmer–Lemeshow test demonstrated the combined model’s robust calibration (Fig. 3E and G and Table S7). According to the DCA, the combined model outperformed the other models across most threshold probabilities, indicating a better net benefit (Fig. 4).

Fig. 4
figure 4

Decision curves of different models for 1-year RFS (A) 2-year RFS (B) and 3-year RFS (C) in all patients, respectively. RFS, recurrence-free survival. RS, radiomics score; DLS, deep learning score.

The cutoff value (0.153) of the nomogram was employed to classify patients into two groups. KM survival analysis demonstrated that patients in the low-risk group had a higher RFS compared to those in the high-risk group in the training set [hazard ratio (HR) = 0.163, 95% CI: 0.087–0.307, p < 0.001)], internal testing set (HR = 0.157, 95% CI: 0.063–0.392, p < 0.001), and external testing set (HR = 0.312, 95% CI: 0.137–0.711, p = 0.003) (Fig. 5). These results illustrated the nomogram’s capacity to effectively distinguish between different survival risks.

Fig. 5
figure 5

KM survival analyses for RFS of the nomogram. The KM curves for patients of high-risk and low-risk groups in the training set (A) internal testing set (B) and external testing set (C) respectively. KM, Kaplan-Meier; RFS, recurrence-free survival.

Discussion

In this study, we developed and validated a CT-based combined model that integrates tumor location, N stage, RS, and DLS to non-invasively predict postoperative RFS in LSCC patients. The results demonstrated that the combined model outperformed the individual clinical, radiomics, and DL models (internal testing set: C-index, 0.810 vs. 0.634, 0.679, and 0.727; external testing set: C-index, 0.742 vs. 0.602, 0.617, and 0.729, respectively). These findings suggested that the combined model may provide more comprehensive tumor heterogeneity information, thereby enabling more effective prognostic prediction for LSCC patients. Furthermore, the combined model effectively stratified high- and low-risk patients, demonstrating its potential utility in clinical practice.

Previous studies have reported several clinical factors significantly associated with the prognosis of LSCC patients24,25. One study found that patients with supraglottic tumors tend to have worse prognoses compared to those with glottic or subglottic tumors26. Another study suggested that LSCC patients with lymph node positivity generally experience worse outcomes and a higher likelihood of recurrence after surgery27. Our research confirmed that tumor location and N stage were indeed significant predictors of survival outcomes for LSCC patients (both p < 0.05), and were incorporated into a clinical model. However, the predictive performance of the clinical model remains relatively low, with a C-index of 0.602 in the external validation set. This indicates that clinical indicators have inherent limitations, as they rely on observable characteristics that may not fully capture the tumor’s underlying biological complexity.

As artificial intelligence technology progresses, radiomics and DL have demonstrated promising results in tumor prognosis assessment28,29,30. Woolen et al. extracted perfusion and radiomic features from CT images of 36 laryngeal and hypopharyngeal cancer patients to predict 1-year disease-free survival, achieving a C-index of 0.6928. Li et al. utilized radiomic features extracted from dual-energy CT to predict 3-year PFS in early glottic cancer, obtaining an AUC of 0.67131. Our study revealed that the radiomics model was capable of predicting the RFS of LSCC patients, with predictive performance comparable to that reported in previous studies. A total of eleven radiomic features were included in this study, including seven wavelet-based features. Wavelet-based features have been widely adopted in many radiomics studies and have demonstrated predictive value in various cancer types32,33,34. They are effective in capturing additional textural information at different scales and resolutions, which may provide insights into tumor heterogeneity. Among the included features, wavelet_HHL_glcm_ClusterShade exhibited the highest positive coefficients, indicating that textural heterogeneity is a critical factor in predicting poor prognosis. In contrast, the feature wavelet_LLH_ngtdm_Strength, with the largest negative coefficient, suggests more homogeneous tumor patterns, which may be associated with a more favorable prognosis. Additionally, several other first-order intensity and texture features provide complementary insights into tumor density distribution and microstructural uniformity. However, radiomics model performance was inferior to that of the DL model, indicating that relying solely on handcrafted radiomic features may not fully capture the complex patterns and heterogeneity inherent in LSCC. In contrast, DL algorithms can automatically extract complex, high-dimensional, and nonlinear data, providing a more comprehensive representation of tumor characteristics35. In this study, the DLS achieved a C-index of 0.729 for predicting the survival outcome of LSCC patients in the external testing set, surpassing both the RS and the clinical model. Furthermore, multivariate Cox regression results indicated that DLS had the highest HR and β, emphasizing its dominant role in predicting recurrence risk. Subgroup analysis also demonstrated that DLS can effectively identify high-risk and low-risk patients in most clinical subgroups. This demonstrates that DL algorithms can derive prognostic information from CT images, complementing clinical prognostic features. Despite the impressive performance of the DL model, its “black-box” nature presents significant challenges to interpretability. To address this issue, Grad-CAM heatmaps were generated to provide a visual representation of the focal areas during the prediction process, revealing key discriminative features within the tumor region and enhancing the model’s transparency. In patients with recurrence, heatmaps primarily highlighted regions within the tumor that corresponded to areas of low-density or irregular texture, likely indicating necrosis or aggressive tumor features. In contrast, for patients who remained recurrence-free, heatmaps showed more peripheral activations, suggesting the model focuses on tumor margins or surrounding tissues. In a small subset of cases, minor activation was observed outside the patient’s body; however, these were not dominant areas of attention. This approach provides insights into the DL model’s decision-making process and further supports its clinical applicability.

Some previous studies have demonstrated that integrating DL with radiomics can further increase the predictive accuracy of prognostic models36,37. Gu et al. developed a DL-based radiomic nomogram for prognostic prediction in nasopharyngeal carcinoma, achieving significantly higher predictive accuracy compared to individual radiomics and DL models, with the C-index increasing by 0.075 and 0.062 in the external validation cohort19. In Wei et al.‘s study, the combined model integrating RS and DLS significantly improved survival prediction over the use of a radiomic or DL model alone in the external validation set (C-index: 0.685 vs. 0.658 vs. 0.601)38. These studies highlighted the complementary strengths of DL and radiomics. In the field of prognostic modeling, some studies have employed feature-level fusion strategies39,40,41. However, this method may not be optimal due to the potential instability caused by correlations between handcrafted and DL features, which could ultimately adversely affect the performance of integrated models22,42. Therefore, instead of mixing the two categories of features, we employed a decision-level fusion approach to establish a combined model. To the best of our knowledge, our study is the first to explore the predictive value of combining DL with radiomics for survival outcomes in LSCC. In this two-center study, the results showed that the combined model achieved the highest C-index among all the models. Notably, the inclusion of RS and DLS significantly improved the model’s predictive performance compared to the clinical model alone (internal testing set: C-index, 0.810 vs. 0.634; external testing set: C-index, 0.742 vs. 0.602). Additionally, we observed notable differences in RFS between the high- and low-risk groups classified by the combined model in all sets (all p < 0.05). High-risk patients may benefit from more aggressive treatment and closer follow-up, while low-risk patients could follow standard protocols to optimize resource allocation. This further validates the effectiveness and clinical utility of the combined model.

Nevertheless, this study has certain limitations. First, all data were retrospectively collected from two centers, which might have introduced selection bias and confounding factors. Therefore, large-scale multicenter prospective studies are still required for more robust validation. Second, manually segmenting tumor images is time-consuming and susceptible to inter-observer variability, potentially compromising the reproducibility of results. Implementing automated or AI-assisted segmentation techniques could reduce human error and increase consistency in tumor delineation. Third, the tumor delineation for the DL model was based on the largest 2D slice, which may not adequately represent the entire tumor’s spatial characteristics. Future studies should explore 3D tumor analysis to capture comprehensive spatial features and improve model performance. Fourth, using an ImageNet-pretrained encoder may introduce a domain gap, as the features learned from natural images may not be fully optimal for medical grayscale images. Models pretrained on medical data, such as BioMedCLIP, may offer better domain-specific feature extraction. Finally, the study did not incorporate other potentially valuable prognostic factors, such as genomic data and biological tumor markers, which might further refine predictive models.

Conclusions

We developed and validated a combined model integrating the RS, DLS, and independent clinical risk factors, which demonstrated promising results in predicting RFS in LSCC patients after surgery and outperformed standalone models. This combined model holds potential as a non-invasive tool to assist clinicians in stratifying patient risk, thereby supporting personalized treatment and follow-up strategies.