Introduction

Esophageal cancer ranks as the seventh most prevalent malignancy and the sixth leading cause of cancer-related mortality worldwide, representing a major health challenge1. Around half of the cases are found in China, predominantly as esophageal squamous cell carcinoma (ESCC)2. Due to its aggressive nature and lack of early symptoms, ESCC is often diagnosed at a locally advanced stage.

Recent clinical trials indicated that combining neoadjuvant immunotherapy with chemotherapy is effective and safe for patients with locally advanced esophageal squamous cell carcinoma (LA-ESCC), showing promising outcomes3,4,5,6. However, the benefits of the neoadjuvant chemo-immunotherapy (NACI) regimen are limited to a subset of patients7,8. Various biomarkers, such as tumor mutation burden (TMB), programmed death receptor-1 (PD-1), programmed death ligand-1 (PD-L1) expression, circulating tumor DNA, microsatellite instability (MSI), and gut microbiome composition have been investigated to predict responses to NACI9. Unfortunately, these biomarkers have shown inconsistent reliability for ESCC. Consequently, there is a critical demand for reliable, non-invasive biomarkers to predict treatment outcomes.

Radiomics is an emerging non-invasive technique that extracts quantitative features from medical images, offering valuable insights into tumor biology10. Often termed a digital biopsy, radiomics assesses the tumor microenvironment by providing data on tumor heterogeneity and molecular and genetic variations11,12. Previous studies have validated the predictive value of imaging biomarkers across various cancers, particularly for chemo-immunotherapy responses13,14,15. Radiomics has the potential to surpass traditional markers such as TMB, MSI, and PD-1/PD-L1 expression due to its ability to characterize the tumor microenvironment non-invasively. Recent research indicates that tumors comprise diverse sub-regions or habitats, each with distinct structural, metabolic, and functional properties16,17. Advanced computed tomography (CT) technology allows for the visualization and quantification of these sub-regions, enhancing tumor characterization18. The robustness, convenience, and reliability of radiomics make it a promising tool for predicting NACI outcomes in LA-ESCC.

CT imaging is inherently three-dimensional (3D), and analyzing these images using two-dimensional (2D) methods can lead to information loss.Although 3D deep convolutional neural networks (DCNNs) are more appropriate for this type of data, they demand larger datasets for training compared to 2D networks, posing a challenge in medical research due to the limited available data19. Moreover, deep learning (DL) features can be sensitive to transformations like translation, rotation, and scaling, whereas radiomics features exhibit greater robustness against such variations20. Studies have shown that multi-domain models integrating radiomics with DL outperform single-method approaches 21,22,23.

This study aims to construct and validate a non-invasive imaging biomarker for predicting NACI response in LA-ESCC using fusion models combining sub-regional radiomics with multichannel 2D or 3D DL. Additionally, the performance of the two fusion models was comprehensively compared to identify the most effective approach.

Results

Patient characteristics

The study included 271 patients with LA-ESCC, divided into a training cohort of 161 patients, an internal test cohort of 69 patients, and an external test cohort of 41 patients. Table 1 provides a summary of the baseline clinical characteristics for these cohorts. There were no significant differences among the cohorts in key variables, including age, gender, body mass index, smoking and drinking habits, clinical tumor, lymph node and metastasis stages, tumor location, tumor length, cycles of chemo-immunotherapy, immune checkpoint inhibitors, and squamous cell carcinoma antigen levels (all P > 0.05). The pathological complete response (pCR) rates were 28.6% in the training set, 28.9% in the internal validation set, and 29.3% in the external validation set, with no significant differences (P > 0.05).

Table 1 Baseline clinical characteristics of the training cohort and the validation cohorts

Diagnostic performance of different tumor regions in CT

The process of selecting the optimal number of clusters is detailed in Table S1. In the training cohort, the Calinski-Harabasz index was highest when four clusters were used, dividing the tumor region into four distinct sub-regions. Sub-region 1 exhibited the highest diagnostic performance in the ROC curve analysis, achieving an AUC of 0.823 (95% CI: 0.753–0.881), with sensitivity at 78.2%, specificity at 89.1%, positive predictive value (PPV) at 66.9%, negative predictive value (NPV) at 91.5%, and an overall accuracy of 82.7%. Sub-region 2 had a lower AUC of 0.757, with sensitivity and specificity values of 73.8% and 69.1%, respectively. Sub-regions 3 and 4 were significantly less effective in distinguishing pCR from non-pCR than sub-regions 1. The combination of sub-region 1 and sub-region 2 did not improve diagnostic performance compared to sub-region 1 alone (P = 0.671). Therefore, sub-region 1 was selected for further analysis. Table 2 summarizes the predictive performances of radiomics features derived from the different sub-regions for pCR prediction in LA-ESCC. Key features selected from these sub-regions are detailed in Table S2.

Table 2 Predictive performances of radiomics features from different sub-regions for predicting pCR in LA-ESCC

Comprehensive analysis of radiomics and DL features

A total of 1834 radiomics features were initially extracted from sub-region 1. After applying Spearman correlation coefficient matrix filtering to remove highly correlated features (|Spearman correlation coefficient | ≥0.9), 745 features remained. Subsequent intraclass correlation coefficients (ICC) analysis excluded features with poor reproducibility (ICC < 0.75), resulting in 312 robust features. From these, 12 key radiomics features were ultimately selected via least absolute shrinkage and selection operator (LASSO) regression (Fig. 1A, left panel). For deep learning–based features, 2048 multichannel 2D features and 2048 3D features were initially extracted. After Spearman correlation analysis, 1124 2D features and 1010 3D features were retained. Subsequent ICC filtering further reduced the feature set to 856 2D features and 693 3D features, respectively. Finally, LASSO regression identified 15 multichannel 2D features (Fig. 1A, middle panel) and 6 3D features (Fig. 1A, right panel) as the most informative predictors. Intraclass ICCs of selected features in three models are summarized in Table S3. Shapley additive explanations (SHAP) analysis provided quantitative insights into the importance of individual features. Among radiomics features, Rad_1_H1 was the most critical for pCR classification (Fig. 1B, C, left panel), while 2D_DL_1749 (Fig. 1B, C, middle panel) and 3D_DL_909 (Fig. 1B, C, right panel) were the most important features in their respective categories. Despite multichannel 2D DL features attaining the highest AUC of 0.906, their discriminative ability for predicting pCR was not significantly different from that of 3D DL and radiomics (Fig. 1D, Table 3). Spearman correlation analysis revealed minimal redundancy among the selected features, as illustrated in Fig. S1.

Fig. 1: Features selection, SHAP analysis and models assessment.
figure 1

A Features selection with LASSO regression in sub-region1 radiomics model (left panel), multichannel 2D DL model (middle panel), and 3D DL model (right panel). B, C Feature importance based on XGBoost algorithm with SHAP analysis in sub-region1 model (left panel), multichannel 2D DL model (middle panel), and 3D DL model (right panel). D The ROC curves of DLRad1, DLRad2, 2D DL, 3D DL, and sub-region1 radiomics model in training set (left panel), internal validation set (middle panel), and external validation set (right panel). Note: 2D two-dimensional, 3D three-dimensional, DL deep learning, LASSO least absolute shrinkage and selection operator, ROC receiver operating characteristic, SHAP Shapley Additive Explanations, XGBoost eXtreme Gradient Boosting.

Table 3 Performances of the predictive models in the training and validation cohorts

Performance analysis of radiomics, 2D/3D DL models, and fusion models

In the DLRad1 model, 14 key features were selected, comprising 5 radiomics features and 9 multichannel 2D DL features (Fig. S2A–C, left panels). The DLRad2 model incorporated 10 key features, including 5 radiomics features and 5 3D DL features (Fig. S2A–C, right panels). Diagnostic performance metrics of the predictive models across study sets are summarized in Table 3.

The DLRad1 model demonstrated the highest AUC (0.793–0.910) across all study sets. The AUCs for the DLRad2, multichannel 2D DL, 3D DL, and sub-region 1 radiomics models, compared to DLRad1, were 0.718–0.834 (P = 0.043–0.494), 0.769–0.906 (P = 0.810–0.894), 0.701–0.831 (P = 0.042–0.471), and 0.721–0.823 (P = 0.036–0.501), respectively. Calibration curves for the DLRad1 model exhibited strong consistency across probabilities (Fig. S3A–C), and decision curve analysis (DCA) indicated a higher net benefit for the DLRad1 model (Fig. S3D–F) in all study sets. The predictive models’ optimal cutoff values, determined by the maximum Youden index in the training set, were 0.411 for DLRad1, 0.431 for DLRad2, 0.401 for multichannel 2D DL, 0.409 for 3D DL, and 0.406 for radiomics. Based on these cutoffs, the DLRad1 model achieved the best sensitivity (80.0–83.3%) and specificity (79.3–81.6%) across test sets (Table 3). Multivariate logistic regression analyses confirmed that DLRad1 was an independent predictor of pCR in both training and validation sets (Table S4). SHAP waterfall diagrams further visualized the contributions of individual features in predicting pCR (Fig. 2A, B) and non-pCR (Fig. 2C, D).

Fig. 2: CT images changes and SHAP waterfall plot for DLRad1 model in two patients.
figure 2

A CT images changes before and after NACI in a patient achieved pCR. B Waterfall plot of SHAP explanations in a patient achieved pCR after NACI. C CT images changes before and after NACI in a patient with non-pCR. D Waterfall plot of SHAP explanations in a patient with non-pCR after NACI.

Clinical model development and models comparison

To investigate whether the integration of radiomics features enhances predictive performance beyond conventional clinical parameters, we first conducted univariate and multivariate logistic regression analyses to identify clinical factors independently associated with pCR (Table S4). In the training cohort, tumor length emerged as the only significant clinical predictor (odds ratio [OR] = 0.549, 95% confidence interval [CI]: 0.180–0.915, P = 0.045). Based on this, a Clinical model was constructed using tumor length as the sole input variable. We then developed a Combined model incorporating both the Clinical model and the DLRad1 signature to evaluate the added predictive value of radiomics features. The discriminative performance of the Clinical model, the DLRad1 model, and the Combined model was assessed across the training, internal validation, and external validation cohorts using AUCs, as well as Integrated Discrimination Improvement (IDI) and Net Reclassification Improvement (NRI) metrics (Table S5). In the training cohort, the Combined model achieved the highest AUC (0.919, 95% CI: 0.885–0.962), outperforming the Clinical model (AUC = 0.617, 95% CI: 0.536–0.698) and the DLRad1 model (AUC = 0.910, 95% CI: 0.867–0.954). Both IDI and NRI analyses demonstrated statistically significant improvements in discrimination and classification when adding DLRad1 to the Clinical model (IDI = 0.095 [0.038–0.151], P < 0.001; NRI = 0.317 [0.156–0.472], P < 0.001). Similar trends were observed in the internal validation (AUC = 0.841 for Combined vs. 0.669 for Clinical) and external validation cohorts (AUC = 0.773 for Combined vs. 0.561 for Clinical), with consistent IDI/NRI improvements.

Discussion

NACI is widely recognized as a safe and effective first-line treatment for LA-ESCC, significantly enhancing pCR rates. Despite its efficacy, no optimal method currently exists for predicting NACI response in LA-ESCC. In this study, we developed sub-region-based radiomics with multichannel 2D and 3D DL models using primary tumor CT images to predict pCR outcomes. First, habitat analysis was performed to extract and evaluate the predictive capabilities of imaging features across different primary tumor regions. Tumor regions were clustered into four distinct groups based on the Calinski-Harabasz Index, with sub-region 1 demonstrating the strongest predictive performance. Subsequently, two multidomain fusion models were constructed: DLRad1, integrating sub-region-based radiomics with multichannel 2D DL, and DLRad2, combining sub-region-based radiomics with 3D DL. The DLRad1 model exhibited superior performance, achieving an AUC value of 0.910 in the training set, 0.829 in the internal validation set, and 0.793 in the external validation set. Feature analysis revealed multichannel 2D DL features had greater discriminative ability than sub-region-based radiomics and 3D DL features. While partial correlations were observed among radiomics features from sub-region 1, multichannel 2D, and 3D DL, there was no significant redundancy among the selected features.

Accurately predicting NACI response prior to treatment is critical for devising personalized therapeutic strategies, minimizing unnecessary harm, and reducing side effects associated with preoperative treatments. Radiomics, owing to its high availability and non-invasive nature, is increasingly gaining attention. Several studies have investigated 2D radiomic models for predicting pCR in ESCC patients undergoing neoadjuvant immunotherapy. Wang et al. developed a traditional 2D model incorporating 10 radiomic features before treatment, achieving AUCs of 0.87 in the training cohort and 0.75 in the validation cohort24. Shi et al. proposed a single-slice CT-based 2D model with LASSO feature selection combined with eXtreme Gradient Boosting (XGBoost), yielding AUCs of 0.89 and 0.80 in the training and validation cohorts, respectively25. Zhu et al. constructed a similar model using ComBat harmonization, reporting AUCs of 0.818 and 0.787 in the respective datasets26. These studies demonstrated favorable performance in identifying pCR to NACI in resectable ESCC patients. However, these studies were limited by small sample sizes or lacked external validation, which restricts the generalizability of their findings. A recent study by Zhang et al. highlighted the predictive value of post-treatment CT-based radiomic modeling in assessing pCR after neoadjuvant chemoradiotherapy in ESCC. The post-treatment model, built using XGBoost, demonstrated superior predictive performance compared to the pre-treatment model, with AUCs of 0.875 in the training set, 0.835 in the internal validation set, and 0.786 in the external validation set27. This study underscores the predictive value of post-treatment imaging and the advantage of multimodal feature integration in enhancing individualized response assessment. Given this, two recent studies have demonstrated that incorporating both pre- and post-treatment imaging data into radiomics models could improve the prediction of pCR in patients with ESCC undergoing NACI. Zheng et al. developed a CT-based delta-radiomics model by quantifying changes in radiomic features before and after treatment, which outperformed both the pre-treatment model, achieving AUCs of 0.827 in the training set and 0.790 in the testing set28. Ruan et al. proposed an ESCC preoperative imaging score by integrating radiological features from pre- and intra-treatment phases, achieving an impressive AUC of 0.91829. These findings highlight the added value of dynamic radiomic changes in capturing therapy-induced tumor alterations and enhancing pCR prediction in ESCC.

Although previous studies have provided valuable insights, many rely on single-slice 2D radiomics features or incorporate post-treatment imaging data, limiting both the comprehensiveness and clinical utility of their models in the pre-treatment setting. Single-slice 2D radiomics overlooks the multidimensional nature of tumor morphology, while post-treatment-based models fail to support pre-therapeutic decision-making. In contrast, our study offers a distinct contribution by focusing exclusively on pretreatment CT images while still achieving robust predictive performance. We utilized multichannel 2D radiomics features to overcome the limitations of single-slice approaches, and, considering the heterogeneity of esophageal tumors and segmentation challenges, we divided the tumor into imaging-defined subregions. Radiomic features were extracted from each subregion and evaluated for their predictive power. The most informative features from the top-performing subregion were then fused with either multichannel 2D or 3D deep learning radiomics features to build two distinct fusion models. Among them, the DLRad1 model, which integrates subregional radiomics and 2D deep learning features, demonstrated strong predictive performance across training, internal validation, and external validation cohorts, comparable to or exceeding the performance of previously reported models including those based on post-treatment or delta-radiomics strategies. Importantly, our approach enriches the feature space and captures intratumoral heterogeneity without requiring additional imaging time points.

Recent studies have compared the performance of single-slice 2D and whole-volume 3D radiomic models in predicting pCR following neoadjuvant immunotherapy-based strategies in ESCC. Shi et al. found that a 2D model using XGBoost achieved superior predictive performance (AUC = 0.89 in the training and 0.80 in the test cohort) compared to 3D and clinical models25. Similarly, Zhu et al. reported that a ComBat-harmonized 2D model outperformed its 3D counterparts, with AUCs of 0.818 and 0.787 in the training and validation cohorts, respectively, and this superiority of 2D over 3D models was statistically significant across all cohorts (P < 0.001)26. These findings underscore the practical advantages of 2D radiomics, including reduced annotation workload, improved reproducibility, and lower sensitivity to inter-slice variability. While single-slice 2D radiomics is computationally efficient, it may overlook critical spatial information necessary for fully characterizing tumor heterogeneity. In contrast, 3D radiomics accounts for full volumetric data but often introduces greater noise and requires more computational resources. Multichannel 2D radiomics offers a balanced alternative, improving spatial characterization by analyzing multiple slices while maintaining computational efficiency. This method typically includes the largest cross-sectional slice and its adjacent multi-layer slices, effectively capturing key tumor characteristics such as size, shape, and intensity. By integrating information from multiple slices, multichannel 2D radiomics approximates overall 3D imaging features to some extent, offering a richer spatial context without the computational complexity of 3D analysis. In our study, the multichannel 2D radiomics model outperformed the 3D radiomics model, consistent with previous studies that also found 2D radiomic features to be more effective than 3D features in certain predictive tasks for ESCC25,26. These results suggest that while 3D modeling offers theoretical comprehensiveness, 2D or multichannel 2D radiomics may capture essential tumor features more effectively in practice, especially when coupled with appropriate feature extraction and learning frameworks.

Habitat analysis, an advanced imaging strategy, enables the identification of spatially distinct tumor regions by capturing subtle differences in radiomic phenotypes, thereby illustrating the intratumoral heterogeneity that underlies treatment resistance and response variability30. Gatenby et al. conceptualized tumors as a mosaic of evolutionary habitats rather than a single self-organizing entity, with each sub-region shaped by unique microenvironmental pressures and selective forces31. Building upon this theory, Xie et al. clustered esophageal tumors into four spatial subregions and demonstrated that sub-regional radiomics could effectively predict survival outcomes following chemoradiotherapy32. Similarly, a separate study confirmed that radiomics features exhibit significant heterogeneity across different tumor habitats, reinforcing the biological relevance of spatial dissection33. Consistent with these prior observations, our subregional analysis revealed that features extracted from sub-region 1 yielded the highest predictive performance for identifying patients likely to achieve pCR to NACI. We further explored the biological interpretability of these computed CT-defined habitats. Despite the limited sample size, the distinct radiomic patterns observed across subregions were consistent with tumor heterogeneity. These findings suggest that CT-derived habitats may reflect underlying pathologic tissue compartments and could serve as noninvasive surrogates of tumor biology and aggressiveness. Our results underscore the potential of habitat-guided radiomics as a clinically valuable approach for stratifying patients based on their biological responsiveness to therapy.

This study proposed a non-invasive preoperative approach for predicting pCR in LA-ESCC patients undergoing NACI, thereby supporting informed treatment decisions and personalized therapeutic planning. In the external validation cohort, the DLRad1 model achieved a sensitivity of 83.3% and specificity of 79.3% at the optimal threshold, correctly identifying 10 of 12 pCR cases and 23 of 29 non-pCR cases. This corresponded to an under-treatment rate of 16.7% and an apparent overtreatment rate of 20.7%. While the under-treatment rate reflects a potential risk of withholding de-escalation in true pCR responders, the interpretation of “overtreatment” requires caution. In clinical practice, patients who do not achieve pCR may still experience meaningful therapeutic benefit from NACI, such as tumor downstaging or partial response, which are associated with improved R0 resection rates and long-term survival. Thus, failure to achieve pCR does not necessarily equate to overtreatment. As NACI continues to emerge as a promising standard of care for LA-ESCC, the proposed model is intended to complement, rather than replace, guideline-based recommendations by enhancing risk stratification and supporting individualized decision-making. Active surveillance has been proposed as a feasible strategy to improve survival in patients who achieve pCR following NACI. However, for those who are unresponsive to NACI or exhibit insufficient tumor regression, additional therapeutic interventions, such as radiotherapy, may be required to achieve adequate local control and optimize clinical outcomes7,34. Given the high cost of NACI, radical chemoradiotherapy represents a viable alternative for these patients. This approach may help reduce financial burden and minimize treatment-related toxicities, particularly in resource-limited settings.

Nonetheless, this study has several limitations. First, it is a retrospective analysis with relatively small validation cohorts, which may affect the generalizability of the findings. Prospective, multicenter studies with larger and more diverse patient populations are warranted to confirm the results and further validate the proposed radiomics model. Second, most patients enrolled were male, potentially limiting the applicability of the model across genders. Third, the limited spatial resolution of CT imaging complicates the precise delineation of tumor boundaries from adjacent normal esophageal tissue during regions of interest (ROI) segmentation, which may introduce bias. Fourth, due to the retrospective and multicenter nature of this study, PD-L1 expression data were not consistently available and therefore could not be incorporated into the modeling process. Lastly, because of the limited follow-up duration in a significant portion of patients, survival outcomes such as disease-free survival or progression-free survival could not be reliably assessed.

Although the proposed DLRad1 model demonstrated robust performance across training, internal, and external validation cohorts, further refinement is needed to enable clinical translation. One promising direction is the application of delta radiomics, in which features extracted from multi-timepoint CT scans (pre-, mid-, and post-treatment) can dynamically reflect treatment-induced changes and improve predictive accuracy. Additionally, integrating radiomic features with clinical and hematological biomarkers, such as neutrophil-to-lymphocyte ratio, platelet-to-lymphocyte ratio, and lactate dehydrogenase, may provide complementary biological information and enhance model robustness. Future research may also explore diverse machine learning algorithms (LightGBM, support vector machines, ensemble models), as well as hybrid architectures combining handcrafted and deep learning features. Improving probability calibration and incorporating uncertainty estimation may further support clinical decision-making. To improve generalizability, large-scale, prospective multicenter studies are essential. The application of harmonization techniques (ComBat or domain adaptation) could reduce variability introduced by different imaging protocols. Furthermore, integrating radiomic features with tumor immune phenotypes or multi-omics data (genomic, transcriptomic) may enhance the biological interpretability and immunotherapy relevance of the model. Ultimately, the development of interpretable, clinically applicable decision support tools is critical for promoting individualized treatment planning and broader clinical adoption.

In conclusion, based on pretreatment CT images, the proposed multidomain fusion model, DLRad1, effectively predicts pCR in patients with LA-ESCC. While additional studies are needed to validate its clinical applicability, our findings provide a foundation for informed treatment decisions and the development of personalized therapeutic strategies for LA-ESCC.

Methods

Study participants

A retrospective analysis was conducted on 271 patients with LA-ESCC who received neoadjuvant chemotherapy and immune checkpoint inhibitor treatment across three hospitals in China (Fig. 3). Among them, 230 patients were consecutively treated at Harbin Medical University Cancer Hospital between December 1, 2019, and October 31, 2024. Patients were randomly divided into a training cohort (n = 161) and an internal validation cohort (n = 69) with a 7:3 ratio. An external validation cohort comprised 41 patients treated consecutively at the Affiliated Suzhou Hospital of Nanjing Medical University and the Affiliated Hospital of Jiangnan University between September 1, 2022, and August 31, 2024. Inclusion criteria: (1) pathohistologically confirmed ESCC treated with neoadjuvant chemotherapy and immune checkpoint inhibitors; (2) availability of contrast-enhanced CT images acquired within one week prior to the initiation of neoadjuvant treatment and before any antitumor therapy, representing pretreatment baseline imaging; (3) diagnosis of locally advanced ESCC (cT1b-2N + M0 or cT3-4a any N M0) according to the American Joint Committee on Cancer TNM Staging Manual (8th Edition); and (4) completion of radical surgery post-neoadjuvant treatment with comprehensive postoperative pathological data. The exclusion criteria: (1) incomplete clinical records or missing CT descriptor information; (2) CT images with significant artifacts; (3) any prior anticancer therapy administered before the baseline CT imaging; and (4) concurrent diagnosis of synchronous multiple primary ESCC or other primary malignant tumors. The overall workflow of the study is illustrated in Fig. 4.

Fig. 3: Flowchart diagram shows the patient selection process from three medical centers.
figure 3

Note: ESCC esophageal squamous cell carcinoma, AJCC American Joint Committee on Cancer, CT computed tomography, Center 1, Harbin Medical University Cancer Hospital; Center 2, the Affiliated Suzhou Hospital of Nanjing Medical University; Center 3, the Affiliated Hospital of Jiangnan University.

Fig. 4: Workflow diagram for the construction of the predictive models.
figure 4

Tumor segmentation and region of interest (ROI) delineation are performed by experienced radiologists. The K-means algorithm is employed to cluster the tumor into sub-regions. A sub-regional radiomics model is developed using PyRadiomics. For the multichannel two-dimensional deep learning (2D DL) model, a pre-trained ImageNet ResNet101 network is fine-tuned with the training data. The input to ResNet101 consists of the tumor’s maximal ROI cross-section along with five adjacent computed tomography (CT) slices above and below it. For the three-dimensional deep learning (3D DL) model, a pre-trained 3D ResNet101 backbone is utilized and fine-tuned with the training data to optimize the model parameters. In the fusion model, features extracted from both the sub-regional radiomics model and either the multichannel 2D DL model or the 3D DL model are combined.

Ethics statement

The study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the respective ethics committees of Harbin Medical University Cancer Hospital, the Affiliated Suzhou Hospital of Nanjing Medical University, and the Affiliated Hospital of Jiangnan University. The requirement for informed consent was waived by the institutional review boards due to the retrospective nature of the study and the use of de-identified data.

NACI regimen and clinicopathological data collection

All patients received 2–4 cycles of NACI, followed by radical esophagectomy with lymphadenectomy upon completion of the neoadjuvant therapy. Tumor response was assessed after the second cycle of treatment and again prior to surgery, according to the response evaluation criteria in solid tumors version 1.1. Further details on the treatment regimen can be found in a previous study3. Baseline clinical characteristics were documented before the initiation of treatment.

CT examination and image preprocessing

Preoperative CT scans were performed for all patients, with imaging parameters standardized to a window width of 300 and a window level of 40 to enhance visualization of esophageal tissues. Details of the CT scan parameters are summarized in Table S6. Images were resampled to a 1 × 1 × 1 mm voxel size using bi-cubic spline interpolation—two experienced radiologists independently and blindly delineated tumor regions using ITK-Snap software (version 3.6.0). ROIs were reassessed 2 months later to evaluate the consistency of annotations. ICCs were calculated to ensure reliability.

Sub-region clustering and radiomics feature extraction

This study excluded air regions within the ROI using a zero Hounsfield unit threshold. Local features for each voxel were calculated using a moving window of size 3 × 3 × 3, generating 19 feature vectors per voxel. The features were classified into five categories: first-order features, Gray Level Co-occurrence Matrix (GLCM), Gray Level Run Length Matrix (GLRLM), Gray Level Size Zone Matrix (GLSZM), and Neighborhood Gray Tone Difference Matrix (NGTDM) features (Fig. 5). The first-order features, including entropy, mean absolute deviation, and median, describe the overall intensity distribution. GLCM features, including difference average, difference entropy, difference variance, joint energy, joint entropy, inverse variance, sum entropy, Imc1, and Imc2, capture pixel pair spatial relationships to elucidate texture patterns. GLRLM features, including run entropy, run variance, and long run emphasis, focus on the distribution of run lengths of pixels with similar intensities. The GLSZM features, Size Zone Non-Uniformity Normalized and Small Area High Gray Level Emphasis, were utilized to characterize the size and distribution of homogeneous intensity zones. Lastly, NGTDM features, such as contrast and strength, were evaluated to assess local intensity variations. The K-means algorithm was utilized for sub-region clustering, exploring cluster numbers from 2 to 9 based on these features. The optimal number of clusters was determined using the Calinski–Harabasz score.

Fig. 5: Radiomic feature maps and sub-region generation.
figure 5

The 19 characteristics for sub-region clustering and the generated sub-regions in computed tomography images.

Radiomics features for each sub-region were extracted using PyRadiomics35. ROIs were normalized to 64 Gray levels to account for variability in CT scanners. Radiomics features quantifying volume, shape, intensity, and texture were calculated for each sub-region, yielding 1834 features derived from eight filters: Laplacian of Gaussian, Wavelet, 3D Local Binary Patterns, Exponential, Square, SquareRoot, Logarithm, and Gradient. Comprehensive parameters for radiomics features can be found on the PyRadiomics website (https://pyradiomics.readthedocs.io/en/latest/).

Radiomics feature selection and model construction

Radiomics features were standardized using Z-scores to achieve a normal distribution. A greedy recursive feature elimination approach was employed to remove features with a Spearman correlation coefficient of 0.9 or higher. Features with strong stability (intraobserver and interobserver ICC ≥ 0.75) were retained.

Feature selection was subsequently conducted using LASSO regression. Predictive models were developed using the XGBoost algorithm, optimized through five-fold cross-validation. Each model was trained on feature sets labeled for pCR prediction. Probabilities for pCR were then generated for patients in the validation cohorts.

Multichannel 2D DL model development and feature extraction

In the multichannel 2D DL framework, a bounding box was defined as the smallest rectangle enclosing the largest cross-section of the tumor. This bounding box was utilized to extract the primary tumor ROI along with five adjacent slices—both above and below the tumor—resulting in a dataset for each patient that included the main tumor ROI and up to 5 neighboring patches. Subsequently, the images were cropped and resized to a uniform dimension of 256 × 256 pixels through linear interpolation. Data augmentation methods, such as random horizontal and vertical flips and random cropping to 224 × 224 pixels, were employed to enhance the dataset and bolster model robustness. The final input image size was standardized to 224 × 224 pixels. The ResNet101 network was employed pre-trained on the ImageNet dataset, which contains millions of labeled images across thousands of categories. Transfer learning was applied to the training set, utilizing ImageNet’s strong feature extraction capabilities, which are commonly used in medical imaging research21,36,37. A global fine-tuning approach was employed to adjust network parameters, customizing ResNet101 for pCR prediction.

Multiple patches for each patient were consistently labeled to reflect the presence or absence of pCR. These labeled tumor images were input for the 2D DCNN. The network parameters were iteratively updated using backpropagation, guided by the cross-entropy loss function during training.The model utilized the Adam optimizer with a learning rate of 1 × 10⁻⁴, a batch size of 64, and incorporated L2 regularization and early stopping to prevent overfitting. The 2D DCNN model produced probability scores for pCR for each CT patch. Patient-level probabilities were calculated by averaging the probabilities across all patches for a given patient. The penultimate average pooling layer of ResNet101 was also used to extract 2,048 DL features from each patch upon model completion.

3D DL model development and feature extraction

We employed a 3D ResNet architecture to address the limitations of 2D DCNNs in comprehensively capturing 3D anatomical structures. This model enhances traditional ResNet’s capabilities by preserving stereoscopic information and leveraging automatic parameter learning for contextual feature extraction21. The Med3D dataset, a comprehensive resource for multi-modal and multi-organ medical imaging, was employed due to the limited availability of pre-trained models for medical 3D imaging38. The 3D ResNet101 model, pre-trained on Med3D, was the backbone architecture. Decoder layers were replaced with fully connected layers to customize the network for the specific task of pCR prediction39. The 3D ResNet101 backbone was initialized with pre-trained weights and fine-tuned on our training dataset.

For 3D DL processing, the tumor ROI was enclosed within a bounding cube. The ROI cubes were linearly interpolated to a size of 96 × 96 × 96 voxels. Data augmentation methods, such as inversion across the X, Y, and Z axes, were utilized to enhance dataset diversity. The 3D DCNN was provided with the generated 3D images and their corresponding labels. The model was trained for 200 epochs using the Adam optimizer with a learning rate of 0.03. After training, the 3D DCNN predicted pCR for each patient.Each patient had 2048 deep learning features extracted from the penultimate average pooling layer of the 3D ResNet101 model.

Construction of the fusion models

Feature-level fusion was achieved by concatenating features from multiple modalities into a unified feature vector. Radiomics features of the primary tumor were extracted using PyRadiomics, while 2D and 3D DL features were derived from the respective DCNNs, as described earlier. Z-score normalization was applied to standardize all features. Feature relevance was assessed using Spearman correlation, ICCs, and LASSO analyses. The XGBoost algorithm was subsequently used to construct feature-based fusion models. Two fusion models were developed: DLRad1, which combined radiomics features based on sub-regions with multichannel 2D DL features, and DLRad2, which integrated radiomics features based on sub-regions with 3D DL features.

Clinical model development and models comparison

Univariate and multivariate logistic regression analyses were conducted to identify clinical and radiomics indicators independently associated with pCR. Based on the multivariate regression results, we constructed a Clinical model by including significant clinical predictors (P < 0.05) identified in the training cohort. Subsequently, we developed a combined multivariable model incorporating both the DLRad1 and the selected clinical variables, in order to evaluate the added value of radiomics beyond clinical factors. To compare the discriminative performance of the Clinical model, DLRad1 model, and Combined model (Clinical + DLRad1), we calculated the area under the AUC for each. Furthermore, we assessed the incremental predictive value of the radiomics-based model using the IDI and NRI metrics.

Model evaluation

The models’ predictive performance was evaluated using metrics such as ROC curves, AUC, sensitivity, specificity, and overall accuracy. The DeLong test was utilized to assess differences in AUC values between models. DCA was used to evaluate the clinical applicability of the models. Calibration curves were plotted to evaluate the concordance between predicted and observed outcomes. Model interpretability was assessed using the SHAP method to clarify the impact of individual features on predictions40.

Statistical analysis

Categorical variables were assessed with the Chi-square or Fisher’s exact tests, and continuous variables were analyzed using the Mann–Whitney U test or independent t-test. A two-sided P-value of less than 0.05 was considered statistically significant. Statistical analyses were conducted using R (v4.2.1) and Python (v3.7.0). Table S7 provides a list of the packages utilized in this study.