Introduction

Anatomical pneumonectomy is the primary treatment for early to mid-stage non-small cell lung cancer (NSCLC)1. Despite progress in endoscopic techniques and minimally invasive surgical treatments, recurrence rates remain elevated2, with approximately 15–40% of NSCLC patients experiencing bone metastasis (BM) during treatment3,4. The high risk of BM, which can induce skeletal-related events (SREs) such as pain, pathologic fracture, hypercalcemia, and spinal cord compression, has emerged as an important prognostic factor for NSCLC patients5,6,7,8.

For patients with suspected BM, emission computed tomography (ECT) or positron emission computed tomography (PET-CT) are routinely recommended9,10,11. However, the latest version of the National Comprehensive Cancer Network (NCCN) guidelines state that follow-up of stage I––IIIA asymptomatic NSCLC patients after localized treatment, routinely without ECT or whole-body PET-CT, is not timely in assessing the actual bone health status of patients with insidious BM. This limitation emphasizes the need for risk stratification of BM before performing postoperative follow-up10. Therefore, early screening of NSCLC with BM potential is important for guiding postoperative follow-up strategies, improving survival, and enhancing quality of life.

Deep learning (DL) techniques learn representative information from raw image automatically, and extract features in an incremental manner without human intervention. These characteristics enable decoding of relevant radiological tumor phenotypes, and provide additional information for planning subsequent treatment strategies12,13,14,15,16. Encouragingly, through the development of appropriate models with fine-grained features, DL has demonstrated promising potential across different applications to NSCLC such as lesion detection, lymph node metastasis prediction, and treatment response evaluation17,18,19,20. However, the training datasets used in previous studies did not consider important prognostic confounders, such as clinical T-stage, carcinoembryonic antigen (CEA) status and postoperative treatment21,22. As a consequence, the utility of the prognostic information extracted from CT image-based primary tumors remains uncertain. DL results must be integrated with known clinical prognostic factors to identify independent risk factors.

To address the above limitations, we established a DL-based multiparametric CT signature for predicting bone metastasis-free survival (BMFS) in NSCLC patients receiving surgery. For improved prediction, we also developed a deep learning clinical-imaging signature (DLCS) that combines imaging features with clinical variables, and we explored the utility of DLCS for assessing risk stratification of BM in NSCLC patients at different clinical stages.

Results

Baseline characteristics

Baseline patient characteristics are summarized in Table 1. Of 2832 resected NSCLC patients, we excluded 412 patients with distant metastasis at initial diagnosis, 341 patients with neoadjuvant radiochemotherapy, 354 patients without survival data, 52 patients with concurrent or heterochronous neoplasms, and 26 patients with lesions that were not measurable on CT images. We included 1547 NSCLC patients with clinical stage I–IIIA in this retrospective study. Sample sizes were 798, 470, and 279 for training, internal validation and external validation cohorts, respectively. The median age was 60 years (interquartile range (IQR): 54–67 years) for the training cohort, 61 years (IQR: 55–67 years) for the internal validation cohort, and 62 years (IQR: 55–68 years) for the external validation cohort. Of the 1547 patients, 296 (19%) experienced BM during an average follow-up period of 32 months (IQR: 25–46 months). Median BMFS times for the three different cohorts were 13, 16, and 12 months, respectively.

Table 1 Patient characteristics

DL signature predicts BM in the validation cohorts

The training process of the deep CNN model based on multiparametric CT is shown in Fig. 1c, d, which shows that loss of the training cohort decreases and gradually converges with an increasing number of iterations. The DL-prob was generated by combining non-enhanced and enhanced CT images using the end-to-end DL method. The trained DL-prob successfully determines whether BM occurred within 3 years of surgery in NSCLC patients (Supplementary Fig. 1). DL-prob correlated closely with NSCLC patient prognosis, which demonstrates the feasibility of constructing reliable prognostic markers. Supplementary Fig. 2 shows the distribution of DL-prob predictions.

Fig. 1: Flowchart of the study.
figure 1

Flowchart for developing the DL signature and DLCS for BM and prognosis prediction in NSCLC patients (n = 1547).

The DL signature developed using DL-prob demonstrated good prognostic predictive performance. When applied to the validation cohorts, the DL signature produced iAUC values of 0.825 and 0.848 for 3-year BMFS (Supplementary Fig. 3), and C-index values of 0.799 and 0.818 for BMFS (Table 2). The DL signature was found to provide valuable information for BM prediction, as shown by the attention maps (Supplementary Fig. 4). This analysis reveals areas associated with a high risk of suffering from BM through visualization of pixel weight distributions using different colors.

Table 2 The performance comparison of different models in validation cohorts

DLCS construction and prognostic performance evaluation

We performed univariate and multivariable Cox analyses to determine the association between important clinical-imaging variables and BMFS. These analyses revealed that Clinical T category, Clinical N category, CEA, postoperative treatment, pleural attachment, and spiculation were prominently associated with BMFS, resulting in the development of the clinical model (Table 3). We combined the DL signature with six important clinical-imaging variables to establish the DLCS. SHAP analysis elucidated that each variable effectively contributed to the model’s decision-making process (Supplementary Fig. 5).

Table 3 Univariable and multivariable analyses of bone metastasis-free survival in patients with resected non-small cell lung cancer

Compared with other competing models, DLCS exhibited significantly better performance (Table 2). DLCS produced the highest C-index for both internal and external validation cohorts with values of 0.806 (95% CI: 0.766–0.846) and 0.834 (95% CI: 0.790–0.878), respectively. When using internal and external validation cohorts, the risk scores generated by DLCS accurately predicted BMFS at 1, 2, and 3 years (Fig. 2). Based on DLCS outputs, patients were divided into high- and low-risk groups for BM. The optimal critical value of the DLCS for the training cohort was determined to be 1.39. Figure 3 illustrates the compositional feature distribution based on high and low DLCS scores. Calibration curves for three models showed good agreement between predictions and observations across validation cohorts (Fig. 4a, b). Using net benefits as the evaluation indicator for the decision curve, DLCS presents excellent benefits within the relevant thresholds (Fig. 4c, d).

Fig. 2: Predictive efficacy of DLCS on BMFS.
figure 2

(a) and (b) represent time-dependent ROC curves for the internal (n = 470) and the external (n = 279) validation cohorts.

Fig. 3: DLCS risk score for BMFS.
figure 3

(a) and (b) are risk stratification and distribution of clinical-imaging characteristics of high- and low-risk patients with BM among internal and external validation cohorts, respectively.

Fig. 4: Evaluation of DLCS in multicentre cohorts.
figure 4

Figure (a, b) and (c, d) shows the calibration and decision curves for the three models in the internal (n = 470) and external (n = 279) validation cohorts, respectively.

BM risk stratification based on DLCS

We performed subgroup BM risk analysis based on DLCS by classifying patients into high- and low-risk groups. Figure 5a–d demonstrates that BMFS values of patients at stage I and II–IIIA within both internal and external validation cohorts were significantly stratified by risk score (p < 0.05, log-rank test).

Fig. 5: DLCS-based KM curves of patients at stage I and II-IIIA.
figure 5

Figure (a, b) and (c, d) shows the KM BMFS curves for the subgroups of stage I and stage II-IIIA patients in the internal and external validation cohorts, respectively. At the bottom of the graph is the distribution of patients at different risk for each time point.

Discussion

Accurate prognostic prediction is useful for selecting therapeutic strategies and stratifying patient risk. In this study, we established and validated a DL-based multiparametric CT signature for predicting BMFS in NSCLC patients undergoing surgical treatment. The DL signature proved to be reliable for BM prediction and risk stratification, demonstrating its value in prognostic prediction. The DLCS with multidimensional information integrating the DL signature and BM-associated clinical imaging features performed better in predicting BMFS.

Patients with NSCLC accompanied by BM usually present an unfavorable prognosis, significantly reducing their quality of life. Follow-up strategies in NSCLC patients are routinely based on clinical practice guidelines, including guidelines from the NCCN10 and the European Society for Medical Oncology (ESMO)23. For patients with surgically available clinical stage I–IIIA NSCLC, the same follow-up management protocol was recommended. If all patients were to undergo skeletal examination only after SREs appeared, BMFS would not be guaranteed in patients with unfavorable prognosis, while disease progression in patients with asymptomatic BM would be overlooked. In this study, the established DLCS can be used as an effective prognostic tool for classifying patients as high and low risk for BM regardless of disease stage, which may contribute to more useful personalized follow-up strategies. The percentages of patients within the stage I cohort considered at high risk for BM according to the DLCS were 16.1% (79/490, training cohort), 17.5% (50/285, internal validation cohort), and 13.0% (29/146, external validation cohort). In the stage II–IIIA cohort, the DLCS classified 28.9% (89/308) of patients as high-risk for BM in the training cohort, 28.6% (53/185) in the internal validation cohort, and 23.3% (31/133) in the external validation cohort. The above results indicate that tumor load and individualized BM risk may vary significantly among patients, even when corresponding to the same clinical stage. Postoperative skeletal examination should not be omitted in high-risk patients, as early detection of BM lesions is crucial for aggressive salvage therapy. In future clinical applications, if the DLCS defines patients at clinical stages I–IIIA as being at high risk of BM, bone-related screening should be performed carefully in the follow-up strategy, rather than relying on patient requests for examination following manifestation of symptoms.

The “watch-and-wait” strategy is currently safe and effective in postoperative NSCLC patients23. Because advances in therapeutic techniques have improved the overall survival of patients with lung cancer, the importance of bone invasion has become increasingly significant. Proactive prevention of BM in high-risk individuals is associated with significantly better outcomes than reactive treatment approaches24. This observation emphasizes the importance of early identification of those at risk, so that appropriate interventions can be implemented promptly. Bone-targeting agents that complement cancer-specific therapies, such as denosumab and bisphosphonates, have been proven to reduce the risk of SREs by improving bone structure and quality11,25. The risk scores calculated by our DLCS model can help identify high-risk patients who may benefit from more aggressive treatment interventions, even at an advanced clinical stage. This valuable contribution can be used to develop strategies for improving the decision-making process associated with the watch-and-wait approach, by providing objective data-driven insights into individual patient risks.

This study extracted information from multiparametric CT scans using a DL method. Findings from previous studies have demonstrated that machine-learning techniques can extract useful characteristics from CT scans to forecast distant metastasis in NSCLC patients. These techniques can be used to investigate the correlation between clinical-semantic features and distant metastases, aiding in assessing prognostic outcomes26,27,28. Radiomic techniques successfully capture peri-tumoral and intra-tumoral features associated with image voxels, revealing tumor heterogeneity phenotypes and providing prognostic information. These methods present some limitations as they rely on predetermined representations set by domain experts, potentially leading to the omission of image features that are relevant to a particular task. The process of cancer distant metastasis follows a recognizable pattern and occurs selectively rather than randomly29,30. Metastastic processes selective for specific organs are collectively known as “organotropic metastasis” in the literature31,32. Neoplasms are genetically and histopathologically heterogeneous, manifesting as spatially variable within the tumor. Tumors with a high degree of heterogeneity display different metastatic propensities as a consequence of their intrinsically aggressive biology. Based on the above considerations, we proposed the end-to-end DL-prob for BM assessment in completely surgically resected NSCLC patients. This approach integrates non-enhanced and enhanced CT images to extract more comprehensive tumor information from multicentric datasets for classification purposes. More specifically, we employed an intermediate fusion strategy to select interpretable enhancement methods for each modality, and to address feature imbalance, lost modalities, and coordinated representation learning while capturing the underlying biological relationships between modalities33. Different network types are applied to each modality to narrow the heterogeneity gap and enable effective imaging fusion34. We then established a DL signature based on the time to postoperative BM, which proved to be highly correlated with prognosis. The DL signature carries great methodological potential for capturing multi-parameter imaging information and for making prognostic predictions, as it captures relevant information specific to the task. In order to determine whether the model may produce a positive or a negative impact on the field of image-based CNN, we generated activation maps to capture high-weighted image regions associated with network prediction. These maps make it possible to visualize activation regions with high-risk trends. We hypothesized that the areas highlighted by the activation maps may be associated with tumor progression, thus providing additional confidence in the predictive ability of the DL signature. This approach enables objective identification of high-risk areas for BM in tumor, which is invaluable for clinical risk assessment and precise clinical intensification of treatment regimens.

There is consensus that prognosis in cancer patient is closely connected with certain clinical features. TN staging and postoperative adjuvant therapy correlate with prognosis in NCSLC patients, in agreement with our findings35,36. The cell adhesion-associated glycoprotein CEA is widely regarded as a marker for tumor invasiveness, with increased CEA levels potentially promoting tumor cell infiltration and metastasis. Multifactorial Cox regression analysis showed that serum CEA levels (>5 µg/L) were directly related to the occurrence of BM. We also considered potential CT semantic features. Previous research indicates that pleural attachment and burr signs serve as imaging markers for tumor aggressiveness37,38. Within clinical models, pleural attachment and burr signs are independent risk predictors of BM, presumably owing to interstitial changes in tumor tissue growth and exudation along vascular, bronchial, or lobular septa raising the risk of BM occurrence. Several studies have shown that the integration of clinical-imaging information into machine learning models can enhance their prognostic prediction39,40. We confirmed these prior findings by combining several prognostically relevant clinical imaging factors with multiparametric deep CT features to develop a multidimensional prognostic model with superior performance.

Our study presents several limitations. First, we focused on the entire population of NSCLC patients. Different pathological types of NSCLC present various radiological characteristics and tumor burden, which may lead to heterogeneity in BM risk and prognosis. As a result, the validity of our model for BM assessment in different histological subtypes needs to be further confirmed by future studies using larger cohorts. Second, the DL signature is not fully automated, requiring manual marking of the tumor on CT images. Third, we collected external validation from healthcare organizations with a potentially similar distribution of patient characteristics, making it necessary to perform further international validation on a larger scale before applying the model to other races. Finally, the main limitation of DL for medical image analysis is its lack of interpretability that inevitably arises from its black box nature. Although we are able to visualize DL features extracted from tumors, their explicit biological meaning remains unknown and will require further clarification. Future studies will need to unravel the opacity of DL networks.

In conclusion, we proposed a DL-based multiparametric CT model to predict BM in surgically resectable NSCLC patients, and we constructed a DLCS by combining deep multiparametric CT information with clinical imaging factors to further improve their clinical application. Both DL signature and DLCS showed good performance in predicting BM. The output generated by DLCS can serve as a useful tool for BM risk stratification of different clinical stages of NSCLC, potentially assisting clinical decisions in precision medicine.

Methods

Ethics

This study complied with the Declaration of Helsinki and was approved by the Medical Ethics Review Board of Affiliated Hospital of Qingdao University. Because this study involved retrospective data analysed anonymously, written informed consent was waived. Figure 1 summarizes the design of our study.

Patients

We retrospectively collected CT images and clinical information from patients with clinical stage I-IIIA NSCLC who underwent complete surgical resection between January 2018 and December 2019. We used 798 patients from the Affiliated Hospital of Qingdao University to construct the training cohort, 470 patients from the Affiliated Hospital of Qingdao University to construct the internal validation cohort, and 279 patients from Qilu Hospital of Shandong University to construct the external validation cohort. We included patients with NSCLC receiving complete surgical resection. We adopted the following exclusion criteria: preoperative neoadjuvant radiochemotherapy; more than 2 weeks between preoperative CT examination and operation; initial diagnosis of distant metastasis; concurrent or heterochronous tumor; non-measurable lesions on CT images; absence of clinical or survival data.

Data collection

Initial clinicopathological information for each patient was extracted from the medical record system, including age, sex, history of smoking, histological subtype, CEA status, CYFRA21-1 and postoperative treatment. We determined clinical T-stage and N-stage in accordance with the American Joint Committee on Cancer (AJCC) 8th Edition Lung Cancer Staging System. Two radiologists with 6 and 3 years of clinical experience, who were blinded to clinicopathological information, independently evaluated the following CT semantic features: location, nodule type, margin, spiculation, lobulation, air bronchogram, and pleural attachment. In the event of disagreement, discussions were carried out to reach consensus. The endpoint event was the presence or absence of BM and documentation of BMFS, defining the interval between surgery and the date of first confirmation of BM (confirmed by imaging and histological evidence). Follow-up endpoints were obtained from outpatient medical records and telephone interviews.

CT imaging acquisition

The patients in the training and internal validation cohorts underwent preoperative chest CT scans using five different scanners from four vendors [Brilliance 128, Philips Healthcare, Andover, MA; Revolution 256, and Optima 670, GE Medical Systems, Milwaukee, WI; Definition 129, Siemens Healthcare, Erlangen, Germany; Aquilion One TSX301A, Toshiba Medical Systems, Otawara, Japan]. The CT images in the external validation cohort were taken with three scanners from two manufacturers [Optima 620 and Optima 660, GE Medical Systems, Milwaukee, WI; Perspective 64, Siemens Healthcare, Erlangen, Germany]. Heterogeneity in the imaging acquisition protocols was inevitable as data were obtained retrospectively with different scanners.

All patients received CT scans from the apex to the bottom of the lungs while suspended for maximum inhalation. Scans were performed at 120 kVp with mAs ranging at 20–200 mAs with or without automatic exposure control according to the capability of each scanner. CT scan reconstruction of the slice thickness is less than or equal to 5 mm. Slice increments were equal to or less than slice thickness. All CT scans included axial reconstruction. All patients received enhanced CT scans after contrast material injection with a delay of 60-70 s.

Data preprocessing

For both non-enhanced and enhanced CT images, we identified the region containing the largest cross-sectional area of the entire tumor, followed by image normalization and standardization. To focus on the tumor while exploring the surrounding tissue, we set the pixel values outside the region of interest (ROI) to 0 and used image erosion techniques to expand the ROI for cropping. This strategic approach allows us to preserve essential pixels around the tumor, ensuring comprehensive consideration of lesion information. The cropped image was converted to a greyscale single-channel image. Non-enhanced images were sharpened (sharpening intensity = 5) using a Laplace operator for better extraction of edge features. The risk probability of BM was generated by combining non-enhanced CT images, enhanced CT images, and sharpened images into three-channel images, which were used as input to the DL model. For image size uniformity, we resized all input images to 120 × 120 pixels using linear interpolation.

DL signature development

The DL signature was developed using a 3D convolutional neural network (CNN) operating on the training cohort. The network contains a total of four convolutional layers (kernel sizes of 3 × 3 × 12, 3 × 6 × 5, 6 × 16 × 5, 16 × 120 × 5). The first three convolutional layers are followed by the application of a BN layer, a LeakyRelu activation function, and a pooling layer. Two fully connected layers are also included, and the BN layer is inserted between the two fully connected layers. Different enrolment groups were formed depending on whether BM occurred during the 3 postoperative years, and the DL model in the input graph bounding box was trained to obtain the probability of each patient developing BM within 3 years of surgery. This probability denotes the metastatic probability and the non-metastatic probability (summing to 1), and is defined as the DL-prob. The DL-prob was then used to construct the DL signature using the Cox proportional hazard model.

Throughout the training process, the Adam algorithm was used to train the neural network at an initial learning rate of 1 e-5. We applied a binary cross-entropy function to stabilize loss at 0.1. Code implementation relied on PyTorch. The algorithm was trained on a computer with an Intel i7-13700HX 16-core CPU and an RTX 4080 12GB GPU. We performed combined prediction of multiple trained models using the method of ensemble learning with hard voting. To reduce generalization error and increase model diversity, we randomly selected 80% of the training cohort data each time when training a model. After training multiple models, calculating the average predicted class of these models helps prevent severe impact from individual model prediction errors on the entire ensemble model, thereby achieving stronger robustness. To visualize image regions that are important for prediction, we generated attention maps using gradient-weighted class activation mapping to detect identified and monitored lesion regions.

Individualized DLCS construction and model performance evaluation

Combining clinical-imaging features with multiparametric CT deep features can further enhance model performance. We applied Cox risk regression analysis to the training cohort to screen for independent predictors and construct the clinical model. We then constructed the DLCS by combining the DL signature with the clinical model, and converted the DLCS into an individualized BM prediction nomogram.

DLCS-based BM risk assessment in patients at different clinical stages

Different clinical stages call for different treatment strategies 10. NSCLC patients were categorized into two subgroups according to AJCC staging: stage I and stage II–IIIA. We investigated whether DLCS could stratify patients associated with different BM risk levels in these cohorts using subgroup analysis. Specifically, we used the DLCS threshold of the Youden index to stratify patients into BM high-risk and low-risk. We then assessed postoperative BMFS for patients in different risk groups using Kaplan-Meier (KM) survival analysis.

Statistical analysis

All statistical analyses were implemented using R software (version 4.0) and SPSS (IBM, version 22.0). We used Student’s t test, ANOVA, or Kruskal-Wallis tests to analyze continuous variables. We used chi-square tests or Fisher’s exact tests to compare categorical variables. We implemented receiver operating characteristic (ROC) analysis for the BM classification model using continuous probability scores ranging between 0 and 1. We calculated the integrated area under the time-varying ROC curve (iAUC) for survival models, and evaluated discriminatory performance using Harrell’s concordance index (C-index). We evaluated the clinical practicality of the BM prediction model using decision curve analysis, which quantifies the net benefit at different threshold probabilities. We relied on calibration curves to evaluate consistency of the predicted probabilities with actual observations. We used Cox regression to assess prognostic elements and analyze the multivariable-adjusted hazard ratio. Additionally, we performed KM survival curve analyses based on BMFS to validate the discrimination ability of the model using the log-rank test. We defined statistical significance as p < 0.05.