Abstract
With the rapid progress in artificial intelligence (AI) and digital pathology, prognosis prediction for non-small cell lung cancer (NSCLC) patients has become a critical component of personalized medicine. In this study, we developed a multimodal AI model that integrated whole-slide images and dense clinical data to predict disease-free survival (DFS) and overall survival (OS) with high accuracy for NSCLC patients undergoing surgery. Utilizing data from 618 patients at Beijing Chest Hospital, the model achieved areas under the curve (AUC) of 0.8084 for predicting progression and 0.8021 for predicting death in the test set. Importantly, the model attained balanced accuracies of 0.7047 for predicting progression and 0.6884 for predicting death. By categorizing patients into high-risk and low-risk groups, the model identified significant differences in survival outcomes, with hazard ratios of 4.85 for progression and 4.57 for death, both with p values below 0.0001. Additionally, it uncovered novel digital biomarkers associated with poor prognosis, offering further insights into NSCLC treatment. This model has the potential to revolutionize postoperative decision-making by providing clinicians with a precise tool for predicting DFS and OS, thereby improving patient outcomes.
Similar content being viewed by others
Introduction
Lung cancer is the leading cause of cancer-related death and the second most commonly diagnosed cancer, accounting for approximately one in five (18.0%) cancer deaths and one in ten (11.4%) cancer diagnoses1. Non-small cell lung cancer (NSCLC) represents 85% of all lung cancer cases2. In clinical practice, NSCLC treatment is primarily guided by TNM staging3. Early-stage NSCLC patients (stage 0, stage IA, stage IB and stage IIA without high-risk factors) generally do not require postoperative interventions, whereas patients with stage IB or IIA with high-risk factors, stage IIB, and stage III typically require postoperative treatment. However, patients within the same stage often exhibit different clinical outcomes, posing challenges in determining the need for postoperative interventions based solely on TNM staging.
In early-stage NSCLC patients, the risk of disease progression or cancer-related death is not entirely eliminated. For stage IA NSCLC patients undergoing surgery alone, the 5-year disease-free survival (DFS) rate is 84.5%, and the 5-year overall survival (OS) rate is 96.8%4. Conversely, within the group of NSCLC patients who require postoperative interventions but do not receive them, some remain free from disease progression or cancer-related death. For stage IIB/IIIA NSCLC patients who undergo surgery alone, the 3-year progression-free survival rate is 36.1%5, and the 5-year OS rate for stage IIIA patients is 26%6. Prognostic prediction is crucial for determining whether postoperative interventions are necessary. Accurate tools that predict DFS and OS for NSCLC patients are essential for personalized treatment and improved disease management.
Artificial intelligence (AI)-based pathology has significantly advanced in the application to NSCLC, particularly in areas such as pathological diagnosis7,8, molecular phenotype prediction9,10, gene mutation prediction7, and prognostic prediction11,12,13,14,15,16,17,18,19,20,21,22,23,24,25. Among these applications, prognostic prediction holds the greatest clinical importance for NSCLC patients. Previous studies on prognostic prediction have either excluded clinical data or incorporated only minimal clinical information. Although these studies successfully distinguished different prognostic groups, they lacked a strong correlation between predicted and actual survival outcomes. Additionally, they did not effectively predict DFS and OS, which are critical in NSCLC prognosis.
Several digital biomarkers have emerged from these studies. For example, the density of tumor-infiltrating lymphocytes (TILs) has been identified as a biomarker associated with worse prognosis14, and the growth pattern of adenocarcinoma has also been linked to prognosis15. Moreover, a recent study developed and validated four digital biomarkers based on tertiary lymphoid structures and necrosis19. However, the digital biomarkers associated with prognosis have not been fully elucidated.
In this work, we have developed a multimodal AI model for prognostic prediction in NSCLC patients undergoing surgery, referred to as AIM-LCpro. Our model uses a patient-level weakly supervised learning approach that integrates WSIs with dense clinical data (Fig. 1). It not only categorizes patients into high-risk and low-risk groups but also predicts precise DFS and OS for each patient. Through model visualizations, we have identified several novel digital biomarkers associated with poor prognosis in NSCLC patients. This model has the potential to guide decisions regarding the need for postoperative interventions and improve overall prognosis in NSCLC patients.
AIM-LCpro uses a patient-level weakly supervised learning approach to predict DFS and OS. WSIs and dense clinical data are integrated.
Results
Baseline characteristics of the study cohort
In the study cohort, 173 patients (27.99%) experienced disease progression within 5 years, and 121 patients (19.58%) died of NSCLC within the same period. Of the total cohort, 353 patients (57.12%) did not require postoperative interventions, while 265 patients (42.88%) did. The median follow-up period of the study cohort was 73 months (Supplementary Table 1). The baseline characteristics of the study cohort are presented in Table 1.
Performance of AIM-LCpro in predicting prognosis of NSCLC patients
In the training, validation, and test sets, the areas under the ROC curves (AUCs) for predicting progression within 5 years were 0.9925, 0.8801, and 0.8084, respectively (Fig. 2a–c). Similarly, the AUCs for predicting death within 5 years were 0.9826, 0.8477, and 0.8021, respectively (Fig. 2d–f). These results suggest that our model has the potential to accurately distinguish between patients who will experience progression and those who will experience death.
The ROC curves for predicting a progression in 5 years in the training set. b Progression in 5 years in the validation set. c Progression in 5 years in the test set. d Death in 5 years in the training set. e Death in 5 years in the validation set. f Death in 5 years in the test set.
We found that the performance of unimodal models was inferior to that of the multimodal model. When predicting progression and death in the test set, the AUCs of the unimodal model using clinical data were 0.7597 and 0.6733, respectively (Supplementary Fig. 1a, c). The AUCs of the unimodal model using pathological images were 0.6562 and 0.7082, respectively (Supplementary Fig. 1b, d).
We applied the selected thresholds to predict outcomes in the training, validation, and test sets, achieving strong performance (Supplementary Tables 2–10). Specifically, when predicting whether patients’ disease would progress within 5 years or whether they would die within 5 years in the test set, our model demonstrated high balanced accuracies (0.7047 and 0.6884, respectively). In the test set, the model exhibited a sensitivity of 0.5556 for predicting progression within 5 years and 0.5385 for predicting death within the same period. Additionally, the model’s specificity for predicting progression and death within 5 years was 0.8539 and 0.8384, respectively.
Harrell’s C-index was also used to evaluate the performance of AIM-LCpro. In the test set, the Harrell’s C-index for predicting progression and death within 5 years was 0.7748 and 0.7775, respectively (Supplementary Table 11).
High-risk and low-risk groups
The AIM-LCpro model was able to categorize patients into high-risk or low-risk groups based on two criteria: predicting progression within 5 years and death within 5 years (Fig. 3). For instance, if the model predicted that a patient’s disease would progress within 5 years, the patient was categorized as high-risk; otherwise, they were categorized as low-risk. In the test set, there was a statistically significant difference between high-risk and low-risk groups for all patients, with p values less than 0.0001 and Hazard Ratios (HR) of 4.85 for progression and 4.57 for death (Fig. 3a, d).
Progression in 5 years: a for the test set (log-rank test p value < 0.0001, HR = 4.85); b among patients who do not require postoperative interventions (log-rank test p value = 0.0030, HR = 5.01); c among patients who require postoperative interventions (log-rank test p value < 0.0001, HR = 4.34). Death in 5 years: d for the test set (log-rank test p value < 0.0001, HR = 4.57); e among patients who do not require postoperative interventions (log-rank test p value = 0.0443, HR = 4.10); f among patients who require postoperative interventions (log-rank test p value = 0.0036, HR = 3.51). HR hazard ratio (log-rank).
Among patients who did not require postoperative interventions, the difference between high-risk and low-risk groups remained significant, with a p value of 0.0030 and HR of 5.01 for progression, and a p value of 0.0443 and HR of 4.10 for death (Fig. 3b, e). Similarly, for patients who required postoperative interventions, the high-risk group demonstrated a significant difference compared to the low-risk group, with p values less than 0.0001 and HR of 4.34 for progression, and a p value of 0.0036 and HR of 3.51 for death (Fig. 3c, f).
Consistency between predicted and actual K-M curves
The AIM-LCpro model’s predictive outcomes for both 5-year progression and death aligned with actual survival data, with no statistically significant discrepancies observed (for progression: p = 0.5029, HR = 0.85; for death: p = 0.2321, HR = 1.10), as shown in Fig. 4a, d.
Progression in 5 years: a for the test set (log-rank test p value = 0.5029, HR = 0.85); b among patients who do not require postoperative interventions (Rényi test p value = 0.4636, HR = 1.48); c among patients who require postoperative interventions (log-rank test p value = 0.0580, HR = 0.56). Death in 5 years: d for the test set (Rényi test p value = 0.2321, HR = 1.10); e among patients who do not require postoperative interventions (Rényi test p value = 0.3091, HR = 1.76); f among patients who require postoperative interventions (log-rank test p value = 0.5253, HR = 0.81). HR hazard ratio (log-rank).
For patients who did not require postoperative interventions, the model’s survival predictions were also consistent with actual outcomes, with no statistically significant differences (p = 0.4636, HR = 1.48 for progression; p = 0.3091, HR = 1.76 for death), as illustrated in Fig. 4b, e.
Similarly, for patients requiring postoperative interventions, the model maintained its accuracy, showing no significant variance between predicted and actual survival (p = 0.0580, HR = 0.56 for progression; p = 0.5253, HR = 0.81 for death), as depicted in Fig. 4c, f.
We also conducted the patient-level evaluation. As shown in Supplementary Fig. 2, the errors of DFS and OS between the predicted values and the actual values in the test set were 11.34 ± 17.23 months and 7.76 ± 13.82 months, respectively.
Feature importance analysis of the clinical data modality
In the clinical data modality, the top three features with the highest weights for predicting progression were the proportion of lepidic adenocarcinoma, the number of metastatic lymph nodes in the 2nd and 4th groups, and tumor location (Supplementary Table 12). Correspondingly, for predicting death, the top three features with the highest weights were pleural effusion, family history, and lymph node dissection (Supplementary Table 13).
Investigation of prognostic digital biomarkers through AIM-LCpro
To intuitively display the pathological features associated with prognosis, we mapped the prognostic-related features extracted by the AIM-LCpro model onto WSIs in the form of heatmaps. As shown in Fig. 5a, when comparing the heatmaps of progression and death in the test set, the number of hotspots for progression was greater than for death, both at the patient level and the slide level. Moreover, the hotspots for patients who died within 5 years were largely contained within the progression hotspots, particularly in patients or slides with more 5-year progression hotspots than 5-year death hotspots. Given that fewer patients died within 5 years compared to those with progression, it was possible that the model learned fewer features for death prediction. The consistency in the distribution of risk hotspots for progression and death highlighted the model’s predictive capabilities. By analyzing these heatmaps, we can better understand the model’s predictions, identify areas that contribute to these predictions, and potentially uncover new digital biomarkers.
a The comparison of the number of hotspots for 5-year progression and 5-year death at patient and slide levels. P > D: the number of hotspots for 5-year progression exceeded that for 5-year death (patients: 101/125 (80.8%); slides: 369/536 (68.8%)); P = D: the number of hotspots for 5-year progression was comparable to that for 5-year death (patients: 17/125 (13.6%); slides: 114/536 (21.3%)); P < D: the number of hotspots for 5-year progression was fewer than that for 5-year death (patients: 7/125 (5.6%); slides: 53/536 (9.9%)). b The composition of histological types of NSCLC patients in the test set. c The distribution of hotspots on WSIs for patients with SCC, along with the enlargement of example regions. Red arrow: mitotic figures; Green arrow: enlarged and bizarre nuclei. d The actual and predicted results in NMA subtypes. Reality: the number of patients with the NMA subtype; prediction: the number of patients with risk hotspots within the subtype. The distribution of hotspots on WSIs for patients with MPA (e), SPA (f), LPA (g), along with the enlargement of example regions. h The distribution of hotspots on WSIs for patients with APA. The enlarged regions display the features of small glands(upside) and big glands(underside). i The stromal regions identified as risk hotspots. j The distribution of hotspots on WSIs for patients with PPA, along with the enlargement of example regions. SCC squamous cell carcinomas, NMA non-mucinous adenocarcinoma, SPA solid adenocarcinoma, MPA micropapillary adenocarcinoma, LPA lepidic adenocarcinoma, APA acinar adenocarcinoma, PPA papillary adenocarcinoma.
The test set included 84 patients with non-mucinous adenocarcinoma (NMA), 36 patients with squamous cell carcinoma (SCC), and 5 patients with other NSCLC types (Fig. 5b). In SCC, risk hotspots were predominantly concentrated in the tumor regions, where tumor cells were disorderly arranged with enlarged and bizarre nuclei, and frequent mitotic figures were observed (Fig. 5c). Similar to SCC, in NMA, risk hotspots also tended to localize within the tumor areas (Fig. 5e–j). We further analyzed these regions covered by hotspots.
Of the 84 patients with NMA, risk hotspots were found to be distributed in micropapillary adenocarcinoma (MPA) and solid adenocarcinoma (SPA). As shown in Fig. 5d, e, MPA was present in 11 patients, of which 5/11 and 3/11 patients had risk hotspots in the 5-year progression and 5-year death heatmaps, respectively. Surprisingly, 30 patients had SPA, and all of them had risk hotspots in their SPA areas in both the 5-year progression and 5-year death heatmaps, although the instance-level hotspots did not cover all SPA regions (Fig. 5d, f). These two histological subtypes are coincidentally classified as high-grade patterns in the 5th edition of the WHO classification of thoracic tumors. The most common histological type, lepidic adenocarcinoma (LPA), exhibited only minimal coverage by risk hotspots (Fig. 5d, g). Interestingly, LPA was considered a low-grade histology in the 5th edition of the WHO classification of thoracic tumors.
Regarding the other two NMA histological subtypes, acinar adenocarcinoma (APA) and papillary adenocarcinoma (PPA), the distribution of risk hotspots was uneven. For APA, we identified two types of glands more likely to be covered by risk hotspots. As shown in Fig. 5h, the first type consisted of small, irregular glands made up of pleomorphic cells, surrounded by desmoplastic stroma, which was often hypovascular and composed of collagen fibers interspersed with fibroblasts and lymphocytes. The second type consisted of large, irregular glands with multilayered cells, characterized by significant cellular and nuclear pleomorphism. These cells were crowded, and some protruded into the glandular lumen, forming structures similar to a “papillary” pattern without a central axis (Fig. 5h). The stroma in these areas was loose and rich in neomicrovessels, consistent with the pure stromal regions identified as risk hotspots, as demonstrated in Fig. 5i. Fewer areas of PPA were identified as risk hotspots, with the model appearing to recognize regions with crowded cell arrangements as high-risk (Fig. 5j). The pathological features related to prognosis identified by our model may serve as digital biomarkers and warrant further validation in future studies.
Discussion
We demonstrated that a multimodal model combining dense clinical data with WSIs can successfully predict the prognosis of NSCLC patients undergoing surgery. AIM-LCpro effectively screens for and utilizes prognostic information, achieving a high level of balanced accuracies. The features of the clinical data modality enhanced the performance of the model. To our knowledge, no other prognostic prediction models for surgical NSCLC patients have yet entered clinical application. Our model’s ability to predict which patients do or do not require postoperative treatment aligns closely with clinical application scenarios.
Previous studies relied heavily on manual annotation or predefined image features11,12,13,14,15,16,17,18,19. In contrast, our model does not require manual WSI annotation, significantly reducing the manpower involved. Additionally, it does not rely on predefined image features. Instead, it uses CAMEL2 to automatically screen and extract regions associated with prognosis26. By avoiding predefined features, the model is free to search for prognostic regions across the entire WSI without limitations. It can categorize patients into high-risk and low-risk groups while predicting 5-year DFS and OS. Moreover, there is no statistically significant difference between the predicted and actual survival outcomes, which strengthens the validity of stratifying patients into high-risk and low-risk groups.
In clinical practice, physicians need tools to predict patient outcomes. If the model predicts NSCLC patients are at risk of progression or death, they can be recommended for postoperative interventions. Conversely, if patients are predicted to remain free from progression or death, chemotherapy can be avoided, aiding in more personalized treatment plans.
NSCLC exhibits significant tumor heterogeneity27,28. This heterogeneity applies not only to tumor epithelial cells but also to the various microenvironments interacting with tumor cells29. The digital biomarkers identified by our model from WSIs may reflect this heterogeneity and aid in personalizing treatment for NSCLC patients.
Deep learning (DL)-based computational pathology enables automated and high-throughput extraction of features from histopathological images, with high sensitivity to subtle characteristics30,31. We developed AIM-LCpro that assisted prognostic prediction in NSCLC patients undergoing surgery and AIM-LCpro visually and objectively presented the prognosis-related features it extracted in the form of heatmaps, allowing pathologists to carefully review and analyze them to mine digital biomarkers. Similar to traditional biomarkers, digital biomarkers serve as indicators for diagnosis, prognosis, and therapeutic responses and should demonstrate clinical validity32,33. The clinical utility of new biomarkers can be evaluated by their association with existing biomarkers or by directly proving their usefulness34. In SCC, our model identified areas with a high mitotic index as risk hotspots, consistent with previous findings that associated a high mitotic index with poor prognosis35. Additionally, our model showed varying degrees of emphasis on different histologic subtypes within the NMA. MPA and SPA were identified as risk areas, in line with high-grade growth patterns defined by the latest WHO classification of thoracic tumors. Furthermore, LPA, known for having the best prognosis, was not recognized as a risk area. APA, associated with intermediate prognosis, was widely distributed across the slides. Through heatmap analysis, we identified histological characteristics in APA that indicated poor prognosis. This finding might lead to the discovery of novel biomarkers, thereby improving classification systems. Further evidence and additional data were needed to verify this.
In addition to the conventional histological features, our model also identified some stromal areas as risk hotspots, including microvessels, fibroblasts, and extracellular matrix (ECM) and so on, which were the primary components of the tumor microenvironment and had been considered to exert important effects on the progression, metastasis and prognosis of the tumor36. Angiogenesis is a complex process and a key hallmark of cancers, and there are lots of studies confirmed that angiogenesis is crucial for the growth and metastasis of lung cancers37. It is common for human NSCLC to exhibit desmoplasia, characterized by cancer-associated fibroblasts (CAFs). What is more, CAFs also influences cancer cell proliferation, invasion, and drug resistance38. For example, CAFs have associated with T-cell exclusion in human lung tumors, contributing to immune suppression and tumor growth39. The ECM plays key roles in establishment and maintenance of tissue architecture of tumor. It has been reported that Tenascin-C could mediate lung adenocarcinoma metastasis40,41. Therefore, some of the risk hotspots identified by our model may help us to better understand the mechanisms of the tumor microenvironment in the occurrence and development of tumors, providing new support and ideas for future research.
Predicting the prognosis of NSCLC surgery patients raises ethical concerns. For example, knowing a poor prognosis in advance may affect patients’ quality of life. Additionally, the question of who bears responsibility for harm caused by incorrect predictions remains unanswered.
Our study has several limitations. First, to avoid potential information leakage, we did not include information about subsequent treatments after progression, which could have compromised the model’s credibility. Second, the clinical benefits of altering postoperative intervention strategies based on the model’s predictions have not yet been validated. It remains to be seen how much patients would benefit from such an approach. Finally, our model is based on a relatively small cohort. Further studies with larger sample sizes are needed to enhance the model’s ability to predict NSCLC prognosis.
Methods
Study population and inclusion/exclusion criteria
We enrolled 641 NSCLC patients who underwent lung surgery at Beijing Chest Hospital between January 2016 and November 2017. After excluding 23 patients, 618 patients (BCH study cohort) were ultimately included in the study.
The inclusion criteria were: (1) NSCLC patients who underwent radical surgery, or NSCLC patients who underwent pulmonary surgery but did not receive lymph node dissection due to poor pulmonary function; (2) NSCLC patients who agreed to follow-up. Patients with stages 0 to IIIB were all potentially included. The exclusion criteria were: (1) patients with other incurable malignant tumors; (2) patients who died from other diseases before progression within 5 years after surgery; (3) cases where all primary tumor tissues were frozen prior to being fixed in formalin.
The study was conducted in accordance with the principles of the Declaration of Helsinki and approved by the Ethics Committee of Beijing Chest Hospital, Capital Medical University (YJS-2023-16).
WSI acquisition
All hematoxylin-eosin (HE) stained slides of primary tumor tissues were chosen. But frozen slides and frozen paraffin slides were excluded. All WSIs in this study were formalin-fixed paraffin-embedded (FFPE) whole-slide H&E-stained images of primary tumor tissues. A total of 2629 WSIs were acquired for the BCH study cohort, scanned using the KFBio KF-PRO-400 scanner, and saved at magnifications of ×400, ×200, ×100, and ×50.
Clinical data acquisition
Clinical variables were collected from inpatient medical records and included age, gender, smoking history, family history, TNM stages, lymph node dissection and metastasis, tumor size, CT data, postoperative treatment, and risk factors (Supplementary Table 14). Risk factors included poorly differentiated tumors, vascular invasion, wedge resection, visceral pleural involvement, and unknown lymph node status. All patients were followed up through telephone and outpatient services, with a postoperative follow-up period of over 5 years for all patients.
TNM staging was based on postoperative pathology reports and performed according to the 8th UICC/AJCC TNM edition for non-small cell lung cancer staging.
Dataset division for training, validation, and testing
The BCH study cohort was divided into training (428 patients), validation (62 patients), and test sets (125 patients) for predicting 5-year progression (Supplementary Table 15), and into training (426 patients), validation (62 patients), and test sets (125 patients) for predicting 5-year death (Supplementary Table 16). The training, validation, and test sets were comparable (Supplementary Tables 17 and 18). In the cohort, there were 3 patients with unknown progression status and 8 patients with unknown vital status. These patients were included in the training set and participated in a portion of the training process. Analysis was performed only when evaluating the specific segments of the training process in which they were involved.
Image segmentation and feature extraction
Glass regions were filtered out using RGB channel pixel variance calculations, and tissue regions were extracted and cut into 2048 × 2048-pixel image patches at ×20 magnification. Each patch was then divided into 64 instances, each measuring 256 × 256 pixels.
We performed the pre-training of CAMEL2 based on whether the patient progressed within 5 years. Similarly, we also performed the pre-training of CAMEL2 based on whether the patient died within 5 years. These two processes were independent of each other. We obtained two pre-trained CAMEL2 models.
For image features, we used the pre-trained CAMEL2 weakly supervised framework to extract features from patches in the training set, extracting intermediate features from CAMEL2 as the patch’s image feature representation. The core of this enhanced framework was an instance-level classifier, which served as the foundation for model interpretability and visualization. Each patch had two image feature representations: one for progression and one for death. To obtain patient-level image feature representations, we sorted all patch image feature representations in descending order based on the prediction probability output by CAMEL2, then averaged the top 10% of patch-level features to generate the patient-level image feature representation.
Clinical data standardization and normalization
Clinical data contained discrete and categorical variables. Discrete variables were normalized by scaling values between 0 and 1. Categorical variables, such as gender and disease type, were one-hot encoded using positional coding, where a one-dimensional vector represented two-dimensional information.
Architecture of the multimodal AI model
The workflow is shown in Supplementary Fig. 3. We employed a two-stage training strategy: first, classification training for prognostic metrics (progression and death), followed by regression training for time prediction (progression time and death time) based on the classification model weights.
Training procedure and algorithm selection
Preprocessed clinical features from the training set patients were passed through a clinical feature network to obtain clinical feature representations. This network consisted of linear layers, Batch Normalization layers, and ReLU layers. The patient-level image and clinical feature representations were concatenated and fed into the classification head network, which output the probability of progression or death. The classification head network comprised linear layers, Batch Normalization layers, and ReLU layers, with two independent linear layers in the final stage. The network was trained using cross-entropy loss.
For regression training, the clinical feature network weights were frozen, and two separate time prediction head networks were trained to output progression and death times. The time prediction head consisted of linear layers, Batch Normalization layers, ReLU layers, and a final Sigmoid layer. The output was multiplied by 60 to obtain the specific progression or death month. The network was trained using the L1 loss function. During inference, the network simultaneously output classification results and time predictions. For patients classified as negative samples, the corresponding time was set to 60 months; otherwise, the network’s original predicted output was retained.
Linear layers are one of the most fundamental layers in neural networks, also known as fully connected layers or dense layers. The role of multiple linear layers is to capture complex relationships in the data through linear transformations. Batch Normalization normalizes the input data by extracting the mean and variance along the batch dimension, reducing internal covariate shift, thereby accelerating the training process and enhancing model stability. The ReLU activation function introduces nonlinearity, enabling neural networks to learn and represent more complex functional relationships. By stacking multiple ReLU activation functions, neural networks can construct highly nonlinear mappings, thus better fitting complex data distributions. The sigmoid layer is a commonly used activation function layer in neural networks, which maps any real number to the interval (0,1).
The model assigned one probability of progression or death within 5 years to each patient. In the training and validation sets, thresholds were selected based on sensitivity, specificity, accuracy, and the Youden index. These thresholds were applied to the test set to evaluate the performance of the model. For patients who did not require postoperative treatment, thresholds of 0.1461 and 0.2092 were selected for progression and death within 5 years, respectively. For patients who required postoperative treatment, thresholds of 0.3123 and 0.3391 were used. These thresholds were applied to the test set.
To ensure the reliability, reproducibility, and fairness of model evaluation during training, testing, and validation, we utilized standardized metrics. A key focus was analyzing performance using the receiver operating characteristic (ROC) curve and its associated metrics.
Area under the ROC curve (AUC): Quantifies model performance across all classification thresholds. AUC ranges from 0 to 1, with values closer to 1 indicating superior discriminative power. This metric is particularly robust in imbalanced datasets, as it remains unaffected by class distribution skew.
The following metrics were derived under a certain threshold chosen from the ROC curve.
Sensitivity (true positive rate): measures the proportion of actual positive samples correctly identified:
where true positive (TP) denotes the number of correctly classified positive samples, and false negative (FN) represents the number of positive samples that were incorrectly classified as negative.
Specificity (true negative rate): measures the proportion of actual negative samples correctly identified:
where true negative (TN) is the count of correctly classified negative samples, and false positive (FP) refers to the number of negative samples incorrectly classified as positive.
Accuracy: evaluates overall prediction correctness:
Heatmap visualization and evaluation
To achieve precise identification of biomarkers, the heatmap visualization in this research leveraged prognosis prediction probabilities for each instance, mapping these values to heat intensity. Additionally, a sliding window-based inference strategy was employed, with finer granular steps during inference to enhance representation accuracy beyond the instance level, enabling more nuanced feature detection. Instances with probabilities lower than 0.3 are not displayed on the heatmap. In our study, pathologists would read, evaluate these heatmaps, and interpret the digital pathological biomarkers.
Statistical analysis
Categorical data were evaluated using Pearson’s chi-squared test or Fisher’s exact test. Measurement data were expressed as mean ± standard deviation and analyzed using the independent samples t-test or analysis of variance. Survival curves were generated using the Kaplan–Meier method. When survival curves did not intersect, they were compared using the log-rank test. When survival curves intersected, the Rényi test was utilized to make comparisons. Harrell’s C-index was computed in R using the Hmisc package. All tests were two-tailed, and a p value less than 0.05 was considered statistically significant. Statistical analysis was performed using SPSS software 26.0 or GraphPad Prism 10.
Data availability
Data are available upon reasonable request.
Code availability
The code can be accessed online: https://github.com/ThoroughFuture.
References
Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Reck, M. & Rabe, K. F. Precision diagnosis and treatment for advanced non-small-cell lung cancer. N. Engl. J. Med. 377, 849–861 (2017).
Ettinger, D. S. et al. NCCN Guidelines® Insights: non-small cell lung cancer, version 2.2023. J. Natl Compr. Cancer Netw.21, 340–350 (2023).
Jiang, Y. et al. The impact of adjuvant EGFR-TKIs and 14-gene molecular assay on stage I non-small cell lung cancer with sensitive EGFR mutations. EClinicalMedicine 64, 102205 (2023).
Scagliotti, G. V. et al. Randomized phase III study of surgery alone or surgery plus preoperative cisplatin and gemcitabine in stages IB to IIIA non-small-cell lung cancer. J. Clin. Oncol. 30, 172–178 (2012).
Douillard, J. Y. et al. Adjuvant vinorelbine plus cisplatin versus observation in patients with completely resected stage IB-IIIA non-small-cell lung cancer (Adjuvant Navelbine International Trialist Association [ANITA]): a randomised controlled trial. Lancet Oncol.7, 719–727 (2006).
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Chen, C. L. et al. An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning. Nat. Commun. 12, 1193 (2021).
Diao, J. A. et al. Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes. Nat. Commun. 12, 1613 (2021).
Wu, J. et al. Artificial intelligence-assisted system for precision diagnosis of PD-L1 expression in non-small cell lung cancer. Mod. Pathol. 35, 403–411 (2022).
Yu, K. H. et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat. Commun. 7, 12474 (2016).
Luo, X. et al. Comprehensive computational pathological image analysis predicts lung cancer prognosis. J. Thorac. Oncol.12, 501–509 (2017).
Wang, Y. et al. Multi-scale pathology image texture signature is a prognostic factor for resectable lung adenocarcinoma: a multi-center, retrospective study. J. Transl. Med. 20, 595 (2022).
Pan, X. et al. Computerized tumor-infiltrating lymphocytes density score predicts survival of patients with resectable lung adenocarcinoma. iScience 25, 105605 (2022).
Alsubaie, N., Raza, S. E. A., Snead, D. & Rajpoot, N. M. Growth pattern fingerprinting for automatic analysis of lung adenocarcinoma overall survival. IEEE Access 11, 23335–23346 (2023).
Wang, H., Xing, F., Su, H., Stromberg, A. & Yang, L. Novel image markers for non-small cell lung cancer classification and survival prediction. BMC Bioinforma. 15, 310 (2014).
Wang, X. et al. Prediction of recurrence in early stage non-small cell lung cancer using computer extracted nuclear features from digital H&E images. Sci. Rep. 7, 13543 (2017).
Alsubaie, N. M., Snead, D. & Rajpoot, N. M. Tumour nuclear morphometrics predict survival in lung adenocarcinoma. IEEE Access 9, 12322–12331 (2021).
Kludt, C. et al. Next-generation lung cancer pathology: development and validation of diagnostic and prognostic algorithms. Cell Rep. Med. 5, 101697 (2024).
Zhao, L. et al. CoADS: cross attention based dual-space graph network for survival prediction of lung cancer using whole slide images. Comput. Methods Prog. Biomed. 236, 107559 (2023).
Diao, S. et al. Automated cellular-level dual global fusion of whole-slide imaging for lung adenocarcinoma prognosis. Cancers 15, 4824 (2023).
Shim, W. S. et al. DeepRePath: identifying the prognostic features of early-stage lung adenocarcinoma using multi-scale pathology images and deep convolutional neural networks. Cancers 13, 3308 (2021).
Zheng, Y. et al. Graph attention-based fusion of pathology images and gene expression for prediction of cancer survival. IEEE Trans. Med. Imaging 43, 3085–3097 (2024).
Hattori, H., Sakashita, S., Tsuboi, M., Ishii, G. & Tanaka, T. Tumor-identification method for predicting recurrence of early-stage lung adenocarcinoma using digital pathology images by machine learning. J. Pathol. Inform. 14, 100175 (2023).
Kim, P. J. et al. A new model using deep learning to predict recurrence after surgical resection of lung adenocarcinoma. Sci. Rep. 14, 6366 (2024).
Xu, G. et al. CAMEL2: enhancing weakly supervised learning for histopathology images by incorporating the significance ratio. Adv. Intell. Syst. 6, 12 (2024).
Gridelli, C. et al. Non-small-cell lung cancer. Nat. Rev. Dis. Prim. 1, 15009 (2015).
Chen, Z., Fillmore, C. M., Hammerman, P. S., Kim, C. F. & Wong, K. K. Non-small-cell lung cancers: a heterogeneous set of diseases. Nat. Rev. Cancer 14, 535–546 (2014).
Quail, D. F. & Joyce, J. A. Microenvironmental regulation of tumor progression and metastasis. Nat. Med. 19, 1423–1437 (2013).
Ramesh, S. et al. Artificial intelligence-based morphologic classification and molecular characterization of neuroblastic tumors from digital histopathology. NPJ Precision Oncol. 8, 255 (2024).
Liang, J. et al. Deep learning supported discovery of biomarkers for clinical prognosis of liver cancer. Nat. Mach. Intell. 5, 408–420 (2023).
Arya, S. S., Dias, S. B., Jelinek, H. F., Hadjileontiadis, L. J. & Pappa, A. M. The convergence of traditional and digital biomarkers through AI-assisted biosensing: a new era in translational diagnostics? Biosens. Bioelectron. 235, 115387 (2023).
Montag, C., Elhai, J. D. & Dagum, P. On blurry boundaries when defining digital biomarkers: how much biology needs to be in a digital biomarker?. Front. Psychiatry 12, 740292 (2021).
Song, Y., Kang, K., Kim, I. & Kim, T. J. Pathological digital biomarkers: validation and application. Appl. Sci.12, 13 (2022).
Gürel, D. et al. The prognostic value of morphologic findings for lung squamous cell carcinoma patients. Pathol. Res. Pract. 212, 1–9 (2016).
Altorki, N. K. et al. The lung microenvironment: an important regulator of tumour growth and metastasis. Nat. Rev. Cancer 19, 9–31 (2019).
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).
Chhabra, Y. & Weeraratna, A. T. Fibroblasts in cancer: unity in heterogeneity. Cell 186, 1580–1609 (2023).
Grout, J. A. et al. Spatial positioning and matrix programs of cancer-associated fibroblasts promote T-cell exclusion in human lung tumors. Cancer Discov. 12, 2606–2625 (2022).
Paolillo, M. & Schinelli, S. Extracellular matrix alterations in metastatic processes. Int. J. Mol. Sci. 20, 4947 (2019).
Gocheva, V. et al. Quantitative proteomics identify Tenascin-C as a promoter of lung cancer progression and contributor to a signature prognostic of patient survival. Proc. Natl Acad. Sci. USA 114, E5625–e5634 (2017).
Acknowledgements
This work was supported by Beijing AI+Health Cultivation Innovation Project (No. Z241100007724001), Beijing Municipal Public Welfare Development and Reform Pilot Project for Medical Research Institutes (No. JYY2023-15), Beijing Nova Program, and 2023 Science and Technology Projects of Qinghai Province, China (Basic Research Program, No. 2023-ZJ-732).
Author information
Authors and Affiliations
Contributions
N.C. and S.W. conceived and designed the study. Y.L. collected clinical data, conducted the analyses and wrote the manuscript. X.C. participated in the interpretation of digital biomarkers and wrote the manuscript. M.Y., J.X., J.Z. and Y.C. participated in the establishment of the model. G.X. and W.W. provided assistance in establishing the model. H.L. provided assistance in the interpretation of digital biomarkers.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Y., Chai, X., Yang, M. et al. Accurate prediction of disease-free and overall survival in non-small cell lung cancer using patient-level multimodal weakly supervised learning. npj Precis. Onc. 9, 197 (2025). https://doi.org/10.1038/s41698-025-00981-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41698-025-00981-y







