Introduction

Nasopharyngeal carcinoma (NPC), a malignancy arising from the epithelial lining of the nasopharynx, accounts for a significant portion of head and neck cancers, particularly in Southeast Asia and Southern China1,2. Despite its relatively rare occurrence in other parts of the world, NPC is characterized by its distinct epidemiology, aggressive nature, and sensitivity to chemoradiotherapy (CRT)3,4,5. However, due to obscure anatomical sites and insidious symptoms, over 70% of patients with NPC are diagnosed at an advanced stage, leading to a poor prognosis6,7. Notably, survival rates can vary even among NPC patients with the same TNM stage and receiving similar or identical treatment regimens. Approximately 30–40% of patients eventually develop distant metastasis following curative treatment8,9.

Mounting evidence has shown that age is an important prognostic factor in patients with NPC, influencing both treatment response and survival outcomes10,11,12. Older patients often face additional challenges, including a higher incidence of comorbidities, reduced immune response, and diminished capacity for recovery, which can affect the overall treatment strategy and its effectiveness13. Younger patients typically demonstrate better overall survival (OS) rates than their older counterparts, potentially due to better treatment tolerance and fewer age-related complications14. Therefore, understanding the relationship between age and the causes of death (CODs) in NPC patients is crucial for optimizing personalized treatment strategies and improving patient outcomes. However, large-scale studies comprehensively investigating the impact of age on the prognosis and CODs in NPC patients receiving CRT are scarce.

With the increasing use of machine learning techniques, predictive models such as least absolute shrinkage and selection operator (LASSO), random forest (RF), Cox regression, support vector machine (SVM), eXtreme Gradient Boosting (XGBoost), decision tree (DT), and gradient-boosting machine (GBM) are becoming instrumental in identifying complex patterns compared with conventional statistical methods15,16. These models can integrate a wide range of patient-specific data to better understand the effects of age on mortality in NPC patients, allowing for the development of more personalized treatment approaches and improving survival rates17,18. Consequently, the current study sought to leverage data from the Surveillance, Epidemiology, and End Results (SEER) database to explore the influence of age on CODs in patients with advanced NPC receiving CRT and develop predictive models using machine learning to evaluate patient prognosis.

Materials and methods

Patient selection

Data of 2841 patients with advanced NPC undergoing CRT between 2000 and 2020 were extracted from the SEER-17 database. Patients were classified into two groups based on age: older group (< 60 years, n = 1037) and younger group (≥ 60 years, n = 1804). Inclusion criteria: (a) patients diagnosed with squamous cell carcinoma; (b) those who received CRT; (c) those with stage III/IV NPC; (d) those with clear CODs. Exclusion criteria: (a) patients with stage I/II or unknown NPC; (b) those aged < 18 years; (c) those with incomplete data. Since SEER data are publicly available for research, approval from the local ethics committee was waived.

Definition of CODs

CODs were grouped into four categories: (1) NPC-related deaths (NRDs), including those resulting from NPC-related conditions; (2) secondary malignant neoplasms (SMNs), covering fatalities caused by secondary cancers; (3) cardiovascular diseases (CVDs), encompassing deaths from heart disease, atherosclerosis, aortic aneurysm and dissection, cerebrovascular conditions, or other arterial issues; and (4) other causes.

Machine learning algorithms

In our study, the following 11 variables were included in the LASSO regression: sex, race, marital status, year of diagnosis, household income, grade, histology, T, N, M, and stage. After LASSO regression, several key variables influencing OS were identified and used to develop machine learning models. Patients were divided into a training cohort and a validation cohort in a 7:3 ratio. Six models, including Cox regression, SVM (with Radial Basis Function kernel), GBM, XGBoost, RF, and DT, were developed to predict OS in NPC patients. The predictive performance of these models was compared using the concordance index (C-index), Brier score, and receiver operating characteristic (ROC) curves. Decision curve analysis (DCA) and calibration curves were further applied to identify the optimal model.

SurvSHAP, utilizing a global explanation approach, was applied to calculate the mean SHapley Additive exPlanations (SHAP) value for each feature across the entire dataset. Time-dependent importance plots were generated to highlight the most influential features for predicting 3-, 5-, and 10-year OS. Additionally, partial dependence plots (PDPs) were employed to illustrate the impact of feature value changes on predicted outcomes.

Statistical analysis

The chi-square test was employed to compare categorical variables between the younger and older groups. Inverse probability of treatment weighting (IPTW) was applied to adjust for differences in covariate distributions to balance the two groups. The variables included: sex, race, marital status, year of diagnosis, household income, grade, histology, T, N, M, and stage. Competing risk analysis was performed using Gray’s test to accurately calculate the cumulative incidence. Kaplan-Meier curves were plotted to analyze OS outcomes, with comparisons made through the log-rank test. All statistics were conducted using R software version 3.3.2, with a significance threshold set at P < 0.05.

Results

Patient characteristics

The younger and older groups showed significant differences in terms of race (P < 0.001), marital status (P = 0.002), year of diagnosis (P = 0.001), tumor grade (P < 0.001), histology (P < 0.001), T stage (P = 0.037), and N stage (P < 0.001) before IPTW (Table 1). However, no statistically significant differences in baseline characteristics were found between the two groups after IPTW (Supplementary Table 1, Supplementary Fig. 1).

Table 1 Clinical characteristics of NPC patients before IPTW.

OS and CODs

The younger group exhibited a significantly longer median OS than the older group before IPTW adjustment (140 vs. 50 months, P < 0.001, Fig. 1A) and after IPTW adjustment (137 vs. 53 months, P < 0.001, Fig. 1B).

Fig. 1
figure 1

Overall survival (OS) based on age groups before and after the inverse probability of treatment weighting.

Before IPTW, the older group had higher 5-, 10-, and 15-year cumulative incidences of NRD (31, 35, and 37% vs. 21, 27, and 30%; P < 0.001, Fig. 2A), SMNs (11, 14, and 15% vs. 8.4, 10, and 11%; P = 0.006, Fig. 2B), CVDs (4.2, 7, and 8.7% vs. 0.5, 1.5, and 2.7%; P < 0.001, Fig. 2C), and other causes (8.7, 19, and 25% vs. 3.9, 8.1, and 11%; P < 0.001, Fig. 2D) than the younger group.

Fig. 2
figure 2

Comparison of cumulative incidences of NRDs (A), SMNs (B), CVDs (C), and other causes (D) between the older and younger groups at 5, 10, and 15 years before IPTW. IPTW, inverse probability of treatment weighting; NPC, nasopharyngeal carcinoma; NRD, NPC-related deaths; SMN, secondary malignant neoplasm; CVD, cardiovascular disease.

After IPTW, the older group had worse 5-, 10-, and 15-year cumulative incidences of NRD (30, 34, and 38% vs. 21, 27, and 30%; P < 0.001, Fig. 3A), CVDs (4.1, 7.2, and 8.8% vs. 0.5, 1.8, and 3.0%; P < 0.001, Fig. 3C), and other causes (8.3, 17, and 24% vs. 4.1, 8.7, and 12%; P < 0.001, Fig. 3D) than the younger group. However, cumulative incidences of SMNs were comparable between the two groups (P = 0.100, Fig. 3B).

Fig. 3
figure 3

Comparison of cumulative incidences of NRDs (A), SMNs (B), CVDs (C), and other causes (D) between the older and younger groups at 5, 10, and 15 years after IPTW. IPTW, inverse probability of treatment weighting; NPC, nasopharyngeal carcinoma; NRD, NPC-related deaths; SMN, secondary malignant neoplasm; CVD, cardiovascular disease.

Patients based on age groups and spline curve analysis

All patients were categorized into seven age groups: 18–30, 30–40, 40–50, 50–60, 60–70, 70–80, and 80 + years old (Supplementary Table 2). Detailed comparisons of OS among these groups showed significant differences, with the 18–30 age group demonstrating the longest OS and the 80 + age group exhibiting the worst OS (Fig. 4). Additionally, we performed spline regression analysis for the entire cohort and for each COD category, including NRD, SMN, CVD, and other causes. Our results confirm that as age increases, the risk of mortality due to all causes (Supplementary Fig. 2), NRDs (Supplementary Fig. 3A), SMNs (Supplementary Fig. 3B), and other causes increases (Supplementary Fig. 3D) in a continuous fashion. However, for CVD-related mortality (Supplementary Fig. 3C), we observed a distinct pattern: the risk remained similar up to the age of 55, after which the risk increased significantly.

Fig. 4
figure 4

Survival analysis based on different age subgroups.

Supplementary Table 3 presents the distribution of SMNs in the cohort, categorized by age groups. Multivariable Cox regression analysis shows that sex, age, marital status, year of diagnosis, household income, grade, histology, and M stage are associated with SMNs (Supplementary Table 4).

Evaluation of the machine learning models

Age, metastasis, stage, marital status, histology, year of diagnosis, and household income were identified as factors affecting OS through LASSO analysis (Supplementary Fig. 4). The RF model demonstrated the highest C-index among all models, reaching 0.701 (Fig. 5). The six models were evaluated in the validation set by comparing their Brier scores and areas under the ROC curves (AUCs, Table 2; Fig. 6). Furthermore, the DCA (Supplementary Fig. 5A-C) and calibration curves (Supplementary Fig. 5D) for the RF model showed strong predictive accuracy in estimating 3-, 5-, and 10-year survival rates.

Fig. 5
figure 5

Concordance index ranking graph.

Table 2 Performance comparison of machine learning models.
Fig. 6
figure 6

Receiver operating characteristic curves for predicting 3-, 5, and 10-year survival based on seven different models: Cox regression (A), DT (B), DT (C), GBM (D), SVM (E), and XGBoost (F). ROC, receiver operating characteristic; DT, decision tree; GBM, gradient-boosting machine; SVM, support vector machine; XGBoost, eXtreme Gradient Boosting.

Model interpretation

The time-dependent variable importance bar plots revealed that age had the greatest influence on 3-, 5-, and 10-year survival, followed by metastasis and tumor stage (Fig. 7). PDPs further indicated that advanced stages, increased age, and the presence of M were associated with worse survival (Supplementary Fig. 6). Likewise, SHAP value-based box plots revealed a strong association between older age and reduced survival rates (Supplementary Fig. 7).

Fig. 7
figure 7

Time-dependent variable importance bar plots display the ranking of significant features affecting the 3-, 5-, and 10-year survival.

Discussion

The present study comprehensively explored the influence of age on the prognosis and CODs in NPC patients undergoing CRT. Although previous studies have highlighted age as a critical factor in NPC prognosis, only a handful of reports have examined its impact using detailed and large-scale analysis19,20. By leveraging data from the SEER database and applying advanced machine learning models, the current study provides new insights into the role of age in determining survival outcomes and CODs in NPC patients21. This study significantly adds to the existing knowledge by offering a nuanced understanding of age-related prognostic factors and how they can be integrated into more personalized treatment approaches for NPC patients.

Our findings showed that younger patients had longer OS than their older counterparts both before and after IPTW adjustment. This difference in survival may be attributed to several age-related factors, such as a reduced capacity to tolerate aggressive treatment, the presence of multiple comorbidities, and a diminished immune response in older patients. The impact of age was further validated by time-dependent variable importance plots, which identified age as the most significant predictor of survival, followed by M and tumor stage. This underscores the critical need to consider age as a key factor when evaluating the prognosis and management strategies for patients with NPC22,23.

Moreover, older patients exhibited higher cumulative incidences of NRDs and deaths from other causes, highlighting the complexity of managing NPC in an aging population. Further analysis of CODs revealed that older NPC patients had a higher incidence of CVDs and SMNs as a cause of mortality than their younger counterparts. This finding is consistent with the notion that older patients are more vulnerable to non-cancer-related causes of death, which may be exacerbated by the strain of cancer treatments. These findings underscore the need for a multidisciplinary approach that addresses not only the cancer itself but also the overall health and comorbid conditions of older patients.

The use of machine learning models in this study allowed for a more precise identification of factors influencing survival outcomes in NPC patients24. LASSO regression identified age, metastasis, stage, marital status, histology, year of diagnosis, and household income as significant predictors of OS. The RF model showed the highest predictive accuracy among all models, with a C-index of 0.701. The superior performance of the RF model is likely due to its ability to handle large datasets with complex, nonlinear relationships between variables, making it an ideal tool for predicting survival outcomes in NPC patients25,26,27,28. DCA and calibration curves further confirmed the strong predictive performance of the RF model, particularly in estimating 3-, 5-, and 10-year survival.

PDPs and SHAP value-based analyses provided additional insights into the impact of age and other factors on survival. Advanced tumor stages, older age, and the presence of metastasis were all associated with poorer survival outcomes. These findings highlight the importance of integrating machine learning models into clinical decision-making to improve risk stratification and tailor treatment strategies for NPC patients29,30.

Nonetheless, this study has several limitations. First, the retrospective nature of the study and the use of the SEER database may introduce potential biases related to data accuracy and completeness. Second, while machine learning models provide valuable insights, they require further validation in broader and independent populations to ensure their generalizability. The lack of an external validation cohort in this study limits the robustness of the predictive models and their applicability to diverse clinical settings. Finally, other factors not included in this analysis, such as treatment model, lifestyle, and environmental influences, may also influence NPC prognosis and thus warrant further investigation.

Conclusions

In summary, this study has demonstrated that age plays a critical role in the prognosis of patients with NPC undergoing CRT, with younger patients demonstrating significantly better survival outcomes than older patients. Machine learning models, particularly the RF model, proved to be powerful tools for predicting survival and identifying key prognostic factors. These findings emphasize the importance of age-specific management strategies and personalized treatment approaches in NPC patients, especially for older individuals who may benefit from enhanced cardiovascular monitoring and closer follow-up care. Future research should further validate these models and explore additional factors that may influence the survival of patients with NPC.