Stable Cox regression for survival analysis under distribution shifts

Fan, Shaohua; Xu, Renzhe; Dong, Qian; He, Yue; Chang, Cheng; Cui, Peng

doi:10.1038/s42256-024-00932-5

Download PDF

Article
Open access
Published: 13 December 2024

Stable Cox regression for survival analysis under distribution shifts

Nature Machine Intelligence volume 6, pages 1525–1541 (2024)Cite this article

26k Accesses
19 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Survival analysis aims to estimate the impact of covariates on the expected time until an event occurs, which is broadly utilized in disciplines such as life sciences and healthcare, substantially influencing decision-making and improving survival outcomes. Existing methods, usually assuming similar training and testing distributions, nevertheless face challenges with real-world varying data sources, creating unpredictable shifts that undermine their reliability. This urgently necessitates that survival analysis methods should utilize stable features across diverse cohorts for predictions, rather than relying on spurious correlations. To this end, we propose a stable Cox model with theoretical guarantees to identify stable variables, which jointly optimizes an independence-driven sample reweighting module and a weighted Cox regression model. Through extensive evaluation on simulated and real-world omics and clinical data, stable Cox not only shows strong generalization ability across diverse independent test sets but also stratifies the subtype of patients significantly with the identified biomarker panels.

Enhancing survival risk prediction through imputation and feature selection in high-dimensional protein biomarker data

Article Open access 22 March 2026

Cox proportional hazards regression in small studies of predictive biomarkers

Article Open access 20 June 2024

Survival analysis for sepsis patients: A machine learning approach to feature selection and predictive modeling

Article Open access 01 July 2025

Main

Survival analysis is a subfield of statistics that assesses the impact of covariates on the time until an event of interest occurs, widely used across various key fields such as life science to inform decision-making and predict outcomes. Among popular survival analysis methods^1,2,3,4, the Cox proportional hazards (PH) model⁵ is the most prominent historically owing to its flexibility in handling censored data, accommodating a wide range of covariates without requiring the specification of the underlying survival distribution. While existing survival analysis methods show promising results under the assumption that training and test data share similar distributions, challenges arise when this assumption does not hold. Distribution shifts are inevitable in healthcare scenarios owing to training and test data that may be collected from different centres, standards of quantification method and subpopulation heterogeneity⁶. For instance, in the healthcare area, the survival data for certain diseases are plentiful in hospitals in developed areas but scarce in less developed regions. Commonly, a survival model is developed using the abundant data collected from developed areas and then applied to regions where data are lacking⁷. Distribution shifts inevitably occur between the two types of area owing to the high heterogeneity among populations and the differing treatment plans prevalent in these regions. More specifically, in oncology, prognostic markers play a key role in patient management and decision-making, and the identification of them is one of the major objectives in clinical research. However, we often observe that the same biomarker has been reported to show different prognostic values in different studies. For example, in studies focusing on Chinese patients with hepatocellular carcinoma (HCC), the prognostic value of epithelial cell adhesion molecule (EPCAM) expression in tumour tissues, as determined by immunohistochemistry, has shown variable outcomes⁸. One study identified EPCAM expression as a predictor of good prognosis⁹, whereas another study found that high levels of EPCAM expression were linked to poor prognosis¹⁰. We also conducted univariate Cox regression analysis on two HCC transcriptome cohorts^11,12. As shown in Fig. 1a (left), consistent with the literature, there was limited overlap in the genes identified with the same prognostic value by both cohorts. Moreover, a few genes even showed completely opposing prognostic predictive values. From the data perspective, the inconsistent relationship between these biomarkers with prognosis could be caused by the distribution shifts of covariates or the real functional relationship between genes and prognosis. We generally assume that the true functional relationship a gene has with the prognosis of patients with a specific cancer type is stable and would not change across cohorts^13,14,15,16. In contrast, covariate distribution could easily change owing to the expression level of some kinds of genes in different populations being different¹⁷. We further visualize the distribution of covariates for these two cohorts in Fig. 1a (right). It is evident that there is notable variance in the covariate distributions of the two cohorts, indicating the presence of distribution shifts. Distributional shifts pose serious challenges to survival analysis, potentially leading to serious declines in performance when high-risk factors are not accurately identified. The principal challenge in combating distribution shifts lies in identifying stable variables that maintain a consistent relationship with the outcome across different cohorts. It is a highly non-trivial and long-standing unsolved problem to discover such stable variables owing to the complicated time-to-event nature of survival data and correlation-driven mechanism of existing survival analysis methods^18,19. Consequently, current methods might blindly learn the misleading patterns from spurious correlations present in the training set. However, such correlations are unstable and easily changed in the test set, posing a considerable risk when applying the trained model to new cohorts.

**Fig. 1: Illustrations of the distribution shifts problem, particularly covariate shifts, in survival analysis.**

For instance, as shown in Fig. 1b, the distributions of the covariates differ among cohorts or subpopulations, where the ‘batch effect’ is one of the major causes of heterogeneity among cohorts²⁰. We could assume that the shifts between cohorts or subpopulations are mainly caused by parts of covariates (that is, unstable covariates, detailed in Assumption 3 in Methods), where these unstable covariates may have spurious correlations with other stable covariates owing to selection bias. If we train a model on a particular cohort, the correlation-driven nature of current survival analysis methods substantially increases the likelihood that it will capture the spurious correlations that are specific to that cohort. Therefore, applying the identified high-risk factors as biomarkers to an unknown population carries a remarkable risk of leading to serious consequences such as wrong treatment assignment. Given the unacceptable risk in such high-stakes applications, it highlights the critical need for robust survival analysis models. These models must be capable of identifying stable features that can adapt to shifts in distribution.

The most common practice to enhance the ability of Cox PH to identify the most relevant features related to the outcome variable involves incorporating sparsity norms, including lasso²¹, ridge regression²², elastic net²³ and smoothly clipped absolute deviation penalty²⁴ and so on. While these methods have achieved success by promoting sparsity in the coefficients, they can only handle the scenarios without model misspecification²⁵ and lack the capability to discern between stable and unstable variables with model misspecification. Stable learning^26,27 is a branch of machine learning methods that brings causality into learning methods, aiming to bridge the gap between the tradition of precise modelling in causal-inference and black-box approaches from machine learning. Benefiting from the theoretical guarantees provided by causal-inference methods, stable learning aims to identify stable causal relationships rather than easily changeable correlations when modelling the relationships between covariates and outcomes. As a result, these kinds of methods^{28,29,30,31,32} show promising results on generalization, interpretability and fairness. Nevertheless, stable learning methods cannot be applied to complex time-to-event data yet.

In this Article, we propose a stable Cox regression model designed to identify stable variables for prediction, thereby ensuring strong generalization performances under distribution shifts based on these selected variables. Our approach aims to eliminate spurious correlations among covariates and focus on using stable variables for predictions. The model operates in two stages: independence-driven sample reweighting and weighted Cox regression. During the independence-driven sample reweighting stage, we employ a module to learn subject weights that render the covariate independent. In the subsequent weighted Cox regression stage, subjects are reweighted using these learned weights, leading to a weighted partial log-likelihood loss. This loss effectively isolates the effect of each variable during optimization. Theoretically, we prove that under some mild assumptions, and even with model misspecification, our stable Cox model exclusively relies on stable variables for predictions. This means that coefficients for unstable variables will be zero, provided that the learned sample weights maintain strict mutual independence among all covariates. We validate the effectiveness of the proposed method on both simulated data and two kinds of critical real-world applications: patient prognosis prediction based on omics data or clinical features. The extensive results demonstrate the generalization ability of the proposed method on unseen test cohorts or subpopulations. Notably, the coefficients derived from the method show remarkable stability and interpretability in downstream tasks. The learned coefficients can be used to discover potential biomarker panels and stratify subgroups (subtypes) with significantly different survival risks. Such applications are fundamental in guiding treatment decisions and the development of targeted drugs, which must guarantee stability in the face of population heterogeneity.

Results

General framework of stable Cox regression model

Let $X=({X}_{1},{X}_{2},\ldots ,{X}_{p})\in {{\mathbb{R}}}^{p}$ be the p-dimensional subjects’ features, T ∈ [0, ∞) be the possibly censored failure times, and δ ∈ {0, 1} be the indicators for censoring. Suppose we get n independent and identically distributed (iid) survival data ${\left\{{x}^{(i)},{t}^{\,(i)},{\delta }^{(i)}\right\}}_{i = 1}^{n}$ drawn from a training distribution P^tr on the random variables X, T and δ, where x⁽ⁱ⁾ and t⁽ⁱ⁾ means the feaures (covariates) and possibly censored failure time of subject i, respectively. Let P^te denote the unknown test distribution.

In survival analysis problems involving multiple covariates, typically only a small subset notably influences the survival outcomes, while the remaining covariates may represent noise or show spurious correlations with the outcomes that are unstable across unseen testing distributions. For example, in omics data, some tumour genes show a causal role if a gene whose high expression leads to aggressive forms of certain types of cancer, such as the ERBB2 (Erb-B2 receptor tyrosine kinase 2, also known as HER2)-positive breast cancers that tend to be more aggressive³³. However, the expression of some genes (for example, genes for lactase persistence) may highly correlate with the location where the person lives¹⁷, and the development level of the hospital in their city may determine their prognosis. The relationship of these genes with their prognosis is unstable across cities. In this way, two kinds of genes would be spuriously correlated owing to the location. To formalize this scenario, we make a structural assumption of covariates by splitting them into stable variables S and unstable variables V, where the failure time T only depends on the stable variables S. Stable variables are real predictors of the outcome, while unstable variables are associated with the outcome through their correlation with stable variables, which can vary across different study populations. Such an assumption can be guaranteed by T ⊥ V|S (detailed discussion in Assumption 3 in Methods).

In covariate-shift scenarios, we usually suppose probabilty P(T|X) remains unchanged while P(X) may change between training and test sets. For example, in survival analysis, some genes consistently show a stable trend associated with unfavourable prognosis across different cohorts, while others do not show such a stable trend. As shown in Fig. 1c, owing to selection bias, there may exist spurious correlations between stable covariates S and unstable covariates V, resulting in changes in P(X). Thus unexpected correlation would mislead the model to learn the spurious correlation between V and T. This correlation P(T|V) is unstable across possible testing distributions, which would result in the degradation of generalization performance. To get rid of the unstable correlation and capture the stable relationships between S and T, we propose to learn a group of sample weights to remove the correlations among covariates in observational data, and then optimize the Cox model in the weighted distribution. It is theoretically guaranteed that the stable Cox model utilizes only stable variables for prediction (detailed in ‘Theoretical results’ in Methods).

Our stable Cox regression model consists of two stages. In the first stage, we propose to utilize a sample reweighting module to learn sample weights so that X are statistically independent in the weighted distribution (Fig. 2a). In the implementation, we utilize the typical independence-driven algorithm, namely, Sample Reweighted Decorrelation Operator (SRDO)³¹. A previous study³¹ proposed to learn weighting function w(X) by estimating the density ratio of the training distribution P^tr and a specific weighted distribution $\tilde{P}$. They define $\tilde{P}$ through a process of random resampling across each feature, resulting in $\tilde{P}({X}_{1},{X}_{2},\ldots ,{X}_{p})=\mathop{\prod }\nolimits_{j = 1}^{p}{P}^{{\mathrm{tr}}}({X}_{j})$. Consequently, the weighting function w(X) is given by

$$w(X\,)=\frac{\tilde{P}(X\,)}{{P}^{\,{\mathrm{tr}}}(X\,)}=\frac{\mathop{\prod }\nolimits_{j = 1}^{p}{P}^{{\mathrm{tr}}}({X}_{j})}{{P}^{{\mathrm{tr}}}({X}_{1},{X}_{2},\ldots ,{X}_{p})}.$$

(1)

The density ratio in equation (1) can be effectively addressed through class probability estimation problems³⁴. As a result, SRDO can guarantee statistical independence between covariates X if the density ratio is estimated accurately.

**Fig. 2: The framework of our proposed method.**

In the second stage, as shown in Fig. 2b we propose to reweight the partially log-likelihood loss of the Cox PH model by the learned weights as follows:

$$\begin{array}{rlr}{{\mathcal{L}}}_{w}(\beta)&=\mathop{\sum}\limits_{i=1}^{n}{\delta }^{(i)}\log {L}_{w}^{(i)}(\,\beta )=\mathop{\sum }\limits_{i=1}^{n}{\delta }^{(i)}w\left({x}^{(i)}\right)\log \frac{\lambda \left({t}^{(i)};{x}^{(i)}\right)}{\sum _{j:{t}^{(j)}\ge {t}^{(i)}}w\left({x}^{(j)}\right)\lambda \left({t}^{(j)};{x}^{(j)}\right)}&\\ &=\mathop{\sum }\limits_{i=1}^{n}{\delta }^{(i)}w\left({x}^{(i)}\right)\log \frac{{\lambda }_{0}\left({t}^{(i)}\right)\exp \left({\beta }^{T}{x}^{(i)}\right)}{\sum _{j:{t}^{(j)}\ge {t}^{(i)}}w\left({x}^{(j)}\right){\lambda }_{0}\left({t}^{(j)}\right)\exp \left({\beta }^{T}{x}^{(j)}\right)}\\ &=\mathop{\sum }\limits_{i=1}^{n}{\delta }^{(i)}w\left({x}^{(i)}\right)\log \frac{\exp \left({\beta }^{T}{x}^{(i)}\right)}{\sum _{j:{t}^{(j)}\ge {t}^{(i)}}w\left({x}^{(j)}\right)\exp \left({\beta }^{T}{x}^{(j)}\right)}\end{array}$$

(2)

where β is the coefficients of covariates to be learned. λ₀(t⁽ⁱ⁾) represents the value of baseline hazard function at time t⁽ⁱ⁾. ${L}_{w = 1}^{(i)}(\,\beta )$ denotes the unweighted likelihood of the event to be observed occurring for subject i at time t⁽ⁱ⁾. The likelihood considers the sum over all subjects j for whom the event has not yet occurred by time t⁽ⁱ⁾ (including subject i itself). In the process of likelihood maximization, the model’s predictive probability that the event occurs for subject i before any subject j is optimized. ${L}_{w}^{(i)}(\,\beta )$ is the weighted version of the likelihood of each subject, where each subject happens w(X) times. Under mild assumptions, we could prove that the estimated coefficients of unstable covariates ${\hat{\beta }}_{w}(V)$ will approach zeros with high probability (detailed in ‘Theoretical results’ in Methods).

Evaluation on the simulated survival data

Experimental set-up

To evaluate stable Cox in a controllable manner, we generate three kinds of survival data with different hazard functions that usually occur in real-world applications, following the generation process of refs. ^25,35. In addition, to simulate the spurious correlation between stable and unstable variables, we design the sample selection process to introduce such correlation. In the simulation study, we are aware of which variables are stable or unstable, allowing us to rigorously assess whether our method successfully identifies stable variables. The detailed data generation process and experimental set-up are in ‘Simulated survival data’ in Methods. Baselines are introduced in ‘Baseline approaches’ in Methods.

Results

The results of the full setting are depicted in Fig. 3a. In general, all the baselines demonstrate satisfactory performance when test bias rate r_test ∈ (1, 3]; however, their performances drop dramatically when the r_test ∈ [−3, −1). This is because the correlations between V_b and T are similar between training data (training bias rate r_train = 1.7) and test data when r_test > 1 and that correlation can be exploited in prediction. In such cases, V is useful to proxy for the misspecification form and omitted function of S. However, then r_test < −1, the correlation between V_b and T reverses compared with the training set, leading to excessive instability when V_b is used for predictions. Figure 3c depicts the box plots of each method on the Cox-exp dataset. Our stable Cox method shows low variance across testing sets. Importantly, compared with the Cox PH model, our method improves its average performance of Concordance index (C-index), the worst case of testing environments and the variances across environments remarkably. As our model is built on the Cox PH model, we could safely attribute the notable improvement to the seamless joint of the proposed framework with Cox PH model.

Furthermore, we demonstrate the significance of the coefficients derived from our model, highlighting the use of P values for each coefficient as a tool for feature selection. Figure 3d summarizes the −log₂P value of the coefficients of S and V for stable Cox and Cox PH. The P value is derived from a two-sided Wald test, with the null hypothesis being that the coefficient (β) is zero. The larger −log₂P values for stable Cox (S) compared with stable Cox (V) suggest that coefficients for stable variables S are more statistically significant than those for unstable variables V. However, the −log₂P value of V_b of the Cox PH model is substantially higher than those for stable variables S, demonstrating that Cox PH regards V_b as a highly important feature in predictions. On the basis of this, we can leverage the P-value rankings to select the top-N significant features, an approach particularly beneficial in applications requiring minimal feature usage for predictions. For example, measuring key genes to predict patient survival is more cost-effective than assessing an entire genome. Figure 3b presents the outcomes of this feature selection strategy, where we use the top-five features selected by each method to train a new corresponding prediction model. Note that the new prediction model for stable Cox is Cox PH. It yields results akin to the full model setting (as shown in Fig. 3a). Moreover, as shown in Fig. 3f, when the number of selected features exceeds five, the stability of the model drops significantly. These phenomena demonstrate that our model could learn distinct P values between S and V, enabling a reduction in the necessary number of covariates.

In addition to prediction performance, we illustrate the effectiveness of our method in weakening spurious correlation in Fig. 3f. We calculate $| | \hat{\beta }(V)| {| }_{1}$ to quantify the residual correlation between unstable variables V and the outcome T. Across varying sample sizes, the stable Cox model consistently shows lower residual correlations. With a larger sample size, $| | \hat{\beta }(V)| {| }_{1}$ will be reduced (Fig. 3f, left), and the gap between our method and Cox PH will be larger (Fig. 3f, right). This finding confirms that with a larger number of samples, the independence module more effectively reduces statistical dependence, thereby enhancing model stability. More experiments on the sensitivity of the independence assumption among covariates after reweighing can be found in Supplementary Information, section C.2.

Evaluation on multiple cancer transcriptome survival data

Experimental set-up

Transcriptomics is a crucial and rapidly developed field in biology, providing extensive information on disease states, biomarker discovery and new drug development through the analysis of gene expression³⁶. Scientists can identify several key genes associated with the prognostic outcome for a specific disease, which are referred to as prognostic biomarkers. These biomarkers are instrumental in predicting a patient’s prognosis, enabling a more tailored and effective treatment approach. By focusing on these disease-specific genes, targeted therapies can be developed, offering a more personalized and potentially more effective treatment strategy. To comprehensively evaluate our method, we have constructed the following transcriptome survival datasets: HCC transcriptome dataset, breast cancer transcriptome dataset and melanoma transcriptome dataset, where each dataset has one training cohort and three independent testing cohorts. The data details and experimental set-up are in ‘Baseline approaches’ in Methods.

Results

The performance of all the methods on each testing cohort and average performance across testing cohorts of each dataset are reported in Fig. 4. First, we observe that the Cox PH model outperforms the corresponding univariate Cox PH model across three datasets in terms of overall performance. The major difference between them is that the Cox PH model considers the relationship between genes to select the final biomarker panel, whereas the univariate Cox PH considers only the importance of each gene separately. This indicates that it is necessary to consider the relationship between genes to discover the biomarker panel. Second, for the HCC transcriptome dataset, parametric methods show competitive results on the Fujimoto et al. cohort³⁷ and the Roessler et al. cohort¹², but they fail on the Hoshida et al. cohort³⁸ and show high standard error. The reason is that the Fujimoto et al. cohort and the Roessler et al. cohort may share a similar data distribution with the training set and the parametric assumption may align well with this distribution. However, the Hoshida et al. cohort may have larger distribution shifts with the training set. Nevertheless, our stable Cox model consistently shows a high average C-index and low standard error on the three independent testing cohorts, indicating robustness and reliability in various unseen testing conditions. Third, our model outperforms Cox PH across three datasets by a large margin (from 10.4% to 13.9% overall improvements in the top-10 panel), indicating that making covariates independent could help the model get rid of spurious correlation and focus on relevant variables. Fourth, in the melanoma transcriptome dataset, it is shown that the performance deteriorates when the number of selected genes exceeds ten. This decline may occur because the optimal number of biomarkers varies across different diseases³⁹. Incorporating too many genes into the biomarker panel can introduce noise or unstable features, adversely affecting performance. This phenomenon is consistent with the results in our simulation experiments (Fig. 3e), that is, involving more irrelevant features in the model will damage the stability of the model.

**Fig. 4: Performance evaluation for prognostic biomarker discovery task on transcriptome data.**

Accurately inferring survival subgroups among patients with cancer can significantly aid in clinical decision-making and enhance the patient’s survival outcomes. We use the median value of ${\hat{\beta }}^{T}X$ in each cohort to divide the patients into the high-risk group (above median) and the low-risk group (below median). The difference between the survival curves of the two subgroups is measured by the P value of a two-sided log-rank test. The hazard ratio (HR) is from the univariate Cox regression of subgroup separation results. The Kaplan–Meier plots of subgroups stratified by Cox PH and stable Cox on three testing cohorts of the breast cancer transcriptome dataset are shown in Fig. 5a. The P value of stable Cox is lower than that of Cox PH on the corresponding cohort and lower than the significance level 0.05, and the HR of stable Cox is significantly larger than that of Cox PH. This demonstrates that the genes screened by our model can assist in stratifying patients into respective subgroups and offer precise prognostic stratification and treatment strategies. The Kaplan–Meier plots of the HCC transcriptome and melanoma transcriptome datasets are shown in Supplementary Fig. C5. Moreover, we conduct univariate Cox regression analyses on the subgroups divided by the key clinical indicators (that is, age, estrogen eeceptors (ER) status encoded by the ESR1 and ESR2 genes, HER2 status and progesterone receptor (PR) status encoded by the PGR gene) and report their HR value and the corresponding 95% confidence interval (CI) in Fig. 5b. The higher HR value means the method could identify high-risk and low-risk subgroups well under this clinical variable subgroup. Overall, compared with the Cox PH model, the stable Cox model demonstrates higher HR values in most clinical variable subgroups. This suggests that the stable Cox model offers superior prognostic prediction performance in subgroup analyses. It is worth noting that in the HER2 status positive and PR status negative subgroup, stable Cox can also effectively identify those with a worse prognosis among these patients commonly regarded as having a high malignant grade and unfavourable prognosis in clinical practice⁴⁰. This contributes to improving the precision of patient stratification, facilitating enhanced patient management and treatment.

**Fig. 5: The analysis of the top-ten genes identified by Cox PH and stable Cox on the breast cancer transcriptome dataset.**

Furthermore, we compare the favourable/unfavourable consistency of the correlation of individual proteins of the top-ten important genes across different cohorts. The univariate Cox regression analysis was used to calculate the predictive ability for the survival of the identified genes in the training and three test cohorts, respectively. Genes that meet the following criteria are considered as prognosis-related genes: (1) HR larger than 1 and log-rank P value less than 0.05 for unfavourable; (2) HR smaller than 1 and log-rank P value less than 0.05 for favourable; (3) otherwise is no trend. The results are shown in Fig. 5c. As we can see, no genes screened by stable Cox show both favourable and unfavourable relationships with the survival outcome. In addition, 2 genes (WEE2-AS1 and RPGRIP1) screened by stable Cox show a high-frequency (over or equal to 75%) favourable relationship with survival outcome, and 4 genes (CLEC3A, LSG1, SRGAP2 and F2RL1) have a high-frequency unfavourable relationship across 4 cohorts. However, Cox PH has 3 genes showing both favourable and unfavourable relationships with the survival outcome, and 4 genes (MCEMP1, RPGRIP1, F2RL1 and CXCL2) show a 75% frequency favourable/unfavourable relationship with survival outcome. This phenomenon indicates that genes screened by our method also show more stable prognostic relationships at the individual gene level. Moreover, we noticed that three of the top-ten genes identified by both the Cox PH and stable Cox methods are the same (OR5M11, RPGRIP1 and F2RL1), while the others differ. The gene panel identified by our method demonstrates superior predictive performance (Fig. 4b), more significant stratification of subgroups (Fig. 5a,b), and greater consistency in distinguishing favourable and unfavourable genes across both training and test cohorts (Fig. 5c). These encouraging findings substantiate our method’s capability to identify promising candidate biomarkers for prognosis, which is crucial for guiding treatment decisions and developing targeted therapies.

Evaluation on lung and breast cancer clinical survival data

Experimental set-up

Clinical data provide abundant information to characterize patients, including patient demographics, disease stage, treatment methods and so on. This information has a strong correlation with the patient’s survival outcome. In this section, we conduct experiments on lung cancer data and breast cancer data. The data details and experimental set-up are in ‘Clinical survival data’ in Methods.

Results

Figure 6a shows the performance of various methods across six subpopulations for OS and DFS. The results clearly show that the stable Cox model significantly surpasses the baseline methods in each subpopulation for both overall survival (OS) and disease-free survival (DFS). Notably, stable Cox achieves a performance improvement of 4.5% and 17.7% over the Cox PH model in OS and DFS tasks, respectively. This demonstrates that the coefficients of our model can ensure stability across diverse subpopulations and tasks, markedly reducing the risk and bias when applied in real-world scenarios. Furthermore, the results of OS and recurrence-free survival (RFS) prediction of breast cancer data are shown in Fig. 6b. As we can see, stable Cox demonstrates superiority over Cox PH on both test cohort 1 and cohort 2 and shows good stability overall, regardless of whether it is for OS (6.58% improvement) or RFS (6.5% improvement) tasks. We find that the parametric methods achieve better results than Cox PH, demonstrating that their assumptions on survival time are closer to the underlying distribution. Nevertheless, our model could still outperform them taking Cox PH as a base model. Following the procedure of omics study, Kaplan–Meier plots of high-risk and low-risk subgroups separated by the Cox PH model or the stable Cox model for breast cancer dataset are shown in Fig. 6c. The differences between the two subgroups generated by stable Cox on all test scenarios are quite significant (lower than the significance level of 0.05). Nevertheless, the variance of the P values or HR value of Cox PH is quite large. Notably, Cox PH is very significant on test cohort 1 of OS prediction, but the difference between the survival curves of the two subgroups is not significant. The phenomenon demonstrates that the coefficients learned by our model can be used as a reliable index to guide decision-making and improve the patient’s postoperative prognosis. The Kaplan–Meier plots of the lung cancer clinical dataset for OS and DFS outcomes are shown in Supplementary Figs. C8 and C9, respectively.

Furthermore, we present the top-ten clinical variables with the most significant coefficients of Cox PH and stable Cox in Fig. 6d. ER, PR and HER2 status are the most common biomarkers for breast cancer prognosis. Their joint status could stratify patients into subgroups with significantly different prognosis situations⁴⁰. As observed, these clinical indices are among the top-ten clinical variables identified by the stable Cox model, whereas the Cox PH model includes only the PR status. This suggests that our model has a superior capability in discovering clinically significant biomarker panels.

Discussion

Stable Cox is a theoretical-guaranteed model to deal with distribution shifts and discover biomarker panels in survival analysis. Compared with traditional survival analysis methods, stable Cox offers three levels of advantages for survival time prediction and important variable identification. (1) Feature selection under covariate shift. Feature selection aims to construct a diagnostic or predictive model for a given regression or classification task via selecting a minimal-size subset of variables that show the best performance⁴¹. Selecting significant features and their combination for survival analysis could help discover biomarker panels, which is important for both decision-making and drug development. Stable Cox can be viewed as an embedded feature selection method under distribution shifts, which seeks to minimize the size of the selected feature subset while maximizing the prediction performance simultaneously. (2) Causal implication. Recent studies have shown that data-driven approaches are often mistakenly used to draw causal effects, with neither their parameters nor their predictions inherently offering a causal interpretation^42,43,44,45. Therefore, the foundation that data-driven predictive models yield reliable decisions for precision medicine is questionable. A promising and practical pathway is to build a predictive model that is located at the common ground between machine learning and causal inference²⁶. The key idea of our approach is to eliminate the spurious correlation among covariates. When viewing this method from the causal-inference perspective, we regard each input variable as the treatment iteratively and all remaining input variables as its corresponding confounder, and thus the learned sample weights realize confounder balancing⁴⁶ globally for whichever input variable acts as the treatment through the independence between treatment and corresponding confounder. Therefore, the learned correlation of our model would approximately represent the causal effects of the covariates on survival probabilities that are invariant across different domains. From this perspective, our model has inherent causal implications, representing a significant stride towards reliable survival analysis. (3) Easy to implement. The independence module supplies the weighted Cox model with a collection of sample weights. The additional time complexity in our method, compared with the standard Cox model, originates from the independence-driven sample reweighting module. The independence module could be optimized independently and has low time complexity, which is ${\mathcal{O}}(tn(ph+hg+2g))$, where (h, g) are the dimensions of hidden layers of the multilayer perceptron and t is the number of iterations for training.

Methods

Preliminaries

In this subsection, we present notations, weighting function and the base model employed in our approach.

Notations

Let ${\mathcal{X}}$ denote the support of the feature set X. For a vector $x\in {\mathcal{X}}$, we define the following tensor powers. x^⊗0 represents the scalar value 1, x^⊗1 denotes the original vector x, and x^⊗2 denotes the matrix xx^T.

In the context of probabilistic expectations, ${{\mathbb{E}}}_{Q}[\cdot ]$ and ${{\mathbb{E}}}_{Q}[\cdot | \cdot ]$ are used to represent the expectation and conditional expectation under a distribution Q, respectively. To simplify notation, when referring to expectations under the training distribution P^tr, we omit the subscript and use ${\mathbb{E}}[\cdot ]$ and ${\mathbb{E}}[\cdot | \cdot ]$.

Weighting function

Consider the set ${\mathcal{W}}$, defined as the collection of weighting functions that satisfy the condition:

$${\mathcal{W}}=\{w:{\mathcal{X}}\to {{\mathbb{R}}}^{+}| {\mathbb{E}}[w(X)]=1\}.$$

(3)

For every $w\in {\mathcal{W}}$, the associated weighted distribution ${\tilde{P}}_{w}$ is uniquely determined by its probability density function:

$${\tilde{P}}_{w}(X\,)=w(X\,){P}^{{\mathrm{tr}}}(X\,).$$

(4)

To simplify notation, ${{\mathbb{E}}}_{w}[\cdot ]$ denotes the expectations under the distribution ${\tilde{P}}_{w}$. Furthermore, for any measurable function f(X), it holds that ${\mathbb{E}}[w(X\,)f(X\,)]={{\mathbb{E}}}_{w}[\;f(X\,)]$.

In addition, independence-driven sample reweighting algorithms focus on learning a subset ${{\mathcal{W}}}_{\perp }\subseteq {\mathcal{W}}$, rather than the entire set ${\mathcal{W}}$. A weighting function in ${{\mathcal{W}}}_{\perp }$ ensures that the features X are mutually independent in the corresponding weighted distribution ${\tilde{P}}_{w}$, that is

$${{\mathcal{W}}}_{\perp }\mathop{=}\limits^{\bigtriangleup }\left\{w\in {\mathcal{W}}| X\,\text{are statistically independent in}\,{\tilde{P}}_{w}\right\}.$$

(5)

Cox PH model

In survival analysis^5,47, the key variables T and δ are generated through the following process:

$$T=\min \left\{{T}^{\,{\rm{failure}}},{T}^{\,{\rm{censored}}}\right\}\quad \,\text{and}\,\quad \delta ={\mathbb{I}}\left[{T}^{\,{\rm{failure}}}\le {T}^{\,{\rm{censored}}}\right].$$

(6)

Here, T^failure denotes the time of failure (event occurrence), and T^censored represents the censoring time. Consequently, the observed variable T corresponds to the failure time T^failure if it occurs before the censoring time T^censored, and it equals the censoring time otherwise.

In the Cox PH model, also known as Cox regression⁵, the hazard function of the failure time T^failure, denoted as λ(u), measures the instantaneous rate at which events occur at time u, conditional on no occurrence until time u. Mathematically, it is defined as: $\lambda (u;X)=\mathop{\lim }\nolimits_{h\to {0}^{+}}{P}^{{\mathrm{tr}}}({T}^{\,{\rm{failure}}}\le u+h| {T}^{\,{\rm{failure}}}\ge u,X)/h$. For an individual characterized by covariates X, the standard Cox model posits that the individual’s hazard function is of the form:

$$\lambda (u;X\,)={\lambda }_{0}(u)\exp \left({\beta }^{T}X\right).$$

(7)

Here, λ₀(u) represents the baseline hazard function, reflecting the time-dependent risk of an event when covariates are at their baseline levels. The vector β embodies the effect parameters, indicating how the hazard varies in response to the explanatory covariates. According to the PH assumption⁴⁸, there exists a multiplicative relationship between the covariates and the hazard. For example, in a scenario with constant coefficients, a drug treatment might consistently reduce the subject’s hazard by a factor of two at any time t, irrespective of variations in the baseline hazard.

Cox⁵ proposed estimating the parameter β using the log partial likelihood function, focusing solely on the impact of the covariates:

$$\begin{array}{rlr}L(\beta )&=\mathop{\sum }\limits_{i=1}^{n}{\delta }^{(i)}\left({\beta }^{T}{x}^{(i)}-\log \left(\mathop{\sum}\limits_{{t}^{(j)}\ge {t}^{(i)}}\exp \left({\beta }^{T}{x}^{(j)}\right)\right)\right),&\\ \hat{\beta }&=\arg \mathop{\min }\limits_{\,\beta }L(\beta ).\end{array}$$

(8)

It is noteworthy that Breslow’s method provides an approach to estimate the baseline hazard function, enabling the full hazard function to be deduced as the product of the baseline hazard and the exponential term.

Stable Cox regression

A previous study⁴⁹ has shown that if the assumption in equation (7) accurately represents the true hazard function, the estimator $\hat{\beta }$ from equation (8) will converge to the actual β in equation (7) as the sample size approaches infinity. However, model misspecification, where the parameterization in equation (7) fails to capture the true model, frequently challenges the Cox model, as documented in refs. ^25,50,51. Moreover, in practical scenarios, the test data may come from different centres or populations than the training data, resulting in distribution shifts. This is exemplified by the biology function analyses in Fig. 5, where the standard Cox model tends to rely on features with spurious correlations to the outcome, leading to suboptimal performance on test data.

In response to these challenges, we propose the stable Cox regression method. This approach focuses on leveraging features that show greater consistency across different environments for prediction, rather than depending on fragile, spurious correlations found in the training data. Specifically, the stable Cox regression method integrates a sample reweighting module with a weighted Cox regression module for robust performance across varied datasets.

Sample reweighting module

The primary objective of this module is to minimize the dependency of covariates. Initially, we take the original covariate matrix M and generate a column-decorrelated version $\tilde{M}$ through random, column-wise resampling. This procedure disrupts the joint distribution of variables in X, leading to p independent marginal distributions. Thus, while the original feature distribution is P^tr(X), the resampled distribution becomes $\tilde{P}(X)={P}^{{\mathrm{tr}}}({X}_{1}){P}^{{\mathrm{tr}}}({X}_{2})\cdots {P}^{{\mathrm{tr}}}({X}_{p})$.

Next, we employ density ratio estimation techniques as per ref. ³⁴. We construct a joint distribution ${P}^{{\prime} }(X,Z)$ over $X\in {\mathcal{X}}$ and Z ∈ {0, 1}, with ${P}^{{\prime} }(Z=0)={P}^{{\prime} }(Z=1)=1/2$, ${P}^{{\prime} }(X| Z=1)=\tilde{P}(X)$ and ${P}^{{\prime} }(X| Z=0)={P}^{{\mathrm{tr}}}(X\,)$. By applying Bayes’ theorem, the density ratio is formulated as:

$$\begin{array}{r}w(X)=\frac{\tilde{P}(X)}{{P}^{{\mathrm{tr}}}(X)}=\frac{{P}^{{\prime} }(X| Z=1)}{{P}^{{\prime} }(X| Z=0)}=\frac{{P}^{{\prime} }(X,Z=1)}{{P}^{{\prime} }(X,Z=0)}=\frac{{P}^{{\prime} }(Z=1| X)}{{P}^{{\prime} }(Z=0| X)}.\end{array}$$

(9)

The probabilities ${P}^{{\prime} }(Z=1| X)$ and ${P}^{{\prime} }(Z=0| X)$ are estimated by training a multilayer perceptron to differentiate between the samples from the original matrix M and the column-decorrelated version $\tilde{M}$. To normalize the sample weights to have a unit mean, w(x⁽ⁱ⁾) is scaled by dividing it by the average $\frac{1}{n}\mathop{\sum }\nolimits_{i = 1}^{n}w\left({x}^{(i)}\right)$.

Weighted Cox regression module

In this module, the sample weights obtained earlier are applied to reweight the event of each subject in the Cox PH loss function. Specifically, the log partial likelihood function L_w(β) and the estimator ${\hat{\beta }}_{w}$ are defined as follows:

$$\begin{array}{rlr}{L}_{w}(\,\beta )&=\mathop{\sum }\limits_{i=1}^{n}{\delta }^{(i)}w\left({x}^{(i)}\right)\left({\beta }^{T}{x}^{(i)}-\log \left(\mathop{\sum}\limits_{{t}^{(j)}\ge {t}^{(i)}}w\left({x}^{(j)}\right)\exp \left({\beta }^{T}{x}^{(j)}\right)\right)\right),&\\ {\hat{\beta }}_{w}&=\arg \mathop{\min }\limits_{\beta }{L}_{w}(\beta ).\end{array}$$

(10)

Characterizing weighted Cox regression with counting processes

Following refs. ^25,49, we employ counting processes to model events for different individuals. For individual i, let N⁽ⁱ⁾(u) (u ≥ 0) denote the counting process, with its intensity function given by:

$${\lambda }^{(i)}(u)={Y}^{\,(i)}(u)\lambda \left(u;{x}^{(i)}\right),$$

(11)

where ${Y}^{\,(i)}(u)={\mathbb{I}}\left[{t}^{(i)} > u\right]$ is a predictable process, taking values in {0, 1} and indicating active observation of the individual. $\lambda \left(u;{x}^{(i)}\right)$ is the true hazard function for individual i with features x⁽ⁱ⁾ at time u. The loss function can then be expressed as:

$$\begin{array}{l}{L}_{w}(\,\beta )\\=\mathop{\sum }\limits_{i=1}^{n}w\left({x}^{(i)}\right)\displaystyle\mathop{\int}\nolimits_{0}^{\infty }\left({\beta }^{T}{x}^{(i)}-\log \left(\,\mathop{\sum }\limits_{j=1}^{n}w\left({x}^{(j)}\right){Y}^{\,(j)}(u)\exp \left({\beta }^{T}{x}^{(j)}\right)\right)\right){\rm{d}}{N}^{(i)}(u).\end{array}$$

(12)

Equations (10) and (12) represent the empirical losses with respect to a finite sample size n. To extrapolate to the population level, we consider the expected values over these empirical terms, replacing the counting process N⁽ⁱ⁾(t) with the hazard functions. The population-level loss function and the corresponding population-level solution are given by:

$$\begin{array}{rlr}{\tilde{L}}_{w}(\beta )&=\mathop{\int}\nolimits_{0}^{\infty }{\beta }^{T}{q}_{w}^{(1)}(u){\rm{d}}u-\mathop{\int}\nolimits_{0}^{\infty }{q}_{w}^{(0)}(u)\log \left({q}_{w}^{(0)}(\beta ,u)\right){\rm{d}}u,&\\ {\beta }_{w}&=\arg \mathop{\min }\limits_{\beta }{\tilde{L}}_{w}(\beta ).\end{array}$$

(13)

Here, for any r ∈ {0, 1, 2}

$${q}_{w}^{(r)}(u)={\mathbb{E}}\left[w(X)Y(u)\lambda (u;X){X}^{\otimes r}\right],\quad {q}_{w}^{(r)}(\,\beta ,u)={\mathbb{E}}\left[w(X\,)Y(u)\exp \left({\beta }^{T}X\right){X}^{\otimes r}\right].$$

(14)

As established in Theorem 1, under certain assumptions, the empirical solution ${\hat{\beta }}_{w}$ converges to the population-level solution β_w in probability as the sample size n approaches infinity.

Theoretical analysis

In this subsection, we present a theoretical analysis of our proposed stable Cox regression method.

Assumptions

To derive our theoretical results, we first establish several necessary assumptions.

Regularity assumptions

Consider w(X), the learned weighting function from the first stage of our method. We posit several regularity assumptions on the problem setting.

Assumption 1 (bounded parameter assumptions)

The following bounded parameter assumptions are made:

There exists a constant τ such that T ≤ τ almost surely.
There exists a constant C such that ∥X∥₂ ≤ C almost surely.
There exists a constant B > 0 such that w(X) ≤ B almost surely.
For any u ≥ 0, ${\mathbb{E}}[\lambda (u;X)] < \infty$.

Remark 1

These assumptions are standard in Cox regression and weighted regression literature. Specifically, the first assumption, used in ref. ⁴⁹, holds when all individuals have a finite censored time. The second assumption, similar to ref. ²⁵, is reasonable as individual features are finite. The third assumption, in line with ref. ²⁷, ensures a plausible weighting function for theoretical analysis. Lastly, the fourth assumption ensures the expected hazard rate for all individuals at any specific time u is finite, a condition easily met when the first two points are valid and the hazard function is continuous.

We also make an assumption regarding the population-level loss function, as defined in equation (13).

Assumption 2 (existence and uniqueness of the population-level solution)

There exists a unique solution to $\arg \mathop{\min }\limits_{\beta }{\tilde{L}}_{w}(\,\beta )$.

Remark 2

This assumption, also adopted in previous studies^25,49, ensures the existence and uniqueness of the population-level solution. It is a feasible condition as ${\tilde{L}}_{w}(\,\beta )$ is a convex function, a claim substantiated in Supplementary Information, section B.1.

S–V structure assumption

In line with previous studies^27,29,30, we posit that the targets (specifically, the failure time T^failure and censoring time T^censored) are influenced only by a subset of variables, denoted as S. This is formalized in the following assumption.

Assumption 3

Assume that the feature set X can be partitioned into two disjoint subsets S and V, such that S ∪ V = X and $S\cap V={{\emptyset}}$. Within the training distribution P^tr, it holds that:

$$\left({T}^{\,{\rm{failure}}},{T}^{\,{\rm{censored}}}\right)\perp V| S.$$

(15)

Remark 3

It’s important to note that for the failure time T^failure, the condition T^failure ⊥ V|S is satisfied if and only if there exists a function ${\lambda }^{{\prime} }(u;S)$ such that for any time u and features X, ${\lambda }^{{\prime} }(u;S)=\lambda (u;X)$. The substantiation of this claim is detailed in ‘Proposition B.1’ in Supplementary Information, section B.2, highlighting the relevance of S in characterizing the hazard functions.

Moreover, in scenarios involving covariate shifts—where P^te(T^failure, T^censored|X) = P^tr(T^failure, T^censored|X), yet P^te(X) ≠ P^tr(X)—the conditional independence relationship (T^failure, T^censored) ⊥ V|S continues to hold within the test distribution. In addition, it follows that P^te(T^failure, T^censored|S) = P^tr(T^failure, T^censored|S). This crucial aspect is further elaborated in ‘Proposition B.2’ presented in Supplementary Information, section B.2. This proposition demonstrates that the set S can provide a stable and robust estimator for Cox regression, even in the presence of covariate shifts.

Theoretical results

Building on the previously stated assumptions, we now present a theoretical analysis of our stable Cox regression model.

Characterizing the solution of the weighted Cox regression

We first establish that the solution ${\hat{\beta }}_{w}$ to the empirical loss, as defined in equation (10), converges to the solution β_w of the population-level loss, as specified in equation (13).

Theorem 1

Given a weighting function $w\in {\mathcal{W}}$, and under Assumptions 1 and 2, it holds that

$${\hat{\beta }}_{w}\mathop{\to }\limits^{\,{P}\,}{\beta }_{w}\quad \,\text{as}\,\quad n\to \infty ,$$

(16)

where $\mathop{\to }\limits^{\,{P}\,}$ denotes convergence in probability.

Remark 4

The detailed proof of this theorem is provided in Supplementary Information, section B.3. While previous research^25,49 primarily focused on the asymptotic properties of standard Cox regression, Theorem 1 extends this analysis to weighted Cox regression, which presents a novel area of interest.

Eliminating irrelevant variables via stable Cox regression

Building on Theorem 1, it can be demonstrated that stable Cox regression effectively eliminates irrelevant variables V, as per Assumption 3.

Theorem 2

Consider a weighting function $w\in {{\mathcal{W}}}_{\perp }$ that ensures mutual independence of X in the weighted distribution. Under Assumptions 1, 2 and 3, the following holds:

$$\mathop{\lim }\limits_{n\to \infty }{\mathbb{P}}\left(\left\Vert {\hat{\beta }}_{w}(V\,)\right\Vert > \epsilon \right)=0,\quad \forall \epsilon > 0,$$

(17)

where ${\hat{\beta }}_{w}(V\,)$ represents the coefficients of ${\hat{\beta }}_{w}$ corresponding to V.

Remark 5

The proof is available in Supplementary Information, section B.4. As noted in Remark 3, S can serve as an invariant predictor under covariate shift, whereas V may show spurious correlations with the outcome T, as evidenced by the biological function analyses in Fig. 5. Consequently, our stable Cox regression method effectively eliminate the influence of irrelevant variables V, ensuring stable prediction in data with covariate shift.

Experimental set-up details

Simulated survival data

We first generate all the covariates X = (S, V). In our experiments, the dimension of X is fixed to p = 10, and the dimensions of S and V are specified as p_s = p_v = 0.5 × p = 5. We generate covariate X by the following process:

$$\begin{array}{rlr}&{Z}_{1},{Z}_{2},\ldots ,{Z}_{{p}_{s}+1} \sim N(0,1),{V}_{1},{V}_{2},\ldots ,{V}_{{p}_{v}+1} \sim N(0,1),&\\ &{S}_{i}=0.8{Z}_{i}+0.2{Z}_{i+1},i=1,2,\ldots ,{d}_{s},\end{array}$$

(18)

N(0, 1) represents the standard Gaussian distribution. The survival time T is generated by the survival time function in Supplementary Table A2 and the corresponding transformation function between survival time and hazard function is $T={H}_{0}^{-1}[-\log (U\,)]\exp (-{\beta }^{T}S)$ (ref. ²⁵). The generation models Cox-exp and Cox–Weibull are on the omission of nonlinear term g(S) from Cox models, where g(S) = S₁ × S₂ × S₃ is a nonlinear model misspecification term that is used to simulate the nonlinear generation process in the real world. Poly is on the misspecification of regression forms with the omission of the nonlinear term, and Log T is on non-PH models also with the omission of the nonlinear term.

Afterwards, we generate various environments by constructing spurious correlations between V and S, further leading to the change of P(T|V). Among all the unstable variables V, we simulate unstable correlation P(V_b|S) on a subset V_b ∈ V. We vary P(V_b|S) through different strengths of selection bias with a bias rate r ∈ [−3, −1) ∪ (1, 3] to generate multiple testing environments, and set r_tr as 2.5 for the training set. For each sample, we select it with probability ${\mathrm{Pr}}={\prod }_{{V}_{i}\in {V}_{{\mathrm{b}}}}| r{| }^{-5{D}_{i}}$, where D_i = |f(S) − sign(r)V_i|, where sign(⋅) denotes the sign function. In our experiments, we set ${p}_{{v}_{{\mathrm{b}}}}=0.1\times p$. Moreover, we randomly select 10% of subjects as censored events, where their censored survival time is generated by T^censored ∼ U(0, T). We train our models on data from one single environment generated with a bias rate r_train and test on data from multiple environments with bias rates r_test ranging in [−3, −1) ∪ (1, 3]. By adjusting the parameter r_test, we can modulate the strength of spurious correlations, thereby enabling a clear evaluation of the underlying mechanisms and performance of our method across different selection bias scenarios. Each model is trained ten times independently with different training datasets from the same bias rate r_train. Likewise, for each r_test, we generate ten test datasets with different random seeds. We utilize the widely used C-index metric in survival analysis, evaluating the model’s ability to correctly provide a relative ranking of the predicted risk or time to event⁵². The metrics we report are the mean results of these ten times.

Transcriptome survival data

To comprehensively evaluate our method, the following selection criteria are guided by several considerations. First, omics data are typically generated from various cohorts around the world. Discovering omics biomarkers is a complex and costly process that involves identifying candidate biomarkers, qualification, verification and clinical assay development⁵³. Consequently, it is impractical to develop a new biomarker panel for each cohort. A practical path is to discover biomarkers that could be stably generalized on unseen testing cohorts. Therefore, it is critical to assess the generalization ability of discovered biomarkers across multiple cohorts. Consequently, the dataset should encompass multiple cohorts for the same disease. Second, it is necessary that the dataset includes survival information, which is often scarce in high-quality omics literature. Third, the presence of clinical information is preferable to enable detailed studies based on clinical subgroups. On the basis of these criteria, we construct three datasets from multiple cohorts to evaluate survival models under distribution shifts. In each cohort, comprehensive transcriptome data and the corresponding OS information of individual patients are available. The distribution of covariates is thought of as being naturally different. The first dataset is the HCC transcriptome dataset, which is constructed by the five gene-expression datasets in HCCDB (a database of hepatocellular carcinoma expression atlas)⁵⁴. In particular, we utilize the The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) cohort (n = 351)¹¹ as the training set, the Grinchuk et al. cohort (n = 115)⁵⁵ as the validation dataset, and the Fujimoto et al. cohort (n = 203)³⁷, the Roessler et al. cohort (n = 209)¹² and the Hoshida et al. cohort (n = 80)³⁸ as three independent testing cohorts. The second dataset is a breast cancer transcriptome dataset, which is constructed from ref. ⁵⁶. This study⁵⁶ has the transcriptome and OS information of 1,980 patients from 5 cohorts. We use 1 cohort (n = 763) as the training set, 1 cohort (n = 170) as the validation set, and the other 3 as the testing cohorts, termed as the Curtis et al. cohort 1 (n = 521), cohort 2 (n = 288) and cohort 3 (n = 238). Furthermore, the third dataset is the melanoma transcriptome dataset. Melanoma is a serious form of skin cancer that arises from pigment-producing cells known as melanocytes, often due to excessive exposure to ultraviolet radiation. We treat the data from the Liu et al. cohort (n = 120)⁵⁷ as the training set and the Hugo et al. cohort (n = 26)⁵⁸ as the validation set. Moreover, the Gide et al. cohort (n = 91)⁵⁹, the Riaz et al. cohort (n = 54)⁶⁰ and the Van et al. cohort (n = 41)⁶¹ are treated as the test sets. First, to reduce the number of candidate genes, we utilize a univariate Cox model to calculate the HR value of each gene and select the top 100 as the candidate genes. On the basis of these candidate genes, to discover as few genes as possible for constructing a survival analysis model, we select the top-5, -10, -15 and -20 significant genes learned by each model as the biomarker panel. Then, based on the selected biomarker panel by each method, we construct a new corresponding model for predictions. Moreover, in this section, we also compare with a Cox PH model, termed as univariate Cox PH, which directly trains a Cox PH model on the top-N genes ranked by the previous univariate Cox PH model. The detailed data preprocessing process and the pipeline are in Supplementary Information, section A.1.

Clinical survival data

The lung cancer clinical data⁶² have the patient characteristics and clinical outcomes for 382 patients. We utilize clinical information as covariates and follow-up data including OS and DFS as targets for two tasks. For these data, we aim to simulate the scenario that the training data were collected randomly and the test data have several patient subgroups with different characteristics or clinical values. We hope that the trained model can achieve stable performance on different subgroups. In particular, we randomly select 40% of the patients (n = 240) as the training set and the remaining patients are divided into the following subgroups according to their covariates, age ≤60 (n = 43) or age > 60 (n = 113), female (n = 4) or male (n = 152), tumour location classified as ‘central’ (n = 47) or ‘peripheral’ (n = 107), and obstructive pneumonitis/atelectasis is ‘present’ (n = 80) or ‘absent’ (n = 76). Female (n = 4) and male (n = 152) subgroups are used as the validation set and the remaining subgroups are used as testing subgroups. The breast cancer clinical dataset was collected from https://www.kaggle.com/datasets/gunesevitan/breast-cancer-metabric/data, which has the clinical and follow-up data for patients from several cohorts⁵⁶. After filtering low-quality samples, we utilize 1 cohort as the training set (n = 394) and 1 cohort (n = 14) as the validation set, and the other 2 cohorts as the test sets, termed as test cohort 1 (n = 273) and test cohort 2 (n = 177). The survival outcomes of this dataset are OS and RFS. These datasets enable us to validate the effectiveness of our model across two generalization scenarios frequently encountered in real-world settings, namely, generalization on subpopulations and cohorts. The detailed data preprocessing process and the pipeline are in Supplementary Information, section A.1.

Baseline approaches

We compare our model with two mainstream kinds of methods in survival analysis: semi-parametric model (Cox PH model⁵) and parametric models (Weibull accelerated failure time (AFT) model, log-logistic AFT model and log-normal model⁶³). For all the methods, we add the l₂ norm to reduce the multicollinearity problem. Here we introduce three parametric survival compared in the experiments. In survival analysis, a parametric model assumes that the underlying survival times follow a known probability distribution, such as the Weibull, log-normal or log-logistic distributions. AFT models⁶³ are the most frequently used parametric models. Unlike the Cox PH model, which models the hazard rate (the risk of the event occurring at a particular time), AFT models directly model the survival time itself. The general form of an AFT model is:

$$\log (T\,)={\beta }_{0}+{\beta }_{1}{X}_{1}+\cdots +{\beta }_{p}{X}_{p}+\epsilon ,$$

(19)

where ϵ is a random error term. Different distributions of ϵ imply different distributions of the survival time.

The Weibull AFT model

The Weibull AFT model adopts the Weibull distribution for ϵ, and it has the following cumulative hazard rate:

$$H(u;X\,)={\left(\frac{u}{\lambda (X\,)}\right)}^{\rho },$$

(20)

where ρ controls the shape of distribution, and $\lambda (X)=$$\exp ({\beta }_{0}+{\beta }_{1}{X}_{1}+\cdots +{\beta }_{p}{X}_{p}),$ represents accelerating or decelerating hazards by covariates.

The log-logistic AFT model

A log-logistic AFT model assumes that the error follows the log-logistic distribution and it has the following cumulative hazard rate:

$$H(u;X\,)=\log \left(1+{\left(\frac{u}{\lambda (X)}\right)}^{\rho }\right).$$

(21)

The log-normal AFT model

A log-normal AFT model assumes that the error follows the log-normal distribution and it has the following cumulative hazard rate:

$$H(u;X)=-\log \left(1-\varPhi \left(\frac{\log (u)-\lambda (X)}{\rho }\right)\right).$$

(22)

The AFT model directly interprets the effects on survival time and is suitable when the exact survival time predictions are needed, but it relies on the correct specification of the underlying distribution, which can be a limitation. The Cox PH model is robust and widely used, not requiring the assumption of a specific survival time distribution and can handle time-varying covariates, but it assumes PH, and it is less straightforward for predicting absolute survival times.

Evaluation metrics

The metrics used closely reflect the accuracy of survival prediction and the difference of the subgroups identified. Three sets of evaluation metrics were used.

Concordance index

The C-index quantifies the ability of the survival model to correctly rank patient outcomes, which can be calculated as the ratio of all individual pairs whose predicted survival times are accurately ranked⁶⁴:

$$C=\frac{1}{N}\sum _{{t}^{(i)}\,\text{uncensored}\,}\sum _{{t}^{(i)} < {t}^{(j)}}{{\bf{1}}}_{\hat{H}(i)\ > \ \hat{H}(j)},$$

(23)

where N is the total number of comparable pairs of individuals, t⁽ⁱ⁾ and t^(j) are the observed times of events for individuals i and j, respectively, and $\hat{H}(i)$ and $\hat{H}(j)$ are the corresponding predicted risk scores (hazards). The C-index score takes values between 0 and 1, and 0.5 means a random guess. A larger value indicates that the predicted survival ranking is closer to the ground-truth survival ranking.

The log-rank P value

We plot the Kaplan–Meier survival curves of the two risk groups and calculate the log-rank P value of the survival difference between them according to ref. ⁶⁵. A log-rank P value lower than 0.05 means the survival curves of the two subgroups are significantly different.

Hazard ratio

The HR is the ratio of the hazard rates corresponding to the conditions characterized by two distinct levels of a treatment variable of interest. For Cox regression, the HR value of jth covariate can be directly calculated by:

$${\text{HR}}_{i}=\exp \left({\hat{\beta }}_{i}\right).$$

(24)

HR_i > 1 means that the covariate i indicates a higher hazard of death from the treatment. HR_i < 1 suggests a reduced hazard, and HR_i = 1 implies no difference in risk. To calculate the HR value for two risk subgroups, we first create a group variable g. In this variable, patients in the high-risk subgroup are assigned a value of 1, while all others are assigned a value of 0. Next, we use g as the univariate in a Cox PH model to regress the survival outcome. The HR value for the subgroup separation is then given by exp(β_g), where β_g is the coefficient of the group variable in the Cox PH model.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All benchmark datasets used in this paper are publicly available. For the processed RNA-sequencing data and corresponding survival outcome of HCC transcriptome dataset, the TCGA-LIHC cohort data were downloaded from https://portal.gdc.cancer.gov/projects/TCGA-LIHC. The Grinchuk et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76427. The Fujimoto et al. cohort data were downloaded from https://docs.icgc-argo.org/docs/data-access/icgc-25k-data. The Roessler at al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse14520. The Hoshida et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10143. For the breast cancer dataset, the RNA-sequencing, survival outcome, cohort and clinical information can be downloaded from https://www.cbioportal.org/study/summary?id=brca_metabric. For the melanoma transcriptome dataset, the Liu et al. cohort data were download from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000452.v3.p1. The Hugo et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse78220. The Gide et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB23709. The Riaz et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE91061. The Van et al. cohort data were downloaded from https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000452.v2.p1. For the lung cancer clinical dataset, the clinical and survival outcome data can be downloaded from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6777828/. All data sources used in this paper are listed in Supplementary Table A1. The simulated data were generated when running the source code. The preprocessed real-world data are available on GitHub at https://github.com/googlebaba/StableCox and on Zenodo at https://doi.org/10.5281/zenodo.13852489 (ref. ⁶⁶).

Code availability

The implementation code is available on GitHub at https://github.com/googlebaba/StableCox and on Zenodo at https://doi.org/10.5281/zenodo.13852489 (ref. ⁶⁶).

References

Anderson, K. M. A nonproportional hazards Weibull accelerated failure time regression model. Biometrics 47, 281–288 (1991).
Article Google Scholar
Friedman, M. Piecewise exponential models for survival data with covariates. Ann. Stat. 10, 101–113 (1982).
Article MathSciNet Google Scholar
Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. Ann. Appl. Stat. 2, 841–860 (2008).
Article MathSciNet Google Scholar
Wang, P., Li, Y. & Reddy, C. K. Machine learning for survival analysis: a survey. ACM Comput. Surv. 51, 110 (2019).
Article Google Scholar
Cox, D. R. Regression models and life-tables. J. R. Stat. Soc. Ser. B 34, 187–202 (1972).
Article MathSciNet Google Scholar
Guo, L. L. et al. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci. Rep. 12, 2726 (2022).
Article Google Scholar
Chaudhary, K., Poirion, O. B., Lu, L. & Garmire, L. X. Deep learning–based multi-omics integration robustly predicts survival in liver cancer. Clin. Cancer Res. 24, 1248–1259 (2018).
Article Google Scholar
Zhou, L. & Zhu, Y. The epcam overexpression is associated with clinicopathological significance and prognosis in hepatocellular carcinoma patients: a systematic review and meta-analysis. Int. J. Surg. 56, 274–280 (2018).
Article Google Scholar
Liang, J. et al. Expression pattern of tumour-associated antigens in hepatocellular carcinoma: association with immune infiltration and disease progression. Br. J. Cancer 109, 1031–1039 (2013).
Article Google Scholar
Xu, M. et al. Expression of epithelial cell adhesion molecule associated with elevated ductular reactions in hepatocellar carcinoma. Clin. Res. Hepatol. Gastroenterol. 38, 699–705 (2014).
Article Google Scholar
Zhu, Y., Qiu, P. & Ji, Y. TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat. Methods 11, 599–600 (2014).
Article Google Scholar
Roessler, S. et al. A unique metastasis gene signature enables prediction of tumor relapse in early-stage hepatocellular carcinoma patients. Cancer Res. 70, 10202–10212 (2010).
Article Google Scholar
Thorgeirsson, S. S., Lee, J.-S. & Grisham, J. W. Molecular prognostication of liver cancer: end of the beginning. J. Hepatol. 44, 798–805 (2006).
Article Google Scholar
Jiang, G. et al. CD146 promotes metastasis and predicts poor prognosis of hepatocellular carcinoma. J. Exp. Clin. Cancer Res. 35, 38 (2016).
Article Google Scholar
Jiang, Y. et al. Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma. Nature 567, 257–261 (2019).
Article Google Scholar
Liu, F., Liu, Y. & Chen, Z. Tim-3 expression and its role in hepatocellular carcinoma. J. Hematol. Oncol. 11, 126 (2018).
Article Google Scholar
Tishkoff, S. A. et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39, 31–40 (2007).
Article Google Scholar
Curth, A. & Schaar, M. Understanding the impact of competing events on heterogeneous treatment effect estimation from time-to-event data. In International Conference on Artificial Intelligence and Statistics 7961–7980 (PMLR, 2023).
Curth, A., Lee, C. & Schaar, M. SurvITE: learning heterogeneous treatment effects from time-to-event data. Adv. Neural Inf. Process. Syst. 34, 26740–26753 (2021).
Google Scholar
Goh, W. W. B., Wang, W. & Wong, L. Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35, 498–507 (2017).
Article Google Scholar
Tibshirani, R. The lasso method for variable selection in the Cox model. Stat. Med. 16, 385–395 (1997).
Article Google Scholar
Verweij, P. J. & Van Houwelingen, H. C. Penalized likelihood in Cox regression. Stat. Med. 13, 2427–2436 (1994).
Article Google Scholar
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).
Article Google Scholar
Fan, J. & Li, R. Variable selection for Cox’s proportional hazards model and frailty model. Ann. Stat. 30, 74–99 (2002).
Article MathSciNet Google Scholar
Lin, D. Y. & Wei, L.-J. The robust inference for the Cox proportional hazards model. J. Am. Stat. Assoc. 84, 1074–1078 (1989).
Article MathSciNet Google Scholar
Cui, P. & Athey, S. Stable learning establishes some common ground between causal inference and machine learning. Nat. Mach. Intell. 4, 110–115 (2022).
Article Google Scholar
Xu, R., Zhang, X., Shen, Z., Zhang, T. & Cui, P. A theoretical analysis on independence-driven importance weighting for covariate-shift generalization. In International Conference on Machine Learning 24803–24829 (PMLR, 2022).
Kuang, K., Cui, P., Athey, S., Xiong, R. & Li, B. Stable prediction across unknown environments. In Proc. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 1617–1626 (ACM, 2018).
Kuang, K., Xiong, R., Cui, P., Athey, S. & Li, B. Stable prediction with model misspecification and agnostic distribution shift. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 4485–4492 (AAAI Press, 2020).
Shen, Z., Cui, P., Kuang, K., Li, B. & Chen, P. Causally regularized learning with agnostic data selection bias. In Proc. 26th ACM International Conference on Multimedia 411–419 (ACM, 2018).
Shen, Z., Cui, P., Zhang, T. & Kunag, K. Stable learning via sample reweighting. In Proc. AAAI Conference on Artificial Intelligence Vol. 34, 5692–5699 (AAAI Press, 2020).
Fan, S., Wang, X., Shi, C., Cui, P. & Wang, B. Generalizing graph neural networks on out-of-distribution graphs. IEEE Trans. Pattern Anal. Mach. Intell. 46, 322–337 (2024).
Article Google Scholar
Hsu, J. L. & Hung, M.-C. The role of HER2, EGFR, and other receptor tyrosine kinases in breast cancer. Cancer Metastasis Rev. 35, 575–588 (2016).
Article Google Scholar
Sugiyama, M., Suzuki, T. & Kanamori, T. Density Ratio Estimation in Machine Learning (Cambridge Univ. Press, 2012).
Bender, R., Augustin, T. & Blettner, M. Generating survival times to simulate Cox proportional hazards models. Stat. Med. 24, 1713–1723 (2005).
Article MathSciNet Google Scholar
Mertins, P. et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534, 55–62 (2016).
Article Google Scholar
Fujimoto, A. et al. Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509 (2016).
Article Google Scholar
Hoshida, Y. et al. Gene expression in fixed tissues and outcome in hepatocellular carcinoma. N. Engl. J. Med. 359, 1995–2004 (2008).
Article Google Scholar
Van't Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
Article Google Scholar
Onitilo, A. A., Engel, J. M., Greenlee, R. T. & Mukesh, B. N. Breast cancer subtypes based on ER/PR and HER2 expression: comparison of clinicopathologic features and survival. Clin. Med. & Res. 7, 4–13 (2009).
Article Google Scholar
Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003).
Google Scholar
Prosperi, M. et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat. Mach. Intell. 2, 369–375 (2020).
Article Google Scholar
Zhang, K., Schölkopf, B., Muandet, K. & Wang, Z. Domain adaptation under target and conditional shift. In International Conference on Machine Learning 819–827 (PMLR, 2013).
Zhao, H., Des Combes, R. T., Zhang, K. & Gordon, G. On learning invariant representations for domain adaptation. In International Conference on Machine Learning 7523–7532 (PMLR, 2019).
Ahuja, K., Shanmugam, K., Varshney, K. & Dhurandhar, A. Invariant risk minimization games. In International Conference on Machine Learning 145–155 (PMLR, 2020).
Hainmueller, J. Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Polit. Anal. 20, 25–46 (2012).
Article Google Scholar
Kalbfleisch, J. D. & Prentice, R. L. The Statistical Analysis of Failure Time Data (Wiley, 2011).
Breslow, N. E. Analysis of survival data under the proportional hazards model. Int. Stat. Rev. 43, 45–57 (1975).
Article Google Scholar
Andersen, P. K. & Gill, R. D. Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10, 1100–1120 (1982).
Article MathSciNet Google Scholar
Gail, M. H., Wieand, S. & Piantadosi, S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71, 431–444 (1984).
Article MathSciNet Google Scholar
Lagakos, S. The loss in efficiency from misspecifying covariates in proportional hazards regression models. Biometrika 75, 156–160 (1988).
Article MathSciNet Google Scholar
Harrell Jr, F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15, 361–387 (1996).
Article Google Scholar
Rifai, N., Gillette, M. A. & Carr, S. A. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 24, 971–983 (2006).
Article Google Scholar
Lian, Q. et al. HCCDB: a database of hepatocellular carcinoma expression atlas. Genomics Proteomics Bioinformatics 16, 269–275 (2018).
Article Google Scholar
Grinchuk, O. V. et al. Tumor-adjacent tissue co-expression profile analysis reveals pro-oncogenic ribosomal gene signature for prognosis of resectable hepatocellular carcinoma. Mol. Oncol. 12, 89–113 (2018).
Article Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Article Google Scholar
Liu, D. et al. Integrative molecular and clinical modeling of clinical outcomes to PD1 blockade in patients with metastatic melanoma. Nat. Med. 25, 1916–1927 (2019).
Article Google Scholar
Hugo, W. et al. Genomic and transcriptomic features of response to anti-PD-1 therapy in metastatic melanoma. Cell 165, 35–44 (2016).
Article Google Scholar
Gide, T. N. et al. Distinct immune cell populations define response to anti-PD-1 monotherapy and anti-PD-1/anti-CTLA-4 combined therapy. Cancer Cell 35, 238–255 (2019).
Article Google Scholar
Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell 171, 934–949 (2017).
Article Google Scholar
Van Allen, E. M. et al. Genomic correlates of response to CTLA-4 blockade in metastatic melanoma. Science 350, 207–211 (2015).
Article Google Scholar
Gu, K. et al. Integrated evaluation of clinical, pathological and radiological prognostic factors in squamous cell carcinoma of the lung. PLoS ONE 14, 0223298 (2019).
Article Google Scholar
Wei, L.-J. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat. Med. 11, 1871–1879 (1992).
Article Google Scholar
Steck, H., Krishnapuram, B., Dehing-Oberije, C., Lambin, P. & Raykar, V. C. On ranking in survival analysis: bounds on the concordance index. Adv. Neural Inf. Process. Syst. 20, 1209–1216 (2007).
Google Scholar
Bland, J. M. & Altman, D. G. The logrank test. BMJ 328, 1073 (2004).
Article Google Scholar
Fan, S. et al. Stable Cox regression for survival analysis under distribution shifts. Zenodo https://doi.org/10.5281/zenodo.13852489 (2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (numbers 62425206 and 62141607 to P.C., 32088101 to C.C. and 62402263 to S.F.) and the National Key Research and Development Program of China (2021YFA1301603 to C.C.). S.F. was supported by China Postdoctoral Science Foundation (numbers 2023M741946, 2024T170494 and GZB20230345). Y.H. was supported by China National Postdoctoral Program for Innovative Talents (number BX20230195).

Author information

These authors contributed equally: Shaohua Fan, Renzhe Xu, Qian Dong.

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Shaohua Fan, Renzhe Xu, Yue He & Peng Cui
State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
Qian Dong & Cheng Chang

Authors

Shaohua Fan
View author publications
Search author on:PubMed Google Scholar
Renzhe Xu
View author publications
Search author on:PubMed Google Scholar
Qian Dong
View author publications
Search author on:PubMed Google Scholar
Yue He
View author publications
Search author on:PubMed Google Scholar
Cheng Chang
View author publications
Search author on:PubMed Google Scholar
Peng Cui
View author publications
Search author on:PubMed Google Scholar

Contributions

S.F., C.C. and P.C. conceived of the project. S.F. designed the proposed model. R.X. performed the theoretical analysis and discussed with S.F. S.F. mainly performed experiments and analysed the results, assisted by Q.D. S.F., R.X., Q.D. and Y.H. wrote the paper. P.C. and C.C. revised the paper and supervised the project.

Corresponding authors

Correspondence to Cheng Chang or Peng Cui.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Deepti Gurdasani and Jiguang Wang for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. A1 and C2–C9, Discussion, omitted proofs and Tables A1–A4.

Reporting Summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Fan, S., Xu, R., Dong, Q. et al. Stable Cox regression for survival analysis under distribution shifts. Nat Mach Intell 6, 1525–1541 (2024). https://doi.org/10.1038/s42256-024-00932-5

Download citation

Received: 07 March 2024
Accepted: 21 October 2024
Published: 13 December 2024
Version of record: 13 December 2024
Issue date: December 2024
DOI: https://doi.org/10.1038/s42256-024-00932-5

Subjects

Abstract

Similar content being viewed by others

Enhancing survival risk prediction through imputation and feature selection in high-dimensional protein biomarker data

Cox proportional hazards regression in small studies of predictive biomarkers

Survival analysis for sepsis patients: A machine learning approach to feature selection and predictive modeling

Main

Results

General framework of stable Cox regression model

Evaluation on the simulated survival data

Experimental set-up

Results

Evaluation on multiple cancer transcriptome survival data

Experimental set-up

Results

Evaluation on lung and breast cancer clinical survival data

Experimental set-up

Results

Discussion

Methods

Preliminaries

Notations

Weighting function

Cox PH model

Stable Cox regression

Sample reweighting module

Weighted Cox regression module

Characterizing weighted Cox regression with counting processes

Theoretical analysis

Assumptions

Regularity assumptions

Assumption 1 (bounded parameter assumptions)

Remark 1

Assumption 2 (existence and uniqueness of the population-level solution)

Remark 2

S–V structure assumption

Assumption 3

Remark 3

Theoretical results

Characterizing the solution of the weighted Cox regression

Theorem 1

Remark 4

Eliminating irrelevant variables via stable Cox regression

Theorem 2

Remark 5

Experimental set-up details

Simulated survival data

Transcriptome survival data

Clinical survival data

Baseline approaches

The Weibull AFT model

The log-logistic AFT model

The log-normal AFT model

Evaluation metrics

Concordance index

The log-rank P value

Hazard ratio

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links