Comparison of oblique random survival forest, random survival forest, and statistical models for time-to-event data using simulation study

Suliman, Abubaker; Abdullahi, Aminu S.; Masud, Mohammad Mehedy; Serhani, Mohamed Adel; AlZahmi, Amal; Oulhaj, Abderrahim

doi:10.1038/s41598-025-27747-7

Download PDF

Article
Open access
Published: 27 November 2025

Comparison of oblique random survival forest, random survival forest, and statistical models for time-to-event data using simulation study

Abubaker Suliman^1,2,
Aminu S. Abdullahi²,
Mohammad Mehedy Masud^1,5,
Mohamed Adel Serhani³,
Amal AlZahmi² &
…
Abderrahim Oulhaj⁴

Scientific Reports volume 15, Article number: 44005 (2025) Cite this article

789 Accesses
Metrics details

Subjects

Abstract

Time-to-event (TTE) machine learning (ML) algorithms are increasingly utilized in prognostic models, but systematic evaluation is lacking to identify their strengths and limitations. We compared TTE ML algorithms — Oblique Random Survival Forest (ORSF) and Random Survival Forest (RSF) to statistical models (SMs) — Cox Proportional Hazards (Cox PH) and Penalized Cox PH, examining their predictive performance and computational time. Eighteen scenarios were generated with varying censoring rates, sample sizes, and predictor effects, assuming the PH assumption. Performance was evaluated using Harrell’s C-index and IBS, with differences assessed using One-Way Repeated Measures ANOVA. In the linear with additive effects scenario, SMs outperformed RSF in terms of C-indices and IBS scores, with negligible differences between ORSF variants and SMs. ORSF variants were slightly higher in C-indices and comparable in IBS scores to RSF. Under the non-linear scenario with interaction effects, SMs’ models consistently achieved higher C-indices than RSF, with minimal differences from ORSF. SMs were similar to RSF and ORSF in IBS scores, except at a high censoring rate of 90%. ORSF yielded significantly higher C-indices and lower IBS scores than RSF at censoring rates of 50–70%. Overall, differences between ORSF variants in discrimination and calibration were not significant; however, ORSF-net had the longest training time among all ML models. Conclusively, RSF showed inferior discrimination to SMs and ORSF. Traditional SMs outperform ML models in TTE prediction at higher censoring rates but match ORSF at lower rates.

An immune-related eleven-RNA signature-drived risk score model for prognosis of osteosarcoma metastasis

Article Open access 11 June 2024

Predicting Kyasanur forest disease in resource-limited settings using event-based surveillance and transfer learning

Article Open access 08 July 2023

Uncovering the factors that affect earthquake insurance uptake using supervised machine learning

Article Open access 03 December 2023

Introduction

Time-to-event analysis, also known as survival analysis, is a set of statistical methods used to model the time to the occurrence of an event of interest^1,2. They are extensively used in the health field when the objective is to estimate the incidence of an event or to predict the risk of developing the event within a clinically meaningful period of time. Examples of applications include predicting the risk of death, cardiovascular disease (CVD) event, or cancer relapse^3,4,5,6. One important characteristic of time-to-event data is right-censoring¹. This occurs when the time to event is partially observed due to study discontinuation or loss to follow-up. More specifically, in right-censored subjects, only the minimum time-to-event is observed. A widely used statistical model (SM) to analyze right-censored data in clinical research is the semi-parametric Cox proportional hazards (Cox PH) model⁷, valued for its semiparametric flexibility and interpretability.

In recent years, machine learning (ML) algorithms have been increasingly applied to develop diagnostics and prognostics prediction models and have been recognized as a transformative innovation in healthcare^3,8,9. However, many of the ML algorithms applied in most of these healthcare applications do not take into account the fact that observations might be right-censored^10,11,12. According to a comprehensive survey conducted by Wang et al., a number of ML algorithms have now been adapted and developed in order to address the issue of censorship in survival analysis¹³. Random survival forest (RSF), an extension of Breiman’s random forest (RF)¹⁴, is one of the most frequently employed ML techniques that accommodate right censored time-to-event data¹⁵. Recently, the Oblique Random Survival Forest (ORSF), an extension of Ishwaran’s RSF, which is an ensemble of supervised learning methods for right-censored data, was introduced^16,17. ORSF showed inferior prediction accuracy; nevertheless, evaluating several linear combinations of predictors incurred significant computational time¹⁶.

It is critical to assess and compare novel and existing methods in different scenarios to reveal their strengths and weaknesses. This systematic evaluation is frequently implemented using simulation studies¹⁸. A peculiar advantage of a simulation lies in its ability to allow the estimation of the “true” performance of the method being used, since the true data-generating mechanism distribution is known¹⁸. Smith et al.¹⁹, in their scoping review, highlighted the advantages of simulation studies in comparing ML to traditional SM in time-to-event data in terms of risk prediction. Moreover, the authors of the aforementioned review concluded that limited studies have compared statistical and ML methods using simulation studies. Furthermore, although several simulation studies have compared RSF with either the Cox PH or its penalized extension^{11,20,21,22,23}, no independent simulation study has compared the performance of the novel ORSF with RSF, Cox PH, or Penalized Cox PH²⁴. Therefore, the significance of this research lies primarily in its systematic, independent, and comprehensive evaluation of advanced ML models against traditional SM for time-to-event data, specifically addressing a notable gap in the existing literature and assessing the impact of various characteristics on predictive measures.

The aim of this study is to assess and compare, through extensive simulations, the predictive ability of several ML algorithms and SMs, including ORSF, standard RSF, Cox PH, and Penalized Cox PH models. Models are compared under various scenarios, including sample size, censoring rate, and the presence of interaction or non-linear effects. We also aim to investigate how sensitive the predictive accuracy of ORSF is to different specifications of the linear combination criteria. Additionally, we compare computational time between ORSF configurations and standard RSF.

This paper is structured into sections as follows: In Methods, the data-generating process, the selected algorithms, and additional methodological details are outlined. Following this, the Results detailed the outcomes of the simulation. Next, in the Discussion, findings are interpreted and discussed, the study’s strengths and limitations, directions for future work, and implications are presented. Finally, the paper concludes with a summary.

Methods

In this section, we describe a comprehensive simulation study comparing the predictive performance of SMs (Cox PH and Penalized Cox PH) and two ML algorithms (RSF and ORSF) in predicting the risk of developing the event of interest (CVD) when fitted to time-to-event data under varying conditions. We outline the primary procedures involved in executing our simulation, encompassing the algorithms for comparison, simulation parameters, performance indicators, and implementation specifics.

Description of the models

The cox proportional hazards model

The Cox PH model is among the most widely used models in survival analysis within medical research²⁵. It provides a semi-parametric specification of the hazard function, allowing for the estimation of covariate effects without requiring specification of the baseline hazard. In this model, the hazard of developing an event at time t is defined as follows:

$$h(t|x) = h_{0} \left( t \right) \exp \left( {\beta^{T} x} \right)$$

(1)

where ${h}_{0}(t)$ is a non-parametric baseline hazard, and $exp({\beta }^{T} x)$ is the relative risk function. For each risk factor in the vector $x$, the association with the incidence of the event is characterized by the hazard ratio exp(β). The Cox model is typically fitted in two steps. First, the parametric component is estimated by maximizing the partial likelihood, which is independent of the baseline hazard. Subsequently, the non-parametric baseline hazard is estimated based on the fitted covariate effects.

Two variations of Cox PH models are used in this paper: (i) the traditional model including all covariates (Cox PH), (ii) the Penalized Cox PH model that incorporates variable selection through regularization techniques^26,27. Penalization shrinks the estimates of regression coefficients towards zero relative to maximum likelihood estimates. This shrinkage helps to prevent overfitting caused by collinearity of covariates or high dimensionality of the data. L1 (lasso) penalty applies an absolute value constraint to the coefficients, while L2 (ridge) penalty applies a quadratic constraint to the coefficients^28,29. Here, we used a combination of both L1 and L2 penalties to obtain fewer coefficients set to zero than a pure L1 setting, and more shrinkage of other coefficients.

The random survival forest

The Random Survival Forest (RSF) implementation follows concepts similar to RF. The process involves drawing B bootstrap samples from the data. For each sample, a survival tree is constructed. During tree construction, at each node, a random subset of predictor variables is selected (mtry). Among these candidates, the best node split is chosen based on a splitting criterion that uses the log-rank test and hence accounts for right-censoring. This process is applied recursively to daughter nodes until a stopping criterion, such as the minimum number of unique cases in a terminal node, is satisfied. Finally, the cumulative hazard function (CHF) is calculated for each tree, and these are averaged over all trees to define the ensemble CHF¹⁵.

The oblique random survival forest

The Oblique Random Survival Forest (ORSF) is an ensemble method for right-censored survival data first proposed by Janger et al. in 2019¹⁷. The main difference between Jaeger’s ORSF and standard RSF lies in the splitting strategy: ORSF uses linear combinations of multiple predictors to recursively partition the training data, while the standard RSF relies on univariate splits using a single predictor at each node.

The ORSF fits Cox PH or similar models into the non-terminal nodes of its survival trees. These models generate Linear Combinations of Input Variables (LCIVs) using their estimated coefficients, and these LCIVs are then employed as the variable for splitting non-leave nodes. Evaluating candidate solutions for the coefficients involves computing LCIVs for each observation in the current node and selecting random candidate cut-points from the unique LCIV values. For each chosen cut-point, a log-rank statistic is computed to compare the survival curves between observations in the potential child nodes resulting from the split. A node terminates early if the maximum log-rank statistic does not exceed a predetermined threshold; otherwise, the cut-point and the candidate solution that optimize the log-rank statistic are used to partition the node.

Three criteria (fast, cph, and net) are used in the current ORSF implementation to construct LCIVs. In the fast criterion, a single iteration of Newton–Raphson scoring on the Cox partial likelihood function is used to fit the LCIVs, the default method in ORSF. In the cph criterion, the coefficients acquired from fitting a Cox PH regression are used to determine the linear combinations of predictors. The last version (net) uses Penalized Cox PH regression at each node to construct LCIVs²⁴.

Simulation settings

In this section, we describe in detail the settings for the simulations that we carried out. These include the data-generating process, the number of simulations, the different scenarios, and the performance metrics used.

The data-generating mechanism

Data were generated to mimic a real-world popular cohort study in CVD, namely the Multi-Ethnic Study of Atherosclerosis (MESA)³⁰, as it is essential for the simulated data to conform to real-world data³¹. MESA is a large prospective cohort study based in the United States. The study was initiated in July 2000 to investigate the burden, associated factors, and progression of subclinical CVDs in a large sample of adults aged 45 to 84 years³⁰. The study provides extensive longitudinal data, including clinical, imaging, biomarker, and lifestyle information, which has been extensively used for developing and validating CVD risk models³². Due to the multi-ethnic composition of the MESA cohort and highly standardized data collection protocols employed, MESA data is ideal for developing CVD prediction models that could be generalized to different populations³⁰.

Continuous potential predictors were simulated from a truncated normal distribution, and binary predictors from a Bernoulli distribution. To ensure a reasonable correlation structure among the features (predictors), the means, and probabilities used in the simulation procedure were obtained from models fitted to the MESA data. Table 1 presents a detailed overview of the distribution parameters utilized for continuous (${X}_{1}, {X}_{4}- {X}_{7}$) and binary (${X}_{2}, {X}_{3} and {X}_{8}$) predictors. ${X}_{6}$, ${X}_{7}$ and ${X}_{8}$ are presented as non-informative variables in this simulation.

Table 1 Features characteristics.

Full size table

Survival times were generated from a parametric Cox PH model where the hazard function follows a Weibull distribution.

$$T = \left( { - \frac{\log \left( U \right)\lambda }{{\exp \left( {\beta X} \right)}}} \right)^{\nu }$$

where ${\text{U}}$ follows a uniform distribution ${\text{U}}\sim Uniform\left(0,1\right)$, $\lambda and \nu$ represent the scale and shape of the Weibull distribution. Non-informative right-censoring times were simulated from the Weibull distribution, where both $\lambda and \nu$ were tuned manually to achieve approximately the desired censoring rate (e.g., 50%, 70% and 90%). These censoring rates were selected to evaluate the performance of the algorithms under low, medium, and high censoring scenarios, such as in CVD research. The parameters of the Weibull distribution for survival and censoring time are presented in Table S1.

We have examined two scenarios to define the underlying true relationships between the informative predictors (${X}_{1},{X}_{2},{X}_{3},{X}_{4},{X}_{5}$) and the hazard function. Table S1 outlines the specific mathematical forms of the linear predictors (βˊX) used in each scenario. In Scenario I, which favors SMs, predictors contribute in a linear and additive manner to the hazard function. This indicates that each predictor (${X}_{1}$ through ${X}_{5}$) has a constant, independent effect on the log-hazard. In contrast, Scenario II, which favors ML, introduces non-linear and interaction effects, reflecting greater complexity of associations between covariates and the hazard function. Key features of this scenario include non-linear transformations of predictors, such as (${X}_{4}$^²), as well as interaction terms wherein the effect of one predictor is conditional on the value of another ($\text{e}.\text{g}., {X}_{2}* {X}_{3}$).

We generated various sample sizes of N = 500, 1000, and 5000. The small-sized dataset (N = 500) was chosen since SMs usually outperform MLs in small sample sizes, due to their smaller number of hyperparameters, provided the data meets the model’s assumptions. Conversely, higher sample sizes were selected to determine whether ML performance enhances with an increase in dataset size.

Eighteen scenarios were investigated assuming different associations between the predictors and the hazard function (linear and additive/non-linear and interactions), sample sizes (500/1000/5000), and censoring rates (50%/70%/90%). Nine scenarios showed linear and additive effects, while nine showed non-linear and interaction effects between predictors and outcomes.

The number of replications in this simulation study was set to 40. This was based on the formula proposed by Burton et al.³³, where the C-index and standard error were set to 0.66 and 0.02, respectively²². The number of replications specified takes into account the potential model failures during training and/or validation.

The performance metrics

The risks estimated in the validation dataset, along with the actual status, were used to estimate the predictive ability of each method using measures of discrimination and calibration. “Discrimination” refers to how well the predictive model can discriminate between individuals who developed an event and those who did not, whereas “calibration” refers to the agreement between observed and predicted risks³⁴. Discrimination was assessed via Harrell’s concordance index (C-index)³⁵, while model calibration was assessed using integrated Brier score (IBS)³⁶.

Implementation details

During data preprocessing, only continuous variables were scaled. Fig S1 depicts an illustration of three-fold nested cross-validation with three-fold outer resampling for unbiased generalization performance estimates, and hold-out for inner resampling for parameter tuning. Each dataset was randomly split using a static seed into a threefold cross-validation stratified by the outcome event to maintain the original distribution of incidents, specifically in the case of high censoring (e.g., 90%). Then, optimal values of hyperparameters that maximized the performance on the training set were found through a randomized search and hold-out (70%-30%) and 30 repetitions.

For the Penalized Cox PH model, the hyperparameters tuned included L1 and L2 regularization terms. For RSF and ORSF models, the tuned hyperparameters included: the number of trees, the number of features considered for splitting in RSF or for constructing the linear combination in ORSF, and the minimum number of samples required to be at a leaf node. The splitting criteria used were a gradient-based score (global non-quantile) for RSF and C-index for ORSF. In addition, three LCIV-constructed methods (fast, cph, and net) were evaluated as part of the model comparison. Table S2 shows the search space of hyperparameters optimized using randomized search.

Our simulation experiments were conducted between August 2024 and January 2025 on the UAE University High Performance Computing Cluster (HPC) using a computing node with a maximum wall-time of 480 h (20 days). It has two physical CPU chips, which are Intel(R) Xeon(R) Gold 6248 CPU @ 2.50 GHz with 20 cores per CPU and 384 GB RAM, running Red Hat Enterprise Linux. The study was implemented using R version 4.2.1³⁷, utilizing the mlr3 ecosystem³⁸, along with mlr3proba³⁹, and mlr3extralearners packages⁴⁰. The R code used in this study includes the statistical and ML methods, training and validation programs, as well as the generated tabulated results and figures. This is publicly available in the first author’s GitHub repository (https://github.com/AbubakerSuliman/simulation_study_compare_predictive_performance/tree/main)⁴¹.

Statistical analysis

Descriptive analysis of Harrell’s C-index, IBS, and training time estimates was conducted in each scenario utilizing boxplot visualization. Distribution of training time represented in median (IQR). In inferential statistics, we presented the mean (95% confidence interval) of Harrell’s C-index and IBS estimates. Additionally, we employed the one-way repeated measures ANOVA test to compare Harrell’s C-index and IBS estimates across the examined methods in each scenario. Following this, post-hoc analyses with a Benjamini & Hochberg adjustment were carried out for all the pairwise differences between the investigated methods. All statistical tests were two-sided; p-values < 0.05 were considered statistically significant. This analysis utilized R version 4.3.1, tidyverse R package version 2.0.0⁴², and rstatix R package version 0.7.2⁴³.

Results

Scenario I: comparison under proportional hazards with linear and additive effects

Harrell’s C-index comparison

Figure 1 provides box plots of the median C-index for the Cox PH, Penalized Cox PH, RSF, ORSF-fast, ORSF-cph, and ORSF-net models trained on 40 datasets simulated under the initial scenario. All the models exhibit satisfactory discrimination performance on the simulated dataset. However, the C-indices for the RSF model throughout the nine scenarios were lower than those of the statistical and ORSF models. This trend is more evident across all sample sizes when a 50% censoring rate is used.

The box plots of the C-index values for the three linear combinations of ORSF are almost symmetrical, indicating that the prediction ability of the three ORSF linear combinations is equivalent to the simulated time-to-event data.

The means (and 95% CIs) of the Harrell’s C-indices for Cox PH, Penalized Cox PH, RSF, ORSF-fast, ORSF-cph, and ORSF-net, averaged from 40 simulations over nine scenarios, are provided in Table 2. The impact of a growing censoring rate on statistical models is more pronounced in small (N = 500) than medium (N = 1000) and large (N = 5000) sample sizes. The average Harrell’s C-indices for Cox PH and Penalized Cox PH diminished from 0.675 to 0.657 and from 0.676 to 0.644, respectively, when censoring escalated from 50 to 90%. On the other hand, differences in censoring rates across medium and large sample sizes did not lead to substantial changes in the C-indices of the two statistical models.

Table 2 Mean (95% CI) Harrell’s C index for competing learning methods, averaged over 40 simulations in nine scenarios where linear and additive effects are assumed.

Full size table

Among ML algorithms, an increase in censoring had a minimal impact on predicted performance when the sample size remained constant. As the sample size expanded from 500 to 5000 participants, the mean Harrell’s C-index improved by about 2% across all machine learning models.

The C-indices for SMs were significantly higher than those for RSF models (mean difference [MD] = 0.01 to 0.02, p < 0.001) across all scenarios, except when the sample size was 500 and the censoring rate was 90%. SMs had statistically significantly higher C-indices than the two ORSF variants — ORSF-fast and ORSF-cph (MD = 0.01, p < 0.05) — except when the sample size was 5,000 and the censoring rates were 50% and 70%; and when the sample size was 500 at a censoring rate of 90%, where the differences were either negligible or not statistically significant. Similarly, ORSF-net was outperformed by Cox PH models across all scenarios of sample size and censoring rate, except when the sample size was 500, and the censoring rate was 90%, where the difference in C-index was not statistically significant (p > 0.05) (Table S3).

Across all scenarios of sample size and censoring rate, RSF models had statistically significantly lower C-indices than the ORSF variants (MD = -– 0.01 to – 0.02, p < 0.001), except when the sample size was 500 and the censoring rate was 90% where the differences were not statistically significant. The differences in C-indices between the various ORSF variants were generally negligible or not statistically significant (Table S3).

Integrated brier score comparison

The means (and 95% CIs) of the integrated Brier scores for the various models — Cox PH, Penalized Cox PH, RSF, ORSF-fast, ORSF-cph, and ORSF-net — averaged from 40 simulations over nine scenarios, are provided in Table 3 All the models exhibit satisfactory calibration performance on the simulated dataset, with IBS ranging from 0.085 to 0.185.

Table 3 Mean (95% CI) integrated Brier scores for competing learning methods, averaged over 40 simulations in nine scenarios where linear and additive effects are assumed.

Full size table

Across the different scenarios of sample sizes and censoring rates, the Cox PH and Penalized Cox PH had the best calibration, as indicated by the lowest IBS. On the other hand, the RSF recorded the worst calibration, as indicated by the highest IBS, except when the sample size was 500 and the censoring rate was 90%, in which case, the worst performing was the ORSF-net. This performance gap was increasingly evident with larger sample sizes and higher censoring rates. For instance, at a sample size of 5000 and a censoring rate of 90%, the IBS for Cox PH and RSF were 0.141 and 0.176, respectively.

Moreover, across all models, calibration generally deteriorated with increasing censoring rates, regardless of sample size, with RSF and ORSF appearing to be the most sensitive to censoring.

Additionally, the box plots of the IBS values for the three linear combinations of ORSF are almost symmetrical (Fig. 2), indicating that the prediction ability of the three oblique random survival linear combinations is equivalent in the simulated time-to-event data.

RSF models had significantly higher integrated Brier scores (IBS) than SMs across nearly all scenarios (MD = –0.01 to –0.04, p < 0.05), except when the sample size was 500 and the censoring rates were 70% or 90%, where the difference between Cox PH and RSF was not statistically significant (p > 0.05). Additionally, larger differences were observed when the sample size was between 1,000 and 5,000 and the censoring rate was 90%. The ORSF variants yielded statistically higher IBS than the SMs in most cases, particularly when the sample sizes (1,000–5,000) were large and the censoring rate was high (90%). However, these differences were generally negligible when the censoring rate was 50%, regardless of sample size (Table S4).

RSF and ORSF were comparable in terms of IBS score in most of the scenarios, with differences being either negligible or not statistically significant. The only exceptions are when the sample size was 5,000 at 70% and 90% censoring rates, where the IBS scores for the RSF were significantly slightly higher than for ORSF-variants (MD = 0.01, p < 0.001). The pairwise differences in IBS among the ORSF variants were generally statistically not significant and/or negligible (Table S4).

Training time

The training time across RSF, ORSF-fast, and ORSF-cph was consistently minimal across all scenarios, with ORSF-fast and ORSF-cph generally taking shorter times to train than the RSF. On the other hand, ORSF-net was the most computationally extensive, requiring significantly longer duration to train across all scenarios of censoring and sample size (Fig S2, Table S5).

Scenario II: comparison of prediction models in proportional hazards with non-linearity and interaction

In the nine simulated situations involving non-linearity with interaction effects, all five models demonstrated a notable improvement in predictive ability compared to the linear additive relationship.