Introduction

Diffuse large B-cell lymphoma (DLBCL) is the most common lymphoid malignancy among the elderly, and about 25% [1] of new diagnoses concern patients aged 80 years or older (≥80 y.o.). Besides obvious differences related to treatment tolerance and management compared to the younger population, raising knowledge, in molecular cell of origin subclassification, for example, demonstrates that DLBCL of the elderly may have specific aspects [2, 3]. Yet, they are underrepresented in clinical trials, and few trials focus on very elderly patients [1]. Recommended treatment in first-line setting, as established by two phase II clinical trials, LNH03-7B [4] and LNH09-7B [5], is Rituximab and reduced CHOP (Cyclophosphamide, Doxorubicin, Vincristine, Prednisone) combination (R-miniCHOP), with a 2-year overall survival (OS) approaching 60%. Alternatively, regimens replacing or omitting anthracyclines can be proposed for frail patients or those with cardiac function impairment [6,7,8]. Pre-phase treatment, combining oral prednisone with or without vincristine and cyclophosphamide, is recommended for the elderly and should be considered after 80 years old, as it has been shown to be beneficial for performans status (ECOG) improvement, allowing better tolerance during subsequent R-miniCHOP treatment [5, 9, 10].

More than a decade after LNH09-7B, improving 1st line treatment outcomes remains tricky in this population, despite recent overall advances brought by targeted therapies and immunomodulatory agents. Indeed, although the lenalidomide and R-miniCHOP combination (R²miniCHOP) had an appealing rationale, as the activated B-cell DLBCL subtype increases with aging [2], SENIOR [11], the first -and only published to date- randomized phase III clinical trial (RCT) focused on ≥80 y.o. patients, failed to show an improvement in OS, the primary endpoint, which can be partly explained by toxicity issues. Recruitment of the ≥80 y.o. in clinical trials can be more challenging than that of the younger, explaining partly the lack of focused RCT. New combinations are being evaluated in ongoing trials [12], including RCTs [13, 14], that will hopefully bring new treatment options with hypothetical approvals.

At a lower level of proof, phase II clinical trials can bring new opportunities in 1st line management. However, comparing results from different phase II trials is not recommended as variations in patient populations, study design, or endpoints can lead to misleading conclusions. Use of synthetic control arms (SCAs) in clinical trials with innovative designs could improve statistical confidence, replacing either entirely the internal control arm (analysis of a phase II CT), or partially (phase III RCT with a mixed control arm composed of randomized patients and historical patients) [15,16,17]. Even if DLBCL ≥ 80 y.o. cannot be considered as a rare disease, the lack of representation and recruitment issues in clinical trials can justify the use of SCA, with rigorous statistical methods to control covariate balance between arms and to reduce biases related to the lack of randomization [18,19,20,21,22] and general rules, as defined by agencies, have been published [23, 24].

In this study, we aimed to build an SCA from mixed real-world and clinical trial data for ≥80 y.o. DLBCL patients at first line of treatment, and to validate its clinical and statistical relevance by applying it to the SENIOR trial, the only published RCT in this population.

Material and methods

Patients and study design

Patient-level data from REALYSA [25] and LNH09-7B [5] databases were retrieved for SCA patients, and data from the SENIOR trial [11] were used for validation. REALYSA is a French real-world multicentric observational cohort recruiting newly diagnosed lymphoma patients since November 2018. LNH09-7B was a phase II clinical trial assessing the efficacy of Ofatumumab and miniCHOP preceded by a pre-phase treatment (oral vincristine and oral prednisone) as first-line treatment of DLBCL patients ≥80 y.o., which resulted in a 2-year OS of 64.7% [95% confidence interval (CI): 55.3–72.7]. SENIOR trial was a phase III RCT assessing R-miniCHOP vs. R-Lenalidomide-miniCHOP, with a pre-phase treatment, as first-line treatment of DLBCL patients ≥80 y.o., with similar inclusion criteria as LNH09-7B. This study failed to show an improvement in OS with the addition of lenalidomide (2-year OS of 66% for R-miniCHOP versus 65.7% for R²miniCHOP arm). Patients from LNH09-7B had their treatment started from June 2010 to November 2011, and those from SENIOR from October 2014 to September 2017.

From the REALYSA cohort, we included patients aged ≥80 y.o., treated with R-miniCHOP combination as first-line treatment, included in the cohort from December 2018 to 31 December 2021. Patients with performans status (ECOG) of 3 or 4 and Ann Arbor stage I were not included to match SENIOR inclusion criteria. Data were exported from the registry on the 25th of June 2024.

Clinical trials patients with non-compliant diagnoses (other than DLBCL or high-grade B-cell lymphoma (HGBL)), according to anatomopathological centralized reviews realized in each trial or local diagnosis for REALYSA were excluded.

The study was designed as follows: first step was to build a mixed SCA (“Mixed SCA 1”) from real-world data (REALYSA) and clinical trial data (LNH09-7B) adjusted on the SENIOR control arm, and to evaluate if it could mimic the internal control arm from SENIOR. The second step was to build a mixed SCA (“Mixed SCA 2”) adjusted on the experimental arm of SENIOR, to evaluate if we could replicate the efficacy results from the SENIOR study by switching the internal control arm to the Mixed SCA 2. For each step, populations were balanced as described below.

Primary and only endpoint for arms comparison after weighting was OS, measured from the date of inclusion (or date of diagnosis in REALYSA) to the date of death. As a high number of REALYSA patients had a limited follow-up period, we chose to censor all patients at 24 months for all cohorts.

Weighting procedures

Propensity scores (PS) were estimated for each patient included using logistic regression with following covariates: sex, age (spline), Ann Arbor stage (I–II versus III–IV), performans status (ECOG), number of extra-nodal sites involved (<2 versus ≥2), international prognostic index (IPI) score (0–2 versus 3–5), B symptoms, lactate dehydrogenase (LDH) level (normal versus over the upper limit of normal), bulky mass >10 cm, albumin level in gram per litter (g/L) (spline).

For comparison between the Mixed SCA 1 and SENIOR internal control arm, PS allows to estimate the probability for a patient to receive “R-miniCHOP via SENIOR.” For comparison between the Mixed SCA 2 versus the SENIOR experimental arm, PS allows to estimate the probability for a patient to receive R²miniCHOP.

Then, patients’ covariates between arms were weighted using the stabilized inverse probability of treatment weighting (sIPTW) statistical approach, the probability of treatment being reflected by PS [26].

To avoid positivity violations, patients with extreme PS (below 0.1 or over 0.9) were excluded [27]. After weighting procedures, standardized mean differences (SMD) were estimated between arms for each covariate included for PS estimation to check balance in covariates’ distributions. With the statistical approach used, sIPTW allowed to estimate the average treatment effect.

Missing data management

Due to the high number of missing data for patients from REALYSA for some covariates, the multiple imputation method “across” was performed (15 imputations by patient) [28, 29]. That generated 15 complete datasets. PS were calculated for each of the datasets. Then, for each patient, the median of the 15 PS was used for weighting. The SMD presented in this manuscript was calculated on real data, i.e., before imputation. With the “across” missing management method, outcome analyses are performed on one dataset.

Statistics for outcome analyses

As the first objective was to assess if the Mixed SCA 1 could mimic the internal control arm of SENIOR, Hazard Ratios (HR) 95% CIs were expected to include 1. The second objective was to reproduce SENIOR results by replacing the internal control arm with the Mixed SCA 2; HR were expected to overlap with the results obtained in the SENIOR trial (HR [95% CI] of 0.996 [0.66–1.51]), including 1 in the HR 95% CI. Power calculation was not made in our study, but the aim was to build the largest SCA we could from sources we chose (LNH09-7B and REALYSA) with comparable patients at baseline.

OS curves were generated using the Kaplan–Meier method, and OS curves were compared with the log-rank test. HR and 95% CIs were calculated using the Cox proportional hazards model. One-sided p value below 0.05 was considered significant. Analyses were performed using SAS software 9.4.

Sensitivity analyses

Sensitivity analyses were performed using data coming only from the REALYSA cohort to evaluate the possibility to build a well-balanced SCA using only real-world data (REALYSA-SCA), with the same methodology used for the mixed SCA building. Additionally, we performed sensitivity analysis using different missing data management methods on Mixed SCAs analyses or REALYSA-SCA analyses. The “Complete cases” method excludes patients with missing data. For the “within” method, analyses are performed on each imputed dataset (15 analyses), and results were combined using Rubin’s rules [28, 29].

Results

Patients’ characteristics

Overall, in this study, we included 73 patients from LNH09-7B and 97 patients from REALYSA to source mixed SCAs, and 98 or 104 patients from SENIOR standard or experimental arm, respectively, to assess Mixed SCAs relevance (Fig. 1). This population will be referred to as the confirmed Diagnosis Set. All included patients received anti-CD20 monoclonal antibody (mAb) + miniCHOP-based regimen as first-line treatment.

Fig. 1: Study flow-chart.
figure 1

tFL transformed follicular lymphoma, PS propensity score, mAb monoclonal antibody. *Effective sample size of the pseudo-population after weighting, which is close to the real samples.

Table 1 shows characteristics of patients from each cohort included in the confirmed Diagnosis Set. Most patients included in this study had an intermediary-high or high risk DLBCL (77.1% of overall patients with an IPI score at 3 or higher).

Table 1 Patients’ characteristics from the confirmed diagnosis set.

PS were calculated for all patients before each sIPTW weighting procedure. Five patients were excluded (Based on Mixed-SCA 1 PS: 2 from SENIOR standard arm, and based on Mixed-SCA 2: 1 from SENIOR experimental arm and 2 from Mixed-SCA 2) because of extreme values (above 0.9 or below 0.1) (Fig. 1). Final population was included in the “PS set” (Fig. 1).

In the PS set, some covariates were unbalanced before weighting, with an absolute SMD above 0.1 for 4 covariates (Sex, performans status (ECOG) 0 and 2, Mass >10 cm and Ann Arbor stage) before sIPTW between SENIOR Standard arm and Mixed SCA 1 and for 5 covariates (Sex, Mass >10 cm, LDH, IPI, B symptoms) between SENIOR experimental arm and Mixed SCA (Fig. 2). SMD were also calculated on a pool of the 15 datasets with imputation and were similar.

Fig. 2: Absolute Standardized Mean Differences (SMD) using Mixed SCAs.
figure 2

Balance of covariates included in the propensity score between (A) SENIOR standard arm or (B) SENIOR experimental arm and Mixed SCAs before and after sIPTW weighting.

Weighting procedures

Final populations in the PS set were weighted with the sIPTW method: SENIOR Standard arm with Mixed SCA 1 and SENIOR Experimental arm with Mixed SCA 2. Table 2 shows patients’ characteristics distribution in the 4 arms before and after weighting.

Table 2 Patients’ characteristics from the SENIOR standard or SENIOR experimental arm and Mixed SCA 1 or 2 from the PS set before and after sIPTW weighting.

Using the “across” method to manage missing data, weighting procedures with sIPTW were efficient to balance covariates between arms, as all SMD for covariates included in the PS were below 0.10 (Fig. 3).

Fig. 3: Outcome analysis using Mixed SCAs.
figure 3

OS comparison between Mixed SCA 1 and SENIOR control arm before (A) and after (B) weighting. OS comparison between Mixed SCA 2 and SENIOR experimental arm (C) before and D after weighting.

To explore if the missing data management method used in this study could impact weighting efficiency, we also analyzed the balance of covariates using the “complete case” or “within” method. SMD for all covariates was <0.1, whatever the method used for missing data management. Moreover, in the “within” method, at different imputations (from 1 to 15), SMD for all covariates were <0.1, and variation of SMD values across the different imputations was very low (Supplementary Fig. 1).

Outcome analysis

The first step was to assess if Mixed SCA 1 could reproduce the SENIOR control arm’s OS. OS was not significantly different between the SENIOR control arm and Mixed SCA 1, with an HR [95% CI] of 0.86 [0.57–1.32] (p = 0.493) before weighting, and an HR of 0.79 [0.52–1.20] (p = 0.277) after weighting with sIPTW (Fig. 3A, B).

The second step was to assess the reproducibility of SENIOR trial results by switching the SENIOR control arm to Mixed SCA 2. In our study, OS was not significantly different between the SENIOR experimental arm and Mixed SCA 2, with an HR of 0.83 [0.55–1.25] (p = 0.3631) before weighting, and an HR of 0.74 [0.49–1.12] (p = 0.165) after weighting with sIPTW (Fig. 3C, D).

There was no statistically significant difference in OS between SENIOR arms and Mixed SCAs using different missing data management methods (Supplementary Table 1).

REALYSA-SCA

We also assessed the feasibility and viability of SCAs (REALYSA-SCAs) built with real-world data from the REALYSA cohort only, with the same methods as the Mixed SCAs.

Table 3 shows patients’ characteristics from the SENIOR control arm or experimental arm compared to patients from REALYSA-SCAs before and after sIPTW. More covariates were imbalanced before weighting compared to analyses conducted with Mixed SCAs, but after weighting, SMDs of all covariates were <0.1 (Fig. 4). Weighting procedures were efficient, irrespective of the weighting method or the missing data management method used (Supplementary Fig. 2).

Fig. 4: Absolute Standardized Mean Differences (SMD) using REALYSA SCAs.
figure 4

Balance of covariates included in the propensity score between (A) SENIOR standard arm or (B) SENIOR experimental arm and REALYSA-SCAs before and after sIPTW weighting.

Table 3 Patients’ characteristics from the SENIOR standard or SENIOR experimental arm and REALYSA-SCA from the PS set, before and after sIPTW weighting.

OS was not significantly different between the SENIOR control arm and REALYSA-SCA-1, with an HR [95% CI] of 0.89 [0.56–1.44] (p = 0.640) before weighting and an HR of 0.90 [0.56–1.43] (p = 0.478) after sIPTW. Similarly, OS was not significantly different between the SENIOR experimental arm and REALYSA-SCA-2 with an HR of 0.84 [0.53–1.35] (p = 0.474) before weighting and an HR of 0.88 [0.55–1.41] (p = 0.612) after weighting with sIPTW (Fig. 5A–D).

Fig. 5: Outcome analysis using REALYSA SCAs.
figure 5

OS comparison between REALYSA-SCA 1 and SENIOR control arm before (A) and after (B) weighting. OS comparison between REALYSA-SCA 2 and SENIOR experimental arm (C) before and D after weighting.

OS comparison results between REALYSA-SCAs and SENIOR standard or experimental arm obtained in sensitivity analyses were close to those obtained with the main methodology, and no statistically significant difference was observed (Supplementary Table 1).

Discussion

In this study, we showed that a well-balanced SCA designed for ≥80 y.o. DLBCL patients in the first-line setting, with data extracted from historical clinical trials and real-world settings, can mimic the internal control arm of an RCT. Switching the internal control arm by a SCA reproduced efficacy results based on OS comparison, with an HR [95% CI] of 0.743 [0.494–1.118] with a mixed SCA from clinical trial and real-world data and an HR of 0.88 [0.55–1.41] (p = 0.612) with a SCA only based on real-world data, compared to an HR of 0.996 [0.66–1.51] in SENIOR trial. This research demonstrates that we should open the field of newly designed trials to improve our clinical trial accrual and overall clinical benefit, even in the very elderly setting.

Our selected balancing procedure with the sIPTW method was highly efficient with all SMD < 0.1 for covariates included in the PS calculation, along with a low drop-rate of “extreme patients” (0.7–1.4% (Fig. 1)), ensuring robustness of the results. We used mixed sources of real-world data and clinical trial data to build this mixed SCA, to strengthen the extent of reliability of our results from a perspective of “real-life” incoming phase III clinical trials. Surprisingly, we observed a limited number of real-world-sourced patients who would be excluded after applying inclusion criteria (performans status (ECOG) < 3 and Ann Arbor stage >I). Regarding patients’ characteristics, those results showed that ≥80 y.o. patients with DLBCL in the first-line setting are quite similar despite very different inclusion periods between LNH09-7B, SENIOR, and REALYSA. Indeed, most patients’ characteristics were already balanced before weighting, as only a few covariates had a high SMD despite the absence of randomization and the use of real-world data. It suggests that (1) by applying the SENIOR trial population inclusion criteria, we selected a substantially homogeneous population of DLBCL elderly patients to build SCAs, and (2) the SENIOR trial enrolled an unbiased DLBCL elderly population with immunochemotherapy intent of treatment. Of note, analysis based on an SCA only built with these real-world data showed that patients could also be well balanced with SENIOR patients from both internal control arm or experimental arm with the sIPTW method (REALYSA-SCA) with an HR [95% CI] of 0.895 [0.560–1.429] and 0.877 [0.547–1.407], respectively. Some limitations can be taken into account for our study. We did not plan a statistical analysis to compare HRs with 95% between the SENIOR trial and the results of this study because, to our knowledge, there is no such statistical analysis that could properly conclude with a defined threshold if those results can allow us to conclude on a reproduction of efficacy. CIs mainly overlapped with SENIOR results. The fact that the OS of Mixed SCA 1 and SENIOR internal control arm is close can be a result in favor of the study conclusion. Period of data accrual is different between LNH09-7B trial data (2010–2011), and REALYSA (2018–2021), and we cannot rule out that supportive care and improvements in geriatric assessment and management may have improved during this 10-year time frame. It is possible these differences led to improved long-term OS with REALYSA-SCAs compared to Mixed-SCAs. However, we did not plan a comparison between the sources of patients. On the other hand, management of relapse didn’t dramatically change between 2010–2011 and 2018–2021, so differences in OS related to improved post-relapse treatment are unlikely. Furthermore, we couldn’t incorporate prognostic geriatric assessments (e.g., Instrumental Activities of Daily Living (IADL) scale) into the PS calculation, because of the lack of this information in all datasets. In our dataset, data were prospectively collected, with a low rate of missing data (mass > 10 cm and albumin level with >5% of missing data from REALYSA patients) [30]. Moreover, we showed that missing data could be reliably imputed, allowing us to keep a steady sample size with robust outcome analysis results. Indeed, sensitivity analyses for missing data management with “complete cases,” or “within” multiple imputation methods, led to the same conclusions as for the main (“across”) method.

From a statistical point of view, we chose the sIPTW weighting method because it resulted in a pseudo-population with stable and comparable sample sizes, as opposed to IPTW, which would have resulted in a larger pseudo-population (that may result in an increased rate of type 1 error). As illustrated in sensitivity analyses, CIs were narrower with IPTW compared to sIPTW weighting, but we still obtained consistent results with IPTW.

Other examples of SCA building and validation emerged recently in the hematology field using data at an individual level from historical clinical trials or real-world data with PS-based methods to control for confounding variables [31, 32]. For example, another study also used REALYSA data to build an SCA with newly diagnosed patients with advanced Hodgkin lymphoma and reproduced efficacy results of the phase III RCT AHL2011 [33].

We believe our study demonstrates that in a difficult-to-study population, SCA could be used as a bona fide strategy to improve very elderly DLBCL patients’ outcomes by allowing trials with “in silico” control arms along with more promising experimental arms. Indeed, SENIOR is currently the only published RCT in this setting and recruited 249 patients in 3 years within 100 centers in France and Belgium, illustrating the challenges to enroll these patients.

The mixed SCAs we built could be used to help assessment of new combinations with targeted therapies that could improve OS compared to the R-miniCHOP-based regimen, for which the adjunction of another drug seems tricky due to toxicity issues in this population. In our opinion, OS is the most relevant primary outcome to compare SCA to another arm, especially in the elderly population, as first progression is soon followed by death in this low-reserve population. Use of progression or event-free survival would require the same follow-up for the evaluation of response, which is unlikely with multicentric real-world data. Furthermore, health authorities are now giving weight to these indirect, statistically sound strategies to allow market access to new treatments (if the control arm is the right one), and we could increase this weight by a new design of trials. For instance, comparative trials with control arms composed of 50% of external patients and 50% of randomized patients (in order to keep toxicity comparisons available) could accelerate accrual time by reducing the number of patients to include, or, with a steady number of patients to include, allow more patients to be included in the experimental arm (2:1 randomization). To conclude, we built here an SCA from mixed clinical trial and real-world data for ≥80 y.o. DLBCL patients in the first-line setting that could be used to build comparative trials with an innovative design, which could help to explore the ability of new treatment combinations to improve survival in this otherwise poorly-represented population in clinical trials.