Introduction

Prognostication is key in patients with multiple myeloma (MM) who demonstrate highly heterogenous clinical outcomes from time of diagnosis. Since 2015, the revised international staging system (R-ISS) has been recommended as the gold standard for prognostication in patients with newly diagnosed MM (NDMM) [1]. On top of the former international staging system (decreased albumin and increased beta-2-microglobulin [B2M] levels), the R-ISS included lactate dehydrogenase (LDH) levels and adverse cytogenetic aberrations (CA) defined as t(4;14), t(14;16) and del(17p) detected by fluorescence in situ hybridization (FISH). However, only 3–5% of patients with NDMM harbor t(14;16), and its independent prognostic value has been questioned [2, 3]. Larger studies of patients with t(14;16) have been conducted without questioning the prognostic relevance of t(14;16) [4, 5]. Further, the distribution of patients into R-ISS I, II and III is uneven, which has led to inconclusive subgroup analyses due to small numbers [6, 7].

Based on these observations, the second R-ISS (R2-ISS) was recently developed to improve prediction of overall survival (OS) [8]. Developed as a prognostic index from a multivariable analysis in patients from randomized clinical trials, the R2-ISS score is calculated based on ISS III (1.5 point), ISS II (1 point), elevated LDH (1 point), presence of del(17p) (1 point), t(4;14) (1 point) and +1q (0.5 points), and patients are subsequently grouped into low (0 points), low-intermediate (0.5 to 1 points), intermediate-high (1.5 to 2.5 points) and high risk (≥3 points). Both the training and validation cohort consisted exclusively of patients enrolled in randomized clinical trials [8]. However, the differences between patients from clinical trials and those in real-world populations are significant [9, 10]. In general, real-world populations tend to be older patients enriched with high-risk features and clinical outcomes are worse as compared to clinical trial populations – especially for transplant-ineligible patients [9, 10]. Importantly, most patients in real-world populations do not fulfill inclusion criteria for clinical trials of NDMM, mainly due to kidney failure, comorbidity and poor Eastern Cooperative Oncology Group (ECOG) performance status (PS) [10]. At the same time, the prognostic role of +1q aberrations remain conflicting, likely because of the importance of copy numbers which was not available in the R2-ISS dataset [8, 11].

In this study, we aimed to validate R2-ISS in a population-based nationwide cohort in subgroups of transplant-eligible and transplant-ineligible MM patients, and to improve prognostication in real-world populations.

Methods

We included all Danish patients registered with MM in the Danish Multiple Myeloma Register from 2005 through 2019 [12]. We excluded patients without an indication for treatment (according to CRAB criteria) and patients who had not started anti-myeloma therapy three months after diagnosis: these patients were defined as having smoldering MM. To calculate R-ISS and R2-ISS, we retrieved information on ISS, LDH levels, and adverse CA by FISH (i.e., t[4;14], t[14;16], del[17p], and amp/gain[1q]). LDH levels above 205 U/L were considered elevated regardless of age and geographical region. CAs were registered as either present or absent despite varying lower detection levels at different laboratories: a cutoff of 10% is recommended for FISH analyses in Denmark. We calculated overall survival (OS) from time of diagnosis until death with censoring at end of follow-up. Patients were grouped according to R-ISS and R2-ISS risk to compare median OS and Harrell’s C-index in order to assess discrimination capabilities. Subgroup analyses of OS stratified on R2-ISS and R-ISS were performed in three populations: (1) patients undergoing high-dose therapy (HDT), (2) younger transplant ineligible patients up to 70 years of age, and (3) transplant ineligible patients older than 70 years.

To develop a real-world ISS (RW-ISS), we replicated the methods as published by D’Agostino and colleagues [8]. In brief, we split patients with NDMM into a 75% training set and 25% test set, and performed a multivariable Cox regression analysis for OS on the training set including the variables as used to develop R2-ISS: I.e. age, ECOG PS, ISS, IgA isotype, kidney failure, elevated LDH, del(17p), t(4;14), t(14;16), and amp(1q) [8]. In contrast to R2-ISS, age was dichotomized (≤70 years vs >70 years) and ECOG PS was grouped according to the tertile distribution (0 vs 1 vs >1) (Table 1). Kidney failure was defined as a creatinine level above 2 mg/dL (>177 µM) regardless of age, sex and baseline creatinine [13]. Next, we selected the top 6 features with the highest significant hazard ratios (HR) from the multivariable model to develop RW-ISS. Features with HR above 2 weighted 2 points, HR between 1.5 and 2.0 weighted 1 point, and HR below 1.5 weighted 0.5 points. Cumulative RW-ISS scores were calculated in order to assign patients to RW-ISS I, II, III, and IV based on the quartile distribution of scores in the training set. The test set and the Norwegian external cohort were used to internally and externally validate RW-ISS, respectively. We requested external validation of the RW-ISS from all collaborators within the Nordic Myeloma Study Group, however, only one center could participate and retrieve data on all RW-ISS variables. The data for external validation was thus acquired from The Myeloma Registry of Central Norway (MRCN) including myeloma patients from hospitals in Central Norway [14]. All living patients included in the MRCN have signed an informed consent for the use of their clinical data in medical research. We received an exemption from informed consent for patients, who were dead at the time of inclusion in the MRCN.

Table 1 Baseline characteristics.

The study was approved by the Danish National Ethics Committee (1804410) and Data Protection Agency (P-2020-561). The use of Norwegian data was approved by the Regional Committee for Medical and Health Research Ethics (714399) and the scientific committee of the MRCN. All methods were performed in accordance with the relevant guidelines and regulations.

Results

Among 5492 patients with NDMM, FISH was registered in 2929 patients (53.3%) for whom R2-ISS could be calculated. Baseline characteristics are summarized in Table 1. Adverse CAs were identified in 188 (6.4%) patients with t(4;14), 238 (8.1%) patients with del(17p), and 546 (18.6%) patients with +1q, while t(14;16) was identified in only 77 (2.6%) patients. Overall, adverse CAs included in R2-ISS were detected in 800 (27.3%) patients as compared with 452 (15.4%) for R-ISS related adverse CAs. Single, double, and triple hit adverse cytogenetics was identified in 635 (21.7%), 187 (6.4%), and 17 (0.6%) patients, respectively [15]. R2-ISS was low, low-intermediate, intermediate-high, and high-risk in 397 (13.6%), 830 (26.7%), 1484 (50.7%), and 264 (9.0%) patients, respectively.

With a median follow-up of 5.2 years (interquartile range [IQR], 3.1–7.8), R2-ISS clearly stratified patients demonstrating a median OS for low, low-intermediate, intermediate-high, and high-risk of 8.4, 6.2, 4.1, and 2.6 years, respectively (Fig. 1A; P < 0.0001; C-index 0.604). We next analyzed OS stratified on R2-ISS in three different subgroups: patients undergoing HDT, younger and older transplant ineligible patients. First, R2-ISS barely stratified OS in 875 patients from time of HDT (Fig. 1B; P = 0.0078), and we were unable to demonstrate a pairwise OS difference between R2-ISS low, low-intermediate, intermediate-high, and high (P ≤ 0.28; pairwise log-rank). Second, among the 2054 patients (70.1%), who could not undergo HDT, 618 (30.1%) patients were younger and 1436 (69.9%) patients were older than 70 years of age. Among these transplant-ineligible patients, R2-ISS could not clearly stratify OS in younger patients (Fig. 1C; P ≤ 0.38; pairwise log-rank), whereas OS in elderly patients stratified well based on R2-ISS (Fig. 1D; P ≤ 0.0042; pairwise log-rank). Furthermore, patients with single, double, and triple hit demonstrated an incremental shorter OS with a hazard ratio of 1.18 (95% confince interval 1.09 to 1.28; P < 0.0001) per additional high risk CA hit.

Fig. 1: Overall survival stratified by R2-ISS in newly diagnosed patients.
figure 1

A OS for the entire cohort of 2929 with available data, B transplant eligible patients from time of high-dose therapy (HDT), C younger tranplant ineligible patients, and D older tranplant ineligible patients. Median overall survival and C-index indicated.

To compare R2-ISS and R-ISS, we repeated the OS analyses for the same subgroups stratified on R-ISS. R-ISS was I, II and III in 442 (15.1%), 1883 (64.3) and 604 (20.6%) patients, respectively, with a median OS of 8.5, 5.1, and 2.8 years, respectively (supplemental Fig. S1A; P < 0.0001; C-index 0.595). R-ISS clearly stratified all subgroups although C-indices were generally lower as compared with R2-ISS (Fig. 1 and Supplemental Fig. S1).

As R2-ISS could not be fully validated in this Danish real-world population, we noticed that age was not selected as feature in R2-ISS as the multivariate model, from which it was developed, was adjusted for age using 1-year intervals. Further, ECOG PS was presumably markedly lower in the clinical trials populations used to develop R2-ISS as compared to this real-world population (median [IQR]; Not published vs 1 [1;2]) [10].

To create a real-world international staging system (RW-ISS), we thus performed multivariable analysis, feature selection and weighted scoring of hazard ratios similar to the methods applied by D’Agostino and colleagues (see Methods) [8]. However, we divided patients into categorical age groups based on a cutoff of 70 years (≤70 vs. >70 years) and ECOG PS into three groups based the tertile distribution (0 vs 1 vs >1; Table 1). We randomly split patients with available R2-ISS (n = 2 929) into a 75% training set and a 25% test set with an even distribution of baseline characteristics (Table 1). Multivariable analysis on the training set demonstrated an independent association with shorter OS for age, ECOG PS, t(14;16), ISS, LDH, del(17p), IgA isotype, renal insufficiens, and sex, whereas t(4;14) and +1q were not significantly associated with shorter OS (Fig. 2). Selecting the top 6 significant features, we assigned 2 points to patients above 70 years and with an ECOG PS > 1 (HR > 2), 1 point to patients with an ECOG PS 1, t(14;16) and ISS III (HR 1.5–1.9), and 0.5 points to patients with ISS II, elevated LDH and del(17p) (Table 2). In the test set, 278 (38.2%), 144 (19.8%), 197 (27.1%), and 109 (15.0%) patients were RW-ISS I (0 to 2 points), II (2.5 to 3 points), II (3.5 to 4.5 points), and IV (5 to 8 points), respectively, and the median OS was 9.5, 5.5, 3.4 and 1.1 years, respectively (Fig. 3A; P < 0.0001; C-index 0.708). In comparison, C-indices for R2-ISS and R-ISS were only 0.604 vs 0.595, respectively (Fig. 1A and supplemental Fig. S1A). These results were partly validated externally in Norwegian data (Fig. 3B) as the NDMM patients with RW-ISS stage II (n = 28) seemed to fare worse as compared to those with stage III (n = 25; P = 0.37; pairwise log-rank). A full validation could be demonstrated when combining RW-ISS stage II and III (Supplemental Fig. S2; P = 0.0004; C-index = 0.679).

Fig. 2: Cox protional hazard model of overall survival for patients in the training set.
figure 2

Variables included the same as those as used to develop R2-ISS from clinical trial data, except age was used as a categorical value and performance status (PS) was grouped based on the tertile distribution. Variables ranked by highest hazard ratio.

Fig. 3: Validation.
figure 3

Overall survival stratified by RW-ISS in newly diagnosed patients in A the internal test set and B Norwegian external cohort.

Table 2 RW-ISS scoring system.

Discussion

To our knowledge, this is the first validation of the R2-ISS in a population-based, nationwide cohort of patients with NDMM. However, R2-ISS could only be fully validated in elderly transplant ineligible patients as stratification according to R2-ISS in younger patients was unclear - regardless of transplant eligibility. As a result, we here developed the real-world ISS (RW-ISS) based on selection of high-risk features from a multivariable model (age, ECOG PS, t[14;16], ISS, high LDH and del[17p]). We further demonstrate stratification and discriminatory capabibilities in a test set and in part externally validate RW-ISS in an independent real-world population.

Overall, we demonstrated that the R2-ISS had superior discriminatory capabilities benchmarked against R-ISS. Similar to the findings by D’Agostino and colleagues [8], R2-ISS primarily refined the prognosis for the nearly two-thirds of patients previously classified as R-ISS II, while the prognosis for patients with R-ISS I and III was largely similar to those with R2-ISS low and high risk, respectively. Although a more even distribution of patients according to R2-ISS may be an advantage in a clinical trial setting, where statistical analyses become better balanced, we are concerned that R2-ISS could not clearly stratify younger patients whether transplant eligible or ineligble in a real world setting. Another limitation of the R2-ISS is the scoring system, which is certainly feasible to handle, but more laborious as compared to calculating R-ISS. Because R2-ISS was only partly validated and without clear implications of R2-ISS for the clinical management of patients, we believe that R2-ISS may only be considered in elderly patients and in clinical trials as validated by others [16]. As R-ISS on the other hand clearly stratified all subgroups, we will thus continue to recommend the use of R-ISS for routine clinical management of patients with NDMM outside clinical trials.

In this study, we used a smaller yet large real-world cohort (5492 vs 10843 patients, respectively) with a higher proportion of evaluable patients with complete FISH data (2929 [53.3%] vs 3440 [31.7%], respectively) as compared to the original study published by D’Agostino and colleagues [8]. Replicating the methods applied in their original study [8], we here selected the top 6 features with the highest hazard ratios from the multivariable model to develop RW-ISS: Age >70 years (2 points), ECOG PS 1 and >1 (1 point and 2 points), ISS II and III (1 point and 2 points), t(14;16) (1 point), high LDH (0.5 points), and del(17p) (0.5 points) were used to calculate RW-ISS score and assign patients to RW-ISS I (0–2 points), II (2.5–3 points), III (3.5–4.5 points), and IV (5–8 points). In this study, t(14;16) was indeed rare and identified in only 2.6% of patients, while +1q was the most common CA identified in 18.6% of patients; information on the number of 1q aberrations was not available in Danish data. Further, t(14;16) was a strong prognostic marker of OS [8], whereas +1q was not at all prognostic of OS in our statistically well-powered multivariable analysis. While only one large study has not been able to confirm the prognostic value of t(14;16) [2], others have demonstrated longer OS for patients with sole t(14;16) as compared with additional cytogenetic aberration and a high occurrence of renal failure at time of diagnosis in pure t(14;16) populations [4, 5]. Thus, the argument for R2-ISS to replace a rare biomarker for a common one seems invalid in the setting of real-world populations. To overcome this obstacle, future studies considering single, double and triple hit MM are warranted [13]. It is important to stress that neither RW-ISS nor R2-ISS evaluated multi-hit high risk cytogenetics. Considering the missing effect of +1q in our population-based cohort, R2-ISS seems to favor addition of FISH analyses over obvious confounders such as age and comorbidity. Although D’Agostino and colleagues adjusted for age and sex, using age as continuous variable, it seems that age was deliberately excluded as only features with the highest hazard ratios were selected for R2-ISS. Even so, age is prognostic of OS in almost every aspect of general medicine, and most international prognostic indices in lymphoid cancers include age [17,18,19,20,21,22]. The rationale for using age as a part of the prognostication is further confirmed by results from the Myeloma XI cohort demonstrating that the relative contribution of molecular risk to survival varied by age group, with a larger effect on OS in the younger patients [23]. Further, rather than depending on randomization, the selection of patients for duplet, triplet or quadruplet therapy outside clinical trials largely relies on myeloma frailty [24, 25]. The simplest model of myeloma frailty is probably the Mayo Clinic frailty score which includes dichotomized age, ECOG PS and high N-terminal natriuretic peptide type B (NT-ProBNP) underscoring the importance of age, fitness and cardiac comorbidity [26, 27]. Here, we demonstrate that age (≤70 vs >70 years) and ECOG PS (0 vs 1 vs >1) were better prognosticactors of OS in NDMM as compared to any other factor included in the R-ISS and R2-ISS (Fig. 2), probably reflecting that elderly, frail patients are likely to receive limited therapy in the real-world setting. As the sole purpose of prognostic tools in NDMM is to inform on prognosis (rather than treatment selection), we firmly believe that it is reasonable to include both age and ECOG PS in RW-ISS. This still leaves room for frailty scores such as the international myeloma working group (IMWG) frailty score including activities of daily living (ADL) and Charlson comorbidity index (CCI) to help evaluating the expected tolerable therapy intensity [24, 27]. We further notice a fairly even distribution of patients in the test set with RW-ISS I (38.2%), II (19.8%), III (27.1%) and IV (15.0%), which may be argued as an advantage when comparing suvival differences in smaller cohorts [8]. Lastly, we externally validated the RW-ISS in part from Norwegian real-world data. We believe that the small external cohort likely explains why R2-ISS was only partly validated.

Limitations to this study include dichotomization of age, which likely holds more prognostic value in smaller age intervals as exemplified in diffuse large B-cell lymphoma [28]. Further, FISH data were retrieved from a large nationwide register rather than from laboratories directly. Thus, in some cases, amp(1q) may in fact represent gain(1q) rather than multiple 1q copies, which to some extent may explain the absent effect on OS of this cytogenetic aberration. We also recognize that the six variables used in RW-ISS may limit its daily practical use. However, we underscore that both age and PS are freely available for practicing hematologists in all countries, while deployment of electronic clinical support tools with automated calculations may overcome this obstacle [29]. Lastly, we underscore that we were only able to retrieve a validation cohort with available baseline variables from a single center, despite requesting such data from collaborators in all Nordic countries. The small number of patients in the external validation cohort may likely explain that RW-ISS could only be externally validated in part. The RW-ISS thus needs further external validation in larger real-world cohorts.

In conclusion, RW-ISS may refine the prognosis in real-world, routine clinical care of patient with NDMM, which in addition to R-ISS risk factors also considers clinical factors age and performance status. RW-ISS may be calulcated online (https://rwiss.shinyapps.io/RWISS/). Although R2-ISS could be validated in Danish real-world data, we believe that R2-ISS should be reserved for patients enrolled in clinical trials.