Fig. 2: Event prediction and vitals forecasting performance evaluation.

a Event Prediction (Survival): We evaluated our model on two datasets: MM2, comprising NDMM patients, and MM1, comprising RRMM patients. We report concordance index based on inverse probability of censoring weights (C-index IPCW) averaged across three time quantiles (25th, 50th, and 75th quantiles) at different observation windows (1 month, 6 months, and 12 months). Looking at both MM2 and MM1 together, we found that the SCOPE had largely comparable performance to the RSF and CPH models, and significantly better performance than DDH and CPH-ISS (p < 0.001, Bonferroni corrected). We note the added benefit of having to train the SCOPE only once, compared to the two other model architectures, which required a separate model for each observation window and each event outcome. b Event Prediction (AEs): We report average concordance index for multiple adverse events (filtering down to only ≥Grade 2 non-hematalogic events and ≥Grade 3 hematologic events) at the 6 months observation window (we refer to the Supplementary Material for results at other time points, Supplementary Fig. 2 and Supplementary Tables 1-11). The adverse events were mapped to shortened names as follows - ae-0: Acute Renal Failure, ae-1: Cardiac Arrhythmias, ae-2: Diarrhea, ae-3: Heart Failure, ae-4: Hypotension, ae-5: Liver Impairment, ae-6: Nausea, ae-7: Neutropenia, ae-8: Peripheral Neuropathies, ae-9: Rash, ae-10: Thrombocytopenia, and ae-11: Vomiting. We found that for those adverse events that were predictable from the data (i.e., hypotension, acute renal failure, neutropenia, and thrombocytopenia), SCOPE was competitive with highly-tuned, task-specific CPH and RSF models trained separately on each adverse event. c Forecasting: We plotted the mean squared error (MSE) of each model on forecasting different sets of variables, chemistry labs, serum immunoglobulins, and all lab values, over two forecasting horizons, 6 months and 12 months. Evaluation was done after having observed all of the patient’s data at three different time points (tcond): 1 month, 6 months, and 12 months. We found that SCOPE outperformed the other methods in all cases (p < 0.001, Bonferroni corrected). All error bars correspond to standard deviation, computed over the five model predictions.