Introduction

Multiple sclerosis (MS) is a chronic immune-mediated disease of the central nervous system. Globally, there are more than 2.8 million people with MS (pwMS), with a higher prevalence in regions such as North America and Europe compared to Africa and East Asia1. Despite significant advancements in disease-modifying therapies over the last two decades, pwMS exhibit considerable heterogeneity in their disease progression and response to treatment2,3. While some pwMS develop severe disability despite high-efficacy disease-modifying therapies, others remain stable under similar regimens4. This variability underscores the importance of stratifying pwMS to identify distinct disease trajectories and improve treatment precision.

Disease trajectories can be defined by disease progression or processing speed decline , which have been described to be predictive of each other5,6, with processing speed potentially reflecting a unique dieasese dimension7,8. A clinically meaningful definition of disease trajectories requires a multimodal approach that integrates clinical, imaging, and molecular markers9.

An advantage of a multivariate approach is the possibility of highlighting multiple variables associated with MS, both those assessed in routine clinical assessments and beyond10. For example, various brain imaging markers such as lesion count11, lesion volume12,13, and brain age14 have been linked to disability. Brain age is an age estimate based on a comparison of a person’s brain MRI with a large dataset from a healthy population15. Furthermore, several serum markers have been explored as prognostic markers in MS, including serum neurofilament light chain (NfL)16, chitinase-3-like protein1 (CHI3L1)17,18, vitamin D19, and potentially other vitamins20,21, and HLA-DRB1 carriership19. General health risk markers such as smoking22 and high body mass index (BMI)23 also contribute to MS-related outcomes. Clinical variables such as the relapse rate24, Expanded Disability Status Scale (EDSS)25, and cognitive performance, for example measured by the Paced Auditory Serial Addition Test (PASAT)26, further enhance the understanding of disease-related impairment. Additionally, patient-reported outcome measures (PROMs)27 can offer valuable insights for a more comprehensive assessment of disease impact.

We have established a unique cohort, containing longitudinal recordings for all the above variables. While the richness of such data provides insights from many different perspectives, it also presents analytical challenges, particularly the potential for bias introduced by researchers’ subjective decisions in data (pre)processing, modelling, and interpretation. Study designs often highlight one or few predictors of MS outcomes, not accounting for multiple other predictors and their possible combinations. However, due to statistical dependencies, the selection of the variables included in such models influences the observed results, in addition to many other decision researchers make when processing and analysing data28.

To enhance reproducibility and minimize bias, robust and systematic analytical frameworks are essential28. Multiverse analyses, which explore multiple plausible analytical paths and evaluate their impact on results, offer a promising approach to address analysis bias29.

Relevant not only to MS research, variable selection remains an unsolved statistical issue30 for which multiverse analyses provide an opportunity to decrease analysis-specific bias when presenting which variables are important for disease trajectories. In our study, we combine commonly used data from clinical follow-up and experimental biomarkers of MS derived from MRI, blood samples, clinical assessments and quality of life (QoL). We stratify pwMS into groups of cognitive and functional impairment and showcase variables predicting future trajectories already at baseline.

Materials and methods

Sample

The sample included 88 pwMS with relapsing–remitting MS (RRMS) participating in the omega-3 fatty acid in MS (OFAMS) multicentre clinical trial31. The trial entailed data collection every 6 months over 2 years, followed by a single follow-up visit 10 years after the original trial concluded. A total of 85 (97%) of the original clinical trial participants were included and considered in the present analyses. In addition to age, sex, BMI, smoking status, and ω-3 supplementation (original clinical study intervention), the collected data used in this study entailed assessments of clinical (EDSS)25 and processing speed (PASAT)26, together with QoL, evaluated by the short form quality of life inventory (SF-36)32, T1- and T2-weighted MRI and blood samples (see Supplemental Table 1 for overview over variables; Supplemental Fig. 1 for variability of raw measures). All participants were treated with inferon beta-1a starting at month 6 after inclusion in the study, but were treatment naïve before that. As new medications were introduced, the participants’ treatment plans changed unsystematically at various points with ultimately 62 of 85 of the included pwMS using disease modifying treatment.

Fig. 1
figure 1

Sample stratification into functional and processing speed decline groups. Marginal interaction effects between group and age in random intercept models at the level of the individual indicate that risk groups (blue) show a faster functional and cognitive loss than functionally and cognitively stable or improving groups (red). Note that the stratifications were done independent from each other, resulting in nPDG = 19 and nPSDG = 12, with only n = 12 participants being allocated in both groups.

We also used a reference MRI sample (see Supplemental Table 2) to establish brain age models (ntraining = 58,317, nvalidation = 6,608, age span = 5–90), non-clinical repeatedly sampled data of three individuals across one year (minimum of 25 scans each, Bergen Breakfast Scanning Club [BBSC])33, and a cross-sectional MS sample (n = 748, mean age = 38.63 ± 9.46 years) to validate the brain age model in a relevant clinical cohort, matched with healthy controls (n = 751, mean age = 38.73 ± 9.61).

Demographic, clinical, self-reported, multi-omics, and neuroimaging assessments

Age, sex, BMI, and smoking status (smoker/non-smoker assessed by both serum cotinine levels and self-reports) were general variables available for all participants. Clinical and cognitive scores included EDSS25 and PASAT26, and the number of relapses 12 months prior to study start. SF-36 indicated QoL in eight categories [physical functioning, role limitations due to physical health, role limitations due to emotional problems, energy/fatigue, emotional well-being, social functioning, pain, and general health] based on the National Multiple Sclerosis Society’s scoring recommendations34,35 with scores ranging from 0–100 (worst to best outcome) except for general health (1–5). Serum analyses for the current sample16,36, and MRI acquisition and post-processing (cortical reconstruction and lesion segmentation) have been described previously22. MRI scanners and sequences can be found in Supplemental Table 3.

MRI markers of interest included baseline brain age gap (BAG) in years (see next section) from T1-weighted MRI data, as well as lesion count and volume (mm3) from T2-weighted MRI data. Serum markers included CHI3L1 (mg/ml), NfL (pg/ml), and vitamin A and E levels (in µmol/L) and vitamin D levels (in nmol/L), as well as HLA-DRB1 carriership (yes/no). Intra-class correlation coefficients were good (ICC > 0.82), apart from CHI3L1 (ICC ≈ 0), vitamin D levels (ICC = 0.56), and zero-variance variables (see Statistical Analyses). As a consequence of the instability of CHI3L1, we excluded the marker from the analyses.

Sample stratification

EDSS-based stratification

We first stratified pwMS into a progressive disability group (PDG) and the rest of the participants into a stable or improving disability group (SIDG). As previously discussed37, PDG (n = 37, 42%) was defined by progressive disability consisting of at least 1.5-point EDSS increase, when baseline EDSS was 0; 1.0-point increase when baseline EDSS was between 1.0 and 5.5; and 0.5 increase when baseline EDSS was equal or more than 6.0, with a follow-up time of 128–156, M = 141.93 ± 5.83 months.

PASAT-based stratification

In a second, independent stratification, we defined a processing speed decline group (PSDG) based on a ≥ 20% decrease of Paced Auditory Serial Addition Test score from previous timepoints, which was informed by previous literature38 either between baseline and 144-month follow-up or 24-month session and 144-month follow-up, and with the scores decreasing over time (nPSDG = 23). In contrast, a stable processing speed group (SPSG) was defined by smaller processing speed decline than in the PSDG, within expected effects of aging, or even stable or improving PASAT scores.

Brain age

We developed a machine-learning based generalized additive model (allowing polynomials up to the 4th order) trained on 58,317 healthy controls’ cortical thickness, volume and surface area averages across Desikan-Killany atlas39 parcels to predict age. Such models allow sufficient flexibility to approximate non-linear age-dependencies, while still being simple enough to be interpretable.

Bias corrections for brain age prediction models are important as brain ages are otherwise systematically overestimated for lowest ages and underestimated for highest ages, respectively. While this is a know problem, previous studies do usually only apply post-hoc corrections, which do however not remove all of the bias. Hence, we used a combined approach: The original training sample (n = 58,317) was up-sampled (to n = 150,517) using SMOGN40 to approximate a unimodal distribution to minimize training sample age-distribution bias. To account for remaining age-distribution bias, we applied a post-hoc correction using the training sample’s linear age and brain age (BA) association’s intercept \(\left(a\right)\) and slopes (\(b)\): \(Corrected BA=BA+(Age-\left(a+b*BA\right))\). The correction procedure improved model performance (Supplemental Table 4). Finally, the brain age gap (BAG) was calculated as the difference between corrected BA and chronological age: \(BAG=Corrected BA-Age\). Our model predicted lower corrected BAGs in healthy controls compared to pwMS in the utilized cross-sectional sample (t(1462) = -19.40, p < 2.2*10–16, Cohen’s d = -1.00, [-1.11,-0.89]). Moreover, we validated our models in BBSC data, showing within-subject correlations of Pearson’s rsub1 = 0.25, rsub2 = 0.23, rsub3 = 0.53 between age and brain age, yet with only the last correlation being significant (psub3 = 0.006). Note that intra-class correlation coefficients could not be computed for these N = 3 subjects due to the n-to-p ratio (small sample size, large number of observations). Additional information on the model validation procedure can be found at https://github.com/MaxKorbmacher/OFAMS_Brain_Age.

Statistical analysis

Descriptives

We first examined the validity of the stratification using linear mixed effects regression models with random intercept at the level of the individual to assess interaction effects between age and risk grouping, while controlling for age and sex. Specifically, age-group interaction effects on EDSS were used to evaluate the disability progression stratification, and age-group interaction effects on PASAT for processing speed decline stratification (Fig. 1).

To describe apparent differences at baseline, we analyzed univariate baseline differences between the respective risk vs stable groups using simple linear regression models, correcting for age and sex (Tables 1, 2); and for differences in frequencies using Fisher’s exact tests (Supplemental Tables 5, 6).

Table 1 Baseline differences between disability progression and stable or improving disability groups. Continuous models were adjusted for age and sex, except for predictions of age, where only sex was used as covariate.
Table 2 Baseline differences between processing speed decline and cognitively stable or improving groups. Continuous models were adjusted for age and sex, except for predictions of age, where only sex was used as covariate.

Multiverse analyses

We used a multiverse approach to identify which baseline variables were most predictive of risk-membership using binary logistic regression (Fig. 2). Multiverse analyses have the aim to test the robustness of effects using different analytical choices. Multiple models are run, leading to distributions of coefficients, instead of a single estimate. Hence, instead of presenting single coefficients, we present the central tendency of coefficents (median) and variability (median absolute deviation).

Fig. 2
figure 2

Multiverse estimates of predictors of disability progression. Note that disease modifying treatment at 12-year follow-up presented a strong effect of mOR = 7.4 ± 4.07 and was for readability not included.

Variables were excluded when the variance inflation factor (VIF) indicated an unacceptably high collinearity of VIF ≥ 5 in a fully saturated model. These variables included SF-36 based general health and social functioning.

We included models with all possible combinations of variables of interest up to the 8th order with age and a binary variable indicating whether disease modifying treatment was used at the 12-year follow-up as a covariates included in all models. All models were run with and without sex as a covariates, leading to a total of 213,522 models. The QoL "role limitations due to emotional problems" and "role limitations due to physical problems" were constant (mean = 0 ± 0) across participants and therefore excluded from the analyses.

McFadden’s41 pseudo R2 (pR2), the area under the curve (AUC) and Brier scores were estimated for each regression model as a model fit proxy. Wald-based power/sensitivity analysis42 was used to estimate the power of each model at α = 0.05 by testing the assumption of a greater probability of disability progression or processing speed decline group-membership when predictors ≠ 0. For the EDSS-based PDG, model fit was excellent (pR2 = 28.59% ± 8.00%; note that pseudo variance explained is generally lower than conventional variance explained41) when considering the 213,522 converged models, with 99.59% of the models having pR2 > 10%. The median power for these models was 20.82 ± 30.89%. For the PASAT-based CDG the median model fit was excellent: pR2 = 40.00% ± 12.00% (100.00% > pR2 > 7.94%) considering the same converged models, presenting a medium power of nearly 0%. Considering the low median power, in 100 iterations producing each 10,000 datapoints, we simulated N = 1,000,000 datasets to allow for well-powered tests. For the procedure, we used classification and regression trees with the synthpop R package43. Note, that the data simulations entail the assumptions that the true population parameters can be found in the presented original data. These data were then used as a quality control set, running first a fully saturated model, then a model excluding SF-36 items, a model excluding MRI derivatives, and then a model only including significant variables or variables close to a median p-value = 0.05 identified during the analysis of non-synthetic data, to assess whether potential redundancies of variable domains, comparing the models based on AIC, BIC, Brier score, area under the curve (AUC), and likelihood ratio tests.

Multiverse analyses ought to inform about all models, whether well specified or not. This is useful when assessing the literature while assuming analytic flexibility and the possibility of mis-specified models, both of which cannot be excluded. Hence, we reported on all models in the main text but reported only models selected for high power and/or excellent model fit in the supplement. We report raw/uncorrected p-values and 95% Confidence Intervals (CI) in square brackets when describing group differences for the stratifications. For multiverse analyses, median and median absolute deviation (MAD) are reported for the estimated ORs and p-values of each variable of interest across the specified models in the multiverse to a) belong to PDG vs SIDG or b) PSDG vs SPSG. Additionally, the proportion of the OR’s directionality (PORSD), such as the number of models including EDSS, with EDSS OR > 1 for risk group membership, as share of all models was reported. When effects presented consistent directionality in 75% or more of the models they were used in, they were presented in the main text, in addition to clinically meaningful effect sizes. Clinically meaningful was defined here as plausible and possible changes based on the assessed scales and distributions of values and their changes, as well as the magnitude of the effect. For example, a 0.001% increase in the odds of belonging to one of the defined risk-groups per year of age at baseline might be a significant effect but accumulates to less than 1% higher odds even when being older than 100 years of age at baseline. Hence, in this example, the effect of age would not be counted as clinically meaningful. Finally, due to the relatively limited sample size and yet limited missingness, to avoid biasing the results, we did not conduct imputations of missing values. For baseline missingness of all utilized variables see Supplemental Fig. 2, being relevant for the main analyses. EDSS and PASAT score missingness was relevant for stratification (see Supplemental Tables 9, 10).

Results

Study participants

The 85 pwMS with a 12-year follow-up were aged 38.9 ± 8.3 (range: 19–58) years at baseline and 49.6 ± 8.6 at the final follow-up, with 65.9% being females. Median baseline EDSS ± MAD was 2.00 ± 0.74, baseline PASAT was 47.9 ± 9.4, and time since diagnosis was 1.9 ± 3.2 years, and since the first symptom 5.5 ± 5.5 years. Missingness was minimal (n ≤ 6) for the baseline measures, which informed the main analyses (Supplemental Fig. 2). Additional demographics and an overview of all variables can be found in the Supplemental Material. As an overview of the overlap between the different stratifications done in the following sections, note that the overlap between groups was small (n = 12). 38 pwMS had favourable trajectories based on both stratifications, and 39 based on processing speed but not disability progression. 15 pwMS were identified as having a progressive disability but no cognitive decline, and 12 pwMS had unfavourable trajectories based on both stratifications.

Disability progression grouping

We show that, PDG presented a stronger increase in EDSS than SIDG (Fig. 1), shown by a significant interaction effect between group membership and age on EDSS (p = 7.12*10–10), indicating a 0.10 ± 0.02 unit faster annual increase in EDSS in the PDG group.

At the same time, the groups were overall similar at baseline, with slightly better baseline disability, lower vitamin A levels, and higher age among the progressive disability group members (Table 1). There were no other significant differences (p > 0.05, Table 1), including differences in group sizes for sex, smoking, supplement/clinical trial treatment, risk genotype, or receiving disease modifying treatment at 12-year follow-up (Supplemental Table 5).

Multiverse analysis of disability progression

Models predicting a favourable/stable disability level presented a median area under the curve (mAUC) of 0.83 ± 0.04, a median Brier score = 0.16 ± 0.02, and model fit of median pR2 = 0.29 ± 0.08, overall indicating good model perfermonce. Median effects which were on average significant (median p < 0.05) were of relatively large size as well as unaffected by modelling choice in terms of their directionality (Fig. 2, Supplemental Table 7). Varibles significantly predicting stable disability included receiving disease modifying treatment at 12-year follow-up (mORPDG = 7.44 ± 4.07, pmedian = 0.013, PORSD = 100%), lower baseline EDSS (mORPDG = 0.25 ± 0.11, pmedian = 0.013, PORSD = 100%), and counter-intuitively higher age (mORPDG = 1.12 ± 0.04, pmedian = 0.020, PORSD = 100%), and lower vitamin A (mORPDG = 0.10 ± 0.05, pmedian = 0.016, PORSD = 99.7%) and D levels (mORPDG = 0.95 ± 0.02, pmedian = 0.025, PORSD = 100%).

Processing speed decline grouping

The PSDG presented a stronger decrease in PASAT than SPSG (Fig. 1), shown by a significant (p = 0.0001) interaction effect between group membership and age on PASAT, indicating a 0.52-unit faster annual decrease in PASAT score in PSDG compared to SPSG. Moreover, at baseline, PSDG had higher age, lesion volume and PASAT score (Table 2). There were no other significant differences in continuous variables (p > 0.05, Table 2), including differences in group sizes for sex, smoking, treatment, and risk genotype (Supplemental Table 6). However, the number of pwMS who received disease modifying treatment at the 12-year follow-up was lower among the PSDG (p = 0.002; Supplemental Table 6).

Processing speed decline multiverse analysis

Models predicting future processing speed decline presented strong predictive accuracy mAUC = 0.89 ± 0.05, Brier score = 0.10 ± 0.03, and excellent fit: median pR2 = 0.40 ± 0.12, but only receiving disease modifying treatment at follow-up (mORPSDG = 0.10 ± 0.08, pmedian = 0.013, PORSD = 100%) and higher baseline PASAT score (mORPSDG = 0.86 ± 0.03, pmedian = 0.005, PORSD = 99.72%) were significant predictors of future processing speed decline (Fig. 3, Supplemental Table 8). Smoking status had median pmedian = 0.054 (mORPSDG = 7.00 ± 4.63, PORSD = 98.80%).

Fig. 3
figure 3

Multiverse estimates of predictors of processing speed decline.

Simulation-based results

As a robustness check, we included analyses of N = 1,000,000 synthetic datasets, comparing fully specified models with models missing blocks of variables of MRI derivatives and SF-36 indices, and ultimately only including significant varibles for disability progression classifications and near-significant values for processing speed decline. For predictions of disability progression, the fully saturated models performed best (Table 3). Removing variable QoL/SF-36 and MRI domains or all but significant variables decreased the model performance. Differently, when predicting processing speed decline, removing QoL variables did only marginally change model performance. Removing MRI variables and removing all variables except for the ultimate disease modifying treatment and baseline PASAT score presented the largest effect on model evaluation metrics.

Table 3 Model evaluation metrics from selected models using simulated data.

Assessing the predictors in the fully saturated models reflected the predictor structure presented in the multiverse analysis. Predictors which were on average close to the common alpha-level of 0.05, as identified in the original data (disability progression: age, EDSS, disease modifying treatment, social functioning, vitality, vitamin D levels; processing speed decline: disease modifying treatment, PASAT), were also significant (p < 0.001) and of the same direction in the fully saturated models based on the simulated data, with the exception of disease modifying treatment at follow-up indicating to increase the odds of favourable processing speed development, which contrasts the multiverse analysis, but is in line with the disability progression findings. The odds for stable disability group membership were 19% higher when receiving disease modifying treatmet at the 12-year follow-up (OR = 1.19, 95% CI[1.17; 1.21]), by being older with 8% for every year (OR = 1.08, [1.08; 1.08]). These odds were decreased with 53% for every 1-unit EDSS increase (OR = 0.47, [0.46; 0.47]), by 17% for every umol/l increase in vitamin A (OR = 0.83, [0.81; 0.84]) and by 1% for every nmol/l increase in vitamin D serum concentrations (OR = 0.99, [0.99; 0.99]). The odds for stability in processing speed were increased by only 4% when receiving disease modifying treatmet at the 12-year follow-up (OR = 1.04, [1.02; 1.25]), and decreased by 0.7% for every 1-point increase in baseline PASAT scores (OR = 1.04, [1.02; 1.25]).

Discussion

This study provides a comprehensive analysis of baseline variables associated with the risk and outcome of MS to assess their combined predictive ability for stratification of progressive disability (EDSS) and processing speed decline (PASAT). Our results are presented in the light of multiple possible analytical pathways, highlighting the influence of the selection of variables on the predictive value of each variable of interest in thousands of specified analytical paths.

Our analyses suggest a positive impact of transitioning towards disease modifying therapy, being the result of new drug discovery, for favourable disability progression. Higher baseline EDSS levels were however predictive of decreased odds of favourable outcomes. Counterintuitively, a higher baseline age as well as higher vitamin A and D serum concentrations indicated decreased odds for favourable disability progression. Finally, transitioning towards disease modifying therapy decreased the odds of favourable processing speed development, as well as higher baseline PASAT scores.

Our analyses are based on the identification of favourable compared to unfavourable disease trajectories. Such trajectories can be evaluated in different ways, often consulting EDSS scores which focus on physical function. However, previous evidence showcases that pwMS with functional disability progression do not necessarily correspond with those with comparably accelerated processing speed decline44. The small overlap between the identified risk groups in our study confirmed this assumption. Moreover, quality controls ensured that the presented stratifications were robust to temporal fluctuations such as new relapses, new lesions or lesion activity, and we found clear differences in the trajectories when examining the variables which were stratified for.

Multiple studies have examined single or a small selection of MS markers17,18,19,20,21,25,27. Yet only few studies have focussed on multivariate assessments and particularly on longitudinal data45,46,47,48. Multimodal approaches have the advantage of enhancing predictions, for example as previously shown by combining electronic health records, clinical notes and neuroimaging data to predict MS severity49, conversion from clinically isolated syndrome to MS50, or, similar to this study, to predict EDSS development51. However, just as conventional statistical approaches, machine and deep learning techniques are influenced by the input variable selection and hyperparameter tuning suffers from generally small samples in the MS field.

The utilized input variables in our multivariate approaches have previously, often univariately, been associated with different MS outcomes. For example, higher EDSS scores25, HLA-DRB1 carriership19, lower vitamin levels19,20,21, higher lesion count11, cigarette smoking22, a lower PASAT test score26, several PROMs27, and a higher number of baseline relapses24 have previously been associated with progressive disability in MS, similar to elevated NfL levels16,17,18. CHI3L1 serum measures on the other hand were unstable in our sample (ICC≈0, Supplemental Fig. 1). Hence, a meaningful interpretation of findings of higher CHI3L1 levels at baseline predicting less disability-progression at the 12-year follow-up not involving large uncertainty was not possible and we excluded the variable from our analyses. Cerebrospinal fluid-derived CHI3L117,18 might be a more reliable measure. Additionally, our sample was relatively small and difficult to compare to more recent studies as patients were not treated during the first 6 months of the clinical trial. Hence, before drawing further inference, the influence of the characteristics of the examined limited sample and individual differences in disease trajectories need to be ruled out first by replicating of the findings reported here. Moreover, the statistical significance of the mentioned effects might depend on the selection of covariates. For example, the negative effect of cigarette smoking on health outcomes is generally known and has previously been documented in MS22. However, to contribute to an acceleration of the disease or, for instance, functional or processing speed decline , other cumulative effects might need to be considered.

The efficacy of disease modifying treatment has been previously evidenced, with more recent therapies providing more efficient disease management via reduced relapse rate and disease progression52. However, effects on cognitive impairment are rather unknown53. Our findings are in line with these meta-analytic findings, where disease modifying treatments reduce increase the odds for a favourable disability progression. Their effects on cognitive impairment, measured here by processing speed, remain somewhat unclear, with our findings rather suggesting a negative effect of disease modifying treatments. While potential negative cognitive effects of switching medications as a response to new pharmacological developments, side effects, lack of effectiveness or necessary adjustments to the treatment plan cannot be excluded, the observed positive effects for the disability progression somewhat oppose this finding. Additional, due to the mentioned changes in treatment options and resulting policies, there is large variablility in treatment plans and administered drugs across the 12-year study.

We found a contraindicative effect of higher vitamin levels decreasing the odds of a positive disability trajectory. While this might be the case, there are multiple confounders relevant to this effect vitamin levels, such as immune function and nutritional health. Moreover, the examined vitamin A and D levels were within the recommended/reference range and far above what can be classified as deficiency54,55.

At the same time, as vitamins A and D present a broad influence on the immune system, baseline vitamin levels might indirectly reflect immune-related processes, where for example a lower serum level might indicate a larger metabolic need and hence an activated immune system56,57. While vitamin D deficiency has been linked with a higher risk of developing MS19,36, for example, vitamin D supplementation does not seem to significantly affect clinical outcomes in pwMS58. Such associations are not established for other vitamin serum concentrations. Overall, the observed effects need further validation and replication in independent samples before being able to reach conclusion about their validity and meaning.

The main strengths of the study were the unique dataset, with near complete follow-up (95%) for over 12 years, in addition to a large brain age training set of more than 60,000 healthy subjects’ MRI data, and multiple validation sets, including hundreds of cross-sectionally scanned individuals. Combining these data allowed for more comprehensive analyses than previously conducted. Moreover, our analyses represent a good example for the potential of data-reusage. We were able to apply a new multiverse approach to the data, showcasing the influence of analysis pipelines on research outcomes.

Our study faces some limitations. First, several data-related issues limit the generalizability of the results. The examined historic data from a relatively homogenous Norwegian RRMS sample might bear little value for generalizations to other progressive MS or cohorts outside Northern Europe. At the same time, there was considerable variability of scanners and MRI protocols, which can influence the computation of metrics. This constitutes a major limitation. We decided against harmonisation procedures to avoid confounding effects, which are likely enhanced when the data size is limited59.

The size of the risk groups was relatively small, and treatment adjusted to the individual, including medication changes over time reduce the comparability of pwMS exacerbating the observation of true population estimates. Moreover, as common with automatic segmentation tools of MRI data, the segmentations were imperfect. MRI data processing challenges beyond the data processing software are scanner hardware and software changes over the acquisition period which could not be controlled for in the present national multi-centre study.

Machine learning implementations such as the utilized brain age model can “learn” and account for variability from unwanted covariates. Hence, the brain age predictions, which were trained on diverse, multi-site, software and field strength data, were meaningfully related to the examined developmental trajectories. The utilized brain age model is freely available together with a detailed user guide to enhance dissemination (https://github.com/MaxKorbmacher/OFAMS_Brain_Age).

Model misspecifications might have resulted from the thousands of probed modeling choices, including, for instance, not accounting for multicollinearity and non-linear relationships. On the other hand, the overall model performance was relatively strong and the directionality of significant effects consistent across analytic paths. Additionally, perfect model specifications were not the goal. Instead, we aimed to account for researcher bias, such as executing multiple but only reporting the significant analyses, while also testing multiple variables of interest in the same sample. This approach allows to closer approximate the reality of the literature than only models showing perfect fit.

We analysed a broad range of established and experimental MS markers. However, the selection of variables of interest in this study is not exhaustive and there might be more recently established markers which are more suitable for risk group stratifications, such as novel risk loci as DNM3–PIGC or DYSF–ZNF63860, giving additional inside into the heritability of MS.

The utilized brain age model focusses on cortical grey matter. While such approach improves the interpretability of the model, other approaches including subcortical grey matter15 or information on white matter microstructure61 might be more sensitive to MS-specific degeneration.

Overall, we show that pwMS can be stratified into disability progression and processing speed decline risk groups. The average predictive performance of the models was good. Predictors for such group stratifications, are dependent on analytic choices. Disability progression-relevant variables were the type of treatment at the 12-year follow-up (more recent disease modifying), EDSS, age, and potentially vitamin A and D levels, with further validations needed for these effects. Variables for the prediction of cognitive decline, measured by PASAT, were non-significant. Our findings offer not only additional insight into MS disease trajectories, but also a new perspective on analysing multiple MS markers. We encourage the MS research community to adopt the proposed multiverse approach to mitigate the influence of analysis pipelines on research outcomes and underline the potential for multivariate analyses. Additionally, we underscore the importance of incorporating disease stratifications, particularly using longitudinal data, as critical steps for advancing future studies.