Introduction

Research on the course and prognosis of cognitive function following traumatic brain injury (TBI), which refers to structural and/or physiological disruption of brain function due to an external force1, is rapidly developing but remains characterized by extensive inter-patient and intra-patient heterogeneity. Adverse outcomes are attributed to greater injury severity and advanced age, highlighting the challenges in care and poor prognosis for these patients2,3. Recently, the view of TBI of mild severity is progressing through evidence towards a condition with long-term consequences for cognition4,5 rather than a singular event. Furthermore, it has been proposed that TBI should not be seen as a random event and accident, but rather as a health status transition on a continuum, where the magnitude of social adversities and comorbidities increase the risk of injury severity, adverse course, and expedited2,6,7,8,9 cognitive decline, across injury severities10,11. With this in mind, recent research has increasingly posed social equity hypotheses and investigated the role of social parameters in recovery course and long-term outcomes. However, the challenge remains that social parameters are intertwined with injury-related and environmental risks, age-related parameters, and broader social adversities, forming a vastly complex network of associations11,12. Capturing and disentangling these associations on a time continuum presents a substantial challenge for traditional hypothesis-driven research, which typically assumes linear relationships and examines a limited number of variables13,14. As a result, important sources of variability in TBI prognosis remain insufficiently characterised in scientific research, creating a knowledge gap in research, policy, and practice14.

Emerging advances in computational and machine learning (ML) approaches offer a timely opportunity to address this gap by enabling more systematic evaluation of variable importance and the complex interplay of social and demographic parameters in shaping cognitive trajectories after TBI, increasing the confidence in the prognosis of cognitive course after TBI. Here, we present a new analysis of data from 30 published longitudinal cohort studies characterising change in cognitive test performance with 2,364 participants with TBI (72% male, mean age 32 years; 55% mild, 45% moderate/severe TBI)15. We leveraged advancements in computational and ML approaches, with the aim to: (1) develop and validate predictive models of longitudinal cognitive change after TBI using harmonised social, demographic, and injury-related data from published cohort studies; (2) compare the predictive performance of multiple ML approaches (random forest, gradient boosting, and extreme gradient boosting) in modelling standardised rates of cognitive change following TBI, stratified by injury severity, and (3) identify and interpret the relative importance of social and demographic parameters in predicting cognitive trajectories after TBI utilizing the PROGRESS-Plus framework16. PROGRESS stands for social parameters, namely Place of residence, Race/ethnicity/culture/language, Occupation, Gender/sex, Religion, Education, Socioeconomic status, Social capital, and Plus stands for contextual characteristics that may influence outcomes (i.e., age, injury severity, etc.). Through this approach, we sought to move beyond traditional injury-related predictors and elucidate how social profiles of study samples shape longitudinal cognitive course after TBI. This approach to characterizing heterogeneity in cognitive outcomes will equip researchers and policymakers with actionable evidence for understanding and interpreting heterogeneity in cognitive outcomes, and informing more equitable, data-driven prediction.

Results

We included 30 published studies17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46 and 43 cohorts within them (i.e., Mandleberg et al. provided eight cohorts31, Dikmen et al. provided three cohorts20, and Kontos et al.25, Macciochi et al.29, and Covassin et al.19 provided two cohorts, each), which provided a total of 218 cognitive outcome data points collected on at least two time points after injury. The data set comprised 2,364 adults (72% male) with TBI, of which 57% corresponded to data of participants with mild TBI, with the remaining data points representing moderate (4.6%), moderate-severe TBI (24%) and severe TBI (16%). Sample size of cohorts varied, ranging from 1027 to 40430 overall. For mild TBI, sample size ranged from 1229 to 15820, and for moderate-severe TBI it ranged from 1027 to 40430.

To mitigate the influence of imbalance in our data in injury severity categories and to preserve clinically meaningful distinctions, moderate, moderate-severe, and severe injuries were combined into a single ‘moderate–severe’ category, and models were trained separately for mild and moderate-severe TBI.

Data extraction process

After combining moderate and severe injury severity with moderate-severe data points, the final dataset resulted in 121 data points for mild, and 97 data points for moderate-severe TBI (Supplement Table 1).

Table 1 PROGRESS-Plus and other variables of the dataset.

Published data spanned five decades, with the majority (87%) published after the year 2000. Data points represented nine countries, all of high- and middle-income, across global regions. Most cohorts originated from the USA (53%), followed by New Zealand and the UK (9.6% each), China (6.9%), Norway (5.0%), South Korea (4.6%), Brazil (4.1%), and Canada and Australia (3.7%, each) (Table 1). To strengthen the use of country of study origin as a proxy for place of residence, we used country of English language dominance instead of country of study origin in the modelling process, categorizing it under place of residence. Of the nine countries represented by data points, four were classified as English language dominant (i.e., Australia, Canada, New Zealand, UK, USA). Reporting of study sample characteristics varied across cohorts, regardless of country of study origin. Gender/sex composition of study samples was reported in all cohorts, and after harmonisation proportion of males in samples ranged from 0.4443 to 117,25(Table 1).

Cohorts’ education was reported using a variety of formats. After harmonisation to years of formal education, mean education levels ranged from 7.118,38 to 14.743 years across cohorts; 1126,27 to 14.743 years for mild, and 7.118,38 to 13.1721 years for moderate-severe TBI cohorts. Education SDs (within study cohorts) ranged from 1.1136 to 4[18,38 years, 1.529 to 418,38 years for mild, and 1.1136 to 3.340 years for moderate-severe TBI cohorts.

Other PROGRESS-Plus variables (i.e., race/ethnicity, occupation, social capital, and socioeconomic status) were represented by a limited number of data points and were not used in data analysis (Supplement Table 1).

Of the Plus parameters, age and age SD were available for all data points. The pooled mean age across studies ranged from 18.817 to 43.6841 years, 18.817 to 43.6841 years for mild TBI and 25.237 to 4134 years for moderate-severe TBI cohorts.

The nature of our research questions raises the issue of time zero bias. In longitudinal studies, testing starts at a defined point, called zero-time47,48. Designated time zero (i.e. baseline or first assessment) varied, ranging from ~ 12 h17 to ~ 10 years post-injury23. The last follow-up was defined as the final assessment time point at which cognitive test scores were reported within each cohort. Last follow-up assessments ranged from five days19 to ~ 20 years23 post-TBI (Supplement Table 1), with the mean number of days from injury to baseline assessment being 11.2 for mild and 348 for moderate-severe TBI cohorts. Time from baseline to last follow-up assessment, referred to as time interval, ranged from four19 to 485123 days with a mean of 446 ± 824 days.

In order to observe rate of change per month as the outcome, we converted time-related variables to the common metric of months by dividing by 30.4, the average number of days in a month (Supplement Table 1).

The distribution of data points across cognitive test domains, overall and by injury severity, was as follows, from highest to lowest: executive function (118, 55% mild), learning and memory (105, 49% mild), perceptual-motor (87, 64% mild), information processing speed (85, 72% mild), complex attention (84, 60% mild), language (27, 41% mild), and social cognition (6, 100% mild). We refer the reader to Supplement Table 1 for data used in this synthesis.

The cognitive outcome data for ML was a nominal variable across domains of cognition, standardised as rate of change per month (see Methods section). The rate of change was also calculated for domains of cognition with sufficient data: executive function and learning and memory. (Fig. 1)

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Study design and analytical workflow. Schematic representation of the analytical pipeline, illustrating the flow from data acquisition to model interpretation. The seven steps include data collection, extraction, harmonisation, preprocessing, machine-learning model development, model evaluation, and sensitivity analyses, with intermediate and final results displayed at each stage to support overall and model-specific explanations.

The outcome data points were collected at baseline and last follow-up assessments, and formed the foundation for the predictive modelling of cognitive trajectories presented in this study (Supplement Table 1).

From baseline to last follow-up assessment, the median standardised rate of change per month was 0.233 for mild and 0.112 for moderate-severe TBI (Table 1). We refer the reader to the figures for results of the standardised rate of change for mild and moderate-severe TBI for each of the PROGRESS-Plus parameters with available data: gender/sex, age and education overall, and by domains of cognition (i.e. learning and memory, executive function) (Supplement Fig. 1a-i).

Data preprocessing

We created heatmaps with correlation coefficient matrices for mild and moderate-severe TBI to identify associations among PROGRESS-Plus variables (Fig. 2 and Supplement Fig. 2 for country-specific associations). We treated all PROGRESS-Plus variables with complete data coverage as primary predictors and included them in the analyses irrespective of the statistical significance of their association with the rate of change. While country of study origin was available for each cohort, in order to use it efficiently (i.e., considering historical evolution) we used Gender Inequality Index (GII) and Human Development Index (HDI) as contextual measures and effect modifiers in the modelling process. These country-level structural indicators contributed to measurable heterogeneity across mild and moderate-severe TBI cohorts in several key predictors (Fig. 2). Assigned HDI and GII values ranged from 0.68926 to 0.92825,33,35,43 and 0.00942 to 0.26939, respectively. Temporal trends indicated a gradual change in values in cohorts coming from the same country.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Heatmaps of the correlation coefficient matrix, by injury severity (a, mild TBI; b, moderate-severe TBI). Heatmaps of the correlation coefficient matrix for each of the parameters’ association with the outcome, rate of change (bivariate associations). Red signifies a positive correlation, while blue represents a negative correlation. The intensity of the color reflects the magnitude of the correlation coefficient, with more vibrant shades indicating stronger correlations. Specifically, shades tending towards red represent coefficients approaching 1, while those leaning towards blue represent coefficients approaching − 1.

Along with PROGRESS-Plus parameters, we also evaluated whether time-related variables (i.e., time from injury to baseline assessment and time between baseline and last follow up assessment) were associated with the rate of change across injury severity cohorts (Fig. 3 and Supplement Fig. 3). Time from injury to baseline assessment and time between baseline assessment and last follow-up varied (Table 2) and were negatively and positively associated with the rate of change in mild and moderate-severe TBI, respectively. We retained these variables as covariates in models.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Feature contributions across machine-learning models (a, mild TBI; b, moderate-severe TBI). Ranking of feature contributions for prediction of the rate of change in cognitive outcomes in Gradient Boosting, Random Forest, and XGBoost models. The colour scale reflects the strength of positive associations, with darker red indicating the strongest positive correlations and darker blue indicating the weakest positive correlations. Intermediate shades represent gradations in effect size, with colour intensity corresponding to the magnitude of the correlation coefficient. 

Table 2 Performance of ML models for predicting rate of change in mild and moderate-severe TBI.

Machine learning modelling

Results of three supervised ML algorithms, Random Forest (RF), Gradient Boosting (GB) and extreme GB (XGBoost), including both overall cognitive rate of change and domain-specific outcomes, respectively, can be seen in Figs. 3 and 4, for mild and moderate-severe TBI, respectively.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

SHAP summary of feature impact on predictive models’ output (a-c mild TBI; d-f moderate-severe TBI). SHAP summary plots illustrating the influence of input variables on model predictions for Gradient Boosting (a, d), Random Forest (b, e), and XGBoost (c, f) models. The colour of each point indicates the value of the feature, with red corresponding to higher values and blue to lower values. The position along the x-axis reflects whether the feature increases or decreases the prediction in the model output. Features are ranked on the y-axis based on their overall importance, with those contributing most to the model’s predictions appearing at the top.

Feature importance heatmaps of each of the models are presented in Fig. 3a and b, for mild and moderate-severe TBI, respectively. For mild TBI, across the three models, the top PROGRESS-Plus feature was age, followed by education SD and the contextual parameter of GII. For moderate-severe TBI cohorts, the top PROGRESS-Plus feature was age SD and time related variables.

Model evaluation and interpretation

Model performance metrics are summarised in Table 2. Across all algorithms, predictive accuracy was comparable, with XGBoost slightly better in mean absolute error (MAE), and RF slightly better in root mean squared error (RMSE) in mild and moderate-severe TBI cohorts (Table 2).

The pattern of tuned hyperparameters (Table 3) across algorithms suggested different levels of structure in data in both mild and moderate-severe TBI. We observed that RF for both injury severity strata captured very shallow trees (i.e., maximum depth = 2) with the complexity in predictive signals explained by low-order interactions. The GB model favoured deeper individual trees (i.e., maximum depth of 9 and 6 for mild and moderate-severe TBI, respectively) where each successive tree focused on residual errors, allowing more complex interaction within a single tree compared to RF. The mild TBI model highlighted a very low learning rate (i.e., 0.01), with deeper trees and more iterations. The moderate-severe model used a higher learning rate (i.e., 0.1) with fewer trees and shallower depths. The XGBoost model, across mild and moderate-sever TBI strata, selected deeper trees using additional regularization to tolerate deeper depths without overfitting. The learning rate remained small, and overall complexity was determined by interactions with a number of boosting runs (Table 3). To support interpretation beyond hyperparameters, we examined variable influence and marginal effects (e.g., partial dependence/Shapley Additive Explanations) to understand how each algorithm captured non-linear relationships and interactions in the data (Fig. 4).

Table 3 Hyperparameter evaluation for ML models.

For prediction, the Shapley Additive Explanations (SHAP) ranking showed that across models, time interval, GII, education (picked by RF), and age emerged as the strongest predictors of standardised rate of change for mild TBI. For moderate-severe TBI cohorts, baseline assessment, time interval, age, and GII (picked by RF) emerged as the strongest predictors of standardised rate of change across models (Fig. 4).

Internal validation and sensitivity analyses

To ensure the robustness of the results, we ran several sensitivity analyses. First, we performed 5-fold cross-validation to verify generalisability across folds (Table 4). For both injury severities, XGBoost showed slightly superior performance for standardised rate of change across data points. The effect of all the primary predictors, covariates, and the effect modifiers remained the same (Supplement Figs. 4 and 5).

Table 4 Performance of 5-fold ML models for predicting rate of change in mild and moderate-severe TBI.

Second, we re-ran analyses using a subset of data on executive function and learning and memory domains of cognition, which had relatively balanced data between mild and moderate-severe TBI cohorts, confirmed our hypothesis (i.e., that PROGRESS-Plus parameters would exert a more pronounced effect on domain-specific cognitive outcomes than on overall cognition, stronger for mild than moderate-severe TBI). Results showed that all the values of relative contributions for PROGRESS-Plus and time variables varied across models, however they remained largely the same as in the main analysis (Supplement Figs. 6 and 10 for executive function and learning and memory TBI cohorts, respectively). The top three features for prediction of rate of change in executive function domain in mild TBI cohorts identified by the SHAP analyses were age, time interval, and education SD. These results were similar to the main analysis, with the exception of GII whose predictive value was reduced compared to the main analysis. In moderate-severe TBI cohorts, the top three predictive features remained as time interval, time from injury to baseline assessment, and age SD (Supplement Fig. 7).

The top three consistent features for prediction of rate of change in the learning and memory domain in mild TBI cohorts identified by the SHAP analyses were education SD, time interval, and GII. These results were similar to the main analysis, with the exception of age whose predictive value was reduced compared to the main analysis. In moderate-severe TBI cohorts, the top three predictive features remained as time interval, time from injury to baseline assessment, and age SD, similar to the main analysis (Supplement Fig. 11). Model performance for both domains remained similar to that of the main analysis (Supplement Table 4, Table 4).

The results of sensitivity analyses using 5-fold cross-validation as an internal validation approach to evaluate the robustness and generalizability of model estimates for specific domains of cognition are presented in Supplement Figs. 8, 9,12,and 13.

Third, we conducted analyses with the targeted elimination of multicollinear age and education SD parameters provided further validity regarding the impact of time interval, age, and GII in mild TBI cohorts. In moderate-severe TBI, time related variables were further validated (Supplement Figs. 14 and 15).

Finally, by adding sample size of cohorts to the models, we observed that all important parameters remained the same as in main analysis for mild TBI cohorts. However, in moderate-severe TBI cohorts, sample size emerged as an important feature across models, as well as a predictor of rate of change in SHAP analysis, slightly reducing the value of GII and age SD on the model output (Supplement Figs. 16 and 17).

Discussion

In this study, we report innovative explainable ML research using published longitudinal data on the course of cognition after TBI, investigating PROGRESS-Plus characteristics of study samples as predictors of rate of change in cognition after injury. The results support the importance of considering social parameters in post-injury outcomes, and the ability of ML to explain heterogeneity49 in the course of cognition overall and by cognitive domain following TBI of equal severity. Our study has research, clinical, and policy implications.

We found that among the most important predictors of cognitive change after TBI in published research across domains and injury severities were age, time-related parameters (i.e., time from injury to baseline assessment and time interval from baseline to last follow-up assessment), and country-level structural indicators. In the sensitivity analyses, the additional parameter of number of participants in the cohort emerged as an important predictor for the rate of change. These parameters have been discussed in longitudinal research for a long period of time50, and were also cited as key parameters that, if not strongly endorsed in analytical models, make research findings less likely to be true51. ML models were able to capture these variables, which solidified the vision that supervised ML52,53 can contribute significantly to human knowledge and expertise, and discussion concerning scientific validity.

Given the pronounced variation in age and age SD across cohorts, the discussion brings attention to how researchers deal with age in their analyses in longitudinal studies. This is true for both mild and moderate-severe TBI. Age-related deterioration affects all people, and is reflected across domains of cognition, reported for both mild and moderate-severe TBI cohorts1,54,55. The mechanisms that mediate age-related processes and changes reflected in the test performance of cohorts with TBI included in this study are not entirely clear; however, these processes are likely to be influenced by injury severity (Supplement Fig. 1b, e, f). We found a higher rate of change in cognitive outcomes in moderate–severe TBI (0.23) than in mild TBI (0.11) (Table 1), highlighting the value of how machine learning in capturing temporal effects in predicting outcomes. Our results also align with the random damage theories56,57, which situate around the disturbed balance between ongoing damage and repair that occurs in the process of natural aging, reflected in cognitive test performance. When the process is disturbed by brain injury, the balance of ongoing damage and repair is further dysregulated and, therefore, was picked up by ML as a key predictor of the rate of change, to a greater extent in moderate-severe TBI than mild (Supplement Fig. 1b, e, f). The age of research participants, and the SD of the cohort, are, therefore expected to be seen, as it would affect the processes of both natural ageing and constrict recovery after injury, which is not expected to be uniform in participants of various ages. Age variability within cohorts, reflected by larger age standard deviations, indicates broader age dispersion among participants. Such dispersion may correspond to heterogeneity in age-related multimorbidity, particularly among older participants in the cohort, which is known to increase with age. Nonetheless, comorbidity was not consistently reported in the studies included in our dataset, limiting our ability to examine their potential importance. Future research should consider this limitation and systematically report health status of research participants within Plus parameters, as these factors are implicated in injury severity and external cause of injury10. This is important because several longitudinal studies have uncovered that the age-specific strain caused by the onset of TBI in the presence of comorbidity is associated with cognitive outcomes, regardless of TBI severity1,6,9. Future research should investigate age and age-related effects precisely to understand their predictive role in prognosis.

Time-related variables (i.e., time from injury to baseline assessment and time interval between baseline and last follow-up assessment, Supplement Table 2), emerged as important predictors of rate of change, and therefore emphasize the critical role of time in prognostic research. We observed that the rate of cognitive change after TBI differed between mild and moderate-severe cohorts (Table 1). This finding aligns with prior research and with current evidence defining severe TBI as a risk factor for long-term cognitive decline58,59. This study provides evidence that time affects rate of change across executive function and learning and memory domains of cognition, but we did not have sufficient data to test the effect on other cognitive domains (i.e., language, perceptual motor, complex attention, information processing speed, social cognition). Because our prior systematic review1,60 and observational studies35,61 brought attention to the fact that the course of cognition and recovery were not uniform and were dependent on the baseline assessment, future research should consider the implications of time after injury and baseline assessment in prognosis of cognitive domain-specific risks following injury.

Our results on relevance of baseline assessments (i.e., time zero) in prognosis were especially pronounced in moderate-severe TBI cohorts, with the mean baseline assessment conducted at around one year post injury, with great heterogeneity between cohorts (Table 1). In mild TBI, the mean baseline assessment was less than two weeks, with more homogeneity among samples (Table 1). At the point of baseline assessment, many participants in the mild TBI cohort may have reached a recovery plateau, which may not be the case for moderate-severe TBI, and therefore the time effects were not as pronounced in ML prediction for mild as in moderate-severe cohorts. Our results in sensitivity analyses confirmed the robustness of time effects, which underscore a critical need for coordinated research and policy efforts that explicitly integrate discussion on timing of research concerning prognosis based on injury severity, as this would impact scientific evidence with policy and practice implications. This is especially important as in both our prior hypothesis-driven approaches, we faced challenges in characterizing these temporal effects. Given that recovery occurs in parallel with processes of aging, these processes may reinforce one another with differing speed based on time that emerged from injury. Our results underscore that time is not merely a methodological detail, but a determinant of predictive accuracy and scientific interpretation. Future research should consider the benefits of ML for the prediction of outcomes after TBI with greater precision to time.

Country-level indicators also warrant discussion. We systematically integrated heterogeneous datasets across multiple published cohorts on all published longitudinal research concerning the course of cognition after TBI. We found that ML captured GII as a predictor of the rate of change across several models in both injury severities in both bivariate and multivariate ML models. The positive correlation between GII and rate of change was consistent with previous reports of the important role of gender equality in heath outcomes62. These results are consistent with recent observations showing that trust in relationships is associated with outcomes in advanced clinical conditions, including stroke63 and dementia64. In countries with social constraints on education and development, captured via GII and reflected in economic imbalances and restricted decision making, the macro-level relational environment may impose strains on both family and community relationships, impacting trust. While this may be perceived as far-fetched, results bring attention to the critical role of macro-level social parameters within existing data hierarchies and raise fundamental questions about their relative importance compared with person-level predictors. Greater emphasis on social pathways in future research will be essential to elucidate the mechanisms through which these parameters exert their effects on cognitive recovery.

Our correlation matrices results indicate that PROGRESS-Plus parameters characterising study samples, including whether the country of study origin is predominantly English speaking, gender/sex, education, and age are associated with rate of change. However, when these variables were used in ML, in consideration with other parameters (i.e., time effects and country-level structural indicators) their predictive values were not as strong as those of other parameters, except education and age. We have observed implications of the country-level structural indicators expressed by GII in results of our past systematic review8, where when we positioned results along the GII continuum, differences between male and females across outcomes and countries of publication started to dissipate. The GII is a composite, time-varying indicator of a country’s gender inequality level, incorporating measures of educational attainment, labour force participation, maternal mortality, adolescent fertility, and parliamentary representation65. Therefore, when the index was used in ML models, the predictive effect of cohorts’ education and education SD were diminished, and several of our models featured the GII as a salient predictor of the rate of cognitive change after TBI, in both mild and moderate-severe samples. While it has been previously suggested that cognitive outcomes after TBI differ between people of different gender/sex and education level, our results are the first to highlight that the importance of these parameters are affected by broader social and structural contexts. This has important research, clinical, and policy implications, particularly in ongoing debates regarding the long-term cognitive consequences of milder forms of TBI, and the cumulative contributions of biological sex, gender, education, and injury severity to TBI outcomes. Our ML results also bring attention to the intersectionality reflected in the GII, which may also have implications for those who participated in research included in the dataset, and therefore was captured by ML as a prognostic parameter after TBI. Evidence increasingly shows that the characteristics of research participants in TBI studies are not truly reflective of TBI populations, where racialized, less educated, and more disadvantaged communities are less like to participate in research but are also disproportionately affected by TBI and its adverse consequences15,66,67. Future prognostic research and policy should diversify research samples and strive for systematic inclusion of diverse groups of people in TBI research. Overcoming barriers to research remains a priority66,67.

Our sensitivity analyses identified cohort sample size as an important variable in predicting the rate of change. It has long been recognized that statistical power is related to effect size, and that findings are more likely to be true in scientific fields characterized by larger effects51. We observed that in moderate-severe TBI, the rate of change was half that observed in mild TBI cohorts (Table 1). In longitudinal studies where effects are large, such as in moderate-severe TBI, results may be more susceptible to spurious associations arising from limited sample sizes. To elaborate, if the majority of true PROGRESS-Plus characteristics influencing (i.e., associated with) the rate of change is also associated with cohort sample size, these variables may be downgraded in importance among the most important predictive parameters, when sample size was added to ML models. Based on these results, we consider ML to be a valuable and informative tool and screening mechanism for meaningful associations in published TBI research concerning the course of cognition. Its application is particularly valuable given the availability of extensive data and metadata, which can be accessed at relatively low cost.

The purpose of our research was to delineate whether social parameters associated with rate of change of cognition using data from published TBI cohorts, and to provide information on whether the characteristics of longitudinal study samples play a role in prognosis. We applied three ML approaches to build prognostic models concerning rate of cognitive change in both mild and moderate-severe injury severity samples. All three ML approaches converge on the conclusion that social and structural characteristics of published cohorts are prognostically meaningful for cognitive change after TBI. Models for mild TBI strata were more heterogeneous and prone to social influences compared to moderate-severe TBI strata, highlighted in the complexity reflected in the hyperparameters of boosting models (GB and XGBoost). Nonetheless, all models indicated that sample compositions are determining prognostic patterns and therefore should be treated with greater care in future research. Results of ML models not only delineated four PROGRESS-Plus characteristics (i.e., country of English language dominance as place of residence, gender/sex, education, and age) of cohorts as important determinants of prognosis, but also exposed critical gaps in the current prognostic evidence base by showing who contributes to longitudinal research and which social variables are measured in cohort studies on prognosis and which are not (i.e., religion/spirituality, socioeconomic status, occupation, social capital), constraining what can be known about the value of these parameters to the course of cognition and prognosis after TBI. Therefore, ML approaches as a methodological tool in research is of great value not only for improving prognostic modelling and meta-research, but also to bring attention to how methodological choices made by researchers about sampling68, anchored time scale (i.e., time zero and elapsed time)69, and cognitive measures1,60 that actively shaped the current best state of knowledge available to clinicians and policymakers.

Our results should be interpreted in light of several methodological limitations. First, we restricted our analysis to the cohorts’ observational window between first and last available assessments. Several included cohorts (Supplementary Table 1) reported data from multiple assessment points; however, these timepoints were highly inconsistent in the timing of assessment with regards to time zero and time since injury70. When harmonised across studies, there was insufficient overlapping data to support ML modelling at different discrete time intervals (Supplement Table 1). While we adjusted results for time from injury to the first assessment (i.e. time zero) and for the interval between assessments, we acknowledge that our approach approximates change as linear between time zero and the final assessment. However, recovery following injury does not follow a linear course; as early changes are situated within natural recovery processes after injury71. As such, our estimates should be interpreted as average rates of change across cohorts with different observational windows from injury time. As additional longitudinal datasets become available, future work will be better positioned to apply ML approaches that explicitly model nonlinear patterns and time-varying effects across a wide range of recovery phases. Second, because of heterogeneous and inconsistent reporting of study participants’ social profiles in the dataset of published studies, we needed to develop and implement rigorous data harmonisation processes to maximise the ability to examine all available PROGRESS-Plus characteristics in the dataset prior to application of ML-based prognostic modelling. Despite our data harmonisation efforts, the dataset exhibited substantial missingness across several PROGRESS-Plus parameters, including race/ethnicity, socioeconomic status, occupation, religion/spirituality, and social capital. Although data augmentation using synthetic data generation is sometimes employed to mitigate this issue in ML, implementation of this approach was not possible in the present study, as these PROGRESS-Plus parameters were sparsely reported, and the available data was less than 22% for occupation and less than 5% for all other parameters after data harmonisation (Table 1). At such low levels of representation, synthetic data generation would have relied on insufficient underlying distributions, increasing the risk of amplifying noise, introducing artificial patterns, and reinforcing existing biases. For these reasons, we restricted model training to four parameters with adequate representation: the country of study origin through country of English language dominance, gender/sex, education, and age. Future research should consider this limitation and commit to a standardised reporting of PROGRESS-Plus characteristics to enable equitable and robust ML-based prognostic modelling using published data. Further investigations are also needed to refine the role of PROGRESS-Plus parameters as predictors of outcomes when considering structural level parameters. Finally, we used the binary variable English-language dominance as a controlling variable in the ML modelling. This was because all measures that were used to assess cognition were developed in English. Even when translated or culturally adapted, these measures may not achieve full construct and linguistic equivalence when applied and used within non-English-speaking countries, potentially introducing systematic differences in test performance. Because our dataset included samples from multiple countries, including those in which English is not the primary language, we sought to account for potential language-related measurement bias. Although country-level stratification would have been preferable, the number of data points per country was insufficient to support adequately powered analyses. Therefore, to retain statistical power while still addressing potential linguistic effects, we operationalized a binary English-dominance variable (English-speaking vs. non-English-speaking study context). We acknowledge that this binary classification simplifies substantial cultural and linguistic heterogeneity; however, it represents a pragmatic approach to mitigating potential measurement bias given available data.

Our research and results provide a foundation for advancing prediction models that systematically evaluate how study sample representativeness affect TBI prognosis. We employed three ML models, including the RF, the GB, and the XGBoost. Employing SHAP in the present study provided additional insights into model performance and predictive value of each parameter in the presence of others, in both mild and moderate-severe injury severity study samples. The usability of the models in predicting outcomes using published data depends on data quality. We recommend that future researchers and who will apply our process and work with published data dedicate significant time to data extraction, pre-processing and standardization prior to machine learning model input. In addition, users are expected to have at least some machine learning expertise and be familiar with the variables in the dataset, as well as with the principles of modelling. While this may limit direct use by non-technical personnel, it ensures reliable and accurate predictions for research and policy applications. Future enhancements in prognosis research that can be used in machine learning require the systematic collection, reporting, and integration of social parameters in research to enable more equitable clinical and policy decision-making.

Methods

We conducted and reported our research following the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis – Artificial Intelligence (TRIPOD-AI) guidelines for prediction model development and validation72. We shared all research-related files in an open data repository on the Open Science Framework (OSF)73, in alliance with the FAIR Guiding Principles74. The complete dataset is available upon reasonable request from the corresponding author.

Data sources and description

The dataset used in this work was developed by our research team using data from our prior systematic reviews1,60 on the course and predictors of cognitive outcomes after TBI. We updated this dataset with new evidence that emerged since the original searches16. The flow diagram of the study selection is depicted in Fig. 1. In brief, the data covers evidence that emerged from five databases: Embase, Medline, Scopus, and Cochrane Central Register for Controlled Trials, and PsycINFO. We searched the databases from inception until April 8, 2024 for peer-reviewed English-language longitudinal studies reporting raw cognitive test scores at two or more time points in adults diagnosed with TBI. Only studies that provided data on the course of cognition over at least two data points after TBI were included. The protocol, PROSPERO registry, and the systematic reviews that provided specifics on study methodology and the dataset are published15,70,75.

Our research concerned data on TBI diagnosis, injury severity and cause of injury data, and PROGRESS-Plus parameters of each study sample (country of recruitment, race/language/ethnicity, occupation, gender/sex, religion/spirituality, education, socioeconomic status, social capital, age, and other contextual parameters), and cognitive test scores at baseline and last follow-up assessments75. Each cognitive test in a given cohort was treated as a unit of analysis.

Dataset preparation and machine learning workflow

Figure 1 represents dataset preparation and machine learning workflow, outlining seven steps: (i) data collection; (ii) data extraction; (iii) data harmonisation; (iv) data preprocessing; (v) machine learning algorithms; (vi) model evaluation; and (vii) sensitivity analyses, which allowed generation of overall and model-specific explanations.

Data collection

Study selection followed predefined inclusion and exclusion criteria. Eligible studies included those that reported longitudinal data on standardised cognitive tests and provided sufficient injury-related data of study sample characteristics to support data harmonisation and modelling. For more information, please refer to published studies1.

Data extraction

Two independent researchers extracted data using standardised, previously published extraction forms70. Extracted variables included demographic characteristics (age, sex/gender, race/ethnicity), social parameters (place of residence, occupation, education, and social capital), and contextual brain injury-related parameters (injury severity, mechanism of injury, comorbidities, and time from injury to baseline or first assessment, also known as time zero in prognosis studies47). We also extracted data on time from injury to baseline assessment, and calculated the time interval between baseline and last follow-up cognitive assessments. Data was extracted from the main text and supplementary materials of each published study by two independent researchers. Extraction accuracy and reliability were ensured through double extraction, cross-checking, and resolution of discrepancies by consensus.

Data harmonisation process

We considered using ML-based solutions for data harmonisation of several variables76,77,78. However, we faced significant challenges in finding scientific guidance on the application of standardised ML data harmonisation in practice. We found that each harmonisation case was a unique process and necessitated a series of context-dependent decisions through iterative team discussions where we evaluated several harmonisation options at once (i.e., flexible harmonisation79,80.

We first created a library of terms based on all extracted data (Supplement Table 2) that defined variable names, operational definitions, categories, and units to ensure consistency across included cohorts. The core team discussed each variable in the table and evaluated different variables under the same section header, jointly deciding how to incorporate these updates into the harmonised protocol template. We then harmonised data to reconcile different operationalisations of similar concepts (i.e., gender/sex composition of study samples, reported in a variety of data formats, known as syntax) as well as semantics (i.e., intended meanings of words such as young adults, primary schooling, etc.), conceptual schema (i.e., structured and unstructured data extracted from tables and raw text, respectively), and measurement differences (e.g. when the same concept was reported with different measurements, including injury severity, cognitive domain, etc.). Finally, we transformed data to numeric values in order to prepare the data for ML (Supplement Table 3). We structured extracted social and contextual elements using the PROGRESS-Plus framework. The categorical parameters of study samples were transformed to continuous variables where feasible: gender/sex composition was transformed to proportion men/males, and education level was transformed to education in years. Additionally, we transformed age-related parameters to continuous age and age standard deviation of study samples. Country of study origin (where recruitment took place) was transformed to country of English language dominance as a binary variable (0, 1).

Cognitive outcome scores of standardised tests at baseline and last follow-up and the domains of cognition they reflect were represented as mean and standard deviation, and binary values, respectively.

To ensure quality, data harmonisation and transformation processes were conducted by two independent researchers. We documented the process and decisions (Supplement Table 3) made for future data users in compliance with the FAIR standards74.

Data preprocessing

We conducted data preprocessing. This step included scaling of continuous variables, systematic assessment of outliers, and correction for imbalances across injury severity groups based on visual plots and correlation matrices between predictors and the outcome.

When time was reported categorically or as ranges, we used midpoint values. For cohorts reporting multiple follow-up time points1, the final available follow-up assessment was used for data harmonisation.

We calculated time SD for each data point, considering cohort sample size and reported outcome score, and score SDs in order to preserve the variance structure across included cohorts. We then calculated rate of change per month as the outcome for longitudinal harmonisation of data points. For each data point, the rate of change was computed by dividing the difference between outcome values at baseline and last follow-up by the elapsed time between assessments using the following formula:

$$\text{R}\text{a}\text{t}\text{e}\space\text{o}\text{f}\space\text{c}\text{o}\text{g}\text{n}\text{i}\text{t}\text{i}\text{v}\text{e}\space\text{c}\text{h}\text{a}\text{n}\text{g}\text{e}\left(\text{p}\text{e}\text{r}\space\text{m}\text{o}\text{n}\text{t}\text{h}\right)=\frac{{Y}_{FU-}{Y}_{BL}}{T/30}\mathbf{}\mathbf{}$$

where YFU denotes the cognitive test score at the last follow-up assessment, YBL denotes the cognitive test score at baseline, and T denotes the mean follow-up interval measured in days. The follow-up interval was converted to months by dividing by 30 to harmonise time scales across studies.

Class imbalance

We observed that imbalance in our data occurred within injury severity categories. To mitigate the influence of this imbalance on model performance and to preserve clinically meaningful distinctions, we stratified analyses by injury severity rather than applying resampling-based class-imbalance methods. Moderate, moderate–severe, and severe injuries were combined into a single ‘moderate–severe’ category, and models were trained separately for mild and moderate–severe TBI.

Missing data

We made a priori decision to restrict ML analyses to complete cases, where only data points with complete information on the variables required for ML modelling were included in the final analytic sample81,82,83. This decision was made because many variables expressed structurally missing data reporting (Table 2) rather than missing at random.

Machine learning modelling

We applied three supervised ML algorithms84 to mild and moderate-severe injury severity cohorts: random forest (RF)85,86, gradient boosting (GB)87,88, and extreme gradient boosting (XGBoost)89, to model cognitive change over time. These models were trained to predict both overall cognitive trajectories (standardised rate of change per month) and domain-specific standardised rate of change using harmonised PROGRESS-Plus variables. To ensure comparability across algorithms and mitigate bias introduced by differences in cohorts’ age, time between injury and baseline assessment, and follow-up times, all models were trained on harmonised rates of change per unit of time, and controlled for time between TBI and baseline assessment and time between assessments.

We considered time from injury to baseline assessment, time between baseline assessment and last follow-up, and country-level structural indicators as potential modifiers of the relationship between primary predictors and cognitive outcomes. This decision was made in order to capture historical evolution of populations’ social parameters by country, including the GII and HDI.

Model evaluation and interpretation

We evaluated model performance using mean absolute error (MAE) and root mean square error (RMSE)90. We then examined feature importance using feature importance heatmaps and the Shapley Additive Explanations (SHAP)91. By observing SHAP values, we evaluated the predictive importance of each feature and how it contributes to the difference between an actual prediction and a mean prediction, analyzing non-linear relationships between PROGRESS-Plus variables and rate of change.

Internal validation and sensitivity analyses

We performed multiple sensitivity analyses. First, we used 5-fold cross-validation as an internal validation approach to evaluate the robustness and generalizability of model estimates. This procedure assessed the stability of the effects of input parameters on cognitive outcomes across injury severity strata.

Second, we repeated analyses to assess the potential effects of PROGRESS-Plus parameters on specific cognitive domains that provided sufficient data and had relatively even distribution between mild and moderate-severe TBI cohorts. We hypothesized a priori that PROGRESS-Plus parameters would exert a more pronounced effect on domain-specific cognitive outcomes than on overall cognition, stronger for mild than moderate-severe TBI, given previously observed variability in rates of change by injury severity within the cohorts1. This allowed us to verify the relative contributions of the important PROGRESS-Plus parameters under a more constrained, cognitive domain-specific dataset. We re-evaluated feature importance using feature importance heatmaps and the SHAP, estimating each predictor’s contribution to standardised rate of cognitive change in cognitive domain-specific cohorts for executive function and learning and memory domains of cognition.

Third, we investigated the model performance by removing features with high correlation coefficients to evaluate the impact of multicollinearity on the model’s performance. By selectively excluding highly correlated features (education and age SD), we aimed to improve the robustness of the model.

Finally, we added sample size of cohorts to the models as in main analysis. This is because the correlation matrix (Supplement Fig. 2) highlighted strong associations, positive and negative, with HDI and GII, respectively. Since these structural indicators emerged a list of important parameters for predicting rate of change, we added sample size to the main analysis and evaluated the internal consistency of predictors.

All analyses, including generation of figures, were performed using Python (version 3)52,53. Figures were created using Python libraries (matplotlib (v3.x), seaborn (v0.x), and wordcloud (v1.x)).