Introduction

Recent technological advancements due to improvements in computing power have allowed the collection of large volumes of data1,2. In the health area, these innovations present an opportunity to increase the accuracy of outcome predictions3,4, such as all-cause mortality, a complex outcome in public health approaches, especially due to the multifactorial nature of this outcome5,6,7. The use of predictive models for all-cause mortality could be useful to provide better care for individuals and populations, mainly to offer approaches to prevent premature mortality and events associated with poor disease management, which could lead to avoidable deaths. Machine learning models can offer new perspectives for predicting all-cause mortality8,9,10,11 considering its ability to deal with complex relationships among variables. However, the performance of these models can be influenced by the context and the representativeness of the available data.

Furthermore, the presence of unbalanced classification is a challenging issue in machine learning12, which occurs when the class distribution is significantly different from 50%, a common situation in epidemiological studies assessing mortality. In particular, predicting mortality in general population samples, which can include both “healthy” individuals and those with diseases, tends to present additional challenges. The variability in the results and the complexity to develop machine learning models for all-cause mortality highlight the need for a comprehensive analysis to better understand the state-of-the-art of these models. Two previous systematic reviews have demonstrated that machine learning has potential for predicting chronic diseases and obesity13,14 despite the limitations observed.

Systematic reviews and meta-analyses can provide valuable insights for the effectiveness of predictive models from a global health equity perspective. The identification of factors influencing predictive performance across different populations and economic contexts may be particularly valuable to improve the global applicability of these models, as suggested by recent ethical frameworks for AI in health15. In the study, we aimed to review the literature on the performance of machine learning models to predict all-cause mortality and to synthesize these results through a meta-analysis.

Methods

We carried out a systematic review and meta-analysis, which was registered in the PROSPERO repository (CRD42023476567), and was conducted following the recommendations of the PRISMA 2020 statement16.

Search strategy

We searched on October 24, 2023 in the following databases: Pubmed, LILACS, Web of Science, and Scopus. No restrictions were imposed on the year of publication, country, or language of the studies included. If studies were identified in languages other than Portuguese, English, or Spanish, we utilized Google Translate for translation.

Two groups of keywords were used and combined with the Boolean operators ‘OR’ and ‘AND’, respecting the specificities of each database. Whenever possible, we used the Medical Subject Heading (Mesh) Major Topic, or, when not available by the database, the searches included results for the articles’ titles. We used filters for original articles when available from the databases.

The following keywords were used in the searches: Machine Learning [MeSH] OR Supervised Machine Learning [MeSH] OR Prediction models OR Prediction OR Predictive OR Predict OR Classification OR ML OR Artificial Intelligence [MeSH] OR Natural Language Processing [MeSH] OR Neural Networks, Computer [MeSH] OR Support Vector Machine [MeSH] OR Naive Bayes OR Bayesian learning OR Logistic Models [Mesh] OR Neural network OR Neural networks OR Natural language processing OR Support vector* OR Random forest* OR Boosting OR XGBoost OR Deep learning [Mesh] AND Death [MeSH] OR Mortality [MeSH] OR All-cause mortality. The terms and strategy were adapted according to the specifics of each database, and the complete search strategy is available in the Supplementary Table 1.

Inclusion criteria

We included studies that followed three criteria: predicted all-cause mortality classification as a binary outcome utilizing any machine learning models; involved adults or older adults; and reported the results as Area Under the Curve (AUC) or as true positive, true negative, false positive, and false negative (to calculate the specificity and sensitivity).

Exclusion criteria

We excluded studies that predicted specific mortality (e.g., death by cardiovascular diseases or accidents), studies with children and adolescents, studies with animals, and studies without information to be included in the meta-analysis.

Literature screening and data extraction

The selection of the studies was conducted by two reviewers independently (KAM and AAV), and disagreements were resolved by a third reviewer (FMD). The process began by reading the titles and abstracts of the articles using the Rayyan platform. The second stage involved reading the articles in full and then reviewing the references of the included articles in order to find any new studies. From each included study, we extracted information on the year of publication, location where the study was carried out, sample characteristics, predictor variables, outcome, models used, AUC results, and the best-performing model.

For analytical purposes, studies were classified into two categories based on their sampling approach: general population cohorts: studies recruiting participants from community-based settings regardless of disease status, including population registries, health surveys, or community screening programs); disease-specific cohorts: studies recruiting participants based on specific medical conditions or clinical settings (e.g., patients with heart failure, hospitalized patients, or disease-specific registries).

Quality assessment

To evaluate the individual risk of bias, we utilized an adapted version of the Transparent Reporting of a multivariable prediction model of Individual Prognosis or Diagnosis (TRIPOD + AI)17. This updated checklist consists of 27 items that cover various aspects of study reporting. Each item on the TRIPOD + AI checklist is scored, leading to a total score, which can reach 52 points, based on the inclusion of these essential reporting elements. The TRIPOD + AI consists of topics related to title, abstract, introduction, methods, open science, patients and public involvement, results, and discussion. Four reviewers independently conducted the TRIPOD + AI checklist.

Diagnostic criteria

The outcome was all-cause mortality, which was considered when the study evaluated overall mortality rather than specific causes (e.g., cardiovascular mortality). The outcome was chosen because of the need to understand whether machine learning can predict it well, considering that all-cause mortality can occur for different reasons, and to identify which characteristics can influence its occurrence.

Statistical analysis

We conducted a meta-analysis of the Area Under the Curve (AUC) of a Receiver Operating Characteristic (ROC). We chose AUC as our metric because it comprehensively evaluates how well models discriminate between positive outcomes (mortality) and negative outcomes (survival)18. The AUC presents values ranging from 0 to 1. A value of 1 signifies a perfect model that can perfectly differentiate between the two classes, while a value of 0.5 indicates performance equivalent to random guessing18. AUC values above 0.7 are considered good predictive performance.

We collected each study’s AUC value, confidence intervals, and standard errors (SE), based on the model that performed best in the test set or external validation. For studies that reported confidence intervals instead of SE, we calculated it using the following formula: SE = (upper limit − lower limit)/3.9219. For studies lacking confidence intervals, we estimated them using the method of Hanley and McNeil20, which calculates confidence intervals based on AUC values and sample sizes under the assumption of a binomial distribution. We assumed a 95% confidence level and used the relationship between AUC, sample size, and standard error. Sensitivity analyses compared meta-analytic results with and without imputed confidence intervals to assess the impact of this assumption on pooled estimates and heterogeneity measures20.

In addition to the general meta-analysis, we carried out subgroup analyses: 1- general population vs people with specific diseases or conditions; 2- high-income vs low- and middle-income countries, based on the World Bank classification; 3- sample size: less than 2000 versus 2000 or more; 4- TRIPOD + AI: less than 35 points vs 35 or more; 5- models: Tree-based and Tree Ensemble vs. Neural Networks vs. Linear/Statistical vs Ensemble/Hybrid Models vs other models). Model categories were defined as follows: (1) Tree-based models: individual decision trees and random forests when used as single algorithms; (2) Neural Networks: all artificial neural network architectures including deep learning, convolutional networks, and multilayer perceptrons; (3) Linear/Statistical: logistic regression, Cox regression, and linear discriminant analysis; (4) Ensemble/Hybrid models: combinations of multiple algorithms including XGBoost, LightGBM, and stacked approaches; (5) Other models: single-use algorithms including Disease Severity Models (DSM), Deep Learning System with Multi-head Self-attention Mechanism (DLS-MSM), ICD-based Injury Severity Score (ICISS), Support Vector Machines, and Bayesian Networks. The “Other” category comprised algorithms used by only one study each, limiting statistical power for meaningful comparisons.

The results are presented using the AUC, through a random-effects model, with a 95% confidence interval (95% CI). The heterogeneity of the meta-analysis was assessed using the I2 statistic, considering values above 75% as high heterogeneity21,22, which means that the studies are very different from each other, and the combined interpretation of the results should be conducted with caution.

The meta-analyses were conducted using the Python language, via Google Colab, utilizing the Numpy, Pandas, Statsmodels, Matplotlib, and Seaborn libraries. When the same study evaluated different all-cause mortality follow-ups, we considered the longest period for the meta-analysis. Whenever the studies provided such information, we collected the results from the test set or the external validation dataset. The codes created are available at: https://github.com/fmdsocial/reviewsci/blob/main/Revis%C3%A3o_PDJ_Corre%C3%A7%C3%B5es_28_04_2025.ipynb.

To explore sources of heterogeneity, we conducted univariate meta-regression analyses using weighted least squares to examine the relationship between study characteristics and AUC performance. The moderator variables examined included country income level (high-income vs low/middle-income), population type (general vs disease-specific), sample size (≥ 2000 vs. < 2000 participants), TRIPOD + AI quality score (≥ 35 vs. < 35 points), imputation of confidence interval (non-imputed vs imputed), prevalence of the outcome (0–19%, 20–39%, and 40% or more), and machine learning model type (using tree-based models as reference category). Model categories were defined as follows: Tree-based models included individual decision trees, random forests when used as single algorithms; Ensemble models included combinations of multiple algorithms (Random Forest, XGBoost, LightGBM, and hybrid approaches); Neural Networks included all artificial neural network architectures; Linear/Statistical included logistic regression and similar approaches; Other models included single-use algorithms (DSM, DLS-MSM, ICISS, Support Vector Machine, Bayesian Network).

Results

Study selection

From the four databases searched, 33,550 studies were identified, and 17,604 remained after removal of duplicates. Of these, 926 were selected based on inclusion and exclusion criteria by the reviewers. The final step, involving full-text data extraction, resulted in 88 studies being included in the present review (Fig. 1). The main reasons for excluding studies were outcomes other than all-cause mortality, conference abstracts, and articles not reporting AUC metrics.

Fig. 1
figure 1

PRISMA flow diagram of study selection process. Flow diagram showing the identification, screening, eligibility assessment, and inclusion of studies in the systematic review and meta-analysis, following PRISMA 2020 guidelines.

Studies characteristics

The global distribution of studies is shown in Fig. 2. The United States leads with 25 articles, followed by China (n = 20), Sweden (n = 7), and Taiwan (n = 6). The United Kingdom, Italy, Japan, the Netherlands, and South Korea had between three and five articles, while Finland and Spain had two publications each. Some studies combined databases from different countries (n = 7). In total, 51 studies (58%) were carried out in high-income countries.

Fig. 2
figure 2

Global distribution of included studies. World map showing the geographical distribution of the 88 studies included in the systematic review, with the United States (n = 25) and China (n = 20) contributing the most studies.

The studies included in the systematic review showed considerable variation in sample size, ranging from 148 to 1,264,000 participants (Table 1). Supplementary Tables 2 and 3 detail the studies included according to the type of outcome. The study with the smallest number of participants was conducted in Sweden23, with 148 patients discharged from an emergency department. The study with the largest number of participants was conducted in the United States, involving 1,264,000 participants from a synthetic dataset24. The average number of participants was approximately 55,840, while the median was 8367. Of the 88 studies, 17 included samples from the general population (19%), with participants with and without disease, and the rest included databases with participants with specific diseases or health conditions.

Table 1 Characteristics and main results of included studies (n = 88).

Qualitative synthesis of predictor variables

Across the 88 studies included in this systematic review, a diverse range of predictor variables was evaluated to predict all-cause mortality, categorized into demographic, clinical, laboratory, imaging, and socioeconomic/behavioral groups. Demographic variables, such as age and gender/sex, appeared in 97.73% (86/88) of studies. Clinical characteristics, including comorbidities (e.g., diabetes, hypertension, cardiovascular disease), vital signs, and medical history, were also highly prevalent, used in 88.64% (78/88) of studies. These variables were prevalent in disease-specific cohorts, where they contributed to high predictive performance, as seen in Díez-Sanmartín et al.25, which achieved an AUC of 0.99 using XGBoost for kidney transplant patients.

Laboratory biomarkers, such as glucose, cholesterol, troponin, and NT-proBNP, were included in 47.73% (42/88) of studies, with a higher prevalence in studies of chronic diseases (e.g., Takahama et al., 2023, AUC 0.87 using LightGBM for heart failure patients8). Imaging data, including ECGs, echocardiograms, and chest radiographs, were less common, appearing in 20.45% (18/88) of studies, predominantly in high-income countries (HICs) with advanced diagnostic infrastructure (e.g., Siegersma et al.26, AUC 0.96 using deep neural networks with ECGs). Socioeconomic and behavioral factors, such as education, smoking, alcohol consumption, and social support, were the least frequently used, appearing in 27.27% (24/88) of studies. Studies achieving higher AUCs (> 0.90) often integrated multiple variables, particularly clinical and laboratory data.

Algorithms

The studies included in the systematic review utilized various machine learning algorithms to predict mortality and other clinical outcomes. Logistic regression, random forest, and artificial neural networks (ANN) were the most frequently algorithms used in the models. Less frequent models have included algorithms such as Naïve Bayes, Support Vector Machine (SVM), and K-nearest neighbors (KNN). Other algorithms, such as Gradient Boosting Machine (GBM), XGBoost, and LightGBM, also appeared frequently among the studies.

Model predictions

The AUC values ranged from 0.512 to 0.99, five studies (6%) reported an AUC < 0.70. Among the studies with higher AUC, the study by Díez-Sanmartín et al. (2023) using the XGBoost model to predict mortality in patients on the waiting list for kidney transplantation reported an AUC of 0.9925. Another study, which used logistic regression to predict mortality among patients hospitalized with diabetes and hypertension, obtained an AUC of 0.9727. Some studies showed lower AUCs among patients with diseases. For example, a study on patients with heart failure used a deep learning system based on a multiple self-attention mechanism and obtained an AUC of 0.75 for predicting 365-day mortality28.

Studies that used datasets from the general population showed, in general, lower AUC values. A study involving a sample of more than 1 million participants showed an AUC of 0.74724, based on a variety of clinical characteristics as predictors. Another study conducted with 2,291 healthy older adults aged ≥ 70 achieved an AUC of 0.51229, indicating poor capability to predict all-cause mortality.

Meta-analysis

Figures 3 and 4 show meta-analysis results. The AUC obtained for all-cause mortality prediction was 0.831, 95% CI 0.797 to 0.865, and a heterogeneity of 100%. The AUC was 0.824, 95% CI 0.729 to 0.920, and heterogeneity of 100% in analysis from the general population, regardless of disease status (Figs. 5 and 6). Figures 7 and 8 shows that the AUC was 0.833, 95% CI 0.813 to 0.854, with 99.7% heterogeneity, among the studies carried out with participants with diseases or some health conditions. The results were also similar between high-income countries (0.831, 95% CI 0.788 to 0.874, I2 = 100%) and low- and middle-income countries (0.830, 95% CI 0.797 to 0.864, I2 = 98.0%) (Fig. 9). The Fig. 10 shows pooled AUC values for all-cause mortality prediction (Table 2).

Fig. 3
figure 3

Forest plot of individual study AUC values and pooled meta-analysis results. Forest plot displaying AUC values for each included study with 95% confidence intervals and the overall pooled estimate using a random-effects model (AUC 0.831, 95% CI 0.797–0.865, I2: 100%).

Fig. 4
figure 4

Distribution of AUC values across all included studies. Violin plot showing the distribution of AUC values from individual studies with the overall meta-analysis result indicated by the central line.

Fig. 5
figure 5

Forest plot of AUC values for general population studies. Forest plot showing AUC values and 95% confidence intervals for studies conducted in general population cohorts (n = 17) with pooled estimate (AUC 0.824, 95% CI 0.729–0.920, I2: 100%).

Fig. 6
figure 6

Distribution of AUC values in general population studies. Violin plot displaying the distribution of AUC values specifically for studies conducted in general population samples with the pooled estimate indicated.

Fig. 7
figure 7

Forest plot of AUC values for disease-specific population studies. Forest plot showing AUC values and 95% confidence intervals for studies conducted in disease-specific populations (n = 71) with pooled estimate (AUC 0.833, 95% CI 0.813–0.854, I2: 99.7%).

Fig. 8
figure 8

Distribution of AUC values in disease-specific population studies. Violin plot displaying the distribution of AUC values for studies conducted in disease-specific populations with the pooled estimate indicated.

Fig. 9
figure 9

Forest plot comparing AUC values by country income level. Forest plot comparing pooled AUC values between high-income countries (AUC 0.831, 95% CI 0.788–0.874) and low- and middle-income countries (AUC 0.830, 95% CI 0.797–0.864).

Fig. 10
figure 10

Summary of pooled AUC values for all-cause mortality prediction. Bar chart showing pooled AUC values with 95% confidence intervals for the overall analysis and subgroup comparisons (general vs. disease-specific populations, and high-income vs. low/middle-income countries).

Table 2 Item-by-item quality assessment using TRIPOD + AI criteria (n = 88).

Subgroup analysis

Studies with smaller sample sizes (< 2000) showed a slightly higher pooled AUC (0.835, 95% CI 0.799–0.871) compared to larger studies (≥ 2000) (0.830, 95% CI 0.789–0.870), though both demonstrated extreme inter-study variability (I2 of 96.1% and 100% respectively), see Table 3. Regarding TRIPOD + AI scores, models with lower methodological quality (< 35 points) performed marginally better (AUC 0.838; 95% CI 0.817–0.858) than those with higher quality (≥ 35 points) (AUC 0.814; 95% CI 0.704–0.924), both with significant heterogeneity. Among model types, Linear/Statistical models demonstrated the highest performance (AUC 0.853; 95% CI 0.759–0.947), followed by Ensemble/Hybrid models (AUC 0.829; 95% CI 0.781–0.878), while Tree-based models (AUC 0.832; 95% CI 0.774–0.890), Neural Networks (AUC 0.823; 95% CI 0.793–0.854), and Other models (AUC 0.821; 95% CI 0.734–0.907) also showed strong but slightly lower discriminative ability. Regarding confidence interval imputation, studies with imputed confidence intervals showed higher performance (AUC 0.856; 95% CI 0.823–0.890) compared to non-imputed studies (AUC 0.815; 95% CI 0.768–0.862). All model categories and imputation groups showed extreme heterogeneity (I2 > 98%).

Table 3 Subgroup analysis results by study characteristics and model types.

Meta-regression

Meta-regression analysis showed significant study-level covariates associated with AUC performance (Table 4). Studies with disease-specific populations demonstrated significantly higher AUC performance compared to general population studies (reference group) (β = − 0.192, p < 0.001). Studies with lower methodological quality scores (< 35 TRIPOD + AI points) showed higher AUC values (β = − 0.134, p < 0.001). Neural networks outperformed tree-based models (β = 0.136, p < 0.001). Studies with imputed confidence intervals showed significantly higher AUC values compared to those with non-imputed confidence intervals (β = 0.109, p = 0.009). Regarding mortality prevalence, neither studies with 20–39% mortality rates (β = 0.082, p = 0.058) nor those with ≥ 40% mortality rates (β = − 0.025, p = 0.732) showed significant differences in AUC performance compared to studies with 0–19% mortality rates. Country income level, sample size, linear/statistical models, ensemble/hybrid models, and other model types were not significant variables in the model’s performance.

Table 4 Meta-regression analysis of factors associated with AUC performance.

Risk of bias

The TRIPOD + AI scale revealed a diverse range of final scores, ranging from 23 to 45. The lowest score was 23, with the highest reaching 45, while the most frequent score ranged is 31–33, where a significant number of studies cluster, as indicated by the histogram’s peak. Specifically, scores of 31 and 33 are notably prevalent, with 18 and 14 studies, respectively, whereas other ranges, such as 23–24, 25–27, and 45, are less common, with only 1–3 studies each (Fig. 11 and Table 2).

Fig. 11
figure 11

Distribution of TRIPOD + AI quality assessment scores. Histogram showing the distribution of TRIPOD + AI quality assessment scores across all 88 included studies, with scores ranging from 23 to 45 points and most studies clustering around 31–33 points.

Item-level analysis highlighted both strengths and limitations. Studies consistently reported items like Title, Abstract, Background, Objectives, Data, Participants, Outcome, Predictors, Sample Size, Analytical Methods, Ethical Approval, Funding, Conflict of Interest, Model Development, and Interpretation, with 90–100% conformity. However, significant gaps were identified in several areas: only 6.5% of studies addressed Class Imbalance (8c), 9.8% reported Patient and Public Involvement (12f.), 13% discussed Model Updating (18c), 16.3% provided Code Sharing (12e), and 18.5% addressed Data Sharing (12d). Additionally, Fairness (9a, 21.7%), Model Output (9b, 23.9%), Training vs. Evaluation (9c, 26.1%), Protocol (12c, 29.3%), and Usability in Current Care (18f., 30.4%) were frequently underreported.

Equity assessment

Our systematic analysis of equity-related reporting across the 88 included studies revealed substantial gaps in algorithmic fairness considerations (Table 5). The vast majority of studies (89.8%, n = 79) included no social determinants of health variables, with only 10.2% (n = 9) incorporating socioeconomic status, 3.4% (n = 3) including race/ethnicity, and 3.4% (n = 3) reporting education levels. Demographic diversity reporting was limited, with 86.4% (n = 76) of studies providing only basic age and sex data. No studies conducted stratified performance analysis by race/ethnicity or socioeconomic status, and only 2.3% (n = 2) performed sex-stratified analysis. External validation in diverse populations was rare, with 92.0% (n = 81) of studies relying solely on internal validation methods.

Table 5 Assessment of equity-related reporting across included studies.

Discussion

This systematic review found critical equity gaps suggesting reduced potential of machine learning models to predict all-cause mortality, especially for public health’s evaluation and deployment due to the risk of perpetuation of social disparities within these models. Most studies (89.8%) excluded social determinants such as race, education, and income, while none conducted racial/ethnic and socioeconomic sub-group performance analyses. Machine learning models demonstrated high overall performance across diverse populations and economic contexts. However, extreme heterogeneity indicates highly context-dependent results requiring local validation and effectiveness assessment before implementation. Performance was comparable between the general population and disease-specific studies. Nonetheless, this finding has limited generalizability given that only 19% of studies included general population samples.

Our systematic equity assessment showed a lack of algorithmic fairness considerations across the included studies, which represents a fundamental limitation for clinical and public health implementation. The underrepresentation of social determinants variables can reflect multiple barriers, including limited data availability in electronic health records, prioritization of clinical over social variables in model development pipelines, and potential implicit bias in variable selection processes. The absence of race/ethnicity, education, and socioeconomic status as predictors is particularly concerning given their well-established associations with mortality risk and health care accessibility. To address these limitations, we recommend establishing minimum reporting standards for equity variables in future machine learning studies, including mandatory documentation of: (1) availability and inclusion for social determinants variables, (2) demographic representativeness of training datasets, and (3) stratified model performance across relevant demographic and socioeconomic subgroups.

Machine learning models achieved comparable performance in disease-specific and general populations, suggesting broad applicability through different mechanisms. While disease-specific populations offer standardized clinical trajectories and stronger biomarkers (e.g., NT-proBNP and troponin in heart failure patients30, general populations present multiple mortality pathways). Our findings indicated that well-designed algorithms can effectively handle this heterogeneity, contradicting our initial hypothesis that disease-specific contexts would present superior predictive performance. This suggests machine learning’s robustness across diverse population contexts when properly developed.

The comparable performance across different population types represents an important methodological insight for developing universal prediction models. While Shah et al. (2019) noted the challenges of heterogeneity in clinical prediction modeling31, our results suggest that these challenges can be overcome with robust modeling strategies such as ensemble methods that can handle diverse risk patterns, comprehensive feature engineering that captures population-specific risk factors, and stratified validation approaches that ensure consistent performance across different demographic groups.

The lack of consistency in predictor variables significantly affects the generalizability and equity of machine learning models for all-cause mortality prediction. Demographic (97.73%) and clinical (88.64%) variables were most commonly used, reflecting their accessibility and strong mortality associations. Laboratory (47.73%) and imaging (20.45%) variables were less frequent, with imaging predominantly used in high-income countries due to advanced infrastructure (e.g., Siegersma et al., 2022, achieved AUC 0.96 with ECGs26). Socioeconomic and behavioral factors were underutilized (27.27% of studies) despite their relevance to health equity. This predictor variability contributes to the observed heterogeneity and raises concerns about model applicability in different settings. Future studies should adopt standardized minimum reporting requirements, including sociodemographic variables and standardized morbidity indices. Establishing consensus-based predictor sets for different population types (general vs. disease-specific) would improve comparability and facilitate model validation across settings.

Country-income-stratified analysis revealed similar performance of mortality prediction models across economic contexts. Models from high-income countries achieved virtually equal AUC values to those from low- and middle-income countries, both higher than 0.8. While high-income countries more commonly conduct large-scale studies with extended follow-up (e.g., UK Biobank32), our results show that studies from low- and middle-income countries achieve comparable performance. The specific factors contributing to this similarity require further investigation due to the found heterogeneity among studies32.

To maintain and further strengthen the observed equity in model performance across different economic contexts, we propose additional strategies and techniques associated with equity-focused analysis. First, establishing data-sharing frameworks between institutions in high-, middle-, and low-income countries could improve the representativeness and size of datasets across all contexts. Recent initiatives have demonstrated the feasibility of standardized global data collection32, but there is room to improve equitable global participation and representativeness. Second, model transfer techniques could adapt models developed in one context to another by adjusting for local epidemiological characteristics. Wiens et al.33 demonstrated that transfer learning approaches can successfully adapt clinical prediction models across hospitals with varying resources and patient populations. Additionally, federated machine learning can enable collaborative model development without direct sharing of sensitive data between institutions34. Chen et al.35 proposed a framework for evaluating algorithmic fairness in clinical prediction models that could be adapted for all-cause mortality prediction. Such models should be evaluated not only for overall performance but also for equitable performance across population subgroups36.

The substantial heterogeneity observed across all analyses represents our most significant finding, suggesting considerable variation in model development and study methodology despite similar overall performance metrics. This heterogeneity can be due to differences in predictor variables, modeling approaches, local health systems contexts, and mortality follow-up periods. While subgroup analysis showed overlapping confidence intervals between disease-specific (AUC 0.833) and general populations (AUC 0.824), meta-regression showed a statistically significant advantage for disease-specific populations, suggesting that although the mean difference is modest and descriptively similar, it is statistically consistent across studies.

The high degree of heterogeneity suggests that the model’s performance is heavily influenced by the context. Health professionals and stakeholders cannot rely on models to remain accurate across various populations or healthcare environments. While high heterogeneity is commonly observed in meta-analyses including large numbers of studies with diverse methodologies and populations37,38,39,40, this variation suggests that local factors significantly influence model performance. The finding that lower-quality studies reported higher performance suggests potential overfitting and publication bias. For clinical and public health implementation, local validation is required regardless of reported performance.

Studies with lower TRIPOD + AI scores achieved higher AUC values, a counterintuitive finding that can reflect publication bias and overfitting in studies with less rigorous validation strategies. This pattern suggests that studies prioritizing high-performance metrics may result from lower methodological rigor. With only 8% of studies employing external validation, higher methodological quality studies may report more conservative but realistic performance estimates through proper validation process. The interpretation of model performance from poorly reported studies should be carefully done due to overestimation of the true clinical performance as result of potential inadequate validation and potential overfitting. This finding highlights that rigorous study design and validation strategies should be prioritized from inception rather than evaluated post hoc, and suggests the need for mandatory external validation and transparent reporting in future ML mortality prediction studies. The quality-performance paradox observed emphasizes that stringent external validation is essential before any implementation.

Neural networks presented higher performance than tree-based models according to meta-regression analysis. Deep learning models may better capture complex nonlinear relationships to predict mortality. The superior performance of neural networks can reflect their ability to deal with complex interactions between multiple mortality risk factors simultaneously, including unexpected data patterns that traditional algorithms might miss. However, higher performance must be balanced with interpretability concerns in clinical settings. Considering the slight (besides statistically significant) difference, simpler models are the best choice in clinical and health system settings. For implementation, its recommended to prioritize interpretable models (tree-based, logistic regression) when the performance difference is modest (< 0.03 AUC)41, especially in resource-constrained settings where model transparency is relevant for health professionals’ acceptance and regulatory approval. Practical implementation in resource-limited settings faces challenges. However, our findings show comparable performance between LMIC and high-income countries (AUC 0.830 vs. 0.831), suggesting feasible implementation. Resource-constrained settings should prioritize simpler, interpretable models requiring minimal computational power and enabling local validation by clinical staff.

In general, our results indicate a good performance among different algorithms. The quality of data, pre-processing techniques, and representativeness of the target population seem to be more important than the algorithm itself42,43. In this scenario, for example, tree-based models, while showing slightly lower discriminative performance in our analysis, offer greater interpretability through their decision process and feature importance. Thus, the higher algorithmic transparency, especially about how models are making decisions, improves the relevance of simpler models (such as tree-based) to be used to predict mortality and effectiveness evaluation. Healthcare regulatory frameworks should emphasize model interpretability requirements for mortality prediction systems, particularly given that tree-based models offer substantial clinical advantages through transparent decision pathways while maintaining comparable performance to more complex approaches, as demonstrated by the modest performance differences observed in our analysis and supported by interpretability research in high-stakes medical applications44.

An important consideration in the implementation of machine learning models is the trade-off between interpretability and predictive performance. Although neural networks achieved statistically higher performance relative to tree-based models, the observed difference was marginal, with a mean gain of less than 0.02 AUC points. Tree-based models offer substantial advantages in clinical settings through their inherent interpretability, allowing clinicians to understand decision pathways and feature importance rankings44,45. This transparency is relevant for the model’s adoption, particularly in decisions where model errors can have severe consequences44. Given the comparable performance across model types and the critical importance of interpretability in public health applications46, simpler, more transparent models may be preferable for routine implementation despite virtually lower discriminative performance. Regulatory approval and clinical adoption can favor transparent models that enable audit trails and staff understanding.

Although populations and economic settings differ substantially, our analysis indicates that machine learning models for mortality prediction reach a comparable level of discriminative accuracy, but often rely on distinct mechanisms to do so. This observation of “multiple routes to success” highlights the need for closer investigation. Future research could explore which specific pre-processing development factors, algorithms, or predictor variable selection contribute to model performance in different contexts.

The substantial number of publications examined and the adherence to methodological guidelines like PROSPERO, PRISMA, and TRIPOD + AI are two of this study’s strong points. There are certain restrictions, though. The significant heterogeneity found suggests a great deal of variation in the populations and methods examined, which may have an impact on how broadly applicable our findings are.

An important methodological limitation of our review is the insufficient external validation among the included studies. Only a minority of the 88 analyzed studies employed independent external datasets to evaluate the generalizability of their models. Most studies relied on internal validation (such as cross-validation or train-test splits within the same population), which may overestimate model performance and limit clinical applicability. This absence of robust external validation represents a significant gap in the field, as models that demonstrate high performance in internal validation frequently show a substantial shift when applied to different populations or clinical contexts47.

Only 19% of studies included general population samples which can be considered a generalizability limitation. Disease-specific models cannot be extrapolated to populations with diverse risk profiles and heterogeneous clinical trajectories. Models developed in patients with specific diseases can rely on well-defined clinical pathways and established biomarkers. In contrast, general population prediction must identify risk among predominantly healthy individuals presenting different and interconnected risk factors. Disease-specific datasets combined with limited external validation (8% of studies) restrict transferability to population-level applications. Disease-specific models use homogeneous cohorts with standardized markers, while population prediction requires identifying risk in predominantly healthy, diverse populations. This sampling bias limits applicability to public health surveillance and national screening programs.

Precision estimates may have been biased as a result of our use of literature-based estimation techniques to deal with missing confidence intervals. Heterogeneity assessments and pooled estimates may be impacted by this. By inflating apparent consistency or exaggerating heterogeneity, attributed confidence intervals can produce artificially narrow or wide precision estimates, which compromises the validity of our meta-analytic findings.

Our analysis was limited by insufficient data on social and behavioral variables, including education, race/ethnicity, and housing conditions. Future studies should prioritize including these social determinants to improve model accuracy and equity. Equity-focused reporting should be mandatory, with fairness assessments required before clinical implementation. We recommend alignment with established AI ethics guidelines to prevent perpetuating healthcare disparities through algorithmic bias. Funding agencies, academic journals, and regulatory bodies should integrate these equity-focused reporting requirements into research funding criteria, manuscript submission guidelines, and clinical approval processes to ensure systemic implementation across the field. Researchers should conduct stratified performance analysis across demographic subgroups, healthcare systems must require local validation before model deployment, and all stakeholders should establish monitoring systems to track equity outcomes post-implementation.

Thus, our analysis also highlights the potential for biases in training data to perpetuate historical and systemic inequities through the use of these models in health settings, particularly when marginalized populations are underrepresented in training datasets. The similar performance across different economic contexts does not necessarily represent equitable models for all demographic groups within these contexts. Most of the analyzed studies did not report model performance stratified by race/ethnicity, gender, or socioeconomic status, leaving potential disparities undetected. Algorithmic fairness requires intentional design choices and comprehensive evaluation across diverse subpopulations36,48, which should be prioritized in future studies before clinical implementation and deployment.

In conclusion, this meta-analysis showed the potential of machine learning models to predict all-cause mortality across diverse populations and economic contexts. However, findings derive predominantly from disease-specific populations (81% of studies), with limited evidence for general population applicability. The substantial heterogeneity across all analyses indicates a high degree of variation among studies. Future research should prioritize the development of models specifically for general populations, standardized reporting, inclusive data collection that incorporates social determinants of health such as race/ethnicity and socioeconomic status, and rigorous external validation across diverse populations. Finally, the low percentage of external validation requires caution in the generalizability of results and implementation.