Abstract
We aimed to review the literature on the performance of machine learning models to predict all-cause mortality. The systematic review was protocolled in PROSPERO (CRD42023476567) following PRISMA guidelines. Searches were conducted in PubMed, LILACS, Web of Science, and Scopus databases. Studies predicting all-cause mortality using machine learning were analyzed with random-effects models, with heterogeneity assessed using I2 statistics and quality evaluated using TRIPOD + AI. The meta-analysis included 88 studies. Most of the studies were from the United States (n = 25) and China (n = 20). Overall pooled AUC was 0.831 (95% CI 0.797–0.865), with extreme heterogeneity (I2:100%). The majority of the studies included no social variables in the models (89.8%). Subgroup analysis showed similar performance between general population studies and disease-specific populations. Models from high-income countries were similar to those from low- and middle-income countries. Meta-regression showed covariates that affected the results: algorithm, population type, study quality score, and study CI imputation. Equity-oriented sub-group analysis (< 10%) and external validation in other datasets (8.0%) were scarce. Overall, machine learning models showed high performance to predict all-cause mortality, but also highlighted equity gaps. The limitations reduce the potential of public health’s evaluation and deployment due to the risk of perpetuation of social disparities. Extreme heterogeneity indicates highly context-dependent performance requiring local validation before implementation assessment.
Similar content being viewed by others
Introduction
Recent technological advancements due to improvements in computing power have allowed the collection of large volumes of data1,2. In the health area, these innovations present an opportunity to increase the accuracy of outcome predictions3,4, such as all-cause mortality, a complex outcome in public health approaches, especially due to the multifactorial nature of this outcome5,6,7. The use of predictive models for all-cause mortality could be useful to provide better care for individuals and populations, mainly to offer approaches to prevent premature mortality and events associated with poor disease management, which could lead to avoidable deaths. Machine learning models can offer new perspectives for predicting all-cause mortality8,9,10,11 considering its ability to deal with complex relationships among variables. However, the performance of these models can be influenced by the context and the representativeness of the available data.
Furthermore, the presence of unbalanced classification is a challenging issue in machine learning12, which occurs when the class distribution is significantly different from 50%, a common situation in epidemiological studies assessing mortality. In particular, predicting mortality in general population samples, which can include both “healthy” individuals and those with diseases, tends to present additional challenges. The variability in the results and the complexity to develop machine learning models for all-cause mortality highlight the need for a comprehensive analysis to better understand the state-of-the-art of these models. Two previous systematic reviews have demonstrated that machine learning has potential for predicting chronic diseases and obesity13,14 despite the limitations observed.
Systematic reviews and meta-analyses can provide valuable insights for the effectiveness of predictive models from a global health equity perspective. The identification of factors influencing predictive performance across different populations and economic contexts may be particularly valuable to improve the global applicability of these models, as suggested by recent ethical frameworks for AI in health15. In the study, we aimed to review the literature on the performance of machine learning models to predict all-cause mortality and to synthesize these results through a meta-analysis.
Methods
We carried out a systematic review and meta-analysis, which was registered in the PROSPERO repository (CRD42023476567), and was conducted following the recommendations of the PRISMA 2020 statement16.
Search strategy
We searched on October 24, 2023 in the following databases: Pubmed, LILACS, Web of Science, and Scopus. No restrictions were imposed on the year of publication, country, or language of the studies included. If studies were identified in languages other than Portuguese, English, or Spanish, we utilized Google Translate for translation.
Two groups of keywords were used and combined with the Boolean operators ‘OR’ and ‘AND’, respecting the specificities of each database. Whenever possible, we used the Medical Subject Heading (Mesh) Major Topic, or, when not available by the database, the searches included results for the articles’ titles. We used filters for original articles when available from the databases.
The following keywords were used in the searches: Machine Learning [MeSH] OR Supervised Machine Learning [MeSH] OR Prediction models OR Prediction OR Predictive OR Predict OR Classification OR ML OR Artificial Intelligence [MeSH] OR Natural Language Processing [MeSH] OR Neural Networks, Computer [MeSH] OR Support Vector Machine [MeSH] OR Naive Bayes OR Bayesian learning OR Logistic Models [Mesh] OR Neural network OR Neural networks OR Natural language processing OR Support vector* OR Random forest* OR Boosting OR XGBoost OR Deep learning [Mesh] AND Death [MeSH] OR Mortality [MeSH] OR All-cause mortality. The terms and strategy were adapted according to the specifics of each database, and the complete search strategy is available in the Supplementary Table 1.
Inclusion criteria
We included studies that followed three criteria: predicted all-cause mortality classification as a binary outcome utilizing any machine learning models; involved adults or older adults; and reported the results as Area Under the Curve (AUC) or as true positive, true negative, false positive, and false negative (to calculate the specificity and sensitivity).
Exclusion criteria
We excluded studies that predicted specific mortality (e.g., death by cardiovascular diseases or accidents), studies with children and adolescents, studies with animals, and studies without information to be included in the meta-analysis.
Literature screening and data extraction
The selection of the studies was conducted by two reviewers independently (KAM and AAV), and disagreements were resolved by a third reviewer (FMD). The process began by reading the titles and abstracts of the articles using the Rayyan platform. The second stage involved reading the articles in full and then reviewing the references of the included articles in order to find any new studies. From each included study, we extracted information on the year of publication, location where the study was carried out, sample characteristics, predictor variables, outcome, models used, AUC results, and the best-performing model.
For analytical purposes, studies were classified into two categories based on their sampling approach: general population cohorts: studies recruiting participants from community-based settings regardless of disease status, including population registries, health surveys, or community screening programs); disease-specific cohorts: studies recruiting participants based on specific medical conditions or clinical settings (e.g., patients with heart failure, hospitalized patients, or disease-specific registries).
Quality assessment
To evaluate the individual risk of bias, we utilized an adapted version of the Transparent Reporting of a multivariable prediction model of Individual Prognosis or Diagnosis (TRIPOD + AI)17. This updated checklist consists of 27 items that cover various aspects of study reporting. Each item on the TRIPOD + AI checklist is scored, leading to a total score, which can reach 52 points, based on the inclusion of these essential reporting elements. The TRIPOD + AI consists of topics related to title, abstract, introduction, methods, open science, patients and public involvement, results, and discussion. Four reviewers independently conducted the TRIPOD + AI checklist.
Diagnostic criteria
The outcome was all-cause mortality, which was considered when the study evaluated overall mortality rather than specific causes (e.g., cardiovascular mortality). The outcome was chosen because of the need to understand whether machine learning can predict it well, considering that all-cause mortality can occur for different reasons, and to identify which characteristics can influence its occurrence.
Statistical analysis
We conducted a meta-analysis of the Area Under the Curve (AUC) of a Receiver Operating Characteristic (ROC). We chose AUC as our metric because it comprehensively evaluates how well models discriminate between positive outcomes (mortality) and negative outcomes (survival)18. The AUC presents values ranging from 0 to 1. A value of 1 signifies a perfect model that can perfectly differentiate between the two classes, while a value of 0.5 indicates performance equivalent to random guessing18. AUC values above 0.7 are considered good predictive performance.
We collected each study’s AUC value, confidence intervals, and standard errors (SE), based on the model that performed best in the test set or external validation. For studies that reported confidence intervals instead of SE, we calculated it using the following formula: SE = (upper limit − lower limit)/3.9219. For studies lacking confidence intervals, we estimated them using the method of Hanley and McNeil20, which calculates confidence intervals based on AUC values and sample sizes under the assumption of a binomial distribution. We assumed a 95% confidence level and used the relationship between AUC, sample size, and standard error. Sensitivity analyses compared meta-analytic results with and without imputed confidence intervals to assess the impact of this assumption on pooled estimates and heterogeneity measures20.
In addition to the general meta-analysis, we carried out subgroup analyses: 1- general population vs people with specific diseases or conditions; 2- high-income vs low- and middle-income countries, based on the World Bank classification; 3- sample size: less than 2000 versus 2000 or more; 4- TRIPOD + AI: less than 35 points vs 35 or more; 5- models: Tree-based and Tree Ensemble vs. Neural Networks vs. Linear/Statistical vs Ensemble/Hybrid Models vs other models). Model categories were defined as follows: (1) Tree-based models: individual decision trees and random forests when used as single algorithms; (2) Neural Networks: all artificial neural network architectures including deep learning, convolutional networks, and multilayer perceptrons; (3) Linear/Statistical: logistic regression, Cox regression, and linear discriminant analysis; (4) Ensemble/Hybrid models: combinations of multiple algorithms including XGBoost, LightGBM, and stacked approaches; (5) Other models: single-use algorithms including Disease Severity Models (DSM), Deep Learning System with Multi-head Self-attention Mechanism (DLS-MSM), ICD-based Injury Severity Score (ICISS), Support Vector Machines, and Bayesian Networks. The “Other” category comprised algorithms used by only one study each, limiting statistical power for meaningful comparisons.
The results are presented using the AUC, through a random-effects model, with a 95% confidence interval (95% CI). The heterogeneity of the meta-analysis was assessed using the I2 statistic, considering values above 75% as high heterogeneity21,22, which means that the studies are very different from each other, and the combined interpretation of the results should be conducted with caution.
The meta-analyses were conducted using the Python language, via Google Colab, utilizing the Numpy, Pandas, Statsmodels, Matplotlib, and Seaborn libraries. When the same study evaluated different all-cause mortality follow-ups, we considered the longest period for the meta-analysis. Whenever the studies provided such information, we collected the results from the test set or the external validation dataset. The codes created are available at: https://github.com/fmdsocial/reviewsci/blob/main/Revis%C3%A3o_PDJ_Corre%C3%A7%C3%B5es_28_04_2025.ipynb.
To explore sources of heterogeneity, we conducted univariate meta-regression analyses using weighted least squares to examine the relationship between study characteristics and AUC performance. The moderator variables examined included country income level (high-income vs low/middle-income), population type (general vs disease-specific), sample size (≥ 2000 vs. < 2000 participants), TRIPOD + AI quality score (≥ 35 vs. < 35 points), imputation of confidence interval (non-imputed vs imputed), prevalence of the outcome (0–19%, 20–39%, and 40% or more), and machine learning model type (using tree-based models as reference category). Model categories were defined as follows: Tree-based models included individual decision trees, random forests when used as single algorithms; Ensemble models included combinations of multiple algorithms (Random Forest, XGBoost, LightGBM, and hybrid approaches); Neural Networks included all artificial neural network architectures; Linear/Statistical included logistic regression and similar approaches; Other models included single-use algorithms (DSM, DLS-MSM, ICISS, Support Vector Machine, Bayesian Network).
Results
Study selection
From the four databases searched, 33,550 studies were identified, and 17,604 remained after removal of duplicates. Of these, 926 were selected based on inclusion and exclusion criteria by the reviewers. The final step, involving full-text data extraction, resulted in 88 studies being included in the present review (Fig. 1). The main reasons for excluding studies were outcomes other than all-cause mortality, conference abstracts, and articles not reporting AUC metrics.
Studies characteristics
The global distribution of studies is shown in Fig. 2. The United States leads with 25 articles, followed by China (n = 20), Sweden (n = 7), and Taiwan (n = 6). The United Kingdom, Italy, Japan, the Netherlands, and South Korea had between three and five articles, while Finland and Spain had two publications each. Some studies combined databases from different countries (n = 7). In total, 51 studies (58%) were carried out in high-income countries.
The studies included in the systematic review showed considerable variation in sample size, ranging from 148 to 1,264,000 participants (Table 1). Supplementary Tables 2 and 3 detail the studies included according to the type of outcome. The study with the smallest number of participants was conducted in Sweden23, with 148 patients discharged from an emergency department. The study with the largest number of participants was conducted in the United States, involving 1,264,000 participants from a synthetic dataset24. The average number of participants was approximately 55,840, while the median was 8367. Of the 88 studies, 17 included samples from the general population (19%), with participants with and without disease, and the rest included databases with participants with specific diseases or health conditions.
Qualitative synthesis of predictor variables
Across the 88 studies included in this systematic review, a diverse range of predictor variables was evaluated to predict all-cause mortality, categorized into demographic, clinical, laboratory, imaging, and socioeconomic/behavioral groups. Demographic variables, such as age and gender/sex, appeared in 97.73% (86/88) of studies. Clinical characteristics, including comorbidities (e.g., diabetes, hypertension, cardiovascular disease), vital signs, and medical history, were also highly prevalent, used in 88.64% (78/88) of studies. These variables were prevalent in disease-specific cohorts, where they contributed to high predictive performance, as seen in Díez-Sanmartín et al.25, which achieved an AUC of 0.99 using XGBoost for kidney transplant patients.
Laboratory biomarkers, such as glucose, cholesterol, troponin, and NT-proBNP, were included in 47.73% (42/88) of studies, with a higher prevalence in studies of chronic diseases (e.g., Takahama et al., 2023, AUC 0.87 using LightGBM for heart failure patients8). Imaging data, including ECGs, echocardiograms, and chest radiographs, were less common, appearing in 20.45% (18/88) of studies, predominantly in high-income countries (HICs) with advanced diagnostic infrastructure (e.g., Siegersma et al.26, AUC 0.96 using deep neural networks with ECGs). Socioeconomic and behavioral factors, such as education, smoking, alcohol consumption, and social support, were the least frequently used, appearing in 27.27% (24/88) of studies. Studies achieving higher AUCs (> 0.90) often integrated multiple variables, particularly clinical and laboratory data.
Algorithms
The studies included in the systematic review utilized various machine learning algorithms to predict mortality and other clinical outcomes. Logistic regression, random forest, and artificial neural networks (ANN) were the most frequently algorithms used in the models. Less frequent models have included algorithms such as Naïve Bayes, Support Vector Machine (SVM), and K-nearest neighbors (KNN). Other algorithms, such as Gradient Boosting Machine (GBM), XGBoost, and LightGBM, also appeared frequently among the studies.
Model predictions
The AUC values ranged from 0.512 to 0.99, five studies (6%) reported an AUC < 0.70. Among the studies with higher AUC, the study by Díez-Sanmartín et al. (2023) using the XGBoost model to predict mortality in patients on the waiting list for kidney transplantation reported an AUC of 0.9925. Another study, which used logistic regression to predict mortality among patients hospitalized with diabetes and hypertension, obtained an AUC of 0.9727. Some studies showed lower AUCs among patients with diseases. For example, a study on patients with heart failure used a deep learning system based on a multiple self-attention mechanism and obtained an AUC of 0.75 for predicting 365-day mortality28.
Studies that used datasets from the general population showed, in general, lower AUC values. A study involving a sample of more than 1 million participants showed an AUC of 0.74724, based on a variety of clinical characteristics as predictors. Another study conducted with 2,291 healthy older adults aged ≥ 70 achieved an AUC of 0.51229, indicating poor capability to predict all-cause mortality.
Meta-analysis
Figures 3 and 4 show meta-analysis results. The AUC obtained for all-cause mortality prediction was 0.831, 95% CI 0.797 to 0.865, and a heterogeneity of 100%. The AUC was 0.824, 95% CI 0.729 to 0.920, and heterogeneity of 100% in analysis from the general population, regardless of disease status (Figs. 5 and 6). Figures 7 and 8 shows that the AUC was 0.833, 95% CI 0.813 to 0.854, with 99.7% heterogeneity, among the studies carried out with participants with diseases or some health conditions. The results were also similar between high-income countries (0.831, 95% CI 0.788 to 0.874, I2 = 100%) and low- and middle-income countries (0.830, 95% CI 0.797 to 0.864, I2 = 98.0%) (Fig. 9). The Fig. 10 shows pooled AUC values for all-cause mortality prediction (Table 2).
Subgroup analysis
Studies with smaller sample sizes (< 2000) showed a slightly higher pooled AUC (0.835, 95% CI 0.799–0.871) compared to larger studies (≥ 2000) (0.830, 95% CI 0.789–0.870), though both demonstrated extreme inter-study variability (I2 of 96.1% and 100% respectively), see Table 3. Regarding TRIPOD + AI scores, models with lower methodological quality (< 35 points) performed marginally better (AUC 0.838; 95% CI 0.817–0.858) than those with higher quality (≥ 35 points) (AUC 0.814; 95% CI 0.704–0.924), both with significant heterogeneity. Among model types, Linear/Statistical models demonstrated the highest performance (AUC 0.853; 95% CI 0.759–0.947), followed by Ensemble/Hybrid models (AUC 0.829; 95% CI 0.781–0.878), while Tree-based models (AUC 0.832; 95% CI 0.774–0.890), Neural Networks (AUC 0.823; 95% CI 0.793–0.854), and Other models (AUC 0.821; 95% CI 0.734–0.907) also showed strong but slightly lower discriminative ability. Regarding confidence interval imputation, studies with imputed confidence intervals showed higher performance (AUC 0.856; 95% CI 0.823–0.890) compared to non-imputed studies (AUC 0.815; 95% CI 0.768–0.862). All model categories and imputation groups showed extreme heterogeneity (I2 > 98%).
Meta-regression
Meta-regression analysis showed significant study-level covariates associated with AUC performance (Table 4). Studies with disease-specific populations demonstrated significantly higher AUC performance compared to general population studies (reference group) (β = − 0.192, p < 0.001). Studies with lower methodological quality scores (< 35 TRIPOD + AI points) showed higher AUC values (β = − 0.134, p < 0.001). Neural networks outperformed tree-based models (β = 0.136, p < 0.001). Studies with imputed confidence intervals showed significantly higher AUC values compared to those with non-imputed confidence intervals (β = 0.109, p = 0.009). Regarding mortality prevalence, neither studies with 20–39% mortality rates (β = 0.082, p = 0.058) nor those with ≥ 40% mortality rates (β = − 0.025, p = 0.732) showed significant differences in AUC performance compared to studies with 0–19% mortality rates. Country income level, sample size, linear/statistical models, ensemble/hybrid models, and other model types were not significant variables in the model’s performance.
Risk of bias
The TRIPOD + AI scale revealed a diverse range of final scores, ranging from 23 to 45. The lowest score was 23, with the highest reaching 45, while the most frequent score ranged is 31–33, where a significant number of studies cluster, as indicated by the histogram’s peak. Specifically, scores of 31 and 33 are notably prevalent, with 18 and 14 studies, respectively, whereas other ranges, such as 23–24, 25–27, and 45, are less common, with only 1–3 studies each (Fig. 11 and Table 2).
Item-level analysis highlighted both strengths and limitations. Studies consistently reported items like Title, Abstract, Background, Objectives, Data, Participants, Outcome, Predictors, Sample Size, Analytical Methods, Ethical Approval, Funding, Conflict of Interest, Model Development, and Interpretation, with 90–100% conformity. However, significant gaps were identified in several areas: only 6.5% of studies addressed Class Imbalance (8c), 9.8% reported Patient and Public Involvement (12f.), 13% discussed Model Updating (18c), 16.3% provided Code Sharing (12e), and 18.5% addressed Data Sharing (12d). Additionally, Fairness (9a, 21.7%), Model Output (9b, 23.9%), Training vs. Evaluation (9c, 26.1%), Protocol (12c, 29.3%), and Usability in Current Care (18f., 30.4%) were frequently underreported.
Equity assessment
Our systematic analysis of equity-related reporting across the 88 included studies revealed substantial gaps in algorithmic fairness considerations (Table 5). The vast majority of studies (89.8%, n = 79) included no social determinants of health variables, with only 10.2% (n = 9) incorporating socioeconomic status, 3.4% (n = 3) including race/ethnicity, and 3.4% (n = 3) reporting education levels. Demographic diversity reporting was limited, with 86.4% (n = 76) of studies providing only basic age and sex data. No studies conducted stratified performance analysis by race/ethnicity or socioeconomic status, and only 2.3% (n = 2) performed sex-stratified analysis. External validation in diverse populations was rare, with 92.0% (n = 81) of studies relying solely on internal validation methods.
Discussion
This systematic review found critical equity gaps suggesting reduced potential of machine learning models to predict all-cause mortality, especially for public health’s evaluation and deployment due to the risk of perpetuation of social disparities within these models. Most studies (89.8%) excluded social determinants such as race, education, and income, while none conducted racial/ethnic and socioeconomic sub-group performance analyses. Machine learning models demonstrated high overall performance across diverse populations and economic contexts. However, extreme heterogeneity indicates highly context-dependent results requiring local validation and effectiveness assessment before implementation. Performance was comparable between the general population and disease-specific studies. Nonetheless, this finding has limited generalizability given that only 19% of studies included general population samples.
Our systematic equity assessment showed a lack of algorithmic fairness considerations across the included studies, which represents a fundamental limitation for clinical and public health implementation. The underrepresentation of social determinants variables can reflect multiple barriers, including limited data availability in electronic health records, prioritization of clinical over social variables in model development pipelines, and potential implicit bias in variable selection processes. The absence of race/ethnicity, education, and socioeconomic status as predictors is particularly concerning given their well-established associations with mortality risk and health care accessibility. To address these limitations, we recommend establishing minimum reporting standards for equity variables in future machine learning studies, including mandatory documentation of: (1) availability and inclusion for social determinants variables, (2) demographic representativeness of training datasets, and (3) stratified model performance across relevant demographic and socioeconomic subgroups.
Machine learning models achieved comparable performance in disease-specific and general populations, suggesting broad applicability through different mechanisms. While disease-specific populations offer standardized clinical trajectories and stronger biomarkers (e.g., NT-proBNP and troponin in heart failure patients30, general populations present multiple mortality pathways). Our findings indicated that well-designed algorithms can effectively handle this heterogeneity, contradicting our initial hypothesis that disease-specific contexts would present superior predictive performance. This suggests machine learning’s robustness across diverse population contexts when properly developed.
The comparable performance across different population types represents an important methodological insight for developing universal prediction models. While Shah et al. (2019) noted the challenges of heterogeneity in clinical prediction modeling31, our results suggest that these challenges can be overcome with robust modeling strategies such as ensemble methods that can handle diverse risk patterns, comprehensive feature engineering that captures population-specific risk factors, and stratified validation approaches that ensure consistent performance across different demographic groups.
The lack of consistency in predictor variables significantly affects the generalizability and equity of machine learning models for all-cause mortality prediction. Demographic (97.73%) and clinical (88.64%) variables were most commonly used, reflecting their accessibility and strong mortality associations. Laboratory (47.73%) and imaging (20.45%) variables were less frequent, with imaging predominantly used in high-income countries due to advanced infrastructure (e.g., Siegersma et al., 2022, achieved AUC 0.96 with ECGs26). Socioeconomic and behavioral factors were underutilized (27.27% of studies) despite their relevance to health equity. This predictor variability contributes to the observed heterogeneity and raises concerns about model applicability in different settings. Future studies should adopt standardized minimum reporting requirements, including sociodemographic variables and standardized morbidity indices. Establishing consensus-based predictor sets for different population types (general vs. disease-specific) would improve comparability and facilitate model validation across settings.
Country-income-stratified analysis revealed similar performance of mortality prediction models across economic contexts. Models from high-income countries achieved virtually equal AUC values to those from low- and middle-income countries, both higher than 0.8. While high-income countries more commonly conduct large-scale studies with extended follow-up (e.g., UK Biobank32), our results show that studies from low- and middle-income countries achieve comparable performance. The specific factors contributing to this similarity require further investigation due to the found heterogeneity among studies32.
To maintain and further strengthen the observed equity in model performance across different economic contexts, we propose additional strategies and techniques associated with equity-focused analysis. First, establishing data-sharing frameworks between institutions in high-, middle-, and low-income countries could improve the representativeness and size of datasets across all contexts. Recent initiatives have demonstrated the feasibility of standardized global data collection32, but there is room to improve equitable global participation and representativeness. Second, model transfer techniques could adapt models developed in one context to another by adjusting for local epidemiological characteristics. Wiens et al.33 demonstrated that transfer learning approaches can successfully adapt clinical prediction models across hospitals with varying resources and patient populations. Additionally, federated machine learning can enable collaborative model development without direct sharing of sensitive data between institutions34. Chen et al.35 proposed a framework for evaluating algorithmic fairness in clinical prediction models that could be adapted for all-cause mortality prediction. Such models should be evaluated not only for overall performance but also for equitable performance across population subgroups36.
The substantial heterogeneity observed across all analyses represents our most significant finding, suggesting considerable variation in model development and study methodology despite similar overall performance metrics. This heterogeneity can be due to differences in predictor variables, modeling approaches, local health systems contexts, and mortality follow-up periods. While subgroup analysis showed overlapping confidence intervals between disease-specific (AUC 0.833) and general populations (AUC 0.824), meta-regression showed a statistically significant advantage for disease-specific populations, suggesting that although the mean difference is modest and descriptively similar, it is statistically consistent across studies.
The high degree of heterogeneity suggests that the model’s performance is heavily influenced by the context. Health professionals and stakeholders cannot rely on models to remain accurate across various populations or healthcare environments. While high heterogeneity is commonly observed in meta-analyses including large numbers of studies with diverse methodologies and populations37,38,39,40, this variation suggests that local factors significantly influence model performance. The finding that lower-quality studies reported higher performance suggests potential overfitting and publication bias. For clinical and public health implementation, local validation is required regardless of reported performance.
Studies with lower TRIPOD + AI scores achieved higher AUC values, a counterintuitive finding that can reflect publication bias and overfitting in studies with less rigorous validation strategies. This pattern suggests that studies prioritizing high-performance metrics may result from lower methodological rigor. With only 8% of studies employing external validation, higher methodological quality studies may report more conservative but realistic performance estimates through proper validation process. The interpretation of model performance from poorly reported studies should be carefully done due to overestimation of the true clinical performance as result of potential inadequate validation and potential overfitting. This finding highlights that rigorous study design and validation strategies should be prioritized from inception rather than evaluated post hoc, and suggests the need for mandatory external validation and transparent reporting in future ML mortality prediction studies. The quality-performance paradox observed emphasizes that stringent external validation is essential before any implementation.
Neural networks presented higher performance than tree-based models according to meta-regression analysis. Deep learning models may better capture complex nonlinear relationships to predict mortality. The superior performance of neural networks can reflect their ability to deal with complex interactions between multiple mortality risk factors simultaneously, including unexpected data patterns that traditional algorithms might miss. However, higher performance must be balanced with interpretability concerns in clinical settings. Considering the slight (besides statistically significant) difference, simpler models are the best choice in clinical and health system settings. For implementation, its recommended to prioritize interpretable models (tree-based, logistic regression) when the performance difference is modest (< 0.03 AUC)41, especially in resource-constrained settings where model transparency is relevant for health professionals’ acceptance and regulatory approval. Practical implementation in resource-limited settings faces challenges. However, our findings show comparable performance between LMIC and high-income countries (AUC 0.830 vs. 0.831), suggesting feasible implementation. Resource-constrained settings should prioritize simpler, interpretable models requiring minimal computational power and enabling local validation by clinical staff.
In general, our results indicate a good performance among different algorithms. The quality of data, pre-processing techniques, and representativeness of the target population seem to be more important than the algorithm itself42,43. In this scenario, for example, tree-based models, while showing slightly lower discriminative performance in our analysis, offer greater interpretability through their decision process and feature importance. Thus, the higher algorithmic transparency, especially about how models are making decisions, improves the relevance of simpler models (such as tree-based) to be used to predict mortality and effectiveness evaluation. Healthcare regulatory frameworks should emphasize model interpretability requirements for mortality prediction systems, particularly given that tree-based models offer substantial clinical advantages through transparent decision pathways while maintaining comparable performance to more complex approaches, as demonstrated by the modest performance differences observed in our analysis and supported by interpretability research in high-stakes medical applications44.
An important consideration in the implementation of machine learning models is the trade-off between interpretability and predictive performance. Although neural networks achieved statistically higher performance relative to tree-based models, the observed difference was marginal, with a mean gain of less than 0.02 AUC points. Tree-based models offer substantial advantages in clinical settings through their inherent interpretability, allowing clinicians to understand decision pathways and feature importance rankings44,45. This transparency is relevant for the model’s adoption, particularly in decisions where model errors can have severe consequences44. Given the comparable performance across model types and the critical importance of interpretability in public health applications46, simpler, more transparent models may be preferable for routine implementation despite virtually lower discriminative performance. Regulatory approval and clinical adoption can favor transparent models that enable audit trails and staff understanding.
Although populations and economic settings differ substantially, our analysis indicates that machine learning models for mortality prediction reach a comparable level of discriminative accuracy, but often rely on distinct mechanisms to do so. This observation of “multiple routes to success” highlights the need for closer investigation. Future research could explore which specific pre-processing development factors, algorithms, or predictor variable selection contribute to model performance in different contexts.
The substantial number of publications examined and the adherence to methodological guidelines like PROSPERO, PRISMA, and TRIPOD + AI are two of this study’s strong points. There are certain restrictions, though. The significant heterogeneity found suggests a great deal of variation in the populations and methods examined, which may have an impact on how broadly applicable our findings are.
An important methodological limitation of our review is the insufficient external validation among the included studies. Only a minority of the 88 analyzed studies employed independent external datasets to evaluate the generalizability of their models. Most studies relied on internal validation (such as cross-validation or train-test splits within the same population), which may overestimate model performance and limit clinical applicability. This absence of robust external validation represents a significant gap in the field, as models that demonstrate high performance in internal validation frequently show a substantial shift when applied to different populations or clinical contexts47.
Only 19% of studies included general population samples which can be considered a generalizability limitation. Disease-specific models cannot be extrapolated to populations with diverse risk profiles and heterogeneous clinical trajectories. Models developed in patients with specific diseases can rely on well-defined clinical pathways and established biomarkers. In contrast, general population prediction must identify risk among predominantly healthy individuals presenting different and interconnected risk factors. Disease-specific datasets combined with limited external validation (8% of studies) restrict transferability to population-level applications. Disease-specific models use homogeneous cohorts with standardized markers, while population prediction requires identifying risk in predominantly healthy, diverse populations. This sampling bias limits applicability to public health surveillance and national screening programs.
Precision estimates may have been biased as a result of our use of literature-based estimation techniques to deal with missing confidence intervals. Heterogeneity assessments and pooled estimates may be impacted by this. By inflating apparent consistency or exaggerating heterogeneity, attributed confidence intervals can produce artificially narrow or wide precision estimates, which compromises the validity of our meta-analytic findings.
Our analysis was limited by insufficient data on social and behavioral variables, including education, race/ethnicity, and housing conditions. Future studies should prioritize including these social determinants to improve model accuracy and equity. Equity-focused reporting should be mandatory, with fairness assessments required before clinical implementation. We recommend alignment with established AI ethics guidelines to prevent perpetuating healthcare disparities through algorithmic bias. Funding agencies, academic journals, and regulatory bodies should integrate these equity-focused reporting requirements into research funding criteria, manuscript submission guidelines, and clinical approval processes to ensure systemic implementation across the field. Researchers should conduct stratified performance analysis across demographic subgroups, healthcare systems must require local validation before model deployment, and all stakeholders should establish monitoring systems to track equity outcomes post-implementation.
Thus, our analysis also highlights the potential for biases in training data to perpetuate historical and systemic inequities through the use of these models in health settings, particularly when marginalized populations are underrepresented in training datasets. The similar performance across different economic contexts does not necessarily represent equitable models for all demographic groups within these contexts. Most of the analyzed studies did not report model performance stratified by race/ethnicity, gender, or socioeconomic status, leaving potential disparities undetected. Algorithmic fairness requires intentional design choices and comprehensive evaluation across diverse subpopulations36,48, which should be prioritized in future studies before clinical implementation and deployment.
In conclusion, this meta-analysis showed the potential of machine learning models to predict all-cause mortality across diverse populations and economic contexts. However, findings derive predominantly from disease-specific populations (81% of studies), with limited evidence for general population applicability. The substantial heterogeneity across all analyses indicates a high degree of variation among studies. Future research should prioritize the development of models specifically for general populations, standardized reporting, inclusive data collection that incorporates social determinants of health such as race/ethnicity and socioeconomic status, and rigorous external validation across diverse populations. Finally, the low percentage of external validation requires caution in the generalizability of results and implementation.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Avinash, B. S., Srisupattarawanit, T. & Ostermeyer, H. Numerical methods for information tracking of noisy and non-smooth data in large-scale statistics. J. Eng. Res. Rep. https://doi.org/10.9734/jerr/2019/v6i416957 (2019).
Zhang, J. et al. Guest editorial learning from noisy multimedia data. IEEE Trans. Multimed https://doi.org/10.1109/TMM.2022.3159014 (2022).
Arain, Z., Iliodromiti, S., Slabaugh, G., David, A. L. & Chowdhury, T. T. Machine learning and disease prediction in obstetrics. Curr. Res. Physiol. https://doi.org/10.1016/j.crphys.2023.100099 (2023).
Veena, S., Sumanth Reddy, D., Lakshmi Kara, C. & Uday Kiran, K. A. Clinical outcome future prediction with decision tree & naive bayes models, in Advances in Science and Technology, Vol. 124. AST (2023).
Li, Y., Fan, X., Wei, L., Yang, K. & Jiao, M. The impact of high-risk lifestyle factors on all-cause mortality in the US non-communicable disease population. BMC Public Health 23, 422 (2023).
Taneri, P. E. et al. Association between ultra-processed food intake and all-cause mortality: A systematic review and meta-analysis. Am. J. Epidemiol. 191, 1323–1335 (2022).
Feng, X., Sarma, H., Seubsman, S. A., Sleigh, A. & Kelly, M. The impact of multimorbidity on all-cause mortality: A longitudinal study of 87,151 thai adults. Int. J. Public Health 68, 1606137 (2023).
Takahama, H. et al. Clinical application of artificial intelligence algorithm for prediction of one-year mortality in heart failure patients. Heart Vessels 38, 785–792 (2023).
Arostegui, I. et al. Combining statistical techniques to predict postsurgical risk of 1-year mortality for patients with colon cancer. Clin. Epidemiol. 10, 235–251 (2018).
Xiong, J. et al. A novel machine learning-based programmed cell death-related clinical diagnostic and prognostic model associated with immune infiltration in endometrial cancer. Front. Oncol. 13, 1224071 (2023).
Wang, G. et al. Machine learning-based models for predicting mortality and acute kidney injury in critical pulmonary embolism. BMC Cardiovasc. Disord. 23, 385 (2023).
Bacevicius, M. & Paulauskaite-Taraseviciene, A. Machine learning algorithms for raw and unbalanced intrusion detection data in a multi-class classification problem. Appl. Sci. (Switzerland) 13, 7328 (2023).
Delpino, F. M. et al. Machine learning for predicting chronic diseases: A systematic review. Public Health 205, 14–25 (2022).
Delpino, F. M. et al. Does machine learning have a high performance to predict obesity among adults and older adults? A systematic review and meta-analysis. Nutr. Metab. Cardiovasc. Dis. 34(9), 2034–2045 (2024).
Norori, N., Hu, Q., Aellen, F. M., Faraci, F. D. & Tzovara, A. Addressing bias in big data and AI for health care: A call for open science. Patterns https://doi.org/10.1016/j.patter.2021.100347 (2021).
Matthew, P. et al. PRISMA 2020 statement: updated guidelines for reporting systematic reviews and meta analyses. 26th Cochrane Colloquium Santiago Chile (2019).
Collins, G. S. et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Meurer, W. J. & Tolles, J. Logistic regression diagnostics understanding how well a model predicts outcomes. JAMA J. Am. Med. Assoc. https://doi.org/10.1001/jama.2016.20441 (2017).
7.7.7.2 Standard errors from confidence intervals and P values: difference measures. https://handbook-5-1.cochrane.org/chapter_7/7_7_7_2_obtaining_standard_errors_from_confidence_intervals_and.htm.
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).
Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res. Synth. Methods 1, 97–111 (2010).
Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. Introduction to meta-analysis. Introd. Meta-Anal. https://doi.org/10.1002/9780470743386 (2009).
Heyman, E. T. et al. Improving machine learning 30-day mortality prediction by discounting surprising deaths. J. Emerg. Med. 61, 763–773 (2021).
Bergquist, T. et al. Evaluation of crowdsourced mortality prediction models as a framework for assessing artificial intelligence in medicine. J. Am. Med. Inform. Assoc. 31, 35–44 (2024).
Díez-Sanmartín, C., Cabezuelo, A. S. & Belmonte, A. A. A new approach to predicting mortality in dialysis patients using sociodemographic features based on artificial intelligence. Artif. Intell. Med. 136, 102478 (2023).
Siegersma, K. R. et al. Deep neural networks reveal novel sex-specific electrocardiographic features relevant for mortality risk. Eur. Heart J. Digit. Health 3, 245–254 (2022).
Barsasella, D. et al. A machine learning model to predict length of stay and mortality among diabetes and hypertension inpatients. Medicina (Kaunas) 58, 1568 (2022).
Li, D., Fu, J., Zhao, J., Qin, J. & Zhang, L. A deep learning system for heart failure mortality prediction. PLoS ONE 18, e0276835 (2023).
Tedesco, S. et al. Comparison of machine learning techniques for mortality prediction in a prospective cohort of older adults. Int. J. Environ. Res. Public Health 18, 12806 (2021).
Tang, W. H. W. et al. Prognostic value of baseline and changes in circulating soluble ST2 levels and the effects of nesiritide in acute decompensated heart failure. JACC Heart Fail. 4, 68–77 (2016).
Shah, N. D., Steyerberg, E. W. & Kent, D. M. Big data and predictive analytics: Recalibrating expectations. JAMA J. Am. Med. Assoc. https://doi.org/10.1001/jama.2018.56024 (2018).
Zerillo, J. A. et al. An international collaborative standardizing a comprehensive patient-centered outcomes measurement set for colorectal cancer. JAMA Oncol https://doi.org/10.1001/jamaoncol.2017.0417 (2017).
Wiens, J., Guttag, J. & Horvitz, E. Patient risk stratification with time-varying parameters: A multitask learning approach. J. Mach. Learn. Res. 17, 1–23 (2016).
Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).
Chen, I. Y., Szolovits, P. & Ghassemi, M. Can AI help reduce disparities in general medical and mental health care?. AMA J. Ethics 21, 167–179 (2019).
Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G. & Chin, M. H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169, 866–872 (2018).
Inthout, J., Ioannidis, J. P. A., Borm, G. F. & Goeman, J. J. Small studies are more heterogeneous than large ones: A meta-meta-analysis. J. Clin. Epidemiol. 68, 860–869 (2015).
Resche-Rigon, M., White, I. R., Bartlett, J. W., Peters, S. A. E. & Thompson, S. G. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat. Med. 32, 4890 (2013).
Chen, X. et al. Serological evidence of human infection with SARS-CoV-2: a systematic review and meta-analysis. Lancet Glob. Health 9, e598 (2021).
Xie, Z., Ding, J., Jiao, J., Tang, S. & Huang, C. Screening instruments for early identification of unmet palliative care needs: a systematic review and meta-analysis. BMJ Support Palliat. Care 14, 256 (2024).
Issitt, R. W. et al. Classification performance of neural networks versus logistic regression models: Evidence from healthcare practice. Cureus 14, e22443 (2022).
Sculley, D. et al. Hidden technical debt in machine learning systems, in Advances in Neural Information Processing Systems, Vol. 2015-January (2015).
Mohammed, S. et al. The effects of data quality on machine learning performance on tabular data. Inf. Syst. 132, 102549 (2025).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. https://doi.org/10.1038/s42256-019-0048-x (2019).
Rudin, C. et al. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 16, 1–85 (2022).
Amann, J., Blasimme, A., Vayena, E., Frey, D. & Madai, V. I. Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Med. Inform. Decis. Mak. 20, 310 (2020).
Wichmann, R. M. et al. Improving the performance of machine learning algorithms for health outcomes predictions in multicentric cohorts. Sci. Rep. 13, 1022 (2023).
Ueda, D. et al. Fairness of artificial intelligence in healthcare: Review and recommendations. Jpn. J. Radiol. https://doi.org/10.1007/s11604-023-01474-3 (2024).
Banerjee, S., Lio, P., Jones, P. B. & Cardinal, R. N. A class-contrastive human-interpretable machine learning approach to predict mortality in severe mental illness. NPJ Schizophr. 7, 60 (2021).
Meredith, J. W. et al. A comparison of the abilities of nine scoring algorithms in predicting mortality. J. Trauma 53, 621 (2002).
Wang, Y. et al. A maintenance hemodialysis mortality prediction model based on anomaly detection using longitudinal hemodialysis data. J. Biomed. Inform. 123, 103930 (2021).
Lin, S. Y. et al. Artificial intelligence prediction model for the cost and mortality of renal replacement therapy in aged and super-aged populations in Taiwan. J. Clin. Med. 8, 995 (2019).
Liu, C. M. et al. Artificial intelligence-enabled model for early detection of left ventricular hypertrophy and mortality prediction in young to middle-aged adults. Circ. Cardiovasc. Qual. Outcomes 15, e008360 (2022).
Shi, H. Y. et al. Artificial neural network model for predicting 5-year mortality after surgery for hepatocellular carcinoma: A nationwide study. J. Gastrointest. Surg. 16, 2126 (2012).
Shi, L., Wang, X. C. & Wang, Y. S. Artificial neural network models for predicting 1-year mortality in elderly patients with intertrochanteric fractures in China. Braz. J. Med. Biol. Res. 46, 993 (2013).
Puddu, P. E. & Menotti, A. Artificial neural networks versus proportional hazards Cox models to predict 45-year all-cause mortality in the Italian Rural Areas of the Seven Countries Study. BMC Med. Res. Methodol. 12, 100 (2012).
Harris, A. H. S. et al. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty?. Clin. Orthop. Relat. Res. 477, 452 (2019).
Jing, B. et al. Comparing machine learning to regression methods for mortality prediction using veterans affairs electronic health record clinical data. Med. Care 60, 470 (2022).
Sakr, S. et al. Comparison of machine learning techniques to predict all-cause mortality using fitness data: The Henry Ford exercIse testing (FIT) project. BMC Med. Inform. Decis. Mak. 17, 174 (2017).
Jones, B. E. et al. Computerized mortality prediction for community-acquired pneumonia at 117 veterans affairs medical centers. Ann. Am. Thorac. Soc. 18, 1175 (2021).
Singh, A. et al. Deep learning for explainable estimation of mortality risk from myocardial positron emission tomography images. Circ. Cardiovasc. Imaging 15, e014526 (2022).
Lu, M. T. et al. Deep learning to assess long-term mortality from chest radiographs. JAMA Netw .Open 2, e197416 (2019).
Ulloa Cerna, A. E. et al. Deep-learning-assisted analysis of echocardiographic videos improves predictions of all-cause mortality. Nat. Biomed. Eng. 5, 546 (2021).
Wang, L. et al. Development and validation of a deep learning algorithm for mortality prediction in selecting patients with dementia for earlier palliative care interventions. JAMA Netw. Open 2, e196972 (2019).
Mohammad, M. A. et al. Development and validation of an artificial neural network algorithm to predict mortality and admission to hospital for heart failure after myocardial infarction: A nationwide population-based study. Lancet Digit. Health 4, e37 (2022).
Valsaraj, A. et al. Development and validation of echocardiography-based machine-learning models to predict mortality. EBioMedicine 90, 104479 (2023).
Li, Z. et al. Development and validation of questionnaire-based machine learning models for predicting all-cause mortality in a representative population of China. Front. Public Health 11, 1033070 (2023).
Zhou, J. et al. Development of an electronic frailty index for predicting mortality and complications analysis in pulmonary hypertension using random survival forest model. Front. Cardiovasc. Med. 9, 735906 (2022).
Giang, K. W., Helgadottir, S., Dellborg, M., Volpe, G. & Mandalenakis, Z. Enhanced prediction of atrial fibrillation and mortality among patients with congenital heart disease using nationwide register-based medical hospital data and neural networks. Eur. Heart J. Digit. Health 2, 568 (2021).
Castela Forte, J. et al. Ensemble machine learning prediction and variable importance analysis of 5-year mortality after cardiac valve and CABG operations. Sci. Rep. 11, 3467 (2021).
Hernesniemi, J. A. et al. Extensive phenotype data and machine learning in prediction of mortality in acute coronary syndrome–the MADDEC study. Ann. Med. 51, 156 (2019).
Qiu, W. et al. Interpretable machine learning prediction of all-cause mortality. Commun. Med. 2, 125 (2022).
Niedziela, J. T. et al. Is neural network better than logistic regression in death prediction in patients after ST-segment elevation myocardial infarction?. Kardiol. Pol. 79, 1353 (2021).
Mostafaei, S. et al. Machine learning algorithms for identifying predictive variables of mortality risk following dementia diagnosis: A longitudinal cohort study. Sci. Rep. 13, 9480 (2023).
Cui, Y. et al. Machine learning approaches for prediction of early death among lung cancer patients with bone metastases using routine clinical characteristics: An analysis of 19,887 patients. Front. Public Health 10, 1019168 (2022).
Parikh, R. B. et al. Machine learning approaches to predict 6-month mortality among patients with cancer. JAMA Netw. Open 2, 1019168 (2019).
Tong, J. et al. Machine learning can predict total death after radiofrequency ablation in liver cancer patients. Clin. Med. Insights Oncol. 15, 11795549211000016 (2021).
de Capretz, P. O. et al. Machine learning for early prediction of acute myocardial infarction or death in acute chest pain patients using electrocardiogram and blood tests at presentation. BMC Med. Inform. Decis. Mak. 23, 25 (2023).
Tian, P. et al. Machine learning for mortality prediction in patients with heart failure with mildly reduced ejection fraction. J. Am. Heart Assoc. 12, e029124 (2023).
Mamprin, M. et al. Machine learning for predicting mortality in transcatheter aortic valve implantation: An inter-center cross validation study. J. Cardiovasc. Dev. Dis. 8, 65 (2021).
Motwani, M. et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: A 5-year multicentre prospective registry analysis. Eur. Heart J. 38, 500–507 (2017).
dos Santos, H. G., do Nascimento, C. F., Izbicki, R., de Duarte, Y. A. O. & Filho, A. D. P. C. Machine learning for predictive analyses in health: An example of an application to predict death in the elderly in São Paulo, Brazil. Cad Saude Publica 35, e00050818 (2019).
Feng, X. et al. Machine learning improves mortality prediction in three-vessel disease. Atherosclerosis 367, 1–7 (2023).
Yu, Y. et al. Machine learning methods for predicting long-term mortality in patients after cardiac surgery. Front. Cardiovasc. Med. 9, 831390 (2022).
Tamminen, J., Kallonen, A., Hoppu, S. & Kalliomäki, J. Machine learning model predicts short-term mortality among prehospital patients: A prospective development study from Finland. Resusc. Plus 5, 100089 (2021).
Xu, C., Subbiah, I. M., Lu, S. C., Pfob, A. & Sidey-Gibbons, C. Machine learning models for 180-day mortality prediction of patients with advanced cancer using patient-reported symptom data. Qual. Life Res. 32, 713 (2023).
Katsiferis, A. et al. Machine learning models of healthcare expenditures predicting mortality: A cohort study of spousal bereaved Danish individuals. PLoS ONE 18, e0289632 (2023).
Kanda, E. et al. Machine learning models predicting cardiovascular and renal outcomes and mortality in patients with hyperkalemia. Nutrients 14, 4614 (2022).
Lu, J. et al. Machine learning risk prediction model for acute coronary syndrome and death from use of non-steroidal anti-inflammatory drugs in administrative data. Sci. Rep. 11, 18314 (2021).
Scrutinio, D. et al. Machine learning to predict mortality after rehabilitation among patients with severe stroke. Sci. Rep. 10, 20127 (2020).
Li, Y. M. et al. Machine learning to predict the 1-year mortality rate after acute anterior myocardial infarction in Chinese patients. Ther. Clin. Risk Manag. 16, 1–16 (2020).
Guo, R. et al. Machine learning-based approaches for prediction of patients’ functional outcome and mortality after spontaneous intracerebral hemorrhage. J. Pers. Med. 12, 112 (2022).
Li, Y. et al. Machine learning-based models to predict one-year mortality among Chinese older patients with coronary artery disease combined with impaired glucose tolerance or diabetes mellitus. Cardiovasc. Diabetol. 22, 139 (2023).
Asrian, G., Suri, A. & Rajapakse, C. Machine learning-based mortality prediction in hip fracture patients using biomarkers. J. Orthop. Res. 42, 395 (2024).
Ivanics, T. et al. Machine learning–based mortality prediction models using national liver transplantation registries are feasible but have limited utility across countries. Am. J. Transpl. 23, 64 (2023).
Behnoush, A. H. et al. Machine learning-based prediction of 1-year mortality in hypertensive patients undergoing coronary revascularization surgery. Clin. Cardiol. 46, 269 (2023).
Kampaktsis, P. N. et al. Machine learning-based prediction of mortality after heart transplantation in adults with congenital heart disease: A UNOS database analysis. Clin. Transpl. 37, e14845 (2023).
Lin, Y. J. et al. Machine-learning monitoring system for predicting mortality among patients with noncancer end-stage liver disease: Retrospective study. JMIR Med. Inform. 8, e24305 (2020).
Lin, F. Y. et al. Mortality impact of low CAC density predominantly occurs in early atherosclerosis: Explainable ML in the CAC consortium. J. Cardiovasc. Comput. Tomogr. 17, 28 (2023).
Liu, Y. et al. Nomogram and machine learning models predict 1-year mortality risk in patients with sepsis-induced cardiorenal syndrome. Front. Med. (Lausanne) 9, 792238 (2022).
Forssten, M. P., Bass, G. A., Ismail, A. M., Mohseni, S. & Cao, Y. Predicting 1-year mortality after hip fracture surgery: An evaluation of multiple machine learning approaches. J. Pers. Med. 11, 727 (2021).
Alimbayev, A. et al. Predicting 1-year mortality of patients with diabetes mellitus in Kazakhstan based on administrative health data using machine learning. Sci. Rep. 13, 8412 (2023).
El-Bouri, W. K., Sanders, A. & Lip, G. Y. H. Predicting acute and long-term mortality in a cohort of pulmonary embolism patients using machine learning. Eur. J. Intern. Med. 118, 42 (2023).
Park, J. et al. Predicting long-term mortality in patients with acute heart failure by using machine learning. J. Card Fail. 28, 1078 (2022).
Penso, M. et al. Predicting long-term mortality in TAVI patients using machine learning techniques. J. Cardiovasc. Dev. Dis. 8, 44 (2021).
Guo, A., Mazumder, N. R., Ladner, D. P. & Foraker, R. E. Predicting mortality among patients with liver cirrhosis in electronic health records with machine learning. PLoS ONE 16, e0256428 (2021).
Abedi, V. et al. Predicting short and long-term mortality after acute ischemic stroke using EHR. J. Neurol. Sci. 427, 117560 (2021).
Zhou, J. et al. Predicting stroke and mortality in mitral regurgitation: A machine learning approach. Curr. Probl. Cardiol. https://doi.org/10.1016/j.cpcardiol.2022.101464 (2023).
Rauf, A. et al. Predicting stroke and mortality in mitral stenosis with atrial flutter: A machine learning approach. Ann. Noninvasive Electrocardiol. 28, e13078 (2023).
Shi, N. et al. Predicting the need for therapeutic intervention and mortality in acute pancreatitis: A two-center international study using machine learning. J. Pers. Med. 12, 616 (2022).
Zhou, Y. et al. Prediction of 1-year mortality after heart transplantation using machine learning approaches: A single-center study from China. Int. J. Cardiol. 339, 21 (2021).
Lee, H. C. et al. Prediction of 1-year mortality from acute myocardial infarction using machine learning. Am. J. Cardiol. 133, 23 (2020).
Tran, N. T. D. et al. Prediction of all-cause mortality for chronic kidney disease patients using four models of machine learning. Nephrol. Dial. Transpl. 38, 1691 (2023).
Raghunath, S. et al. Prediction of mortality from 12-lead electrocardiogram voltage data using a deep neural network. Nat. Med. 26, 886 (2020).
Kawano, K. et al. Prediction of mortality risk of health checkup participants using machine learning-based models: The J-SHC study. Sci. Rep. 12, 141546 (2022).
Weng, S. F., Vaz, L., Qureshi, N. & Kai, J. Prediction of premature all-cause mortality: A prospective general population cohort study comparing machine-learning and standard epidemiological approaches. PLoS ONE 14, e0214365 (2019).
Zhou, Q. et al. Prediction of premature all-cause mortality in patients receiving peritoneal dialysis using modified artificial neural networks. Aging 13, 14170 (2021).
Huang, S. H., Loh, J. K., Tsai, J. T., Houg, M. F. & Shi, H. Y. Predictive model for 5-year mortality after breast cancer surgery in Taiwan residents. Chin. J. Cancer 36, 1–9 (2017).
Sheng, K. et al. Prognostic machine learning models for first-year mortality in incident hemodialysis patients: Development and validation study. JMIR Med. Inform. 8, e20578 (2020).
Zachariah, F. J., Rossi, L. A., Roberts, L. M. & Bosserman, L. D. Prospective comparison of medical oncologists and a machine learning model to predict 3-month mortality in patients with metastatic solid tumors. JAMA Netw. Open https://doi.org/10.1001/jamanetworkopen.2022.14514 (2022).
Unterhuber, M. et al. Proteomics-enabled deep learning machine algorithms can enhance prediction of mortality. J. Am. Coll. Cardiol. 78, 1621 (2021).
Wu, X. D. et al. Risk factors prediction of 6-month mortality after noncardiac surgery of older patients in China: A multicentre retrospective cohort study. Int. J. Surg. 110, 219 (2024).
Hwangbo, L. et al. Stacking ensemble learning model to predict 6-month mortality in ischemic stroke patients. Sci. Rep. 12, 17389 (2022).
Ross, E. G. et al. The use of machine learning for the identification of peripheral artery disease and future mortality risk. J. Vasc. Surg. 64, 1515 (2016).
Wang, H. et al. Using machine learning to integrate socio-behavioral factors in predicting cardiovascular-related mortality risk, in Studies in Health Technology and Informatics, Vol. 264 (2019).
Acknowledgements
Delpino FM received a post-doctoral fellowship from the National Council for Scientific and Technological Development (CNPq), Call 07/2022, in conjunction with FAPERGS—Foundation for Research Support of the State of Rio Grande do Sul (FAPERGS), during the writing of the manuscript. Nunes BP and Chiavegatto Filho AD received a research productivity grant from CNPq. Alexandre.
Funding
National Council for Scientific and Technological Development (CNPq), Call 07/2022, in conjunction with FAPERGS—Foundation for Research Support of the State of Rio Grande do Sul (FAPERGS), during the writing of the manuscript.
Author information
Authors and Affiliations
Contributions
Delpino FM participated in the design, writing, and leadership of the article. Nunes BP and Chiavegatto Filho AD contributed to the conception and development of the idea, design, and critical review of the paper. Pimenta LP, Gonzales DF, Victor A, Araújo C, and Moura KA participated in the selection of studies and data extraction. Miranda JJ and Batista SRR contributed to the interpretation of the data and critical review of the paper. All authors reviewed and contributed significantly to the writing of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Delpino, F.M., Pimenta, L.P., Gonzalez, D.F. et al. Global performance of machine learning models to predict all-cause mortality: systematic review and meta-analysis. Sci Rep 15, 42707 (2025). https://doi.org/10.1038/s41598-025-26714-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-26714-6













