Introduction

Metabolic dysfunction-associated steatotic liver disease (MASLD) is strongly associated with obesity, type 2 diabetes (T2D) mellitus, and atherosclerotic cardiovascular disease (ASCVD), and is linked to an increased risk of both all-cause and liver-related mortality1,2. The progression from MASLD to metabolic dysfunction-associated steatohepatitis (MASH) is associated with a two-fold increase in all-cause mortality and more than a tenfold increase in the risks of hepatocellular carcinoma and liver-related mortality3. Thus, the identification of MASH is crucial for predicting individual clinical outcomes and informing personalized treatment strategies4.

Liver biopsy has long been regarded as the gold standard for diagnosing MASH5. However, its routine implementation in clinical settings is constrained by its invasive nature, poor patient acceptance, high costs, and substantial inter-observer variability. Moreover, the procedure involves risks, including post-procedural pain in nearly 50% of patients, and life-threatening hemorrhage in 0.6% to 1% of cases6. To enhance the diagnosis of MASH at the population level, it is crucial to develop non-invasive tests (NITs) that exhibit high diagnostic performance. This approach would not only decrease reliance on liver biopsies but also aid in stratifying patients who need follow-up treatments. Over the last 2 years, numerous international validation studies have investigated non-invasive methods for detecting MASH accompanied by fibrosis. These studies utilized a combination of imaging techniques and circulating biomarkers, such as the FAST (FibroScan-aspartate aminotransferase) score7 and NIS48. However, crucial components for these evaluations, including vibration-controlled transient elastography, miR-34a-5p, YKL-40, and alpha-2 macroglobulin, are frequently inaccessible within primary care contexts. In light of these limitations, machine learning-based integration of routine clinical laboratory parameters and physical examination measurements represents a promising alternative strategy9. Machine learning algorithms excel at capturing complex, non-linear patterns and multi-dimensional interactions among readily available clinical variables, enabling superior predictive performance compared to traditional scoring systems. This approach facilitates the development of practical, cost-effective screening tools for population-level MASH detection in primary care settings10.

Consequently, there remains an unmet need to develop accurate tools using readily available laboratory and clinical variables to identify individuals at high risk of developing MASH and its complications, including liver-related events and mortality. Such tools would enable early detection and personalized monitoring of high-risk individuals and patients with MASH in primary care and other non-specialized healthcare settings, before the development of cirrhosis or its complications.

Results

Study population characteristics and cohort composition

The workflow of this study is illustrated in Fig. 1. The study included a total of 218 participants in the training cohort, 93 in the validation cohort, 96 in the test cohort 1, 26,256 in the test cohort 2, and 390,574 in the prognosis cohort. Table 1 presents the baseline characteristics of the participants. The mean age of participants ranged from 31 to 57 years, with the proportion of males varying from 31 to 71%. The training, validation, and test cohort 1 were obtained from hospitals in China, where the MASH positivity rate was between 44 and 48%. In contrast, the test cohort 2 was derived from the general population in the United Kingdom, which exhibited a MASH positivity rate of 4.4%. The robustness of our dataset is evidenced by the diverse demographic characteristics of the participants, which theoretically enhances the development of a broadly applicable MASH prediction model.

Fig. 1: Study design.
figure 1

An illustration of the general workflow adopted in this study.

Table. 1 Baseline characteristics

Development of the MASHRisk score

We calculated the correlation coefficients for the selected features in the training cohort, noting that the coefficients for aspartate aminotransferase (AST) and alanine aminotransferase (ALT), as well as those for low-density lipoprotein cholesterol (LDL-C) and total cholesterol, exceeded 0.7 (Fig. 2a). Employing the principle of maximizing the area under the curve (AUC) for feature selection, we ultimately identified the following features: AST, LDL-C, sex, age, body mass index (BMI), gamma-glutamyl transferase (GGT), albumin, total bilirubin (TB), uric acid, hemoglobin A1c (HbA1c), triglycerides (TG), and high-density lipoprotein cholesterol (HDL-C) (Fig. 2b, c). These features were subsequently utilized in the Boruta algorithm11, which identified age, BMI, AST, HbA1c, and TG as the final modeling features (Supplementary Fig. 1a). Following this, we developed models using ten distinct machine learning algorithms based on the training set and assessed their performance on the validation set. Our results revealed that the model generated using the Partial Least Squares (PLS) algorithm achieved the highest AUC for predicting MASH in the validation set (Supplementary Fig. 1b, Supplementary Table 1). Consequently, we selected the PLS-derived model as the MASHRisk score. In the training, validation, test cohort 1, and test cohort 2, the MASHRisk score predicted MASH with AUC values of 0.791 (0.732–0.851), 0.793 (0.698–0.887), 0.806 (0.717–0.895), and 0.796 (0.785–0.808), respectively (Fig. 2d–g). The MASHRisk scores for participants diagnosed with MASH were significantly higher than those for non-MASH participants (all p < 0.001) (Fig. 2d–g). To promote broader utilization among researchers, we implemented a freely accessible online tool for MASHRisk score calculation, which can be accessed at https://njph.shinyapps.io/mashrisk/.

Fig. 2: Development and validation of the MASHRish score.
figure 2

a Correlation coefficients of the included features. b Receiver operating characteristic (ROC) curve of aspartate aminotransferase (AST) (left) and alanine aminotransferase (ALT) (right) in predicting metabolic-associated steatotic hepatitis (MASH) in the training cohort. c ROC curve of low-density lipoprotein cholesterol (LDL-C) (left) and total cholesterol (TC) (right) in predicting MASH in the training cohort. ROC curve of MASHRisk score in predicting MASH across training cohort (d, left), validation cohort (e, left), testing cohort 1 (f, left), and testing cohort 2 (g, left). Comparative distribution of MASHRisk scores demonstrating clear differentiation between MASH and non-MASH across training cohort (d, right), validation cohort (e, right), testing cohort 1 (f, right), and testing cohort 2 (g, right).

Assessment of the MASHRisk score

The multivariable logistic regression analyses conducted on the MASHRisk score across four distinct cohorts, while controlling for sex, age, and BMI as covariates, revealed that the MASHRisk score remained a statistically significant independent predictor of MASH (all p < 0.001) (Fig. 3a–d). Utilizing the MASHRisk score, we identified an optimal cutoff value of 0.457 from the receiver operating characteristic (ROC) curve analysis conducted in the training cohort, which facilitated the generation of confusion matrices for each subsequent cohort. In the training cohort, the model demonstrated a sensitivity of 0.729 and a specificity of 0.738 (Fig. 3e, Supplementary Table 2). In contrast, the validation cohort exhibited an increased sensitivity of 0.689, accompanied by a decreased specificity of 0.833 (Fig. 3f, Supplementary Table 2). For test cohort 1, sensitivity was recorded at 0.605, with specificity significantly higher at 0.906 (Fig. 3g, Supplementary Table 2). In test cohort 2, the sensitivity was noted to be 0.779, while specificity dropped to 0.689 (Fig. 3h, Supplementary Table 2). It is noteworthy that test cohort 2 reflects a general population that exhibits a MASH positivity rate of only 4.4%, and the MASHRisk score showed a negative predictive value of 99.0% in this cohort (Fig. 3h, Supplementary Table 2). Overall, these results suggest that the MASHRisk score is a valuable tool for predicting MASH and demonstrates consistent performance across different populations.

Fig. 3: Assessment of the MASHRisk score.
figure 3

Multivariate logistic regression in the training cohort (a), validation cohort (b), testing cohort 1 (c), and testing cohort 2 (d). Confusion matrices in the training cohort (e, left), validation cohort (f, left), testing cohort 1 (g, left), and testing cohort 2 (h, left). Assessment of the confusion matrices in the training cohort (e, right), validation cohort (f, right), testing cohort 1 (g, right), and testing cohort 2 (h, right). Spec specificity, Sens sensitivity, PPV positive predictive value, NPV negative predictive value, F1 F1 score, Prec precision, Rec recall, Prev prevalence, DP diagnostic probability, DR detection rate, BA balanced accuracy.

Comparison of the MASHRisk score with other NITs

We conducted a comprehensive comparison of the predictive performance of the MASHRisk score against Fibrosis-4 Index (FIB-4), AST to Platelet Ratio Index (APRI), AST to ALT Ratio (AAR), and Non-alcoholic Fatty Liver Disease Fatty Liver Severity Score (NFS) using ROC curves, calibration curves, and decision curve analysis (DCA). It is important to note that, due to the absence of platelet (PLT) data in test cohort 1, only the AAR could be calculated in this cohort. Firstly, among the four cohorts, the ROC curves indicated that the MASHRisk score exhibited the highest AUC for predicting MASH (Fig. 4a). Calibration curves demonstrated that the MASHRisk score’s predicted probabilities were closer to actual probabilities, reflecting the lowest Brier score (Fig. 4b). Additionally, DCA revealed that the MASHRisk score provided greater clinical net benefits compared to other NITs (Fig. 4c). In summary, the MASHRisk score offers superior predictive performance for MASH compared to FIB-4, APRI, AAR, and NFS, warranting its promotion in clinical practice.

Fig. 4: Comparison of the MASHRisk score with other non-invasive tests (NITs).
figure 4

a The receiver operating characteristic (ROC) curves of different NITs. From left to right: training cohort, validation cohort, testing cohort 1, and testing cohort 2. b The calibration curves of different NITs. From left to right: training cohort, validation cohort, testing cohort 1, and testing cohort 2. c The decision curve analysis (DCA) curves of different NITs. From left to right: training cohort, validation cohort, testing cohort 1, and testing cohort 2.

Model interpretation

We utilized the SHapley Additive exPlanations (SHAP) framework to interpret the MASHRisk score. The mean absolute SHAP values illustrated the impact of various risk factors, revealing that AST was the most significant variable (Supplementary Fig. 2a, b). To further elucidate the interpretability of the MASHRisk score, we highlighted two representative cases. SHAP values were employed to assess the influence of each feature on the predictions made by the MASHRisk score. In this study, diminished SHAP values correlated with a reduced likelihood of MASH, whereas elevated SHAP values indicated an increased probability. For example, patient 1, diagnosed with MASH, presented with an elevated SHAP value and a MASHRisk score of 0.884, suggestive of a greater MASH likelihood (Supplementary Fig. 2c). Conversely, patient 2, without MASH, exhibited a lower SHAP value and a prediction score of 0.353, indicating a decreased probability of MASH (Supplementary Fig. 2d).

Prognostic value of the MASHRisk score

In the prognostic cohort, participants were stratified into two categories: low-risk (n = 11,596; 3%) and high-risk (n = 378,978; 97%). Liver-related mortality was observed in 0.1% (n = 417) of the low-risk group, compared to 1.3% (n = 152) of the high-risk group. The high-risk group demonstrated a significantly elevated hazard ratio for liver-related mortality compared to the low-risk group, with a hazard ratio (HR) of 12.296 (95% CI: 10.212–14.806, p < 0.001) (Fig. 5a). Time-dependent ROC analysis revealed strong prognostic performance, exhibiting AUC values of 0.83 at 3 years, 0.85 at 5 years, and 0.80 at 10 years (Fig. 5a). The incidence of MASH events was 0.4% in the low-risk group and 1.2% (n = 141) in the high-risk group, with the latter exhibiting a significantly increased hazard ratio (HR = 12.829, 95% CI: 10.560–15.585, p < 0.001) compared to the low-risk group (Fig. 5b). Time-dependent ROC analysis for MASH events showed strong prognostic accuracy, yielding AUCs of 0.74 at 3 years, 0.79 at 5 years, and 0.81 at 10 years (Fig. 5b). For cirrhosis events, the incidence was recorded at 0.4% (n = 1676) in the low-risk group and 3.8% (n = 446) in the high-risk group. The high-risk group showed a significantly increased risk for cirrhosis with an HR of 8.863 (p < 0.001, 95% CI: 7.985–9.839) compared to the low-risk group (Fig. 5c). Time-dependent ROC analyses for cirrhosis yielded AUC values of 0.75 at 3 years, 0.75 at 5 years, and 0.75 at 10 years (Fig. 5c). Regarding liver cancer, the incidence was found to be 0.07% (n = 254) in the low-risk group and 0.6% (n = 72) in the high-risk group. The high-risk group exhibited a significantly elevated risk of developing liver cancer compared to the low-risk group (HR = 9.278, 95% CI: 7.142–12.059, p < 0.001) (Fig. 5d). Time-dependent ROC analysis for liver cancer demonstrated AUC values of 0.77 at 3 years, 0.75 at 5 years, and 0.74 at 10 years (Fig. 5d). Finally, the high-risk group showed a significantly increased risk for ASCVD and all-cause mortality compared to the low-risk group, with hazard ratios of 2.303 (95% CI: 2.047–2.592, p < 0.001) and 1.744 (95% CI: 1.661–1.831, p < 0.001), respectively (Supplementary Fig. 3a, b).

Fig. 5: Prognostic value of MASHRisk score in the prognostic cohort.
figure 5

a Cumulative hazard plot comparing high-risk and low-risk patient groups over time (in days), with liver-related death as the outcome (left). Time-dependent receiver operating characteristic curves of the MASHRisk score for predicting liver-related mortality (right). b Cumulative hazard plot comparing high-risk and low-risk patient groups over time (in days), with metabolic-associated steatotic hepatitis (MASH) as the outcome (left). Time-dependent receiver operating characteristic curves of the MASHRisk score for predicting MASH (right). c Cumulative hazard plot comparing high-risk and low-risk patient groups over time (in days), with cirrhosis as the outcome (left). Time-dependent receiver operating characteristic curves of the MASHRisk score for predicting cirrhosis (right). d Cumulative hazard plot comparing high-risk and low-risk patient groups over time (in days), with liver cancer as the outcome (left). Time-dependent receiver operating characteristic curves of the MASHRisk score for predicting liver cancer (right).

Enrichment analysis of the MASHRisk score

To explore the biological mechanisms of the MASHRisk score, we performed pathway enrichment analysis. Both over-representation analysis (ORA) and gene set enrichment analysis (GSEA) revealed that the high-risk group was significantly enriched in metabolic and immune-related pathways, such as the carboxylic acid metabolic process, carbon metabolism, and chemokine signaling pathway (Fig. 6a, b). This finding is in line with the known pathophysiology of MASH, as metabolic dysregulation and chronic inflammation are central to its development12,13.

Fig. 6: Enrichment analysis of the MASHRisk score.
figure 6

a Over-representation analysis of MASHRisk score. b Gene set enrichment analysis of MASHRisk score.

Discussion

We developed a biochemical examination-based MASH prediction score in a Chinese liver biopsy cohort using 14 candidate clinical parameters. By employing a machine learning pipeline, we identified that a combination of five essential parameters, namely age, BMI, AST, HbA1c, and TG, effectively predicts MASH. Furthermore, the MASHRisk score demonstrated successful external validation across various ethnic groups within a cohort from the British population, achieving an AUC of 0.806. Furthermore, this score demonstrates a clinically significant improvement in detecting MASH compared to the widely used FIB-4 and other NITs. A strong association with liver-related events underscores the clinical utility of this innovative referral tool.

Each element of the MASHRisk score has previously shown a robust individual correlation with MASH. Obesity is a recognized risk factor for fatty liver disease, and patients frequently exhibit elevated liver enzyme levels, thereby justifying the inclusion of BMI and AST as predictive factors14,15,16. Age is a primary risk factor for chronic liver diseases, including MASLD and MASH17. Elevated TG levels are often observed in patients with metabolic syndrome, which is a significant risk factor for the development and progression of MASH17. High TG levels can contribute to hepatic steatosis and inflammation, thereby increasing the risk of MASH17. HbA1c, a marker of long-term blood glucose control, is closely related to T2D, which is one of the strongest risk factors for MASH18. The MASHRisk score takes into account individuals with elevated liver enzymes, diabetes, obesity, and hyperlipidemia, which may be one of the reasons for the outstanding performance of the MASH score.

MASLD has become the primary contributor to liver-related health issues and deaths globally19. Recognizing patients with MASLD who are at an increased risk for liver disease-related mortality is crucial in clinical settings. Previous research indicates that significant hepatic steatosis—assessed through ultrasonography or by using surrogate markers such as liver enzyme levels and various steatosis scores (including the ultrasound liver fat index, fatty liver index, NAFLD liver fat score, and hepatic steatosis index)—is linked to a higher risk of liver-related mortality20,21. Nevertheless, these parameters alone are not highly effective; for example, the mortality risk only rises by four to five times among individuals exhibiting the highest liver enzyme levels or intermediate to high steatosis indices. Simple fibrosis scores, including AAR, FIB-4, APRI, and NFS, are commonly employed in clinical practice to estimate advanced liver fibrosis and may contribute to minimizing the need for liver biopsies22,23. We found that the performance of the MASHRisk score was superior to several established NITs, including AAR, APRI, FIB-4, and NFS scores, in accurately diagnosing MASH. It is important to note that our laboratory tests are entirely based on biochemical examinations, such as FIB-4, NFS, and APRI, and also incorporate relevant complete blood count parameters like PLT. This increases the complexity of clinical implementation and testing costs, as it necessitates an additional blood sample for PLT testing. Incorporating the MASHRisk score as a targeted screening tool in primary care settings could alleviate unnecessary patient anxiety. Its user-friendly design encourages adoption across various healthcare settings, including resource-limited settings where advanced diagnostic tools may be unavailable. By integrating this score into routine check-ups, healthcare providers can prioritize patients for further diagnostic evaluation, lifestyle modifications, or treatment, ultimately enhancing patient outcomes. From a public health perspective, the widespread adoption of the MASHRisk score can help mitigate the burden of MASH by facilitating early detection and intervention. This, in turn, may lower healthcare costs related to advanced liver disease and improve the quality of life for affected individuals.

While the MASHRisk score demonstrates strong predictive performance, successful implementation requires addressing key challenges: integration with electronic health records, laboratory standardization across facilities, and clinician training on score interpretation and patient management. We propose practical solutions, including the development of digital calculators and mobile applications, standardized laboratory protocols, and clear implementation guidelines with defined action thresholds for referral and intervention. The score’s reliance on routinely available parameters makes it suitable for resource-limited settings where advanced diagnostics are inaccessible, maximizing its real-world impact in reducing the global burden of MASH.

This study has several limitations. First, the use of proton density fat fraction (PDFF) as a surrogate marker for MASH in the test 2 cohort may be open to discussion. Second, the potential for longitudinal monitoring of liver health using the MASHRisk score has yet to be verified. Long-term prospective follow-up studies are essential to validate its accuracy and determine its clinical utility in tracking disease progression or regression in response to therapeutic interventions.

In summary, the MASHRisk score is a validated, user-friendly, non-invasive tool applicable to diverse populations. It exhibits robust predictive ability across various cohorts and effectively assesses the long-term risk of liver-related outcomes. As healthcare systems transition towards managing chronic diseases on a population level, the MASHRisk score presents a scalable approach for early detection and risk stratification. With its adaptability to different healthcare environments24,25, MASHRisk shows potential for practical implementation and for alleviating the global burden of MASH.

Methods

Study design

The primary objective of the first phase was to develop and validate a blood-based NIT, the MASHRisk score, which integrates routine clinical and biochemical examinations to diagnose MASH. The second phase aimed to evaluate whether the MASHRisk score could predict subsequent liver-related events in patients with MASLD. The candidate predictors and their distributions are detailed in Table 1. The workflow of this study is illustrated in Fig. 1, which was created using the BioGDP platform (https://biogdp.com/)26.

Participants

The study involved participants from the First Affiliated Hospital of Nanjing Medical University in Nanjing, China, who were randomly divided into a training cohort and a validation cohort in a 7:3 ratio. These MASLD patients were treated within the departments of Infectious Disease or General Surgery and Bariatric and Metabolic Surgery, with data collected from August 2023 to July 2024. Inclusion criteria mandated that patients had undergone a liver biopsy and histologic examination. Participants were excluded if they met any of the following criteria: (1) viral hepatitis; (2) previous cirrhosis; (3) previous liver cancer or other malignancies; (4) incomplete baseline information. Ethical approval was granted by the Ethics Committee of the First Affiliated Hospital of Nanjing Medical University (2023-SR-398). The test cohort 1 was derived from a previous study, and the specific inclusion and exclusion criteria can be referenced in that study27. Test cohort 2 and the prognostic cohort were derived from the UK Biobank, a large prospective cohort study that recruited approximately 500,000 participants aged 40–69 years from across the United Kingdom between 2006 and 2010. The inclusion criteria for test cohort 2 required participants to have undergone PDFF evaluation. The inclusion criteria for the prognostic cohort specified that participants should have complete follow-up information. The exclusion criteria for these cohorts included: (1) viral hepatitis; (2) previous cirrhosis; (3) previous liver cancer or other malignancies; and (4) incomplete baseline information.

Liver histology

Liver sections embedded in paraffin were subjected to a standard staining protocol, utilizing hematoxylin and eosin as well as Masson trichrome. The evaluation of pathological changes in each sample was conducted by two experienced pathologists who were blinded to the experimental groups. This assessment was carried out according to the scoring system established by the NASH Clinical Research Network28.

Outcome definitions and other definitions

In the training, validation, and test cohort 1, a NAS score of 5 or higher was used to diagnose MASH27. In the test cohort 2, MASH was defined as a PDFF exceeding 5.5% and an iron-corrected T1 mapping (cT1) value greater than 800 ms. To evaluate prognostic accuracy in the prognostic cohort, key outcomes included liver-related events, such as the development of MASH, cirrhosis, hepatocellular carcinoma, liver-related mortality, ASCVD, and all-cause mortality. A comprehensive definition of liver-related events is provided in Supplementary Table 3.

Modeling strategy

Within the derivation cohort, the Boruta algorithm, implemented via the Boruta R package11, was initially employed to discern salient variables correlated with MASH. Subsequently, ten machine learning models were developed using the caret R package29, based on the following algorithms: K-Nearest Neighbors, Gradient Boosting Machine, Recursive Partitioning and Regression Trees, Neural Network, the Support Vector Machine, AdaBoost, Random Forest, Extreme Gradient Boosting, Generalized Linear Model with Elastic Net Regularization (GLMNET), and PLS. The PLS algorithm exhibited the best predictive performance in the validation cohort and was selected to construct the MASHRisk score, with grid search identifying optimal performance at ncomp = 1. To enhance the understanding of the best model’s predictions, SHAP values were applied, and the shapviz R package was utilized to interpret the complex machine learning model30.

Proteomics assays

High-throughput proteomics measurements were conducted using Olink®31. Details pertaining to the Olink proteomics assay, as well as data processing and quality control measures, have been documented in other sources. In summary, 2941 protein quantification values were obtained from their Normalized Protein eXpression, presented on Olink’s arbitrary log₂ scale31. Differential protein analysis was conducted using the limma package.

Enrichment analysis

Protein pathway enrichment analysis was conducted using two strategies. First, we employed the ORA method using Metascape32. Additionally, we performed GSEA using the clusterProfiler package33. Differentially expressed proteins were identified using the limma package in R. Proteins with adjusted P value (adj.P.Val) < 0.05 and log2 fold change (log2FC) > 0.5 were selected for ORA.

Statistical analysis

We used z-score standardization during model development, with scaling parameters estimated on the training set and applied to validation and test cohorts. Missing observations were removed prior to scaling, and categorical variables were not scaled. Continuous variables are reported as means accompanied by standard deviations, while categorical variables are shown as counts and percentages. To evaluate relationships between the variables, Pearson or Spearman correlation methods were employed. The diagnostic efficacy of the MASHRisk score and other indices was determined through ROC curves. The optimal cutoff was determined on the training set by maximizing Youden’s index on the ROC. This threshold was then fixed and applied unchanged to the validation and test cohorts. Calibration curves were created to assess the alignment between predicted probabilities and actual outcomes. Furthermore, DCA was performed to investigate the clinical relevance of the models by calculating the net benefit of applying these scores at varying threshold probabilities. The follow-up period was established using the Kaplan–Meier approach. A p value of less than 0.05 was considered statistically significant. All statistical analyses were carried out using R version 4.4.2.