Introduction

Global burden of liver disease

Chronic liver disease (CLD) is an escalating global health crisis, causing ~2 million deaths annually, rising disproportionately in working-age populations with far-reaching socio-economic impacts1. Increasing prevalence is largely driven by steatotic liver diseases (metabolic dysfunction-associated steatotic liver disease, MASLD; alcohol-related liver disease, ALD; and metabolic dysfunction and alcohol-related liver disease, MetALD). Chronic hepatitis B and C virus (HBV/HCV) remain major global contributors, while autoimmune and cholestatic disorders (such as autoimmune hepatitis, AIH; primary biliary cholangitis, PBC; and primary sclerosing cholangitis, PSC), although less common, cause significant chronic liver injury. Across these diverse aetiologies, disease progression converges on major adverse liver outcomes such as compensated/decompensated cirrhosis, primary liver cancers (hepatocellular carcinoma, HCC; intrahepatic cholangiocarcinoma, iCCA), and liver-related death.

Most patients first present with advanced CLD in emergency settings2, reflecting the silent nature of early disease, lack of systematic community screening, and inequalities in access to timely care. Late presentation undermines opportunities for prevention and contributes to rising healthcare expenditure3. Addressing these challenges requires more effective risk stratification to target surveillance and treatment resources, individualise care pathways, and develop therapies for advanced disease. Artificial intelligence (AI), harnessing multidimensional data, predicting risk, and optimising clinical decision-making, may be transformative and usher in a new era in hepatology.

AI taxonomy

Large-scale patient data and AI are catalysing advances in translational liver research. AI is an umbrella term referring to computational methods performing complex tasks, supporting or enhancing human perception, reasoning, learning, and decision-making. Machine learning (ML) is a subset of AI that recognises patterns in complex data through supervised (input-output mapping) or unsupervised (discovering hidden structures) learning. Deep Learning (DL), using neural networks (NNs), detects features in images and videos via computer vision (CV) or speech and text via natural language processing (NLP)4,5 [Fig. 1]. Table 1 summarises key algorithms most frequently referenced in this review. Overall, the development of AI/ML models relies on extensive data preparation and processing for training and robust evaluation, before potential clinical adoption [Fig. 2]. AI promises a paradigm shift toward proactive, personalised, and equitable management.

Fig. 1: Simplified framework of artificial intelligence relevant to healthcare.
figure 1

Conceptual overview illustrating the relationships between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Neural Networks (NN). AI refers to a wide range of algorithms to simulate human-like reasoning and perform complex data handling tasks, while ML is one of the fundamental machineries. Framework highlights core ML learning strategies (e.g., supervised, unsupervised, semi-supervised, reinforcement, self-supervised, transfer) and representative algorithms. DL is a significant and popular subset of ML, which uses NN for a variety of tasks, such as computer vision (CV) and natural language processing (NLP). Generative AI and Agentic AI are advanced NLP and/or CV-based large AI models which can handle user interaction, including media of text, image, audio, and even video. NOTE: Some algorithms may span multiple categories depending on the nature of the data, task formulation, or implementation context (i.e., supervised or unsupervised). The hierarchical ordering and clustering shown in this figure are illustrative rather than prescriptive. The example algorithms listed are non-exhaustive and reflect a rapidly evolving field in which models are continuously emerging, refined, or replaced over time.

Fig. 2: End-to-end artificial intelligence workflow in healthcare.
figure 2

Data preparation begins with privacy safeguards (anonymisation/pseudonymisation), systematic cleaning, standardisation and cross-site harmonisation to mitigate batch effects. These steps are challenging due to heterogenous data sources of variable quality, which can introduce bias or limit generalisability. Metadata tagging enables auditability, while imputation and imaging-specific pre-processing (e.g., resizing, patching) reduce bias and variance; though improper handling may distort signals. Data processing includes image segmentation and ROI selection (radiology/pathology), data augmentation, dimensionality reduction, and feature ranking (‘omics) to enhance model learning. Biological knowledge is incorporated via label definitions, feature engineering, and model constraints to ensure biological plausibility and clinical relevance. Model evaluation involves appropriate algorithm selection, robust internal/external validation (ideally multi-centre), interpretability, and bias/fairness analyses across sex, ethnicity, age, or sociodemographics. Evaluation may be limited by small or unrepresentative datasets, risking hidden bias. Performance metrics include AUC/AUROC (discrimination), C-index (survival prediction), F1-score (class imbalance), and Dice coefficient (imaging accuracy). Clinical adoption requires more than accuracy: transparency, end user training, usability, cost management, post-deployment monitoring (e.g., model drift and recalibration), and regulatory compliance are essential. Poor monitoring or usability can impede clinical adoption despite strong performance. Together, these stages define an evidence-based pathway from raw data to clinically dependable AI tools, aligned with emerging best-practice guidelines and expert consensus.

Table 1 Overview of common AI/ML algorithms

This narrative review focuses on recent advances (2023–2025), highlighting emerging diagnostic, prognostic, and therapeutic applications for AI in hepatology and examining challenges that must be addressed for implementation in clinical practice.

Data sources

AI depends on large, diverse “Big Data” to generate clinically meaningful insights, although each data type presents unique challenges.

Health record systems

Electronic health records (EHRs) contain longitudinal patient information, including sociodemographic details, diagnostic and procedural codes, laboratory results, imaging reports, medications, and administrative data. Structured data (e.g., laboratory results, codes) are generally more standardised, whereas unstructured data (e.g., free-text clinical notes) exhibit greater variability and pose additional challenges for AI integration. EHRs are now near universal in the US and EU, enabling large-scale studies. However, missing information, data entry errors, and inconsistencies between records are common. Patient-generated health data from smartphones, wearables, or applications offers the potential to integrate granular lifestyle insights with EHRs, but a lack of standardisation and accuracy concerns are barriers to immediate utilisation.

Imaging data

Expert evaluation remains the reference standard for assessing histopathological features. However, biopsies are invasive, limited by sampling error, and susceptible to inter-/intra-observer variability inherent in subjective assessment. In contrast, non-invasive imaging modalities such as ultrasounds, computed tomography (CT), and magnetic resonance imaging (MRI) technologies allow quantitative whole-liver assessment. Digitalised histology whole-slide images (WSIs) also produce structured, high-resolution datasets for AI analyses. ML enables objective and reproducible scoring of features while reducing interpretative variability. Nevertheless, heterogeneity in acquisition protocols and image reconstruction parameters limits standardisation. Additional variability can also arise from differences between instrument manufacturers, although contemporary AI models often incorporate normalisation or domain adaptation strategies to mitigate such effects.

Multiomics

As ‘omics data proliferate (Supplementary Table 1), their integration is delivering system-level insights to identify candidate biomarkers and therapeutic targets. AI/ML is essential to manage these complex datasets, although high heterogeneity, dimensionality, and processing variability challenge reproducibility and clinical translation.

Opportunities of AI

As the volume of multimodal data expands, so does the potential for identifying novel diagnostic, prognostic, and therapeutic tools. Large-scale data commons (e.g., UK Biobank, NHANES) and focused liver-specific initiatives (e.g., SteatoSITE6) support both conventional hypothesis-driven and data-driven, hypothesis-free analyses to uncover patterns beyond conventional clinical paradigms.

Diagnostic opportunities

Current diagnostic pipelines combine patient history with isolated serological, radiological, and histological assessments. Applied to non-invasive tests (NITs), AI/ML approaches could uncover more subtle, multimodal signatures preceding symptoms, enabling earlier diagnosis, scalable screening, and more informed clinical decision-making. A comprehensive overview of diagnostic applications of AI in hepatology is provided in Supplementary Table 2.

Image-based feature detection

AI/ML is being widely applied to radiological assessments of liver health. For example, Convolutional Neural Network (CNN) pipelines applied to CT images accurately segmented whole livers7 and detected malignancies8, offering a potential tool for rapid triage. Similar approaches on ultrasounds9 and MRIs10 delivered accurate fibrosis staging. Assigning histological features of disease activity, such as steatosis grading from CT images11 or predicting hepatocyte ballooning scores from ultrasounds12, has also been possible. Other CNN-based models have characterised features such as vasculature13, ascites14, and body fat15.

Histology remains the gold standard for some disease assessment. Multiple AI/ML computational histopathology pipelines were developed to provide reproducible, granular, and interpretable feature quantification from biopsies. Ercan et al16. developed a CNN-based tool for AIH diagnosis using Haematoxylin and Eosin (H&E) and Sirius Red-stained WSIs, successfully classifying biopsies with 88.2% accuracy. Similar models were able to detect other features, such as portal tracts17 and microvascular invasion (MVI)18. Digital histopathology for MASLD is extensively reviewed elsewhere19.

Disease signatures and stratification

AI/ML approaches allow identification of latent disease-associated patterns within EHR datasets. Addressing diagnostic delays presented by chronic HCV’s asymptomatic onset, Sharma et al.20 stacked ML models to detect HCV infection from standard biochemistry laboratory tests, suggesting a path toward scalable, low-cost screening. In MASLD, a 17-variable Random Forest (RF) classification model outperformed standard NITs for biopsy-defined staging across four US centres21. Other DL models identified increased steatosis risk from unstructured data sources using NLP22, showcasing the potential of text mining for case identification at population scale.

Some diseases may benefit from nuanced spatial and systemic molecular assessment for earlier diagnosis and finer stratification. Oh et al.23 analysed MASLD biopsy-anchored multiomic data via Support Vector Machine-based feature selection and used a generalised linear regression model to derive a six-gene signature which generalised across independent cohorts. The model distinguished healthy from MASLD, and simple steatosis from metabolic dysfunction-associated steatohepatitis (MASH), identifying a blood signal in cell-free RNA suggesting non-invasive translation. Other studies implicated cell death24, oxidative stress25, inflammation26, and metabolic27 gene signatures as potential biomarkers for MASLD. Tavaglione et al.28 applied a Feedforward NN to data from over ~218,000 participants, finding that individuals with hypertriglyceridemia exhibited a 3-to-4-fold increased prevalence of MASLD and MASH, whereas hypercholesterolemia conferred only marginal risk, underscoring lipid profiling as a robust clinical signal to prompt targeted screening. AI/ML applied to urinary proteomics29, MRI-based fat content30, and circulatory extracellular vesicle (EV)31-based biomarkers have also shown promise.

Although HCC diagnosis remains radiological, AI/ML-driven transcriptomic32, cell-free DNA methylation33, serum metabolomics34, and oral/gut microbiome assays35 are emerging as credible molecular complements. Notably, Li et al.36 isolated fucosylated EVs from serum and trained a Logistic Regression (LR) model on five EV-miRNAs for HCC detection, rescuing >80% of previously misclassified cases.

Differential diagnosis

Liver diseases often present with non-specific features, making the challenges of diagnosis and accurate management amenable to assistance by AI/ML. Huang et al.37 developed a gut-microbiome-based strategy to distinguish simple steatosis from MASH, mapping pathway shifts in glucose metabolism and flavonoid biosynthesis. A similar approach differentiated ALD from MASLD metagenomically38, suggesting stool-based signatures as non-invasive diagnostic options. Using routine laboratory parameters, Wang et al.39 validated a Gradient-Boosted Decision Tree to differentiate idiosyncratic drug-induced liver injury (DILI) from AIH. Similarly, AI/ML supported the differentiation of PBC and AIH from saliva proteomics40 and histology41.

For patients with combined HCC-iCCA, Calderaro et al.42 developed a self-supervised CNN to re-classify tumours as HCC-like or iCCA-like, with attention maps showing that iCCA-like areas drove discrimination. Similar work used multiparametric MRI radiomics to classify HCC-iCCA43 and inflammatory pseudotumours44 pre-operatively. Wei et al.45 created LilNet, an automated detection system for hepatic lesions from multiphased-enhanced CT, successfully distinguishing focal nodular hyperplasia, haemangiomas, and cysts with 88.6% accuracy and highlighting AI/ML potential as a clinically deployable tool in radiological resource-limited settings.

Prognostic opportunities

Prognostication in MASLD largely depends on fibrosis severity. Existing non-invasive tests (such as Fibrosis-4 index (FIB-4), Enhanced Liver Fibrosis test, and vibration-controlled transient elastography (VCTE)) assess fibrosis, but their performance in population-level screening remains suboptimal46. Improved risk stratification may be achieved through earlier recognition of anthropometric, genetic, and metabolic risk factors, as opportunities for intervention diminish once significant fibrosis is established. A comprehensive overview of prognostic applications of AI in hepatology is provided in Supplementary Table 3.

Risk prediction

Using routine EHR data, AI/ML models can deliver high-throughput, individualised risk estimates for liver disease across the general population. In MASLD, multiple large cohort studies have identified optimal predictors of CLD incidence and progression47,48,49. Yu et al.50 constructed a model using a RF with recursive feature exclusion from ten routine clinical variables, outperforming traditional risk indicators with body mass index, waist-to-hip ratio, triglycerides, and fasting glucose among the top predictors, all potentially actionable via weight loss and glycaemic control. Njei et al.51 developed an Extreme Gradient Boosting (XGBoost) classifier to identify MASLD individuals at high risk of MASH based on alanine aminotransferase (ALT), gamma-glutamyl transferase (GGT), platelets, waist circumference, and age, surpassing NITs and demonstrating an approach to triage without FibroScanÒ. Complementing phenome-derived findings, transcriptomics52, metabolomics53, and proteomics54 studies support ML-based risk prediction and stratification of MASLD, often demonstrating lipid-centred signatures as dominant risk signals.

HCC risk stratification, prognosis, and recurrence

HCC annual incidence rate in patients with cirrhosis is ~2–3%55, but risk is heterogeneous. Guo et al.56 developed a metabolomic risk model of end-stage cirrhosis (including HCC) from UK Biobank participants. Based on eight serum metabolites, the model outperformed polygenic risk scores and, when integrated with routine clinical variables, accurately predicted 10-year outcomes. In a different cohort, CNN modelling predicted HCC occurrence from tumour-free baseline WSIs with ~82% accuracy in validation; saliency maps highlighting nuclear atypia, high hepatocellular nucleus-to-cytoplasm ratio, immune cell infiltrates, and lack of large fat droplets as predictive histopathological signals beyond fibrosis57. Further AI/ML prognostic studies identified liver fibrosis58, angiogenesis59, and glycosylation mechanisms60 as important features for risk stratification, but mainly in already diagnosed HCC patients. AI/ML also enhanced HCC surveillance in viral hepatitis. In chronic HBV, Wu et al61. trained an Artificial NN that accurately estimated 10-year risk in antiviral therapy (AVT)-treated patients, while in cured HCV, Nakahara et al62. applied Random Survival Forests to routine laboratory tests to define four 5-year risk strata. Strikingly, many events fell outside guideline cut-offs, underscoring AI’s value in calibrating surveillance.

Studies of HCC recurrence after transplant63, ablation64, or immunotherapy65 have also supported the use of AI/ML-derived risk scores to guide surgical decision-making. For example, single-cell mapping of primary and early-relapse HCC revealed rewired tumour-immune crosstalk dominated by MIF-CD74/CXCR4 signalling and malignant CD8⁺ T-cells, yielding a LASSO/Cox-derived 7-gene relapse score that outperformed clinical covariates and identified high-risk tumours66. MVI67, elevated alpha-fetoprotein63, peritumoural radiomic64 and pathomic68 features were additional predictors of recurrence.

Other complications

Accurate prognostication is also important in predicting decompensation and liver failure. In PSC, Singh et al. trained CNNs on portal-venous phase CTs to predict decompensation. Half-volume experiments69 and body composition quantification70 suggested diffuse signal contribution, supporting whole-organ phenotyping as a biomarker of deterioration. In MASLD, ML models showed that a combination of routine laboratory tests with some imaging modalities dominated decompensation prediction71,72. In HBV-related cirrhosis, a RF combining GP73 and α1-microglobulin with age, aspartate aminotransferase, ALT, and platelets best predicted decompensation. Interaction analyses showed that non-linear ML models captured transition risk better than linear indices like FIB-473. In surgical settings, multimodal DL models accurately predicted pre-operative post-hepatectomy liver failure74 while peri-operative EHR-based monitoring enabled early post-operative detection75, collectively supporting AI/ML’s value across the surgical timeline. Models predicting non-liver outcomes have also been developed. Veldhuizen et al.76 used a self-supervised Transformer NN to predict major cardiovascular events from liver MRI. Saliency maps implicated hepatic veins, inferior vena cava, and abdominal aorta health as key predictive features. MASLD also predisposes to renal complications. Sun et al.77 used ML-driven qFibrosis® digital histopathology quantification to track collagen remodelling around pericentral/central veins, which predicted estimated glomerular filtration rate and outperformed conventional histology.

Liver transplantation

AI/ML may help improve outcomes following liver transplantation (LT) by predicting graft survival and guiding clinical management. Sharma et al.78 developed GraftIQ, a clinician-informed multi-class NN integrating clinicopathological data from the 30 days pre-biopsy to accurately classify graft injury aetiologies. Using t-SNE unsupervised clustering, Chichelnitskiy et al.79 profiled soluble immune mediators from a prospective paediatric cohort, identifying a high CD56bright NK-cell plasma signature detectable two weeks post-LT associated with higher rejection-free survival, suggesting actionable, non-invasive markers to guide immunosuppression. Further immune80 and metabolome81-based AI/ML approaches have assessed drivers of LT dysfunction/rejection and their potential prognostic value.

Mortality

Mortality risk in CLD has long relied on the Model for End-stage Liver Disease (MELD) score, but AI/ML may allow greater discrimination. To predict HCC-related mortality, multiple studies integrated CNN auto-segmentation and regression-driven feature selection of pre-treatment CT scans combined with clinical variables, and most models outperformed traditional prognostic risk scores, emphasising image-derived features as powerful predictors of overall survival (OS)82,83. Complementing radiomics, Sun et al.84 derived a 3-gene epithelial-mesenchymal transition immune risk score that stratified OS prediction over 5 years. Generalisable HCC prognostic modelling from clinical registries has also been shown to accurately predict OS85,86.

In MASLD, Drozdov et al.87 used a Transformer NN to predict all-cause mortality at 12–36 months, with age, type-2 diabetes, and prolonged prior hospitalisation among key predictive factors. Huang et al.53 built a metabolome-derived score that accurately identified patients with biopsy-proven MASH and predicted liver-related mortality more accurately than clinical covariates.

AI/ML also improved prognosis prediction in acute settings. In ALD, interpretable ML outperformed legacy scores for short-term mortality, from intensive care unit parameter-based models in alcoholic cirrhosis88 to the global ALCHAIN ensemble in alcohol-associated hepatitis89, providing explainable risk factors and a bedside web tool that can inform steroid triage.

Therapeutic opportunities

Clinical outcomes in hepatology remain unpredictable with current management, with some patients progressing to end-stage liver disease despite removal of the underlying cause and others showing heterogeneous treatment responses90. A comprehensive overview of therapeutic and other applications of AI in hepatology is provided in Supplementary Table 4.

Drug discovery and repurposing

AI/ML is enabling the discovery of therapeutic targets across the CLD spectrum. Combining Cox Regression with Gradient Boosted Machine, Wen et al.91 generated a multiomic Consensus AI-derived Prognostic Signature (CAIPS) from HCCs. When integrated with pharmacological databases, the model recommended irinotecan and the PLK1 inhibitor BI-2536 for high-CAIPS profiles, subsequently validated in vitro. In MASLD-related HCC, Sun et al.92 derived metabolic dysfunction scores from public genomic databases and identified CACNB1 as a putative druggable target, with molecular docking analysis highlighting calcium-channel agents as testable inhibitors. Similarly, Venhorst et al.93 fused phenotypic and transcriptomic profiling to propose an EP300/CBP bromodomain inhibitor, inobrodib, as an anti-fibrotic strategy in MASH. Finally, through proteomic profiling of serum samples associated with PSC progression to cirrhosis, Snir et al.94 described a CCL24-defined druggable chemokine axis; a clinical trial for anti-CCL24/CM-101 immunotherapy is ongoing, with positive Phase 2 signals (NCT04595825)95.

AI tools are also becoming embedded within therapeutic pipelines. Ren et al.96 integrated an AI-based platform for therapeutic target prioritisation (PandaOmics) with generative chemistry (Chemistry42) to investigate CDK20 inhibition in HCC, discovering a nanomolar hit in just 30 days using an AI-driven protein-structure prediction system (AlphaFold), later confirmed in vivo.

Treatment response

AI/ML supports a response-adaptive therapeutic approach, guiding drug choice for optimised personalised care whilst preventing toxicity. Using images of human liver organoids, Tan et al.97 created a spatiotemporal DL model which performed ternary DILI grading, identifying toxic compounds that standard spheroid assays miss. Other models were developed to anticipate hepatotoxicity98, differentiate animal-only toxicities99, prioritise synergistic drug combinations100, or assess treatment efficacy in population subsets in silico. For example, clinical benefits of atezolizumab plus bevacizumab (AB) immune checkpoint inhibitors have been observed in patients with unresectable HCC101. Zeng et al.102 used H&E pathomics to derive an immune AB response signature and identify patients with longer progression-free survival, while Vithayathil et al.83 externally validated a model incorporating pre-treatment CT radiomics and clinical features predicting 12-month mortality risk on AB, successfully stratifying response rates and outperforming traditional risk scores.

Beyond cancer, Fan et al.103 developed a novel tool for AVTs assessment by turning longitudinal serum quantitative HBV surface antigen trajectories into individualised antigen-loss probabilities, identifying ~8–10% of patients with high probability of viral clearance. External validation in clinical trials confirmed that “favourable” patients had markedly higher treatment response104. Finally, Yang et al.105 constructed a multiomic model predicting suboptimal biochemical response in PBC/AIH variant syndrome, highlighting dysregulated lipid metabolism and immune (e.g., IL-4/IL-22) pathways as key pathogenic factors, enabling timely escalation in likely non-responders.

Lifestyle interventions

AI/ML has the potential to identify effective and actionable lifestyle changes. While an independent audit of ChatGPT meal plans for MASLD found plausible weight loss advice but frequent mistakes and guideline omissions106, Joshi et al.107 showed in a 1-year randomised trial that an AI “digital twin” delivering personalised nutrition, physical activity, and sleep schedule recommendations improved MASLD liver-fat and fibrosis scores more than standard care.

Other opportunities

AI is increasingly integrated into routine healthcare workflows, often functioning as a support tool or “co-pilot” for clinicians or patients (Fig. 3), thereby maintaining human oversight.

Fig. 3: Use of artificial intelligence across the clinician-patient journey.
figure 3

Conceptual flowchart illustrating where patients and clinicians interface with Artificial Intelligence (AI) across the care continuum, from pre-visit planning and triage, through admission, consults and interventions, to at-home support, applied biomedical research and clinical trials, and finally communication. Throughout the journey, clinician-directed co-intelligence intakes all or selected inputs, executes routine or targeted tasks, and returns outputs to care teams or patients under clinical oversight, thus supporting, not supplanting, clinical judgement. SDoH social determinants of health, EHRs electronic health records (EHRs), Q&A question and answer.

Clinical co-pilots

LiVersa, a liver-specific large language model (LLM) built from ~30 AASLD guidance documents, correctly answered a trainee HBV/HCC knowledge set, generating more specific outputs than ChatGPT108. Related, LiVersa could also accurately extract structured elements from HCC imaging reports in a head-to-head comparison with manual reviewers109. A more generalist domain-specific vision language model for pathology, PathChat, was also recently introduced110. Beyond text extraction, Xu et al.111 demonstrated that LLMs (GPT-4, Gemini) achieved near-expert accuracy in predicting immunotherapy response for unresectable HCC. Parallel work developed a radiomics-DL-LLM agent for personalised HCC treatment planning112. In the operating room, LiverColor, a smartphone app using CNN architecture for colour and texture analysis, could classify steatosis in liver grafts in <5 s, outperforming surgeons for >15% steatosis, although performance at >30% remained limited by sample size113.

Clinical trials

AI/ML is also helping to reshape clinical trials. NASHmap, an EHR-based XGBoost model using 14 routine variables, accurately predicted biopsy-confirmed MASH and, when applied to ~2.9 million at-risk adults, identified 31% as probable MASH, representing a pragmatic pre-screening recruitment tool114. Within digital histopathology, AIM-MASH automated eligibility and endpoint scoring with agreement comparable to expert consensus, also detecting a greater proportion of treatment responders than central readers115. In February 2025, the European Medicines Agency issued a Qualification Opinion allowing AIM-MASH as an aid to single central pathologists for Phase 2/3 enrolment and histology-based endpoint evaluation116. Across ~1400 biopsies from four trials, AI-assisted pathologists outperformed independent manual readers for key histological components while remaining non-inferior for steatosis and fibrosis117.

Social determinants of health

AI/ML is increasingly able to capture social determinants of health (SDoH) from clinical notes, helping identify access gaps, support model fairness, and build more diverse cohorts. In MASLD, factors such as education, food insecurity, and marital status are linked to higher disease burden48,118,119,120, underscoring the need for equity-aware study design. For example, Wang et al.121 showed that Black-White performance gaps in 1-year mortality prediction across chronic diseases (including CLD) disappeared once SDoH were balanced. In LT, Robitschek et al.122 used a LLM to extract 23 psychological/SDoH factors from evaluation notes, improving prediction of listing outcomes and elucidating drivers of transplant decisions.

Challenges of AI

Despite promising results for the use of AI/ML to improve care in hepatology, there are limitations to address before real-world clinical adoption.

Technical challenges

Most AI models in hepatology are built on retrospective, single-centre cohorts or public registries with narrow demographics and limited follow-up, making them prone to overfitting, with poor generalisability and limited transparency123.

Data quality remains a major bottleneck, as label noise (e.g., biopsy sampling errors, inter-observer variability) and inconsistent preprocessing pipelines (e.g., imaging protocols, EHR completeness) undermine reliability and standardisation124. Furthermore, the addition of high heterogeneity of real-world hepatology populations (i.e., variable aetiologies, disease prevalence, demographics), evolving clinical practice guidelines, and limited adherence to evaluation and reporting standards (e.g., TRIPOD + AI125, CONSORT-AI126, DECIDE-AI127) all complicate model reproducibility. Additional challenges are posed by dataset and concept shifts, where differences between training datasets and real-world populations may degrade model performance128. Multi-centre validation and continuous post-deployment monitoring for calibration drift (the gradual loss of accuracy in measurements or predictions over time) are therefore essential to maintain long-lasting clinical reliability.

Clinical credibility of AI depends on rigorous evaluation and transparency. Where possible, prospective or “AI-in-the-loop” randomised trials comparing AI-assisted and standard care are essential to determine true clinical benefit. Such studies have been piloted in liver imaging but remain uncommon. Assessing model interpretability and explainability helps explain which features drive predictions and whether models rely on spurious or biased patterns. For example, outputs may be influenced more by fibrosis stage or demographic factors than by disease biology itself. Techniques such as SHapley Additive exPlanations (SHAP) or Local Interpretable Model-agnostic Explanations (LIME) provide post hoc insight into model reasoning, while attention or saliency maps visually highlight image regions most influencing a prediction129.

Overall, when data are limited, simpler models may outperform deep networks, which often sacrifice transparency for marginal accuracy gains5. Beyond architecture, reliable AI deployment requires quantifying uncertainty and the ability to “abstain” in low-confidence cases, a critical safeguard for clinical integration130. No single modelling approach is universally superior. Robust feature selection, transfer or self-supervised learning, and systematic sensitivity analyses are essential for producing interpretable and reproducible biomarkers131. Equally, LLMs introduce additional risks, including hallucinations, prompt sensitivity, and over-confident errors132, underscoring the need for task-specific evaluation, transparent data sourcing, and human oversight.

Regulatory complexity

Regulation of medical AI is progressing but remains uneven across regions. In the EU, the AI Act introduces rules in stages, with bans on unacceptable uses and AI literacy measures from 2024, general-purpose AI rules from 2025, and “high-risk” medical device standards phased in through 2026–2027133. In the US, the FDA’s Predetermined Change Control Plans (PCCPs) allow pre-authorised post-market model updates for AI-enabled software, supporting safer iteration and adaptation134. In the UK, MHRA’s “Software and AI as a Medical Device” framework defines expectations for development and post-market monitoring135.

Data privacy remains a major challenge for multi-centre research. Under the EU GDPR, secondary use of health data requires clear legal grounds. The European Health Data Space aims to streamline data sharing136, although coordination remains complex. Privacy-preserving approaches, such as federated learning, may help, particularly for rare and paediatric liver diseases where cohorts are small137. Despite these efforts, uncertainty persists about when AI/ML tools qualify as medical devices and how best to assess their safety, effectiveness, and fairness. Divergent regulations slow adoption and deter investment138.

Finally, data privacy and patient safety are closely linked. US and EU regulators now emphasise “secure-by-design” AI systems to counter growing cyber risks139,140. Demonstrations that manipulated medical images can mislead both clinicians and algorithms141, underscores the need for verification tools and secure data pipelines.

Ethical limitations

Without safeguards, AI/ML can amplify existing inequities in hepatology. Minority ethnic groups, women, and non-represented cohorts have been systematically disadvantaged by MELD-based LT prioritisation142, HCC risk modelling143, and other predictive tools. Bias mitigation requires diverse training across ethnicity, sex, and socioeconomic strata, subgroup calibration, transparent equity reporting, and post-deployment audits144.

Importantly, accountability for AI-driven decisions is still unclear. The EU AI Act assigns responsibilities to developers and users of high-risk medical AI145, but US PCCPs leave liability unresolved134, often defaulting to clinicians. Clearer rules on who is responsible are needed in governance and public communication.

AI also has an environmental impact. Data centres already account for ~1.5% of global electricity use, projected to more than double by 2030146. Training GPT3 alone was estimated to consume ~700,000 litres of cooling water147. With health systems like the NHS beginning to mandate disclosure of environmental costs148, sustainability must become integral to medical AI deployment.

Clinical integration

Two main barriers hinder clinical integration: an evidence gap (few large, prospective, multi-centre trials) and a deployment gap (limited integration of AI into workflows). Without transparency, clinicians often revert to familiar statistical tools or user-friendly chatbots valued for convenience over accuracy. A recent EASL consensus identified key requirements for adoption, including demonstrated clinical benefit, rigorous prospective validation, and benchmarking against best statistical baselines. Despite enthusiasm, only ~4% of EASL 2024 abstracts used AI/ML, reflecting early adoption in hepatology149. Ongoing concerns include clinician distrust, regulatory uncertainty, and poor system interoperability. Facilitators include interdisciplinary collaboration, shared data resources, sustainable funding, and improved AI literacy. Positioning AI as “assistive” rather than “autonomous” may also reduce workforce anxiety. However, lessons from EHR adoption warn that poorly integrated systems can add to clinician workload150.

Beyond these structural challenges lies a more subtle concern: preserving clinical expertise amid growing algorithmic support. Senior clinicians increasingly question how future specialists will develop skills if AI shortens traditional learning pathways. In a recent multi-centre study, continuous exposure to AI-assisted polyp detection led to reduced performance during subsequent unassisted colonoscopies151, suggesting early signs of deskilling. Mitigation requires deliberate integration strategies, embedding AI as a co-pilot rather than a replacement, maintaining unassisted practice, and ensuring ongoing skill calibration. Targeted education and hybrid training (with/without AI support) are essential to preserve sound clinical judgment.

Health economics

Cost-saving claims for AI/ML tools in hepatology remain largely speculative. Existing evaluations are scarce, methodologically inconsistent, and rarely patient-centred152. The potential system-level value lies in earlier detection, workflow automation (e.g., radiology/histopathology quantification), and risk-based triage using EHR data. Of note, while commercial models (e.g., ChatGPT, Claude) may incur licensing costs, the main financial burden of medical AI stems from infrastructure, integration, validation, and governance rather than model access itself.

Experts emphasise early involvement of health economists to design robust cost-effectiveness studies that capture true implementation costs, effects on clinician time and workflow, and downstream resource reallocation, while avoiding costs from misclassification. Such evidence is essential to establish both financial and clinical viability, particularly in resource-limited settings153.

Conclusion

Centred on the patient, the AI/ML lifecycle (spanning purpose, population, data, model development, validation, and deployment) offers a pragmatic framework for hepatology. Applied responsibly, multimodal data integration and assistive algorithms can enable earlier diagnosis, more accurate prognosis, and personalised therapy. Successful clinical translation will depend on generalisability, transparency, and longitudinal performance monitoring to detect drift, alongside robust privacy, security and equity safeguards, clear demonstration of health economic value, and workflow-embedded human oversight, to shift liver care from reactive to proactive.