Introduction

Hepatocellular carcinoma (HCC) is the sixth most frequently diagnosed cancer and the third leading cause of cancer-related deaths1,2. Liver transplantation (LT) and partial hepatectomy are the primary treatment options; however, recurrence after LT presents challenges, leading to complications that impact both short- and medium-term outcomes. Recurrence rates range from 10 to 58%, depending upon stage at the time of diagnosis3,4,5. Additionally, 75% of those who receive LT also experience recurrence within 2 years post-transplantation6, and the median survival for recurrence cases is only between 8.7 and 17 months7,8. Consequently, accurately predicting recurrence is challenging for both LT patients and clinicians responsible for their treatment.

The Milan criteria9 has been the gold standard for almost 30 years; however, researchers have suggested that the Milan criteria may be overly restrictive, as it focuses primarily on tumor morphology rather than intercalating demographics and other disease specific variables7. Improving patient selection could help to allocate resources more effectively to specific HCC patients, and several studies have identified factors that provide insights into tumour biology, which could be used to refine selection criteria3,10,11,12. Furthermore, while recurrence models are the preferred decision-support tool for post-LT patients there is also a need to improve the identification of those at high risk of recurrence13,14. A deeper understanding of recurrence risk would allow physicians to personalise HCC treatments more effectively, enhancing post-LT care for each patient.

Deep learning (DL) is enhancing case identification, treatment approaches, and overall healthcare15. DL also has the potential to detect both linear/non-linear relationships and potential associations within complex multi-factorial datasets. DL research in HCC has the potential to enhance early detection, prognosis prediction, and treatment planning. However, HCC is highly heterogeneous and often diagnosed at an advanced stage which necessitates new models because traditional methods have limited accuracy and consistency16. By integrating diverse data sources, DL has the potential to advance personalised medicine while optimising resource allocation and improving HCC patient outcomes. The objective of this study was to develop and validate DL-based pre- and post-prediction models for HCC recurrence to drive the individualization of care.

Methods

Patients and data

This study adhered to the principles of the Declarations of Helsinki (revised in 2013) and the Declaration of Istanbul. This study received approval from Institutional Review Boards (IRBs) in Beijing Chaoyang Hospital (2020-S-303), the Second Xiang-ya Hospital (2019-S-NO.154), and the Beijing Friendship Hospital (2024-P2-363-01). However, the need for written informed consent was waived by all three IRBs since this was a retrospective study and data were anonymized a priori. No organs or tissues were obtained from incarcerated individuals.

A total of 501 datasets from HCC patients were collated from three centers, of whom 278 underwent LT at Beijing Chaoyang Hospital (herein referred to as center 1) between January 2015 and December 2021. 154 were taken from the Second Xiang-ya Hospital (i.e. center 2) between November 2016 and October 2020, and 69 were from Beijing Friendship Hospital (i.e. center 3) taken between January 2019 and November 2021. At all three centres, most patients underwent whole liver transplants from deceased donors, except for 12 patients in Centre 3 who received grafts from living relatives. Inclusion criteria were as follows: (1) histopathologic diagnosis of only HCC on explanted livers; and (2) patients older than 18 years of age. Note that, unlike regulations in the United States for HCC patients, there is no mandatory 6-month waiting period in the People’s Republic of China. Patients were included if they had received preoperative systemic chemotherapy and/or locoregional therapies. Of the 278 patients assessed, 243 met the inclusion criteria, while 35 were excluded due to incomplete data (n = 24), death from non-tumour-related complications within 90 days of transplantation (n = 10, including infection and bleeding), fibrolamellar carcinoma (n = 0), or retransplantation (n = 1).

Patient characteristics encompassed demographics, clinical data, pre-transplant therapies, laboratory results, and tumor imaging. Imaging technologies included computed tomography (CT) and magnetic resonance imaging (MRI), performed after liver-directed treatment and near the time of transplantation. Additionally, explant pathology data were recorded, including the maximum tumour diameter and the presence of necrotic areas.

Pathology reviews of all explant-based pathologic data were independently conducted by two pathologists. The histologic grade of tumor cell differentiation was based on Steiner grading17, and the lowest differentiation was taken into account when explants exhibited heterogeneous differentiation.

Follow-up and recurrence definition

Patients were monitored for HCC recurrence using alpha-fetoprotein levels, thoracic CT, and either abdominal CT or MRI. Follow-ups were conducted every 3 months during the first 2 years after transplantation or as clinically indicated. Thereafter, assessments were performed biannually. Additional imaging techniques i.e. positron emission tomography (PET) scans and bone scans were captured if recurrence was suspected. The primary outcome for the binary regression models was HCC recurrence within 2 years post-LT. For survival analysis, HCC recurrence within the predefined follow-up period was considered the primary outcome.

Recurrence-free survival (RFS) was calculated using LT until tumor recurrence. HCC recurrence was determined according to either: (1) imaging from CT scan, MRI, or others of recurrent tumor (or metastatic tumor); (2) pathological diagnosis. Once HCC recurrence was identified, patients underwent further investigations and received appropriate treatments.

Data analysis process

A study process flowchart has been provided in Fig. 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Research flowchart including variables screening, models construction, and external validation with patients from two extra centers. LR logistic regression, RF random forests, SVM Support vector machine.

  1. (1)

    Initially, variables were selected using three methods: support vector machine (SVM) based on Shapley values, logistic regression (LR) based on coefficients, and random forest (RF) based on Gini scores. Variables identified by at least two of these methods were included in the model.

  2. (2)

    We then developed two types of models: survival analysis models and binary classifiers. For survival analysis, we applied two methods—Cox proportional hazards regression and DeepSurv18—using recurrence status during the post-surgery follow-up period as the primary outcome. Binary classifiers were constructed, including logistic regression (LR), artificial neural networks (ANN), RFs, and support vector machines (SVM). Additionally, we developed a stacking ensemble model, which used LR, RF, and SVM as base learners and LR as the meta-learner. For these classifiers, recurrence status within 2 years post-surgery was set as the primary outcome.

  3. (3)

    We evaluated and compared model performance, identifying each patient’s recurrence risk according to the best-performing model. For more details on each model and their evaluations using specific indices, please refer to Supplementary Method 1 in the supplementary materials.

  4. (4)

    Using the Youden index, we then classified patients into high- and low-risk recurrence groups. Kaplan-Meier survival analysis and log-rank tests were performed to assess the predictive accuracy of our pre- and post-DeepSurv models (DSM) compared to established criteria, including the Milan criteria, UCSF criteria, up-to-7 criteria, Fudan-Shanghai criteria, RETREAT model, AFP-French model, Hangzhou criteria, and Metroticket 2.07,19,20,21.

Akaike information criterion (AIC) were calculated to evaluate the potential risk of overfitting. A likelihood-based method was applied to the type I censoring design. All statistical analyses were performed with R software (version 4.4.0).

Deep learning model: SeepSurv

DeepSurv is a deep learning-based survival analysis model that predicts duration of survival. It is particularly effective at handling nonlinear and high-dimensional data, outperforming traditional models in these scenarios22,23. DeepSurv is based on features taken from the artificial neural network (ANN) and Cox’s proportional hazard (CoxPH) regression models. ANN was used as a preprocessing model to filter sample features, while CoxPH was applied to estimate the risk function by integrating the ANN model with a neural network regression framework.

The ANN consisted of one output layer and two hidden layers, each containing 16 nodes. The activation function used by the hidden layer was a hyperbolic tangent function. The hidden layer included a dropout layer to enhance generalization, along with a fully connected neural network layer. The Adam optimizer was used with the negative log partial likelihood as the loss function, incorporating batch normalization, weight decay regularization, and learning rate scheduling. Training was performed with a learning rate of 0.001 and no learning rate decay.

The final ANN output served as the covariate in the Cox model, linking all covariate features from the original sample into a single feature is represented as q(x) in the formula. The basic risk model used in the Cox model was obtained through a Nelson-Aalen model, which is a linear univariate risk model which intercalates event and time as the input function b_0 (t). A DeepSurv model was then established based on clinicopathologic features while risk score was established as the outcome of interest for each patient. Univariate logistic regression was further applied for the final recurrence score calibration.

Results

Patient characteristics

Baseline characteristics for the 466 patients from centers 1 (n = 243), 2 (n = 154), and 3 (n = 69) are presented in Table 1 (also see Table S2 for additional details). 243 patients from center 1 was used to form training and testing cohorts. 154 patients from center 2 and 69 patients from center 3 were selected and formed external validation cohorts 1 and 2, respectively. The average age (across the three samples combined) was 52.7 ± 8.8 years. Hepatitis B was the most prevalent underlying liver disease. Other factors were categorized into preoperative and postoperative data with a median follow-up of 51.0 months (95% confidence interval [CI] 47.8–54.2).

Table 1 Characteristics of patients, tumors, and explants of patients.

Screening clinicopathologic variables

Factors were categorized into preoperative or postoperative factors and are provided in Tables 1 and S2. The top 16 (Fig. 2A) and 17 (Fig. 2B) factors were selected from 150 potential factors according level of significance associated with recurrence using machine learning and in accordance with professional medical opinions.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Variables importance plot according to support vector machine, random forest, and logistic regression for pre-model (A) and post-model (B) settings. AFP alpha-fetoprotein; tumor diameter, the sum of tumor diameter, NEUT neutrophile count, HGB hemoglobin concentration, ALB albumin, NLR neutrophil-to-lymphocyte ratio, PLT platelet count, T-Bil total bilirubin, CREA creatinine, LYM lymphocyte count, BMI body mass index, LYM% lymphocyte percentage, INR international normalized ratio, MVI microvascular invasion.

Construction and evaluation of predictive recurrence models

After comparing model performance, DSMs were selected because to their adaptability and evidence from previous research which indicated their superiority over traditional models. Based on training results, testing, and external validation with cohorts 1 and 2, pre-DSM C-index data points within the follow-up period were 0.790 (± 0.003), 0.775 (± 0.037), 0.765 (± 0.001) and 0.819 (± 0.002), respectively. Post-DSM C-index data points within the follow-up period were 0.835 (± 0.008), 0.812 (± 0.082), 0.839 (± 0.001) and 0.831 (± 0.002), respectively. The application of pre- and post-DSM enabled calculations of individual risk, with recurrence risk corresponding with increasing scores. See Tables S3S6 and Fig. S1 for calibration curves, sensitivity analysis and further details.

Pre-DSM risk stratification plus comparisons with common criteria

Recurrence risk scores were calculated for all patients using pre-DSM and using the maximum value of 0.31429 in Youden’s index (0.505) as the cut-off value based on the pre-DSM risk score. It was then possible to categorized patients into two prognostically distinct groups (the high or low risk group). The median score in the low risk and high risk groups were 0.11964 and 0.65409 respectively (p < 0.001). Similar results were observed in all three centres, with 0.11867 and 0.66837 in centre 1 (p < 0.001), 0.11867 and 0.66837 in centre 2 (p < 0.001), and 0.12621 versus 0.68003 in centre 3 (p < 0.001). Please see Table S8 for further details.

Using the maximum value of 0.31429 in Youden’s index (0.505) as the cut-off value based on the pre-DSM risk score, patients were categorized into two prognostically distinct groups (p < 0.001, Fig. 3B), including low-risk (2-year RFS: 86.2%, 95% CI 82.1–90.3%) and high-risk (2-year RFS: 40.3%, 95% CI 33.4–47.2%). The pre-DSM outperformed Milan criteria and all other models (Figs. 3A,B, S2A–D, S3 and Table S7). Owing to the superior area under the curve (AUC), the number of transplant candidates classified using pre-DSM safely increased by 8.7% (n = 22) compared to the Milan criteria (Fig. 3B).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

DSM performance compared with common criteria. (A) 2-year ROC of Pre-DSM VS the rest seven criteria; (B) Kaplan–Meier test between Pre-DSM and Milan criteria; (C) Deviation ability of Pre-DSM and Milan criteria; (D) 2-year ROC of Post-DSM VS the rest seven criteria; (E) Kaplan–Meier test between Post-DSM and Milan criteria; (F) Deviation ability of Post-DSM and Milan criteria.

Additionally, we stratified the entire cohort into four groups. In Fig. 3C, you can see the cohort beyond the traditional Milan criteria. 28.2% (n = 60) were classified as low-risk according to the pre-DSM, with a 16.7% (n = 10) recurrence rate in the first 2-years. With the Milan criteria, 15.0% (n = 38) were identified as high-risk, with a 42.1% (n = 16) recurrence rate in the first 2 years. The pre-DSM provided more accurate recurrence predictions than the Milan criteria (p < 0.001, Fig. 3C). Similar findings were observed with other commonly used models (Fig. S4).

Post-DSM risk stratification plus comparisons with common criteria

We calculated recurrence risk scores for all patients using post-DSM, with a cut-off value of 0.32606 using Youden’s index at 0.650. The median score in high and low groups were 0.75621 and 0.07906 respectively (p < 0.001). Across all three centers we observed similar results, with 0.77256 and 0.07759 in centre 1 (p < 0.001), 0.73305 and 0.07669 in centre 2 (p < 0.001), and 0.73571 and 0.09124 in centre 3 (p < 0.001). See Table S6 in the supplementary materials for further details.

Utilizing post-DSM scores and employing a cut-off value of 0.32606, determined using the maximum value in Youden’s index (0.650), patients were categorized into two prognostically distinct groups (p < 0.001, Fig. 3E): a low-risk group (2-year RFS: 90.1%, 95% CI 86.8–93.4%) and a high-risk group (2-year RFS: 28.9%, 95% CI 22.2–35.6%). Post-DSM exhibited superior predictive accuracy and identified patients at high risk of HCC recurrence more precisely than the Milan criteria and other models (Figs. 3D,E, S2E–H, S5 and Table S7). We also conducted a multi-class comparison of the post-DSM and RETREAT model, revealing similar outcomes (Fig. S6).

To further assess these dynamics, we stratified the entire cohort into four groups. In Fig. 3F, among individuals beyond the Milan criteria, 33.0% (n = 73) were categorized as low-risk according to post-DSM, which highlighted a 13.6% (n = 10) recurrence rate across the first 2 years. With the Milan criteria, 10.2% (n = 25) were identified as high-risk with a 60.0% (n = 15) recurrence rate over the first 2 years. The post-DSM provided more accurate recurrence predictions independent of the Milan criteria (p<0.001, Fig. 3F). Similar findings were observed using the alternative models (Fig. S6).

Discussion

In this multi-center study, we constructed and validated prognostic DeepSurv models for both pre- and post-LT patients. The pre-DSM, which relied upon readily available and easily accessible preoperative data, would effectively increase the number of potential liver transplant recipients by 8.7% compared to the Milan criteria. This increase would have made LT a viable option for a larger number of HCC patients, while also preserving valuable liver donor resources. Furthermore, the post-DSM tool provided a reliable and more accurate method for assessing individual recurrence risk. This would be especially beneficial for clinicians during discussions about treatment plans, through shared decision-making processes, and for scheduling necessary follow-up appointments.

These DSMs provide a new perspective for understanding recurrence. Some variables of tumor morphology and biology have been utilized elsewhere7. However, tumor capsule, necrosis, Glisson capsule invasion, and nerve invasion are seldom considered despite being associated with aggressive tumor biology and poor prognoses24,25,26,27,28. It may also be possible to further individualize care by intercalating knowledge about lifestyles and patient conditions. For example, peripheral blood and other physiological factors play roles in influencing tumor recurrence because they are associated with the tumor microenvironment, inflammatory responses and nutritional status29. Our study suggests that incorporating measures of individual conditions into a comprehensive model will further improve specificity. This, in turn, will increase the likelihood of benefiting a greater number of patients and reduce the risk of recurrence after liver transplantation.

Machine learning models, being data-driven, can generate varying performances based on the training data, with no single model consistently outperforming others across all datasets. To address this, this study incorporated the best-performing models from previous research into the modeling process. After validating both preoperative and postoperative models built using DeepSurv, it was found to perform consistently better during both training and validation. Due to DeepSurv’s capacity for learning and generalizabilities, it was finally selected as the predictive model. DSMs, rather than expanded Milan criteria, demonstrated the ability to independently assess prognosis. Our model can identify those at risk of poorer prognoses and those with relatively good prognoses beyond the standard Milan criteria. The creation of the Milan criteria (as well as other models) marked significant improvements in patient selection and have become key to predicting outcomes after LT for HCC patients. The Hangzhou criteria, AFP-French model, and other models have also been associated with superior outcomes compared to the Milan criteria by simply incorporating tumor biology-based factors rather than focusing solely on tumor morphology7,19,20,21. However, these criterion ignore the important role of patient conditions in tumor recurrence. Our pre- and post-DSMs systematically incorporated tumor morphology, tumor biology and patient condition, considering each as a collective entity which is the next logical step and may become the foundation for AI-based clinician-assisted decision support. The performance of our models was highly consistent with real-world outcomes during external validation. While expanding the pool of eligible recipients, the pre-DSM maintained prognostic outcomes comparable to the Milan criteria. This suggests that the pre-DSM provides a safe and reliable method for identifying a broader range of candidates.

The DSMs constructed in this study not only improve upon conventional statistical methods but could also become the foundation for artificial intelligence models. DeepSurv is a robust DL network which has been used to develop and validate a number of predictive models18,23. This is a step forward from conventional statistical approaches which are based on linear proportional hazards and are associated with difficulties modelling more complex biological data due to non-linear influences over clinical outcomes15,30. DL approaches can provide more powerful models, but the opacity, often referred to as the ‘black box,’ poses a challenge in understanding certain variables which contribute to model predictions. Efforts developing interpretable machine learning techniques should therefore aim to address this issue by providing insights into the decision processes involved in DL models. A user-friendly version of our predictive nomogram has been created for transparency and to encourage further investigation (visit https://www.deepliver.site/), as well as to enable researchers to test this tool using retrospective data from around the world.

The post-DSM designed here also creates opportunities to individualize care with reference to the different levels of risk. HCC recurrence after LT is the most intractable complication, and surgeons must introduce individual surveillance strategies in tandem with subsequent treatment information. Numerous studies have found that despite the high lethality of HCC recurrence, patients in the early stages of tumor recurrence after LT can achieve favorable prognoses through interventions such as surgery31, adjuvant treatments32, targeted therapy14, and immunotherapy33. Early identification and prompt interventions play pivotal roles in securing a positive prognoses for patients experiencing recurrence. However, in current clinical practice, the lack of specific and individual post-LT surveillance guidance hinders early diagnosis and treatment of recurrence, leading to significant prognostic heterogeneity across patient groups34. Therefore, surgeons need a tool which can provide individualized surveillance strategies. Since 75% of HCC recurrences occur within the first 2 years after LT6,35, We suggest that patients in the post-DSM high-risk group undergo HCC surveillance at least once every 2 months during the first 2 years, followed by at least once every 3 months in the third year, and then every 6 months from years three to five. It is generally accepted that patients at high-risk of HCC recurrence could benefit and achieve longer survival by being administered post-LT prophylactic adjuvant therapies32. The post-DSM developed in this study also creates opportunities to personalise treatment for high-risk patients and may influence post-LT immunosuppression strategies. However, further long-term research incorporating various factors, such as adjuvant interventions and timing, is needed to refine and improve our models.

This cohort study represents regional experiences in China, wherein more than 80% of HCC patients have chronic hepatitis B2. Consequently, the generalizability of our findings may be limited, given that the majority of LT outside of Asia are performed for HCC patients with heterogeneous underlying causes. Further studies are needed to validate DSMs internationally, particularly in countries where chronic hepatitis C or alcoholic liver cirrhosis are primary etiologies. Due to the nature of our imaging data collection, which was based on the most recent transplant, fully evaluating the impact of varying LT wait times and liver-directed therapies across different centres remains challenging. Additionally, this study did not include PIVKA-II due to incomplete data. Given the limitations of the retrospective study design, a multicentre study is planned to prospectively evaluate the application of DSMs in both pre- and post-LT HCC surveillance and to validate their prognostic power. While DSMs were not superior in all cases, there is potential to further refine these models and test them on larger datasets before clinical implementation.

In conclusion, both the pre- and post-DSM models developed here serve as robust systems for predicting HCC recurrence after liver transplantation, outperforming all the more established models. Prediction of tumor recurrence is improved significantly by Deepsurv method. Pre- and post-DSMs show the ability to provide individualized surveillance strategies for LT patients with HCC.