Introduction

Hepatitis C virus (HCV) is a major public health problem with an estimated 50 million people infected worldwide1. Treatment with direct acting antivirals (DAAs) have revolutionized the treatment of chronic HCV achieving over 95% efficacy in eradicating all HCV genotypes2,3. DAAs reduce, but do not eliminate, the risk of hepatocellular carcinoma (HCC) in those with underlying cirrhosis4,5,6. Current guidelines recommend biannual surveillance in patients who had advanced liver fibrosis and cirrhosis including serial ultrasound imaging ± alpha fetoprotein (AFP) measurements7,8. Such an approach encourages earlier identification of dysplastic lesions, broader eligibility for curative therapies and overall improved survival9 ; however, a ‘one-size-fits-all’ strategy places a huge burden on healthcare systems. Given the number of HCV-infected individuals who have achieved a sustained virological response (SVR), a more personalized and cost-effective approach to screening is urgently required.

A subsequent policy statement on liver cancer screening from European association for the Study of the Liver (EASL) endorses a risk-based approach and suggests that, approximately 20% of patients who are at low risk could potentially be exempted from regular surveillance. Additionally, it recommends that about 5%–10% of patients at high risk should receive more intensive surveillance, using magnetic resonance imaging (MRI) as the primary surveillance method. Those with an intermediate risk could adhere to the existing guidelines10. This strategy has emerged as a more advantageous and not only a cost-effective strategy but also may impose substantial physical harms on patients including multiple CT/MRI. Despite this endorsement, the absence of a defined risk score for patient selection remains a gap in the literature.

Egypt launched a national hepatitis C virus treatment program, through which about 5 million patients received treatment. Approximately 20% of these treated individuals had liver cirrhosis (F4), and another 16% presented with advanced liver fibrosis (F3) resulting in a substantial population requiring HCC surveillance11. This increased surveillance demand significantly burdens the healthcare system. Therefore, individualizing HCC risk assessment through established risk prediction scores could effectively reduce this burden. Several validated HCC risk prediction scores, including ALBI12, aMAP13, GES14, and THRI15, have been developed using readily available clinical variables. These scores are practical for routine clinical application as they do not rely on expensive molecular or genetic markers or complex computational methods. The objective of our study is to evaluate the predictive performance and clinical utility of these HCC risk prediction scores among patients with cured HCV infection and compensated advanced chronic liver disease (cACLD) in a large, multicenter national cohort, aiming to identify the most accurate and clinically useful risk stratification tool.

Patients and methods

Cohort

Out of 33,797 CHC patients, with liver cirrhosis (F4) or advanced liver fibrosis (F3) who had a sustained virologic response (SVR) after receiving DAAs, 8419 patients who completed follow up and had complete data available for scoring parameters were included in this observational study. Between January 2016 and December 2022, patients were recruited from the 52 centers belonging to the National Committee for Control of Viral Hepatitis (NCCVH) throughout Egypt. This study was conducted in accordance with the protocol and the principles of the Declaration of Helsinki [CIOMS/WHO, 1993] and its amendments in 200816. The protocol was approved by the ethical Committee of the National Committee for the Control of Viral Hepatitis (NCCVH), Ministry of Health and Population, Egypt. The need to obtain informed consent from the participants was waived by the IRB due to the retrospective nature of the study.

Patients with decompensated liver disease, evidence of current HCC or focal hepatic lesions, history of HCC or other malignancies and chronic debilitating diseases as chronic kidney diseases or chronic cardiac failure were excluded from the study. All patients in the cohort had their base line data recorded, together with the data in the follow-up visits, up to the last follow-up. HCC incidence was also recorded. For score comparison, we depended on pre-treatment data (immediately before the onset of DAAs).

Diagnosis of fibrosis and HCC

Patients were diagnosed with advanced liver fibrosis (F3) or cirrhosis (F4) based on a FIB-4 in addition to clinical as follows:

  • F3 (advanced fibrosis): defined by a FIB-4 score > 3.25 in the absence of clinical or radiological evidence of cirrhosis.

  • F4 (cirrhosis): defined by the same FIB-4 cut-off (> 3.25) plus at least one of the following: platelet count < 150 × 10⁹/L, splenomegaly, or evidence of portal hypertension/varices on imaging or endoscopy or evidence of liver cirrhosis on imaging.

Multiphase CT or MRI was performed in patients with focal hepatic lesions detected on abdominal ultrasound and/or an AFP level > 20 ng/mL, to assess for hepatocellular carcinoma based on characteristic arterial phase enhancement and washout in the delayed phase, in accordance with EASL guidelines17 .

Calculation of HCC risk scores

We employed conventional HCC prediction scores including THRI, aMAP, ALBI, GES, and FIB-4. The calculation methods and criteria for stratifying patients into low-, intermediate-, or high-risk categories were adopted from the original peer-reviewed publications that introduced and validated these scores; GES score14, FIB-418, ALBI13, aMAP score13, THRI score15. Table 1 &Supplementary material (1).

Table 1 Baseline characteristics of the cohort.

Statistical analysis

To evaluate the prognostic performance of the three HCC risk scores—GES, aMAP, and THRI—in predicting hepatocellular carcinoma (HCC), we conducted a multi-step statistical analysis. Follow-up duration was defined as the time from the end of treatment to the date of last follow-up or HCC occurrence, whichever came first. Time-to-event and cumulative incidence analyses were performed using the Kaplan–Meier method. Incidence curves were then compared across the different risk scores using the log-rank (Mantel–Cox) test. Discrimination was evaluated using the area under the receiver operating characteristic curve (AUC) and Harrell’s C-index, both of which quantify the ability of each score to distinguish between patients who developed HCC and those who did not19,20. Calibration was assessed using Brier score, calibration plots, calibration slope and the Hosmer-Lemshow to evaluate the accuracy of predicted probabilities21. Decision curve analysis (DCA) was performed to examine the net clinical benefit across a range of threshold probabilities, comparing each score to “treat all” and “treat none” strategies22. Negative Predictive Value (NPV) was calculated as the proportion of patients classified as low-risk who did not develop HCC during follow-up (true negatives / [true negatives + false negatives]), to evaluate each score’s ability to reliably exclude future HCC occurrence.Net Reclassification Improvement (NRI) was calculated to assess the added value of THRI and aMAP over GES using clinically relevant risk categories23. Analyses were conducted using SPSS v26, R v4.3.2, and Python v3.11.A p-value < 0.05 was considered statistically significant.

Results

Analysis included 8419 patients. Characteristics of studied patients are presented in Table 2. 3672 (43.6%) patients were males while 4747 (56.4%) were females. Median age was 60.0 years (53.0–67.0). Patients were followed up for a mean duration of 25.9 ± 12.9 months (range 12–92 months). 52 patients developed HCC during follow up. Incidence of HCC was 0.29 / 100 py (95% CI = 0.22–0.37). Performance of HCC prediction scores is presented in Tables 2 and 3; Fig. 1 and Supplementary Fig. 1.

Table 2 Performance of different risk scores.
Table 3 Accuracy for prediction of hepatocellular carcinoma development in CHC patients using different scores.
Fig. 1
Fig. 1
Full size image

Cumulative incidence of hepatocellular carcinoma according to risk stratification by prediction scores.

ALBI grade

Patients were classified by ALBI into low-risk group (3658 patients, 43.4%), intermediate risk group (4362 patients, 51.9%) and high-risk group (399 patients, 4.7%). HCC developed in 17 of patients belonging to low-risk group with incidence of 0.22 / 100 py (95% CI = 0.13–0.35), 32 belonging to intermediate risk group with incidence of 0.33 / 100 py (95% CI = 0.23–0.46) and 3 belonging to high-risk group with incidence of 0.37 / 100 py (95% CI = 0.09–0.99). Log rank test for comparison of incidence curves is not statistically significant (p = 0.393), and Harrell’s c statistics was low (0.556). NPV to rule out occurrence of HCC is 99.1% (95% CI = 98.7–99.3).

aMAP score

Patients were classified by aMAP into low-risk group (1266 patients, 15.0%), intermediate risk group (3763 patients, 44.7%) and high risk group (3390 patients,40.3%). HCC developed in 5 of patients belonging to low-risk group with incidence of 0.19 / 100 py (95% CI = 0.07–0.42), 17 belonging to intermediate risk group with incidence of 0.21 / 100 py (95% CI = 0.13–0.33) and 30 belonging to high risk group with incidence of 0.41 / 100 py (95% CI = 0.28–0.57). Log rank test for comparison of incidence curves is statistically significant (p = 0.041), and Harrell’s.

c statistics was low (0.595). NPV to rule out occurrence of HCC is 96.4% (95% CI 95.3–97.3).

FIB-4 index

Patients were classified by FIB-4 into low risk group (1282 patients, 15.2%), intermediate risk group (3549 patients, 42.2%) and high risk group (3588 patients, 42.6%). HCC developed in 13 of patients belonging to low risk group with incidence of 0.50 / 100 py (95% CI = 0.28–30.84), 10 belonging to intermediate risk group with incidence of 0.13 / 100 py (95% CI = 0.13–0.24) and 29 belonging to high risk group with incidence of 0.36 / 100 py (95% CI = 0.24–0.51). Log rank test for comparison of incidence curves is highly statistically significant (p = 0.003), and Harrell’s c statistics was fair (0.623). NPV to rule out occurrence of HCC is 99.4% (95% CI = 99.1–99.6).

GES score

Patients were classified by GES score into low risk group (6234 patients, 74.0%), intermediate risk group (982 patients, 11.7%) and high risk group (1203 patients, 14.3%). HCC developed in 30 of patients belonging to low risk group with incidence of 0.22 / 100 py (95% CI = 0.15–0.32), 6 belonging to intermediate risk group with incidence of 0.28 / 100 py (95% CI = 0.13–0.58) and 16 belonging to high risk group with incidence of 0.60 / 100 py (95% CI = 0.36–0.95). Log rank test for comparison of incidence curves is highly statistically significant (p = 0.004), and Harrell’s c statistics was fair (0.681). NPV to rule out occurrence of HCC is 99.7% (95% CI = 99.5–99.8).

THRI score

Patients were classified by THRI into low-risk group (2453 patients, 29.1%), intermediate risk group (5137 patients, 61.1%) and high-risk group (829 patients, 9.8%). HCC developed in 6 of patients belonging to low-risk group with incidence of 0.11 / 100 py (95% CI = 0.05–0.23), 38 belonging to intermediate risk group with incidence of 0.34 / 100 py (95% CI = 0.25–0.47) and 8 belonging to high risk group with incidence of 0.44 / 100 py (95% CI = 0.20–0.84). Log rank test for comparison of incidence curves is statistically significant (p = 0.015), and Harrell’s c statistics was fair (0.605). NPV to rule out occurrence of HCC is 98.2% (95% CI = 97.5–98.6).

Comparison of HCC prediction scores (Table 3; Fig. 1)

Except ALBI, all HCC risk scores investigated had adequate statistical performance with significant Log rank (Mantel–Cox) analysis for comparison of incidence curves (p value ≤ 0.05).

GES demonstrated the highest area under the receiver operating characteristic curve (AUC = 0.632), followed by THRI (AUC = 0.600) and aMAP (AUC = 0.591). supplementary Fig. 1& figure Kaplan–Meier analysis demonstrated significant separation in HCC-free survival among patients stratified into low-risk versus intermediate/high-risk groups by each score. The log-rank (Mantel–Cox) test confirmed that these differences were statistically significant (p < 0.05), supporting the utility of all three scores in stratifying risk over time.

In terms of discrimination, Harrell’s C-index values for time-to-event data were similar across the models, with each score achieving a C-index of approximately 0.63, indicating modest but consistent ability to distinguish between individuals who did and did not develop HCC during follow-up.

In terms of calibration, GES demonstrated the most accurate probability estimates, with the lowest Brier score (0.197), compared to THRI (0.325) and aMAP (0.487). The calibration slope for GES was 0.90 with an intercept of − 5.00, indicating slight overestimation of risk but overall good calibration. In contrast, THRI and aMAP exhibited substantial miscalibration, with slopes of − 0.008 and 0.26, and intercepts of − 6.49 and − 5.32, respectively. Visual inspection of calibration plots confirmed that GES provided predictions closest to the ideal 45-degree line, particularly in the mid-risk range. Although none of the scores achieved perfect calibration, GES showed the least deviation from observed outcomes. Figures 2.

Decision Curve Analysis (DCA) demonstrated that the GES score provided the highest net benefit across the clinically relevant threshold range (0.1–0.5), consistently outperforming both the “treat-all” and “treat-none” strategies. In contrast, THRI showed moderate net benefit, while aMAP offered limited utility, particularly at the lower end of the threshold range. Within the lower thresholds (0.1–0.3), where early intervention is most impactful, GES maintained a consistently superior net benefit, supporting its clinical utility in early risk stratification and guiding individualized HCC surveillance decisions. Figure 3and 4.

Net reclassification improvement (NRI) analysis revealed that replacing GES with aMAP resulted in poorer classification performance, with a negative NRI of − 0.16, primarily due to inferior classification of non-events. In contrast, THRI offered only a marginal improvement over GES, with a modestly positive NRI of + 0.05, insufficient to warrant preference over GES in clinical practice.

Discussion

Our study showed that all three scores; aMAP, GES and THRI were able to stratify our patients into low, intermediate and high-risk groups. Notably, the low-risk group exhibited a remarkably low cumulative incidence rate: 0.22/100 person-years (py) for GES, 0.19/100 py for aMAP, and 0.11/100 py for THRI. Among these, the GES score identified the largest proportion of patients (74.0%) as low risk, substantially higher than aMAP (15.0%) and THRI (29.1%). This finding has important clinical and economic implications, as the application of GES may allow a substantial proportion of patients to undergo extended HCC surveillance intervals, thereby improving the cost-effectiveness of HCC monitoring programs.

It should be noted that aMAP classified a markedly higher proportion of patients (40.3%) as high risk, compared to 14.3% with GES and 9.8% with THRI. While aMAP’s broader high-risk categorization may enhance sensitivity, it could also pose significant challenges in terms of healthcare resource utilization, potentially limiting its feasibility in routine clinical practice.

The aMAP was validated in 2085 genotype-4 HCV-cured patients with advanced liver diseases and exhibited strong statistical performance. However, the authors noted that the aMAP categorized approximately 85% of their cohort into the high-risk group, which could potentially reduce the cost-effectiveness of the surveillance program and pose significant physical risks to the patients24. Recently, the aMAP score was compared to the newly developed SMART model by Minami et al. where the SMART model showed superior discriminative performance, with a higher c-index than aMAP in both the derivation cohort (0.936 vs. 0.762) and the validation cohort (0.839 vs. 0.830)25.

The Toronto Hepatocellular Carcinoma Risk Index has been validated within a single Asian cohort comprising individuals with cirrhosis of varying etiologies. The authors of this study concluded that the score effectively stratified patients into three distinct risk groups, with only 5.2% of patients falling into the low-risk category26. In a Swedish context, Astrom et al., reached the conclusion that the THRI could successfully distinguish between individuals at low and high risk of developing HCC. However, it is worth noting that the low-risk group was relatively small, accounting for just 5.3% of the total cohort, which may limit the clinical utility of the THRI27.

When evaluating the individual predictive performance of the GES, aMAP, and THRI scores, notable differences emerged in both discrimination and calibration. The GES demonstrated the highest discriminatory ability (AUC = 0.632), followed by THRI (AUC = 0.600) and aMAP (AUC = 0.591). These values indicate that all three models have modest capacity to distinguish between patients who will develop hepatocellular carcinoma (HCC) and those who will not. However, when calibration was assessed using the Brier score to evaluate the accuracy of predicted probabilities, GES outperformed the other scores. It achieved the lowest Brier score (0.197), reflecting more accurate and reliable risk estimates compared to THRI (0.325) and aMAP (0.487).

It also delivered the highest net clinical benefit across relevant decision thresholds and exhibited greater consistency in reclassification performance. These results support the use of GES as the most clinically reliable and actionable tool for individualized HCC risk estimation in this patient population.

The discriminative capability of the GES score was validated in a multicentre international cohorts, encompassing diverse populations from Europe, Japan, India, and the United States. The authors asserted that the GES score effectively categorizes HCV patients into three risk groups for HCC. Furthermore, it exhibited significant predictive efficacy for HCC development across all participants (p < 0.0001), with a Harrell-C index ranging from 0.55 to 0.76 across all cohorts, even after adjustments for HCV genotypes and patient ethnicities28. The dynamics of the GES were also externally validated in a Japanese cohort with genotypes 1 and 2. In this study, Abe et al. concluded that the GES demonstrated excellent performance in HCC risk stratification for DAA-treated HCV patients with other genotypes, yielding a fair Harrell-C index29. Similar findings were reported by Muzica et al. who compared several HCC risk scores among 992 Romanian CHC patients and concluded that GES score has a very good predictive power for the risk of HCC post- SVR and could be recommended in clinical practice in Romaina30. In line with these findings, Qureshi et al., supported the utility of GES as a simple and practical tool for optimizing HCC surveillance strategies in high-risk, post-SVR Pakistani patients with HCV-genotype 331.

Moreover, we recently reported the outcomes of the first prospective study evaluating individualized risk stratification using GES in CHC patients who achieved sustained virologic response (SVR) at Tanta University Hospitals. This prospective cohort study included 492 patients with a mean follow- up duration of 2 years and demonstrated that the implementation of individualized risk stratification using GES significantly enhanced early detection of hepatocellular carcinoma (HCC). Notably, 80% of detected HCC cases were identified at early, potentially curable stages (BCLC stages 0 and A), a substantial improvement compared to only 52% in the comparative cohort following regular surveillance. These findings provide compelling evidence supporting the use of individualized risk stratification with GES to optimize surveillance strategies and increase opportunities for curative treatment32.

The superior performance of the GES_new score may be attributed to its incorporation of biologically and clinically relevant variables, such as AFP and fibrosis stage, which directly reflect tumor risk and liver disease progression. In contrast, aMAP and THRI lack AFP and rely on indirect or non-specific markers. Moreover, GES includes albumin but not platelet count, prioritizing hepatic synthetic function over surrogate markers of portal hypertension. Recent studies33,34 have demonstrated that monitoring longitudinal changes in AFP, rather than relying on a single absolute cut-off value, can substantially enhance the sensitivity of HCC detection. In line with these findings and our own results, we suggest that an effective HCC risk prediction score should incorporate AFP dynamics during follow-up. This is further supported by the observation that existing scores such as the aMAP and THRI—which do not include AFP—were not clinically useful in our patient cohort. Notably, to address this limitation, Fan et al. recently incorporated AFP into the aMAP model to develop aMAP-2, a novel score that integrates longitudinal AFP and aMAP trends, thereby improving HCC risk prediction and overcoming the static nature of the original aMAP score35.

Our study offers several strengths that enhance its validity and clinical relevance. It represents the first large-scale, multicentre national study directly comparing the predictive performance of multiple HCC risk stratification scores—including aMAP, GES, and THRI—in a well-defined cohort of patients with advanced fibrosis or cirrhosis who achieved sustained virologic response (SVR) after DAA therapy. The inclusion of over 8,000 patients and a long follow-up period provides robust statistical power and allows for meaningful incidence estimation of HCC in this population. Additionally, the real-world nature of the cohort, drawn from 52 centres across Egypt, enhances generalizability within similar populations.

Our study also had limitations; First, the retrospective design carries inherent risks of selection and confounding bias, despite the large sample size and systematic data collection. Second, all patients were recruited from the Egyptian National Liver Disease Program, which may limit generalizability to other healthcare settings such as primary care or to populations outside Egypt. The predominance of HCV genotype 4 in our cohort, together with local environmental and lifestyle factors, may influence the risk profile for HCC and differ from Asian or Western populations where other genotypes are more common. Another limitation of this study is that it does not address recent accurate prediction scores based on molecular and genetic risk factors, such as the hepatic fat genetic risk score or the TLL1 variant. However, these scores are costly and not practical for implementation in a national program. Moreover, our study relied on conventional clinical indicators in risk scores, without incorporating emerging biomarkers such as des-γ-carboxy prothrombin (DCP) or glypican-3. Although AFP is included in the GES score, longitudinal AFP dynamics and integration with novel biomarkers may further enhance predictive accuracy. Future studies should evaluate models that combine these markers with clinical variables to optimize HCC risk stratification. A further limitation is that detailed information on non-HCC causes of death was not available in our dataset, which precluded the use of competing risks methods such as the Fine–Gray model. As a result, our Kaplan–Meier estimates may slightly overestimate the cumulative incidence of HCC. However, prior evidence suggests that risk score stratification remains robust when competing risks are considered. Additional limitation is the variability in surveillance practices and suboptimal adherence to recommended intervals—an acknowledged global challenge—may have influenced early HCC detection. The relatively limited follow-up time, some patient losses, and occasional delays or restricted access to definitive imaging could also have affected incidence estimates. These factors should be considered when interpreting our results.

In conclusion, the implementation of the GES score within this national hepatitis C virus (HCV) elimination program effectively stratifies patients according to hepatocellular carcinoma (HCC) risk. Notably, the GES score categorized the largest proportion of patients (74.3%) as low-risk, allowing longer follow-up intervals. Adopting GES in routine clinical practice, may offer more favorable harm-benefit tradeoffs, helping clinicians better identify high-risk individuals for hepatocellular carcinoma surveillance or early intervention while minimizing unnecessary procedures in low-risk patients.

Fig. 2
Fig. 2
Full size image

Calibration plots.

Fig. 3
Fig. 3
Full size image

Decision curve analysis.

Fig. 4
Fig. 4
Full size image

Patient risk distribution by prediction scores compared to EASL policy statement and current guidelines.