Introduction

Hepatic encephalopathy (HE) is a neuropsychiatric disorder linked to both acute and chronic liver failure and/or portosystemic shunting, caused by the accumulation of neurotoxic substances, such as ammonia, which the liver is unable to adequately remove1. HE manifests as a spectrum of cognitive and motor disturbances, ranging from clinically undetectable to severely apparent. Diagnosing more advanced stages is generally straightforward, as these patients exhibit clear neurological symptoms such as disorientation, asterixis, or even coma. However, diagnosing subtler forms of HE, which are collectively referred to as “covert HE” (minimal HE and West Haven Grade 1 HE), is typically more intricate. While Grade 1 HE involves mild, yet detectable changes often noticed only by caregivers or physicians well-acquainted with the patient, minimal HE lacks obvious clinical signs and is identified only through specialized psychometric testing, such as the Psychometric Hepatic Encephalopathy Score (PHES), Stroop or Critical Flicker Frequency testing2.

Covert HE is associated with a diminished quality of life, progression to overt HE, and thus worse overall survival3,4,5,6. A notable European study by Gairing et al. found that over one-third of patients with cirrhosis experience minimal HE, with a prevalence rate of 34.8% among the general cirrhotic population. However, the prevalence increases with the severity of liver failure, reaching 52.1% in patients classified as Child-Pugh Class C7.

Despite the high prevalence of covert HE, it remains largely under-screened8, hampering early intervention and improving outcomes9. This lack of screening is primarily due to the time-consuming or technically complex nature of the existing diagnostic methods. More importantly, there is no consensus on the recommended standard diagnostic approach. The European Association for the Study of the Liver (EASL) and the International Society for Hepatic Encephalopathy and Nitrogen Metabolism (ISHEN) offer conflicting recommendations for the diagnosis of minimal HE. While ISHEN discourages the use of multiple tests10, arguing that single methods are sufficient, EASL supports the use of more than one diagnostic test but acknowledges the limited data on how to effectively combine and interpret the results of these different tests2.

Inspired by the success of machine learning in classifying patients with and without Parkinson’s disease based on speech patterns11, we explored the adaptation of a similar approach for detecting covert HE, given that patients with overt HE often exhibit noticeable speech alterations. In this study, we present a proof-of-concept for a novel, prospective diagnostic method for covert HE based on sustained-vowel voice analysis, complemented by an explainable machine learning approach that elucidates the key factors driving the model’s predictions and addresses the traditional ‘black box’ nature of AI.

Results

Patients’ characteristics

Thirty-six cirrhotic patients had covert HE (20 patients had minimal HE diagnosed by a pathological result on EncephalApp_Stroop and 16 patients had Grade 1 HE) while eight cirrhotic patients did not present signs of covert HE. None of the patients had any history of TIPS placement.

In the covert HE cohort (n = 36), there were 21 male patients (58%) and the median age was 58 years (IQR 49−70). Alcohol-related liver disease was the most common underlying disease. The median Child-Pugh score was 7 (IQR 5−10), and the median MELD score was 12 (IQR 7-19). The clinical and laboratory characteristics are presented in Table 1.

Table 1 Overall characteristics of the two cohorts (with and without covert HE)

The control group (n = 57) comprised individuals without evidence of HE. This group included patients with liver cirrhosis but no signs of HE and a normal EncephalApp_Stroop result (n = 8/57, 14%), as well as individuals without advanced liver disease or healthy volunteers (n = 49/57, 86%). The overall distribution of patients’ is presented in Fig. 1. There was no difference in age, gender, or education among the cohorts (p = 0.5, p = 0.3, and p = 0.9, respectively).

Fig. 1: Distribution of all enrolled participants into the study cohorts.
Fig. 1: Distribution of all enrolled participants into the study cohorts.
Full size image

Flowchart of the study population showing the distribution of participants across diagnostic subgroups.

Model training and evaluation

We trained and evaluated models for the diagnosis of covert HE across 13 distinct tasks, each involving different combinations of “a:” and “e:” on low and high volume. In overall, the best discriminatory performance was achieved by XGBoost using a combination of recordings of both vowels “a:” (low volume + high volume) and “e:” (high volume) (Table 2).

Table 2 Performance comparison of machine learning classifiers

XGBoost performance

The XGBoost model achieved an AUROC of 81.20 (95% CI 73.03–89.73, SD 4.20) and demonstrated both plateaued learning curve and reasonable calibration across the full range of predicted probabilities (p = 0.66; Table 2 and Fig. 2).

Fig. 2: Performance evaluation of the final XGBoost model.
Fig. 2: Performance evaluation of the final XGBoost model.
Full size image

a Receiver operating characteristic (ROC) curve demonstrating discriminative ability compared with the random classifier reference line. b Calibration curve with 80% and 95% confidence bands demonstrating agreement between predicted and observed probabilities. c Learning curve illustrating training and validation AUCs across increasing training set sizes.

The SHAP summary plot identified the features with the greatest impact on the output of the model, as indicated by the highest mean SHAP values: (1) zero-crossing rate, (2) magnitude spectrum Q2, (3) magnitude spectrum p1, and (4) magnitude spectrum Q1 (Fig. 3).

Fig. 3: Summary plot of the final XGBoost model.
Fig. 3: Summary plot of the final XGBoost model.
Full size image

Summary plot showing the most influential acoustic features contributing to the model’s predictions.

The dependence plots (Fig. 4) show that that a high zero crossing rate (harsh sound) indicates an increased likelihood of covert HE. Conversely, a mellow sound with a low zero crossing rate suggests a decreased likelihood of covert HE. The other most important features were the first percentile and first and second quartile of the magnitude spectrum. The high magnitude in the first percentile suggests that patients with covert HE tend to have a higher volume of very quiet sounds than controls. This means that they struggle to reduce their voice energy when needed—indicating an inability to speak quietly when appropriate. On the other hand, the low magnitude in the first and second quartiles suggests that patients with covert HE might also have trouble increasing the volume adequately when speaking louder is necessary. In other words, people with covert HE tend to struggle to naturally vary how loud they speak, keeping their voice at a more constant volume regardless of the situation. Although other features were present, they did not significantly influence model predictions (low mean SHAP values).

Fig. 4: Dependence plot of the final XGBoost model.
Fig. 4: Dependence plot of the final XGBoost model.
Full size image

Dependence plot illustrating how variations in a selected acoustic feature influence the model’s predicted probability of covert hepatic encephalopathy.

Only 8 cases (8.6%) reported having a B2 or higher level of proficiency in a foreign language (three in the covert HE group and five in the no HE group). One of the eight “multilingual” cases was misclassified, leading to an accuracy of 87.5. The misclassified case belonged to the “no HE cohort”.

Sensitivity analyses

Finally, we performed sensitivity analyses, calculating separate AUROC curves for two other scenarios: (1) cirrhotic patients only (cirrhosis with covert HE vs cirrhosis without HE), and (2) covert HE vs non-cirrhotic controls, to illustrate how the model’s discriminatory performance varies by control group composition. No significant departure from the primary analysis was observed, as the AUROCs for the cirrhotics-only and non-cirrhotic-only control group comparisons were 87.04 (95% CI 78.19–94.12) and 79.75 (95% CI 71.38–88.00), respectively.

Discussion

Covert HE is a common yet often overlooked complication of liver cirrhosis, with a prevalence that increases as liver failure progresses. Despite the significant burden it imposes, covert HE remains underdiagnosed and undertreated, largely owing to the limitations of current diagnostic methods8. This underdiagnosis not only prevents timely pharmacological intervention9 but also leaves patients vulnerable to the risks associated with impaired decision-making in high-risk situations. Furthermore, the need for improved diagnostic tools and a more proactive screening approach is now more critical than ever, particularly because of the rising proportion of patients receiving transjugular intrahepatic portosystemic shunts. Such a model could therefore be used also in the selection process for TIPS and for monitoring patients after the procedure.

In this study, we present a detailed description of the development of a proof-of-concept machine learning model that identifies patients with covert HE based on speech analysis12.

Our study was conducted in a specific population of patients with liver cirrhosis, characterized by a relatively high prevalence of covert hepatic encephalopathy. As noted earlier, part of the study took place during the COVID-19 restrictions, when hospital visits were in Slovakia limited to severely ill patients only. Moreover, we used the EncephalApp_Stroop test to diagnose minimal HE, which has been reported to have the highest sensitivity among established diagnostic tools13. Both factors may have contributed to the higher observed incidence, which, however, remained at the upper end of the range reported in the literature14. We believe that this should not have a significant impact on the overall conclusions of the study. In addition, we included individuals without advanced liver disease. The inclusion of noncirrhotic controls aligns with foundational research that led to the development of currently accepted diagnostic tools such as the PHES and the EncephalApp_Stroop test8,15.

XGBoost model demonstrated the best discriminatory characteristics (AUROC 81.20 ± 4.2 (95% CI: 73.03–89.73)) while maintaining calibration with good 95% CI across the whole spectrum of probabilities (p = 0.66). The proposed model is potentially easy to use, requiring only a smartphone/computer application and no more than 15 s of speech recording per patient. This is in stark contrast to the currently employed methods, which are both technically demanding and time-consuming16.

An interesting comparison can be made with a recent study by Bloom et al. who examined running speech features in a similarly large cohort of minimal HE patients (n = 40)17. This study observed that a slower speech rate is commonly associated with HE and used this feature alongside acoustic variables different from ours (such as mel-frequency cepstrum coefficients skewness 22 and jitter local absolute). The model by Bloom et al. achieved an AUROC of 0.76 (95% CI 0.67—0.84), although calibration across the full probability spectrum (0%–100%) was not reported, limiting insights into the model’s reliability at extreme probabilities. Notably, the authors enhanced their model by incorporating age and MELD-Na, which raised the AUROC to 0.82 (95% CI 0.74–0.90), close to our model’s performance.

To remove the skepticism behind artificial intelligence, still largely present among clinicians, we present dedicated research explaining how the model arrives at outcomes. We believe that removing the black box of the model’s decisions will improve the acceptance of the model in future research and clinical practice. Indeed, we only confirmed the changes, that an experienced clinician would expect in the speech of patients with covert HE. The most important group of features behind the machine learning model decisions was that people with covert HE tend to struggle to naturally vary how loud they speak, keeping their voice at a more constant volume and harsh regardless of the situation. We also acknowledge that the best-performing model was based on the combination of “a:” (low and high volume) and “e:” (high volume). This could be attributed to two key factors. First, the vowel “a” lies at the edge of the Hellwag triangle, unlike “e”, and its articulation requires a more complex positioning of the vocal tract18. This increased articulatory demand may expose subtle neuromotor deficits, making it more informative for the detection of covert HE. Second, the model appears to capture volume-related variability sufficiently through the soft and loud recordings of “a:”, rendering the additional soft “e:” redundant and less informative.

Our study has few limitations. First, the speech characteristics can vary significantly across languages. To mitigate this, we focused on basic speech elements, such as sustained vowels, which tend to remain relatively stable across different linguistic contexts. Second, the small sample size, namely cirrhotic patients without covert HE, and a lack of true external validation. To mitigate the impact of the small sample size, we employed outcome-stratified nested cross-validation with out-of-fold predictions—one of the most robust validation strategies available—to reduce the risk of overfitting and provide a nearly unbiased estimate of model performance. Nevertheless, the first two limitations reduce the generalizability of our findings; accordingly, the results should be viewed as hypothesis-generating, highlighting the need for validation in larger and multilingual cohorts. Third, the diagnosis of minimal HE relied solely on the EncephalApp_Stroop test. Although this tool has been validated, it may not detect all cases of minimal HE, potentially leading to misclassification which might have propagated into the presented model.

In conclusion, this study presents the potential of speech analysis—specifically sustained vowel phonation—combined with an XGBoost classification model as a non-invasive and easily obtainable digital biomarker for the detection of covert hepatic encephalopathy. However, further evaluation in larger and multilingual cohorts is required before any real-world clinical adoption.

Methods

Design

This prospective cross-sectional observational case-control study was conducted at a tertiary liver care center (2nd Department of Internal Medicine, Pavol Jozef Safarik University and L Pasteur University Hospital in Kosice, Slovakia). The local ethics committee approved the study protocol on 14th November 2018 (ethics committee no. 117/EK/18). All patients provided informed consent to participate in the study. This study adheres to the TRIPOD guidelines19.

Definitions

The diagnosis of cirrhosis was established prior to the study inclusion and was based on a combination of laboratory and imaging findings—including ultrasonography, elastography, and endoscopy—which were consistent with liver cirrhosis. The diagnosis of covert HE was established based on the presence of either Minimal or Grade 1 HE. Minimal HE was diagnosed following EASL and ISHEN recommendations2,10, using a pathological result on a validated EncephalApp_Stroop test as the defining criterion (total off-time ≥ 125.84 s as reported by Bajaj et al.)20. Grade 1 HE was diagnosed when one or more of the following signs were present: trivial lack of awareness, euphoria or anxiety, shortened attention span, impairment of addition or subtraction, or altered sleep–wake rhythm)2. The assessment was jointly performed by a junior specialist (JG) and a consultant (MJ), with 8 and 16 years of clinical experience, respectively, both specializing in liver diseases. A mutual agreement was reached in all patients.

Patients

We included patients with known liver cirrhosis scheduled for examination at the outpatient liver clinic. Furthermore, individuals without advanced liver disease and without hepatic encephalopathy. The latter included healthy volunteers—primarily hospital employees and their family members or friends—as well as individuals diagnosed with metabolic dysfunction-associated steatotic liver disease (MASLD) without signs of advanced fibrosis (as evidenced by transient elastography/Fib-4), or with benign liver hemangioma. This approach is conceptually aligned with prior standardization studies of diagnostic tools for covert HE—instruments such as PHES21 and the EncephalApp Stroop test20 similarly incorporated healthy subjects to establish normative performance boundaries before evaluating diagnostic validity within cirrhotic cohorts.

Patients were enrolled between June 2019 and February 2025, with intermittent pauses during the COVID-19 lockdowns. Individuals younger than 18 years or older than 80 years, as well as those unwilling to participate, were excluded. To minimize potential confounding factors affecting voice characteristics, additional predefined exclusion criteria were applied: treatment with any psychiatric or psychoactive medication, the presence of active infections or intoxications; any neurological or psychiatric disorder (including overt hepatic encephalopathy, Wernicke encephalopathy, or Korsakoff syndrome); metabolic conditions such as diabetes with acute or chronic complications, acute kidney injury, or chronic kidney disease; respiratory insufficiency; and malignant diseases.

Voice recordings

All voice recordings were performed in Slovak and collected under standardized acoustic conditions to minimize environmental variability. Recordings took place in the same quiet room for all participants, ensuring consistent background noise levels. Audio acquisition was carried out using a Blue Yeti Aurora external microphone, set to 16 kHz sampling rate and 16-bit PCM encoding, connected to a desktop computer running Audacity software (version 2.3, Muse Group). The microphone gain was adjusted before each session to avoid clipping, and no hardware-level noise reduction or real-time denoising was applied.

Participants were seated ~25 cm from the microphone and instructed to begin sustained vowel phonation immediately after an optical signal indicated the start of recording, continuing for 5 s until a second signal marked its termination. This protocol ensured consistent capture of the onset, steady-state, and offset phases of phonation (fade-in and fade-out). The procedure was repeated for four tasks—sustained /aː/ (as in “father”) and /eː/ (as in “excellent”), each produced at low and high volume. If any recording was interrupted, incomplete, or exhibited irregular phonation, it was repeated to maintain consistency and data quality.

Sustained vowel phonation was selected because it offers the linguistically most neutral and standardized speech unit, decreasing the cross-speaker and cross-language variability that typically affects running speech22. This methodological choice aligns with prior voice-based digital biomarker studies in Parkinson’s disease23, amyotrophic lateral sclerosis24, and other neurological and speech-related conditions25,26. For these reasons—and to facilitate future external validation in multilingual cohorts—we restricted the present analysis to sustained vowel phonation.

Audio preprocessing

All recordings underwent a standardized preprocessing pipeline prior to feature extraction. No additional denoising or filtering was applied in order to preserve the original acoustic characteristics of the signal.

Features extraction

Following preprocessing, we extracted 28 acoustic features commonly used in pathological speech and neurological voice analysis11,27. These included measures of phonatory stability, amplitude variation, spectral shape, and harmonic structure. Feature computation was performed using the Librosa (v0.10.1) and Surfboard (v0.2.0) Python libraries.

The following features were extracted: Log Mel Spectrogram, Morlet Continuous Wavelet Transform, Fundamental Frequency Statistics, Loudness, Root Mean Square Energy, Fundamental Frequency Contour, Formants, Linear Prediction Coefficients, Detrended Fluctuation Analysis, Harmonics-to-Noise Ratio, Shimmer (Amplitude Perturbation), Jitter (Frequency Perturbation), Pitch Period Entropy, Logarithmic Energy, Intensity, Zero Crossing Rate, Mel-Frequency Cepstral Coefficients, Chroma Short-Time Fourier Transform, Chroma Constant-Q Transform, Chroma Energy Normalized Statistics, Formants Sliding Window Analysis, Magnitude Spectrum, Bark Scale Spectrogram, Spectral Slope, Spectral Flux, Spectral Centroid, Spectral Spread, Linear Spectral Frequencies. Zero Crossing Rate is scalar value and can be directly used as an input to a classifier. However, features like the Magnitude Spectrum, Log Mel Spectrogram and others which are extracted as vectors, require additional processing. For these vector-based features, we calculate statistical functionals, such as the mean, max, min, standard deviation, first quartile (Q1), second quartile (median), third quartile (Q3), first percentile (P1), 99th percentile (99 P), and interquartile ranges (Q2-Q1, Q3-Q1, Q3-Q2).

Model training

To evaluate classification performance across algorithmic families, we trained four machine-learning models: Elastic net-regularized logistic regression, support vector machine, random forest, and XGBoost.

Hyperparameter tuning was performed using an outcome-stratified nested cross-validation with out-of-fold predictions to ensure an unbiased performance estimate. This method consists of two nested loops: (1) an outer loop and (2) an inner loop. The outer loop first partitions the entire dataset into 10 folds. In each iteration, one of these 10 folds is designated as an independent, untouched outer validation set. The remaining nine folds serve as the training set for hyperparameter tuning. This process guarantees that every data sample is used for the final validation exactly and only once and remains completely separate from the model training process in that specific iteration. Within the training set of each outer loop, an inner loop was used to determine the optimal hyperparameters. This loop performs a 5-fold cross-validation, systematically evaluating different hyperparameter combinations. The combination that achieved the best performance on the inner validation folds was selected as the optimal configuration. This optimal hyperparameter configuration was then used to train the final model on the entire training set of the outer loop. The performance of the trained model was then evaluated on the untouched outer validation set. The entire process was repeated 10 times, with each of the original 10 folds serving as the outer validation set (i.e. multiple external-like validations). The final performance metrics were calculated by averaging the results from all ten outer validation folds. This nested approach provides a robust and less optimistic estimate of the true generalization error of the model, mimicking the evaluation on a completely new, unseen dataset28.

The following grid search strategy over specified hyperparameters was applied in inner 5-fold CV: (1) for XGBoost—subsample [0.5, 0.75, 1], num_parallel_tree [100], max_depth [3,6,9], learning_rate [0.05, 0.1, 0.2], min_child_weight [1, 10, 100], (2) for logistic regression—solver [saga], penalty [elasticnet], l1_ratio [0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.9], C [0.05, 0.1, 0.15, 0.25, 0.5], class_weight [none, balanced], tol [1e–5], (3) for SVM—kernel [rbf], C [0.01, 0.1, 1, 10, 100], gamma [scale, 1, 0.1, 0.01, 0.001, 0.0001], (4) for random forest—n_estimators [100, 200, 300], max_depth [10,15,20], min_samples_split [10,20,50], min_samples_leaf [2,5,10].

Model evaluation

Model discrimination is represented by the area under the receiver operating characteristic curve (AUROC) with a standard deviation (SD) and 95% confidence interval (95% CI), estimated using bootstrap resampling on aggregated CV estimates with 200 replicates. Model calibration is shown with calibration belts (GivitiR), and miscalibration significance was evaluated using the Hosmer-Lemeshow test.

Explainability

To interpret the predictions of the model with the best discriminatory and calibration characteristics, SHAP (SHapley Additive exPlanations) values were used, quantifying the contribution of each feature to individual predictions (higher SHAP values mean a greater impact)29. A summary and dependence plots present the distribution of SHAP values for each feature across all samples, emphasizing the features with the most significant influence. Machine learning was carried out in Python 3.9.18, using scikit-learn package 1.3.2 and XGBoost package 2.0.2.

Descriptive statistical analyses

Descriptive statistical analyses were performed out using RStudio software (version 1.2.1335). Continuous variables are presented as median and interquartile range (IQR). Categorical variables are presented as absolute counts and percentages (%). There were no missing values in the analyzed variables. The significance of distribution differences in continuous and categorical variables was examined using the Kruskall–Wallis and ChiSquare tests. All tests were performed at a significance level of p ≤ 0.05.

Weights availability

XGBoost weights are available on GitHub: https://github.com/matthires/covert_hepatic_encephalopathy.