Introduction

Steatotic liver disease (SLD), formerly known as fatty liver disease, is a common disease defined by the presence of steatosis in more than 5% of the hepatocytes1. A new nomenclature was recently published, classifying SLD into 5 groups: Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD), Alcohol-Associated Liver Disease (ALD), a combination of MASLD and ALD (MetALD), SLD secondary to another specific etiology, and cryptogenic SLD2. The term MASLD replaces the former terms Non-Alcoholic Fatty Liver Disease (NAFLD), and Metabolically-Associated Fatty Liver Disease (MAFLD). A great concordance has been shown among these three definitions, and previous NAFLD studies are considered to be valid under the new MASLD definition3.

MASLD is the most common cause of chronic liver disease worldwide1 with a continuously growing global prevalence associated with the increasing prevalence of diabetes, obesity, and metabolic syndrome4,5,6. Its prevalence is estimated to be at 30–38% in adults with a 50.4% increase in the last 3 decades4,7,8. A meta-analysis estimated a high prevalence of MASLD in South America (30.4%)9. MASLD prevalence is higher in metabolic risk groups, affecting > 70% of patients with type 2 diabetes mellitus (T2DM), and 90% of patients with severe and morbid obesity undergoing bariatric surgery6. Likewise, ALD is one of the leading causes of chronic liver disease10 but contrary to MASLD, ALD prevalence has remained stable during the last decades, with an estimated global prevalence of 8%6,11. ALD may coexist with other liver diseases, such as viral hepatitis and MASLD (MetALD), contributing to the progression of liver disease12,13.

SLD is associated with liver and non-liver adverse outcomes. It can progress to steatohepatitis, fibrosis, cirrhosis, end-stage liver disease, and hepatocellular carcinoma (HCC), and is associated with an increase in all-cause mortality1,6,14. Both MASLD and ALD are considered leading indications for livertransplantation1,6,13,14,15. ALD was the most common underlying reported chronic liver disease in patients with acute-on-chronic liver failure16. Furthermore, MASLD is associated with a significant number of comorbidities that lead to a higher risk of non-liver malignancies such as colorectal cancer, lung diseases, chronic kidney disease, cognitive impairment, and complications of T2DM6,17,18. Also, MASLD shares a complex bidirectional relationship with cardiovascular disease, which is the main cause of death in these patients8,18.

Consequently, SLD is associated with a high economic burden, which has been increasing in recent decades19,20. MASLD significantly increases health-care costs compared to those of non-MASLD patients. The average rate of overall outpatient visits at 5 years following diagnosis was 40% higher among patients with MASLD compared with controls19,21. Furthermore, MASLD is associated with a reduction in health-related quality of life compared to patients with no liver disease and patients with liver disease due to other causes22,23. MASLD-related deaths due to cirrhosis and liver cancer have increased by 76.7% and 95.1%, respectively between 1990 and 201924.

The diagnosis of SLD requires evidence of hepatic steatosis by either imaging or histology2,5,15,25. Liver biopsy is considered the gold standard for diagnosis, which is associated with low, but not negligible complication rates26and it is reserved for specific scenarios such as diagnostic doubt, or patients at increased risk for advanced fibrosis25. Consequently, in routine clinical practice, most diagnoses of SLD are made radiologically. Abdominal ultrasound is the most commonly used method because it is relatively inexpensive, accessible, and innocuous, with a sensitivity and specificity of approximately 85% and 94%, respectively1,26,27 and its use is recommended by international guidelines as a first-line diagnostic test28,29,30.

Hepatic steatosis is characterized by a bright liver echotexture and blurring of the hepatic vasculature. Ultrasound reliability is operator dependent, is limited in patients with central obesity, and has limited sensitivity in mild steatosis31. Alternative imaging techniques are associated with higher costs, including vibration-controlled transient elastography (VCTE), magnetic resonance spectroscopy (MRS), and magnetic resonance proton density fat fraction (MRI-PDFF)18. MRS has a good correlation with MRI-PDFF, which has a sensitivity of 93% and 94% specificity32. Despite its better accuracy for detecting steatosis, cost and limited availability restrict its use in clinical practice26,30.

Early SLD diagnosis is important, since prompt initiation of treatment can stop disease progression, lead to a reduction in adverse outcomes, and reduce the economic burden associated with the disease33,34. A recent study showed that a screening strategy for MASLD followed by intensive lifestyle interventions, or pioglitazone in persons with T2DM, is cost-effective35. Life-style interventions in patients with MASLD have proven regression in MRI-PDFF36 and improvement in liver histology37. In patients with ALD, alcohol abstinence reduces the risk of disease progression14. However, early diagnosis is difficult, since SLD patients are often asymptomatic and have no laboratory alterations, especially in early stages26. The use of screening techniques can help disease detection in asymptomatic patients. Although abdominal ultrasound is a relatively low-cost test and has good performance as a first-line diagnostic test, in larger screening studies the cost and availability of imaging impact feasibility, especially in primary care centers30.

Currently in Latin America, there is no consensus on recommending SLD screening in the general population, due to the low cost-effectiveness of this practice, and the associated risks of invasive tests. However, MASLD screening is recommended in patients with repeatedly altered liver enzymes, features of metabolic syndrome, or obesity. In these cases, abdominal ultrasound is the recommended initial screening method38. However, this method is not widely available in primary health care.

From this perspective, there have been attempts to create prediction models to diagnose SLD without the use of imaging, or biopsy, that can be applied to the general population39,40. More recently, with the development of machine learning, new models have been published41,42,43,44,45,46,47,48,49,50,51,52,53. Some of these models use simple variables, such as age, body mass index (BMI), alanine aminotransferase (ALT), aspartate aminotransferase (AST), gamma-glutamyl transferase (GGT), fasting plasma glucose (FPG), and triglyceride.

ML models have also been used previously in the prediction of different diseases. Several models were developed to predict the risk of gestational diabetes mellitus (GDM)54. ML was used to diagnose acute gastrointestinal (GI) bleed55.

State of the art in SLD prediction with ML models

Several models have been developed to predict SLD, NAFLD and MAFLD with various approaches39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,56,57,58,59,60,61,62,63,64,65. Most of them use data from medical visits and blood tests. Liver ultrasound combined with diffusion models are used to improve the classification of NAFLD64. The best performance was reached with a combination of Triglycerides, Glycemia, and Waist Circumference (WC). For the prediction of NAFLD, 35 clinical and biochemical variables are used with an extreme gradient boosting (XGB)42 and 12 variables are used as input for an XGB46. A random forest (RF) model for the prediction of SLD was made using age, gender, systolic blood pressure (SBP), diastolic blood pressure (DBP), abdominal girdle and triglycerides43. 30 variables are used in a random under-sampling (RUS) boosted tree48. Comparisons among a custom model (LR), FLI, and Hepatic Steatosis Index (HSI) are made, using patient data from two different regions of Europe. On average, the custom model reached better prediction results compared to those of FLI and HSI. The custom model required 8 variables: age, serum aspartate aminotransferase (AST), AST – alanine aminotransferase (ALT) ratio, Waist circumference (cm), Ferritin, body mass index (BMI), Serum triglycerides andGout49. 11 variables were used for the prediction of SLD in a Chinese cohort using an XGB50 the variables were BMI, Albumin, ALT, globulin, fasting blood glucose (FBG), high-density lipoprotein cholesterol (HDL-c), low-density lipoprotein cholesterol (LDL-c) and triglyceride. A combination of 28 variables with a RF was used to predict NAFLD, however the most important variables were WC, chest circumference, trunk fat and BMI51. After a selection of variables, just 8 were used for an XGB model to predict NAFLD52. BMI, uric acid, triglyceride, HDL, height, hemoglobin, LDL, carcinoembryonic antigen (CEA), AST, age, glucose, and alpha-fetoprotein (AFP) were the inputs of an XGB model53. A variety of clinical data is used to diagnose MAFLD and NAFLD, with the best performance obtained by using the Fatty Liver Index (FLI)56. Another approach employed genetic biomarkers to predict NAFLD, using Machine Learning (ML) methods to select the biomarkers, followed by performing the prediction using the biomarkers with a nomogram57. A logistic regression (LR) model is used for MAFLD and NAFLD prediction using blood tests including Triglycerides, Glycemia, HOMA-IR, and data measured in clinical visits58. An XGB is also used with 27 variables59. Age, gender, BMI, Cholesterol, HDL-C, LDL-C, Glucose, GOT-AST, and GPT-ALT were used for the prediction of SLD with an LR60. BMI, Waist Circumference, HDL, Triglycerides, ALT, Tuber Consumption, Fry Food Consumption, Diabetes and Hyperuricemia were used for the prediction of NAFLD with an stepwise LR model61. Sex, Age, Gamma-Glutamyl Transferase (GGT), Glucose, Abdominal Volume Index were used as inputs to a neural network to predict NAFLD62. Age, gender, BMI, WHR, ALT, LDL, HDL, UA, and smoking were used with a LR model to predict MAFLD in young adults (18–44 years old)63. Age, sex, waist circumference, BMI, ALT and triglyceride glucose index were used to predict NAFLD with a LR model65.

FLI was proposed by Bedogni et al.39 for the prediction of SLD. The model requires just 4 variables: gamma-glutamyltransferase (GGT), BMI, triglyceride, and WC. HSI was proposed by Lee et al.40 to predict NAFLD, and requires ALT/AST ratio, BMI, diabetes mellitus (DM) and gender. In both cases, the models employed were LR39,40. In other studies, models use more complex variables; e.g., data from multiple medical visits is used to predict SLD47.

Deep learning capabilities in pattern recognition

Deep learning (DL), which has its roots in conventional neural networks, significantly outperformed its predecessors in pattern recognition tasks66. Deep Learning models include a layered architecture of data representation, in which the high-level features can be extracted from the last layers of the networks while the low-level features are extracted from the lower layers66,67. These architectures were originally inspired by Artificial Intelligence (AI) simulating processes of key sensory areas, such as vision in the human brain68. One of the main advantages of DL is mimicking how the human brain works. With great success in many fields, deep learning has reached excellent performance in tasks that require pattern recognition, image classification, object detection, video processing, natural language processing, and speech processing, among others68,69,70. Patient variables, obtained through exams or physical measurements, and coded as tabular data, are still a challenge to DL models71. Some approaches consist of transforming tabular data into an image. DeepInsight converts non-image data into an image, formed by arranging pixel positions with similar features together72. Bazgir et al. present a feature representation termed REFINED to create a 2D image based on the feature pairwise relationship73. An image generator for tabular data (IGDT) also positions features based on how close they are to each other74. Sharma et al. proposed methods to represent a 1-D vector in a 2-D graphical image, using a bar graph, a normalized distance matrix, and a combination of both75. Previous methods use the power of CNN to classify, getting good results in their comparison with traditional models.

A multimodal model was used to predict severe hemorrhage in placenta previa using MRI and tabular data76.

Our contributions

Several studies have shown the tremendous success of DL methods in medical image analysis77,78,79,80. In particular, with the introduction of CNNs, many pattern recognition problems in images have been solved successfully. Part of the success could be attributed to CNN architectures that are based on the visual system architecture with convolutional layers that extract various features from images. Our proposed models consider the spatial representation of each variable, including width and height, in the image, and the importance of the spatial position of each variable since these could match the filters of the convolutional layers of the CNNs spatially to improve classification results.

We developed a model using DL for the disease prediction transforming the input variables from tabular data to images, with the goal of using the power of CNN models to recognize complex patterns in images. DL models should outperform traditional machine learning models in disease prediction. This is the main problem we address in our study, proposing a new novel method to treat tabular data as images. The main contributions of our method are summarized as follows:

  1. (1)

    In this work we report the development of a novel DL method for the prediction of SLD, which consists of transforming the input variables from tabular data into images, with the goal of matching the variable representation sizes in the image with those of the filters of the convolutional layers of the CNN models to reach the best classification performance. Additionally, based on our literature review, existing approaches require a larger number of variables than those available in the SLD prediction problem and have not explored the variable representation size and position. Also, previously published methods have not been applied to SLD prediction.

  2. (2)

    We applied our method to an important illness, SLD, that has a prevalence between 30 and 38% in adults using a database of 2,999 patients. Detection of SLD is important, since prompt initiation of treatment can stop disease progression, lead to a reduction in adverse outcomes, and reduce the economic burden associated with the disease.

  3. (3)

    Our proposed model is compared with the twelve different traditional ML models we have also developed for SLD prediction.

  4. (4)

    A search was performed for the optimization of the hyperparameters of all DL and ML models. We also included the application of a variable selection process, reducing the redundance of data to improve the performance of the model. Additionally, our proposed method could be extended to other illnesses.

  5. (5)

    Our results show that by using DL on the transformed patients’ data, we obtained significantly better performance than that based on traditional machine learning models, and those based on the Hepatic Steatosis Index (HSI). We obtained an average reduction in false positives (FP) for the same level of sensitivity of 9.85% which is very significant.

Materials and methods

Database

The dataset used in this study was obtained from patients attending the Preventive Medicine Unit of the University of the Andes Clinic, in Santiago, Chile. The dataset includes data from 2,999 patients, obtained between February 2022 and October 2023. The only exclusion criteria were that the patients must be at least 18 years old, and patients must have all variables assessed to be included in the dataset. AUDITc was computed using a brief test proposed by Bush et al. in 199881, in which the range of values is between 0 and 12. The data for each input to the model was normalized by subtracting the mean and dividing it by the standard deviation. The dataset was randomly divided into three partitions; training set (70%), validation set (10%) and testing set (20%). The data usage was approved by the institutional review board (IRB) of Clinica Universidad de los Andes, Santiago, Chile. Our study is a retrospective study on de-identified data. The institutional review board (IRB) of Clinica Universidad de los Andes, Santiago, Chile, approved the data usage and waived the need for informed consent. All methods were performed in accordance with relevant guidelines and regulations.

Data augmentation

Data Augmentation (DA) is a method commonly used in ML and DL to improve the performance of the models54,82. In this study, a DA method is proposed that is only used on the training set. It consists of the creation of new patients, but these new patients must have their new data within a range of values established by a medical specialist in Hepatology. The objective of using these ranges is that the new, artificially created patients have their variables with values that are validated by the medical experts. Table S1 of the Supplementary Material presents the range for each of the variables proposed by the medical specialists, including the limitations for some variables, such as in Glycemia, keeping the same category of the original patient, and in AUDITc, keeping same category. BMI adapts to the changes of weight and height, and it is recomputed; however, the new patient must be within the same category of BMI classification recommended by the WHO83,54. A similar DA was proposed and used by us54. We named this proposed DA Controlled Noise (CN) since for the new patients the input variables take random values within the assigned ranges. We used three options. In the first one, we created i patients from each original patient. In the second option, we created i patients from each original positive patient, and in the third option we created i patients from each original positive patient, and j patients from each original negative patient, with i > j. Using the last two options, the training set may be balanced for the positive and negative cases.

Development of a new DL method for the prediction of SLD

In this study we developed a model using DL for the prediction of SLD transforming the input variables from tabular data to images, with the goal of using the power of CNN models to recognize complex patterns in images. The method has three stages: the data transformation stage, the CNN stage, and the classification stage.

Data transformation stage

With the goal of using the power of CNN models to recognize complex patterns in images, it is necessary to transform the patient’s data into an image. Each patient’s data can be represented as a vector of n variables/features. This vector can be transformed into a matrix. CNNs normally use an image of 224 × 224 as input67. In our case the number of available variables for each patient is 22. Therefore, we replicate the data m times, creating a matrix of (m × n). To increase the number of columns in the matrix, we replicate each column k times, which can be interpreted as increasing the width of each column. This process results in a matrix of (m × nk). Figure 1 shows an example of the results of the replication process for a patient with 9 variables. The matrix is used for the three channels of the CNN67,84,85. Data Augmentation is applied to the whole image, changing the values randomly as is described in the section, Data Augmentation of the Materials and Methods. Figure 1 shows the creation of the matrix for a patient with 9 variables (a). The resulting matrix (m x n) = (30 × 9) after replication of the rows is shown in Fig. 1b. Figure 1c shows the resulting matrix (m x nk) = (30 × 9*3) after column replication. Figure 2 shows the three options of DA for the patient with 9 variables of Fig. 1. Figure 2a shows the image without DA. Figure 2(b) shows the image with DA modifying a percentage of each row. Figure 2c shows the image with DA modifying a percentage of each column, and Fig. 2d shows the image with DA modifying a percentage of both rows and columns.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

(A) Example of data from a patient with 9 variables. (B) Resulting matrix (m × n) = (30 × 9) after row replications. (C) Resulting matrix (m × nk) = (30 × 9*3) after column replication.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Example of DA options in the patient with 9 variables of Fig. 1. (A) Image without DA. (B) Image with DA modifying a percentage of each row. (C) Image with DA modifying a percentage of each column. (D) Image with DA modifying a percentage of both rows and columns.

Binary/Categorical variables are not used to create the input image for the CNN in this transformation because they just take two values. These binary variables are considered as inputs to the classification stage as shown in Fig. 3. Considering this new patient data representation, we added another type of DA usually used with images67,82,84 consisting of small random rotations of ± 5° and vertical flips with a probability of 50%, both applied only to the training set.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Block diagram of the proposed method including the data transformation stage, the CNN stage, and the classification stage for SLD prediction.

CNN stage

As described in the previous section, after the data transformation stage, each patient’s data is represented by an image that becomes the input to a CNN. The selected CNN is the ResNet-50, using the 1.5 version implemented in PyTorch86. This CNN is used to extract features from the image of each patient. The weights of the CNN model were pretrained using the ImageNet 2012 dataset87. The classification layer was removed and replaced by the classification Stage.

Classification stage

The classification stage is a Multi-Layer Perceptron (MLP), consisting of 3 hidden layers of 1000, 500, and 100 neurons. The input to this MLP is the output of the CNN and the binary/categorical variables. The output of this stage is the SLD prediction. The entire process including the three stages, i.e., the data transformation stage, the CNN stage, and the classification stage, can be observed in Fig. 3.

Traditional ML models

To compare the performance of our proposed method, we used twelve traditional ML models developed for SLD prediction using tabular data. The models are the following: Gaussian Naïve Bayes (GNB)88 Bernoulli Naïve Bayes (BNB)88 Decision Trees (DT)88 Support Vector Machines (SVMs)88 Multi-Layer Perceptron (MLP)88 K-Nearest Neighbors (KNN)88Logistic Regression (LR)88 Random Forest (RF)88 Extra Trees (ET)88 Balanced Random Forest (BRF)89 and Gradient Boosting Machines (GB), in two popular implementations, Extreme Gradient Boosting Machines (XGB)90 and Light Gradient Boosting Machines (LGBM)91. These ML models have been used in the prediction of various illnesses with tabular data as inputs, e.g., Gestational Diabetes54. DA Controlled Noise is also applied to these ML models.

DL and ML model implementations and hyperparameters

The models were implemented in Python 3.10.11, using the libraries PyTorch 2.0.1, Scikit-Learn 1.2.2, Imbalanced-Learn 0.10.1, XGBoost 1.7.3, and LightGBM 3.3.5. The hyperparameters for our proposed DL models are related to the image generated m and k, that adjust the size of the image. The values studied for m are 50, 150, 180, 210, 250, and 300. The values analyzed for k are 1, 2, 3, 4, 10, and 36. The layer selected to be the output of the CNN Stage is also a hyperparameter. Two layers of the ResNet were analyzed, Average Pooling layer (Avgpool) and the last convolutional layer. The alternative of selecting some of the CNN intermediate layers has been studied with good results67. The spatial position of the variables in the image is also a hyperparameter. Several random positions were first analyzed, and then we performed permutations of variable positions around those that yielded good results in the first analysis.

The hyperparameters used for the traditional ML models are shown in Table S2 of the Supplementary Material. The hyperparameter selection was performed using a grid search evaluated in a 5-Fold Cross Validation (CV)92. Variable selection was part of the grid search, and the best variables were used for our proposed model. Variable selection was performed using 4 methods/metrics to select the optimal number of variables required for the best model performance, whilst reducing redundancy: F-test of ANOVA (Analysis of Variance), Chi-Square Test, Mutual Information, using the implementation of Scikit-Learn88 and Balanced Random Forest89.

The top 15% of the models with the highest area under the curve (AUC) were selected and assessed on the validation set.

Model evaluation

The validation set was used to select the best models and the decision threshold of our models. The test set was not used in training, model selection, or in decision threshold selection. Trained models were tested using the test set. With the decision thresholds, model results of accuracy, sensitivity, specificity, recall macro, area under the ROC curve (AUCROC), False positives (FP) and False Negatives (FN) are available. A high sensitivity is a priority for gastroenterologists because the model is intended to be used for SLD screening purposes. Therefore, we explored sensitivities above 0.80 (80%) with special attention.

Results

Population characteristics

A total of 2999 patients was included in this study. The dataset was partitioned into a training set of 2099 patients (70%), a validation set of 300 patients (10%), and a test set with 600 patients (20%). The prevalence of SLD in the dataset was 26.64% (799/2999). The test set had 159 patients positive for SLD. Table 1 presents the variables collected in the dataset.

Variable selection

Twenty-two variables were available for the prediction of SLD. Variable selection improves the performance of the models, reducing the use of irrelevant or redundant data. The best 11 variables for each method appear on Table S3 of the Supplementary Material.

Table 1 Clinical variables of the patients in the dataset. IQR, interquartile range.

Model performance

Table 2 shows the performance of our proposed DL model (OursDLM) compared with traditional ML models. Table 2 includes the number of variables used, the image size for our proposed model, and the DA used. In the traditional ML models, only noise in artificial patients could be applied, while in our DL proposed model, the noise could be applied in rows and/or columns. In both cases, the number of patients created with noise appears next to them on the table. This means that both ML and DL models could use DA Controlled Noise. Noise is a small value, with ranges provided by a medical specialist. Only the training set is altered by this DA. Our DL proposed model also has the possibility of DA with Vertical Flip and Random Rotations. Table 2 also shows the following metrics: Accuracy, Sensitivity, Specificity, Recall Macro, AUCROC, a confusion matrix inline, and the total number of errors for each model. Table 2 also shows the best DL model and the best traditional ML model for each level of sensitivity from 1 to 0.8000. Additional results of OursDLM and OursMLM for more sensitivity levels are included on Table S4 of the Supplementary Material, and Table S4 includes results presented on Table 2. We also included Table S5 in the Supplementary Material, with information about the output layer used and spatial position of the variables. Models are tested using a test set with data that was not used in any of the methods employed to find the best models.

Our DL models show excellent SLD prediction capability with the use of simple-to-obtain clinical and laboratory variables. Table 2 presents the following models that obtained excellent results and that the medical specialists can choose depending on the desired balance between sensitivity (FN) and specificity (FP).

The DL models reached better results compared to the traditional ML models for the same levels of sensitivity.

Table 2 The top two models for different sensitivity levels, with sensitivity ≥ 0.8052, our DL model, and our traditional ML model (model number 1 (T1) to 32 (T32), and up to 14 variables.

A comparison of the results, after changing the parameters of replication, was performed for three models (23, 27, 31) at the same threshold of sensitivity as that in Table 3. Replication parameters affect the input image size to the CNN. It can be seen on Table 3 that an image with a shorter height (50) has a worse performance in comparison with our selected value (180), with an average increase in error of 30 FP patients. With a height of 100, there is an increase of error of 19 FP on average. An increase in height, to 210, also has a larger number of errors, up to an average of 8 FP patients. Conversely, decreasing the width of the columns to 5 increases the errors to 86 FP patients on average, in comparison with our selected value (36). With a width of 12, errors increase on average to 57 FP patients in comparison with our selected value. Reducing the width of the column to 12, the error of FP is 36 patients more than in the case of our selected width. If we set the column width to 1, i.e., no replication of the column, the error increases on average to 114 FP patients. Finally, if we increase the width of each column to 40, the error also increases on average to 32 FP patients.

Table 3 Comparison of three models (23, 27, 31) with the same model but different parameters of replication, which vary image size.

Table S6 of the Supplementary Material shows the results of the Hepatic Steatosis Index (HSI) applied to our database for the various levels of sensitivity from 1 to 0.8050. The comparison of these results to those on Table 2 shows that the results of all the DL models are significantly better than those of the HSI model. In this study, a screening was performed with the general population to detect patients at high risk of SLD. Then, a second confirmatory diagnostic test was performed only with those patients, that is with that high risk of SLD. In this first diagnostic test, which is widely available in the health care system, we use the biochemical profile that includes the variables to determine HSI.

Table S7 in the Supplementary Material shows the metrics for the traditional models without optimization (default hyperparameters), with and without variable selection, and without Data Augmentation. A comparison of the results between models with DA and without DA are shown on Table S8. Thirty of the thirty-two models achieved better results with DA (all models except models 23 and 31). Without DA the errors increased by 4.65% on average, with the greatest difference in Model 1, with 44 more errors when DA is not used.

Table S9 in Supplementary Material shows the thresholds used to calculate the metrics for the models.

A comparison between our proposed model and TabNet is presented on Table S4. In general, TabNet models achieved better, or similar, results compared to those of the traditional ML models. TabNet achieved improved results, e.g., a smaller number of errors, compared to the traditional ML models, in models 1–6, 16–19, 26–28, 30 and 31. TabNet achieved the same results compared to the traditional ML models, with models 15, 21 and 29. In other sensitivity thresholds (0.9623 − 0.9182, 0.8805, 0.8679 − 0.8491, 0.8050), traditional ML models achieved better results than TabNet (models 7–15, 20–25, 29 y 32). In general, TabNet models yielded lower results compared to our proposed models for high sensitivity ranges (0.8050 to 1). For example, our model 20 correctly predicts 14 more negative patients compared to TabNet 20 (328 vs. 314), with the same sensitivity value of 0.8805. Another example is our model 5 and TabNet 5, where the difference is 33 patients not detected by TabNet.

Figure 4 shows the ROC curves for DL models 9, 17, 27, 31, and 32, with sensitivities 0.9497, 0.8994, 0.8365, 0.8113 and 0.8050, respectively. These DL models reach higher AUCROC values compared to the corresponding traditional ML model for all levels of sensitivity in the range 1-0.8050. Also, these DL models reach much higher AUCROC values than those obtained with HSI.

Figure 5 shows another way to compare model results by comparing the total number of errors FP + FN. Figure 5 shows the total number of errors (FP + FN) as a function of the True Positives (TP) in the test set. Our DL models (OursDLM) are shown in blue. Our traditional ML models are shown in red (with optimization and variable selection). In orange are the traditional ML models without optimization, with variable selection. In purple are the traditional ML models without optimization and with no variable selection. The results of the HSI models in our database are in green.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

ROC curves of our DL models, OursDLM, with sensitivities 0.9497, 0.8994, 0.8365, 0.8113 and 0.8050.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Total number of errors (FP + FN) as a function of the true positives (TP) in the test set. Our DL models (OursDLM) are shown in blue. In red are our traditional ML models (with optimization and variable selection). The traditional ML models without optimization and with variable selection are in orange. In purple are the traditional ML models without optimization and with no variable selection. The results of the HSI models in our database are in green.

Discussion

Our study presents a new method that enables applying the power of DL models for the prediction of SLD. We also developed twelve different traditional ML models, optimizing their hyperparameters, and compared their results to those of the DL models in the prediction of SLD. In general, the application of the traditional ML models has emerged as a tool to help identify diseases and make decisions in real time43. In our study, we used a dataset that includes data from 2999 patients attending a preventive medicine unit. SLD prevalence in our population was 26.6%.

Our results show that DL models outperform the traditional ML models in a high sensitivity range(1–0.8113). For lower sensitivities (< 0.8052) the traditional ML models reach the best results. Figure 5 shows the total number of errors (FP + FN) as a function of the True Positives (TP) in the test set. It can be observed that our DL models (OursDLM) (blue color) achieve results with the lowest total number of errors compared to those results reached with our traditional ML models (red curve). However, both options OursDLM and OursMLM, yield better results than those of the ML models without optimization (orange and purple curves). It is important to note that variable selection and parameter optimization in the traditional ML models improves results significantly (orange, purple and red curves).

Our DL models show excellent SLD prediction capability with the use of simple-to-obtain clinical and laboratory variables. On Table 2 the following models are presented that obtained excellent results, and that the medical specialists can choose depending on the desired balance between sensitivity (FN) and specificity (FP). For example, model 9 reached a sensitivity of 0.9497 (8 FN), a specificity of 0.6417 (158 FP), and an AUCROC of 0.8662. Another good example on Table 2 is model 17 that reached a sensitivity of 0.8994 (16 FN), a specificity of 0.7211 (123 FP) and an AUCROC of 0.8660. Another choice for cases with lower sensitivity and higher specificity is provided by model 27 on Table 2, that reached a sensitivity of 0.8365 (26 FN), a specificity of 0.7868 (94 FP) and an AUCROC of 0.8630. Additionally, model 31 achieved a sensitivity of 0.8113 (30 FN), a specificity of 0.8004 (88 FP) with an AUCROC of 0.8562. These findings are of particular importance given the increasing prevalence of SLD and its associated adverse effects. The high prevalence of this disease makes it difficult to implement universal screening programs. The use of our DL models could help identify patients at increased risk for SLD who would benefit from a confirmation by testing. On the other hand, our DL models also allow us to identify patients with a very low risk of presenting the disease, for whom it will be possible to choose not to perform further studies, thus reducing costs. Our best models demonstrated that the most effective predicting variables were age, weight, BMI, waist perimeter, AST, ALT, triglycerides, and HDL cholesterol. These are clinical and laboratory variables that are easy to obtain and of low cost, which facilitates the implementation of this model in primary care.

The use of a two-step screening program, in which a formula is applied to select high-risk patients who benefit from abdominal ultrasonography, has shown a reduction in ultrasonography requests, with a low false-negative rate93. In a two-step screening program using one of our models (model 17, Table 2), we could avoid 55.7% of abdominal ultrasounds, with a false negative rate of 2.7%.

OursDLMs in comparison with OursTMLs offers an average reduction in FP for the same level of sensitivity of 15.625 (9.85% reduction) in the sensitivity range analyzed, with a minimum of 3 (3.37% reduction at sensitivity of 0.8050), and a maximum of 43 (20.09% reduction, at sensitivity 0.9748). This improved performance is due to the transformation of the data into images and subsequent application of CNNs for pattern recognition, including use of our proposed DA. OursDLMs compared with traditional models without optimization, but with variable selection, results in an average improvement of 23.59 (14.91% reduction) of FP for the same level of sensitivity with a minimum of 12 (12.24% reduction at sensitivity of 0.8050) and a maximum of 50 (22.62% reduction at sensitivity of 0.9748). It is important to note the importance of both, traditional model optimization, and variable selection, since without it, the difference would be even greater. For example, OursDLMs compared with our traditional models without optimization or variable selection, provide an average improvement of 43.19 (reduction of 23.83%) of FP for the same level of sensitivity with a minimum of 29 (14.80% reduction at sensitivity of 0.9623), and a maximum of 56 (31.64% reduction at sensitivity of 0.8931). Comparing OursDLMs against HSI, the difference is even higher. OursDLMs reach an average improvement of 88.28 (reduction of 38.52%) of FP at the same level of sensitivity with a minimum of 60 (30.61% reduction at sensitivity of 0.9120), and a maximum of 183 (47.04% reduction at sensitivity of 0.9874).

Variable selection impact

Table S10 in the Supplementary Material shows a comparison between traditional models with variable selection and the same models using all available variables (22). Compared to the 32 models, those with 22 variables increase the number of errors by 1.93% on average. The greatest increase in error was for model 24, with an increase of 13 FP. Using 22 variables in models 1, 30, and 31, yielded the same results. Models 2, 5, 6, 10, 12, 23, 26, and 27, improve performance by up to 7 patients. However, in these cases, 22 variables are more than double the number required by the models with variable selection (i.e., model T2 requires 8 variables, and model V22-2 requires 22 variables). Also, using 22 variables will require more time to fill in the data to use the models in clinical practice.

Weight-related variables (BMI, Waist Perimeter and/or Weight) are chosen in the top 5 of all the variable selection methods. Something similar happens with GPT ALT and Triglycerides. Cholesterol HDL is chosen 6th in all the methods except Chi-Square, where it is chosen in 4th place.

Using SHAP, a plot of global feature importance is shown in Fig. 6. The plot shows that GPT ALT is the most influential variable, followed by Triglycerides, BMI, and Age. Cholesterol HDL, and GOT AST have a lesser influence in model prediction.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Global feature importance for SLD prediction, based on SHAP values.

There is limited public availability of datasets from other published studies, making a direct comparison of model performance impossible. There are also different criteria for patient inclusion in each study. Since in our case patients were selected from a screening study, the prevalence of SLD is similar to that of the population. However, in other previous publications51,52,60 the selection of patients was made from a group that consulted for a disease, and therefore the SLD prevalence is much higher. Thus, models may not be directly comparable. As a reference, models from previous studies for SLD prediction are shown on Table 4.

Table 4 Results of models published in previous studies. Used as reference.

The input variable used by OursDLMs and OursTMLs and the models of the state of the art are shown on Table S11 of Supplementary Material. It is important to mention that 7 variables are used by all of our models: Age, Weight, BMI, Waist Perimeter, GPT ALT, Triglycerides, and Cholesterol HDL. The average number of variables used by our traditional model is 10.85 compared to OursDLMs that use an average of 8.28 variables. The input variable used by OursDLMs and OursTMLs, and the models of the state of the art are shown on Table S11 of the Supplementary Material, including the variable selection method used. It is important to notice that OurDLM of 8 variables uses variables selected by the Chi-Square Test method, while OurDLM of 11 variables uses variables selected by BRF.

The purpose of our study is to develop a screening method for SLD. Then, in patients with high risk of SLD, a second test would be performed to confirm diagnosis. In this context, it is a priority to make the correct prediction of positive patients (high sensitivity) and low FN rate. Also, having a good level of specificity reduces the rate of false positives (FP), and therefore, the second test is performed in a reduced group of patients with a high risk of SLD. We consider at least these two metrics together to make a decision for model performance. A high sensitivity requires a large number of true positives (TP), i.e., most patients with the disease are detected with a low number of false negatives (FNs are patients with the disease not detected). A high specificity requires a small number of FPs. In a screening task, it is necessary first to have good performance in sensitivity, and second in specificity, which could be interpreted as requisite for reducing False Positives and False Negatives.

AUC ROC provides an overall assessment of the model diagnostic performance at different values of sensitivity and specificity. Nevertheless, the ROC curves of the models may intersect and have different performances at various ranges of FP. Therefore, for screening (high sensitivity and high specificity) there may be models with better performance than those with the largest total AUC ROC94,95,96. For example, as shown on Table S4, traditional models 1 to 12 have a higher AUC ROC, but worse performance in FP values for the same level of sensitivity (sensitivities from 1 to 0.802). Another example on Table S4 is model 13, in which our proposed model has a slightly larger AUC ROC (0.8681 versus 0.8678); however, the number of FPs is reduced significantly from 157 to 142 with our model. Many other examples can be observed on Table S4. Also, Table S4 shows that for all levels of sensitivity, from 1 to 0.8050, the best specificity (i.e., the best combination of low FNs and FPs) is reached with our proposed models.

The partial area under the ROC curve (pAUC) is a metric that can be used to compare models in different regions of the ROC curve. Table S12 shows the pAUC for true positive rate (TPR) computed in steps of 0.2. In each range, one of our models achieved the best results. For example, model 31 achieved the best results in the range 0–0.2, and in 0.2–0.4. In the range 0.8–1, our models 27 and 32 yielded the best results.

The same conclusion can be reached by using the precision-recall (PR) curve. Our models in the region of interest (recall 0.8–1) in Figure S1 show greater precision. For example, model 9 has a larger area between recall of 0.9 and 0.98. Model 17 achieves the best results, between recall 0.86 and 0.9. Models 17, 27, 31 and 32 are better in the recall region 0.8–0.86. BRF has less precision, despite achieving a larger AUC PR in the region between a recall of 0.925 and 1. MLP has a good precision in the region of interest (recall 0.8-1), but our proposed model achieves improved results.

Conclusion

SLD, formerly named fatty liver disease, has a prevalence estimated at 30–38% in adults. Detection of SLD is important, since prompt initiation of treatment can stop disease progression, lead to a reduction in adverse outcomes, and reduce the economic burden associated with the disease. In this study we reported the development of a novel DL method for the prediction of SLD, which consists of transforming the input variables from tabular data into images, with the goal of using the pattern recognition power of DL models to achieve the best classification performance. For that purpose, the data of each patient, originally represented as a vector of n variables, was converted into an image replicating each variable m times in one dimension and, k times in the other, creating a matrix of (m x kn). A variable selection, and data augmentation method were used during training to improve prediction results. Twelve traditional machine learning (ML) models were implemented as a comparison with our DL proposed models.

All our proposed DL models reached better results compared to those of traditional ML models at all levels of sensitivity in the range 1-0.8050. This sensitivity range was selected by the hepatologist specialist as appropriate for SLD screening purposes. For example, a sensitivity of 0.9497, a specificity of 0.6417, and an AUCROC of 0.8662 were reached with one of our DL models. Another model reached a sensitivity of 0.8113, a specificity of 0.8004 with an AUCROC of 0.8562. These models require only 8 widely available variables in clinical practice. We also reached significantly better results compared to those obtained with the Hepatic Steatosis Index (HSI). All our DL models reached higher AUCROC values compared to those of the traditional ML models in the sensitivity range 1-0.8050. Additionally, our DL models reach much higher AUCROC values than those obtained with HSI. Our proposed method converts tabular data into images enabling applying the pattern recognition power of DL models to the prediction of SLD. The combination of our proposed DL model with variable selection, hyperparameters optimization and data augmentation allows us to set a new state of the art level for SLD prediction, reaching a better performance than traditional ML models and those of HSI. The proposed method may be applied to prediction of other illnesses by converting tabular data into images.