Introduction

Diabetes is a chronic metabolic disorder, sometimes with autoimmune origins, whose rising prevalence is strongly shaped by environmental and genetic factors1. In 2010, approximately 285 million people worldwide were affected, a number projected to rise to 552 million by 2030, representing 6.4% of the global adult population2. Diabetes is broadly classified into three types: Type 1, Type 2, and gestational diabetes. Type 1 diabetes, an autoimmune disorder in which the pancreas fails to produce insulin, typically develops in children and young adults under 30 years of age, and insulin therapy remains the primary treatment. Type 2 diabetes, in contrast, is associated with insulin resistance, obesity, and ageing, occurring more frequently in individuals over 65 years. Unlike Type 1, Type 2 can often be predicted, prevented, or diagnosed based on family history, gender, and lifestyle factors. Gestational diabetes develops during pregnancy and may result in maternal hyperglycemia3. Early diagnosis of diabetes is critical, and technological tools play an important role in identifying clinical indicators, reducing human error, and streamlining assessments4. However, clinical datasets are often underutilized due to heterogeneity, high dimensionality, and sparsity. Sparse data is a well-recognised challenge not only in healthcare but also in recommender systems, genomics, and natural language processing (NLP). In recommender systems, matrix factorization and collaborative filtering techniques are used to address sparsity in user–item interactions5. In genomics, dimensionality reduction and regularization methods help manage high-dimensional yet sparse gene expression data6. Similarly, in NLP, word embeddings and transformer-based models mitigate sparsity in text representations7. These strategies suggest possible solutions that could be adapted to healthcare, where datasets are often both sparse and high-dimensional. Given these challenges, Machine Learning (ML), an expanding subfield of artificial intelligence, offers significant promise, as it can capture nonlinear relationships in complex datasets and automatically learn patterns without explicit programming8,9,,9.

Chang et al.10 studied a sparse dataset of health indicators for early diabetes prediction and diagnosis, utilising ML approaches, notably Decision Trees (DT), Random Forests (RF), K-Nearest Neighbours (KNN), Logistic Regression (LR), and Naive Bayes (NB). These common ML algorithms often struggle to extract significant features from sparse datasets, resulting in lower accuracy. Furthermore, driven by advances in computational power, Deep Learning (DL), a frontier approach in ML, has recently achieved remarkable success, often surpassing state-of-the-art methods across various healthcare domains11,12,13.

Despite extensive literature reviews on the use of artificial intelligence (AI) for diabetes, which include some traditional ML methods and statistical models, systematic research remains lacking, focusing on DL applications for diabetes14,15,16. For instance, Pathak et al.17 examined the prediction gap for type 2 diabetes using rigorous learning methods. However, they noted that the available data are often less precise. The majority of research publications have predominantly employed ML techniques for diabetes prediction. Although DL algorithms have been extensively applied to the prediction of various diseases, their application to type 2 diabetes remains relatively limited. Zhang et al.18 provided a non-invasive approach that uses the stacked sparse autoencoder methodology to identify diabetes mellitus in a dataset of facial images. Additionally, Kannadasan et al.19 developed a Deep Neural Network (DNN) for diabetic data classification applied to the Pima Indians’ diabetes dataset. DNN employs stacked autoencoders and SoftMax for feature extraction and classification. However, the Pima Indians’ diabetes dataset has 12% sparse data, and DNN fails to manage and support it. Their findings demonstrate that sparse data issues in diabetes prediction necessitate strong techniques. Furthermore, Alex et al.20 developed a Deep Convolutional Neural Network (CNN) for diabetes prediction, utilising data from Electronic Health Records (EHRs) for training. However, this model relies on massive, labelled datasets, making it computationally intensive and difficult to interpret. Similarly, Vivekanandan et al.21 proposed the Stacked Autoencoder for feature learning, extraction, and classification using shallow and DNN, utilising the Indian Pima dataset. However, as the data is sparse, this model struggles to extract the relevant features, resulting in lower classification accuracy. Further, García et al.22 proposed a DL approach, such as a variational autoencoder for data augmentation, a stacked autoencoder for feature extraction, and a CNN for the prediction of type 2 diabetes using the Indian Pima dataset as well. However, the limited dataset size constrained the model’s performance.

Thaiyalnayaki et al.23 used the Indian Pima dataset to predict type 2 diabetes with Multilayer Perceptrons (MLPs) and support vector machines. The MLP classifier fine-tunes the hyperparameters to minimise the loss function and optimise the model, but the model has difficulty extracting features and cannot provide adequate accuracy. This limitation is largely attributed to the small dataset, which restricts the model’s ability to generalize effectively. Also, Miotto et al.24 proposed a Stack Denoising Autoencoder (SDA) for predicting type 2 diabetes using an EHR dataset, which enables the capture of complex patterns in EHRs without the need for manual feature engineering. However, it relies solely on the frequency of the laboratory test without considering the actual test result, resulting in low predictive power for the disease. Moreover, Chetoui et al.25 developed a federated learning framework that employs Vision Transformers for diabetic retinopathy detection, enhancing accuracy while preserving data privacy. However, the model demonstrated limitations when applied to larger imbalanced datasets, such as Eyepacs, and faced challenges associated with high-dimensional sparse data, including overfitting, difficulties in feature selection, and increased computational complexity. Similarly, Lan et al.26 introduced a Higher-Dimensional Transformer (HDformer) for diabetes detection using PPG signals, achieving high performance accuracy and enhanced computational efficiency through the Time Square Attention (TSA) module. Nevertheless, the model faced challenges with real-world noise in PPG signals, limited scalability for longer signal periods, and poor generalization due to small or imbalanced datasets. Beyond these task-specific architectures, transformer-based frameworks such as Medical Bidirectional Encoder Representations from Transformers (Med-BERT), Bidirectional Encoder Representations from Transformers for Electronic Health Records (BEHRT), and Tabular Neural Attention Network (TabNet) have also been applied in broader healthcare contexts. Med-BERT and BEHRT utilize self-attention mechanisms for EHR modelling, thereby improving sequence learning and disease prediction27,28. While TabNet is utilised for structured tabular data, it demonstrates effectiveness in managing high-dimensional clinical datasets29.

These models highlight the potential for healthcare tasks; however, their performance may still be constrained by sparsity, high dimensionality, and domain adaptation challenges, encouraging further investigation into this area. Prior studies in healthcare, particularly in disease prediction using DL algorithms, have shown that these algorithms often focus on dense datasets where most values are non-zero and struggle to extract functional characteristics from sparse data collections effectively. Table 1 provides an overview of various machine learning and deep learning models, highlighting their advantages and limitations.

Table 1 Limitations of medical AI models on sparse & high-dimensional data.

Healthcare datasets often contain numerous sparse values; however, removing these values can substantially reduce the dataset size, ultimately compromising predictive accuracy. Therefore, careful feature selection is essential to minimize the risk of false predictions. Therefore, the primary objective of this study is to accurately predict Type 2 diabetes using two sparse diabetes datasets with the proposed deep learning-based algorithm HSSAE. The main contributions of this study are outlined below.

  1. I.

    Develop an HSSAE algorithm as an advanced autoencoder-based deep learning framework for identifying key features associated with high-risk factors in diabetes prediction.

  2. II.

    Propose a hybrid loss function that integrates \({\text{L}}_{1}\) and \({\text{L}}_{2}\) regularization with a single parameter \(\alpha\) with binary cross-entropy loss to improve feature selection efficiency in sparse datasets.

  3. III.

    Gauge the performance and reliability of the proposed HSSAE algorithm using Precision, Recall, Accuracy, F1-score, AUC, and Hamming Loss as key assessment metrics.

  4. IV.

    Evaluate the HSSAE algorithm against traditional classifiers, including DT, RF, KNN, NB, and DL models, such as CNN, LSTM, and SSAE, demonstrating its superior accuracy on both Type 2 diabetic datasets.

The rest of the paper is organized as follows: Section “Materials and methods” describes the dataset, preprocessing techniques, and the architecture of the proposed HSSAE model. Section “Experiments and analysis” details the experimental setup, evaluation metrics, and comparative analysis with traditional ML and DL models. Section “Discussion” provides an in-depth interpretation of the results, emphasizing the proposed algorithm HSSAE strengths and areas for improvement. Finally, Section “Conclusion” summarizes the key findings, discusses the implications, and outlines potential future research directions.

Materials and methods

In the prediction of diabetes using the HSSAE algorithm, the research was conducted in three stages, as indicated in Fig. 1. Initially, the dataset was pre-processed using the Synthetic Minority Over-sampling Technique (SMOTE) in conjunction with min–max normalization. Following the pre-processing stage, the statistical analysis was performed to determine the nature of the data. Subsequently, the HSSAE algorithm was developed and trained with the pre-processed dataset. The HSSAE algorithm was employed to identify the complex patterns within the dataset, which were subsequently utilized for the prediction tasks. Finally, the performance of the algorithm, HSSAE, was evaluated using several metrics, including accuracy, precision, recall, F1 score, AUC, and Hamming loss. These metrics provided a comprehensive overview of the algorithm’s performance in diabetes prediction.

Fig. 1
figure 1

Framework of the study from data collection to evaluation.

Data acquisition, statistical analysis, and preprocessing

This study utilized the Diabetes Health Indicators Dataset (DHID), collected by the Centers for Disease Control and Prevention (CDC) via a telephone survey and available on Kaggle at “https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset”. The Health Indicator Dataset comprises 253,680 rows and 21 columns, with a sparsity of 43%, indicating that a substantial portion of the data contains zero values. Additionally, the EHRs Diabetes Prediction Dataset, accessible at “https://www.kaggle.com/code/ahmadkaif/diabetes-prediction”, was employed for diabetes prediction tasks. This dataset contains diverse patient information, including medical history, demographic data, and diabetes status, representing typical features of real-world clinical datasets. Initially, the dataset’s sparsity rate was 31%. To assess the robustness of the proposed HSSAE model under higher sparsity, we systematically increased the sparsity of selected numerical features (e.g., age, BMI, HbA1c level, and blood glucose level) by randomly setting 70% of their entries to zero, using a fixed random seed to ensure reproducibility. After this procedure, the sparsity of the selected features reached approximately 73%. The sparsity of each dataset was calculated using Eq. (1).

$$Sparsity = \frac{number\,of\,zeros}{{Total\,number \,of\,values}} \times 100\%$$
(1)

Statistical analysis of the sparse health indicators diabetes dataset

The dataset comprises two categories of variables: Numerical and Categorical. Numerical variables are quantitative measures such as Body Mass Index (BMI), general health, mental health, age, education level, and income, as presented in Table 2. BMI, with a mean of 28.38, high skewness (2.12), and kurtosis (10.99), highlights the presence of extreme obesity cases that are strongly linked to diabetes onset. General health, with a mean of 2.51, exhibits low skewness (0.42) and kurtosis (− 0.38), suggesting that most participants reported average health. Mental health (mean 3.18) and physical health (mean 4.24) show high variances (54.95 and 76.00) and strong positive skewness (2.72 and 2.20), indicating that while most individuals experienced minimal issues, a minority with severe conditions substantially affected the distribution. Age, with a mean of 8.03 and low skewness (− 0.35), reflects a well-balanced spread across groups, while education (mean 5.05) and income (mean 6.05) show negative skewness (− 0.77 and − 0.89), indicating that higher levels are more common, factors often associated with reduced diabetes risk. These patterns clearly highlight the significant impact of socioeconomic and lifestyle factors on determining health outcomes.

Table 2 Statistical analysis summary of health indicator dataset.

On the other hand, categorical variables provide qualitative insights that can be analysed through graphical representations, as shown in Fig. 2. The findings reveal clear associations between diabetes status and several health-related and lifestyle factors. Individuals with high blood pressure and high cholesterol are more likely to have diabetes. At the same time, those who regularly undergo cholesterol checks also show a higher prevalence, possibly reflecting underlying health concerns. Smoking, stroke history, and heart disease or heart attack are strongly associated with diabetes, highlighting their role as significant comorbidities. Conversely, engagement in physical activity and higher consumption of fruits and vegetables are associated with lower diabetes prevalence, suggesting a protective effect of healthy lifestyle behaviours. Heavy alcohol consumption shows a modest positive association with diabetes, whereas healthcare access and affordability (AnyHealthcare and NoDocbcCost) indicate that diabetes remains prevalent regardless of these factors. Furthermore, difficulty walking is highly correlated with diabetes, reflecting mobility challenges among individuals affected by the condition. Lastly, sex-based differences are observed, although diabetes is prevalent across both groups.

Fig. 2
figure 2

Categorical attributes of health indicator dataset.

Statistical analysis of the sparse EHRs diabetes prediction dataset

This dataset also comprises two categories of variables: The numerical features are Age, Hypertension, Heart disease, BMI, HbA1_level, and Blood_glucose_level, as shown in Table 3. Summarizes the statistical properties of the numerical variables in the diabetes prediction dataset. Age shows wide variability (mean 12.62, standard deviation 22.85) with positive skewness (1.62) and kurtosis (1.24), indicating a concentration of younger participants alongside fewer older individuals at higher risk. Hypertension (mean 0.07) and heart disease (mean 0.34) are rare, as confirmed by extreme skewness (3.23, 4.73) and high kurtosis (8.44, 20.40), which highlights the imbalance between affected and unaffected cases—a factor that can bias predictive models if not addressed. BMI (mean 8.20) shows mild skewness (1.17) and near-zero kurtosis (− 0.07), reflecting a relatively uniform spread. The HbA1c level (mean 41.18, standard deviation 2.60) demonstrates stable central tendencies with mild skewness (1.28), making it a reliable marker for diabetes. By contrast, the blood glucose level displays extreme variability (mean 0.86, variance 4501.92), with skewness (1.28) and kurtosis (0.29) indicating the presence of influential outliers. This heterogeneity reflects real-world metabolic dynamics, and while it challenges modelling, it provides critical diagnostic value.

Table 3 Statistical analysis summary of EHRs diabetes prediction dataset.

On the other hand, Categorical variables offer qualitative insights that can be descriptively analysed using graphical representations to show the relationship between categorical variables, such as gender, smoking history, and diabetes status, as shown in Fig. 3. As observed, diabetes cases are significantly limited among males and females, with the “Other” gender category contributing very few instances. This may be due to their underrepresentation in the data. For smoking history, individuals with no history of smoking or those who never smoked form the highest proportion in both diabetic and non-diabetic groups. However, a substantial number of diabetes cases also occur among former and current smokers, indicating an association between smoking status and the risk of diabetes. The “ever” and “not current” categories are relatively lower-case numbers, indicating lower prevalence or reduced reporting.

Fig. 3
figure 3

Categorical attributes of the ehrs diabetes prediction dataset.

Data pre-processing

Before applying the proposed algorithm, the datasets were divided into training and testing sets, with a 70:30 split to ensure adequate model training while preserving data for validation43. To guarantee reproducibility, this split was performed using a fixed random seed (random_state = 42). Normalization was applied using the MinMax Scaler to ensure that all features contribute equally. The min–max normalization approach provided in Eq. (2) was used to scale the feature values into the range [0, 1]44.

$${X}{\prime}=\frac{X-\text{min}(X)}{\text{max}\left(X\right)-\text{min}(X)}$$
(2)

where \({X}{\prime}\) = normalized value,\(X=\) original value, \(\text{min}(X)\) = minimum value of \(X\), and max \((X)\) = maximum value of X. It was observed that the datasets contained only a limited number of positive samples, while the majority were negative, resulting in a biased class distribution. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE)45 was applied with a fixed random state (random_state = 42) to ensure reproducibility. SMOTE generates new instances of the minority class by sampling the feature space of each target class and its nearest neighbours, creating synthetic examples while preserving the characteristics of the original dataset46. After applying SMOTE, the classes were balanced, as shown in Table 4, resulting in more reliable predictions and reduced bias when training ML and DL models.

Table 4 Distribution before and after applying SMOTE.

Methodology

An autoencoder (AE) is a type of unsupervised neural network with three layers47, including an input layer, a hidden layer, and an output layer (reconstruction), as given in Figs. 4 and 5 illustrates its model representation. The autoencoder can gradually convert artificial feature vectors into conceptual feature vectors, effectively performing a nonlinear transformation from a high-dimensional space to a low-dimensional space. The automatic encoder’s operation consists of two main stages: encoding and decoding.

Fig. 4
figure 4

Basic Autoencoder48.

Fig. 5
figure 5

Autoencoder model representation48.

The proposed HSSAE model

In this study, the HSSAE algorithm was developed, as shown in Fig. 6. All the processes were presented before the classification results were obtained.

Fig. 6
figure 6

Flowchart of proposed algorithm HSSAE.

The sequential breakdown of the flow chart in Fig. 6:

figure a

Unified architecture

The HSSAE algorithm is based on the SSAE principle, as shown in Fig. 7. However, SSAE focuses on the reconstruction-based features. SSAE expect the decompression-reconstructed output data \(\widehat{X}\) to restore the input data \(X\) as much as possible, i.e., \(\widehat{X}\approx X\). Suppose the input data \(X=\{{x}_{1}, {x}_{2}, {x}_{3},\dots .{x}_{l}\}\) are the training samples of size \(l\), each set of samples has \(N\) observations \({X}_{i}= {\{ {x}_{i,1}, {x}_{i,2}, {x}_{i,3},\dots .{x}_{i,N}\}}^{T},X\in {\mathbb{R}}^{N\times L}\) then \(\forall i=1, 2, 3, \dots .l,\) then the loss function of stacked sparse autoencoder as represented in Eq. (3).

$$j\left(w,b\right)=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}{\Vert \widehat{{X}_{i}}-{X}_{i}\Vert }^{2}+ \frac{\lambda }{2}\sum_{l=1}^{L}{\Vert {w}^{(l)}\Vert }_{2}^{2}+ \beta \sum_{h=1}^{L}KL(\rho \parallel {{\rho }{\prime}}_{h})$$
(3)

where the first term \(\frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}{\Vert \widehat{{X}_{i}}-{X}_{i}\Vert }^{2}\) of Eq. 3, called the Mean Squared Error (MSE), measures how accurately the network reconstructs the input data. The second term is called \({L}_{2}\) norm that penalizes the large weight values. This term helps the model prevent overfitting by encouraging smaller, more generalized weights. The parameter \(\lambda\) controls the strength of this penalty. The third term \(KL\left(\rho \parallel {{\rho }{\prime}}_{h}\right)\) is called the Kullback–Leibler (KL) divergence, which enforces the sparsity in hidden layers of the network. Sparsity ensures that only a small subset of neurons is activated at a time. The KL divergence quantifies the difference between the desired average activation level \((\rho )\) and the actual activation \({{\rho }{\prime}}_{h}\) of the hidden layer neuron. The parameter \(\beta\) control the sparsity constraint. The SSAE extracts features from the data and applies an ML model for classification. However, SSAE faces two challenges, i.e., neglecting the discriminative features of learning for sparse data prediction49. Additionally, using latent space with another algorithm, i.e., an ML model for prediction, increases the computational complexity50.

Fig. 7
figure 7

SSAE structure.

To address the limitations of the SSAE and HSSAE, a custom hybrid loss function has been proposed, where the HSSAE algorithm integrates a supervised classification layer, i.e., sigmoid, into the latent space of the encoder and passes the decoder part. Several important components of the HSSAE algorithm, which are crucial for the entire learning process, are given below.

Encoder layer

The encoder layers of the HSSAE algorithm map the input data to a lower-dimensional latent space representation. Many nonlinear transformations within these layers learn to extract the most prominent features and patterns. Let \({X}_{input layer}={h}^{0}\) represent the input layer, then the encoded layer can be defined by Eq. (4).

$$h^{l} = \sigma (w^{\left( l \right)} h^{{\left( {l - 1} \right)}} + b^{l} ),l = 1,2, \ldots ..L - 1$$
(4)

where, \({h}^{l}\) is the output of the \({l}^{th}\) layer, \({w}^{(l)}\) is the weight matrix of \({l}^{th}\) layer, \({h}^{(l-1)}\) is the output of the previous layer \({(l-1)}^{th}\), \({b}^{l}\) is the bias vector of the \({l}^{th}\) layer and \(\sigma\) is the activation function, i.e., ReLU.

Latent space layer

The latent space layer, as shown in Fig. 8, contains the most important and instructive features from the input data, removing noise and unnecessary information. Typically, the autoencoder follows the path of an encoder, a latent space, and a decoder on the input data to make predictions.

Fig. 8
figure 8

Proposed algorithm HSSAE structure.

Binary prediction needs a feature in an extracted form, and any classification model or layer can be applied to perform it. The latent space layer can be mathematically represented in Eq. (5).

$${Z}_{Latent }={h}^{(L)}$$
(5)

where \({Z}_{Latent}\) represent the latent representation of the input data and \({h}^{(L)}\) is the output of the last layer.

HSSAE classification layer

The HSSAE algorithm utilizes the features extracted in the bottleneck layer directly for classification by applying a sigmoid layer, rather than decoding the latent representations, as illustrated in Fig. 9. This approach significantly reduces computational complexity while optimizing the prediction task. Unlike SSAE, the HSSAE algorithm employs the learned latent representations for classification without requiring an additional ML classifier. The output of the last encoder layer \({h}^{(L)}\), is passed through a sigmoid activation function \(\varphi\), to obtain the predicted probabilities \(\widehat{y}\) for binary classification, as defined in Eq. (6) and (7):

Fig. 9
figure 9

Proposed algorithm HSSAE with classification layer.

$$\widehat{y}=\varphi ({Z}_{Latent })$$
(6)
$$\widehat{y}=\frac{1}{1+{e}^{-{h}^{(L)}}}$$
(7)

The HSSAE algorithm generates probabilities in the range [0, 1] for each instance, which can serve as a simple measure of predictive confidence: values close to 0 or 1 indicate high confidence. In contrast, values near 0.5 indicate higher uncertainty.

In a binary classification problem, \(\widehat{y}\) is the probability predicted by the model that this input belongs to the positive class. Due to the sigmoid activation function \(\varphi\) the output is constrained between 0 and 1, which also provides a probabilistic interpretation of how much the model considers in its prediction. The HSSAE algorithm combines supervised classification and unsupervised feature learning in a single framework, showing effectiveness through the integration of the encoder and classification layers in using latent representations. This will enhance the model’s ability to simplify predictions of target variables and strengthen its capacity to extract relevant features.

\({\mathbf{L}}_{1}\)regularization

\({\text{L}}_{1}\) Regularization, also referred to as Lasso regularization, modifies the loss function by adding the total of the model’s coefficients’ absolute values. This method successfully performs feature selection while promoting sparsity by adjusting some coefficients to absolute zero. As a result, the model might ignore characteristics that are less important or irrelevant. \({\text{L}}_{1}\) regularization is useful for high-dimensional datasets where feature selection is crucial. The \({\text{L}}_{1}\) regularization term can be stated mathematically as given in Eq. (8).

$${\text{L}}_{1} {\text{ regularization}} = \alpha \left( {\sum\limits_{l = 1}^{L} {w^{\left( l \right)}_{1} } } \right)$$
(8)

where \(\alpha\) is the regularization parameter, \({w}^{l}\) denotes the weight matrix of the \({l}^{th}\) layer, and \({\Vert {w}^{l}\Vert }_{1}=\sum_{ij}\left|{w}_{ij}^{(l)}\right|\), represents the \({l}_{1}\)-norm, calculated as the sum of the absolute values of all coefficients in the weight matrix.

\({\mathbf{L}}_{2}\)regularization

\({\text{L}}_{2}\) Regularization, also known as Ridge regularization, adds the squared values of the model’s coefficients to the loss function. Unlike \({\text{L}}_{1}\) regularization, \({\text{L}}_{2}\) regularization favours small coefficients rather than forcing them to be exactly zero. This reduces overfitting by spreading the effect of a single feature across numerous features. \({\text{L}}_{2}\) regularization is very beneficial when input characteristics are correlated. The \({\text{L}}_{2}\) regularization term is stated mathematically as given in Eq. (9).

$${\text{L}}_{2} {\text{ regularization}} = \left( {1 - \alpha } \right)\left( {\sum\limits_{l = 1}^{L} {w_{2}^{(l)2} } } \right)$$
(9)

where \(\alpha\) is the regularization parameter, \({w}^{l}\) denotes the weight matrix of the \({l}^{th}\) layer, and \({\Vert {w}^{(l)}\Vert }_{2}^{2}=\sum_{ij}{\left({w}_{ij}^{(l)}\right)}^{2}\) represents the squared \({l}_{2}\)-norm, calculated as the sum of the squares of all coefficients in the weight matrix.

Like \({\text{L}}_{1}\) regularization, \(\alpha\) is the regularization parameter, whereas \({w}^{l}\) represents the model coefficients. The total is calculated for all coefficients, and the squares of the coefficients are added.

Binary cross-entropy (BCE)

The objective of binary classification tasks, such as predicting diabetes, is to learn the probability that a given dataset belongs to one of two groups. The model makes binary predictions by approximating the probability using the BCE loss function, as shown in Eq. (10), which measures the difference between class labels and predicted probabilities. BCE is well-suited for binary classification since it complements the sigmoid activation function, whose outputs range from 0 to 1. BCE is also differentiable and can therefore be used with gradient-based optimizers, such as Adam, which can lead to effective model training. BCE also provides a probabilistic output of prediction, which can be used in medical diagnosis and many other applications where the model’s confidence can contribute towards decision-making.

$${E}_{B.C}= -[ylog(\widehat{y})+(1-y)log\left(1-\widehat{y}\right)]$$
(10)

where \(y\in \{\text{0,1}\}\) is the actual binary label and \(\widehat{y}\in [\text{0,1}]\) is the predicted probability.

Custom hybrid loss

The objective of the HSSAE algorithm, is not only to preserve the reconstruction capabilities but also to optimize the model for task-specific predictions, making it particularly effective for sparse data scenarios. However, minimizing the MSE reconstruction-based optimization in SSAE fails to extract essential features for downstream predictive tasks. Moreover, the \(KL(\rho \parallel {{\rho }{\prime}}_{h})\) does not adapt well to datasets with uneven sparsity patterns, where certain features or data dimensions may dominate others. To address these limitations, in this study, a custom hybrid loss function is developed that incorporates the BCE loss function and a dynamic and finely tuned balance between the sparsity-inducing \({\text{L}}_{1}\) norm and the stability-enhancing \({\text{L}}_{2}\) norm. This exceptional formulation, \(\left({L}_{1}\right)+(1-\alpha ){\text{L}}_{2}\), where \(\alpha\) range from \(0\le \alpha \le 1\) is not just a mathematical adjustment; it is a groundbreaking approach for tailoring the model’s performance to the specific challenges posed by sparse, high-dimensional datasets. By assigning a weight of \(\alpha\) to the \({L}_{1}\) norm, the hybrid loss function actively encourages sparsity, driving less relevant coefficients to zero and enabling effective feature selection. Simultaneously, the complementary \((1-\alpha )\) the weight allocated to the \({\text{L}}_{2}\) norm ensures stability by reducing large coefficients, distributing influence evenly across features, and enhancing the model’s generalization ability. This interaction provides unparalleled flexibility: a higher \(\alpha\) sharpens that focuses on essential features by prioritizing sparsity, while a lower \(\alpha\) Stabilizes the learning process and reduces sensitivity to noise by emphasizing smooth optimization. The following hybrid loss function is optimized using the HSSAE algorithm to accomplish these two goals, as shown in Eq. (11)

$${E}_{HSSAE }= {E}_{B.C}+\alpha \left({L}_{1}\right)+(1-\alpha ){L}_{2}$$
(11)

Putting the values of \(\left({E}_{B.C}\right), ({L}_{1})\) & \(({L}_{2})\) in Eq. 11, as given in Eq. (12).

$${E}_{HSSAE}=-ylog\widehat{y}-\left(1-y\right)\text{log}\left(1-\widehat{y}\right)+\alpha (\sum_{l=1}^{L}{\Vert {w}^{\left(l\right)}\Vert }_{1})+(1-\alpha )\sum_{l=1}^{L}{\Vert {w}^{\left(l\right)}\Vert }_{2}^{2}$$
(12)

Here, \({E}_{HSSAE}\) denotes the hybrid loss function, \(\alpha \in [\text{0,1}]\) controls the balance between sparsity \({(l}_{1})\) and weight shrinkage \({(l}_{2})\), \({w}^{\left(l\right)}\) are the weight matrices of the \({l}^{th}\) layer, and \(\text{log}\) denotes the natural logarithm. The ideal weights and biases are obtained through greedy layer-wise pretraining, followed by fine-tuning with backpropagation, which minimizes the hybrid loss to align the network’s outputs with the target predictions. The gradient of the hybrid loss with respect to the weight matrix \({w}^{\left(l\right)}\) is expressed in Eq. (13)

$$\frac{\partial ({E}_{HSSAE})}{\partial {w}^{(l)}}=\frac{\partial }{\partial {w}^{(l)}}[-ylog(\widehat{y})-\left(1-y\right)\text{log}\left(1-\widehat{y}\right)]+\alpha \cdot sign\left({w}^{(l)} \right)+2(1- \alpha ) {w}^{(l)}$$
(13)

where \(sign(\cdot\)) denotes the sign function, defined as \(sign\left(x\right)=-1\) if \(x<0\), \(sign\left(x\right)=0\) if \(x=0\) and \(sign\left(x\right)=1\) if \(x>0\). Consequently, Eqs. (14) and (15) represent the weight and bias update processes.

$${w}^{(l)}\leftarrow {w}^{(l)}-\mu \frac{\partial ({E}_{HSSAE})}{\partial {w}^{(l)}}$$
(14)
$${\text{b}}^{(l)}\leftarrow {\text{b}}^{(l)}-\mu \frac{\partial ({E}_{HSSAE})}{\partial {\text{b}}^{(l)}}$$
(15)

where \({w}^{(l)}\), \({\text{b}}^{(l)}\) are the weight and bias, and \(\mu\) represents the learning rate. Traditional gradient descent methods, such as Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent, apply a uniform learning rate across all network parameters. This approach can be limiting, particularly in sparse datasets like those handled by the proposed algorithm HSSAE, which often contain rare or less frequent features that require different update dynamics. Using a uniform learning rate increases the likelihood of suboptimal convergence, including the risk of settling into a local minimum, as these methods cannot dynamically adapt the learning rate for diverse parameter requirements51. To address these limitations, the Adam (Adaptive Moment Estimation) optimization algorithm, as described by52 is employed to train the HSSAE algorithm. Adam dynamically adjusts the learning rate for each parameter by computing first-order (mean) and second-order (variance) moment estimates of the gradients. This capability enables the model to converge more quickly and effectively, even in challenging and sparse datasets. The Adam algorithm performs parameter updates as follows: The gradient of the parameters at time step \(t\), denoted as \({g}_{t}\), is calculated for the loss function \({E}_{HSSAE}\) as given in Eq. (16).

$${g}_{t}\leftarrow {\nabla }_{\vartheta }{J}_{t}({\vartheta }_{t}-1)$$
(16)

Further, the first-order and second-order moment estimates \({m}_{t}\) and \({v}_{t}\), are computed iteratively as given in Eqs. (17) and (18).

$${m}_{t}={\upbeta }_{1}{\text{m}}_{t-1}+\left(1-{\upbeta }_{1}\right) .{g}_{t}$$
(17)
$${v}_{t}={\upbeta }_{2}{\text{v}}_{t-1}+\left(1-{\upbeta }_{2}\right) .{{g}_{t}}^{2}$$
(18)

where, \({\upbeta }_{1}\) and \({\upbeta }_{2}\) \(\in [\text{0,1})\) are the exponential decay rates for the first and second moments, respectively. To correct for initialization bias in the moment estimates, Adam computes bias-corrected values as given in Eqs. (19) and (20).

$${{m}_{t}}{\prime}=\frac{{m}_{t}}{1-{{\upbeta }_{1}}^{t}}$$
(19)
$${{v}_{t}}{\prime}=\frac{{v}_{t}}{1-{{\upbeta }_{2}}^{t}}$$
(20)

Using the corrected moments, the parameters \({\vartheta }_{t}\) are updated as given in Eq. (21).

$${\vartheta }_{t}={\vartheta }_{t-1}-\frac{\tau }{\sqrt{{{v}_{t}}{\prime}+\epsilon }}.{{m}_{t}}{\prime}$$
(21)

The update step size is denoted by \(\tau\), and \(\epsilon\) is a constant to prevent the denominator from being zero. The Adam optimizer combines the advantages of RMSProp (scaling learning rates with second-order moments) and momentum-based optimization (smoothing updates with first-order moments). This adaptive mechanism ensures the following factors.

  1. a.

    Faster convergence compared to traditional methods.

  2. b.

    Improved handling of sparse and imbalanced datasets.

  3. c.

    Robustness to noisy gradients.

By leveraging Adam, the HSSAE model achieves effective parameter tuning and optimized performance, particularly in datasets with diverse feature distributions and sparse data challenges.

Visualizing hybrid loss function

The 3D visualization, as shown in Fig. 10, shows how the proposed hybrid loss function operates. On the graph, the \(x\)-axis represents the weight values, the \(y\)-axis represents the parameter \(\alpha\) that controls the balance between \({\text{L}}_{1}\) and \({\text{L}}_{2}\) regularization, and the \(z\)-axis shows the overall loss value. The colour gradient, which transitions from purple (low loss) to yellow (high loss), illustrates how loss changes under different settings. The U-shaped valley in the plot highlights the region where the loss is minimized, indicating that the model achieves its optimal balance between the two forms of regularization. When \(\alpha\) is closer to 1, the \({\text{L}}_{1}\) penalty has more influence, which pushes the model to select only the most important features (sparsity). When \(\alpha\) is closer to 0, the \({\text{L}}_{2}\) penalty becomes stronger, which smooths and stabilizes the weight values. This figure therefore demonstrates that by properly adjusting \(\alpha\), the model can achieve an effective trade-off between sparsity and stability, leading to better feature extraction from complex, high-dimensional data and stronger generalization performance.

Fig. 10
figure 10

Spatial representation of hybrid loss function.

Experiments and analysis

The experiments were conducted on Google Colab, which provided free GPUs for optimizing the performance of DL processes. The Python language was used in conjunction with libraries such as TensorFlow and Keras to design and train the Hybrid Stacked Sparse Autoencoder (HSSAE) model. Data preprocessing, optimization, and hyperparameter optimization were conducted using Pandas, NumPy, and scikit-learn. The free GPUs offered by Colab helped minimise training and testing time, thereby maximizing the overall efficiency of the experimental procedures.

Evaluation metrics

The performance of the HSSAE algorithm was evaluated using various metrics, including accuracy, precision, recall, Hamming loss, F1 score, and AUC.

The proposed algorithm HSSAE parameters

The experiments were conducted on two datasets with different sparsity levels to verify the effectiveness and generality of the HSSAE algorithm. The HSSAE algorithm was optimized using a Bayesian optimization technique to achieve improved performance on each dataset, as shown in Table 5. Both datasets employed a two-layer encoder, with 512 and 256 neurons for the Health Indicators dataset and 256 and 128 neurons for the EHRs Diabetes Prediction dataset. Latent space dimensions were maintained at 18 for the Health Indicators dataset and 5 for the EHRs Diabetes Prediction dataset to ensure proper representation of the complexity of each dataset. Activation functions employed were ReLU in the encoder layers and Sigmoid in the output layer to accommodate binary classification tasks. Batch normalization was applied after each encoder layer to stabilize training and improve performance. For regularization, both \({\text{L}}_{1}\) and \({\text{L}}_{2}\) norms were employed, with the regularization strengths determined by the hyperparameter α, which controls the balance between sparsity (\({\text{L}}_{1}\)) and robustness (\({\text{L}}_{2}\)). For the EHRs Diabetes Prediction dataset, α was set to 0.02 for \({\text{L}}_{1}\) regularization, and (1−α) = 0.98 for \({\text{L}}_{1}\) regularization. A dropout rate of 0.1 was applied to all layers to prevent overfitting. The number of epochs was determined based on the convergence pattern of each dataset: 1200 epochs for the larger and more complex Health Indicators dataset, and 700 epochs for the EHRs Diabetes Prediction dataset.

Table 5 Hyperparameter summary of HSSAE.

The HSSAE algorithm was trained using the hybrid loss function defined in Eq. 12, which combines BCE with \({L}_{1}\) and \({L}_{2}\) regularization to balance sparsity and stability. Model parameters, including weights and biases, were updated using the Adam optimizer with a learning rate of 0.001, \({\upbeta }_{1}=0.9\), \({\upbeta }_{2}=0.999\), and \(\epsilon ={10}^{-8}\). During training, data were processed in mini-batches according to the batch size of each dataset. For each batch, a forward pass computed the activations of the encoder, latent, and output layers, and the HSSAE loss was evaluated. Gradients were then calculated using backpropagation, and the parameters were updated iteratively with the Adam optimizer until convergence was achieved. After training, the model generated predictions on the test set, and performance was evaluated using Accuracy, Precision, Recall, F1-score, and Hamming Loss.

Empirical study on health indicator dataset

The HSSAE algorithm performance was evaluated using various classification metrics derived from the confusion matrix, as presented in Table 6. The model achieved an overall accuracy of 89%, demonstrating its effectiveness in distinguishing between diabetic and non-diabetic cases. For the negative class (non-diabetic), the model recorded a precision of 91%, a recall of 86%, and an F1-score of 88%. These metrics indicate the model’s ability to identify non-diabetic instances while maintaining a moderate false-negative rate correctly. In the positive class (diabetic), the HSSAE model achieved a precision of 86%, a recall of 92%, and an F1-score of 89%. This reflects the model’s proficiency in accurately detecting diabetic cases, striking a balance between precision and recall. The macro and weighted averages for precision, recall, and F1-score were all 89%, highlighting the model’s consistent performance across both classes. The HSSAE algorithm demonstrates a robust capability to differentiate between diabetic and non-diabetic cases, achieving high accuracy and balanced precision and recall across both classes.

Table 6 Confusion matrix outcomes for the proposed HSSAE model.

Performance comparison with baseline models on the health indicator dataset

The HSSAE algorithm was comparatively evaluated against both machine learning and deep learning models on the Health Indicator dataset, with each model’s performance assessed across key metrics, including precision, recall, F1-score, accuracy, AUC, and Hamming loss. Table 7 compares the performance of the HSSAE model with traditional machine learning models, including Decision Trees (DT), Random Forest (RF), K-Nearest Neighbours (KNN), and Naive Bayes (NB). The HSSAE model outperforms all other classifiers, achieving an accuracy of 89%, a precision of 86%, and an AUC of 0.95. In comparison, models such as DT, RF, KNN, and NB show lower performance, with reduced F1-scores and accuracy, as well as higher Hamming losses. This highlights the HSSAE algorithm’s superior ability to predict health outcomes from sparse data.

Table 7 Comparison with machine learning models on health data using metrics.

Table 8 provides a comparison with deep learning models, such as CNN, LSTM, and SSAE. The HSSAE algorithm consistently outperformed these models in terms of precision, recall, F1-score, accuracy, and AUC. The HSSAE model achieved the highest F1-score (89%), precision (86%), and AUC (0.95), making it the most effective model for health data prediction, especially when compared to the lower performance of CNN, LSTM and SAE.

Table 8 Deep learning models comparison on health indicators dataset.

Comparison with SSAE + Machine learning model on health indicator dataset

Table 9 shows the hybrid approach of SSAE combined with traditional machine learning models. Again, the HSSAE algorithm outperforms the others, achieving a high accuracy of 89%, a precision of 86%, and an AUC of 0.95, considerably higher than the combinations of SSAE and machine learning classifiers.

Table 9 SSAE with machine learning models on health indicators dataset.

Figure 11 presents the ROC-AUC and Precision-Recall curves for the proposed HSSAE algorithm. The model achieved an AUC score of 0.95 and a Precision-Recall curve score of 0.91, demonstrating its strong ability to distinguish between positive and negative cases. These curves illustrate the model’s outstanding performance in diabetes detection, highlighting high sensitivity and specificity.

Fig. 11
figure 11

ROC-AUC & precision-recall curve for health indicator dataset.

Empirical study on EHRs diabetes prediction dataset

The HSSAE algorithm model exhibited strong performance on the EHRs Diabetes Prediction Dataset, as detailed in Table 10. The model achieved an accuracy of 93%, indicating its effectiveness in distinguishing between diabetic and non-diabetic cases. For the negative class (non-diabetes), the model attained a precision of 95%, a recall of 92%, and an F1-score of 93%. These metrics demonstrate the model’s proficiency in correctly identifying non-diabetic instances while minimizing false negatives. In the positive class (diabetes), the model achieved a precision of 92%, a recall of 95%, and an F1-score of 94%. This reflects the model’s capability to accurately detect diabetic cases with a balanced approach between precision and recall. The macro and weighted averages for precision, recall, and F1-score were all 93%, further emphasizing the model’s consistent performance across both classes. The HSSAE model demonstrates a robust ability to differentiate between diabetic and non-diabetic cases, achieving high accuracy and balanced precision and recall across both classes.

Table 10 Confusion matrix outcomes for the proposed HSSAE model.

Comparison with baseline models on EHRs diabetes prediction dataset

Table 11 compares the performance of the HSSAE algorithm with traditional machine learning classifiers, including DT, RF, KNN, and NB, on the EHRs Diabetes Prediction Dataset. The HSSAE model consistently outperformed all other classifiers, achieving the highest precision (92%), recall (95%), and F1-score (94%). Additionally, the HSSAE model achieved an impressive AUC of 0.99, indicating its strong ability to differentiate between positive and negative classes. In contrast, the traditional machine learning models, such as DT, RF, and KNN, showed lower performance with reduced F1-scores and accuracy, along with higher Hamming losses. This highlights the HSSAE model’s superior accuracy and effectiveness in diabetes prediction.

Table 11 Comparison with machine learning models on the EHRs diabetes prediction dataset.

Table 12 further compares the HSSAE algorithm with top-performing deep learning models, including CNN, LSTM, and SSAE. The HSSAE model outshines these models across all performance metrics, achieving the highest F1-score (94%), precision (92%), and AUC (0.99). The HSSAE model also demonstrated superior recall and accuracy, establishing it as a more robust and efficient model for predicting diabetes, particularly in healthcare datasets with sparse data. In comparison, models such as CNN and LSTM reported lower precision, recall, and F1-scores, demonstrating the HSSAE model’s effectiveness in handling complex healthcare data.

Table 12 Comparison with deep learning models on the EHRs diabetes prediction dataset.

Comparison with SSAE + ML models on EHRs diabetes prediction dataset

The HSSAE algorithm is compared with hybrid SSAE + machine learning models, including SSAE + DT, SSAE + RF, SSAE + KNN, and SSAE + NB, as presented in Table 13. The HSSAE algorithm outperformed all evaluation metrics, including precision, recall, F1-score, and accuracy. Furthermore, the Hamming loss for HSSAE is significantly lower (0.07), indicating fewer misclassifications and enhancing the model’s reliability in real-world applications. The hybrid SSAE + machine learning models performed well but were outperformed by the HSSAE model, confirming its superiority.

Table 13 SSAE with machine learning models on EHRs diabetes prediction dataset.

In Fig. 12, the ROC-AUC and Precision-Recall curves for the HSSAE algorithm are shown. The model achieved an AUC score of 0.99 and a Precision-Recall curve score of 0.95, indicating an exceptional ability to differentiate between positive and negative cases. These curves highlight the model’s outstanding performance in diabetes detection, with excellent sensitivity and specificity.

Fig. 12
figure 12

ROC-AUC & Precision-Recall Curve for EHRs Diabetes prediction dataset.

Impact of alpha (\(\boldsymbol{\alpha }\)) in the loss function

The impact of the α parameter in the loss function was evaluated by testing three values: \(\alpha\)= 0 (Pure \({\text{L}}_{2}\) regularization), optimized \(\alpha\)(best performing), and \(\alpha\)= 1 (Pure \({\text{L}}_{1}\) regularization) for each dataset. The results are presented in Tables 14, 15.

Table 14 Impact of \(\alpha\) on health indicator dataset.
Table 15 Impact of \(\alpha\) on EHRs diabetes prediction dataset.

The results, as summarized in Tables 14, 15, highlight the critical role of \(\alpha\) in determining the model’s classification performance across different datasets. Relying solely on \({\text{L}}_{1}\) regularization (\(\alpha\) = 1) introduces excessive sparsity, leading to a decline in overall model generalization. Conversely, pure \({\text{L}}_{2}\) regularization (\(\alpha\) = 0) promotes stability but lacks the feature selection capability for optimal classification. The best performance is consistently observed at optimized \(\alpha\) values, confirming that an appropriate balance between \({\text{L}}_{1}\) and \({\text{L}}_{2}\) regularization enhances feature selection, model robustness, and overall classification accuracy.

Uncertainty quantification and error analysis

The reliability and robustness of the HSSAE algorithm were evaluated through uncertainty quantification and detailed misclassification analysis using confusion matrices for both datasets. These analyses provide a comprehensive understanding of model performance, variability, and potential weaknesses.

Confidence intervals

The 95% confidence interval for classification accuracy was calculated using the standard formula for proportions as presented in Eq. (22).

$$CI= \widehat{p}\pm 1.96\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$$
(22)

where, \(\widehat{p}\) is the observed accuracy, \(n\) is the number of test samples, and 1.96 corresponds to the 95% confidence level.


Health Indicator Dataset:

  • Total instances: 131,001

  • Accuracy = 0.8873 (95% CI: [0.8856, 0.8890])

EHRs Diabetes Prediction Dataset:

  • Total instances: 54,900

  • Accuracy = 0.9338 (95% CI: [0.9318, 0.9359])

The narrow confidence intervals indicate low variability, demonstrating that the model’s performance is stable and unlikely to be attributable to random chance, which is critical for healthcare applications.

Misclassification analysis

To better understand the model’s misclassifications, we present the confusion matrix for each dataset, highlighting the distribution of true positives, true negatives, false positives, and false negatives. This provides insights into the model’s performance and areas where errors are more likely to occur.

Health Indicator Dataset:

  • Misclassified 14.4% of negative cases (class 0) as positive.

  • Misclassified 8.1% of positive cases (class 1) as negative.

EHRs Diabetes Prediction Dataset:

  • Misclassified 8.3% of negative cases (class 0) as positive.

  • Misclassified 4.9% of positive cases (class 1) as negative.

This analysis highlights weaknesses specific to certain classes, showing that errors are somewhat higher in negative cases. Recognizing these patterns can guide targeted feature engineering or class-specific regularization in future model improvements.

Visualization of misclassifications

To further illustrate misclassification patterns, confusion matrix heatmaps were generated for both datasets (Fig. 13). The heatmaps confirm that misclassifications are more frequent in negative cases (class 0), complementing the numerical analysis.

Fig. 13
figure 13

Confusion matrix heatmaps of true and predicted labels.

The uncertainty quantification and misclassification analysis performed in this study are especially important in healthcare settings, where predictive errors can have significant clinical consequences. By providing confidence intervals and outlining class-specific misclassification patterns, the HSSAE algorithm shows both strong overall performance and transparency in its predictions. These analyses enable healthcare practitioners to make a more informed assessment of model reliability, thereby enhancing trust, interpretability, and safety in clinical decision-making.

Discussion

The HSSAE algorithm demonstrated significant effectiveness in distinguishing between diabetic and non-diabetic cases. On the Health Indicator Dataset, the proposed algorithm achieved an overall accuracy of 89%, with a precision of 91%, a recall of 86%, and an F1-score of 88% for the non-diabetic class. For the diabetic class, it attained a precision of 86%, a recall of 92%, and an F1-score of 89%, as detailed in Table 6. These consistent macro and weighted averages of 89% underscore the algorithm’s reliability in classification tasks.​

The superiority of the HSSAE algorithm becomes evident when compared to traditional machine learning models in terms of performance and reliability. As presented in Table 7 HSSAE achieved an AUC of 0.95, surpassing Decision Trees (DT) with an AUC of 0.78, Random Forest (RF) with 0.84, K-Nearest Neighbours (KNN) with 0.84, and Naive Bayes (NB) with 0.71. Additionally, HSSAE reported a lower Hamming loss of 0.11, indicating fewer misclassification errors than competing models. This suggests that the proposed algorithm, HSSAE, enhances predictive accuracy and reduces the likelihood of mistakes.​ In comparison with other deep learning architectures, such as CNN, Long Short-Term Memory networks (LSTM), and Stacked Sparse Autoencoders (SSAE), HSSAE consistently outperformed its counterparts. Table 8 shows that, HSSAE achieved higher precision (86%), recall (92%), and F1-score (89%) compared to CNN’s precision of 73%, recall of 78%, and F1-score of 75%, as well as LSTM’s precision of 72%, recall of 84%, and F1-score of 78%. Further, HSSAE’s AUC of 0.95 exceeded CNN’s 0.73 and LSTM’s 0.84, underscoring its robust capability in diabetes prediction. Furthermore, the analysis involved hybrid models combining SSAE with traditional machine learning classifiers, which demonstrated commendable performance; however, they failed to surpass the HSSAE algorithm. For instance, SSAE combined with RF achieved an AUC of 0.91, still lower than HSSAE’s 0.95, reinforcing the latter’s efficacy, as presented in Table 9.

Extending the evaluation to the EHRs Diabetes Prediction Dataset, the HSSAE algorithm continued to demonstrate exceptional performance. The algorithm achieved an accuracy of 93%, with the non-diabetic class recording a precision of 95%, a recall of 92%, and an F1-score of 93%. In the diabetic class, the algorithm achieved an accuracy of 92%, a recall of 95%, and an F1-score of 94%. Both the macro and weighted averages reached 93%, reflecting the proposed algorithm’s balanced and dependable performance across both classes, as shown in Table 10. A comparative analysis of the proposed algorithm, HSSAE, against traditional machine learning models on the second dataset highlights the superior performance of HSSAE. Achieving the highest precision (92%), recall (95%), F1-score (94%), and AUC (0.99), the HSSAE algorithm reported a notably low Hamming loss of 0.07, reflecting its accuracy and reliability in diabetes prediction, as shown in Table 11. When evaluated against other deep learning models on the EHRs dataset, HSSAE maintained its superior performance. Notably, HSSAE’s F1-score of 94% was significantly higher than those of CNN (65%) and LSTM (72%), emphasizing its effectiveness in balancing precision and recall, as indicated in Table 12. Hybrid models combining SSAE with traditional machine learning classifiers were also assessed on the EHRs dataset. While these models performed well, the HSSAE algorithm consistently outperformed them, reinforcing its robustness and applicability in real-world scenarios, as presented in Table 13.

The impact of the regularization parameter α in the loss function was examined by evaluating three different values:\(\alpha =0\) (pure \({\text{L}}_{2}\) regularization), optimized \(\alpha\) (best performing), and \(\alpha =1\) (pure \({\text{L}}_{1}\) regularization). The results, summarized as shown in Tables 14, 15, highlight the critical role of α in determining the proposed algorithm’s classification performance across two different datasets. Using only \({\text{L}}_{1}\) regularization (α = 1) resulted in excessive sparsity, which negatively affected the model’s generalization performance. Conversely, pure \({\text{L}}_{2}\) Regularization (\(\alpha =0\)) promoted stability but lacked the feature selection capability needed for optimal classification. The best performance was consistently observed at optimized α values, confirming that an appropriate balance between \({\text{L}}_{1}\) and \({\text{L}}_{2}\) regularization enhances feature selection, model robustness, and overall classification accuracy.

The empirical evaluations confirm that the HSSAE algorithm excels in distinguishing diabetic from non-diabetic cases, consistently outperforming traditional ML and DL baselines across multiple metrics. Its capacity to manage sparse and high-dimensional data, reinforced by robust regularization strategies, highlights its potential as a reliable tool for accurate and timely diabetes prediction in clinical practice. Although developed as a static model, the HSSAE algorithm can be adapted for longitudinal applications to achieve functionality comparable to that of temporal architectures, such as LSTM. Furthermore, for multi-modal learning, modality-specific encoders (structured EHRs, medical imaging, clinical text) can be integrated to extract domain-relevant representations that are subsequently fused in a shared latent space. For temporal datasets, recurrent models such as LSTM or GRU, as well as temporal self-attention mechanisms, can be embedded within the HSSAE algorithm to capture sequential dependencies in longitudinal patient records. In a clinical context, the HSSAE algorithm could be deployed in several real-world scenarios. In primary care, it may support physicians by flagging high-risk patients using routinely collected EHR data, enabling early intervention. In hospital settings, integration into electronic health record systems could allow continuous monitoring of patient trajectories and prediction of adverse outcomes. In telemedicine and remote monitoring platforms, HSSAE algorithm could analyze longitudinal data to inform personalized care plans for diabetic patients. These scenarios highlight the practical applicability of the HSSAE algorithm in providing enhanced decision support across various healthcare settings. Several challenges must be addressed before clinical deployment can occur. Ethical concerns regarding bias and fairness remain critical, as predictive performance may vary across demographic subgroups. Interpretability is a further limitation, since deep learning models are often perceived as “black boxes,” highlighting the need for explainable AI techniques to foster clinician trust. Additionally, the handling of sensitive health records necessitates strict adherence to privacy and security standards. Finally, external validation across diverse populations is essential to ensure generalizability, as performance may differ when applied beyond the datasets used in this study.

Conclusion

In the healthcare field, where every prediction can have significant implications for patient outcomes, the accuracy and reliability of predictive models are of utmost importance. The proposed HSSAE algorithm demonstrated superior performance in predicting Type 2 diabetes, outperforming traditional machine learning and deep learning models in key metrics, including precision, recall, F1-score, accuracy, and AUC. The results indicate that HSSAE is highly effective for identifying diabetes, especially in sparse and high-dimensional datasets, while maintaining low misclassification rates. This highlights its potential as a practical and dependable tool for clinical decision support, particularly in scenarios involving sparse and high-dimensional medical data. Despite its promising results, the HSSAE algorithm’s performance is susceptible to the choice of the α parameter in the hybrid loss function. This necessitates careful tuning to find the optimal balance between \({L}_{1}\) and \({L}_{2}\) regularization. Additionally, the model’s effectiveness can vary depending on the sparsity and dimensionality of the dataset, underscoring the need for further optimization. The performance may also be influenced by the dataset type, requiring more rigorous evaluation on diverse, large-scale datasets. Furthermore, while the model performs well on the datasets tested, its generalization to other healthcare problems and data types remains an area for improvement.

Future research will prioritize the external validation of HSSAE using large-scale, non-Kaggle datasets such as MIMIC-III or NHANES to evaluate generalizability and clinical relevance across diverse populations. Model explainability will also be addressed through integration of SHAP, LIME, or attention-based interpretability methods to clarify feature importance and enhance transparency for clinical adoption. In addition, a more rigorous evaluation framework will be implemented, including the computation of p-values and k-fold cross-validation, to establish the statistical significance and robustness of performance gains. These directions will strengthen the reliability, interpretability, and applicability of HSSAE in real-world healthcare scenarios.