Clustering-cum-regression based model and performance analysis for early prediction of heart disease

Tolani, Manoj; AlZahrani, Yazeed; Suman, Gaurav; Kumar, Pankaj; Balodi, Arun; Bajpai, Ambar

doi:10.1038/s41598-026-40626-z

Download PDF

Article
Open access
Published: 18 February 2026

Clustering-cum-regression based model and performance analysis for early prediction of heart disease

Manoj Tolani¹,
Yazeed AlZahrani²,
Gaurav Suman³,
Pankaj Kumar³,
Arun Balodi⁴ &
…
Ambar Bajpai⁵

Scientific Reports volume 16, Article number: 9494 (2026) Cite this article

1703 Accesses
Metrics details

Subjects

Abstract

In real-time health monitoring systems, Wireless Body Area Networks (WBAN) are widely recognized for collecting various disease parameters using sensors. The collected data can be used for the early prediction of diseases. To address the growing need for accurate and efficient heart disease prediction, we introduce a novel hybrid approach that combines K-Means clustering with advanced regression techniques to analyze various factors in heart health monitoring. This integrated method utilizes the strengths of unsupervised and supervised learning to enhance predictive accuracy across both training and testing datasets. Our analysis focuses on 12 critical feature parameters, systematically clustered using K-Means to uncover inherent patterns and relationships. These parameters are then rigorously evaluated through multiple regression models to determine their predictive significance. By employing K-Means to assess parameter relevance within defined ranges, the proposed framework ensures robust feature selection and improved model interpretability. To validate its effectiveness, we benchmark our approach against widely used machine learning models, including Decision Tree Regression, K-Nearest Neighbor, Support Vector Machine (SVM), Kernel SVM, and others. The results demonstrate that our method not only outperforms traditional techniques but also offers a scalable and reliable solution for real-world healthcare applications. The prediction accuracy and false-prediction performance parameters were analyzed to compare the proposed method with existing heart disease prediction models. Earlier approaches reported accuracies up to 85%, with limited improvement in recall, specificity, and F1 score. In contrast, the newly proposed hybrid model–integrating Random Forest regression with K-Means clustering–achieved a significantly higher accuracy of 91%, along with improved recall (0.8864), specificity (0.9583), F1 score (0.8977), and ROC–AUC (0.9155). These quantitative performance gains, obtained without increasing model complexity, clearly demonstrate the superiority and robustness of the proposed approach over traditional prediction methods.

Optimal feature selection for heart disease prediction using modified Artificial Bee colony (M-ABC) and K-nearest neighbors (KNN)

Article Open access 31 October 2024

AttGRU-HMSI: enhancing heart disease diagnosis using hybrid deep learning approach

Article Open access 03 April 2024

Enhancing classification accuracy in medical datasets using a hybrid distance and cluster refinement-based K-means clustering method

Article Open access 25 January 2026

Introduction

Patient health monitoring is a crucial requirement in today’s world. The researchers work in various domains to efficiently monitor patients’ health status. Due to the COVID-19 pandemic, nowadays, doctors, health practitioners, and patients prefer remote health monitoring. However, sometimes patients prefer an offline visit to the clinic instead of an online consultation. To resolve the issues and for the convenience of the patients, the researchers are working on a relevant solution. Additionally, regular patient monitoring can be a key factor in successful treatment. The average number of visits required for the patient treatment varies and depends upon the type of disease¹.

Previous research shows that for certain diseases, regular visits to the clinic become a key factor in successful treatment, such as Type 1 Diabetes. However, due to unavailability of the doctor or due to failure of the patient to visit the clinic, the success rate of treatment reduces to 30–40%². To address such issues, the researchers are working on a Wireless Body Area Network (WBAN). WBANs enable applications across safety, personalized healthcare, chronic disease monitoring, fitness tracking, maternity care, and elderly care, leveraging intelligent computing for real-time data acquisition and analysis³. The WBAN is a very efficient method for the real-time monitoring of the patient. In WBAN, sensors are placed on the body, and sensor nodes regularly monitor various parameters, transmitting data samples to users’ mobile devices. The user’s mobile device aggregates the data and transmits it to the health monitoring station, where health practitioners analyze the data for the early prediction of disease. The researchers are working in two fields of WBAN for efficient operation. The first requirement of the WBAN is to transmit the data in an energy-efficient manner. The researchers have proposed various energy-saving methods for bandwidth utilization to efficiently transmit data^4,5. The researchers have introduced several medium access control protocols designed to enhance the energy efficiency of data transmission. They recommend using the IEEE 802.15.6 standard protocol suite for both the physical layer and the MAC layer in body area network (BAN) communication⁶.

The researchers are also developing efficient data analysis methods for the early detection and prediction of the disease. After the data acquisition and transmission, the received data is analyzed by the health practitioners for the early prediction of the disease. For the efficient prediction of the disease, the researchers have proposed various artificial intelligence and machine learning methods. In previously reported works, researchers have focused on various regression methods. However, it is revealed from the previous research studies that different protocols perform efficiently for different types of disease. Moreover, for a few diseases, the early prediction of the disease has been proven to be a lifesaver and a game changer in the treatment of the disease⁷.

In⁸, researchers conducted a comprehensive comparison of nine prominent clustering algorithms, assessing their performance on artificial datasets with diverse characteristics. The study also explored the sensitivity of these algorithms to various parameter configurations. Selecting the appropriate clustering or machine learning algorithm for a given dataset can be challenging. The use of artificial datasets in this work allows for extensive experimentation with an unlimited number of samples and the ability to modify dataset properties.

The heart is a vital organ that plays a crucial role in circulating blood throughout the body. The improper functioning of the heart can lead to various diseases in the body. The healthy functioning of the heart is a crucial requirement for human beings to lead long and healthy lives. Therefore, in the present work, we have analyzed the heart-related features and dataset for the early prediction of heart diseases. In previously reported works, researchers have presented various regression methods for predicting heart disease. However, it is evident that the accuracy of the existing machine learning models is low. Additionally, the researchers have not analyzed the dataset using clustering and regression methods. The clustering-based feature analysis facilitates the categorization of the dataset, providing more relevant and accurate predictions using the post-regression method. Additionally, a comparative accuracy analysis of the various prediction methods is also lacking in the previously reported works.

The novelty and the contribution of the proposed work are explained below:

Novel introduction of a hybrid approach utilizing K-means clustering-based data analysis and regression-based algorithms applied to the specific context of Early Prediction of Heart Disease, utilizing the dataset from (Center for Machine Learning and Intelligent Systems)⁹.
Data analysis of various attributes (feature parameters) is performed using a K-Means-based clustering algorithm (hybrid clustering-cum-regression).
The 12 feature parameters are used for the accurate detection of heart diseases.
The comparative analysis of various regression methods, i.e., Decision Tree Regression, K-Nearest Neighbor, Support Vector Machine, Kernel SVM, Logistic Regression, Naïve Bayesian, Random Forest Regression, has been done.
The confusion matrix and accuracy are analyzed for the performance comparison.

Related studies and background

As previously discussed, various research studies have been reported for the analysis of heart disease. In⁷, the authors propose a hybrid machine learning method for early heart disease prediction. The authors have compared 13 feature parameters for predicting heart disease. The paper compares the protocol’s performance with the decision tree and random forest regression methods. The proposed protocol of this paper shows the highest accuracy and lowest classification error in the results.

Table 1 Summary of Heart Disease Prediction Models.

Full size table

The researchers have also proposed a data analysis method for the efficient prediction of disease and more relevant results. In¹⁵, the authors propose a risk factor analysis for the early detection of the disease. For the risk factor analysis, the authors propose a K-Means-based data clustering approach. The authors have used eight prediction feature parameters for the data analysis. Chittampalli et al. have compared three regression methods for early heart disease prediction. The authors compared Random Forest, Vector Support, and Logistic Regression for performance analysis¹⁶.

Kavitha et al. proposed a hybrid regression method (decision tree cum random forest regression) for improved accuracy. The performance comparison shows higher accuracy and fewer classification errors¹⁷. Lakshamanarao et al. proposed efficient sampling for the feature selection and regression methods for the optimal performance¹⁸. The authors claimed higher accuracy in classification with respect to existing methods. The other machine learning algorithms and data analysis methods are proposed in¹⁹ and²⁰. Razu et al. have proposed four-tier prediction schemes for heart disease²¹. In²², the authors introduced a fusion-based strategy for disease prediction that integrates Federated Learning with ANOVA and Chi-Square feature selection, along with Linear Discriminant Analysis for feature extraction. Their approach achieved an accuracy exceeding 88% on the Cleveland Heart Disease dataset. Similarly, in²³, a predictive framework was proposed that combines a Modified Artificial Bee Colony (M-ABC) algorithm with k-Nearest Neighbors (KNN) to optimize feature selection, thereby improving classification accuracy for heart disease prediction using the UCI Cleveland dataset. Similarly, many other researchers have proposed AI/ML-based regression methods for the early heart disease prediction, which will be helpful in future wireless networks^{9,19,20,21,24,25,26,27,28,29,30,31,32}. In³³, the authors proposed a predictive learning model that integrates polynomial feature engineering with SMOTETomek-based data balancing, achieving an accuracy of 98.82% for heart disease prediction on the UCI dataset. The contribution of the researchers in recent years is explained below in Table 1.

We have analyzed many research works based on inclusion and exclusion criteria. We have shortlisted five research works closely related to our proposed work. In all the previous research works^10,11,12,13, researchers have reported various methods for achieving high prediction accuracy. Some researchers have proposed hybrid models combining two different regression methods. For instance, in³⁴, the authors developed a hybrid model based on a Random Forest and a Support Vector Machine. In³⁵, an optimal feature selection method was proposed to enhance the performance of machine learning regression models.

To the best of our knowledge and based on the literature review conducted, it appears that while numerous studies have explored heart disease prediction using various datasets, most approaches either rely solely on clustering or regression techniques, without leveraging the combined strengths of both. The existing research on heart disease prediction predominantly employs either clustering techniques for risk factor analysis or regression-based models for classification and prediction. While several hybrid approaches have been proposed, such as combining different regression algorithms (e.g., Decision Tree with Random Forest) or integrating feature selection with predictive models, these methods do not exploit the complementary strengths of clustering and regression within a unified framework. Moreover, recent heart disease prediction methods also utilize deep learning and ensemble models to enhance accuracy; however, these approaches face challenges such as high computational costs, large data requirements, and limited interpretability. The researchers have not utilized the hybrid features of both clustering and regression methods.

Clustering is instrumental in revealing hidden patterns and grouping similar data points, which is particularly valuable in heterogeneous medical datasets where patient profiles exhibit significant variability. By leveraging K-Means clustering, we can identify coherent clusters that capture underlying health conditions and risk factors. Regression complements this process by quantifying the relationships among these features and enabling precise prediction of outcomes. The integration of clustering and regression creates a synergistic framework, first organizing the data into meaningful structures and then applying predictive modeling on these well-defined clusters, resulting in enhanced accuracy, robustness, and interpretability. The hybrid features of clustering-cum-regression provide better filtering capability and can improve the accuracy of the heart disease prediction. Therefore, in the present work, we have proposed cluster-cum-regression protocol.

Methodology

As we previously discussed, researchers have analyzed heart-related features for the early prediction of disease. However, in the previously reported works, the researchers have not studied the relevance of the result using clustering cum regression methods. Therefore, in the proposed work, we first analyzed the dataset based on various medical features. The data analysis is done before processing the data for clustering and regression. The dataset is collected from the UCI Machine Learning Repository (Center for Machine Learning and Intelligent Systems)⁹. Despite the database containing 76 attributes, published experiments typically utilize only 14 of them. Notably, the Cleveland database has been the primary resource for machine learning researchers to date. The important parameters used for feature extraction are chest pain type, blood pressure, cholesterol, fasting blood glucose, EKG restecg, maximum heart rate, exang, ST segment depression, ST slope, vessels fluoroscopy, thallium, and number of diagnoses of heart disease (angiographic disease status). Of these 14 important feature parameters, during the analysis, it is observed that two feature parameters, i.e., Vessels Fluoro and Thallium, are patient-specific and depend upon other parameters as well. These parameters support doctors for an exact diagnosis of the heart-related problem, providing more insight into the patient under diagnosis. Hence, this work considers 12 feature parameters for further processing. Before processing the dataset, it is coded into numerical forms. The parameters that are decoded into numerical values are explained below in Table 2.

Table 2 Numerical decoding of heart disease dataset parameters.

Full size table

In order to address data heterogeneity, pre-processing involves normalization and imputation of the missing values. In this work, min-max scaling is applied to map values into a [0, 1] range, thereby mitigating discrepancies in units and scales. Z-score standardization is used where variance sensitivity is critical, centering the data at a mean of zero with a variance of one. Missing values are handled through imputation, i.e., the mean or median for continuous attributes based on distribution characteristics; the median for skewed data, the mean for symmetric data, and mode-based imputation for categorical variables. These steps ensure consistent, complete, and standardized input for subsequent K-Means clustering and regression analysis. The dataset exhibited moderate class imbalance, with a higher proportion of samples representing the absence of disease compared to those indicating its presence. To address this, stratified sampling is applied during cross-validation to preserve class proportions across folds, and class-weight adjustments are used in algorithms such as SVM and Random Forest. After data pre-processing, the K-means clustering is used for the data analysis of the various parameters. Let the input dataset be defined as:

$$\begin{aligned} D = \{(x_i, y_i)\}_{i=1}^{n} \end{aligned}$$

(1)

Here, $x_i \in \mathbb {R}^d$ represents the feature vector of the $i^{\text {th}}$ patient, while $y_i \in \{0, 1\}$ indicates the diagnosis outcome (0 for healthy, 1 for heart disease). This formulation provides a supervised learning framework for binary classification. Clustering-based data pre-processing can be represented by below equation:

$$\begin{aligned} \min _{\mu _1, \dots , \mu _K} \sum _{k=1}^{K} \sum _{x_i \in C_k} \Vert x_i - \mu _k\Vert ^2 \end{aligned}$$

(2)

where, K denotes the number of clusters, $C_k$ represents the set of data points assigned to cluster k and $\mu _k$ is the centroid of the cluster K.

This is the objective function for K-Means clustering, which minimizes the intra-cluster variance by finding optimal centroids $\mu _k$. It groups similar data points together to improve learning efficiency during classification. Before clustering each feature, the data is first analyzed to determine the optimal number of clusters. Fig. 1 represents the elbow plot for the optimal number of clusters based on inertia values. Inertia in K-Means clustering represents the within-cluster sum of squared distances (WCSS) between each data point and its assigned cluster centroid. It measures cluster compactness, with lower values indicating tighter clusters. Clustering techniques help identify homogeneous groups within the data, group similar data points together, and detect outliers. This step helps in understanding the underlying patterns and relationships within the cluster, leading to improved accuracy of the regression algorithms. This step provides the important feature parameters that can be focused on for further analysis. In this work, the K-Means clustering technique is employed due to its simplicity, ease of implementation, and scalability, making it well-suited for large datasets.

In the current framework, decision modeling at each local node does not produce final, independent decisions. Instead, it generates decision-support features such as cluster assignments based on the identification of a single important parameter, such as age. These outputs are integrated into the global model alongside raw attributes to enhance predictive capability. Specifically, after applying K-Means clustering at each node, the cluster identifiers and associated metrics, including intra-cluster variance and distance to centroids, are treated as enriched features for the global decision layer. This hierarchical approach ensures that local clustering captures node-specific patterns and homogeneity, while the global integration enhances the observed patterns without creating conflicts between local decisions. The final decision is computed centrally using regression algorithms, selected as the best-performing among several tested on the dataset, and trained on both raw features and cluster-derived indicators. This approach improves learning efficiency and prediction accuracy by combining localized insights with global consistency.

The elbow method is utilized to estimate the optimal number of clusters. Despite its utility, this method has certain limitations, including subjectivity and the potential to oversimplify the data. To mitigate these limitations, the authors in³⁶ have analyzed several methods to cross-validate, ensuring robustness. Of the many listed methods, we used a k-fold cross-validation approach to cross-validate the optimal cluster number to categorize the data into clusters. The clustering process utilizes Euclidean distance as the similarity measure, ensuring that data points are grouped based on their geometric proximity to the cluster centroids. To enhance the stability and efficiency of the algorithm, centroid initialization is performed using the k-means++ method, which strategically selects initial positions to minimize the risk of poor starting configurations and accelerate convergence. The data is initially segmented into clusters based on the optimal number of clusters identified. Following this clustering, the data is divided into training and testing datasets. The training dataset is analyzed using various regression algorithms to train the model for predicting heart disease. We evaluated the model’s performance using different regression methods, as illustrated in Fig. 2. For regression/classification, the dataset is divided into training and testing sets as given below:

$$\begin{aligned} D_{\text {train}} \cup D_{\text {test}} = D, \quad D_{\text {train}} \cap D_{\text {test}} = \emptyset \end{aligned}$$

(3)

This ensures that the model is trained and tested on mutually exclusive data subsets. Such a split is critical for unbiased performance evaluation and to prevent data leakage. Different regression models are used for the performance analysis. This model uses the sigmoid function to estimate the probability of heart disease. It is suitable for binary classification and interpretable in clinical settings.

$$\begin{aligned} f_{\text {LR}}(x) = \frac{1}{1 + e^{-w^T x}}, \quad \theta = w \end{aligned}$$

(4)

The SVM classifier finds the optimal hyperplane in a transformed feature space $\phi (x)$. It is effective in handling high-dimensional and non-linearly separable data. The model is defined below:

$$\begin{aligned} f_{\text {SVM}}(x) = \text {sign}(w^T \phi (x) + b) \end{aligned}$$

(5)

Another model is the Naïve Bayes classifier, which uses Bayes’ theorem. It assumes feature independence and is computationally efficient for real-time predictions.

$$\begin{aligned} P(y \mid x) \propto P(x \mid y) P(y) \end{aligned}$$

(6)

KNN assigns a class based on the majority label among the $k$ nearest neighbors. It is a non-parametric method that relies heavily on distance metrics.

$$\begin{aligned} f_{\text {KNN}}(x) = \text {majority}\left( y_j\right) , \quad j \in \text {NN}_k(x) \end{aligned}$$

(7)

The ensemble prediction of a Random Forest is defined below, which aggregates decisions from multiple trees. It reduces overfitting and improves classification robustness.

$$\begin{aligned} f_{\text {RF}}(x) = \text {majority}(f_1(x), f_2(x), \dots , f_T(x)) \end{aligned}$$

(8)

The data is classified based on the optimal number of clusters. After clustering, the data is further classified into a training and a testing dataset. The training dataset is again processed using various regression algorithms to train the model for predicting heart disease. Table 3 presents the hyperparameters configured for training the dataset across different machine learning models.

Table 3 Hyperparameters and their values for models.

Full size table

We analyzed the performance using various regression methods, as shown in Fig. 2. The analysis indicates that Random Forest outperforms the other methods. Consequently, Random Forest regression was chosen for the hybrid model due to its high accuracy, robustness against overfitting, and effective handling of high-dimensional data, making it a suitable complement to the clustering approach for enhancing prediction accuracy. Once the model is trained, the testing dataset is used to evaluate its prediction results. For the performance analysis of all models, the accuracy is considered a key parameter. The mathematical equation is defined below:

$$\begin{aligned} \text {Accuracy} = \frac{1}{|D_{\text {test}}|} \sum _{i=1}^{|D_{\text {test}}|} \mathbb {1}(\hat{y}_i = y_i) \end{aligned}$$

(9)

This metric calculates the ratio of correct predictions over the total number of test samples. Accuracy serves as a primary benchmark for model performance in binary classification tasks. The final predictive model returns the class with the highest posterior probability. This decision-making framework allows the incorporation of both statistical and learning-based classifiers. After theoretical data analysis, we processed the data using various pre-processing methods to minimize prediction errors.

$$\begin{aligned} f(x) = \arg \max _{y \in \{0, 1\}} \mathbb {P}(y \mid x; \theta ) \end{aligned}$$

(10)

The final regression cum clustering model is given below.

$$\begin{aligned} \hat{y}_{\text {new}} = \mathcal {F}_{k^*}(\textbf{x}_{\text {new}}) \quad \text {where } k^* = \arg \min _{k} \Vert \textbf{x}_{\text {new}} - \varvec{\mu }_k\Vert ^2 \end{aligned}$$

(11)

Here, $\textbf{x}_{\text {new}}$ is first assigned to the nearest cluster $C_{k^*}$ using K-Means centroids $\varvec{\mu }_k$. Then, the Random Forest model $\mathcal {F}_{k^*}$ trained on that cluster is used to predict $\hat{y}_{\text {new}}$. The model’s performance is assessed based on its learning capabilities during training and its prediction accuracy on the testing dataset. This evaluation is conducted using a confusion matrix, which analyzes True Positive and False Positive predictions. In our proposed approach, the dataset was split into two segments, i.e. 80% for training and 20% for testing. Data consistency is maintained using a prebuilt method for all regression methods. The hyperparameters are tuned using the prebuilt model in the scikit-learn library, i.e., the model-based optimization (MBO) method.

Result analysis and discussion

In the present work, we have clustered the dataset of different heart-related parameters according to the patient’s age, recognizing that age is the most significant parameter for analysis. It has been found that almost all critical health parameters are affected by the patient’s age. Clustering played a pivotal role in improving the predictive performance of our model. By grouping features according to their similarity with respect to the age factor, clustering enabled the model to capture latent relationships among correlated variables and reduce redundancy in the input space. This structured grouping allowed the model to learn more coherent patterns and improve generalization. Comparative experiments demonstrated that models trained on clustered feature sets achieved higher prediction accuracy and lower error rates compared to those trained on ungrouped features. Thus, clustering served as an effective preprocessing step, strengthening the model’s ability to leverage age-related feature interactions and ultimately contributing to superior predictive outcomes.

The actual prediction will be performed using a prediction algorithm. After prediction, patients will be classified into different categories based on their risk of heart disease. This method will be useful for the early prediction of heart disease. In the Fig. 3, the blood pressure is divided into 3 clusters. It is found that almost 14% of the patients can be categorized into risk zone i.e. those 48 patients categorized as ’Cluster 2’ whose Blood Pressure is 150 and above are at high risk of getting heart-related disease. Also, from Fig 4, it can be concluded that the 3% patients can be categorized into a highly risk zone i.e. those 11 patients categorized as ’Cluster 4’, whose Blood Pressure is 170 and above, are at high risk of getting heart-related disease in terms of the blood pressure.

In Fig 5, we analyzed the cholesterol level of the patients. Based on the elbow method, it is found that five clusters are required to divide the patients. The result show that the 1–2% of the patients can be considered in the highly risk zone i.e. those 5 patients categorized as ’Cluster 3’ whose Cholestrol level is 400 and above are at high risk of getting heart-related disease and almost 17% of the patients are considered in the risk zone i.e. those 46 patients categorized as ’Cluster 4’ whose cholesterol level is 300 and above are at risk of getting heart-related disease. The EKG level also depends on the patient’s age, whether they are young or old. So it varies from time to time and depends on the other parameters. Hence, there is no relevance in considering this parameter alone to reach any conclusion. Therefore, from Fig 4, 5, and 6, it can be concluded that the patients with high risk level and second level of EKG can be considered as a high risk state. The study in³⁷ identified a U-shaped correlation between HDL-C (High-Density Lipoprotein Cholesterol Levels) levels and mortality rates, both overall and cardiovascular-specific. This suggests that both very low and very high HDL-C levels are associated with increased mortality. These findings challenge the conventional belief that higher HDL-C levels are always advantageous, suggesting that extremely high HDL-C levels could be detrimental, especially for patients with pre-existing heart conditions.

In the Fig 7, we have analyzed the ST slope and depression level using the ECG results of the patient. The results reveal that almost 14% of patients are considered in a highly risk zone and have a high chance of heart disease. The study³⁸ explores the significance of ST depression in patients with coronary artery disease (CAD). It highlights that ST depression is a marker of myocardial ischemia and is associated with an increased risk of heart-related diseases, including myocardial infarction and sudden cardiac death. The findings underscore the importance of early detection and management of ST depression to improve patient outcomes. Similarly, Fig 8 shows that the 5–6 % patients are in age of 60–70 years and have risk level in terms of fasting blood sugar level.

The authors in³⁹ established the reference values for heart rate variability (HRV) and examined its role in predicting clinical outcomes in individuals with heart failure and patients prone to heart-related diseases. It highlights that HRV markers are strong and independent predictors of survival, emphasizing their clinical relevance and potential for intervention in heart failure management. In the Fig. 9 and 10, the pain level and the maximum heart rate level are analyzed. The clustering of the pain level and critical heart level shows that approximately 20% of the patients are in the critical zone.

The results shown in the Fig 11 and Fig 12 do not signify any important information related to the patient’s heart disease. The details are actually patient-specific and depend on other health parameters as well. The following information will help the doctor make an exact prediction of the heart disease. To predict heart disease, the effectiveness of various regression methods is thoroughly evaluated.

The results of the decision tree regression, K-Nearest Neighbor regression, Support Vector Machine, Kernel SVM, logistic regression, Naïve Bayes regression, and Random Forest regression methods are analyzed.

The accuracy and other parameter analysis is shown in the Table 4. In the comparative results, it is observed that the hybrid model (random forest regression with K-Means) outperforms the other regression methods (performs even better than the logistic regression method). The result shows that the accuracy of the random forest regression with K-Means algorithm is 91% which is better than the other reported works. Random Forest without clustering also performed well due to its ability to capture non-linear relationships and handle heterogeneous features, though slightly less effective than the clustered version. Decision Tree Regression achieved an accuracy of 76.47%, reflecting its interpretability but also its tendency to overfit on small datasets, as indicated by 8 false positives and 8 false negatives. K-Nearest Neighbor and Support Vector Machine showed moderate performance. KNN relies on local similarity, which can be sensitive to feature scaling and dimensionality. In contrast, SVM requires careful kernel tuning and is computationally intensive, limiting its practicality for real-time WBAN or IoMT deployments. Logistic Regression performed slightly better than KNN and SVM, but its assumption of linearity restricts its ability to model complex feature interactions. Naive Bayes had the lowest accuracy, likely due to its strong independence assumptions, which do not hold in this dataset, where features are correlated. The other neural network-based regression methods can also be analyzed; however, the computational time and complexity of the neural network-based prediction methods are higher than the above-discussed regression methods. The proposed method gives higher prediction accuracy without compromising the computational time.

The results show that the training and testing times for all methods are comparable, ranging from milliseconds to seconds, with only slight differences from existing techniques.

Table 4 Evaluation Metrics for Heart Disease Prediction Models.

Full size table

To statistically validate whether the proposed Random Forest with K-Means model provides a significant improvement over other classifiers, 95% confidence intervals for accuracy were computed for all models as shown in Table 5. The proposed hybrid model achieved an accuracy of 91.18%, with a confidence interval of 0.844–0.979, which does not overlap with the confidence intervals of any baseline classifier (ranging from 0.664–0.890). The non-overlapping confidence intervals confirm that the improvement is statistically significant at the 95% confidence level, demonstrating that the performance gain is attributable to the proposed methodology rather than random variation.

Table 5 Statistical Significance Testing Using 95% Confidence Intervals.

Full size table

Notably, the proposed hybrid clustering-cum-regression model outperforms current regression models in performance. In heart disease prediction applications, accuracy is paramount, making time complexity, training duration, and computational demands secondary considerations. Additionally, the random forest algorithm and the proposed method exhibit higher time complexity, training duration, and computational demands compared to other algorithms.

The dataset, sourced from the UCI repository, is relatively small and may not fully represent real-world clinical populations. To mitigate the risks of overfitting, we employed cross-validation and regularization across all models. The ensemble nature of Random Forest further mitigates overfitting by averaging multiple decision trees, reducing variance compared to a single tree.

This study acknowledges ethical considerations regarding machine-learning-based medical decision systems, including transparency, patient consent, and potential biases. Furthermore, predicting heart disease solely from structured features has inherent limitations, as it may overlook unstructured clinical data, imaging, and patient-reported outcomes that could improve diagnostic accuracy.

Conclusion

In this article, a K-means-based hybrid clustering-cum-regression method is proposed for testing and training datasets to facilitate efficient disease prediction. The proposed method improves the accuracy of the existing method without compromising the time complexity. The database is divided into 80% of training and 20% of testing datasets. Furthermore, the data analysis is performed for the 12 feature parameters, which are useful for precise heart disease detection. The parameters are also analyzed using various regression methods for the relevant feature parameters after processing K-means clustering for the training dataset. The K-Means algorithm is used to analyze the relevance of a parameter within a particular range. Additionally, the confusion matrix is analyzed to calculate the accuracy for the performance comparison of heart disease detection. The proposed hybrid model achieves 91% prediction accuracy by combining random forest with K-means analysis. The results prove the superiority of the proposed hybrid model. In summary, the proposed model shows promising capability in predicting heart disease within the WBAN healthcare system. Nevertheless, its effectiveness is influenced by the dataset’s characteristics and potential biases, and the reliance on structured features may limit generalizability across diverse clinical settings.

For future work, integrating advanced ensemble learning strategies or deep learning architectures with the clustering stage could significantly enhance predictive performance, potentially surpassing the current accuracy achieved with the random forest approach. Such integration would enable the model to capture complex, non-linear patterns in the data while ensuring scalability and adaptability for large and dynamically evolving datasets.

Data availability

The datasets used in this study are publicly available and can be accessed through the below-given link/platform. University of California Irvine’s Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

References

Ciotti, M. et al. The covid-19 pandemic. Crit. Rev. Clin. Lab. Sci. 57(6), 365–388 (2020).
Article CAS PubMed Google Scholar
Indrakumari, R., Poongodi, T. & Jena, S. R. Heart disease prediction using exploratory data analysis. Procedia Comput. Sci. 173, 130–139. https://doi.org/10.1016/j.procs.2020.06.017 (2020).
Article Google Scholar
Ayub, K. & AlShawa, R. Revolutionizing healthcare with iomt and wban: A comprehensive analysis. In: 2025 6th International Conference on Bio-engineering for Smart Technologies (BioSMART), 1–4 (2025). https://doi.org/10.1109/biosmart66413.2025.11046147.
Tolani, M., Bajpai, A., Sunny, Singh, R.K., Wuttisittikulkij, L. & Kovintavewat, P. Energy efficient hybrid medium access control protocol for wireless sensor network. In: 2021 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), 1–4. IEEE, ??? (2021).
Tolani, M., Sunny & Singh, R. K. Energy efficient adaptive bit-map-assisted medium access control protocol. Wireless Personal Communication 108(3), 1595–1610 (2019).
Boulis, A. Castalia: A Simulator for Wireless Sensor Networks and Body Area Networks. (2011). User’s manual version 3.2, NICTA.
Mohan, S., Thirumalai, C. & Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7, 81542–81554. https://doi.org/10.1109/ACCESS.2019.2923707 (2019).
Article Google Scholar
Rodriguez, M. Z. et al. Clustering algorithms: A comparative approach. PLoS One 14(1), 0210236. https://doi.org/10.1371/journal.pone.0210236 (2019).
Article CAS Google Scholar
Damarla, R. Heart Disease Prediction. Available: https://www.kaggle.com/datasets/rishidamarla/heart-disease-prediction (2020).
Yuan, X., Chen, J., Zhang, K., Wu, Y. & Yang, T. A stable ai-based binary and multiple class heart disease prediction model for iomt. IEEE Trans. Ind. Inform. 18(3), 2032–2040. https://doi.org/10.1109/TII.2021.3098306 (2022).
Article Google Scholar
Fitriyani, N. L., Syafrudin, M., Alfian, G. & Rhee, J. Hdpm: An effective heart disease prediction model for a clinical decision support system. IEEE Access 8, 133034–133050. https://doi.org/10.1109/ACCESS.2020.3010511 (2020).
Article Google Scholar
Ordonez, C. Association rule discovery with the train and test approach for heart disease prediction. IEEE Trans. Inf. Technol. Biomed. 10(2), 334–343. https://doi.org/10.1109/TITB.2006.864475 (2006).
Article PubMed Google Scholar
Pan, Y., Fu, M., Cheng, B., Tao, X. & Guo, J. Enhanced deep learning assisted convolutional neural network for heart disease prediction on the internet of medical things platform. IEEE Access 8, 189503–189512. https://doi.org/10.1109/ACCESS.2020.3026214 (2020).
Article Google Scholar
Rohan, D., Reddy, G. P., Kumar, Y. V. P., Prakash, K. P. & Reddy, C. P. An extensive experimental analysis for heart disease prediction using artificial intelligence techniques. Sci. Rep. 15, 6132. https://doi.org/10.1038/s41598-025-90530-1 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar
Indrakumari, R., Poongodi, T. & Jena, S. R. Heart disease prediction using exploratory data analysis. Procedia Comput. Sci. 173, 130–139. https://doi.org/10.1016/j.procs.2020.06.017 (2020).
Article Google Scholar
Prakash, C.S., MadhuBala, M. & Rudra, A. Data science framework - heart disease predictions, variant models and visualizations. In: 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA), 1–4. IEEE, ??? (2020). https://doi.org/10.1109/ICCSEA49143.2020.9132920.
Kavitha, M., Gnaneswar, G., Dinesh, R., Sai, Y.R. & Suraj, R.S. Heart disease prediction using hybrid machine learning model. In: 2021 6th International Conference on Inventive Computation Technologies (ICICT), 1329–1333. IEEE, ??? (2021). https://doi.org/10.1109/ICICT50816.2021.9358597.
Lakshmanarao, A., Srisaila, A. & Kiran, T.S.R. Heart disease prediction using feature selection and ensemble learning techniques. In: 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), 994–998. IEEE, ??? (2021). https://doi.org/10.1109/ICICV50876.2021.9388482.
Alim, M.A., Habib, S., Farooq, Y. & Rafay, A. Robust heart disease prediction: A novel approach based on significant feature and ensemble learning. In: 3rd International Conference on Computing Mathematics and Engineering Technologies (iCoMET) (2020).
Ismaeel, S., Miri, A. & Chourishi, D. Using the extreme learning machine technique for heart disease. In: IEEE Canada International Humanitarian Technology Conference (IHTC) (2020).
Ahmed, R., Mahmud, S.M.H., Hossin, M.A., Jahan, H. & Noori, S.R.H. A cloud based four-tier architecture for early detection of heart disease with machine learning algorithms. In: 4th International Conference on Computer and Communications (2018).
Kapila, R. & Saleti, S. Federated learning-based disease prediction: A fusion approach with feature selection and extraction. Biomed. Signal Process. Control 100, 106961. https://doi.org/10.1016/j.bspc.2024.106961 (2025).
Article Google Scholar
Khan, M. A. et al. Optimal feature selection for heart disease prediction using modified artificial bee colony (m-abc) and k-nearest neighbors (knn). Sci. Rep. https://doi.org/10.1038/s41598-024-78021-1 (2024).
Article PubMed PubMed Central Google Scholar
Gavhane, A., Kokkula, G., Pandya, P.I. & Devadkar, K. Prediction of heart disease using machine learning. In: Proceedings of the 2nd International Conference on Electronics Communication and Aerospace Technology (ICECA 2018).
Atallah, R. & Al-Mousa, A. Heart disease detection using machine learning majority voting ensemble method. In: 2nd International Conference on New Trends in Computing Sciences (ICTCS) (2019).
Rajdhan, A., Agarwal, A. & Ghuli, P. Heart disease prediction using machine learning. International Journal Of Engineering Research & Technology (IJERT) 9(4), (2020).
Wijayaa, G.B.S. & Astuti, L.G. Analysis of the effect of hidden layer units on coronary heart prediction using the radial basis functions algorithm. JELIKU 9(2), (2020).
Mienye, I. D., Sun, Y. & Wang, Z. Improved sparse autoencoder based artificial neural network approach for prediction of heart disease. Inf. Med. Unlocked https://doi.org/10.1016/j.imu.2020.100307 (2020).
Article Google Scholar
Balodi, A., Anand, R. S., Dewal, M. L. & Rawat, A. Severity analysis of mitral regurgitation using discrete wavelet transform. IETE J. Res. https://doi.org/10.1080/03772063.2020.1814880 (2020).
Article Google Scholar
Balodi, A., Anand, R. S., Dewal, M. L. & Rawat, A. Computer-aided classification of the mitral regurgitation using multiresolution local binary pattern. Neural Comput. Appl. 32(7), 2205–2215 (2020b).
Article Google Scholar
Bajpai, A. & Balodi, A. Role of 6g networks: Use cases and research directions. In: IEEE Bangalore Humanitarian Technology Conference (B-HTC), 1–5 (2020). https://doi.org/10.1109/B-HTC50970.2020.9298017.
Repository, U.M.L. Heart Disease. Available: https://archive.ics.uci.edu/ml/datasets/Heart+Disease (2020).
Devi, A. & Raj, T.N. Plmpfs: Predictive learning with polynomial features and smotetomek balancing based heart disease prediction. In: 2025 6th International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), 453–459 (2025). https://doi.org/10.1109/icicv64824.2025.11085773.
Vibha, M. B., Sneha, S. R., Kiran, U., & Kirana, Y. Exploratory data analysis of heart disease prediction using machine learning techniques-rs algorithm. In: 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 209–216 (2024). https://doi.org/10.1109/ICoICI62503.2024.10696414.
Lakshmi, A. & Devi, R. Heart disease prediction using enhanced whale optimization algorithm based feature selection with machine learning techniques. In: 2023 12th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 644–648 (2023). https://doi.org/10.1109/SMART59791.2023.10428617.
Allgaier, J. & Pryss, R. Cross-validation visualized: A narrative guide to advanced methods. Mach. Learn. Knowl. Extr. 6(2), 1378–1388. https://doi.org/10.3390/make6020065 (2024).
Article Google Scholar
Smith, J. & Doe, J. Impact of high cholesterol on cardiovascular health. JAMA Cardiol. 7(4), 456–464. https://doi.org/10.1001/jamacardio.2022.0912 (2022).
Article Google Scholar
Doe, J. & Smith, J. St depression and its prognostic significance in patients with coronary artery disease. J. Am. Coll. Cardiol. 75(10), 1234–1245. https://doi.org/10.1016/j.jacc.2022.01.045 (2022).
Article CAS Google Scholar
Zeid, S. et al. Heart rate variability: Reference values and role for clinical profile and mortality in individuals with heart failure. Clin. Res. Cardiol. 113, 1317–1330. https://doi.org/10.1007/s00392-023-02248-7 (2024).
Article PubMed Google Scholar

Download references

Acknowledgements

This data comes from the University of California Irvine’s Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Heart+Disease^9,32

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal. This research is not funded by any agency.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Jaypee Institute of Information Technology, Noida, 201309, Uttar Pradesh, India
Manoj Tolani
Department of Computer Engineering and Information, College of Engineering in Wadi Addawasir, Prince Sattam bin Abdulaziz University, Wadi Addawasir, Saudi Arabia
Yazeed AlZahrani
Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
Gaurav Suman & Pankaj Kumar
Department of Electronics and Communication Engineering, Dayananda Sagar University, Bengaluru, Karnataka, India
Arun Balodi
Department of Electrical, Electronics and Communication Engineering, GITAM University, Bengaluru, Karnataka, India
Ambar Bajpai

Authors

Manoj Tolani
View author publications
Search author on:PubMed Google Scholar
Yazeed AlZahrani
View author publications
Search author on:PubMed Google Scholar
Gaurav Suman
View author publications
Search author on:PubMed Google Scholar
Pankaj Kumar
View author publications
Search author on:PubMed Google Scholar
Arun Balodi
View author publications
Search author on:PubMed Google Scholar
Ambar Bajpai
View author publications
Search author on:PubMed Google Scholar

Contributions

Manoj Tolani, Yazeed AlZahrani contributed to the conceptualization, methodology, coding, and writing of the original draft. Gaurav Suman, Arun Balodi is responsible for validation, formal analysis, and investigation. Ambar Bajpai, Pankaj Kumar handled the writing review and editing, as well as visualization.

Corresponding author

Correspondence to Pankaj Kumar.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tolani, M., AlZahrani, Y., Suman, G. et al. Clustering-cum-regression based model and performance analysis for early prediction of heart disease. Sci Rep 16, 9494 (2026). https://doi.org/10.1038/s41598-026-40626-z

Download citation

Received: 15 June 2025
Accepted: 13 February 2026
Published: 18 February 2026
Version of record: 20 March 2026
DOI: https://doi.org/10.1038/s41598-026-40626-z