Abstract
Machine learning methods, especially the K_Means clustering method, have demonstrated potential in analyzing medical data by facilitating pattern detection. However, the classic K_Means algorithm suffers from two major limitations: (1) its reliance on a single, often suboptimal distance metric (typically Euclidean), and (2) the lack of a mechanism to refine clusters post-assignment, which can lead to poor cohesion and misgrouping. To address these challenges, this paper proposes a novel enhanced K-Means clustering framework with two key innovations: (i) a hybrid distance approach that combines cosine and cityblock (Manhattan) metrics in a tunable weighted manner to better capture the structure of medical data and (ii) an efficient cluster refinement mechanism based on Z-score outlier detection to reassign distant samples and improve cluster quality. First, we evaluate K_Means using five distance metrics—Euclidean, cosine, cityblock, Chebyshev, and Minkowski—on two public medical datasets: Breast Cancer Wisconsin (BCW) and Heart Disease. Then, we introduce the hybrid distance strategy, systematically varying the weight between cosine and cityblock to identify the optimal combination. Following initial clustering, our refinement step identifies data points far from their cluster centroids (using Z-score) and reassigns them to more suitable clusters, significantly enhancing cluster homogeneity and separation. The proposed method is evaluated using multiple metrics: accuracy, precision, recall, F1-score, Adjusted Rand Index (ARI), homogeneity, and execution time. Results show substantial improvements over traditional approaches and advanced clustering methods (deep clustering and spectral clustering methods). For the BCW and Heart Disease datasets, the proposed method achieves accuracies of 0.9825 and 0.9000, outperforming Euclidean K-Means (0.8752, 0.8316) and cosine-based K-Means (0.9350, 0.8418). Homogeneity scores also enhance significantly from 0.7721 to 0.8676 (for BCW dataset) and from 0.4335 to 0.5352 (for Heart Disease dataset)—demonstrating the effectiveness of the refinement step. This work presents an original, practical enhancement to K_Means clustering for healthcare applications, offering improved accuracy, interpretability, and robustness through a hybrid distance strategy and a novel refinement mechanism. The results provide deeper insights into unsupervised learning for medical data analysis and support its potential in real-world clinical decision-making.
Similar content being viewed by others

Introduction
Healthcare encompasses procedures and services aimed at preventing, treating, and managing illnesses to preserve and improve individuals’ health1. It plays a vital role in promoting well-being, protecting lives, and enabling people to reach their full potential. Without proper healthcare, individual growth and the progress of society would be seriously hindered2.
It is imperative that medical professionals develop a deeper understanding of the methodologies and thought processes involved in clinical decision-making. This enables physicians and healthcare providers to detect illnesses early, including cancer and heart disease3. Its ability to extract valuable insights from data facilitates early diagnosis, treatment planning, and outcome prediction4. To assist medical professionals in identifying patients more accurately and efficiently (since critical illnesses must be diagnosed early), technology-based software solutions are needed5.
Machine learning (ML), a core subfield of artificial intelligence (AI), has emerged as a transformative force in healthcare by enabling systems to learn from data, identify complex patterns, and support predictive analytics6. ML models have been successfully applied to various medical tasks, including disease classification, risk prediction, and patient stratification7. Among these, unsupervised learning techniques—particularly clustering—are highly valuable when labeled data is scarce or costly to obtain8. Clustering organizes unlabeled data into meaningful groups based on similarity, facilitating exploratory data analysis, anomaly detection, and subgroup identification in patient populations9,10.
One of the most widely used clustering algorithms is K_Means, prized for its simplicity, scalability, and effectiveness in partitioning data into \('k'\) coherent clusters by minimizing within-cluster variance11. The choice of distance metric significantly influences clustering results, as different metrics emphasize distinct geometric properties of the data space12, especially in heterogeneous medical datasets. For instance, Euclidean distance assumes spherical clusters and is sensitive to outliers, while cosine similarity focuses on orientation rather than magnitude—useful in sparse or normalized data. Similarly, cityblock (Manhattan) and other Minkowski-family metrics offer alternative ways to model dissimilarity, especially in non-uniform feature spaces.Yet, most implementations rely solely on Euclidean distance without comparative evaluation or adaptive selection, leading to potentially misleading results13. Additionally, several efforts have been made to optimize cluster initialization (like K_Means++) and determine the optimal number of clusters (such as using elbow or silhouette methods)14. Some researchers have also proposed ensemble clustering or hybrid models combining multiple algorithms to enhance robustness13.
Despite these advances, two critical limitations remain underexplored: (i) Over-reliance on a single distance metric: Most existing works evaluate distance metrics in isolation rather than exploring synergistic combinations. There is limited work on adaptive or hybrid distance measures that leverage the strengths of multiple metrics to better capture complex data structures in medical datasets. (ii) Lack of post-clustering refinement: Traditional K-Means lacks a built-in mechanism to identify and correct misassigned or borderline samples after the initial clustering phase. This absence often results in clusters with poor cohesion.
These gaps reduce the reliability and accuracy of clustering results, limiting their practical utility in clinical decision-making, where accuracy and interpretability are paramount.
To address these challenges, this paper suggests a novel enhanced K_Means clustering framework specifically tailored for medical data analysis. Our approach introduces two key innovations:
-
1.
A hybrid distance strategy that combines cosine and cityblock (Manhattan) distances in a weighted manner, allowing for greater flexibility in capturing diverse data structures. We systematically evaluate multiple distance metrics (Euclidean, cosine, cityblock, Chebyshev, and Minkowski) and then explore optimal mixing ratios between cosine and cityblock to maximize clustering accuracy. An efficient cluster refinement mechanism based on the Z-score to identify data points that are statistically distant from their assigned cluster centroids. These outliers are reassigned to the most suitable neighboring cluster, thereby improving intra-cluster homogeneity and inter-cluster separation.
The proposed method is rigorously evaluated on two benchmark medical datasets: the Breast Cancer Wisconsin (BCW) and Heart Disease datasets. Performance is evaluated using multiple metrics, including accuracy, precision, recall, F1_score, Adjusted Rand Index (ARI), homogeneity, and execution time. Experimental results demonstrate substantial improvements over conventional K_Means with Euclidean or single-metric approaches, both before and after refinement.
This paper makes the following original contributions:
-
Classic K_Means clustering method: Instead of using just the Euclidean distance method as a default method, we apply K_Means method with five distance methods: Euclidean, cosine, cityblock (Manhattan), Chebyshev, and Minkowski methods. The accuracy for each method is computed.
-
Hybrid distance method: By mixing the cosine and cityblock distance methods, the accuracy is computed. Specifically, we start with the cosine weight percentage set to 0.0 and the cityblock percentage at 1.0, and then increase the first one to 1.0 while decreasing the second one to 0.0. Accordingly, the accuracy for all percentages is explored. The optimal mixing ratio that yields the highest accuracy is then identified.
-
Efficient cluster refinement method: After the initial clusters are constructed, all data samples that are distant from their assigned cluster centroid are identified using the Z-score method. These data samples are then reassigned to the nearest cluster that provides a better fit. This step effectively enhances cluster cohesion and accuracy.
-
The proposed method is tested by two medical datasets; Breast Cancer Wisconsin (BCW) and Heart disease datasets. Different metrics are used to evaluate the results including accuracy, precision, recall, F1_score, Adjusted Rand Index (ARI), and execution time. Additionally, the homogeneity metric is applied to the clusters before and after refinement to demonstrate the efficiency of the proposed method. In the same context, the results of the proposed method outperform the classic distance metric and advanced clustering method (deep clustering and spectral clustering methods). This provides deeper understanding and insights into its effectiveness in processing medical datasets.
By addressing the major limitations of K_Means through principled enhancements in distance computation and cluster refinement, this study advances the applicability and reliability of unsupervised learning in medical diagnostics. The proposed method not only improves clustering performance but also supports more trustworthy and interpretable results—critical for adoption in clinical settings.
Related works
Many studies have explored the implementation of K_Means clustering method in medical data analysis, particularly for disease classification such as Breast cancer and Heart Disease. However, most existing studies suffer from methodological limitations that affect their accuracy, robustness, and interpretability. This section discusses different previous works that use the same medical datasets (Breast Cancer Wisconsin (BCW) and Heart Disease) as a benchmark to evaluate their model’s performance.
A common practice in the literature is the use of Euclidean distance as the default metric in K-Means. The authors of15 proposed a weighted sample-based initialization technique to improve centroid selection, achieving 96.2% accuracy on a medical dataset—still below the performance of standard K-Means (69.1%), which raises concerns about experimental validity and reproducibility. Similarly, the work presented by16 applied K-Means with Euclidean distance after preprocessing steps and reported 85.71% accuracy for heart disease prediction, but with a relatively high execution time (0.9 seconds), indicating inefficiency in scalability.
Some researchers have attempted to move beyond Euclidean distance. A new mixing method of17 proposed a mixed distance metric combining correlation, cosine, and Euclidean distances, coupled with an adaptive outlier removal strategy using Tukey’s rule. While innovative, their results show that the mixed metric (96.49%) underperformed compared to the standard Euclidean approach (96.58%) on the cancer dataset, suggesting that simply combining metrics without optimization does not guarantee improvement.
Other works integrate K-Means with supervised models to enhance classification. According to18, this work combined modified K-Means with SVM on the Breast Cancer Wisconsin (BCW) dataset, reporting 96.99% accuracy. However, their method relies on a threshold-based distance measure and fails to address class imbalance, a common issue in medical datasets. Moreover, no comparison was made with other K-Means variants, limiting the generalizability of their conclusions.
In broader comparative studies, the authors of19 evaluated seven unsupervised algorithms—including K-Means—on both BCW and heart disease datasets. Their results revealed no consistent outperformer across datasets, and critically, accuracy was not reported, weakening the clinical relevance of their findings. Homogeneity and ARI scores were low (ARI of –0.0084 to 0.0135), indicating poor cluster quality. While, the work of20 compared multiple clustering algorithms for heart disease and found standard K-Means with Euclidean distance achieved the highest accuracy (84.78%). Yet, the number of clusters was not justified, and class imbalance was ignored.
Furthermore, the framework of21 employed fuzzy logic and clustering for rule-based classification, testing various supervised models (Super Vector Machine (SVM), K-nearest neighbor (KNN), Naive Bayes (NB), Random Forest (RF), and Artificial Neural Network (ANN)). Their best-reported accuracy was only 83.17%, highlighting the challenge of achieving high performance without robust feature representation or clustering refinement.
In recent years, deep learning and clustering methods have emerged as powerful alternatives through learning the latent representations and clustering structures simultaneously22,23.The Deep Embedded Clustering (DEC)24 and its successor, Improved Deep Embedded Clustering (IDEC)25 use autoencoders to learn compressed feature representations and then refine clusters using soft assignment loss. The DCEC (Deep Clustering with Convolutional Autoencoders) enhances clustering by preserving local data structure through reconstruction loss26.
Recently, contrastive clustering frameworks such as SwAV, SimCLR, and CC27 have shown promise in learning robust representations without labels. These methods have been adapted to healthcare for patient sub-typing and disease stratification28.
For balancing the performance and interpretability, different studies have proposed hybrid or ensemble clustering approaches. These combine multiple algorithms or integrate clustering with optimization techniques. The authors of29 introduced a deep hybrid model where an autoencoder first learns low-dimensional features, K-Means clusters the embeddings, and PSO optimizes the number of clusters (k) and centroid positions iteratively. The black-box nature has a slow convergence.
The work of30 presented a practical two-stage strategy for the graph clustering by balancing efficiency and accuracy. It used the K_Means and DBSCAN methods and then enhanced the outcomes by applying the spectral clustering to better preserve and exploit topological features. The works suffered from the reliance on the handcrafted global features and sensitivity to initial k in the K_Means method.
Methods and materials
This section explains the materials and methods of our proposed K_Means method. Firstly, we explain the datasets in Sect. “Datasets”. Then, K_Means clustering method is presented in Sect. "K Means clustering method", followed by an explanation of the distance metric types in Section “Distance metrics”. The cluster refinement is illustrated in Sect. “Cluster refinement”. Section "Advanced clustering methods for medical data analysis" clarifies the advanced clustering methods for medical data. Finally, our methodology is depicted in Sect. “Methodology”.
Datasets
The UCI Machine Learning Repository is a comprehensive dataset repository provides the community of machine learning with empirical analysis of various algorithms of machine learning. Therefore, it is widely applied in different fields. To evaluate the proposed method, two public medical datasets from this repository are used.
Download datasets
As mentioned earlier, two public datasets are used to test our proposed method. To download these datasets, the following URLs are used:
-
1.
Breast Cancer Wisconsin (BCW) dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data).
-
2.
Heart disease dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data).
Datasets statics
Table 1 explains the statics of these datasets.
Class distribution of datasets
Class distribution is a critical aspect in machine learning model performance. These models may perform poorly on the minority class due to a bias toward the majority class, especially in medical diagnosis. Figure 1 clarifies the class distribution of the two medical datasets used in this study.
In the Breast Cancer Wisconsin (BCW) dataset, there are 357 benign (B) samples and 212 malignant (M) samples, resulting in a moderate imbalance ratio of approximately 1.68:1. While the Heart Disease dataset exhibits more severe imbalance, with 164 instances of class 0 (no disease) and smaller counts across classes 1–4 (ranging from 19 to 73), indicating a heavy class imbalance problem.
Class distribution of the Breast Cancer Wisconsin (BCW) and Heart Disease datasets. Green bars represent the majority class, orange bars represent minority classes.
Synthetic Minority Over-sampling Technique (SMOTE)
As shown in Figure 1, both datasets suffer from class imbalance, which may cause clustering algorithms (such as K_Means) to favor the majority class and misclassify minority samples. To overcome this issue, we use the Synthetic Minority Over-sampling Technique (SMOTE)31 to generate synthetic samples for the underrepresented classes, ensuring balanced representation during clustering32. This preprocessing step enhances the model fairness and improves the detection accuracy for minority classes, particularly important in medical applications33.
The SMOTE parameters are selected as follows:
-
sampling_strategy: ’auto’ which determines the desired ratio of the sample numbers of the minority class to the majority class.
-
random_state=42: To ensure the same synthetic data samples each run.
-
k_neighbors= 5: It refers to the number of nearest neighbors in the K-NN method that is used with the generating synthetic samples step.
In the same context, in this method, we used SMOTE rather than alternative methods (e.g., ADASYN or undersampling) for the following reasons:
-
(1)
SMOTE is widely adopted and validated in medical data analysis, where preserving all original majority samples is critical to avoid losing rare but clinically important patterns. On the other hand, the undersampling risks discarding potentially informative majority instances, which is undesirable in healthcare environments where each patient record may carry diagnostic value.
-
(2)
Despite the adaptivity of the ADASYN method, it tends to generate more synthetic samples in regions of high minority-class difficulty. Thus, it can easily lead to overfitting on outliers, which is a concern in medical datasets that usually contain measurement variability or borderline cases. According to the paper’s goal of improving cluster cohesion, we prioritized stable and uniform oversampling over adaptive density-based generation.
-
(3)
The main objective was to enhance clustering performance by mitigating class imbalance. Therefore, we selected SMOTE as a well-established, effective, and interpretable baseline that aligns with best practices of machine learning in healthcare.
As a result, in this study, applying the SMOTE alone was sufficient to achieve balanced representation and improve clustering accuracy, as evidenced by the enhanced homogeneity, separation in t-SNE visualizations, and reduced outlier counts after refinement.
Feature selection
The process of selecting the most pertinent features from a dataset for model training is known as feature selection34. It is employed to boost model performance by lowering computing costs, improving interpretability, and minimizing overfitting. Moreover, by concentrating on the most instructive and valuable features, feature selection makes models both easier to understand and more effective35.
There are various methods of feature selection applied across different fields, including healthcare36. The chi-squared feature selection (Chi²) is a statistical technique used in machine learning to identify the most pertinent features from the given dataset37. Under the presumption that the features and the target are independent, it measures the dependence between these stochastic variables. We retain the top k features with the highest chi-squared scores, as the greater the chi-squared value, the more probable the feature is to be associated with the target variable. The expected frequency (\(E_{ij}\)) is computed by Eq. 1:
While the Chi-Square (X²) statistic is described in Eq. 2:
Where, the (\(O_{ij}\)) Observed frequency for cell (i, j)
A large \(\chi ^2\) value suggests that the feature and target are not independent—i.e., the feature is useful for predicting the target.
To explore the robustness and reliability of the proposed method, the sensitivity analysis was conducted to evaluate the impact of the feature numbers on clustering performance. Certainly, in medical datasets, not all features contribute equally to disease classification, and including irrelevant or redundant features degrades cluster quality and performance.
By applying Algorithm 1, all dataset features are ranked according to their importance. Practically, determining the optimal number of top-ranked features requires empirical validation. To address this issue and select the optimal number, we systematically evaluated the performance of the proposed clustering method using different numbers of features (5, 10,..., up to the full set), while maintaining all the other parameters constant (like SMOTE, scaling, distance weights, and refinement steps).
For every configuration, the metrics are computed to evaluate the cluster quality. This step ensures an optimal balance between model expressiveness and simplicity.
According to the results, we concluded that 25 and 10 features are the optimal number of features for the Breast Cancer Wisconsin (BCW) and Heart Disease datasets, respectively. The choice of 25 and 10 features is therefore not arbitrary; rather, it is supported by the best results. This sensitivity analysis guarantees that the final model performs at optimal efficiency without overfitting or needless complexity, and it also reinforces the validity of the proposed methodology.
Algorithm 1 explains the steps of the preprocessing datasets.
Data preprocessing
K_Means clustering method
In this subsection, we discuss the main aspects of the K_Means clustering method.
Method goal
Dataset samples are divided into ’k’ distinct clusters according to their similarities using the iterative, unsupervised clustering algorithm K_Means. Each data sample is assigned to the nearest centroid after ’k’ centroids are randomly initialized. The centroids are then recalculated using the assigned samples’ mean, and the assignment process is repeated until the centroids stabilize. The main goal of this method is to reduce the sum of squared distances between data samples and the centroids of respective clusters. The K_Means method is widely used for exploratory data analysis due to its simplicity, speed, and efficiency38,39.
Implementation mechanisms
The following steps represent the main mechanisms to apply K_Means clustering method39,40:
-
1.
Select the number of clusters (k): Choose the number of clusters, ’k’. Usually, the user makes this decision or applies some techniques.
-
2.
Initialize centroids: Randomly select k data samples from the dataset as initial centroids:
$$\mu _1^{(0)}, \mu _2^{(0)}, \ldots , \mu _k^{(0)} \in \mathbb {R}^d$$where d is the feature numbers in the dataset.
-
3.
Assign data samples to nearest centroid: For every data sample, \(x_i\), compute the Euclidean distance metric to each cluster centroid and assign the current sample to the closest one according to Eq. 3:
$$\begin{aligned} c_i = \arg \min _{j} \Vert x_i - \mu _j\Vert ^2 \end{aligned}$$(3)where:
-
\(c_i\) is the cluster assigned to the \(i^{\text {th}}\) data sample.
-
\(\Vert \cdot \Vert\) refers to the Euclidean norm.
-
-
4.
Recompute centroids: Update the centroids by calculating the mean of all data samples that are assigned to every cluster by Eq. 4:
$$\begin{aligned} \mu _j^{(t+1)} = \frac{1}{|C_j|} \sum _{x_i \in C_j} x_i \end{aligned}$$(4)where:
-
\(C_j\) is the set of data points in cluster j at iteration t.
-
\(|C_j|\) is the number of points in cluster j.
-
-
5.
Repeat steps (3 and 4) until achieve the convergence:
-
Centroids no longer changed significantly through iterations.
-
The maximum number of iterations has been achieved.
-
Distance metrics
As explained earlier, the Euclidean distance (Eq. 3) is the default metric type used in K_Means clustering method. The performance of the K-Means clustering is highly sensitive to the choice of distance metric, as it defines the geometry of the feature space and influences cluster shape, cohesion, and separation. In medical datasets, features often exhibit heterogeneous scales, sparsity, and directional trends, requiring a diverse set of distance measures or a mixed method to capture different data properties22.
In this paper, five well-known distance metrics are used, which are cosine, cityblock (Manhattan), Euclidean, Chebyshev, and finally, Minkowski according to their complementary geometric interpretations and proven utility in the context of medical data clustering. More accurately, these metrics were chosen because they can capture various geometric and structural characteristics of patient data, which is important in healthcare applications where directional patterns, correlations, and feature scales vary greatly. This subsection describes all these metrics as follows41,42,43,44,45,46:
Cosine distance metric
The cosine distance metric measures the degree of angle between non-zero vectors, ignoring magnitude. This is valuable in medical data where direction matters more than absolute values. This metric is sometimes known as the cosine similarity method. It is computed by Eq. 5:
Where \(\cos (\theta )\) is the cosine similarity between two vectors that computed by Eq. 6:
Based on Eq. 6, the cosine distance can be recomputed by Eq. 7:
Cityblock distance metric
It is also known as Manhattan distance, Rectilinear distance, Taxicab distance, and L_1 distance. This metric computes the distance between two data samples as a sum of the absolute differences of their cartesian coordinates. The Cityblock distance metric works better when dataset features are sparse or have non-uniform distributions, which is a prevalent characteristic in clinical data with binary indicators (e.g., presence/absence of symptoms). It is also more resilient to outliers. Mathematically, the Cityblock distance between two vectors \(\textbf{p} = (p_1, p_2, \dots , p_n)\) and \(\textbf{q} = (q_1, q_2, \dots , q_n)\) is defined by Eq. 8:
Euclidean distance metric
It is also known as L_2 Norm. Euclidean distance computes the straight-line separation. It is popular and easy to use, but it is susceptible to scale discrepancies and outliers, which are frequent in medical datasets with mixed units. It can be computed by Eq. 3.
Chebyshev distance metric
The maximum distance and \(L_\infty\) Norm are the alternative names of this metric type. Chebyshev distance makes use of the greatest variation between characteristics. It can be used to detect extreme circumstances when the largest deviation dominates classification (for example, when a single abnormal lab value indicates disease). The Chebyshev distance between two vectors \(\textbf{p} = (p_1, p_2, \dots , p_n)\) and \(\textbf{q} = (q_1, q_2, \dots , q_n)\) is computed by 9:
Minkowski distance metric
It is also known as the generalized distance metric. Euclidean and cityblock are generalized by Minkowski distance (as particular examples when p=2 and p=1). It provides flexibility in balancing sensitivity to large vs. consistent small deviations by adjusting p, which is helpful for customizing to the behavior unique to medical datasets. The Minkowski distance between two points \(\textbf{p} = (p_1, p_2, \dots , p_n)\) and \(\textbf{q} = (q_1, q_2, \dots , q_n)\) is calculated by Eq. 10:
Actually, a thorough assessment of the impact of distance choice on clustering in medical contexts is made possible by applying these five distance metrics, which cover a wide range of geometric assumptions.
Despite there are another distance metrics (like Mahalanobis and kernel-based distances), well-understood, parameter-light measures that are simple to apply are given priority because of our emphasis on simplicity, interpretability, and reproducibility.
For the Mahalanobis distance, the covariance matrix must be estimated, though, and this becomes unstable for small or medium-dimensional datasets (BCW, for example, only has 569 samples). Additionally, it makes the assumption of multivariate normality, which is frequently broken in actual medical data. On the other hand, the kernel-based distances are effective in deep and kernel clustering frameworks and may capture nonlinear correlations. Nevertheless, they contradict the objective of transparent, deployable models in clinical contexts by introducing more hyperparameters (such as kernel width), raising processing costs, and decreasing interpretability.
Algorithm 2 outlines the steps used to compute the accuracy for each of the five distance metrics.
K_means clustering method with distance methods
In our proposed K_Means method, we use a hybrid distance method by combining the cosine and cityblock distance metrics. The selecting of the distance methods does not imply that the best (highest accuracy) methods are used. Rather, the selecting step is based on the final accuracy that is achieved.
More specifically, we compute the accuracy of all possible mixing percentages and then determine the best-performing combination. We set the cosine percentage weights to 0.0 and keep the cityblock percentage at 1.0, and then increase the first one to 1.0 and decrease the second one to 0.0. This process ensures that the accuracy for all percentages is computed, allowing us to explore all the percentage weights and identify the combination that yields the highest accuracy.
Algorithm 3 lists the proposed hybrid distance method.
Proposed hybrid distance method
Cluster refinement
Cluster refinement refers to the iterative procedures used to enhance the quality, accuracy, and interpretability of the final clusters following an initial data clustering47. These procedures are particularly crucial in real-world applications, where noise, inaccurate clustering assumptions, or improper distance measurements might result in unsatisfactory initial clusters48.
Objectives of cluster refinement
There are many objectives to implement the cluster refinement step, such as48,49:
-
A.
Improve the quality of clusters.
-
B.
Recheck and determine the optimal number of clusters.
-
C.
Interpretability.
-
D.
Adaptation to changes in data.
-
E.
Adaptation to evolving data.
Methods of cluster refinement
In general, the structure and performance of any clustering method, such as K_Means method is influenced by the refinement cluster step. The cluster refinement can be implemented using various methods; the following are the main ones50,51,52:
-
A.
Reassign the data samples.
-
B.
Merging and splitting clusters.
-
C.
Removing noise.
-
D.
Parameter tuning.
-
E.
Centroid update smoothing.
-
F.
Iterative re-clustering.
In our proposed method, we introduce a new step as a cluster refinement. This step aims to improve the cluster quality by detecting the data samples and then reassigning these data samples.
Theoretical basis for Z-score–based cluster refinement
After constructing the initial clusters, some data samples may be far from their designated cluster centroid, potentially due to noise, measurement error, or complex boundary regions. Cluster homogeneity and classification accuracy may be weakened by these outliers. To overcome this issue, we introduce a statistically grounded refinement mechanism using the Z-score method.
Practically, the Z_score is defined as a statistical metric that indicates how a value relates to the mean of a set of values53. It is employing to find data samples that deviate significantly from the norm. In other words, we apply the idea of this statistical metric to explore and determine the data samples that are not similar to its current cluster.
The Z_score for a data sample is computed by Eq. 1154:
where \(\mu\) is the mean and \(\sigma\) is the standard deviation of the feature across the dataset.
The mechanisms of cluster refinement via detection of the distant data samples and reassignment of these samples to the accurate clusters are detailed in Algorithm 4.
Cluster refinement via distant detection and reassignment
According to this Algorithm, once outliers are identified, they are reassigned to the nearest cluster. This step improves the cluster accuracy by correcting misclassifications besides the cluster cohesion. This refinement step is deterministic and reproducible, making it ideal for clinical applications where transparency and stability are essential.
Advanced clustering methods for medical data analysis
To evaluate the effectiveness and robustness of the proposed K_Means method, we conducted a comprehensive comparison against two state-of-the-art clustering approaches that are deep clustering and spectral clustering methods. This section clarifies these methods.
Deep clustering method
The deep clustering method is defined as an advanced unsupervised learning framework that mixes the deep neural networks with clustering objectives to simultaneously learn meaningful feature representations and cluster the data samples into coherent clusters. It learns a low-dimensional latent space where the data structure is more suitable and meaningful for clustering. Deep clustering is a powerful method for medical datasets, where the features may fail to capture non-linear patterns or hidden patient subtypes. Moreover, it can improve the cluster separation by joint optimization of representation and clustering. The deep clustering method consists of the following steps:
-
(1)
Autoencoder framework: An autoencoder consists of an encoder and a decoder network. Given an input \(\textbf{x}_i \in \mathbb {R}^d\), the encoder maps it to a latent representation \(\textbf{z}_i \in \mathbb {R}^p\) as Eq. 12:
$$\begin{aligned} \textbf{z}_i = f_{\theta }(\textbf{x}_i) \end{aligned}$$(12)The decoder reconstructs the input from the latent code by Eq. 13:
$$\begin{aligned} \hat{\textbf{x}}_i = g_{\phi }(\textbf{z}_i) \end{aligned}$$(13)After pre-training, only the encoder is retained to extract latent features \(\textbf{Z} = \{\textbf{z}_1, \dots , \textbf{z}_N\}\).
-
(2)
Soft cluster assignment: To compute soft assignments \(q_{ij}\) — the probability that sample i belongs to cluster j — we use Eq. 14:
$$\begin{aligned} q_{ij} = \frac{(1 + \Vert \textbf{z}_i - \varvec{\mu }_j\Vert ^2)^{-1}}{\sum _{j'} (1 + \Vert \textbf{z}_i - \varvec{\mu }_{j'}\Vert ^2)^{-1}} \end{aligned}$$(14)where:
-
\(\textbf{z}_i\): latent embedding of sample i,
-
\(\varvec{\mu }_j\): centroid of cluster j, initialized via K-means,
-
\(q_{ij} \ge 0\) and \(\sum _j q_{ij} = 1\).
-
-
(3)
Target distribution (refinement): To refine cluster assignments, we define a target distribution (\(p_{ij}\)) that sharpens high-confidence predictions by Eq. 15:
$$\begin{aligned} p_{ij} = \frac{q_{ij}^2 / \sum _i q_{ij}}{\sum _{j'} (q_{ij'}^2 / \sum _i q_{ij'})} \end{aligned}$$(15)This emphasizes confident assignments and encourages cluster cohesion.
-
(4)
Joint optimization (Fine-tuning): Instead of updating cluster centroids, the deep clustering model fine-tunes the encoder by minimizing the Kullback-Leibler (KL) divergence between the target distribution \(\textbf{P}\) and the soft assignment \(\textbf{Q}\) as Eq. 16:
$$\begin{aligned} \mathcal {L}_{\text {cluster}} = \text {KL}(\textbf{P} \Vert \textbf{Q}) = \sum _{i=1}^{N} \sum _{j=1}^{k} p_{ij} \log \frac{p_{ij}}{q_{ij}} \end{aligned}$$(16)This step refines the latent representation—not the centroids—to improve clustering quality.
-
(5)
Final cluster assignment: The final cluster labels are obtained by Eq. 17: For inference (e.g., on test data), final cluster labels are obtained by:
$$\begin{aligned} \hat{y}_i = \arg \max _j q_{ij} \end{aligned}$$(17)
Spectral clustering method
Spectral clustering: It is able to transform the clustering issue into a graph partitioning task through leveraging the eigenstructure of a similarity matrix to embed the data into a lower-dimensional space where the traditional clustering methods (like K-means) work better. The spectral clustering determines irregularly shaped clusters via modeling pairwise similarities between data samples as nodes in a graph connected by similarity edges, making it suitable for disease subtyping. Accordingly, it often exhibits complex, non-linear patterns; thus, it is widely used for the medical datasets.
Let \(\mathcal {X} = \{\textbf{x}_1, \dots , \textbf{x}_n\}\) be the input data. A similarity matrix \(\textbf{A} \in \mathbb {R}^{n \times n}\) is constructed using a kernel function, such as the RBF (Gaussian) kernel as Eq. 18:
While the graph Laplacian \(\textbf{L}\) is computed by Eq. 19:
where \(\textbf{D}\) is the degree matrix, a diagonal matrix with entries:
Next, the k smallest (or first k non-trivial) eigenvectors of \(\textbf{L}\) are computed. Let \(\textbf{U} \in \mathbb {R}^{n \times k}\) be the matrix whose columns are these eigenvectors:
Lastly, the K_Means clustering method is applied to the rows of \(\textbf{U}\) to obtain the final cluster assignments.
Methodology
The methodology of the proposed K_Means method is depicted in Fig. 2, including following phases:
The methodology of proposed K_Means method.
-
1.
Medical dataset: Firstly, the medical datasets are loaded to our proposed method. These datasets are explained in Sect. “Datasets”.
-
2.
Data preprocessing: Through this phase, the dataset is preprocessed as follows:
-
I.
Preprocessing steps: The dropping of the ID column in the given dataset is done. For the BCW dataset, we map the ’Benign’ class to 0, while the ’Malignant’ class is mapped to 1. On the other hand, for the Heart disease dataset, ’No Heart Disease’ is mapped to 0, whereas’Heart Disease’ is mapped to 1.
-
II.
Feature scaling: In this step, the dataset features are scaled to be in the range of [0,1]. In this paper, we used the Standard Scaler technique as represented by Eq. 22:
$$\begin{aligned} z = \frac{x - \mu }{\sigma } \end{aligned}$$(22)Where:
-
x: original feature value
-
\(\mu\): mean of the feature (computed on the training data)
-
\(\sigma\): standard deviation of the feature
-
z: standardized (scaled) feature value
-
-
III.
Feature selection: In the current paper, the Chi-Square (Chi²) feature selection method is used. It computed by Eq. 2. In the same realm, for the BCW dataset, we select 25 features, while for the Heart disease dataset, 10 features are selected.
-
IV.
SMOTE technique: The last preprocessing step of this phase is the SMOTE technique. It applied to overcome the issue of imbalanced classes in the datasets.
-
V.
Splitting dataset: The given medical dataset is partitioned into training data (80%) and testing data (20%).
-
I.
-
3.
Classic K-Means: Through this phase, the classic K_Means clustering method is applied. The distance methods in this phase are the types that are explained in section (3.3). The final step is computing the accuracy for every distance method separately.
-
4.
The proposed method: This phase includes the steps of the proposed method. First of all, apply the K_Means method, and then construct the hybrid distance method. Finally, the refinement clusters step is done. More details of these two steps are as follows:
-
A.
Construct hybrid distance method: In the first step, the proposed hybrid method is constructed that mixes two distance methods. These methods are the cosine and cityblock. In the same context, the accuracy for the two methods is computed as percentages, that is (0.00 to 1.00). Thus, the accuracy for all percentages is computed. This step ensures all the percentage weights are explored for determining the best accuracy.
-
B.
Cluster refinement method: It represents the second step of our proposed method. Specifically, this step refers to the refinement that aims to improve the cluster quality. More specifically, for this step, we implement two sub-steps as follows:
-
Identifying all the data samples that are far from their assigned clusters by applying Z_score method through Eq. 11. We compute the standard deviation (std) of the data sample distances, and we then set a threshold to identify these data samples. Actually, these data samples are mostly poorly aligned with their own cluster.
-
Thereafter, we reassign these samples to the nearest better-fitting clusters.
-
A.
-
5.
Results: Once the proposed K_Means clustering method is applied, the corresponding results are produced.
-
6.
Evaluation: It represents the last phase of our proposed method. The confusion matrix is a table used to summarize the performance of a classification model. It consists of False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN). A variety of metrics are applied to evaluate the performance of the classification model. Accuracy, precision, recall, F1_score, Adjusted Rand Index (ARI), homogeneity, and execution time are all used in this paper to evaluate the results of the current method. These metrics are computed by Eq.s [23–28]55,56:
$$\begin{aligned} \text {Accuracy}= & \frac{\text {TP} + \text {TN}}{\text {TP} + \text {TN} + \text {FP} + \text {FN}} \end{aligned}$$(23)$$\begin{aligned} \text {Precision}= & \frac{\text {TP}}{\text {TP} + \text {FP}} \end{aligned}$$(24)$$\begin{aligned} \text {Recall}= & \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned}$$(25)$$\begin{aligned} F1\_Score= & 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$(26)
Moreover, we compute the Adjusted Rand Index (ARI). It is an indicator of how comparable two data clusters are to one another. The value of the ARI metric usually varies from 1 to −1 as follows: if ARI = 1, it means a perfect match between the two clusters; if ARI = 0, it refers to random labeling; and if ARI = −1, it reflects the worst case, where it is worse than random. As a result, a larger ARI value indicates that the clustering results are more consistent with true labels. The ARI metric is computed by Eq. 27:
Where:
-
\(n_{ij}\): Number of data samples in both clusters; i from the first clustering and cluster j from the second clustering
-
\(a_i = \sum _j n_{ij}\) (number of data samples in cluster i from the first clustering).
-
\(b_j = \sum _i n_{ij}\) (number of data samples in cluster j from the second clustering).
-
n: Total number of data samples.
The results are compared to the ground truth using a clustering metric called Homogeneity. A cluster is considered homogeneous if all of its data samples belong to the same class19. Eq. 28 explains how to compute this metric:
where:
-
H(C|K) is the conditional entropy of classes given clusters,
-
H(C) is the entropy of the class distribution.
Similar to ARI, the score of higher ARI reflects an absolute homogeneity, with each cluster consisting solely of members of a single class.
Finally, we compute the execution time that the proposed method requires to complete its implementations.
Experimental results
This section explains in detail the results of applying the proposed K_Means clustering method, discussion, and comparisons.
Results
Sensitivity analysis of feature numbers
As we explained earlier, the proposed K_Means clustering method conducted the sensitivity analysis of feature numbers for both datasets. More deeply, the range is (5, 10,..., up to the full features), meanwhile computing the most important metrics (accuracy, F1_score, homogeneity, and ARI).
Table 2 explains the results for both datasets with different numbers of features.
From Table 2, we conclude that for the Breast Cancer Wisconsin (BCW) dataset, the performance of the clustering method peaks when the feature number is 25 (with accuracy of 0.9825 and homogeneity at 0.8676). While, for the Heart Disease dataset, the proposed clustering method achieves the best results at 10 features (with accuracy reaching 0.90 and homogeneity of 0.5352).
Breast Cancer Wisconsin (BCW) dataset
The first dataset used to evaluate our method is the Breast Cancer Wisconsin (BCW) dataset. In the preprocessing step, the Chi² feature selection method is applied with the feature number being 25. Table 3 displays the selected features.
Then, we apply K_Means clustering method to this dataset, and then the accuracy for all distances is computed. Table 4 expounds these accuracy values.
While Fig. 3a shows the accuracy for all possible mixing percentages (0.0–1.0.0.0). Thereafter, the hybrid distance method is constructed. The cosine distance is the base metric, while the second method for the BCW dataset is the Cityblock. Since we build a hybrid distance method, we analyze the accuracy of the two methods that form the final method in percentage weight. For the BCW dataset, the best accuracy of mixing these two distance methods is 0.9825, as shown in Fig. 3b. In practice, it can be noticed (from Fig. 3b) that the mixing accuracy partitions into three ranges: (1) Cosine: 0.00–0.47 with accuracy of 0.9825, (2) Cosine: 0.48–0.97 with accuracy of 0.9737, and (3) Cosine: 0.98–1.00 with accuracy of 0.9649.
After we obtain the accuracy of the proposed method, we compute the other measurements of precision, recall, and F1_score that are explained in Eq.s 24, 25, and 26. The score for these metrics are; 0.9762, 0.9535, and 0.9647 respectively. Figure 3c illustrates the values of these measurements. So far, the clusters are constructed using the proposed K_Means method by applying the hybrid distance method. Figure 3d visualizes the two constructed clusters in red and black data sample colors.
Results of Breast Cancer Wisconsin (BCW) dataset.
When the z_score threshold set to 3.0, 74 data samples that are distant from the accurate cluster are detected. The positions of the 74 data samples in the BCW dataset are explored in Table 5.
Figure 4a effectively highlights the extreme behavior of distant data samples compared to non-distant data samples. Red lines (representing the distant samples) show deviations over features, whereas the blue lines (non-distant samples) are still relatively consistent. Figure 4b affirms the data samples in the constructed clusters, where distant data samples stand out with high contrast, which requires mandatory treatment. While Fig. 4c reflects the reassigning of distant data samples after they have been identified.
Results of Breast Cancer Wisconsin (BCW) dataset.
Finally, for the BCW dataset, the ARI metric achieved by the proposed method is 0.9303, with an execution time of 0.0025 seconds. While the homogeneity score increased from 0.7721 before refinement to 0.8676 after the refinement step.
Heart disease dataset
The second dataset used to evaluate our proposed method is the Heart disease dataset. This dataset includes 10 selected features, which are explained in Table 6.
After that, the K_Means method is applied to the Heart disease dataset. The accuracy for all distance methods is illustrated in Table 7.
In the same context, Fig. 5a shows the accuracy of all these methods while Fig. 5b displays the accuracy of the proposed hybrid distance that achieves an accuracy of 0.90. Similar to the BCW dataset, the accuracy of the hybrid distance method for the Heart disease dataset also split into three ranges (as shown in Fig. 5b). These ranges are (1) Cosine: 0.00–0.74 with an accuracy of 0.8667; (2) Cosine (with two periods): 0.75–0.88 and 0.96–0.98 with an accuracy of 0.8833; and (3) Cosine (also with two periods): 0.89–0.95 and 1.00 with an accuracy of 0.90.
The other measurements through the confusion matrix are expounded in Fig. 5c. By implementing the above K_Means method, the initial clusters are built. Figure 5d depicts the cluster visualization.
Results of Heart disease dataset.
The z_score threshold is set to 3.0. Consequently, our method detects 9 data samples that are distant from the accurate cluster. Table 8 shows the positions of these data samples.
Accordingly, Fig. 6a distinguishes distant and non-distant data samples using red and blue lines across the features. Figure 6b further confirms the data samples based on their features, while Fig. 6c shows the reassigning step of distant data samples.
Results of Heart Disease dataset.
Lastly, for the Heart Disease dataset, the ARI metric is 0.6334, and the execution time is 0.0140 seconds. The homogeneity increases from 0.4335 to 0.5352 after the refinement step.
Advanced clustering methods
The results of the proposed K_Means clustering method are compared with the results of the advanced clustering methods (deep clustering and spectral clustering methods) that are explained in Sect. "Advanced clustering methods for medical data analysis" Figure 7 depicts the comparison results.
More specifically, Fig. 7a illustrates the comparison on Breast Cancer Wisconsin (BCW) dataset, while Fig. 7b presents the comparison on the Heart Disease dataset.
Comparison of the proposed clustering method with deep clustering and spectral clustering methods for both datasets.
Discussion
As we explained earlier, the proposed K_Means clustering method is evaluated by two medical datasets. This section discusses the results of these datasets as follows:
Breast Cancer Wisconsin (BCW) dataset:
With an accuracy of 0.9825 Fig. 3a, the hybrid distance method (Cosine + Cityblock) outperforms Euclidean, Chebyshev, and Minkowski metrics for all mixing percentages. The high precision, recall, and F1_score (> 0.95) indicate that the model is doing exceptionally well with few false positives or negatives, as shown in Fig. 3c. The hybrid distance method offers more robustness, underscoring the advantage of combining different distance measurements, even if the cosine accuracy is continuously high. The red lines in Fig. 4a, which represent distant samples, frequently go beyond the range of the blue lines, which represent the non-distant samples. This illustrates how distant samples can distort feature distributions and have an impact on model performance. With a large number of distant data samples, Fig. 4b demonstrates notable variability across features, suggesting the existence of remote or extreme data samples that may have an impact on model performance. Figure 4c displays less variability following the reassignment of distant data samples, suggesting higher data quality and increased model stability.
Thus, the comparison between Fig. 4b andc proves that the lower number of distant samples and the narrower interquartile ranges indicate that the feature distributions have become more consistent due to the reassignment of these samples.
Heart disease dataset:
For all mixing percentages (Fig. 5a), the hybrid distance method (Cosine + Cityblock) outperforms other metrics (Euclidean, Chebyshev, and Minkowski), achieving the highest accuracy of 0.9000. This implies that performance can be enhanced by mixing different distance measurements. The analysis of confusion matrix metrics (Fig. 5c) indicates that the majority of anticipated positives are accurate, as evidenced by the extremely high precision of 0.9091. In general, our method performs well overall. The features in Fig. 6a such as 0, 2, 6, and 8, exhibit clear distinctions between distant and non-distant samples, indicating that they have a greater influence on the distribution of data. Before the reassignment step, Fig. 6b shows a large number of distant samples and substantial variability in a number of features (e.g., 2, 7, 9). This implies that skewed distributions are caused by distant samples. The impact of reassignment in Fig. 6c displays less variability following the reassignment of distant samples, proposing enhanced data quality and feature consistency. As a result, the reassigning step resulted in a more robust method, as evidenced by the decrease in distant counts and the tightening of interquartile ranges. Finally, the cluster visualization Figs. 3d and 5d confirm that the Z-score threshold of > 3.0 successfully identifies data samples with unusual patterns in the data.
Overall Performance: The hybrid distance method and refinement clusters strategy are efficient options for these two medical datasets because of the findings, which show great classification performance with high accuracy, unambiguous cluster separability, and balanced precision, recall, and F1_scores. Moreover, the execution time is very low for both datasets.
Scalability: The proposed K_Means method demonstrates good scalability on medical datasets like BCW and Heart disease. Future optimizations such as approximate nearest neighbor techniques or parallel processing could enhance scalability.
Flexibility: By enabling the use of various distance combinations suited to certain datasets, the approach provides more flexibility than ordinary K_means. By increasing robustness to distant data samples, the cluster quality refinement step further improves adaptability. Performance in many domains may be impacted by the current implementation’s manual parameter selection requirements, which include the number of clusters (k) and Z_score threshold.
Assumptions and limitations
While the proposed K_Means method demonstrates significant improvements in clustering accuracy and homogeneity for medical datasets, it is based on several assumptions and subject to practical limitations that must be acknowledged.
Assumptions: The assumptions of our work as follows:
-
\(\square\) The cluster number k is known: The proposed method assumes that the true number of classes (benign/malignant in BCW, presence/absence in Heart Disease) is known and used as k in K_Means method which simplifies the evaluation.
-
\(\square\) Medical data can be meaningfully grouped: Like traditional K_Means, our proposed method assumes that clusters are relatively compact and separable in feature space. This works well for many medical datasets but may fail if natural groupings are highly irregular.
-
\(\square\) Feature scaling and preprocessing are properly implemented: The performance of distance-based methods (especially cityblock and cosine) is based heavily on normalized or standardized data. Our approach assumes that all features are scaled appropriately before clustering.
-
\(\square\) The hybrid distance metric is additive: We assume that combining cosine and cityblock distances methods via a weighted sum is sufficient to capture the directional and geometric similarities.
-
\(\square\) Outliers are defined as data samples far from centroids: The Z-score refinement step assumes that distant samples are likely misclassified. This assumes that clusters are roughly symmetric and centered—valid in most cases but not all.
Limitations: The current paper has limitations such as:
-
\(\square\) Scalability to high-dimensional or very large medical datasets is untested: While efficient on BCW and Heart Disease datasets, which are classified as small to medium-sized, the method has not been evaluated on larger medical datasets.
-
\(\square\) Evaluation step: We use the ground truth labels to compute metrics in clustering validation. In some environments (like truly unlabeled), other metrics (such as the silhouette score) would be needed.
-
\(\square\) Hybrid distance optimization is dataset-specific: The optimal weight ratio between cosine and cityblock distances must be tuned empirically for each dataset. While effective, this step is an extra for the hyperparameter tuning step.
-
\(\square\) The refinement step: The Z-score method assumes the normal-like distribution within clusters, which may not hold for all medical datasets.
-
\(\square\) The homogeneity of the Heart disease is relatively low, with 0.5352, which needs to be enhanced.
Future works:
While this K_Means method achieves significant improvements for both clustering accuracy and homogeneity on the medical datasets, many promising directions remain for future research, such as:
-
\(\square\) Currently, the optimal mixing ratio between cosine and cityblock distances is determined empirically through grid search. The future work can integrate optimization algorithms like Particle Swarm Optimization (PSO), Genetic Algorithms (GA), or Bayesian optimization to automatically learn the best weight combination based on clustering performance metrics (such as the silhouette score), reducing human intervention and improving adaptability.
-
\(\square\) Integration with advanced initialization methods: The performance of K_Means is sensitive to initial centroid placement. The proposed method could be enhanced by combining it with K_Means++ or density-based seeding to improve convergence speed and final cluster quality.
-
\(\square\) Extension to the high-dimensional and multi-modal medical dataset: The current evaluation focuses on two datasets (BCW and Heart Disease). Future work should test the method on high-dimensional data like the gene expression profiles, medical imaging features, or EHR-derived embeddings.
-
\(\square\) Adaptive distance metric selection: Instead of manually selecting cosine and cityblock distance, future work can develop a meta-learning or decision-based system that dynamically chooses or combines distance metrics based on dataset characteristics (e.g., sparsity, dimensionality, feature correlation). This would make the method more generalizable over other domains.
Contributing of the technical improvements in clinical impact:
Although the proposed K_Means method with the hybrid distance strategy and cluster refinement is considered a methodological innovation, its significance extends far beyond algorithmic performance. According to the early disease detection, even simple improvements in clustering accuracy and cluster cohesion can have profound clinical implications.
In breast cancer diagnosis, mammographic features are often subtle and high-dimensional. The traditional K_Means method (applying the Euclidean distance) may cluster the benign and malignant tumors incorrectly if they are close in raw feature space but differ in direction or pattern (e.g., spiculated vs. smooth margins). By integrating the distance metrics, our hybrid method better captures directional differences in feature vectors (akin to how radiologists assess shape and orientation), while cityblock distance handles feature-wise deviations more robustly than Euclidean in normalized data. This causes a more accurate separation of benign and malignant cases, which can effectively reduce the false negatives and enable earlier intervention.
On the other hand, in the Heart Disease dataset, patient data often has mixed signals—such as chest pain type, exercise-induced angina, and ST-segment changes—that may not follow Euclidean geometry. A patient with atypical symptoms may be misclustered if only straight-line distance is considered. Our hybrid method enhances the sensitivity to such non-spherical patterns, ensuring that high-risk individuals are not grouped with low-risk ones due to metric bias. Moreover, the refinement step (by reassigning the distant samples) helps identify borderline or atypical cases (like asymptomatic patients) that might otherwise be misclassified. By determining and re-evaluating these samples, this method acts as a safety net, improving early detection of patients who do not fit typical profiles but are still at risk.
Thus, these enhancements do not just improve numerical metrics; they enhance the reliability of unsupervised patient stratification, support earlier identification of high-risk individuals, and reduce the likelihood of missed diagnoses in screening pipelines. In real-world settings, such a system could be integrated into automated triage tools, electronic health record (EHR) analytics, or point-of-care diagnostic assistants, where timely and accurate clustering of patient data can trigger further testing, specialist referral, or preventive care—potentially saving lives.
Comparisons
The accuracy of the proposed K_Means clustering method is demonstrably superior to many previous methods. These methods depended on K_Means algorithm. Table 9 illustrates these companions.
Conclusion
Artificial intelligence and machine learning are revolutionizing the healthcare by enabling faster, more accurate disease detection and supporting clinical decision-making. In this context, clustering methods like K_Means play a vital role in uncovering hidden patterns in unlabeled medical data, facilitating early diagnosis and patient stratification. This paper presents an enhanced K_Means clustering framework specifically designed to overcome two major limitations of the classical algorithm: (1) over-reliance on the Euclidean distance metric and (2) lack of a mechanism to refine clusters post-assignment. The proposed method introduces two key innovations: (i) a hybrid distance strategy that combines Cosine and Cityblock (Manhattan) metrics in a tunable manner, allowing for more flexible and accurate similarity measurement in complex, high-dimensional medical datasets; and (ii) a Z-score-based cluster refinement process that identifies and reassigns distant (outlier) samples to improve cluster cohesion and separation. These enhancements enable the algorithm to better capture both directional and geometric relationships in the data, resulting in more meaningful and interpretable clusters. Evaluated on two benchmark medical datasets (Breast Cancer Wisconsin (BCW) and Heart Disease). The proposed method achieves accuracy of 0.9825 and 0.90, precision of 0.9762 and 0.9091, recall of 0.9535 and 0.8333, F1_score of 0.9647 and 0.8696, ARI of 0.9303 and 0.6334, and homogeneity score of 0.8676 and 0.5352 for these datasets, respectively.Thus, it outperformed other advanced clustering methods (deep clustering and spectral clustering) and many previous methods. Moreover, the visual validation using t-SNE plots illustrates a clearer class separation after refinement, while box plots demonstrate reduced variability and fewer outliers, further supporting the effectiveness of the refinement step. Feature importance analysis using the Chi² test highlights clinically relevant attributes, reinforcing the biological plausibility of the clusters. The strengths of this work are represented in (1) the hybrid distance method adapts to data properties, outperforming single-metric methods; (2) the Z-score refinement step is simple, efficient, and effective, improving the cluster quality; (3) it maintains interpretability for its results, allowing it to be more suitable in clinical environments; and (4) the comprehensive evaluation through multiple metrics and visualization tools strengthens the validity of findings. The main limitations are the scalability to high-dimensional datasets and the Z-score refinement, which assumes the normal-like distribution within clusters. By moving beyond the rigid using of Euclidean distance and introducing a lightweight refinement mechanism, our method bridges the gap between simplicity and performance. Through enhancing the accuracy and improving the homogeneity, our method can support early disease detection, patient subgroup identification, and anomaly detection in clinical settings. This method has direct applications in automated medical screening systems (e.g., mammography analysis, ECG interpretation) and data preprocessing pipelines where clean, well-separated clusters improve the final classification. We recommend this method for use in tabular medical datasets with high-dimensional datasets that have numeric features, particularly when interpretability and efficiency are prioritized over learning complexity.
Data availability
The datasets for this paper are available at: - https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data.
References
Xia, X. et al. Healthcare professionals’ knowledge, attitudes, and practices towards predictive diagnosis of early neurological deterioration. Sci. Rep. 15(1), 24371 (2025).
Goktas, P. & Grzybowski, A. Shaping the future of healthcare: Ethical clinical challenges and pathways to trustworthy ai. Journal of Clinical Medicine 14, 1605 (2025).
Glenn, J. et al. Do healthcare providers consider the social determinants of health? results from a nationwide cross-sectional study in the united states. BMC Health Serv. Res. 24, 271 (2024).
Chen, X. et al. Evaluating and mitigating bias in ai-based medical text generation. Nat. Comput. Sci. 1–9 (2025).
Etges, A. P. B. d. S., Jones, P., Liu, H., Zhang, X. & Haas, D. Improvements in technology and the expanding role of time-driven, activity-based costing to increase value in healthcare provider organizations: a literature review. Front. Pharmacol. 15, 1345842 (2024).
Al-Khamees, H. A., Al-A’araji, N. & Al-Shamery, E. S. Data stream clustering using fuzzy-based evolving cauchy algorithm. Int. J. Intell. Eng. Syst. 14, (2021).
Abdulkareem, A. B. et al. Predicting covid-19 based on environmental factors with machine learning. Intell. Autom. Soft Comput. 28, 305–320 (2021).
Sousa, S. et al. Machine learning models’ assessment: Trust and performance. Med. Bio. Eng. Comp. 62(11), 3397–3410 (2024).
Mohamed Nafuri, A. F., Sani, N. S., Zainudin, N. F. A., Rahman, A. H. A. & Aliff, M. Clustering analysis for classifying student academic performance in higher education. Appl. Sci. 12, 9467 (2022).
Das, S., Nayak, S. P., Sahoo, B. & Nayak, S. C. Machine learning in healthcare analytics: a state-of-the-art review. Arch. Comput. Methods Eng. 31, 3923–3962 (2024).
Sinaga, K. P. & Yang, M.-S. Unsupervised k-means clustering algorithm. IEEE Access 8, 80716–80727 (2020).
Campanelli, L. On the euclidean distance statistic of benford’s law. Commun. Stat. Theory Methods 53, 451–474 (2024).
Khan, Z. & Yang, J. Nonparametric k-means clustering-based adaptive unsupervised colour image segmentation. Pattern Anal. Appl. 27, 17 (2024).
Wani, A. A. Comprehensive analysis of clustering algorithms: Exploring limitations and innovative solutions. PeerJ Comput. Sci. 10, e2286 (2024).
Khan, A. A., Bashir, M. S., Batool, A., Raza, M. S. & Bashir, M. A. K-means centroids initialization based on differentiation between instances attributes. Int. J. Intell. Syst. 2024, 7086878 (2024).
Kaur, B. & Kaur, G. Heart disease prediction using modified machine learning algorithm. In Int. Conf. Innov. Comput. Commun. 1, 189–201 (2022).
Shrifan, N. H., Akbar, M. F. & Isa, N. A. M. An adaptive outlier removal aided k-means clustering algorithm. J. King Saud Univ. - Comput. Inf. Sci. 34, 6365–6376 (2022).
Al-Yaseen, W. L., Jehad, A., Abed, Q. A. & Idrees, A. K. The use of modified k-means algorithm to enhance the performance of support vector machine in classifying breast cancer. Int. J. Intell. Eng. Syst. 14, (2021).
Lu, H. & Uddin, S. Unsupervised machine learning for disease prediction: A comparative performance analysis using multiple datasets. Health Technol. 14(1), 141–154 (2024).
Jetty, J., Sk, S. S., Polepalle, R. B. & Parusu, V. Unsupervised learning for heart disease prediction: Clustering-based approach. In ITM Web Conf. 74, 01005 (2025).
Bahani, K., Moujabbir, M. & Ramdani, M. An accurate fuzzy rule-based classification systems for heart disease diagnosis. Sci. Afr. 14, e01019 (2021).
Amin, S. U., Hussain, A., Kim, B. & Seo, S. Deep learning based active learning technique for data annotation and improve the overall performance of classification models. Expert Syst. Appl. 228, 120391 (2023).
Ul Amin, S. et al. EADN: An efficient deep learning model for anomaly detection in videos. Mathematics 10(9), 1555 (2022).
Xie, J., Girshick, R. & Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proc. Int. Conf. Mach. Learn. (ICML) 48, 478–487 PMLR, (2016).
Guo, X., Gao, L., Liu, X. & Yin, J. Improved deep embedded clustering with local structure preservation. Proc. Int. Joint Conf. Artif. Intell. (IJCAI) 17, 1753–1759 (2017).
Guo, X., Liu, X., Zhu, E. & Yin, J. Deep clustering with convolutional autoencoders. In Proc. Int. Conf. Neural Inf. Process. (ICONIP), 373–382 (Cham, Springer, 2017).
Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. In Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 9912–9924 (2020).
Mousa, A. H. et al. Diabetes at a glance: Assessing AI strategies for early diabetes detection and intervention via a mobile app. Mesopotamian J. Comput. Sci. 2025 (2025).
Kang, Y. et al. A fast hybrid feature selection method based on dynamic clustering and improved particle swarm optimization for high-dimensional health care data. IEEE Trans. Consum. Electron. 70(1), 2447–2459 (2023).
Park, H. G., Shin, K. S. & Kim, J. C. Efficient clustering method for graph images using two-stage clustering technique. Electronics 14(6), 1232 (2025).
Agyemang, E. F. et al. Addressing class imbalance problem in health data classification: Practical application from an oversampling viewpoint. Appl. Comput. Intell. Soft Comput. 2025(1), 1013769 (2025).
Al-Khamees, H. A. A., Sani, N. S., Gifal, A. S., Liu, L. X. W. & Esa, M. I. A dynamic model using k-nn algorithm for predicting diabetes and breast cancer. Comput. Biol. Med. 192, 110276 (2025).
Amin, S. U., Taj, S., Hussain, A. & Seo, S. An automated chest X-ray analysis for COVID-19, tuberculosis, and pneumonia employing ensemble learning approach. Biomed. Signal Process. Control 87, 105408 (2024).
Ul Amin, S., Kim, B., Jung, Y., Seo, S. & Park, S. Video anomaly detection utilizing efficient spatiotemporal feature fusion with 3d convolutions and long short-term memory modules. Adv. Intell. Syst. 6(7), 2300706 (2024).
Shabudin, S., Sani, N. S., Ariffin, K. A. Z. & Aliff, M. Feature selection for phishing website classification. Int. J. Adv. Comput. Sci. Appl. 11, (2020).
Al-Khamees, H. A. A. A., Al-A’araji, N. & Al-Shamery, E. S. Enhancing the stability of the deep neural network using a non-constant learning rate for data stream. Int. J. Electr. Comput. Eng. 13(2), (2023).
Bayrakçeken, E., Yarali, S., Ercan, U. & Alkan, Ö. Patterns among factors associated with myocardial infarction: chi-squared automatic interaction detection tree and binary logit model. BMC Public Health 25(1), 296 (2025).
Vong, C. K., Wang, A., Dragunow, M., Park, T. I. H. & Shim, V. Brain tumour histopathology through the lens of deep learning: A systematic review. Comput. Biol. Med. 186, 109642 (2025).
Lai, H., Huang, T., Lu, B., Zhang, S. & Xiaog, R. Silhouette coefficient-based weighting k-means algorithm. Neural Comput. Appl. 37(5), 3061–3075 (2025).
Zubair, M. et al. An improved k-means clustering algorithm towards an efficient data-driven modeling. Ann. Data Sci. 11(5), 1525–1544 (2024).
Zhu, J. et al. Feature-targeted deep learning framework for pulmonary tumorous cone-beam ct (cbct) enhancement with multi-task customized perceptual loss and feature-guided cyclegan. Comput. Med. Imaging Graph. 121, 102487 (2025).
Jha, K., Srivastava, S. & Jain, A. A novel speaker verification approach featuring multidomain acoustics based on the weighted city block minkowski distance. ETRI J. 47(2), 227–243 (2025).
Manaa, M. E., Hussain, S. M., Alasadi, S. A. & Al-Khamees, H. A. Ddos attacks detection based on machine learning algorithms in iot environments. Int. Artif. 27(74), 152–165 (2024).
Dai, J., Chen, W. & Xia, L. Feature selection based on neighborhood complementary entropy for heterogeneous data. Inf. Sci. 682, 121261 (2024).
Abdulmuhsin, A. A. et al. Acceptance of KM-driven metaverse technology in higher education institutions: Are educators ready to be immersed?. Inf. Discov. Deliv. (2025).
Abdulmuhsin, A. A., Owain, H. O., Dbesan, A. H., Alkhwaldi, A. F. & Tarhini, A. Knowledge management in metaverse: Does knowledge storage matter as a factor affecting adoption and acceptance?. Int. J. Organ. Anal. (2025).
Alazzawi, A. K., Alharbi, H., Al-Khamees, H. A. & Abdul Zahra, M. M. An intelligent and scalable framework for early heart disease detection using multimodal health data and optimized deep learning strategies. SN Comput. Sci. 6(7), 1–21 (2025).
Iglesias, F., Zseby, T. & Zimek, A. Clustering refinement. Int. J. Data Sci. Anal. 12(4), 333–353 (2021).
Yu, M., Bianchi, F. & Piroddi, L. Identification of piecewise affine systems using a cluster refinement technique. Eur. J. Control 83, 101204 (2025).
Roberts, M. K., Thangavel, J. & Aldawsari, H. An improved dual-phased meta-heuristic optimization-based framework for energy efficient cluster-based routing in wireless sensor networks. Alex. Eng. J. 101, 306–317 (2024).
Hao, S., Xia, T., Zhang, R. & Guo, M. Clustering cu-s based compounds using periodic table representation and compositional wasserstein distance. Sci. Rep. 14(1), 31602 (2024).
Al-Khamees, H. A. A., Nabeel, A.-A. & Al-Shamery, E. S. An evolving fuzzy model to determine an optimal number of data stream clusters. Int. J. Fuzzy Log. Intell. Syst. 22(3), 267–275 (2022).
Sohn, J., Shin, H., Lee, J. & Kim, H. C. Validation of electrocardiogram based photoplethysmogram generated using u-net based generative adversarial networks. J. Healthcare Inform. Res. 8(1), 140–157 (2024).
Yaro, A. S., Maly, F., Prazak, P. & Malỳ, K. Outlier detection performance of a modified z-score method in time-series rss observation with hybrid scale estimators. IEEE Access 12, 12785–12796 (2024).
Al-Razaq, F. J. A., Mohammed, S. J., Manaa, M. E., Al-Murieb, S. S. A. & Al-Khamees, H. A. A. Classification model of spam emails based on data mining-deep learning techniques. Int. J. Saf. Secur. Eng. 14(4), (2024).
Shreem, S. S., Ahmad Nazri, M. Z., Abdullah, S. & Sani, N. S. Hybrid symmetrical uncertainty and reference set harmony search algorithm for gene selection problem. Mathematics 10(3), 374 (2022).
Deldadehasl, M., Jafari, M. & Sayeh, M. R. Dynamic classification using the adaptive competitive algorithm for breast cancer detection. J. Data Anal. Inf. Process. 13(2), 101–115 (2025).
Hernández-Julio, Y. F., Díaz-Pertuz, L. A., Prieto-Guevara, M. J., Barrios-Barrios, M. A. & Nieto-Bernal, W. Intelligent fuzzy system to predict the wisconsin breast cancer dataset. Int. J. Environ. Res. Public Health 20(6), 5103 (2023).
Škrjanc, I., Andonovski, G., Iglesias, J. A., Sesmero, M. P. & Sanchis, A. Evolving gaussian on-line clustering in social network analysis. Expert Syst. Appl. 207, 117881 (2022).
Gupta, S. R. Prediction time of breast cancer tumor recurrence using machine learning. Cancer Treat. Res. Commun. 32, 100602 (2022).
Gayathri, Y. K. K. M. K. & Napagoda, N. A. D. N. Evaluating clustering methods for heart disease analysis. J. Inf. Commun. Technol. 2, (2025).
Faris, N., Sahi, A., Diykh, M., Abdulla, S. & Siuly, S. Enhanced polycystic ovary syndrome diagnosis model leveraging a k-means based genetic algorithm and ensemble approach. Intell. Based Med. 100253 (2025).
Acknowledgements
The authors would like to express their deepest gratitude to Universiti Kuala Lumpur (UniKL), Al-Mustaqbal University, Universiti Kebangsaan Malaysia (UKM), and Al-Bayan University for their invaluable support toward the completion and publication of this paper.
Funding
This research was funded by the Universiti Kebangsaan Malaysia (Grant code: FRGS/1/2024/ICT06/UKM/02/3).
Author information
Authors and Affiliations
Contributions
CREDIT author statement The following contributions were made by each author to this work: Hussein A.A. Al-Khamees (H.A.A.A.K.): Conceptualization, Methodology, Supervision, Resources. Mudatheer M. Al-Slivani (M.M.A): Project administration, Formal analysis. Mayameen S. Kadhim (M.S.K.): Investigation, Visualization. Ahmed Dheyaa Radhi (A.D.R.): Data curation, Formal analysis, Writing. Nor Samsiah Sani (N.S.S.): Funding acquisition, Validation, original draft. Rusul Mansoor Al-Amri (R.M.A.): Writing – review & editing, Project administration. Fazidah Wahit (F.W.) : Supervision, Conceptualization. Mohd Aliff Afira Sani (M.A.A.S.): Software, Formal analysis. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
Not applicable since the proposed method is evaluated using two open-access medical datasets (Breast Cancer Wisconsin (BCW) and Heart Disease).
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Al-Khamees, H.A.A., Al-Slivani, M.M., Kadhim, M.S. et al. Enhancing classification accuracy in medical datasets using a hybrid distance and cluster refinement-based K-means clustering method. Sci Rep 16, 3490 (2026). https://doi.org/10.1038/s41598-025-30176-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-30176-1












