Introduction

Premature ventricular contraction (PVC) is characterized by early repolarization of the myocardium that originates from the Purkinje fiber rather than the sinoatrial node. Previous studies have shown that PVC may be related to heart diseases1. However, PVC may be triggered by non-cardiac causes, such as excessive alcohol consumption2, hyperthyroidism3, and sarcoidosis4. The results of previous studies about the clinical importance of PVCs have been controversial; nevertheless, the clinical benefits of PVC suppression are necessary in patients with cardiac diseases5,6,7. PVC is seen among 1–4% of the general population8,9 and can lead to symptoms such as palpitations, chest pain, heart failure, and syncope10.

The complete blood count (CBC) is commonly requested in clinical practice. The characteristics of CBC components (hematological factors) include white blood cell (WBC), red blood cell (RBC), hemoglobin (Hb), mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular hemoglobin (MCH), red cell distribution width (RDW), platelet distribution width (PDW), mean platelet volume (MPV), mixed cell (MXD), and platelet count. Previous studies have shown that CBC components are associated with certain cardiac diseases11, such as coronary heart disease12, left ventricular hypertrophy13, and endothelial dysfunction14,15. In addition, it has been reported that PVC is associated with ventricular dysfunction16,17. Also, it has been shown that CBC components may be prognostic in patients with ventricular dysfunction18,19 and predict ventricular dysfunction18,19,20,21.

Given these findings, it is proposed that some hematological factors may also be associated with PVC. Two meta-analyses showed that hematological factors could predict the occurrence and recurrence of atrial fibrillation22,23. Moreover, hematocrit was found to be associated with ventricular arrhythmias in patients with chronic renal failure on dialysis24. Bindra et al. investigated hemoglobin levels in patients with acute coronary syndrome (ACS)25. Their findings indicated that individuals with reduced Hb concentrations were predisposed to ventricular arrhythmias. Similarly, another study indicated that ventricular arrhythmia is significantly more prevalent among patients with higher Hb levels26. Recently, Arienzo et al. proposed a framework for detection PVC based on explainable AI (XAI). They used GRADient-weighted Class Activation Mapping algorithm and achieved high accuracy (96.21%)27. Also, Shen et al. proposed a model to detect PVC based on the network dynamic features. In this method each heartbeat was divided into several segments, and their statistical properties were calculated to construct sequences. Subsequently, network topology-related features were extracted from the constructed sequences and input into a decision tree-based Gentleboost. F1-measure of their method was about 97%28. However, limited research has been undertaken on the relationship between arrhythmia and CBC components, and no large-scale study has explored the potential association between hematological factors and PVC. Thus, the purpose of this research is to investigate the potential association between hematological factors and PVC using a data mining method.

Materials and methods

Participants and blood sampling

The 9,035 subjects were recruited in the MASHAD study. A stratified cluster random sampling technique was used to select participants aged between 35 and 65 years old. Participants who chose not to remain in the study and those who did not return for testing were excluded29. Also, Individuals which did not have complete CBC component or unreadable (ECG) were exclude for this study. participants were divided into two datasets based on their gender. After an overnight period of fasting, blood samples of 20 milliliters were collected from each participant involved in the study. The collection was performed through antecubital vein venipuncture following standardized protocols, with subjects in a seated position in an EDTA-coated tube. Hematological parameters were assessed by using the Sysmex auto analyzer system KX-21 N in whole blood samples. Complete blood count (CBC) components, including white blood cell (WBC), red blood cell (RBC), hemoglobin (Hb), hematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), Corpuscular hemoglobin concentration (MCHC), platelet (PLT), lymphocytes count (LYM), mixed cell count (MXD), Neutrophil count (NEUT), red cell distribution width (RDW), platelet distribution width (PDW), and mean platelet volume (MPV) were measured. The research protocol was approved by the School of Medicine, Mashhad University of Medical Sciences, Biomedical Research Ethics Committee (IR.MUMS.MEDICAL.REC.1399.783).

Statistical analysis and learning models

Quantitative and qualitative variables were delineated through the reporting of mean ± standard deviation (SD) and frequency (%) measures, respectively. Participants were randomly assigned to test and training datasets with proportions of 20% and 80%, respectively. Moreover, because the data was unbalanced a mixed sampling approach was utilized which uses under-sampling and over-sampling methods that make synthetic minority class samples and balance the data30. To detect the relationship between hematological factors and PVC three machine learning (ML) models, including Logistic regression (LR), C5.0, and boosting decision tree (DT) were used. Area under the curve (AUC), accuracy, precision, F1-measure, and specificity were utilized to evaluate the performance of each model on the test dataset.

All analyses were made using SPSS version 26 and SPSS Modeler 10. Significant cut-off values were considered at 0.05.

Logistic regression

Logistic regression (LR) is a model that is applied to data with binary outcomes and multiple variables31. The formula below represents the mathematical utilization of the LR model, in which Yi, X, and β denote the response variable (zero or one), vectors of covariates, and regression coefficient.

Decision tree model and boosting

In the domain of big data analytics, typified by complex underlying frameworks and interrelationships, ML algorithms have emerged as effective methodologies for uncovering latent associations among variables and consequent outcomes32. Classification is a primary application of ML algorithms, aimed at discerning novel and intricate connections between predictors and outcomes, demonstrating notable efficiency in this regard33. DT is one of the main classification algorithms that simply, classify the data based on the different values of variables. It comprises different elements:1) A root node, which classifies the data into different main subsets. 2) The internal nodes, symbolize a feasible vertex within the hierarchical arrangement, linking to the root node at the apex and the leaf nodes at the end. 3) The lead nodes, which represent the terminus of a tree’s hierarchical structure, illustrate the conclusive outcomes of the process of partitioning records into distinct target groups34. Choosing the factors and split data are done by the Information gain parameter which is based on the reduction of entropy of factors.

$${\text{IG}}({\text{T}},{\text{a}}){\text{ }} = {\text{ H}}({\text{T}}) - {\text{H}}({\text{T}}|{\text{a}}).$$

where H(T|a) is the conditional entropy of T given the value of attribute a.

Boosting operates through the sequential construction of multiple models. Initially, a base model is developed conventionally. Then, subsequent models are constructed, emphasizing instances misclassified by the preceding model. This iterative process continues, with each successive model addressing the errors of its predecessor. Ultimately, cases are classified by applying the collective ensemble of models, employing a weighted voting mechanism to incorporate individual predictions into a unified outcome. Although boosting can markedly enhance the accuracy of a C5.0 DT model, it necessitates extended training durations35.

Results

Baseline characteristics

Of the participants who were recruited into this study 3,615, and 5,420 participants were male and female, respectively. A total of 42 males and 66 females were found to be experiencing PVC. The baseline characteristics of males and females are shown in Table 1. As expected, baseline characteristics of this study showed that WBC, RBC, Hb, HCT, and MXD are slightly higher among males in comparison to females. Since these baseline differences might affect the models, they have been constructed separately for each gender group.

To investigate the feasible relationship between PVC and hematological factors, three machine learning methods have been utilized. These methods are the LR, DT (C5.0), and boosting DT. For this purpose, the dataset was randomly split into training (80%) and test (20%) datasets. Models were trained on the training dataset and subsequently evaluated on the test dataset. The evaluation was repeated 10 times, and the average scores were presented.

LR model

According to the multivariate LR model, the variables that were significantly associated with PVC (P value < 0.001) among males and females were indicated in Table 2. The WBC, MCHC, RDW, and Hb revealed the highest OR among variables in the LR model for males. However, in the LR model of female participants, the Hb and LYM were the most remarkable variables. To evaluate the models’ performances, LR models were tested on the test dataset. The LR model for males performed accuracy, specificity, and sensitivity of 62.91%, 28.47%, and 93.09% respectively (Table 3). LR model of female subjects indicated accuracy, specificity, and sensitivity of 61.5%, 17.04%, and 93.61% respectively (Table 4). Moreover, according to the ROC curve, LR models presented the lowest Area under the curve (AUC) in comparison to the other models (Fig. 1).

C5.0 DT model and boosting

Various hematological factors were evaluated using the DT algorithm, leading to their classification into 10 layers. The most important variable within the first layer often denotes the most significant contributor to data categorization, with subsequent variables arranged in descending order of importance. Also, boosting has been used to improve the accuracy of results. This model proceeds by focusing on correcting errors in each step. The results of C5.0 were better than the LR model for both males and females (Tables 3 and 4). Based on the final tree among males, WBC, PLT, RDW, PDW, and HCT had the greatest importance in PVC presence (Fig. 2 and Supplement 1). Also, among females, the DT model showed that PVC presence mostly was impacted by RBC, PLT, RDW, MCV, and MXD (Fig. 3 and Supplement 2). According to the confusion matrices, the boosting model for males demonstrated the accuracy, specificity, and sensitivity of 98.13%, 94.41%, and 100% respectively (Table 3). Also, this model, performed an accuracy of 96.92%, a specificity of 91.34%, and a sensitivity of 99.46% for females (Table 4). The results of DT (C5.0) with boosting were the best performance in comparison to other models for both males and females. Additionally, the AUC of the ROC curve for boosting DT models was higher in comparison to LR and C5.0 models (Fig. 1). Also, some of the rules and upper layers of the DT model are shown in Figs. 2 and 3. The complete final Trees were shown in Supplement 1 and 2.

Discussion

It has been shown previously that PVC may be associated with cardiac dysfunction, and it has been reported that frequent PVC could be a result of congestive heart failure36. Also, previous studies have brought up that PVC could potentially be associated with a higher risk of sudden cardiac death37,38,39. Hence, it is essential to identify factors that may correlate with PVC. Accordingly, this study employs three ML algorithms to explore the potential relationship between hematological factors, readily accessible in clinical settings, and the occurrence of PVC among a cohort comprising 9,035 participants.

Comparison among the models indicated that the boosting DT algorithm exhibited the most favorable performance. Conversely, the LR algorithm displayed lower AUC, sensitivity, specificity, and accuracy. The C5.0 model revealed high results but using boosting improved the final results. Based on the final DT the most important hematology factors associated with PVC were WBC, PLT, RDW, PDW, and HCT for males and RBC, PLT, RDW, MCV, and MXD for females.

One of the hematological factors that was associated with the presence of PVC was RDW. However, there is no evident direct relationship between RDW and PVC, but according to previous studies some medical conditions might cause the co-occurrence of higher RDW and PVC. It has been shown that PVC could be associated with heart diseases such as coronary artery disease (CAD)16, structural heart disease40, myocarditis41, and arrhythmias42. In addition, it has been shown that RDW could be associated with an increased risk of different heart diseases. For instance, Tonelli et al. found that elevated levels of RDW are related to the risk of death and CVD events among patients with a history of myocardial infarction43, which might trigger PVC. Moreover, another study revealed that higher RDW is also related to a higher probability of heart failure which is one of the underlying causes of PVC44,45. Moreover, one of the major causes of higher RDW is iron deficiency, which is found to be associated with altered electrophysiological activity46.

One of the other hematological factors found to be associated with PVC was platelet count (PLT). Higher PLT may be associated with medical conditions that potentially contribute to PVC. It has been found that myocardial inflammation can lead to PVC47. In addition, this is evident that PLT is elevated during inflammation48. Moreover, beyond these findings, previous studies have demonstrated a direct relationship between PLT elevation and CVDs, which are the main risk factors for frequent PVC, as was mentioned earlier49,50,51,52. Moreover, it has been demonstrated that PLT is directly associated with cardiac arrhythmia through PLT product release which can potentially impact the cardiac cells’ activities53.

According to the models, higher WBC was also associated with a higher risk of PVC. Previous studies have shown that higher WBC and neutrophil are associated with hypertrophy among patients with hypertensive blood pressure54. This statement could be one of the plausible explanations for this observation since ventricular hypertrophy is related to PVC occurrence55,56. Moreover, it has been revealed that elevated WBC is remarkably related to a higher incidence of coronary heart disease, and CVD occurrence which are tangibly related to PVC16,57,58,59. Moreover, studies have shown that higher WBC is an independent risk factor for coronary heart disease, which is consistent with the observation in this study12,60.

This study revealed a significant association between MCV and PVC. MCV, commonly regarded as a surrogate marker for anemia, exhibits a noteworthy correlation with PVC. Anemia has been implicated in the pathogenesis of ventricular hypertrophy, suggesting a potential long-term impact on cardiovascular health61. In concordance with this research finding, a previous study has reported that the B6 deficiency, which is caused by the lower values of MCV62, is significantly related to cardiac changes among rats, including PVC63. In parallel with the previously mentioned parameters of CBC, MCV demonstrates a relation with CVD, a connection that extends to PVC64,65,66.

In addition to previous hematological factors, PDW may be associated with heart failure and cardiac events, such as PVC. A previous study has shown that PDW could be correlated with adverse prognostic events among heart failure patients67. Given this finding, previous studies have shown that the PVC could be associated with heart failure6,68. Moreover, HCT is one of the other factors that were found to be associated with PVC. Similarly, previous studies have shown that heart failure and left ventricular dysfunction are also associated with lower HCT69. It has been found that coronary heart disease (CHD), is one of the underlying causes of PVC occurrence70, was related to abnormal HCT levels71. Moreover, other similar scenarios could be plausible regarding the relationship between PVC and hematological factors, such as RBC, and MXD. It has been revealed that the PVCs could be related to high blood pressure. Regarding this finding, the RBC count potentially can influence the viscosity and other hemodynamic factors, which finally might lead to high blood pressure72. Similarly, higher MXD is associated with higher cardiovascular risk73,74.

This study has several strengths. It was conducted using a large sample dataset, which improves the reliability of the models. Secondly, this study includes several hematological factors, providing a nuanced and comprehensive understanding of the feasible relationship between hematological factors and PVC. Thirdly, the performance evaluation of the DT (C5.) with boosting model in this study indicated robust favorable AUC, accuracy, specificity, precision, and F1-measure values, which promise possible clinical applications of the models.

In conclusion, this study investigated whether it was possible to predict PVC using routinely measured hematological factors through the utilization of advanced machine learning algorithms. The most important hematological factors associated with PVC in both males and females were RDW and PLT. For males WBC, PDW, and HCT, and females, RBC, MCV, and MXD were also important. The results are consistent with prior research findings that underscored their physiological relevance. However, more research is warranted to apply and validate these findings in other population groups, as this is the first study to evaluate the association between hematological factors and PVC as far as we know.

Table 1 Baseline characteristics of male and female.
Table 2 Variables associated with PVC in males and females.
Table 3 Evaluation parameters for male.
Table 4 Evaluation parameters for females.
Fig. 1
figure 1figure 1

ROC curve of models.

Fig. 2
figure 2figure 2

Upper layers of DT and rules for positive and negative PVC for males.

Fig. 3
figure 3figure 3

Upper layers of DT and rules for positive and negative PVC for females.