Introduction

Hypertension, a chronic disease prevalent throughout the world, is the most common risk factor for cardiovascular disease, and it common has environmental and inherited causes [1,2,3]. A large number of genome-wide association studies have identified multiple single nucleotide polymorphisms (SNPs) of susceptibility genes associated with hypertension and/or elevated blood pressure levels [4,5,6]. Indeed, several studies have discussed whether the combination of genetic and traditional factors can improve hypertension prediction efficacy. An essential hypertension risk model including both traditional factors and 4 SNPs was established and indicated the critical roles of genetic factors in hypertension [7]. However, due to the relatively low magnitude predictive ability of a single SNP, the predictive ability may improve by combining nonsignificant SNPs into genetic risk scores (GRSs). A study conducted in Sweden confirmed that a GRS was independently related to incident hypertension but did not contribute to the predictive ability of the model [8]. However, two studies from Korea and China concluded that the addition of a GRS provided limited but positive improvements to traditional models. However, no relevant studies have been conducted in resource-limited areas [9, 10].

In recent years, a fair number of studies have applied machine learning methods to medical data and successfully solved various medical problems, especially disease prediction [11, 12]. Machine learning methods have advanced applications for which classical methods are not well suited due to the high data utilization requirement. Complicated diseases can be easily and better described and predicted by machine learning methods [13]. More importantly, artificial neural network (ANN), random forest (RF), and gradient boosting machine (GBM) methods have been shown to have better performance among various machine learning methods in rural China populations [12, 14]. Considering the uncertainty of the effect of GRS on the prediction of hypertension, this study aims to explore whether genetic information is an important factor in hypertension prediction using machine learning methods in resource-limited districts from a prospective cohort study conducted in rural China.

Methods

Subjects and measurements

The Henan Rural Cohort study is a prospective study based on a large-scale rural population in a resource-limited area in China that aims to explore the possible risk factors for noncommunicable diseases (NCDs) and the prevention of NCDs in the Chinese rural population. The baseline data of the study were collected between July 2015 and September 2017. Subsequent follow-up surveys were conducted every 3 years starting in 2018 [15]. The current analysis used the population dedicated to the examination of genetic factors, including 8268 subjects who had undergone complete SNP testing. The following subjects were excluded: (1) those who were missing genetic information, (2) those who had prevalent hypertension at baseline, and (3) those who had missing data on the outcome. Therefore, a total of 4592 (aged 18–81 years) subjects were enrolled in the present analysis. All the subjects signed written informed consent forms. This study was approved by the Zhengzhou University Life Science Ethics Committee.

Demographic characteristics, lifestyle factors, personal history of disease, family history of disease, etc. were assessed through a questionnaire interview conducted by well-trained investigators. Anthropometric measurements including height, weight, waist circumference, hip circumference, blood pressure, etc. were obtained for each subject. Venous blood samples were collected after 8 h of overnight fasting for routine blood examination. Less fruits and vegetables intake was defined as taking less than 500 g of fruit and vegetables per day. Body mass index (BMI) was calculated as weight divided by height squared (kg/m2). Physical activity was classified into 3 levels (low, moderate, high) according to the International Physical Activity Questionnaire recommended by WHO. Pulse pressure was calculated as systolic blood pressure (SBP) - diastolic blood pressure (DBP).

Hypertension

The blood pressure of subjects was measured after at least 5 min of rest and no consumption of tea, alcohol, or cigarettes or performance of excessive physical activity. Each subject had their blood pressure measured at least three times, and the average of those measurements was taken as the final blood pressure. According to the China Hypertension Prevention Guide (2010 Revision) [16], blood pressure greater than or equal to 140/90 mmHg or a physician diagnosis of hypertension and use of medication in the last two weeks was determined as hypertension.

Polygenetic risk score

A total of 13 SNPs (rs11191548, rs1275988, rs16849225, rs7136259, rs17249754, rs2107595, rs9810888, rs10745332, rs1378942, rs16998073, rs1902859, rs2021783, rs7577262) reported to be related to hypertension were integrated into a polygenic genetic risk score (PGGRS). Considering the target population of this study, the 13 SNPs were selected from the literature in conjunction with reports from large GWAS in East Asia [17,18,19,20,21]. These 13 SNPs were subsequently replicated in the present population. SNPs were detected by a custom SNPscanā„¢ kit (Genesky Biotechnologies Inc., Shanghai, China). The PGGRS was calculated based on a genetic codominance model, which codes genotypes in the form of dummy variables (e.g., AA is coded as 00, Aa is coded as 10, and aa is coded as 01), calculates the risk coefficient for each genotype relative to the reference genotype (shown in Supplementary TableĀ 1), and finally sums all the risk coefficients for each individual to obtain the PGGRS.

Model construction and evaluation

All models were constructed and fitted using the train dataset, involving a randomly selected group of 70% of subjects (n = 3214). The predictive performance of the models was assessed using the test dataset, involving the remaining 30% of subjects (n = 1378).

The models were constructed using the train dataset to predict the probability of incident hypertension over 3 years, and the process was as follows: First, risk factors that significantly associated with the development of hypertension was screened using the univariate Cox regression (results were shown in Supplementary TableĀ 2); Second, among these associated risk factors, the stepwise method of multivariate Cox regression (P < 0.05 for inclusion, P > 0.10 for exclusion) was used to screen the predictors of incident hypertension (results were shown in TableĀ 2). The screened predictors composed the traditional model. These traditional predictors together with PGGRS constituted the traditional+PGGRS model. Third, the models were constructed using the Cox regression methods (performed in SPSS (21.0)) and machine learning methods, including ANN, RF, and GBM (performed in Python (3.8), package: sklearn (0.21.3)) [22,23,24,25,26]. The parameters of machine learning models were selected using 10-fold cross-validation and grid search methods, while the process of parameter selection was repeated 100 times to obtain robust results. In this study, the parameters with the best area under the receiver operator characteristic curves (AUCs) were used to construct the prediction model.

In the test dataset, the improvement in discrimination was examined by comparing the AUC with or without the GRS in traditional models [27]. The integrated discrimination improvement index (IDI) and net reclassification improvement index (NRI) were also calculated [28, 29]. The decision curve was used to examine the clinical benefits of the predictive model [30]. A flowchart of model construction and evaluation is provided in the supplementary material (Supplementary Fig.Ā 1) to present the whole process.

Statistical analysis

The mean and standard deviation are used to describe continuous variables, and differences were tested by the t-test. For classification variables, frequency and proportion were adopted, and the difference was tested by chi-square.

In addition, another two types of GRSs (SGRS and DLGRS) were calculated and analyzed as sensitivity analyses. The predictive ability of SGRS and DLGRS in the development of hypertension was analyzed and is provided in the supplementary material.

All statistical tests were two-sided, with P < 0.05 indicating statistical significance. Data were analyzed using SPSS version 21.0, R version 4.0.0, and Python 3.8.

Results

Baseline characteristics

The baseline demographic characteristics of the subjects in the total population, train dataset, and test dataset are summarized in TableĀ 1. Among 4592 subjects, the 3-year incidence of hypertension was 18.90%, with 868 participants developing hypertension. The average age was 49.04 ± 11.52 years old, and 63% of participants were women. Concerning baseline characteristics, no significant difference was observed between the train dataset and test dataset (all P > 0.05).

Table 1 Baseline characteristic between train dataset and test dataset

Performance of models with and without the PGGRS

TableĀ 2 shows the predictors in the traditional model, which were age, less fruits and vegetables intake, family history of hypertension, physical activity, BMI, baseline DBP, and pulse pressure. All the variables were significantly related to incident hypertension. As shown in TableĀ 3, the AUCs were 0.785 (0.763, 0.807), 0.790 (0.768, 0.811), 0.838 (0.817, 0.857) and 0.854 (0.835, 0.873) for traditional models with the Cox, ANN, RF, and GBM methods, respectively. The receiver operating characteristic curves (ROCs) of the traditional models are shown in Supplementary Fig.Ā 3. Among the 4 traditional models, the best discrimination ability was shown by the model constructed by the GBM method. After adding the PGGRS, the AUCs of the models were increased by 0.001, 0.008, 0.023, and 0.017 for the Cox, ANN, RF, and GBM methods, respectively. Except for the Cox methods, the increase in AUC of other methods showed significant differences, indicating that the discrimination of the traditional+PGGRS models was acceptable, especially in the case of the GBM method. The comparison of ROCs between the traditional model and the traditional+PGGRS model is shown in Fig.Ā 1.

Table 2 Cox multivariate analysis of predictors and incident hypertension
Table 3 Improvement of traditional models after adding PGGRS
Fig. 1
figure 1

ROC curves of the traditional model and the traditional+PGGRS model with different classifiers. ROC receiver operating characteristic, Cox Cox regression, ANN artificial neural network, RF random forest, GBM gradient boosting machine

The IDI and continuous NRI are provided in TableĀ 3, and they were used to examine whether adding the PGGRS to the traditional model could improve the reclassification ability. The results showed that the risk of incident hypertension reclassification was significantly improved by the PGGRS for all four methods. The PGGRS increase the IDI by 1.39% (0.60–2.26%), 2.86% (0.72–5.33%), 4.73% (2.99–6.35%), and 4.68% (2.03–7.81%) and the NRI by 25.05% (14.87%, 36.00%), 13.01% (āˆ’16.90–30.99%), 44.87% (32.04–53.39%), and 22.94% (8.22%, 37.13%) for the Cox, ANN, RF, and GBM methods, respectively. These results represented a significant improvement in reclassification resulting from the addition of the PGGRS for predicting incident hypertension.

The decision curve was plotted for the measurement of the impact of using the PGGRS for predicting incident hypertension, which is shown in Fig.Ā 2. Compared with all hypertensive patients with or without intervention, in the interval with a threshold probability lower than 80%, the model with the PGGRS had a higher net benefit for the RF and GBM methods; for the ANN method, the traditional+PGGRS model had a higher net benefit when the threshold probability was less than 45%. Such a curvilinear trend suggested that the traditional models had better benefits when combined with the PGGRS.

Fig. 2
figure 2

Decision curve of models. The ā€œNoneā€ line means that none of the participants had hypertension or were undergoing any intervention; the ā€œAllā€ line represents that all participants had hypertension and all received the intervention. Cox Cox regression, ANN artificial neural network, RF random forest, GBM gradient boosting machine

Sensitivity analysis

Whether adding the SGRS and DLGRS could improve the predictive ability of the traditional models was also analyzed, although the SGRS and DLGRS were not significantly related to incident hypertension after adjusting for baseline blood pressure (Supplementary Fig.Ā 2). The results are shown in Supplementary TableĀ 3 and Supplementary Figs.Ā 4 and Ā 5. In summary, adding the SGRS partially improved the discrimination and reclassification of the traditional model, especially using the RF method. The addition of the DLGRS resulted in significant AUC improvement for the ANN, RF, and GBM methods. The risk of incident hypertension was reclassified using Cox and RF methods. Similar results were observed for the SGRS, PGGRS, and DLGRS, which explained the role of genetic factors in the prediction of incident hypertension.

Discussion

Based on a rural cohort population, this study validated the predictive performance of the PGGRS, which was associated with incident hypertension irrespective of baseline blood pressure. The AUC, NRI, and IDI results showed that the discrimination of the traditional model significantly improved and that the ability to predict the risk of incident hypertension improved when the PGGRS was added. Models with genetic factors exhibited superior net benefit than those without genetic factors. Moreover, predicting incident hypertension using machine learning was more efficient. This study provides evidence and choices for the clinical application of genetic information for the prediction of incident hypertension using machine learning methods.

The results are consistent with previous studies indicating that age, vegetable and fruit intake, family history of hypertension, physical activity, and BMI are risk factors for hypertension [31,32,33]. In addition, smoking, alcohol consumption, and lipid levels were risk factors for hypertension, but the screening variable process in this study excluded them from the model. This may be due to the effect of these variables on hypertension becoming less significant over time or to racial differences in the different studies. In addition, the results of the present study showed that high levels of physical activity increase the risk of developing hypertension. This finding may be explained by the physical activity characteristics of rural people. A recent article suggested a physical activity paradox in which leisure time physical activity is a protective factor, whereas physical activity at work is a risk factor for adverse cardiovascular events [34]. Our study focused on a rural population whose physical activity occurs primarily at work. Therefore, our findings suggest that physical activity in rural populations is a risk factor for incident hypertension, which is consistent with previous studies and provides evidence for future studies related to physical activity.

This study calculated three GRSs based on 13 hypertension-related SNPs, and the relationship between GRSs and incident hypertension was determined (Supplementary Fig.Ā 2). The results showed that the PGGRS was strongly associated with incident hypertension after adjustment for baseline blood pressure; for every 1-unit increase in PGGRS, the risk of incident hypertension increased 4.6%. Several previous studies have also reported the prominent association of GRSs with hypertension/blood pressure [35,36,37]. These similar results further confirmed that GRSs integrated with SNPs are related to hypertension.

A previous study conducted in Sweden stated that a GRS was independently associated with elevated blood pressure but was not useful for hypertension prediction when considered in terms of whether the AUC increased [8]. A subsequent study from Korea revealed similar but differential results,: the AUC increased by 0.001 (P = 0.1057) when the GRS was added, but the reclassification of the model increased due to the addition of the GRS [9]. To explain whether the addition of the GRS can improve the performance of the predictive model, especially in resource-limited settings, this study established a traditional model and then added the PGGRS to examine the effect of genetic factors in predicting incident hypertension in the rural population with 3-year follow-up data. With the addition of the PGGRS, the increases in AUC were significant for all three machine learning methods, and the NRI and IDI were significantly improved, suggesting considerable improvement in terms of discrimination and reclassification, respectively. A large-scale population study proposed a parallel improvement in prediction model performance [10]. The results of SGRS and DLGRS also partially indicated the same conclusion. Models incorporating the GRS could better predict incident hypertension, indicating the feasibility of using genetic information to predict hypertension over time.

The decision curve of the traditional model and traditional+PGGRS model showed that models with the PGGRS had better net benefits than those that only contained traditional factors. A previous study argued that genetic information could be contained in a prediction model for the assessment of cardiovascular disease risk, thus lowering the level of LDL-C in patients [38]. Our results regarding the decision curve resulted in a similar inference, suggesting that genetic information plays a potential role in clinical disease prevention and decision-making.

Notably, the traditional statistical method (Cox regression) and three machine learning algorithms (ANN, RF, GBM) were adopted to construct the prediction model. According to our results, the better method to predict the risk of hypertensive events using genetic factors was machine learning, and the AUCs of the ANN, RF, and GBM methods with the addition of the PGGRS were 0.798, 0.861, and 0.871, respectively. This result demonstrates that using machine learning methods to utilize genetic factors for hypertension prediction is effective and can provide more options for clinical hypertension prediction. Moreover, the GBM exhibited outstanding predictive ability for the risk identification of incident hypertension. Other studies also indicated the critical role of boosting algorithms in diverse diseases [12, 39,40,41]. These findings may be explained by the fact that machine learning can better manage the complicated and indivisible relationships among risk factors.

This research, conducted in a rural Chinese cohort, can better demonstrate the causal relationship between variables and outcomes, thus better demonstrating the effect of genetic factors on incident hypertension prediction in resource-limited areas. In addition, the machine learning methods were used to build the model to maximize the utilization of data. Nonetheless, several limitations need to be concerned. First, this study only verified the results of the analysis in the same population that was randomly divided into a train dataset and a test dataset, which may interfere with the extrapolation of the conclusion. Second, the 3-year follow-up may be insufficient to account for the long-term impact of genetic factors on the development of hypertension, and the results should be confirmed in subsequent studies in our cohort or in other cohort studies with longer follow-up durations. Third, only 13 hypertension-related SNPs were detected and included in the GRS, and the possible interaction between genetic variants and environmental factors was not considered; consequently, the effect of genetic information in predicting hypertension may not have been fully discovered.

Conclusion

To what extent genetic elements affect incident hypertension was previously undetermined, peculiarly in areas with limited resources. Based on a rural prospective study conducted in China, the present study found a significant relationship of genetic elements with incident hypertension. In addition, the addition of the PGGRS resulted in more significant improvements to the traditional model both in terms of discrimination and reclassification, while the use of machine learning methods could result in even more improvements. The results suggested the potential clinical use of genetic elements in predicting incident hypertension leveraging machine learning techniques.