Introduction

Modern humans originated in Africa and then migrated around the world. East Asia is a key region for the study of migration and evolution of modern humans, and the Y-chromosome haplotype study of various East Asian populations suggests a south-north migration that led to the settlement of mainland East Asia1,2. As humans migrated northwards, population expansion and environmental and dietary selection pressures led to changes in the structure of the genomes of East Asian populations3. Previous studies on the Han Chinese in East Asia have provided a wealth of genetic evidence for the complex migratory history and disease susceptibility of the populations4,5,6. However, not enough attention has been paid to key areas particularly the Central Plains of China.

The Central Plains usually referred to as Henan Province in China, which lies in the middle and lower reaches of the Yellow River basin, is one of the largest agricultural regions in East Asia and even globally7,8. Historically, more than 20 Chinese dynasties chose to build their capitals here (includes the famous the Xia Dynasty, Shang Dynasty and the Song Dynasty), and it is considered the birthplace of Chinese civilization. Studying the population structure of the Central Plains populations and their subsistence patterns is crucial to understanding the formation of modern humans in East Asia. In addition, structural variation (SV), which consists of insertions, deletions, etc. of long fragments of sequences over 50 bp, is the other major form of variation in the genome. Previous studies have shown that SVs play a crucial role in human phenotypes, adaptation and disease9. But to date, investigations of SV in East Asian populations have been limited. Therefore, in-depth studies of the Han Chinese in the Central Plains using SNPs and SVs may provide additional genetic evidence for the population history and phenotypic adaptations of East Asian populations, particularly for adaptation to crop domestication, which has led to changes in the diet of East Asian populations.

In the current study, we collected 492 Central Plains Han Chinese (CPHC) individuals and performed whole genome sequencing. Population structure analysis indicates that the CPHC has a more concentrated set of genetic components and a higher level of genetic diversity than the other populations in East Asia (CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; JPT, Japanese in Tokyo, Japan) in the 1000 Genomes Project (1KGP). We identified a lot of candidate variants (13,035 SNPs and 26 SVs) with higher allele frequencies in the CPHC. In particular, the glycolipid metabolic genes show strong selection signals in CPHC, for example, the LONP2, FADS2, CCDC171, SULT1C4, SIAH1, ASXL2 genes and the high-frequency SVs INS_27791 (FGF21), INS_03097 (SLC19A2). In addition, East Asia-specific structural variants DEL_21699 (LINC01749) and DEL_38406 (FAM102A) may be associated with esophageal squamous carcinoma and primary angle-closure glaucoma (PACG), respectively. In conclusion, this study demonstrates the unique genetic structure and dietary adaptations of the CPHC, and the result will provide important clues for the study of population structure and phenotypic adaptation of East Asian populations.

Results

Genome sequencing of 492 CPHC individuals and variants annotation

We sampled a total of 492 Han Chinese individuals from 16 cities in Henan Province, China (Fig. 1a). The individuals were between 30–50 years old, with a gender ratio of 1.56:1 (300 female and 192 male). Whole genome sequencing was performed on the Illumina sequencing platform and quantified on the Bioanalyzer 2100 with a paired-end read length of 2 × 150 bp. The mean NGS coverage was 25.54×, with a range ~20.2–42.2× (Fig. 1b). A total of 22.65 million SNPs and 4.26 million INDELs have been identified using the standard GATK process and a rigorous filtering process (“Methods”). The transition-to-transversion ratio (Ts/Tv) of all these variants is 2.12, indicating the accuracy of our SNP calling10. A total of 2,417,290 single nucleotide variants (~10.67% of the current study, compared to dbSNP Build 156) were novel variants (Fig. 1c); 151,763 SNPs (0.67%) located in coding and splice regions (including 71,359 missense mutations); 14,847,845 SNPs (65.55%) located in intronic region and 144,514 SNPs (6.38%) in intergenic region (Fig. 1d; Supplementary Fig. 1); In addition, 17,441,403 SNPs (~77% of the current study) were minor allele frequency variants (AF < 0.01). The total number of these variants is much higher than other Han Chinese databases, e.g. ~ 1.9 times the total number of CHB and ~2.2 times the total number of CHS in 1KGP (Supplementary Fig. 2).

Fig. 1: Sample distribution and WGS data statistics in the CPHC.
Fig. 1: Sample distribution and WGS data statistics in the CPHC.
Full size image

a Distribution of samples in the CPHC. The red highlight shows the geographical location of Henan Province in China. Red circles represent the native places of the samples. The map was sourced from an approved standard map service (http://bzdt.ch.mnr.gov.cn) endorsed by the Ministry of National Resources of the People’s Republic of China (GS(2019)1653, GS(2019)3266 and GS(2019)3333). b The distribution of WGS genomic coverage. The red line refers to the mean coverage of the samples. c Numbers of variants detected in different bins of allele frequencies. Variants were classified as known or novel based on public resources (1KGP and dbSNP). dg Functional annotation of SNPs and SVs in the CPHC. The scale of each category is filled with a different color.

More importantly, we used Paragraph (a graph-based SVs genotyping tool)11 to type SVs in the CPHC genome and obtained a total of 19,042 DELs and 22,917 INSs with the SV length ranging from 50 to 4000 bp. Next, we used the VEP and RepeatMasker tools to annotate all variants12,13. Of these, 18,546 SVs (44.2%) were located in genomic regulatory regions, 340 SVs (0.81%) in coding regions and 17,916 SVs (42.7%) in intronic regions (Fig. 1e). We also identified SINE-associated SVs in various genomic functional regions (Fig. 1f and g). For Deletions, 2457 (42.4%) were located in intronic regions; 442 (7.63%) in 5’/3’-UTR regions; 386 (6.66%) in non-coding RNAs regions and 93 (1.61%) in up/downstream regions. For Insertions, 2326 (48.68%) were located in intronic regions; 605 (12.66%) in 5’/3’-UTR regions; 416 (8.7%) in non-coding RNAs regions and 65 (1.36%) in up/downstream regions. In addition, minor allele frequency calculations for all variants found 80% of the SVs were rare variants (AF < 0.01), which may be specific to CPHC populations.

These results indicate that many SNPs and SVs are present in the CPHC genome, suggesting a rich genetic diversity that is a great complement to the variant resources of East Asian populations. Further analysis of these variants will be important for understanding the genetic structure, phenotypic adaptation, and disease risk of the CPHC.

The CPHC have a higher level of genetic diversity than other populations in East Asia

To investigate the genetic relationship between the CPHC and other populations, we performed principal component analysis (PCA) using SNPs and SVs generated from the CPHC and public databases (1KGP and HGDP), respectively. When we conducted the PCA of global populations, East Asians, Europeans and African populations showed significant genetic differences, and the CPHC clustered closely with East Asian populations (Fig. 2a, b). It is worth noting that the SV analyses suggest that the Native Americans in the HGDP (Colombian, Maya, Karitiana, Pima and Surui) are genetically closer to East Asian populations than modern American populations in the 1KGP (PUR, Puerto Ricans from Puerto Rico; CLM, Colombians from Medellin, Colombia; PEL, Peruvians from Lima, Peru; MXL, Mende in Sierra Leone) (Fig. 2a, b). This may indicate a closer genetic relationship between East Asian populations and Native Americans. We further investigated the genetic relationship between East Asian populations. The results of SNP-based PC1 showed that populations in southern East Asia (KHV, Kinh in Ho Chi Minh City, Vietnam; CDX, Chinese Dai in Xishuangbanna, China) were genetically more closely, whereas CHB and CHS showed mixed Han Chinese population characteristics, but the CPHC individuals showed a more concentrated genetic component. The levels of PC2 showed genetic differences between the Japanese (JPT) and most of the East Asian populations (Fig. 2c). From the perspective of SVs, similarly, a separate genetic component for CPHC was also shown, and the inclusion of CPHC makes the genetic position of the Han Chinese population more prominent (Fig. 2d).

Fig. 2: Principal component analysis (PCA) of SNPs and SVs in the CPHC and other populations.
Fig. 2: Principal component analysis (PCA) of SNPs and SVs in the CPHC and other populations.
Full size image

a The CPHC and all populations in the 1KGP. b The CPHC and other East Asian populations in the 1KGP. c The CPHC and all populations in the HGDP. d The CPHC and other East Asian populations in the HGDP. The population abbreviations are given in Supplementary Data 4.

We next inferred the rooted maximum likelihood tree (Fig. 3a) between the CPHC and the 26 world populations in the 1KGP using YRI as the outgroup with TreeMix and MEGA14,15. The CDX and KHV are clustered in the same branch, while the northern East Asian populations (CHS, CHB, CPHC and JPT) are clustered in another branch. CPHC is closer to CHB and together with CHB and CHS forms the Han population, distinct from JPT. We further calculated the FST16 between the populations to indicate the degree of genetic differentiation between the northern, central and southern Han populations and the YRI. The result showed that the FST between the CPHC and YRI were greater than CHB and CHS with YRI (FST (CPHC-YRI) = 0.179; FST (CHB-YRI) = 0.170; FST (CHS-YRI) = 0.169), implies a greater genetic difference between CPHC and YRI. Meanwhile, we found that CPHC is less genetically differentiated from CHB compared to CHS (FST (CPHC-CHS) = 0.00234; FST (CPHC-CHB) = 0.00068), suggesting that the CPHC is genetically more closely to the northern Chinese population (Supplementary Fig. 3).

Fig. 3: Population structure and genetic diversity analysis.
Fig. 3: Population structure and genetic diversity analysis.
Full size image

a The maximum likelihood tree of the CPHC with other populations in the 1KGP. b ADMIXTURE analysis of the CPHC with other populations in 1KGP. Different colors represent different ancestry components (Yellow: Northern Asian population; Purple: Southern Asian population; Orange: JPT; Grey: CEU; Brown: YRI). c The pattern of LD decay between the CPHC and the other populations in the 1KGP. d Dynamics of effective population sizes of the CPHC and other world populations in 1KGP. The population abbreviations are given in Supplementary Data 4.

To further investigate the genetic composition of the major populations in East Asia, we performed admixture analysis on the CPHC and other populations from the 1KGP. Each population is assumed to be descended from an arbitrary number of ancestral components, denoted by K17. After rigorous screening for relatedness and linkage disequilibrium (Methods), the admixture results showed that from K = 4, East Asia is dominated by the genetic components of the northern (yellow) and southern (purple) populations (Supplementary Fig. 4). Further analysis of the Cross-validation result shows that K = 5 best matches the East Asian populations’ genetic structure (Supplementary Fig. 5). The CPHC contains predominantly the main genetic components of the northern East Asian populations, whereas the southern CDX and KHV populations contain predominantly the genetic components of the southern East Asian populations, and the CHB and CHS share the genetic components of the northern and southern East Asian populations. JPT provides a new genetic component (Fig. 3b), which is also in agreement with the PCA of the East Asian populations described above.

Finally, we investigated the genetic diversity and the effective population size of CPHC. LD Decay was performed between the CPHC and other East Asian populations in the 1KGP, using CEU and YRI as control populations. In general, faster LD decay indicates greater genetic diversity18. The LD decay rate of the CPHC is significantly higher than that of CHB, CHS and JPT (Fig. 3c). This implies a richer genetic diversity in the CPHC. To further investigate the effective population size of the East Asians, we performed MSMC analysis on the above populations19. The results indicated that non-African populations experienced a dramatic population decline around 100,000 years ago, corresponding to the temperature drop during the Last Glacial Maximum20. On the other hand, there was a rapid increase in the effective population size of the CPHC around 10,000 years ago (Fig. 3d). This period coincides with the emergence of an early agricultural civilization in China, which later developed gradually in the central plains near the Yellow River basin with the northern dry farming system represented by wheat and millet21,22.

Collectively, this evidence suggests that the CPHC have a rich genetic diversity with some genetic differences between them and other populations in East Asia, which may suggest CPHC can provide more genetic information in the Han population of East Asia.

Genome-wide detection of selective signals in the CPHC

The above analyses demonstrate the unique genetic components of the CPHC, which may be related to the genetic adaptation of modern humans following their migration to East Asia. To further investigate the genetic components of the CPHC, we performed FST analysis between the CPHC (agricultural group) and 81 hunter-gatherer groups (JPZ) in Southeast Asia23 and identified a total of 380 genes (13,035 SNPs) showing significant genetic differentiation (Fig. 4a). These SNPs were also tested using XP-EHH, a long-range haplotype statistical method to identify alleles segregating in a population that has undergone selection24. We observed 12,535 highly differentiated SNPs (FST (CPHC-JPZ) > 0.1) in 361 genes that only selected in the CPHC (Supplementary Data 1). We took the top 0.1% regions, which contained a total of 9 SNPs with high derived allele frequencies (FST (CPHC-JPZ) > 0.51). The 9 SNPs are located in the LONP2 (4 SNPs), SIAH1 (1 SNPs), TDRD10 (2 SNPs) and SHE (2 SNPs) gene regions.

Fig. 4: Genome-wide selection signatures of the CPHC.
Fig. 4: Genome-wide selection signatures of the CPHC.
Full size image

a Manhattan plot of the signatures of natural selection from the WGS data of the CPHC. The y-axis represents the FST values. Significance threshold indicated by horizontal red line (FST > 0.1). b, c LocusZoom plots of the rs6500380 and rs174618. The rs6500380 and rs174618 are represented by the purple circle. Linkage disequilibrium r2 values were calculated using the East Asian population from the 1KGP.

LONP2 is located on chromosome 16 and encodes an ATP-dependent peroxidase whose overexpression promotes the synthesis of polyunsaturated fatty acids (PUFA), which is one of the essential fatty acids and plays an important role in human lipid metabolism25,26. Four SNPs rs12446160, rs6500380, rs8052293 and rs111915588 are all found mainly in East Asia and are most common in CPHC (DAF > 0.97) (Supplementary Data 1). First, rs12446160 (FST (CPHC-JPZ) = 0.579; XP-EHH (CPHC-JPZ) = 3.620) is a cis-eQTL for LONP2, which can promote its gene expression27. In addition, LocusZoom indicates that the variant rs6500380 (FST (CPHC-JPZ) = 0.634; XP-EHH (CPHC-JPZ) = 3.254) highly linked with rs8052293 (R2 = 1), rs111915588 (R2 = 0.929) and an adjacent intronic variant rs1982782 in SIAH1 (FST (CPHC-JPZ) = 0.638; XP-EHH (CPHC-JPZ) = 2.964, R2 = 0.916; Fig. 4b) and as cis-eQTL for another lipid metabolism gene SIAH1, which can promote SIAH1 expression to enhance ubiquitination of Scp2 and facilitate intracellular transport of cholesterol and fatty acids, reducing the incidence of various metabolic diseases such as atherosclerosis, diabetes and obesity27,28. This suggests that some specific lipid metabolism genes are under natural selection in the CPHC. Also, two SNPs rs12128408 (FST (CPHC-JPZ) = 0.526; XP-EHH (CPHC-JPZ) = -1.732), rs12136771 (FST (CPHC-JPZ) = 0.514; XP-EHH (CPHC-JPZ) = -0.413) of TDRD10 (encodes a protein with Tudor domains) and two SNPs rs6684921 (FST (CPHC-JPZ) = 0.515; XP-EHH (CPHC-JPZ) = -0.206), rs12118018 (FST (CPHC-JPZ) = 0.514; XP-EHH (CPHC-JPZ) = -1.905) of SHE (encodes a protein with Src homology 2 domain) were also distributed at high frequency (DAF > 0.8) in the CPHC (Supplementary Data 1). Previous study found these four variants are all cis-eQTL for SHE27. Meanwhile, TDRD10 and SHE have been reported to be associated with the immune component, e.g. serum levels of total protein, C-reactive protein and IL6 receptor29,30,31. This suggests that the immune response is another important adaptive phenotype in CPHC. However, the selection signals for XP-EHH did not show the same significance as the lipid metabolism genes.

In addition to the variants described above that are specific to the CPHC, 19 genes were also found in other two Han population studies (Westlake BioBank and the NyuWa Genome)4,32 (Supplementary Data 2). Remarkably, there have 8 genes (FADS2, FADS3, CCDC171, SULT1C4, ASXL2, ADH1B, ADH7 and ADH1A) are associated to human glycolipid metabolism. While the other 11 genes (RAB3ILI, RAPGEF2, CCDC138, EDAR, RANBP2, GCC2, KIF3C, LIMSI, SULTIC3, SULTIC2 and C4orf45) have a variety of functions, including play a role in the human immune system, angiogenesis, the development of sweat glands and the development of the nervous system33,34,35,36,37,38,39.

For these 8 glycolipid metabolic genes, the FADS genes (FADS2 and FADS3) encode a desaturase and can also catalyze the biosynthesis of polyunsaturated fatty acids (PUFA)40. These two genes are located next to each other at 6,172 bp. We have identified a total of 14 SNPs in FADS2 and 3 SNPs in FADS3 (Supplementary Data 2). In particular, the regulatory SNPs rs174618 (associated with PUFA concentrations; p < 0.05) in FADS2 (FST (CPHC-JPZ) = 0.506; XP-EHH (CPHC-JPZ) = 5.609) and the two regulatory SNPs rs174450 (FST (CPHC-JPZ) = 0.312; XP-EHH (CPHC-JPZ) = 4.578) and rs174455 (FST (CPHC-JPZ) = 0.294; XP-EHH (CPHC-JPZ) = 4.118) in FADS3 are all cis-eQTL for FADS2, suggesting a possible synergistic role for these variants in the modification of polyunsaturated fatty acid concentrations in human serum27,41,42 (Fig. 4c; Supplementary Data 2). Next, SULT1C4 and ASXL2 are involved in secreting insulin. 9 SNPs were found in the gene SULT1C4, which is associated with flavanone metabolism and may promote insulin secretion. Among them, a missense mutation rs1402467 (FST (CPHC-JPZ) = 0.269; XP-EHH (CPHC-JPZ) = 4.706) can regulate insulin secret and then influence glucose metabolism in humans43. 14 SNPs in is an ETP family gene ASXL2, which activates adipogenic genes, regulates insulin secretion and plays an important role in adipogenesis and glycolysis in humans44. In addition, 31 SNPs showed selective signals in a fat storage and metabolic regulation gene CCDC171. The missense mutation rs1539172 (FST (CPHC-JPZ) = 0.101; XP-EHH (CPHC-JPZ) = 2.457) is a cis-eQTL for CCDC171 and strongly associated with body mass index and obesity-related diseases (such as hypertension and diabetes)27,31. Besides, ADH1B (8 SNPs), ADH1A (4 SNPs) and ADH7 (6 SNPs) are all alcohol metabolism genes, especially the well-known ethanol dehydrogenase gene ADH1B. The mutation rs1229984 (Arg47His, FST (CPHC-JPZ) = 0.313; XP-EHH (CPHC-JPZ) = 2.905) results in higher ethanol dehydrogenase activity and increased alcohol metabolism capacity45. The above analyses showed that glycolipid metabolism is one of the most important phenotypic characteristics of CPHC, suggesting that it is a key adaptive trait in East Asian populations.

In addition to this, several genes that have been proven to be selective in East Asia were also observed. For example, OCA2 (42 SNPs; FST (CPHC-JPZ) > 0.1) and PAH (21 SNPs; FST (CPHC-JPZ) > 0.1) are all associated with lighter skin color46,47,48,49. EDAR (40 SNPs; FST (CPHC-JPZ) > 0.1), a gene associated with the development of hair straightness39. SLC35F3 (126 SNPs; FST (CPHC-JPZ) > 0.1) and SLC35F4 (21 SNPs; FST (CPHC-JPZ) > 0.1) associated with vitamin B1 metabolism50 (Supplementary Data 3).

Taken together, the analyses of the selective signals of the SNPs in the CPHC suggest that the CPHC has a number of specific and high-frequency mutations, especially SNPs related to glycolipid metabolism, suggesting that the CPHC may undergo adaptive traits as a result of environmental and dietary changes during migration.

Glycolipid metabolism-associated mutations may occur as a result of dietary adaptations due to the expansion of agriculture

The Central Plains have been an important source of crops in East Asia and even globally, and several of the important glycolipid metabolism-associated genes mentioned above show strong selection signals in the CPHC. Therefore, we further investigated whether the occurrence of these specific mutations could have a relationship with the agricultural domestication. We analyzed 21 typical mutations in these genes (LONP2, FADS2, FADS3, CCDC171, SULT1C4, SIAH1, ASXL2, ADH1B, ADH1A and ADH7) in 14 ancient DNA samples from the Yellow River basin51,52. We infer the possible time of the candidate SNP emergence based on the age of the ancient DNA samples and the number of typing of its derived allele. First, these SNPs were found to be widespread in the ancient human population of the Yellow River basin (Table 1). The occurrence of some SNPs derived alleles can be dated to 6,000 BP (the Xiaowu site in Lingbao City, Henan Province), such as the derived allele A of rs12446160 (LONP2; totalCov ≥2 in 9/14 samples) and the derived allele T of rs174618 (FADS2; totalCov ≥2 in 12/14 samples). The time coincides with the cultivation of crops (such as millet) in the Central Plains51. Other typical mutations in genes associated with glycolipid metabolism (CCDC171, SULT1C4, SIAH1 and ASXL2) showed similar results, expect two SNPs in FADS3 (Table 1). However, it’s also worth noting that derived alleles of alcohol metabolism variants associated with the domestication of rice were not widespread in ancient humans from the Yellow River basin. A well-known missense variant rs1229984 in ADH1B, which is associated with alcohol metabolism, is predominantly ancestral C (totalCov ≥ 2) in 7/14 ancient DNA samples (Xiaowu, Lingbao; Dahecun, Zhengzhou; Pingliangtai, Huaiyang; Haojiatai, Luohe; and Niecun, Jiaozuo) and derived T (totalCov ≥ 2) is only 3/14 ancient DNA samples (Wadian, Yuzhou; Niecun, Jiaozuo). Previous studies have shown that the rs1229984_T allele appeared around 7,000 years ago, coinciding with the beginning of the domestication of rice in the Yangtze River basin45. Two other alcohol metabolism genes, ADH1A (rs1813977, rs1039151 and rs3819197) and ADH7 (rs284784 and rs1827567) showed similar results to ADH1B (Table 1). Therefore, the derived mutations of the ADH genes did not reach high frequencies in the Central Plains during the Late Neolithic.

Table 1 The allele types of some typical mutations associated with dietary metabolism in 14 ancient DNA samples from the Yellow River Basin

In summary, the ancient DNA analysis not only suggests that the generation of these mutations in LONP2, FADS2, CCDC171, SULT1C4, SIAH1 and ASXL2 in the Central Plains may be related to the agricultural domestication and dietary adaptation in the Yellow River basin humans but also suggests that the genetic basis of dietary adaptations may be different in the major crop regions of the Yellow River (wheat and millet dominated) and Yangtze (rice dominated) basins.

Specific structural variants contribute to the genetic diversity of the populations of East Asia

We used Paragraph to genotype the SVs in the CPHC11, identified and manually checked a total of 26 SVs (14 deletions and 12 insertions) with large frequency differences between the CPHC and other populations in the world. Of these, 9 SVs (4 deletions and 5 insertions) are specific to the CPHC (AF: 0.203–0.516). The other 17 SVs (10 deletions and 7 insertions) are with high frequency (0.4 higher than the average frequency in the other populations around the world) in the CPHC (AF: 0.528–0.924) (Fig. 5a, b and Table 2).

Fig. 5: Specific and high-frequency SVs in the CPHC.
Fig. 5: Specific and high-frequency SVs in the CPHC.
Full size image

Manhattan plot of Deletions (a) and Insertions (b) in the CPHC. The frequency distribution map of DEL_21699 (c) and DEL_38406 (d). The population abbreviations are given in Supplementary Data 4.

Table 2 Specific and high-frequency structural variants in the CPHC

Among the 9 East Asian population-specific deletions, DEL_21699 is a 170 bp deletion located in LINC01749, with frequency 0.222 in the CPHC (African frequency is 0 and European is 0.018 in HGDP) (Fig. 5c). LINC01749 encodes a class of non-protein-coding RNA transcripts greater than 200 nucleotides, has an important role in esophageal squamous cell carcinoma53. We performed the linkage disequilibrium (LD) of its upstream and downstream 10 kb regions and observed a highly linked SNP rs6010839 near the DEL_21699 (R2 = 0.96), an eQTL for the esophageal squamous cell carcinoma risk gene TCFL527,54. The Central Plains of China is known to be a region with a high incidence of esophageal squamous cell carcinoma55,56. We have also seen the distribution of the DEL_21699 in other groups with a high incidence of esophageal squamous cell carcinoma (e.g., Hazara in northeastern Iran, Russians in Eastern Europe and Japanese) (Fig. 5c)57. This suggests that DEL_21699 may be involved in the development of esophageal squamous cell carcinoma. DEL_38406 is a 92 bp deletion located in intron 10 of the FAM102A, with frequency is 0.223 in the CPHC (African and Europeans is 0 in HGDP) (Fig. 5d). Previous study reported that FAM102A plays an important role in Primary angle-closure glaucoma (PACG), a predominant type of glaucoma in East Asian populations58,59. We found a highly linked variant rs2417064 (R2 = 0.93), an eQTL for FAM102A27. This suggests that DEL_38406 may have potential function in PACG.

Next, we also identified 17 high-frequency SVs (10 deletions and 7 insertions) in the CPHC. INS_27791 is a 639 bp insertion (CPHC = 0.924; HGDP.Han = 0.875; HGDP non-East Asian populations = 0.441–0.508), located on upstream of the FGF21, which is a metabolic hormone synthesized by several organs that regulates glucose sensitivity, and specifically encodes a protein that stimulates glucose uptake in adipose tissue60. LD analysis showed that INS_27791 highly linked with some intergenic variants, including an FGF21 eQTL SNP rs4801773 (R2 = 1)27,31. Therefore, INS_27791 may have an effect on FGF21 expression to alleviate diet-induced insulin resistance and regulate hepatic gluconeogenesis. INS_03097 is a 337 bp insertion (CPHC = 0.751; HGDP.Han = 0.579; HGDP non-East Asian populations = 0.018–0.159), located in intron 4 of the SLC19A2, encodes a thiamine transporter. Its deficiency affects intracellular thiamine transport, and this impairment affects insulin secretion and glucose metabolism, which in turn increases the risk of diabetes in people on a high starch diet61. Therefore, some of the high SVs found in CPHC may also have contributed to dietary adaptation in East Asian populations.

In addition to the typical examples listed above, the remaining 22 SVs (7 unique and 15 high frequencies in the CPHC) located in the intergenic regions, e.g. INS_18720 119 bp, INS_42051 56 bp and INS_05228 77 bp. Some SVs in the intronic regions of the gene, e.g. DEL_29932 located in PPP2R2B (2,592 bp); DEL_24082 located in ENPP7P2 (72 bp); INS_21702 located in DPEP1 (61 bp) and INS_46471 located in OSMR-DT (96 bp) (Table 2). However, very little is known about the phenotypes that are associated with these structural variants and genes.

In general, the results suggest the presence of some high-frequency or specific SVs in the CPHC, which may play an important role in East Asia specific disease genesis and dietary adaptation.

Discussion

The central plains of China, which lie in the middle and lower reaches of the Yellow River basin, are one of the core areas of agricultural domestication in East Asia. Genomic studies of populations in the Central Plains are critical to understanding the genetic structure of East Asian populations and genetic adaptations resulting from agricultural expansion. In the current study, whole-genome sequencing data revealed the unique genetic structure of the CPHC, indicating its rich genetic diversity. The new discoveries of high-frequency mutations and structural variants are an important genetic resource for East Asian populations.

This PCA result suggests that the CPHC has some more concentrated genetic components. In comparison with CHS, CPHC has a closer genetic affinity with CHB, suggesting that it has more genetic characteristics of northern populations. The PCA results of SV also showed that Native Americans are closer to East Asian populations, further confirming this result from the perspective of SV (Fig. 2b). This finding is consistent with the results of previous analyses of the SNPs62,63,64,65. However, the Han Chinese in the HGDP are mixed with Yi, Hui, and other East Asian populations, which is inconsistent with the PCA results of SNPs (Fig. 2c, d) and may be as the low number of certain populations in the HGDP and the insufficient number of SVs. Therefore, this study is also an addition to previous databases of Han Chinese data.

The ADMIXTURE and LD Decay analysis showed a higher genetic diversity (Fig. 3b, c). K = 5 best represents the number of ancestral components present in East Asian populations, with three ancestral components observed (Fig. 3b; Supplementary Fig. 5). These three ancestral components correspond to populations from southern East Asia (CDX and KHV), Han Chinese populations (CPHC, CHB and CHS) and the Japanese population (JPT). Compared to K = 4, a new ancestral component emerges in the Japanese population, leading to its separation from other East Asian populations. These results may be related to the relatively isolated island geography of Japan. Long-term genetic drift and relatively independent populations may have led to the development of genetic characteristics in the Japanese population.

The effective population size analysis shows a rapid increase in the population of the CPHC around 10,000 years ago, probably as a result of the beginning of crop cultivation and the availability of a stable source of food66. This implies that crop domestication may have facilitated dietary adaptation in East Asian populations and have contributed to the population increase of East Asia. Early human diets were predominantly hunter-gatherer, rich in animal protein and fiber. However, with the expansion of agriculture, the human diet gradually shifted to a starch-rich plant-based diet, which may have exerted significant selective pressure on insulin secretion, synthesis of long-chain polyunsaturated fatty acids and other metabolic processes3,40. In the current study, we found several of the dietary metabolism-associated genes (LONP2, FADS2, FADS3, SIAH1, SULT1C4, CCDC171, ASXL2, ADH1A, ADH1B and ADH7) show strong selection signals in the CPHC. These genes may be important for the dietary evolutionary history of East Asian populations. e.g. positive selection signaling for LONP2 and FADS2, which have been implicated in promoting polyunsaturated fatty acid synthesis in humans. The increased capacity for endogenous polyunsaturated fatty acid synthesis is also a consequence of the shift from a carnivorous diet to a diet rich in starch crops25,40. Evidence from ancient DNA suggests that signals of metabolic selection in the CPHC may have coincided with the expansion of agriculture in the region. For example, the occurrence of derived allele of 10 SNPs in 6 candidate genes (LONP2, FADS2, SIAH1, SULT1C4, CCDC171 and ASXL2), date back to approximately 6,000 BP, coincides with the cultivation of millet in the Central Plains22,51. This suggests that the high frequency of metabolism-related mutations may be the result of CPHC dietary adaptation and play an important role in glycolipid metabolism in East Asian populations.

The Yangtze River basin is an important area for rice domestication, unlike the Yellow River basin where millet and wheat are the main crops67. Selection signals for the ADH genes (e.g. ADH1A, ADH1B, ADH4, ADH5, ADH6 and ADH7) associated with the rice diet are showed in the local populations4,45. In the current study, ADH1B, ADH1A and ADH7 also showed a significant selection signal in the CPHC. However, the mutation in these genes was mainly as ancestral type in the ancient DNA samples from the Central Plains (Table 1). One possible explanation for this is that the ADH mutation in the modern populations of the Central Plains is the result of more recent mixing of the populations. The alcohol metabolism mutations in the Yangtze River basin (e.g. the rs1229984_T allele appeared around 7000 years ago) differed from the glycolipid metabolism mutations in the Yellow River basin (e.g. 6000 years ago, Table 1), suggesting that there are differences in crop preference and timing of domestication between northern and southern East Asian populations. This suggests that different populations in northern and southern China may have undergone different adaptive processes due to differences in crop domestication during the history of agricultural development.

The SVs found in this study will greatly increase the resources of structural variants in East Asia, suggesting that the SVs may be widespread in East Asia, implying the presence of these SVs may have contributed to the adaptive evolution of the phenotype in East Asian populations. Interestingly, we also found some high-frequency SVs associated with human glycolipid metabolism in the CPHC, and these were the major alleles (AF > 0.5) in East Asia. For example, INS_27791 (FGF21) and INS_03097 (SLC19A2) showed high frequencies and may be associated with adipose tissue production and glucose secretion60,61. It can be assumed that the CPHC had a higher capacity for glycolipid metabolism, which may also have been adapted to the diet brought about by agricultural civilization. Some other East Asian-specific structural variants, such as DEL_21699 (TCFL5) and DEL_38406 (FAM102A) may be associated with esophageal squamous carcinoma and primary angle-closure glaucoma, which are highly prevalent diseases in East Asia57,58. More interestingly, the diseases resulting from these two SVs may also be related to glycolipid metabolic adaptations. For example, it is well known that most wheat-based high-starch (hot food) diets in the Central Plains have higher temperatures. TCFL5 and FAM102A can promote human M2 macrophage polarization and osteoclast differentiation, which increases glycolipid metabolism and reduces the energy burden of hot foods68,69. However, prolonged exposure to hot food can damage the epithelial cells of the esophageal lining and cause esophageal squamous carcinoma57. Meanwhile, hot foods induce an increase in oxidative stress, which promotes the development of primary angle-closure glaucoma70,71,72.

In conclusion, we comprehensively characterized the genomic variants of CPHC, demonstrating the CPHC has a unique genetic component and higher genetic diversity compared with other East Asian populations. The CPHC genome contains variants related to human glycolipid metabolism that may be the result of adaptation to agricultural civilization, providing important clues for further studies on the adaptive evolution of diet in agricultural populations.

Methods

Samples collection and whole-genome sequencing

We collected blood samples from 492 Han Chinese individuals from 16 cities in Henan Province, China. All volunteers were mainly within three generations of Han population, consented to blood collection and signed informed consent forms. Genomic DNA was extracted from blood using the Serum/Plasma Circulating DNA Kit (Tiangen, Beijing, China) according to the manufacturer’s instructions. The DNA samples are then subjected to library construction and quality control according to standard Illumina library preparation protocols. Finally, all DNA libraries were sequenced using the HiSeq 2500 sequencing platform, and whole genome sequencing data with an average depth of 25.54 × were obtained. Written informed consent was obtained from each individual before their inclusion in the study. All protocols of this study were approved by the Institutional Review Board of Zhengzhou University (ZZUIRB2009-002 and ZZUIRB2022-14) and the Institutional Review Board of The Second Affiliated Hospital of Zhengzhou University (2020013).

Variants calling, SV genotyping and annotation

Variants calling was followed the standard Genome Analysis Toolkit (GATK version 4.2.6) pipeline73. We download the human reference genome assembly 38 from the resource bundle on the GATK website (ftp://gsapubftpanonymous@ftp.broadinstitute.org/bundle/). In brief, FastQC version 0.11.3 was used to check the quality of the sequencing data and to remove adapter sequences and low-quality data. Next, the sequencing reads were aligned to the human reference genome assembly 38 using the BWA-MEM version 0.7.1574, and the results were converted into BAM files by SAMtools version 1.975. We further sorted and de-duplicated the BAM files using SAMtools version 1.9 and performed base quality score recalibration (BQSR and ApplyBQSR) on the BAM files to improve the reliability and accuracy of variants calling. The variants were then called for each sample’s BAM file with GATK HaplotypeCaller to generate gVCF files and merged using GenotypeGVCFs to generate a population VCF file containing all 492 individuals. Finally, the merged VCF file was recalibrated for variant quality scores (VQSR and ApplyVQSR), and all SNPs and INDELs from the training set were recalibrated with 99% true sensitivity to generate the final variant dataset.

For SVs genotyping, we used the Paragraph, an accurate genotype that models SVs using sequence graphs and annotations11. The CPHC and 54 world populations in HGDP were included in our study. The set of high-confidence SVs we selected for genotyping was 107,590 SVs (insertions and deletions) based on fully phased genome assembly data from 35 individuals and used RepeatMasker to annotate the SVs in CPHC13,76. Finally, we used Samplot and IGV to manually check the breakpoints of these structural variants in the CPHC77,78.

Before annotating the variants, we conducted the following steps to filter low-quality SNPs. First, the SNPs with a genotype quality of less than 10 in more than 50% of the 492 CPHC individuals were removed. Subsequently, we calculated Hardy-Weinberg equilibrium “p value” and removed variants with p < 10−6 using PLINK version 1.0979. Finally, we removed all multi-allelic variants using BCFtools version 1.1280. Approximately 5.63 million SNPs and 540,000 INDELs were removed after these filtering steps. The final remaining SNPs and all SVs were annotated with genes, variant types and other information using the variant effect prediction tool12.

Principal component analysis (PCA)

For the PCA analysis of SNPs, we downloaded the SNPs dataset from the 1000 Genomes Project (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/) and conducted Hardy-Weinberg equilibrium (HWE). Subsequently, we removed variants with p < 10−6, site missing rate > 0.05, and MAF < 0.05. Finally, we extracted the SNPs that overlapped between the CPHC and the 1KGP and merged them into a single VCF file. For the merged VCF file, we used PLINK version 1.09 --indep-pairwise 50 10 0.5 to exclude linked variants (R2 > 0.5), retaining approximately 0.17 million variants for PCA. However, for the PCA analysis of SVs, due to the insufficient number, we only removed variants with p < 10−4, site missing rate > 0.2, and individual absence rate > 0.5. Next, we combined the SVs that were present in both the CPHC and the Human Genome Diversity Project (HGDP) into a single VCF file. Finally, the PCA of SNPs and SVs in the Central Han Chinese were conducted with the EIGENSOFT tool81 and visualized using the ggplot2 package in R.

Admixture analysis and LD decay

The kinship index between individuals was calculated using the King tool82, and the data of 20 individuals with kinship < 0.0442 were excluded. After performing individual filtering of the SNPs dataset, we used PLINK version 1.09 --indep-pairwise 50 10 0.5 to exclude highly linked variants (R2 > 0.5), and about 0.13 million SNPs were included. Finally, we conducted admixture analysis in the CPHC as well as the CDX, KHV, CHS, CHB, JPT, CEU, and YRI populations from the 1KGP from K = 2–7 with ADMIXTURE tool, repeating the analysis 10 times for each K value under different seeds17. The K-value is a subpopulation parameter assumed by the Admixture tool for population structure analyses. We also calculated the R2 within a maximum distance of 300 kb for the above populations using PopLDdecay tool to assess the extent of linkage disequilibrium83.

Effective population size

To explore the timing of divergence between the CPHC and other populations, we analyzed effective population size of CPHC, CDX, KHV, CHS, CHB, JPT, CEU and YRI in the 1KGP. We phased the VCF file of each population using SHAPEIT version 4.2.284. And then the effective population size was analyzed by using MSMC2 with a generation time of 30 years while keeping other parameters at their default values19.

Detection of positive selection in the CPHC

We used VCFtools version 0.1.1379 to compute the whole-genome FST with a sliding window of 50 kb between the CPHC and the Cambodian population (JPZ)16,85. The dataset of the Cambodian population was obtained from previous studies23 and merged with the variants in the CPHC. We obtained a total of 5400 genomic regions, from which we identified the top 5% regions with the highest FST value as potential adaptation signals. The YRI and CEU populations in the 1KGP were selected as outgroups and 13,035 SNPs with FST > 0.1 and higher allele frequencies in the CPHC were selected as candidate adaptation signals. We also use the phased VCF file of the CPHC and JPZ to calculate XP-EHH value and the top 5% (XP-EHH > 1.697) were treated as the selected regions24.

To investigate specific SVs in East Asian populations, we compared them with non-East Asian populations (Africans, Europeans, Americans, Middle Eastern, Central South Asians, and Oceanian populations) from HGDP. The Han Chinese population from the HGDP was also compared with the CPHC to ensure consistency across sequencing platforms. Finally, we chose SVs with frequencies greater than 0.2 in the CPHC and less than 0.02 in other non-East Asian populations of the HGDP as East Asian-specific variants. The SVs with frequency difference of more than 0.4 between the CPHC and the non-East Asian populations of the HGDP were selected as the high-frequency variants in East Asia.

Ancient DNA analysis

We analyzed the 21 typical mutation related to dietary metabolism using 14 ancient DNA samples from the Central Plains region (Lingbao city, Xiaowu site; Xingyang county, Wanggou village; Zhengzhou, Dahecun; Yuzhou city, Wadian site; Huaiyang, Pingliangtai site; Luohe city, Haojiatai site; Jiazuo city, Niecun site) (Supplementary Fig. 6). The genotype calling was performed according to previous published protocols51,86.

Statistics and reproducibility

Software used for the data analysis of this study is as follows: bcftools (https://samtools.github.io/bcftools/bcftools.html), FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), BWA, SAMtools (https://github.com/samtools/), GATK, VCFtools (http://vcftools.sourceforge.net/), PLINK, Repeatmasker (https://www.repeatmasker.org/), VEP, Samplot (https://github.com/ryanlayer/samplot), IGV, R, EIGENSOFT, ADMIXTURE, PopLDdecay (https://github.com/BGI-shenzhen/PopLDdecay), TreeMix (https://github.com/carolindahms/TreeMix), MSMC2. The databases used in the current study are described in the Methods.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.