Introduction

Maritime travel has greatly impacted human migration history, enabling long-distance movement that not only introduced humans to many islands and remote continents, but also increased biological and cultural interactions between different human groups. In southern East Asia, multiple waves of population admixture and turnover as well as the spread of cultures and languages have been observed across oceans1,2, and the migration of proto-Austronesian humans from coastal southern East Asia to islands of Southeast Asia and the Southwest Pacific has been well characterized3,4,5. In northern East Asia, human movement and interaction in northern coastal East Asia and Pacific islands such as those found in the Japanese archipelago greatly impacted the region, as the eastern coastline of East Asia was an important route for the spread of crops and trade goods (e.g. rice) from mainland East Asia to the Japanese archipelago6,7. Previous ancient genomic studies have shown gene flow from ancient hunter-gatherers from Japan (e.g. Jōmon) into prehistoric populations from Far East Siberia (e.g. Boisman_MN8) and the West Liao River basin6. Efforts to investigate the genetic connections and history between mainland East Asians and populations from the Japanese archipelago have shown that populations from the the Kofun period (~1750–1400 BP) of Japan and historical Nagabaka populations from the Ryukyu islands show partial ancestry related to northern East Asians6,7,9 and can be described by a three-ancestry model, where they possess ancestry related to Jōmon hunter-gatherers, and two mainland East Asian sources. One mainland East Asian source, which appeared during the Yayoi period (~2300–1750 BP) of Japan, has been associated with ancient northern East Asians. The other, which arrived in the Japanese archipelago after the Yayoi period, has not been clearly identified, and only present-day Han populations have been used as a proxy7,9. Even with more than 3000 deeply sequenced genomes from present-day Japanese populations, a suitable East Asian ancestral source population has yet to be found10. In previously published studies, differences in genetic structure were observed between populations in the Japanese archipelago, with different genetic patterns between populations from the main islands (Hondo) and the Ryukyu islands, complicating the population history of prehistoric Japan further10,11,12,13. Thus, increased sampling of ancient humans from coastal regions of mainland East Asia, where sampling has been limited to a few localities and time periods, is vital for determining the East Asian ancestry that made substantial contributions to the genetic composition of humans from the Japanese archipelago.

Previous sampling of ancient ShanDong populations from the Early Neolithic period4 (~9500–7700 BP) shows a shared ancestry that falls within the diversity of ancient northern East Asian ancestries spanning from the Upper Yellow River Basin to the Amur River Basin. However, younger populations from the ShanDong region have yet to be sampled, despite a rich archaeological context. The ShanDong region was home to one of the longest and most influential Neolithic cultures in East Asian prehistory, the DaWenKou culture (6000–4600 BP)14,15,16,17,18,19. The DaWenKou culture spanned the Middle and Late Neolithic and was primarily located in ShanDong province, and it co-existed with the YangShao culture that was distributed along the Yellow River14,15,17,18. Interactions between these two cultures were highly dynamic, where influences from the YangShao culture can be observed in sites associated with DaWenKou culture20. Ultimately, by 4600 BP, populations in the Yellow River Basin and the ShanDong region both showed cultural remains associated with the LongShan culture (4600–4000 BP)20,21. However, the population movement and interaction associated with the transition to the LongShan culture in both the Yellow River and ShanDong regions is still unknown.

For the early dynastic period spanning the Xia to the Jin Dynasties (~4000–1500 BP), the historical and archaeological record emphasizes the prominent role of trading in the coastal regions of East Asia, leading to increased communication and conflict as early as the Shang Dynasty22. In ShanDong, the predominant culture was the Dongyi culture, which was culturally influenced by populations from the Yellow River region during the Shang Dynasty through salt trading23. Archaeological and historical studies point to frequent conflicts in coastal regions since the Shang Dynasty that ultimately led to the incorporation of the ShanDong region under the rule of the Western Zhou Dynasty24. The effect of increased interaction through trade and conflict across the Yellow River and ShanDong regions25 on the population makeup of coastal northern East Asians is unclear, due to a lack of aDNA evidence in the coastal region in East Asia from this time period.

To study changes in coastal populations from the ShanDong region during the dynamic period spanning from the early Neolithic period to the Jin Dynasty and examine the impact of ShanDong populations on nearby populations in the Yellow River region and the Japanese archipelago, we collected 85 individuals from 11 sites dating from 6000 to 1500 BP from ShanDong, which spans a sixth of the coastline of China. By retrieving aDNA evidence from these individuals, we reconstructed the history of ShanDong populations and resolved the population dynamics of mainland, coastal, and archipelago East Asians from the DaWenKou cultural period to the early dynastic period.

Results

We generated genome-wide data from 85 ancient individuals sampled from 11 sites from the ShanDong region. Radiocarbon dating indicates that these individuals span from ~6000 to 1500 BP, covering the DaWenKou cultural period to the Jin Dynasty (Fig. 1A, B). Coverage across the genome-wide data for these individuals ranged from 0.028 to 2.696×. Specifically, 78 individuals with at least 40,000 SNPs were retained for downstream population genetic analyses, and the seven remaining low-coverage individuals (labeled with the suffix “_low”) were only included in limited downstream analyses (Supplementary Table S1). We estimated the contamination level using X chromosomes (males) and mitochondrial genomes (males and females)26,27. For 77 sampled individuals with at least 40,000 SNPs and two low-coverage individuals, we estimated contamination levels lower than 5.0%. For the three individuals (TL4773_d_k, BQ4625_d_low_k, YX4790_d_low) with high contamination (>5.0%), we restricted our analyses to fragments showing characteristic aDNA deamination when performing genotype calling for downstream analyses28,29 (Supplementary Table S1).

Fig. 1: Spatial, temporal, and genetic structure associated with ancient individuals from the ShanDong region.
Fig. 1: Spatial, temporal, and genetic structure associated with ancient individuals from the ShanDong region.The alternative text for this image may have been generated using AI.
Full size image

A Information on the geographic distribution of where nuclear genomic data of humans from archaeological sites from ShanDong and other published archaeological sites were sampled. Blue denotes sites associated with the DaWenKou cultural period (DWK, 6000–4600 BP), orange denotes sites associated with the LongShan cultural period (LS, 4600–4000 BP), and green denotes sites associated with the early Chinese dynastic period, specifically the Shang dynasty to the Han dynasty (CD, 3500–1500 BP). Different shapes are used to distinguish different sites. B Chronological information for sampled individuals from different sites, determined using corrected radiocarbon dating of at least one skeletal sample from each site. Color and shape of markers correspond to A. Note that the FuJia site was assigned to the DaWenKou cultural period because it is described in the archaeological excavation report as having the typical characteristics of the middle and late DaWenKou culture. The FuJia site’s upper dating limit was determined to be 4500 BP, ~100 years lower than the lower dating limit for the DaWenKou cultural period. C A principal component analysis (PCA), where ancient East Asians were projected onto the PC1-PC2 plane constructed from present-day southern, northern, and Tibetan East Asian populations. Color and shape of markers for the new published ancient ShanDong populations of this study are the same as in A and B. Gray points indicate present-day East Asians, and their labeling can be observed in the PCA shown in Supplementary Fig. S1. D Enlarged PCA showing data in the dashed lines in panel C.

We explored the genetic relationship these coastal populations from ShanDong shared with nearby mainland East Asians, as well as island populations of the Japanese archipelago. We then used the observed genetic connections to examine how these coastal populations influenced and were influenced by inland and island neighbors. Newly sampled ShanDong populations were associated with one of three periods: the DaWenKou cultural period spanning the Neolithic period dating to 6000–4600 BP (ShanDong_DWK), the LongShan cultural period spanning the Neolithic period dating to 4600–4000 BP (ShanDong_LS), and the early Chinese dynastic period spanning 3500–1500 BP (ShanDong_CD, Fig. 1A, B).

North-south interactions influenced ShanDong populations since at least 7700 BP

We first examined the genetic relationship between the newly sampled individuals from the ShanDong region and previously sampled ancient and present-day East Asians. Using cluster analyses (PCA30, Umap31, t-sne32), we found that the ShanDong individuals are located close to ancient and present-day northern East Asians, including previously sampled Early Neolithic ShanDong populations (9500–7700 BP4) (Fig. 1C, D, Supplementary Fig. S2A, B). Outgroup f3-statistics33 also demonstrated that the ShanDong populations fall within northern East Asian genetic diversity (Fig. 2A). They formed clusters distinct from that observed for ancient humans from the Japanese archipelago, West Liao River, and Amur River regions in the PCA (Fig. 1C, D). Outside of ShanDong, populations from the Yellow River region are closest to the newly sampled individuals in the PCA. Among ancient ShanDong populations, there are three main clusters: one composed of Early Neolithic populations (9500–7700 BP), one composed of early ShanDong individuals dating to the DaWenKou cultural period (6000–4600 BP, ShanDong_DWK), and another composed of later ShanDong populations from the LongShan and early dynastic cultural periods (4600–1500 BP, ShanDong_LS and ShanDong_CD, Fig. 1C). Notably, compared with the DaWenKou and Early Neolithic ShanDong populations, younger ShanDong populations are shifted towards ancient Yellow River populations, suggesting influence from inland East Asian populations outside of the ShanDong region. In a Treemix analysis34, we observed that all ShanDong populations group with ancient northern East Asian populations, with those from the ShanDong region sharing the closest relationship to each other (Fig. 2B). Outside of ShanDong, Yellow River populations show the closest genetic relationship to ShanDong populations (supported by 59.2% of 1000 bootstrap trees, Fig. 2B, Supplementary Table S3). These patterns support that ShanDong populations fall within the genetic diversity of northern East Asians. In an ADMIXTURE35 analysis, we observed the same components across ShanDong populations, but with varying proportional distributions of these components over time. (Fig. 2C).

Fig. 2: Genetic relationships of ShanDong popualtions to other ancient East Asians.
Fig. 2: Genetic relationships of ShanDong popualtions to other ancient East Asians.The alternative text for this image may have been generated using AI.
Full size image

A Pairwise outgroup f3-statistics for ancient East Asians. A lighter color (yellow) indicates that more alleles are shared between the two populations, which is calculated as f3 (X, Y; Outgroup), where X and Y denote ancient East Asian populations, and the outgroup is chosen to be Mbuti, a present-day population from central Africa. B A maximum likelihood phylogeny allowing three migration events using the Treemix software. The branches of the tree convey the proximity of ancient East Asians to each other and arrows indicate putative gene flow events. Red numbers indicate the confidence rate for each branch, calculated from 1000 bootstrap trees. We computed the maximum likelihood phylogeny allowing for 0–6 mixing events, see Supplementary Fig. S5. C ADMIXTURE analysis showing stratified components of ancient East Asians for K = 5.

The Holocene in mainland East Asia was a time of marked change in human societies, with the rapid rise of farming and complex societies36,37,38,39. We first examined the effect of these societal changes in ShanDong populations from the Early Neolithic (9500–7500 BP) and the DaWenKou cultural period (6000–4600 BP). We found different trends in the Early Neolithic, where individuals from the Xiaojingshan site (7700 BP) show more genetic connections with populations outside of the ShanDong area, such as northern East Asians from the Amur River region and Far East Siberia (AR/FE8,40,41), as well as southern East Asians from Fujian, the Taiwan Strait, and Guangxi regions (aSC3,4). In an f4-analysis assessing whether any Early Neolithic ShanDong individuals share excess ancestry with other ancient East Asians, we observed that the 7700 BP Xiaojingshan individuals share additional alleles with these northern East Asians (aAR/FE), i.e. most f4 (Bianbian/Boshan/Xiaogao/SD9K/aYR/aLR/aSC, Xiaojingshan; aAR/FE, Mbuti) < 0 (−16.1 < Z < 0.4, Supplementary Table S4a), and southern East Asians (aSC), i.e. most f4 (Bianbian/Boshan/Xiaogao/SD9K/aYR/aLR, Xiaojingshan; aSC, Mbuti) < 0 (-7.9 < Z < 1.6, Supplementary Table S4b) relative to older ShanDong individuals dating to ~9000 BP and other northern East Asians (except for AR9.2K_o, who shares some genetic affinity with ancient ShanDong populations42). Using a rotational qpAdm strategy to further parse Xiaojingshan’s connection to these northern East Asians, we found that Xiaojingshan can only be modeled by a 3-way model with 74.2% Early Neolithic ShanDong ancestry related to Bianbian, Xiaogao, and Boshan (SD9K); 9.8% ancestry related to Early Neolithic populations from Fujian (Fujian_EN); and 16.0% ancestry related to Amur River populations younger than 14,000 years ago (ARpost14K, Supplementary Table S5), confirming that Xiaojingshan shows additional genetic influences from northern and southern East Asian ancestries outside of the ShanDong region. This suggests that as early as the Early Neolithic, there was already some interaction with northern and southern East Asian populations from other regions of mainland East Asia.

Two pulses of gene flow from inland to coastal populations in northern East Asia

We next investigated genetic relationships during the Middle and Late Neolithic between coastal ShanDong populations (6000–4600 BP) associated with the DaWenKou culture and inland Yellow River populations (7000–5000 BP) associated with the YangShao culture. In the PCA (Fig. 1C, D), we observed that the three DaWenKou ShanDong populations, from the more inland GangShang site (GSGroup) to the more coastal BeiQian (BQGroup) and FuJia (FJGroup) sites, are shifted away from the Early Neolithic ShanDong populations (ShanDong_EN), and toward YangShao-related Yellow River (YR) populations. It can be observed that the three populations from the DaWenKou period distributed along this axis (from ShanDong_EN to YR) show different affinities to YR populations. Specifically, the relatively inland GSGroup clusters with YR populations, whereas the coastal BQGroup and FJGroup fall between the YR and Early Neolithic ShanDong populations (Fig. 1C, D).

To further determine whether DaWenKou populations show additional YR ancestry relative to Early Neolithic ShanDong populations, we employed a rotational qpAdm strategy to estimate ancestral components found in the DaWenKou populations. A total of 11 representative East Asian populations (e.g. ARpost14K, aFujian_EN) were rotated as potential source ancestries for the three DaWenKou groups (Supplementary Table S5). Using this strategy, only those populations that fit the mixture model are considered as ancestral source populations (i.e. Tail_prob >0.05, each ancestral mixture proportion >standard error, pnest <0.05, Supplementary Table S5). With the rotational qpAdm strategy, we found that the DaWenKou populations (BQGroup and GSGroup) can be modeled as a mixture of ancestry related to Early Neolithic ShanDong populations (ShanDong_EN, ~29–87%) and YR populations (YR, ~13–71%), while FJGroup are best described by a single source ancestry related to ShanDong_EN (“1-way” model, Tail_prob = 0.08) or the BQGroup (Tail_prob = 0.47, Fig. 3A, Supplementary Table S5).

Fig. 3: Gene flow related to the ShanDong populations.
Fig. 3: Gene flow related to the ShanDong populations.The alternative text for this image may have been generated using AI.
Full size image

A qpAdm analysis showing the proportions of each ancestral source component calculated for ancient ShanDong populations from the DaWenKou (DWK, BQ=BQGroup, GS=GSGroup, FJ=FJGroup), LongShan (LS, CZY=CZYGroup, YJC=YJCGroup), and Chinese dynastic (CD, HL=HLGroup, LJZ=LJZGroup, XC=XCGroup, TL=TLGroup, XZ=XZGroup, YX=YXGroup) cultural periods. The measure of centers (junction of two rectangles) are the average values, and the error bars represent 1x standard error. When calculating ancestry components, we used a rotational strategy informed by chronological data from the ancient East Asian populations, where younger groups could not be used as potential sources for modeling older groups. Colors represent different ancestral sources (ShanDong_EN (SD9K/Xiaojingshan), BQGroup, GSGroup, CZYGroup, and YR (YR_MN/YR_LN)). Numbers indicate the larger ancestry proportion in a two-source qpAdm analysis. Exact values of the qpAdm analysis are shown in Supplementary Table S5a. B Figure depiction summarizing major findings of gene flow associated with the ShanDong region from this study. Blue and red arrows indicate gene flow into ShanDong populations as early as 7700 BP from northern East Asia (NEA) and southern East Asia (SEA). Orange and yellow arrows indicate gene flow from the Yellow River region into ShanDong populations, which occurred at least twice – during the DaWenKou (6000–4600 BP) and early Chinese dynastic (younger than 3500 BP) cultural periods. No evidence for gene flow back into Yellow River populations was observed. The green arrow (dashed line) indicates gene flow from the ShanDong region into the Nagabaka population in the Ryukyu Islands at least after 2800 years ago.

Consistent with previous observation that the three sampled DaWenKou populations show different affinities to the YR populations, the mixture proportion calculated by qpAdm for YR ancestry in 6000–4600 BP ShanDong populations varies, with the GSGroup showing the highest levels of YR ancestry (~50–71%, Fig. 3A, Supplementary Table S5), followed by the BQGroup (~13%, Fig. 3A, Supplementary Table S5) and the FJGroup (~0–13%, Fig. 3A, Supplementary Table S5). In an ADMIXTURE analysis, the GSGroup shows a higher proportion of a YR-related component (orange) compared to other DaWenKou populations (Fig. 2C). Based on our finding of greater YR-related ancestry in the inland GSGroup relative to the coastal BQGroup and FJGroup, we suspect that YR-related ancestry had decreasing impact with proximity to the coast. With only three DWK sites represented, however, finer sampling of DWK sites in Shandong is needed to confirm this hypothesis. Collectively, these patterns suggest that ancestry related to YR populations impacted populations in the ShanDong region during the DaWenKou cultural period (Fig. 3B).

Between 4600–4000 BP, the archaeological record shows high cultural assimilation in YR and ShanDong populations, resulting in a shared culture across these inland and coastal regions denoted the LongShan culture21. To explore population dynamics during this cultural transition, we investigated shifts in genetic ancestry of the coastal ShanDong population during this time period, particularly in populations from the YinJiaCheng (YJCGroup) and ChengZiYa (CZYGroup) sites. We first found that the genetic influence of YR populations persisted in ShanDong populations of the Late Neolithic who are associated with the LongShan cultural period (YJCGroup and CZYGroup). In a PCA, similar to the DaWenKou GSGroup, both YinJiaCheng and ChengZiYa individuals clustered with Yellow River individuals (Fig. 1C, D). In an ADMIXTURE analysis, individuals from the YJCGroup and CZYGroup show a component related to inland YR populations, and the proportion of a YR-related component in these two groups is within a range that overlaps with the proportion observed in the three DaWenKou populations (GSGroup, BQGroup, and FJGroup, Fig. 2C). We then explored whether this YR-related component was introduced from additional admixture from YR populations using an f4-analysis. We found that f4 (ShanDong_DWK, ShanDong_LS; YR, Mbuti)~0 (−2.3 < Z < 3.3, with only one “Z-value” >3 when the ShanDong_DWK was GSGroup, Supplementary Table S6), which suggests that LongShan populations did not share more genetic connections with YR populations than DaWenKou populations.

Using a rotational qpAdm analysis (Fig. 3A, Supplementary Table S5), we found that both LongShan populations can only be modeled as a mixture of ancestry related to the GSGroup (50–71% YR) – the DaWenKou population with elevated YR ancestry – and another DaWenKou population (BQGroup) or an Early Neolithic ShanDong population (21–32%, Fig. 3A, Supplementary Table S5). Interestingly, our result suggests different patterns of admixture in LongShan and DaWenKou populations. The LongShan populations cannot be modeled using a mixture of “ShanDong-related ancestors” and a “YR population”, but only as a mixture of two ShanDong populations. This suggests that rather than continued admixture from YR-related populations, there was genetic continuity between the LongShan and older ShanDong populations. Overall, our results support LongShan populations as a mixture of ancestry related to the older DaWenKou populations from both the more coastal and inland regions of ShanDong, with no additional influence from YR-related populations.

We next examined genetic changes in the ShanDong region during the early dynastic period of China, starting around 3500 BP with the establishment of the Shang Dynasty. Using a rotational qpAdm strategy (Fig. 3A, Supplementary Table S5) for early dynastic ShanDong populations, we found that they can be modeled by three different admixture patterns: (1) The HouLi (HLGroup), LiuJiaZhuang (LJZGroup), and XiChen (XCGroup) populations can be modeled as a single ancestry related to the LongShan “CZYGroup”, with no additional connections to YR-related ancestry beyond that observed in LongShan populations. (2) The TongLin (TLGroup) and XinZhi (XZGroup) populations can only be modeled by a single source ancestry related to the “GSGroup”, the DaWenKou population who showed an elevated YR-related ancestry (50–71%). This suggests that the two populations may share more YR-related ancestry, more similar to DaWenKou populations than LongShan populations. However, it is important to note that, apart from the level of YR-related ancestry, the ancient ShanDong populations are overall highly similar to each other. Two possible scenarios could have given rise to the observed qpAdm model for the TLGroup and XZGroup: (a) genetic continuity between the GSGroup and these two early dynastic groups, or (b) additional YR-related admixture leading to genetic similarity between the GSGroup and these two groups. The major difference between these two scenarios is the timing of when the YR-related ancestry was introduced into the population. (3) Lastly, we found that the YiXi population (YXGroup) is best modeled as a mixture of YR-related ancestry (~75–92%) and ancestry related to another ShanDong population (e.g., ~25% CZYGroup, or ~9% SD9K, Supplementary Table S5).

We further note that the TLGroup, XZGroup, and YXGroup (all younger than 3000 BP) mixture model patterns all differ compared with the LongShan mixture model pattern. To explore what happened to these three early dynastic ShanDong populations (XZGroup, YXGroup and TLGroup), we estimated the timing of admixture using DATES, and found that these three populations could be modeled as a mixture of ancestry related to ShanDong populations older than 3500 BP and ancestry related to YR populations (YR_MN/YR_LN), with an estimated date of admixture around 4.6–18.9 generations prior to the dynastic period, or ~2880–2030 BP (Supplementary Table S7). This finding suggests a second wave of admixture, where these three populations may have been genetically influenced by a second YR-related population after the LongShan cultural period, unique from the admixture event associated with the DaWenKou populations (at least 6000–4600 BP). In an ADMIXTURE analysis, we observed that individuals from the TLGroup, XZGroup, and YXGroup show a higher proportion of a component found in Yellow River populations (orange) compared with other ShanDong populations (Fig. 2C). Unlike the previous wave during the DaWenKou cultural period that affected all sampled ShanDong populations, the second wave may have only influenced a subset of the early dynastic ShanDong populations.

As population interactions can be bidirectional, we also tested whether Yellow River populations were influenced by ShanDong populations. Using an f4-analysis, we found that ancient Yellow River populations are similarly related to ShanDong populations, i.e. f4 (aYR, aYR; 6000–1500 BP ShanDong, Mbuti) ~ 0 (−3.0 > Z > 3.0, Supplementary Table S8), suggesting that Yellow River populations were not differentially influenced by ancestry related to ShanDong populations (Fig. 3B).

Introduction of ancestry related to ShanDong coastal East Asians in Ryukyu Islanders at least after 2800 BP

To examine the relationship of ShanDong populations to those who lived in the Japanese archipelago, we next compared our newly sampled individuals to previously published ancient Japanese populations. In the PCA, ancient Japanese populations form two main clusters, where younger populations dating to the post-Yayoi period are intermediate between the BQGroup, GSGroup, and FJGroup ShanDong populations from the DaWenKou cultural period, and older Jōmon hunter-gatherer populations (Fig. 1C, Supplementary Fig. S2A, B). In an ADMIXTURE analysis, younger Japanese populations, especially the Yayoi, Kofun, and historical Nagabaka populations, possess genetic components that are widely distributed among ShanDong populations younger than 6000 BP, confirming a strong connection between ShanDong populations associated with the DaWenKou cultural period and more recent populations from Japan (Fig. 2C).

This connection was further confirmed by f4 statistics43, where most f4(>3000 BP aJapanese, <3000 BP aJapanese; 6000–1500 BP ShanDong populations, Mbuti) < 0 (−21.7 < Z < 3.1, Supplementary Table S9), showing more shared alleles between the <3000 BP post-Yayoi Japanese populations and 6000–1500 BP ShanDong populations compared with the older Jōmon hunter-gatherers (Supplementary Table S9).

To determine whether this connection is specific to 6000–1500 BP ShanDong populations instead of other northern East Asians, we next tested whether the ShanDong populations could be described as a necessary ancestral source population for different recent Japanese populations (Nagabaka_2800BP, Kofun, Nagabaka_historic) in a rotational qpAdm analysis. We found three major patterns: (1) We found that the Nagabaka population dating to 2800 BP can be modeled as solely ancestry related to the Jōmon (Supplementary Table S5). (2) Then, we found that the historical Nagabaka population could only be modeled as a mixture of two ancestries, related to the ~4600 to 3500 BP CZYGroup and YJCGroup (75.0–75.2%) from the LongShan cultural period in ShanDong and ~3900 to 3700 BP individuals from the Late Jōmon (24.8–25.0%, Supplementary Table S5). We also found that the YR_LN, the LongShan ShanDong (YJCGroup/CZYGroup), and the Jōmon populations were included in the best three-ancestry model (but when comparing with the two-ancestry model, pnest = 0.04, Supplementary Table S5). The three-ancestry and two-ancestry models do not contradict each other, because these ShanDong populations (CZYGroup and YJCGroup) already carry northern inland East Asian (YR-related) and coastal East Asian components, where the coastal East Asian component is specific to ShanDong populations and was not identified in previously sampled ancient East Asian populations. Previous analysis of the historical Nagabaka population showed that they were best described by a three-ancestry model composed of Jōmon ancestry, northern East Asian ancestry, and an ambiguous ancestry related to present-day Han populations. Here we found that a coastal East Asian ancestry related to the ShanDong population better represents the ancestry represented previously by the Han, where the historical Nagabaka population is best described by a three-ancestry model as a mixture of Jōmon ancestry, northern inland East Asian ancestry related to YR populations, and additional northern coastal East Asian ancestry related to ShanDong populations. (3) The genetic influence of ancient ShanDong populations is limited to the Ryukyu islands and is not observed in the Kofun population in Hondo. That is, in a qpAdm analysis, the Kofun could not be modeled as carrying any ShanDong-related ancestry, and a working three-ancestry model shows Jōmon ancestry, northern inland East Asian related to YR populations, and an ambiguous northern East Asian-related ancestry (Supplementary Table S5).

We also observed differences in genetic linkage between the Nagabaka and Kofun populations and ShanDong populations in an f4-test. We observed that most ShanDong populations, particularly the 6000 BP BQGroup, tends to share more alleles with the historical Nagabaka group than other northern East Asians, i.e. many f4(WLR/YR/AR, 6000–1500 BP ShanDong populations; Nagabaka_historic, Mbuti) < 0 (Z < −2.5, Fig. 4, Supplementary Table S10), while most f4 (WLR/YR, 6000–1500 BP ShanDong populations; Japan_Kofun, Mbuti)~0 (|Z|<3, Fig. 4, Supplementary Table S10).

Fig. 4: F4-statistics depicting the relationship of the BQGroup from ShanDong to ancient mainland East Asians and post-Yayoi populations from the Japanese archipelago.
Fig. 4: F4-statistics depicting the relationship of the BQGroup from ShanDong to ancient mainland East Asians and post-Yayoi populations from the Japanese archipelago.The alternative text for this image may have been generated using AI.
Full size image

Results of f4(Y, BQGroup; Japanese_after_2800BP, Mbuti), where Y represents ancient northern East Asians from the Amur River (blue), Yellow River (orange), and West Liao River regions (purple). This f4 was used to test whether alleles shared between the DaWenKou ShanDong populations (i.e. BQGroup) and post-Yayoi island populations (Nagabaka, Kofun) contain genetic components that are not present in these ancient northern East Asians. The measure of centers (circles) are the average values of f4, and the error bars represent 2.5x standard error. A solid circle indicates that the f4 value is significant, while a hollow circle indicates that the f4 value is not significant.

To further analyze the northern coastal ancestry in the historical Nagabaka population and connect it to ShanDong populations, we designed a simulation test to use with the f4-analysis44,45. We found two major patterns: (1) First, the simulation analysis further confirms that there is an additional genetic connection between the historical Nagabaka and ShanDong populations. We tested f4 (X, Nagabaka_historic; SDEN/Xiaojingshan/BQGroup, Mbuti), where X was a simulated population ((1-x)% Jomon+x% LongShan ShanDong populations (YJCGroup, CZYGroup), Supplementary Fig. S8). In all two sets of simulation tests (ShanDong populations = SDEN/Xiaojingshan), when x% is within the range of the proportion of the LongShan ShanDong population (~75%) calculated using the rotational qpAdm strategy, the value of the f4-tests approximate zero (between the blue lines, Supplementary Fig. S8). This supports the rotational qpAdm mixture model for the historical Nagabaka population containing a northern coastal population component related to ShanDong populations. In the set of simulation tests (ShanDong population = BQGroup), when x% is within the range of ~87% of the proportion of the LongShan ShanDong population calculated by qpAdm (~0.87 * 75%, between the yellow lines), the value of f4 is approximately zero. This is because the BQGroup population is a mixture of ~87% ShanDong-related ancestry and ~13% YR-related ancestry. These patterns support that there were additional genetic connections between ShanDong populations and the historical Nagabaka population beyond general northern East Asian connections. (2) In order to test for the additional contribution of a northern coastal component related to ShanDong populations in the historical Nagabaka population compared to the Kofun population, we tested f4 (X, Nagabaka_historical; Kofun, Mbuti), where X was a simulated population ((1-x)% Kofun+x% ShanDong populations, sample size = 30, Supplementary Fig. S9). In all four sets of simulation tests, f4 values gradually decreased in all four groups as the different ShanDong components in population X increased, suggesting that the historical Nagabaka population does share additional genetic connections with the ShanDong population compared to the Kofun population. In addition, because it was not possible to model the Nagabaka population using the Kofun population as the ancestral source in the rotational qpAdm analysis, the values of f4-tests approximately equal to 0 was not observed.

Here, since there is no component related to ShanDong populations in the Nagabaka population of 2800 BP (in both ADMIXTURE and f4 results, Fig. 2C and Fig. 4), we have inferred that admixture related to northern coastal ancestry into the Nagabaka population from the Ryukyu Islands happened at least after 2800 BP. We next estimated the time of admixture integrating ShanDong ancestry into the historical Nagabaka population using DATES46. A consistent admixture time of ~102-43 generations ago is obtained using the ShanDong populations (CZYGroup, LJZGroup, XCGroup, HLGroup) to represent northern coastal populations and the West Liao River populations (WLR_LN, WLR_BA) to represent northern inland populations as mainland East Asian sources, and the Jōmon as another source (Supplementary Table S7). The most likely timing of the admixture is estimated to be 1600–1400 BP assuming one generation is 28 years (LJZGroup is the best fit for the ancestral source, with the smallest nrmsd = 0.180, mean = 43.3, Supplementary Table S7). This can potentially be linked to population interactions between the Sui Dynasty (around 1400 BP) and the Ryukyu Islands populations that are known to have occurred according to historical documents (e.g. “Sui Shu”, “Chuzan-sefu” and “Chuzan-sekan”)47; the specific historical events associated needs further support from archaeological study.

Discussion

Through sampling of ancient individuals from the northern coastal region of ShanDong in East Asia, we reconstructed fine-scale population dynamics from the ShanDong region over the past 9000 years, allowing us to answer several long-standing questions on not only population interaction and change during formative cultural periods in northern East Asia, but also the source of mainland East Asian ancestry into the Japanese archipelago.

First, we reconstructed the population history of mainland East Asians, focusing particularly on interactions across major cultural periods associated with the Neolithic. We found that before the emergence of the coastal DaWenKou culture, by at least 7700 BP, some ShanDong populations were influenced by populations from further north and south, about 3000 years earlier than that estimated in previous studies48. Later, with the establishment of two major Neolithic cultures in East Asia, the YangShao and DaWenKou cultures, we observed gene flow related to inland YangShao populations from the Yellow River region into the coastal DaWenKou populations from the ShanDong region, a pattern consistent with cultural interactions observed in the archaeological record16,17,18,20,49,50. We observed different interaction patterns during three major cultural periods since 6000 BP. First, we observed admixture from Yellow River-related populations to ShanDong populations during the DaWenKou cultural period, likely associated with the expansion of the YangShao culture during 6000–4600 BP17,20,21. Second, we observed little to no gene flow from external regions into the ShanDong region from 4600–4000 BP, when both the Yellow River and ShanDong regions experienced similar cultural changes that led to the LongShan cultural period20,21,51. This pattern suggests that during this time period, within-region population continuity was predominant in the ShanDong region. Finally, in the early dynastic period after 3500 BP, we observed a second wave of gene flow from Yellow River-related populations to some ShanDong populations, potentially associated with increased trade and conflict between the Shang Dynasty and Dongyi populations, which was shown in the historical record to have been driven by demand for sea salt22,23,25,52. During the dynastic period, the establishment of socioeconomic structure may have contributed to a second wave of Yellow River-related ancestry into the ShanDong region53,54. The different patterns of gene flow that occurred during the DaWenKou, LongShan, and early dynastic cultural periods show the history of how the genetic structure of the ShanDong populations was formed between 6000 and 1500 BP.

Further studies have shown that post-Yayoi populations from the Japanese archipelago (e.g. Nagabaka_2800 BP, Kofun, and Nagabaka_historic) derive ancestry from at least three sources: Jōmon hunter-gatherers, a northern East Asian ancestry likely associated with Yayoi migrants, and a mainland East Asian ancestry that entered Japan after the Yayoi period associated with the present-day Han6,7,9,10. However, the provenance of the mainland East Asian ancestry and the timing of the related admixture was not known. Here, we identified the previously unknown East Asian ancestry associated with the Han as a coastal East Asian ancestry that was also found in ancient ShanDong populations (e.g. CZYGroup, YJCGroup, and LJZGroup), and we estimated that this ancestry was introduced through admixture ~1600 to 1400 BP using a DATES analysis. Interestingly, this model can only explain the unknown genetic component in Ryukyu islanders, and the mainland East Asian ancestry found in Hondo Japanese populations remains unclear. Therefore, while the genetics of recent populations from Japan fits a model of three ancestries related to the Jōmon, northern inland East Asians (analogous to the northern East Asian ancestry associated with the Yayoi previously proposed), and northern coastal East Asian ancestry (analogous to mainland East Asian ancestry associated with the post-Yayoi previously proposed), northern coastal East Asian ancestry can be further differentiated within different populations of the Japanese archipelago. This observation also fits the population structure previously observed in present-day Japanese10, and highlights the complex population history within different regions of Japan.

Methods

Ethics and inclusion statement

Permission to test for ancient DNA in the human specimens from this study was obtained through discussions with local archaeologists who excavated them, with final approval granted by the institutes in Shandong where they are managed and cared for, the Shandong Provincial Institute of Cultural Relics and Archaeology and Shandong University. Additional oversight and approval were obtained from the Institutional Review Board at the Institute of Vertebrate Paleontology and Paleoanthropology of the Chinese Academy of Sciences to sample the genomes of the ancient humans included in this study (202310250014). Protocols used to sample the genomes follow the highest standards used in archaeogenomic research. The work was done in collaboration with several local archaeologists, who were included as co-authors for their contributions to collation of archaeological material, dating of specimens, and/or discussions that contributed to the connections made to archaeological research cited in this study.

Ancient DNA extraction, sequencing, and data processing

For ancient DNA extraction, we primarily selected temporal bone fragments and teeth from human skeletal remains from ancient sites in the ShanDong region and drilled for bone powder. For each specimen, about 100 mg of bone or tooth powder was extracted. In order to avoid inter-sample contamination, use of a disposable drill bit for each specimen was strictly followed during the sampling process. For temporal bone samples, two drilling methods were employed: when we could isolate the temporal bone, we drilled a small hole on the inner side of the temporal bone to obtain bone powder55. When we could not isolate the temporal bone from the intact cranial bone, we drilled from the bottom of the cranial bone56, in order to protect the recognizable morphological features on the surface of the cranial bone.

For DNA library construction, single-stranded DNA libraries (SS) were constructed57,58 for samples from the GangShang and XiChen sites, and these libraries were not subjected to uracil-DNA glycosylase treatment (non-UDG) (Supplementary Data 1). For samples from the other nine sites, double-stranded DNA libraries (DS) were constructed58,59, and partial uracil-DNA glycosylase treatment was used (half-UDG60). Amplification of DNA libraries was carried out by the AccuPrimepfx DNA enzyme in a polymerase chain reaction (PCR), and libraries were amplified for 35 cycles. The amplification process involved 35 cycles to ensure that enough ancient DNA was available for capture, followed by the addition of P5 and P7 primers to specific libraries. A NanoDrop2000 spectrometer was used to measure the amount of DNA extracted from each sample4.

Sequencing and reads alignment

Oligonucleotide probes designed for ancient nuclear whole genome SNP capture was used, which focused on ~1,240,000 SNPs (1240 K SNP array61,62,63) (Supplementary Data 1). The enriched captured DNA fragments were sequenced on Illumina Hiseq2500 and HiSeq X platforms, generating end-paired fragments of 2× 100 bp and 2× 150 bp in length. Primer fragments were removed from the original sequences using the leeHom software64, and forward and reverse sequences with at least 11 base pairs of overlap were screened and merged into a single sequence. BWA software65 aligned the merged sequence with the hg19 human reference genome with parameters set to “-n 0.01 -l 16500”. According to criterion that the mapping quality of the comparison should be greater than or equal to 30, the fragments that did not meet the criterion were filtered. Duplicated sequence fragments, i.e. fragments with the same sequence orientation and same start and end positions, were excluded, where the fragment with the highest quality was retained for further processing.

Test for contamination and genotyping

The C-T substitution rate of each individual terminal nucleotide was calculated. A relatively high C-T substitution rate at the terminal nucleotide is characteristic of ancient DNA66, suggesting that the sequence read represents genetic material from the ancient human sampled. The mitochondrial contamination rate of each individual was assessed by comparing the sequenced fragments with the mitochondrial genomes of 311 present-day humans from around the world using ContamMix software26. For one male individual where mtDNA was not captured, we used an X chromosome contamination test27. Libraries with estimated mitochondrial contamination levels greater than 5.0% were reprocessed to retain only damaged fragments containing patterns typical of ancient DNA, i.e. they exhibit damage patterns not found in modern DNA28. Damaged fragments were obtained by filtering out fragments containing at least one C-T substitution in the first three positions of the 5’ end and in the last three positions of the 3’ end using pmdtools0.60 and the “--customterminus” parameter29, and the individuals corresponding to these damage-restricted libraries were labeled with “_d” for subsequent analyses (Supplementary Table S1). For SNP loci that have reads covered at least once in each individual, a random read was selected to determine the allele for that individual61, leading to pseudohaploid genome-wide data for downstream population genetic analyses.

Principal components analysis, Umap dimensionality reduction, t-sne dimensionality reduction

Principal Component Analysis (PCA) was performed using the smartpca program from the EIGENSOFT package30, in which we used published present-day humans (34 present-day populations from the HO project43, and 17 Tibetan and Han populations differentiated according to region in their published studies67) to determine the principal components (PC1 = 5.5%, PC2 = 3.6%, Supplementary Fig. S1). We then projected ancient ShanDong individuals sampled in this study, as well as previously published ancient individuals3,4,6,7,8,41,42 (Fig. 1C).

We then assessed PC1 through PC10, collapsing the data through new eigenvalues onto a two-dimensional plane using Umap31 and t-sne32. Compared to PCA, Umap and t-sne can visualize all 10 PCs on a two-dimensional plane, where Umap (Supplementary Fig. S2) focuses more on global structure, and t-sne (Supplementary Fig. S3) focuses more on local structure.

F3- and f4-analyses

To determine genetic relationships among East Asian populations, the outgroup-f3 and f4 analyses found in the the software package Admixtools were used68. Raghavan et al. first proposed the outgroup f3-analysis33, which uses an f-statistic of the form f3(Outgroup; X, Y), where the Outgroup is an outgroup population to X and Y. We used the modern Central African population Mbuti as the Outgroup, and ancient East Asians from this and previously published studies as X and Y populations. In practice, we used the qp3Pop software from the AdmixTools43 package and plotted heatmaps using the matplotlib package for Python 3.7 (Fig. 2A). X and Y populations that share a high f3 value show high genetic similarity between these two populations. Similarity due to shared ancestry versus admixture can be further differentiated using the f4 analysis.

We used the qpDstat software in the AdmixTools43 package to perform f4 analyses and evaluate the relative degree of allele sharing between ancient individuals in East Asia. The f4 statistic takes the form f4(P1, P2; P3, P4), where P4 is generally fixed as an outgroup to P1, P2, and P3. We used the Central African population Mbuti as P4, which is outgroup to East Asian populations. In an f4 analysis, f4 > 0 (Z > 2.5 or more strictly Z > 3) indicates that the number of alleles shared between the P1 and P3 populations is greater than the number of alleles shared between the P2 and P3 populations. f4 < 0 (Z < −2.5 or more strictly Z < −3) indicates that the number of alleles shared between the P1 and P3 populations is less than the number of alleles shared between the P2 and P3 populations, and f4 ~ 0 (Z <| 2.5| or more strictly Z <|3|) indicates that the number of alleles shared by P3 with P1 and P2 is approximately equal.

Kinship analyses

In order to exclude the influence of kinship on population genetics analysis, READ (Relationship Estimation from Ancient DNA) software was used to analyze the kinship between human individuals from ancient sites in ShanDong69. READ software was specially developed for use with ancient DNA, as the low content of endogenous DNA, fragmentation, terminal damage and other characteristics of aDNA can make estimating kinship difficult. The principle is (1) to divide the genome into non-overlapping windows of 1 Mbp; (2) calculate the proportion of mismatched alleles (P0) for each window for each pair of individuals; (3) randomly select the expected value of a pair of unrelated individuals in the same population to normalize P0; and finally, (4) classify the kinship between samples according to the threshold value. After processing through the READ analysis, each pair of individuals may be categorized into one of the following four types of kinship: (1) identical individuals/identical twins; (2) first-generation kinship: parents and children, siblings; (3) second-generation kinship: maternal/grandparents and grandchildren, aunts/uncles and nieces/nephews, half-siblings; and (4) unrelated individuals: the distance of kinship is greater than the second-generation range.

Because individuals within two generations of kinship share similar genetic characteristics that can bias population genetics analyses that assume independence of data, we filtered the newly sampled individuals to exclude related individuals in population genetic analyses. That is, for any kinship groups, we retained the individual with the highest data quality. We ultimately excluded 15 individuals (BeiQian = 13, TongLin = 1, GangShang = 1) from downstream population genetics analyses and marked the excluded individuals with the suffix “_k” (Supplementary Table S2).

Grouping analysis based on Pairwise D method

To group individuals within sites, we focused on comparing differences between individuals within the same site using the Pairwise D method. Pairwise D entails using the functionality of the f4 analysis in the AdmixTools software package43, with the formula D(ind1, ind2; Pop, Mbuti), where ind1 and ind2 are two different individuals from the same site, and Pop is a published ancient individual or present-day population. A higher number of D-statistics where |Z| > 3 indicates genetic differences between ind1 and ind2 that suggest ind1 and ind2 may not share enough genetic similarities to be grouped together. In this study, 107 representative ancient40,42,70,71,72,73 and present-day43,67 populations were rotated into the Pop position, and the number of D-statistics for each (ind1, ind2) pairing where |Z| > 3 was determined. If there were greater than five D-statistics where |Z| > 3, then the ancient individuals were divided into subgroups or outliers as appropriate (Supplementary Fig. S3).

From the grouping analysis using Pairwise D, BQ4628 and BQ4610 were classified as outliers at the BeiQian site, YJC4658 was classified as an outlier at the YinJiaCheng site, XZ3470 was classified as an outlier at the XinZhi site, and HL4788 was classified as an outlier at the HouLi site. Outliers were labeled with the suffix “_o”, and the remaining individuals from that site were grouped together for downstream population genetic analysis (BQGroup of BeiQian site, GSGroup of GangShang site, FJGroup of FuJia site, YJCGroup of YinJiaCheng site, CZYGroup of ChengZiYa site, HLGroup of HouLi site, LJZGroup of LiuJiaZhuang site, XCGroup of XiChen site, XZGroup of XinZhi site, YXGroup of YiXi site, TLGroup of TongLin site).

Phylogeny modeling with Treemix

Treemix v1.1334 was used to determine the phylogenetic relationships of various ancient East Asians3,4,6,7,41,42,63, allowing for admixture events. We rooted the tree using the Central African Mbuti (with the option “-root Mbuti”) and used blocks of 500 SNPs at a time (with the option “–k 500”). We ran 1000 replicates for each tree, adding the options “-bootstrap -q”. The 1000 bootstrap trees were assessed in Phylip v3.695 using the “consense” program. With that, we could assess the robustness of each clade in the tree (Supplementary Table S3). Results for m = 0 to m = 6, and a heatmap of the residuals were determined (Supplementary Figs. S4 and S5), and the tree for m = 3 is visualized in Fig. 2B.

Admixture analysis

We applied the program ADMIXTURE35 to compute stratified components in different East Asian populations based on its likelihood model with a block relaxation algorithm to estimate individual ancestry and cross-validate the estimated population structure. We used PLINK v1.90b3.4074 to prune the dataset to minimize linkage disequilibrium, with the parameter “--indep-pairwise 200 25 0.4”. We included present-day and ancient populations used in the PCA analysis. Twenty replicates for each of K = 2 to K = 9 were performed, using different random seeds. The lowest CV was for K = 2 (Supplementary Fig. S6), with similarly low CVs for K = 3 to K = 5 (0.4452–0.4471). By comparing the results from K = 2 to K = 5 (Supplementary Fig. S7), we found that the first separation of components is between North and South East Asians (K = 2), the second separation distinguishes continental and island populations amongst northern East Asians (K = 3), the third separation distinguishes the Amur River populations (and populations further north) from Yellow River populations (K = 4), and the final separation distinguishes Yellow River (inland) and ShanDong (coastal) populations (K = 5). These separations across K = 2 to K = 5 mirror the population relationships observed in the Treemix analysis (Fig. 2B). We visualized the results for K = 5 in Fig. 2C.

Admixture modeling with qpAdm

To model ancestry proportions for any target population, we used qpAdm62 in AdmixTools with the parameter “allsnps”: default”. We utilized python scripts to implement a rotational strategy to examine the potential ancestral origins of the target population in one-, two-, and three-way mixing scenarios. In the rotational strategy, a standard outgroup “Yamnaya_Samara”75 was added to the “right population”. The possible ancestral source populations are categorized into two groups, “rotating” and “no rotating” (Supplementary Table S4): (1) Populations in the “rotating” group who are not used as a source in the “left population” are incorporated into the “right population”. The identification of a set of best-fitting mixture models is a major advantage of a rotational qpAdm analysis. Compared to qpAdm without rotation, in a rotational qpAdm analysis, each potential source population is sequentially included as the left_population, where all unused potential sources in that qpAdm analysis are included in the right_population. Finding successful mixture models for one or a few potential sources highlights the optimal sources relative to the other potential sources (where Tail_prob >0.05, each ancestral mixture proportion >standard error, pnest <0.05). Using this exhaustive strategy, combined with a smaller number of source populations, tends to reveal optimal combinations of sources, because the source populations can only be identified when they outperform the rest of the tested potential source populations. Therefore, by this method, we can find the most suitable combination of source populations among all combinations of source populations. (2) Populations in the “non-rotating” group who are not used as a source for the “left population” are not included in the “right population” to avoid situations where the ‘right population’ contains groups that are younger in age than the target population76. See Supplementary Table S4 for details.

The admixture proportions calculated using qpAdm were stratified using the age of the sources, i.e. ShanDong_EN was used as an ancestral source for ShanDong_DWK, ShanDong_DWK was used as an ancestral source for ShanDong_LS, and ShanDong_LS was used as an ancestral source for ShanDong_CD. Thus, to calculate the proportion of the YR component in ShanDong_LS populations, we weighted the ShanDong_EN and YR components based on the ShanDong_DWK proportion observed in each ShanDong LS population. We used the same method for the ShanDong_CD populations, using the proportions for the ShanDong_LS populations. The results of this re-estimate of proportions was used to visualize the changes in the proportion of YR components in ShanDong populations over the past 6000 years (Fig. 3B).

Estimating admixture time of ancestral source components with DATES

The timing of admixture events among populations of interest in East Asia was estimated using DATES v401046 (https://github.com/MoorjaniLab/DATES_v4010). The genetic distance was set to 0.45 cM using “lovafit: 0.45”, and the maximum genetic distance was set to Morgan’s maximum using “maxdis: 1” to ensure that it was larger than the confounding LD block. The recommended optimal subgroup size of 0.001 molecules was used (“binsize: 0.001”). Standard errors were estimated by a weighted block jackknife method with the parameter ‘jackknife: Yes’. We considered all results with NRMSD < 0.7, Z > 2, and generations < 200, and assumed that each “generation” corresponds to 28 years77 in order to convert generations to years (Supplementary Table S5).

F4-test based on the simulation method

We included a simulation to test the possible connections between the tested population (Population A) and the possible ancestral components in related populations (Population C), which leverages the linear relation between the proportion of arbitrary components in the simulated population (Population B) and the value of f4. The line where a series of f4 values are located will pass through the zero point when the ratio of the two components is consistent with that of the population being tested44,45.

Specifically, in f4 (A, B; C, O), (1) Population A contains two population components, i% x and j% y, where i + j = 1 and is a fixed constant; (2) Population B is a series of populations generated by the simulation method and consists of a%x + b%y with a + b = 1; Population C is a population with 100% x component; and (4) O is an outgroup.

We know f4 (A, B; C, O) = (pA − pB) × (pC − pO), where pX denotes the frequency of a given allele in population X. So, pA = i × px + j × py; pB = a × px + × py; pc = px; pO = 0.

Then, f4(A, B; C, O) = (i − a) × px2 + (a − i) × pxpy is a linear equation with respect to the variable a.

Finally, f4(A, B; C, O) = (px × (i − a) + py × (j − b)) × px = 0, only if all 3 of the following conditions are met at the same time:

$${{{\rm{i}}}}+{{{\rm{j}}}}=1$$
(1)
$${{{\rm{a}}}}+{{{\rm{b}}}}=1$$
(2)
$${{{\rm{a}}}}={{{\rm{i}}}}\left({{{\rm{j}}}}-{{{\rm{b}}}}=\right.\left(1-{{{\rm{i}}}}\right)-\left(1-{{{\rm{a}}}}\right)=\left.{{{\rm{a}}}}-{{{\rm{i}}}}\right)$$
(3)

The conditions that need to be satisfied simultaneously in more complex populations (A = i%x + j%y + k%z + …) can be further derived from the following equations:

$${{{\rm{i}}}}+{{{\rm{j}}}}+{{{\rm{k}}}}+\ldots=1$$
(4)
$${{{\rm{a}}}}+{{{\rm{b}}}}+{{{\rm{c}}}}+\ldots=1$$
(5)
$${{{\rm{a}}}}={{{\rm{i}}}},{{{\rm{b}}}}={{{\rm{j}}}},{{{\rm{c}}}}={{{\rm{k}}}},\ldots$$
(6)

The simulation can be used to demonstrate whether the population components in the tested population A contain only the population components calculated by qpAdm, and to verify if the corresponding component proportions are the same as the proportions of each component calculated by qpAdm when the value of the linear distribution for f4 crosses the zero point.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.