Abstract
Creating a catalogue of early diverged genome variation is critical to determine the true extent of human diversity and associated medical impact. Generating deep whole genome data for 150 Khoe-San (12 groups, 1 unclassified), and 40 regionally comparative Southern Africans (3 groups), we identify ~30 million small-to-large variants - over 1.3 million unknown single nucleotide variants. Representing shared traditionally forager lifestyles and click-speaking languages, we identify San and Damara as separate phylogenetic lineages, contributing two admixture waves to Nama. While San represented modern humans’ deep divergence (~115 thousand years ago), Damara divergence is recent, with both showing high effective population sizes between 45–150 thousand years ago. Developing an assembly-based test we report 1,376 genes under positive selection (dN/dS = 19.46) of which 479 are significantly associated with forager peoples and, therefore, maintained ancestral alleles that differ from derived genetic variation observed in non-African biomedical resources.
Data availability
Raw sequencing data, alignments, germline variant calls (small variants, short tandem repeats and mobile element insertions) and derived datasets are available for general research use for browsing and download through the European Genome Phenome Achieve (EGA) [https://ega-archive.org] via adherence to KSGP Data Access Committee (DAC) EGAC50000000798 policy and approval [https://dac.ega-archive.org/EGAC50000000798/requests] for KSGP under accession number EGAS50000001408. Genomic data for South African participants have previously been deposited at the EGA under accession number EGAD00001009067. The public release of SGDP data is available through the EBI European Nucleotide Archive under accession numbers PRJEB9586 and ERP010710. The Altai Neanderthal genome can be downloaded online [http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/]. Source data are provided with this paper. Access to KSGP and SAPCS sequencing data may be requested via the KSGP or SAPCS Data Access Committee’s (DACs), respectively, and will be made available to researchers with appropriate feasibility and corresponding ethics approvals to ensure the safeguarding of patient genomic information (contact V.M.H. directly). Both DACs include community representation, with all studies directly communicated with community representative partners. Restrictions include (i) no transfer to third parties allowed, (ii) inability of the researchers to adequately articulate their research question at application or the question is deemed culturally inappropriate, (iii) a report of the results of the research to be provided to the respective DACs prior to publication (or when requested), (iv) written DAC approval for publication of final draft, (v) acknowledgment of the KSGP or SAPCS community leaders in publications/presentations, (vi) researchers cannot utilise the data for commercial purposes or any other purposes not approved by the DAC, and (vii) approval will not be given that excludes other researchers from accessing data. Data currently being used for capacity building in under-resourced studies across Sub-Saharan Africa will be given priority and at times may be granted time-limited exclusive rights for no more than a two-year period. Source data are provided in this paper.
Code availability
The core computational pipelines used in this study for read alignment, quality control and variant calling are described in Supplementary Information. Analysis code for assembly-based genome analysis of positive selection is available at GitHub (https://github.com/wjaratlerdsiri/aGATK) or Reference96.
References
Mallick, S. et al. The simons genome diversity project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
1000-Genomes-Project-Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).
Jaratlerdsiri, W. et al. African-specific molecular taxonomy of prostate cancer. Nature 609, 552–559 (2022).
Soh, P. X. Y. S. & Hayes, V. M. Common genetic variants associated with prostate cancer risk: the need for African Inclusion. Eur. Urol. 84, 22–24 (2023).
Sengupta, D. et al. Genetic substructure and complex demographic history of South African Bantu speakers. Nat. Commun. 7, 2080 (2021).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Fatumo, S. et al. Promoting the genomic revolution in Africa through the Nigerian 100K Genome Project. Nat. Genet. 54, 531–536 (2022).
Schuster, S. C. et al. Complete Khoisan and Bantu genomes from Southern Africa. Nature 463, 943–947 (2010).
Schlebusch, C. M. et al. Khoe-San genomes reveal unique variation and confirm the deepest population divergence in homo sapiens. Mol. Biol. Evol. 37, 2944–2954 (2020).
Güldemann, T. & Fehn, A.-M. Beyond ‘Khoisan’: Historical Relations in the Kalahari Basin, (John Benjamins Publishing Company, Amsterdam, 2014).
Wilfred, H. Khoekhoegowab (Nama/Damara). in The Social and Political History of Southern Africa’s Languages (eds. Kamusella, T. & Ndhlovu, F.) 133–158 (Palgrave Macmillan, London, 2018).
Fehn, A. M., Amorim, B. & Rocha, J. The linguistic and genetic landscape of southern Africa. J. Anthropol. Sci. 100, 243–265 (2022).
Smith, A., Malherbe, C., Guemther, M. & Berens, P. The Bushmen of Southern Africa. A Foraging Society in Transition. (David Philip Publishers, South Africa, 2004).
Barnard, A. B. Anthropology and the Bushman, (Routledge, New York, 2007).
Koot, S. & Walter, V. B. Ju|’hoansi Lodging in a Namibian conservancy: CBNRM, tourism and increasing domination. Conserv. Soc. 15, 136–146 (2017).
Hayes, V. M. Indigenous genomics. Science 332, 639 (2011).
Haacke, W. H. G. The social and political history of Southern Africa’s languages. in Khoekhoegowab (Nama/Damara) (eds. Kamusella, T. & Ndhlovu, F.) (Palgrave Macmillan, London, 2018).
Sullivan, S. & Ganuses, W. S. Understanding Damara / ‡Nūkhoen and ||Ubun indigeneity and marginalisation in Namibia, (Land, environment and development project, Legal Assistance Centre, Windhoek, Republic of Namibia, 2020).
Kinahan, J. The rock art of ǀUi-ǁAis (Twyfelfontein) Namibia’s first World Heritage Site. (Namib Desert Archaeological Survey, Windhoek, Namibia, 2007).
Lander, F. & Russell, T. The archaeological evidence for the appearance of pastoralism and farming in southern Africa. PLoS ONE 13, e0198941 (2018).
Ragsdale, A. P. et al. A weakly structured stem for human origins in Africa. Nature 617, 755–763 (2023).
Fan, S. et al. Whole-genome sequencing reveals a complex African population demographic history and signatures of local adaptation. Cell 186, 923–939 (2023).
Soodyall, H. & Jenkins, T. Mitochondrial DNA polymorphisms in Negroid populations from Namibia: new light on the origins of the Dama, Herero and Ambo. Ann. Hum. Biol. 20, 477–485 (1993).
Güldemann, T. & Stoneking, M. A historical appraisal of clicks: A linguistic and genetic population perspective. Annu. Rev. Anthropol. 37, 93–109 (2008).
Barbieri, C. et al. Migration and interaction in a contact zone: mtDNA variation among Bantu-speakers in Southern Africa. PLoS ONE 9, e99117 (2014).
Oliveira, S. et al. Matriclans shape populations: Insights from the Angolan Namib Desert into the maternal genetic history of southern Africa. Am. J. Phys. Anthropol. 165, 518–535 (2018).
Grollemund, R. et al. Bantu expansion shows that habitat alters the route and pace of human dispersals. Proc. Natl. Acad. Sci. USA 112, 13296–13301 (2015).
Hammond-Tooke, W. D. Southern Bantu origins: light from kinship terminology. South Afr. Humanit. 16, 71–78 (2004).
Koile, E., Greenhill, S. J., Blasi, D. E., Bouckaert, R. & Gray, R. D. Phylogeographic analysis of the Bantu language expansion supports a rainforest route. Proc. Natl. Acad. Sci. USA 119, e2112853119 (2022).
Choudhury, A., Sengupta, D., Ramsay, M. & Schlebusch, C. Bantu-speaker migration and admixture in southern Africa. Hum. Mol. Genet. 30, R56–R63 (2021).
Skoglund, P. et al. Reconstructing prehistoric African population structure. Cell 171, 59–71.e21 (2017).
Patin, E. et al. Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science 356, 543–546 (2017).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 22, 1154–1162 (2012).
Gardner, E. J. et al. The mobile element locator tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Ceballos, F. C., Joshi, P. K., Clark, D. W., Ramsay, M. & Wilson, J. F. Runs of homozygosity: windows into population history and trait architecture. Nat. Rev. Genet. 19, 220–234 (2018).
Bennett, E. A. et al. Active Alu retrotransposons in the human genome. Genome Res. 18, 1875–1883 (2008).
Fan, S. et al. African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations. Genome Biol. 20, 204 (2019).
Chan, E. K. F. et al. Human origins in a southern African palaeo-wetland and first migrations. Nature 575, 185–189 (2019).
Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197, 573–589 (2014).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440 (2022).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Schlebusch, C. M. et al. Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000 years ago. Science 358, 652–655 (2017).
Lipson, M. et al. Ancient West African foragers in the context of African population history. Nature 577, 665–670 (2020).
Schlebusch, C. M. et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science 338, 374–379 (2012).
Schlebusch, C. M., Prins, F., Lombard, M., Jakobsson, M. & Soodyall, H. The disappearing San of southeastern Africa and their genetic affinities. Hum. Genet. 135, 1365–1373 (2016).
May, A. et al. Genetic diversity in black South Africans from Soweto. BMC Genom. 14, 644 (2013).
Aberer, A. J., Krompass, D. & Stamatakis, A. Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. Syst. Biol. 62, 162–166 (2013).
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Giliomee, H. B. & Mbenga, B. K. Nuwe geskiedenis van Suid-Afrika. (Tafelberg, 2007).
Choin, J. et al. Genomic insights into population history and biological adaptation in Oceania. Nature 592, 583–589 (2021).
Malaspinas, A. S. et al. A genomic history of Aboriginal Australia. Nature 538, 207–214 (2016).
Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51, 1321–1329 (2019).
Browning, S. R. et al. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 14, e1007385 (2018).
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007).
Hughes, A. L. & Nei, M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335, 167–170 (1988).
Wilson, D. J. & McVean, G. Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172, 1411–1425 (2006).
Kim, U. K. et al. Positional cloning of the human quantitative trait locus underlying taste sensitivity to phenylthiocarbamide. Science 299, 1221–1225 (2003).
Petersen, D. C. et al. Complex patterns of genomic admixture within southern Africa. PLoS Genet. 9, e1003309 (2013).
Sabbagh, A., Darlu, P., Crouau-Roy, B. & Poloni, E. S. Arylamine N-acetyltransferase 2 (NAT2) genetic diversity and traditional subsistence: a worldwide population survey. PLoS ONE 6, e18507 (2011).
Lamason, R. L. et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310, 1782–1786 (2005).
Engelken, J. et al. Extreme population differences in the human zinc transporter ZIP4 (SLC39A4) are explained by positive selection in Sub-Saharan Africa. PLoS Genet. 10, e1004128 (2014).
Campbell, M. C. & Tishkoff, S. A. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9, 403–433 (2008).
Yi, X. et al. Sequencing of fifty human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Lacaze, P. et al. The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. Rationale and cohort design. Eur. J. Hum. Genet. 27, 308–316 (2019).
Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
Hauser, A. S. et al. Pharmacogenomics of GPCR Drug Targets. Cell 172, 41–54 (2018).
Whirl-Carrillo, M. et al. Pharmacogenomics knowledge for personalized medicine. Clin. Pharmacol. Ther. 92, 414–417 (2012).
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).
Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).
Weis, W. I. & Kobilka, B. K. The molecular basis of G protein-coupled receptor activation. Annu. Rev. Biochem. 87, 897–919 (2018).
Conti, D. V. et al. Trans-ancestry genome-wide association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic risk prediction. Nat. Genet. 53, 65–75 (2021).
Barrett, R. D. & Hoekstra, H. E. Molecular spandrels: tests of adaptation at the genetic level. Nat. Rev. Genet. 12, 767–780 (2011).
Akey, J. M. Constructing genomic maps of positive selection in humans: where do we go from here? Genome Res. 19, 711–722 (2009).
Szpak, M., Xue, Y., Ayub, Q. & Tyler-Smith, C. How well do we understand the basis of classic selective sweeps in humans? FEBS Lett. 593, 1431–1448 (2019).
Vitti, J. J., Grossman, S. R. & Sabeti, P. C. Detecting natural selection in genomic data. Annu. Rev. Genet. 47, 97–120 (2013).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25, 1754–1760 (2009).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinform. 11, 11.10.1–33 (2013).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P. & Ramachandran, S. pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32, 2817–2823 (2016).
Diaz-Papkovich, A., Anderson-Trocmé, L., Ben-Eghan, C. & Gravel, S. UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet. 15, e1008432 (2019).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Maier, R. et al. On the limits of fitting complex models of population history to f-statistics. Elife 12, e85492 (2023).
Wangkumhang, P., Greenfield, M. & Hellenthal, G. An efficient method to identify, date, and describe admixture events using haplotype information. Genome Res. 32, 1553–1564 (2022).
Delaneau, O., Marchini, J. & 1000-Genomes-Project-Consortium. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat. Commun. 5, 3934 (2014).
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
Kamm, J., Terhorst, J., Durbin, R. & Song, Y. S. Efficiently inferring the demographic history of many populations with allele count data. J. Am. Stat. Assoc. 115, 1472–1487 (2020).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Jaratlerdsiri, W. A catalogue of early diverged contemporary human genome variation reveals distinct Khoe-San populations, Code Ocean https://doi.org/10.24433/CO.6181495.v1 (2025).
Acknowledgements
The work presented was supported by a donation provided by the University of Limpopo in South Africa (to V.M.H. and J.M.), the Garvan Institute of Medical Research Foundation and Medical Genome Research Biobank (MGRB) in Australia (to D.M.T.), and by an Australian Medical Research Future Fund (MRFF) Genomics Health Futures Mission Grant (2025/MRF2045394 to V.M.H., W.J., D.M.T. and P.X.Y.S.), while partially supported through the U.S.A. Congressionally Directed Medical Research Programmes (CDMRP) Prostate Cancer Research Programme (PCRP) Health Equity Research and Outcomes Improvement Consortium (HEROIC) Award (PC210168 and PC230673, HEROIC Prostate Cancer Precision Health (PCaPH) Africa1K to V.M.H., M.S.R.B., Peter Ngugi from the University of Nairobi, Kenya and Gail Prins from the University of Illinois at Chicago, U.S.A.). J.J. is supported by a U.S.A. Prostate Cancer Foundation (PCF) PhD Scholarship as part of a Challenge award) 23CHAL18, to V.M.H.) and V.M.H. is supported by the Petre Foundation through the University of Sydney Foundation. We acknowledge the use of the National Computational Infrastructure (NCI), which is supported by the Australian Government, and accessed through the National Computational Merit Allocation Scheme (V.M.H., E.K.F.C. and W.J.), the Intersect Computational Merit Allocation Scheme (V.M.H.), Intersect Australia Limited and the Sydney Informatics Hub, Core Research Facility, and we acknowledge the staff at the Garvan Institute of Medical Research’s Kinghorn Centre for Clinical Genomics (KCCG) core facility for genome sequencing. We thank the study participants and their representative communities who contributed to this study; without their contribution and continued engagement, this research would not be possible. We are in debt to the many local Namibians who have aided during community engagement, providing critical logistical, historical, cultural and linguistic insights, specifically E. Adams, A.A. Collins, R. Friederich, B. G/aq’o, N. /kun, J. /kunta, H. Mische, F. Naque, D. Naque, H. Oosthuizen, E. Oosthuizen, A. Oosthuysen, E. Oosthuysen, D. Roux, J. Sinvula, C. Swau, T. Tauros, T. Tsebe and R. Wilkinson, while we are grateful to C.P. Bennett from Evolving Picture in Sydney (https://evolvingpicture.com/) for providing community recording. We further acknowledge and fondly remember the late Archbishop Emeritus Desmond Tutu (South Africa), who remained an advocate and key participant of the Ubuntu Project, to the late Chief Seth M. Kooitjie (Namibia), past Chairperson of the Nama Traditional Leaders Association for his blessing and critical support, and to Professors Philip A. Venter (University of Limpopo, South Africa) and Christopher F. Heyns (University of Stellenbosch, South Africa) for their respective foundational work in establishing ethical frameworks within South Africa and Namibia, respectively. We are more recently grateful to Professor Lamech Mwapagha, Namibia University of Science & Technology (NUST), for taking on the responsibility of KSGP DAC Chair.
Author information
Authors and Affiliations
Contributions
V.M.H. designed the experiments. Community engagement, recruitments, government and ethic approvals were performed by V.M.H., Z.S., E.H., K.T., H.A.E.F., M.S.R.B. and J.M., with V.M.H. performing all remote community recruitment personally in the boarder of Namibia (see Supplementary Data). Z.S., D.C.P., E.K.F.C., K.T., and V.M.H. performed initial genetic screening for participant inclusion, W.H.G.H. provided Khoe-San linguistic expertise, and D.M.T. provided access and interpretation for Medical Genome Reference Biobank (MGRB). W.J. performed all the bioinformatic analyses and designed the positive selection workflow and codes, with additional support from T.G. and J.J, while P.X.Y.S. performed population substructure analyses. W.J. developed the pipelines and performed high-performance computational variant calling, with further complex variant annotation supported by T.G. and J.J. Both W.J. and V.M.H. performed data interpretation and wrote the manuscript, with further critical culturally relevant interpretation provided by Z.S., D.C.P., E.H., H.A.E.F., M.S.R.B., and J.M.; V.M.H., J.M., M.S.R.B. and D.M.T. acquired the funding. W.J. and P.X.Y.S. generated the figures, while all authors contributed to the final editing and approval of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jaratlerdsiri, W., Soh, P.X.Y., Gong, T. et al. A catalogue of early diverged contemporary human genome variation reveals distinct Khoe-San populations. Nat Commun (2026). https://doi.org/10.1038/s41467-026-69269-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-69269-4