Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Genome diversity and signatures of natural selection in mainland Southeast Asia

Abstract

Mainland Southeast Asia (MSEA) has rich ethnic and cultural diversity with a population of nearly 300 million1,2. However, people from MSEA are underrepresented in the current human genomic databases. Here we present the SEA3K genome dataset (phase I), generated by deep short-read whole-genome sequencing of 3,023 individuals from 30 MSEA populations, and long-read whole-genome sequencing of 37 representative individuals. We identified 79.59 million small variants and 96,384 structural variants, among which 22.83 million small variants and 24,622 structural variants are unique to this dataset. We observed a high genetic heterogeneity across MSEA populations, reflected by the varied combinations of genetic components. We identified 44 genomic regions with strong signatures of Darwinian positive selection, covering 89 genes involved in varied physiological systems such as physical traits and immune response. Furthermore, we observed varied patterns of archaic Denisovan introgression in MSEA populations, supporting the proposal of at least two distinct instances of Denisovan admixture into modern humans in Asia3. We also detected genomic regions that suggest adaptive archaic introgressions in MSEA populations. The large number of novel genomic variants in MSEA populations highlight the necessity of studying regional populations that can help answer key questions related to prehistory, genetic adaptation and complex diseases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Statistics of the SEA3K genomic variants.
Fig. 2: SV discovery based on the long-read genome data of 37 MSEA individuals.
Fig. 3: Genetic structure and population history of MSEA populations.
Fig. 4: Genomic signals of positive selection in MSEA populations.
Fig. 5: The landscape of archaic introgression in MSEA populations.

Similar content being viewed by others

Data availability

Datasets generated in this study were deposited in public repositories. WGS data are archived at the Genome Sequence Archive under the accession HRA007135. Genome assemblies are archived at Genome Warehouse (GWH) under the accession PRJCA028104. Variant data are archived at Genome Variation Map under the accession number GVM000730. To protect participant confidentiality, the raw sequencing data are available to the scientific community for general research through a controlled access process. Access can be requested by submitting an application that includes a detailed research proposal and an IRB approval from the applicant’s home institute to the Data Access Committee of Kunming Institute of Zoology, Chinese Academy of Sciences (KIZ, CAS). All other data are open access. Datasets obtained from publicly available sources include: human reference genome GRCh38 (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa), human reference genome T2T-CHM13 (https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz), 1KGP phase 3 dataset (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), high-coverage 1KGP dataset (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/), HGDP dataset (https://rosenberglab.stanford.edu/data/conradEtAl2006/data1_1.tar.gz), SGDP dataset (https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/vcfs.variants.public_samples.279samples.tar), HGSVC3 (freeze4) (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC3/working/20240415_Freeze4/), HPRC (v1.1) (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz), Genome of Altai Neanderthal and Denisovan (https://www.eva.mpg.de/genetics/genome-projects/), RefSeq genes (https://www.cog-genomics.org/static/bin/plink/glist-hg38), GWAS summary data (https://www.ebi.ac.uk/gwas/api/search/downloads/full), cCREs from ENCODE (https://downloads.wenglab.org/V3/GRCh38-cCREs.bed), ClinVar, https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/), OMIM database (https://www.ncbi.nlm.nih.gov/omim/) and eQTLs from GTEx (https://storage.googleapis.com/adult-gtex/bulk-qtl/v8/single-tissue-cis-qtl/GTEx_Analysis_v8_eQTL.tar).

References

  1. Jin, L., Seielstad, M. & Xiao, C. Genetic, Linguistic and Archaeological Perspectives on Human Diversity in Southeast Asia (World Scientific, 2001).

  2. Glover, I. & Bellwood, P. S. Southeast Asia: From Prehistory to History (Routledge Curzon, 2004).

  3. Browning, S. R., Browning, B. L., Zhou, Y., Tucci, S. & Akey, J. M. Analysis of human sequence data reveals two pulses of archaic Denisovan admixture. Cell 173, 53–61.e59 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Su, B. et al. Y-chromosome evidence for a northward migration of modern humans into Eastern Asia during the last Ice Age. Am. J. Hum. Genet. 65, 1718–1724 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Hallast, P., Agdzhoyan, A., Balanovsky, O., Xue, Y. & Tyler-Smith, C. A Southeast Asian origin for present-day non-African human Y chromosomes. Hum. Genet. 140, 299–307 (2021).

    Article  CAS  PubMed  Google Scholar 

  6. Kutanan, W. et al. Reconstructing the human genetic history of mainland Southeast Asia: insights from genome-wide data from Thailand and Laos. Mol. Biol. Evol. 38, 3459–3477 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Duong, N. T. et al. Complete human mtDNA genome sequences from Vietnam and the phylogeography of mainland Southeast Asia. Sci. Rep. 8, 11651 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Li, Y. C. et al. Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene. Sci. Rep. 5, 9473 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhang, X. et al. Analysis of mitochondrial genome diversity identifies new and ancient maternal lineages in Cambodian aborigines. Nat. Commun. 4, 2599 (2013).

    Article  PubMed  Google Scholar 

  10. Abdulla, M. A. et al. Mapping human genetic diversity in Asia. Science 326, 1541–1545 (2009).

    Article  CAS  PubMed  Google Scholar 

  11. Peng, M. S. et al. Tracing the Austronesian footprint in mainland Southeast Asia: a perspective from mitochondrial DNA. Mol. Biol. Evol. 27, 2417–2430 (2010).

    Article  CAS  PubMed  Google Scholar 

  12. Deng, L. et al. Genetic connections and convergent evolution of tropical Indigenous peoples in Asia. Mol. Biol. Evol. 39, msab361 (2022).

    Article  CAS  PubMed  Google Scholar 

  13. Tucci, S. et al. Evolutionary history and adaptation of a human pygmy population of Flores Island, Indonesia. Science 361, 511–516 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zhang, X. et al. The distinct morphological phenotypes of Southeast Asian aborigines are shaped by novel mechanisms for adaptation to tropical rainforests. Natl Sci. Rev. 9, nwab072 (2022).

    Article  CAS  PubMed  Google Scholar 

  15. Dhir, R. K., Cattaneo, U., Ormaza, M. V. C., Coronado, H. & Oelz, M. Implementing the ILO Indigenous and Tribal Peoples Convention No. 169: Towards an Inclusive, Sustainable and Just Future (International Labour Organization, 2020).

  16. Wong, L. P. et al. Deep whole-genome sequencing of 100 southeast Asian Malays. Am. J. Hum. Genet. 92, 52–66 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e3419 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  19. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. GenomeAsia, K. C. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).

    Article  Google Scholar 

  21. Wu, D. et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell 179, 736–749.e715 (2019).

    Article  CAS  PubMed  Google Scholar 

  22. McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Logsdon, G. A. et al. Complex genetic variation in nearly complete human genomes. Preprint at bioRxiv https://doi.org/10.1101/2024.09.24.614721 (2024).

  24. Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Wang, C., Zollner, S. & Rosenberg, N. A. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 8, e1002886 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Lipson, M. et al. Ancient genomes document multiple waves of migration in Southeast Asian prehistory. Science 361, 92–95 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. McColl, H. et al. The prehistoric peopling of Southeast Asia. Science 361, 88–92 (2018).

    Article  CAS  PubMed  Google Scholar 

  29. Liu, D. et al. Extensive ethnolinguistic diversity in Vietnam reflects multiple sources of genetic diversity. Mol. Biol. Evol. 37, 2503–2519 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Lawson, D. J., van Dorp, L. & Falush, D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat. Commun. 9, 3258 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Schiffels, S. & Wang, K. MSMC and MSMC2: the multiple sequentially Markovian coalescent. Methods Mol. Biol. 2090, 147–166 (2020).

    Article  PubMed  Google Scholar 

  33. Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49, 303–309 (2017).

    Article  CAS  PubMed  Google Scholar 

  34. Grossman, S. R. et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 327, 883–886 (2010).

    Article  CAS  PubMed  Google Scholar 

  35. Luo, H. et al. Recent positive selection signatures reveal phenotypic evolution in the Han Chinese population. Sci. Bull. 68, 2391–2404 (2023).

    Article  CAS  Google Scholar 

  36. Zheng, W. et al. Large-scale genome sequencing redefines the genetic footprints of high-altitude adaptation in Tibetans. Genome Biol. 24, 73 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Liu, X. et al. Decoding triancestral origins, archaic introgression, and natural selection in the Japanese population by whole-genome sequencing. Sci. Adv. 10, eadi8419 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Lo, Y. H. et al. Detecting genetic ancestry and adaptation in the Taiwanese Han people. Mol. Biol. Evol. 38, 4149–4165 (2021).

    Article  CAS  PubMed  Google Scholar 

  39. Chen, L., Wolf, A. B., Fu, W., Li, L. & Akey, J. M. Identifying and interpreting apparent Neanderthal ancestry in African individuals. Cell 180, 677–687.e616 (2020).

    Article  CAS  PubMed  Google Scholar 

  40. Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Springelkamp, H. et al. ARHGEF12 influences the risk of glaucoma by increasing intraocular pressure. Hum. Mol. Genet. 24, 2689–2699 (2015).

    Article  CAS  PubMed  Google Scholar 

  42. Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).

    Article  CAS  PubMed  Google Scholar 

  43. Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Efremov, G. D. et al. Hb Icaria–Hb H disease: identification of the Hb Icaria mutation through analysis of amplified DNA. Br. J. Haematol. 75, 250–253 (1990).

    Article  CAS  PubMed  Google Scholar 

  45. Vlok, M. et al. Forager and farmer evolutionary adaptations to malaria evidenced by 7000 years of thalassemia in Southeast Asia. Sci. Rep. 11, 5677 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Larena, M. et al. Multiple migrations to the Philippines during the last 50,000 years. Proc. Natl Acad. Sci. USA 118, e2026132118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Karmin, M. et al. Episodes of diversification and isolation in island Southeast Asian and near Oceanian male lineages. Mol. Biol. Evol. 39, msac045 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Fan, S., Hansen, M. E., Lo, Y. & Tishkoff, S. A. Going global by adapting local: a review of recent human adaptation. Science 354, 54–59 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Sankararaman, S., Mallick, S., Patterson, N. & Reich, D. The combined landscape of Denisovan and Neanderthal ancestry in present-day humans. Curr. Biol. 26, 1241–1247 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Barnes, R. H., Gray, A. & Kingsbury, B. Indigenous Peoples of Asia (Association for Asian Studies, 1995).

  52. Taylor, P. M. in World Bank Inspection Panel. Investigation Report (March 30, 2006): Cambodia: Forest Concession Management and Control Pilot Project 128–141 (World Bank, 2006).

  53. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).

  55. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Cingolani, P. et al. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front. Genet. 3, 35 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Mondal, M. et al. Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat. Genet. 48, 1066–1070 (2016).

    Article  CAS  PubMed  Google Scholar 

  64. Lu, D. et al. Ancestral origins and genetic history of Tibetan highlanders. Am. J. Hum. Genet. 99, 580–594 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).

    Article  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  67. Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Bhatia, G., Patterson, N., Sankararaman, S. & Price, A. L. Estimating and interpreting FST: the impact of rare variants. Genome Res. 23, 1514–1521 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Tamura, K., Stecher, G. & Kumar, S. MEGA11: Molecular Evolutionary Genetics Analysis version 11. Mol. Biol. Evol. 38, 3022–3027 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Yu, G. Using ggtree to visualize data on tree-like structures. Curr. Protoc. Bioinformatics 69, e96 (2020).

    Article  PubMed  Google Scholar 

  71. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8, 289–317 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  73. Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786–1788 (2019).

    Article  CAS  PubMed  Google Scholar 

  74. Wang, J., Raskin, L., Samuels, D. C., Shyr, Y. & Guo, Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 31, 318–323 (2015).

    Article  PubMed  Google Scholar 

  75. Weissensteiner, H. et al. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res. 44, W58–W63 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. van Oven, M. & Kayser, M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 30, E386–E394 (2009).

    Article  PubMed  Google Scholar 

  77. Chen, H., Lu, Y., Lu, D. & Xu, S. Y-LineageTracker: a high-throughput analysis framework for Y-chromosomal next-generation sequencing data. BMC Bioinformatics 22, 114 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  78. Y Chromosome Consortium. A nomenclature system for the tree of human Y-chromosomal binary haplogroups. Genome Res. 12, 339–348 (2002).

    Article  Google Scholar 

  79. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Sellinger, T. P. P., Abu-Awad, D. & Tellier, A. Limits and convergence properties of the sequentially Markovian coalescent. Mol. Ecol. Resour. 21, 2231–2248 (2021).

    Article  PubMed  Google Scholar 

  81. Patton, A. H. et al. Contemporary demographic reconstruction methods are robust to genome assembly quality: a case study in Tasmanian devils. Mol. Biol. Evol. 36, 2906–2921 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Hu, W. et al. Genomic inference of a severe human bottleneck during the Early to Middle Pleistocene transition. Science 381, 979–984 (2023).

    Article  CAS  PubMed  Google Scholar 

  83. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  86. Szpiech, Z. A. & Hernandez, R. D. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol. Biol. Evol. 31, 2824–2827 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome Biol. 16, 56 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  88. The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

    Article  PubMed Central  Google Scholar 

  89. Romanoski, C. E., Glass, C. K., Stunnenberg, H. G., Wilson, L. & Almouzni, G. Epigenomics: roadmap for regulation. Nature 518, 314–316 (2015).

    Article  CAS  PubMed  Google Scholar 

  90. Adams, D. et al. BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30, 224–226 (2012).

    Article  CAS  PubMed  Google Scholar 

  91. Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Racimo, F., Berg, J. J. & Pickrell, J. K. Detecting polygenic adaptation in admixture graphs. Genetics 208, 1565–1584 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Berg, J. J. & Coop, G. A population genetic signal of polygenic adaptation. PLoS Genet. 10, e1004412 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  94. Yengo, L. et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Chen, M. et al. Evidence of polygenic adaptation in Sardinia at height-associated loci ascertained from the Biobank Japan. Am. J. Hum. Genet. 107, 60–71 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).

    Article  CAS  PubMed  Google Scholar 

  97. Hofmeister, R. J., Ribeiro, D. M., Rubinacci, S. & Delaneau, O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Rozas, J. et al. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol. Biol. Evol. 34, 3299–3302 (2017).

    Article  CAS  PubMed  Google Scholar 

  99. Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10, 564–567 (2010).

    Article  PubMed  Google Scholar 

  100. Leigh, W. J. & Bryant, D. POPART: full-feature software for haplotype network construction. Methods Ecol. Evol. 6, 1110–1116 (2015).

    Article  Google Scholar 

  101. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Hu, J. et al. NextPolish2: A repeat-aware polishing tool for genomes assembled using HiFi long reads. Genomics Proteomics Bioinformatics 22, qzad009 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  104. Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22, 312 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  105. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Smolka, M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 42, 1571–1580 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Wang, S. et al. De novo and somatic structural variant discovery with SVision-pro. Nat. Biotechnol. 43, 181–185 (2025).

    Article  CAS  PubMed  Google Scholar 

  109. Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  111. Prufer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).

    Article  PubMed  Google Scholar 

  112. Prufer, K. et al. A high-coverage Neandertal genome from Vindija Cave in Croatia. Science 358, 655–658 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  113. Delaneau, O., Zagury, J. F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  114. Loh, P. R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank all participants in the project, and S.-F. Wu for assistance in sample collection. This study was supported by National Natural Science Foundation of China (32288101 to B.S.; 32170632 to Y.H.; T2222030 and U23A20161 to X.Z.; 32170633 to Y.-C.L.), Major Scientific Project of Yunnan Province (202305AH340007 to B.S.), Yunnan Revitalization Talent Support Program Science and Technology Champion Project (202005AB160004 to B.S.), Yunnan Revitalization Talent Support Program Innovation Team (202405AS350008), Yunnan Scientist Workshops (to B.S.), Science and Technology General Program of Yunnan Province (202301AW070010 to Y.H.), High-level Talent Promotion and Training Projects of Kunming (2022SCP001 to M.-S.P., 2022SCP001 to Y.-P.Z. and 2020SCP001 to Q.-P.K.), Animal Branch of the Germplasm Bank of Wild Species, Chinese Academy of Sciences (the Large Research Infrastructure Funding), the National Key R&D Program of China (2022YFC3302004 to Y.-C.L.), Yunling Scholar of the Yunnan Province (Q.-P.K.) and the Yunnan Ten Thousand Talents Plan Young and Elite Talents Project (Y.-C.L.).

Author information

Authors and Affiliations

Authors

Consortia

Contributions

B.S., Y.-P.Z. and Q.-P.K. conceived and designed the study. B.S. and Y.H. coordinated and supervised the project. T.S., L.B. and X.Z. collected samples from Cambodia. J.K., M.S., W.K. and J.W. collected samples from Thailand. H.Q.H., K.D.P., S.D. and M.-S.P. collected samples from Vietnam. S. Singthong, S. Sochampa, C.L., Z.G., L.-Q.Y. and Y.-C.L. collected samples from Laos. U.W.K. and M.-S.P. collected samples from Myanmar. X.Z. collected samples from China. A.I., W.P., C.M. and K.R. were responsible for the ethical approval work, personnel coordination and volunteer organization for sample collection. Y.H., X.Z., M.-S.P., Y.-C.L., Y.Z., K.L., Y. Ma and T.Y. prepared the samples and processed them for sequencing. Y.H. and K.L. contributed to data processing, QC, variant analysis and genome assembly. Y.H., Y. Ma, W. Zheng, Y. Liao, L.M., J.G., J.L., R.H., K.L., Y. Lu and Y.W. contributed to the population genetics analysis. Y.Z., T.Y. and Y.H. contributed to construction of the imputation reference panel. W. Zhang, X.C. and B.T. contributed to construction of the SEA3K Imputation Server. Y.G., Y.H, X.Y., K.Y., S.G., S.W., B.Z., Y. Mao and X.W. contributed to structural variants analysis. L.C., S.-A.L, Y.Z. and Y.H. contributed to archaic introgression analysis. Y.H. and X.Z. were responsible for organizing the CASEAC. Y.H., M.-S.P. and Y.-C.L. were responsible for ethical, legal and social implications. Y.H. and B.S. wrote the manuscript. Y.-P.Z., Q.-P.K., X.Z., M.-S.P., Y.-C.L., L.C. and Y. Mao edited the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Qing-Peng Kong, Ya-Ping Zhang or Bing Su.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Stephen Acabado and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Statistics of small variant calls.

The sample-level counts of small variants (SNVs and indels) (panel-a) and the calculated heterozygosity ratios of small variants (panel-b) cross populations, stratified by super-populations. A totoal of 29 MSEA populations (CMKL was excluded due to the small sample size of two individuals) in the SEA3K dataset and 26 populations (belong to five super-populations) from 1KGP dataset were included in this analysis. Panel-a shows a higher average number of variants per individual genome in MSEA populations than those in Eurasian populations of 1KGP, and panel-b shows a comparable heterozygosity ratio in MSEA populations with East Asian populations. The genome-wide heterozygosity was calculated by the ratio of the number of heterozygous SNVs divided by the number of non-reference homozygous SNVs. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum. Abbreviations in 1KGP: EAS-East Asians, AMR-Americans, SAS-South Asians, EUR-Europeans, and AFR-Africans.

Extended Data Fig. 2 Schematic diagram of SV merging and SV novelty evaluation.

a, The SV calls from Sniffles2 and PAV of each individual were merged using Jasmine, and then we merged the 37 individual calls to obtain the population calls (120,063 SVs in total). b, The counts of individual SVs in the Sniffles2 calls, the PAV calls and the merged calls. The mean SV counts per individual at each step are labeled and denoted using dashed red lines. c, Repeat annotation of MSEA SVs. d, The HGSVC3 and HPRC SV callsets are considered as the reference panel. The detailed information of evaluation procedure is provided in Methods.

Extended Data Fig. 3 Principal component analysis (PCA) of the SEA3K and worldwide population samples.

The plots were constructed using 4,587,143 biallelic SNVs among 5,938 global individuals, including 2,183 samples drawn from SEA3K, 2,504 samples from 1KGP, 828 samples from HGDP, 279 samples from SGDP, and other representative populations from the published data: Malays (n = 96), Andamanese (n = 10) and Tibetans (n = 38) (see Methods). The upper-panel shows the global pattern, and the bottom-panel shows the regional pattern including only East Asians, South Asians and SEA3K. The SEA3K dataset was categorized by language families.

Extended Data Fig. 4 Genetic relatedness of MSEA populations with other global populations.

The Neighbor-Joining (NJ) trees and Maximum-Likelihood (ML) trees were constructed using the merged SNV data from SEA3K (marked in red font) and 1KGP (or HGDP). The NJ trees showing the genetic relationship that was determined using genetic distances, and the unit of branch length for genetic distance is shown in the bottom of the tree. In the ML trees, the standard errors and 100 bootstrap replicates were used to evaluate the confidence in the inferred tree topology.

Extended Data Fig. 5 Pairwise FST between MSEA and East Asian populations.

Heatmaps showing the pairwise FST differences between East Asian populations from SGDP, and MSEA populations in SEA3K. Color scales for both heatmaps are the same. Genetic differences between MSEA populations are greater than between population from East Asia distributed over a comparable geographic range. For example, FST between MYNA and CMKR (0.051) is significantly higher than is the highest FST between East Asian groups (0.037, Yakut vs. She), given the two populations are separated by three to four times the geographical distance of the MSEA populations.

Extended Data Fig. 6 Historical population dynamics of MSEA populations.

The plots show the results of the estimated effective population size (Ne) of the ancestral populations in MSEA populations (stratified by countries) using SMC + + (dotted lines) and MSMC2 (solid lines) methods. The generation time and the mutation rate per generation per site (μ) in the analyses were set as 29 years and 1.25 × 10−8 in both MSMC2 and SMC + +, respectively.

Extended Data Fig. 7 GWAS annotation and simulation of PS-SNVs in MSEA populations.

a, Manhattan plot of the CMS scores of the genome-wide SNVs in MSEA populations. The reported GWAS hits are highlighted by red dots and labeled with GWAS information (variants, associated traits and mapped genes). There are six positive-selection regions harboring the GWAS hits of the top PS-SNVs (with the top 0.01% CMS scores), including two regions covering genes (GDF5, UQCC1 and ZRANB3) related to body height, one region covering genes (SLC24A5 and MYEF2) related to skin pigmentation, one region covering PNPT1 related to hip circumference, one region covering WDPCP related to diet measurement, and the MHC region related to diverse phenotypes. The dot size denotes to the P-value of the CMS score. Statistical significance was assessed using one-sided chi-squared (χ2) test. b, Permutation test of height-associated SNVs. Among the 5,505 PS-SNVs in the 44 regions, we found a significant excess of variants related to body height, with 21 height-associated SNVs falling in the PS-SNV set (P-value = 0.005 based on 1000 one-sided permutation tests) (see details in Methods).

Extended Data Fig. 8 Signature of natural selection in the FLG gene region.

a, The regional plot of CMS scores and recombination rates in the FLG region, in which the peaks indicate the selective signals. The peak SNVs are marked with colors. The bottom panel shows the LD blocks of the 262 PS-SNVs with MSEA-enriched alleles. The dot size denotes to the P-value of the CMS score, and statistical significance was assessed using chi-squared (χ2) test. The calculated recombination rates (r2) indicate the estimated linkage disequilibrium (LD) degree between the peak SNV and the other SNVs and are coded in colors. b, The TCS network of the FLG region showing a MSEA-specific haplogroup. Each node represents a haplotype, and the size is proportional to its frequency. The MSEA-specific haplogroup is highlighted.

Extended Data Fig. 9 Identification of archaic introgression in MSEA populations.

a, Comparison of the archaic-introgression sequences (from Neanderthal and Denisovan) between MSEA and global populations in 1KGP. EAS-East Asians (CDX and KHV were excluded since they due belong to MSEA), AMR-Americans, SAS-South Asians and EUR-Europeans. The y-axis indicates the mean detected Neanderthal sequences (or Denisovan sequences) per individual from different populations, stratified by super populations. We compared MSEA with EAS and SAS, and evaluated significance using two-sided unpaired t-test. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum. b, Mean amounts of the detected introgressed sequences per individual in the 1KGP populations, categorized by affinity to the Altai Neanderthal and Altai Denisovan genomes. c, Intersection of the Neanderthal introgression callsets between Sprime and IBDmix. The callsets were merged for all identified introgression segments in MSEA individuals. d, Violin plots of the Neanderthal sequences per individual in MSEA populations identified by Sprime and IBDmix. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum.

Extended Data Fig. 10 Hierarchical clustering of the haplotypes spanning the adaptive introgression region on Chr1.

The rows illustrate the individual haplotypes. A total of 2,116 individuals from 23 populations in SEA3K (MSEA population) were include, and 2,504 individuals from 26 geographically diverse populations in 1KGP (including SAS, EAS, EUR and AMR populations) were used as the control groups. The Denisovan-derived PS-SNVs are denoted by rhombus in red. The colors of gray and black represent the ancestral and the derived alleles, respectively. DEN, Denisovan. The introgressed Denisovan-like haplotypes were marked in the plot.

Extended Data Fig. 11 Detection of disease-associated variants and protein-truncating variants in MSEA populations.

a, Frequency distribution of the ClinVar pathogenic variants in SEA3K. The classifications of autosomal-dominant (AD), autosomal recessive (AR) and unknown were based on the OMIM database. b, Number of pathogenic variants carried by each MSEA individual. c, Ten pathogenic variants specifically enriched in MSEA populations. The mapped gene, variants and risk alleles, and frequencies of risk alleles in SEA3K and other datasets are indicated. The clinical significance is indicated by the exclamation marks (pathogenic level) and stars (times of classified by previous submitter). d, Frequency distribution of an alpha-thalassemia variant in HBA2 (chromosome 16: 173598; c.427 T > C) in world populations and MSEA populations. NA, not available. e, The proportion of genes with at least one high-confidence PTVs (pie on the left), and the proportions of novel, known, heterozygous and homozygous PTVs (pie on the right) in the SEA3K dataset. f, The counts of the identified novel homozygous PTVs per individual across MSEA populations.

Supplementary information

Supplementary Information

Supplementary Figs 1–18 and a list of members of the Consortium of Anthropological Research in Southeast Asia and Southwest China (CASEAC).

Reporting Summary

Supplementary Tables

Supplementary Tables 1–20.

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, Y., Zhang, X., Peng, MS. et al. Genome diversity and signatures of natural selection in mainland Southeast Asia. Nature 643, 417–426 (2025). https://doi.org/10.1038/s41586-025-08998-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41586-025-08998-w

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing