Abstract
Mainland Southeast Asia (MSEA) has rich ethnic and cultural diversity with a population of nearly 300 million1,2. However, people from MSEA are underrepresented in the current human genomic databases. Here we present the SEA3K genome dataset (phase I), generated by deep short-read whole-genome sequencing of 3,023 individuals from 30 MSEA populations, and long-read whole-genome sequencing of 37 representative individuals. We identified 79.59 million small variants and 96,384 structural variants, among which 22.83 million small variants and 24,622 structural variants are unique to this dataset. We observed a high genetic heterogeneity across MSEA populations, reflected by the varied combinations of genetic components. We identified 44 genomic regions with strong signatures of Darwinian positive selection, covering 89 genes involved in varied physiological systems such as physical traits and immune response. Furthermore, we observed varied patterns of archaic Denisovan introgression in MSEA populations, supporting the proposal of at least two distinct instances of Denisovan admixture into modern humans in Asia3. We also detected genomic regions that suggest adaptive archaic introgressions in MSEA populations. The large number of novel genomic variants in MSEA populations highlight the necessity of studying regional populations that can help answer key questions related to prehistory, genetic adaptation and complex diseases.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
Datasets generated in this study were deposited in public repositories. WGS data are archived at the Genome Sequence Archive under the accession HRA007135. Genome assemblies are archived at Genome Warehouse (GWH) under the accession PRJCA028104. Variant data are archived at Genome Variation Map under the accession number GVM000730. To protect participant confidentiality, the raw sequencing data are available to the scientific community for general research through a controlled access process. Access can be requested by submitting an application that includes a detailed research proposal and an IRB approval from the applicant’s home institute to the Data Access Committee of Kunming Institute of Zoology, Chinese Academy of Sciences (KIZ, CAS). All other data are open access. Datasets obtained from publicly available sources include: human reference genome GRCh38 (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa), human reference genome T2T-CHM13 (https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz), 1KGP phase 3 dataset (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), high-coverage 1KGP dataset (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/), HGDP dataset (https://rosenberglab.stanford.edu/data/conradEtAl2006/data1_1.tar.gz), SGDP dataset (https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/vcfs.variants.public_samples.279samples.tar), HGSVC3 (freeze4) (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGSVC3/working/20240415_Freeze4/), HPRC (v1.1) (https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz), Genome of Altai Neanderthal and Denisovan (https://www.eva.mpg.de/genetics/genome-projects/), RefSeq genes (https://www.cog-genomics.org/static/bin/plink/glist-hg38), GWAS summary data (https://www.ebi.ac.uk/gwas/api/search/downloads/full), cCREs from ENCODE (https://downloads.wenglab.org/V3/GRCh38-cCREs.bed), ClinVar, https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/), OMIM database (https://www.ncbi.nlm.nih.gov/omim/) and eQTLs from GTEx (https://storage.googleapis.com/adult-gtex/bulk-qtl/v8/single-tissue-cis-qtl/GTEx_Analysis_v8_eQTL.tar).
References
Jin, L., Seielstad, M. & Xiao, C. Genetic, Linguistic and Archaeological Perspectives on Human Diversity in Southeast Asia (World Scientific, 2001).
Glover, I. & Bellwood, P. S. Southeast Asia: From Prehistory to History (Routledge Curzon, 2004).
Browning, S. R., Browning, B. L., Zhou, Y., Tucci, S. & Akey, J. M. Analysis of human sequence data reveals two pulses of archaic Denisovan admixture. Cell 173, 53–61.e59 (2018).
Su, B. et al. Y-chromosome evidence for a northward migration of modern humans into Eastern Asia during the last Ice Age. Am. J. Hum. Genet. 65, 1718–1724 (1999).
Hallast, P., Agdzhoyan, A., Balanovsky, O., Xue, Y. & Tyler-Smith, C. A Southeast Asian origin for present-day non-African human Y chromosomes. Hum. Genet. 140, 299–307 (2021).
Kutanan, W. et al. Reconstructing the human genetic history of mainland Southeast Asia: insights from genome-wide data from Thailand and Laos. Mol. Biol. Evol. 38, 3459–3477 (2021).
Duong, N. T. et al. Complete human mtDNA genome sequences from Vietnam and the phylogeography of mainland Southeast Asia. Sci. Rep. 8, 11651 (2018).
Li, Y. C. et al. Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene. Sci. Rep. 5, 9473 (2015).
Zhang, X. et al. Analysis of mitochondrial genome diversity identifies new and ancient maternal lineages in Cambodian aborigines. Nat. Commun. 4, 2599 (2013).
Abdulla, M. A. et al. Mapping human genetic diversity in Asia. Science 326, 1541–1545 (2009).
Peng, M. S. et al. Tracing the Austronesian footprint in mainland Southeast Asia: a perspective from mitochondrial DNA. Mol. Biol. Evol. 27, 2417–2430 (2010).
Deng, L. et al. Genetic connections and convergent evolution of tropical Indigenous peoples in Asia. Mol. Biol. Evol. 39, msab361 (2022).
Tucci, S. et al. Evolutionary history and adaptation of a human pygmy population of Flores Island, Indonesia. Science 361, 511–516 (2018).
Zhang, X. et al. The distinct morphological phenotypes of Southeast Asian aborigines are shaped by novel mechanisms for adaptation to tropical rainforests. Natl Sci. Rev. 9, nwab072 (2022).
Dhir, R. K., Cattaneo, U., Ormaza, M. V. C., Coronado, H. & Oelz, M. Implementing the ILO Indigenous and Tribal Peoples Convention No. 169: Towards an Inclusive, Sustainable and Just Future (International Labour Organization, 2020).
Wong, L. P. et al. Deep whole-genome sequencing of 100 southeast Asian Malays. Am. J. Hum. Genet. 92, 52–66 (2013).
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185, 3426–3440.e3419 (2022).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
GenomeAsia, K. C. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Wu, D. et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell 179, 736–749.e715 (2019).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Logsdon, G. A. et al. Complex genetic variation in nearly complete human genomes. Preprint at bioRxiv https://doi.org/10.1101/2024.09.24.614721 (2024).
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Wang, C., Zollner, S. & Rosenberg, N. A. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 8, e1002886 (2012).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Lipson, M. et al. Ancient genomes document multiple waves of migration in Southeast Asian prehistory. Science 361, 92–95 (2018).
McColl, H. et al. The prehistoric peopling of Southeast Asia. Science 361, 88–92 (2018).
Liu, D. et al. Extensive ethnolinguistic diversity in Vietnam reflects multiple sources of genetic diversity. Mol. Biol. Evol. 37, 2503–2519 (2020).
Lawson, D. J., van Dorp, L. & Falush, D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat. Commun. 9, 3258 (2018).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Schiffels, S. & Wang, K. MSMC and MSMC2: the multiple sequentially Markovian coalescent. Methods Mol. Biol. 2090, 147–166 (2020).
Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49, 303–309 (2017).
Grossman, S. R. et al. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 327, 883–886 (2010).
Luo, H. et al. Recent positive selection signatures reveal phenotypic evolution in the Han Chinese population. Sci. Bull. 68, 2391–2404 (2023).
Zheng, W. et al. Large-scale genome sequencing redefines the genetic footprints of high-altitude adaptation in Tibetans. Genome Biol. 24, 73 (2023).
Liu, X. et al. Decoding triancestral origins, archaic introgression, and natural selection in the Japanese population by whole-genome sequencing. Sci. Adv. 10, eadi8419 (2024).
Lo, Y. H. et al. Detecting genetic ancestry and adaptation in the Taiwanese Han people. Mol. Biol. Evol. 38, 4149–4165 (2021).
Chen, L., Wolf, A. B., Fu, W., Li, L. & Akey, J. M. Identifying and interpreting apparent Neanderthal ancestry in African individuals. Cell 180, 677–687.e616 (2020).
Meyer, M. et al. A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222–226 (2012).
Springelkamp, H. et al. ARHGEF12 influences the risk of glaucoma by increasing intraocular pressure. Hum. Mol. Genet. 24, 2689–2699 (2015).
Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).
Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586, 741–748 (2020).
Efremov, G. D. et al. Hb Icaria–Hb H disease: identification of the Hb Icaria mutation through analysis of amplified DNA. Br. J. Haematol. 75, 250–253 (1990).
Vlok, M. et al. Forager and farmer evolutionary adaptations to malaria evidenced by 7000 years of thalassemia in Southeast Asia. Sci. Rep. 11, 5677 (2021).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Larena, M. et al. Multiple migrations to the Philippines during the last 50,000 years. Proc. Natl Acad. Sci. USA 118, e2026132118 (2021).
Karmin, M. et al. Episodes of diversification and isolation in island Southeast Asian and near Oceanian male lineages. Mol. Biol. Evol. 39, msac045 (2022).
Fan, S., Hansen, M. E., Lo, Y. & Tishkoff, S. A. Going global by adapting local: a review of recent human adaptation. Science 354, 54–59 (2016).
Sankararaman, S., Mallick, S., Patterson, N. & Reich, D. The combined landscape of Denisovan and Neanderthal ancestry in present-day humans. Curr. Biol. 26, 1241–1247 (2016).
Barnes, R. H., Gray, A. & Kingsbury, B. Indigenous Peoples of Asia (Association for Asian Studies, 1995).
Taylor, P. M. in World Bank Inspection Panel. Investigation Report (March 30, 2006): Cambodia: Forest Concession Management and Control Pilot Project 128–141 (World Bank, 2006).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Cingolani, P. et al. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front. Genet. 3, 35 (2012).
Bergström, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Mondal, M. et al. Genomic analysis of Andamanese provides insights into ancient human migration into Asia and adaptation. Nat. Genet. 48, 1066–1070 (2016).
Lu, D. et al. Ancestral origins and genetic history of Tibetan highlanders. Am. J. Hum. Genet. 99, 580–594 (2016).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).
Pickrell, J. K. & Pritchard, J. K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8, e1002967 (2012).
Bhatia, G., Patterson, N., Sankararaman, S. & Price, A. L. Estimating and interpreting FST: the impact of rare variants. Genome Res. 23, 1514–1521 (2013).
Tamura, K., Stecher, G. & Kumar, S. MEGA11: Molecular Evolutionary Genetics Analysis version 11. Mol. Biol. Evol. 38, 3022–3027 (2021).
Yu, G. Using ggtree to visualize data on tree-like structures. Curr. Protoc. Bioinformatics 69, e96 (2020).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8, 289–317 (2016).
Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786–1788 (2019).
Wang, J., Raskin, L., Samuels, D. C., Shyr, Y. & Guo, Y. Genome measures used for quality control are dependent on gene function and ancestry. Bioinformatics 31, 318–323 (2015).
Weissensteiner, H. et al. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res. 44, W58–W63 (2016).
van Oven, M. & Kayser, M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 30, E386–E394 (2009).
Chen, H., Lu, Y., Lu, D. & Xu, S. Y-LineageTracker: a high-throughput analysis framework for Y-chromosomal next-generation sequencing data. BMC Bioinformatics 22, 114 (2021).
Y Chromosome Consortium. A nomenclature system for the tree of human Y-chromosomal binary haplogroups. Genome Res. 12, 339–348 (2002).
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
Sellinger, T. P. P., Abu-Awad, D. & Tellier, A. Limits and convergence properties of the sequentially Markovian coalescent. Mol. Ecol. Resour. 21, 2231–2248 (2021).
Patton, A. H. et al. Contemporary demographic reconstruction methods are robust to genome assembly quality: a case study in Tasmanian devils. Mol. Biol. Evol. 36, 2906–2921 (2019).
Hu, W. et al. Genomic inference of a severe human bottleneck during the Early to Middle Pleistocene transition. Science 381, 979–984 (2023).
Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78 (2010).
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007).
Voight, B. F., Kudaravalli, S., Wen, X. & Pritchard, J. K. A map of recent positive selection in the human genome. PLoS Biol. 4, e72 (2006).
Szpiech, Z. A. & Hernandez, R. D. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol. Biol. Evol. 31, 2824–2827 (2014).
Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. & Flicek, P. R. The ensembl regulatory build. Genome Biol. 16, 56 (2015).
The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Romanoski, C. E., Glass, C. K., Stunnenberg, H. G., Wilson, L. & Almouzni, G. Epigenomics: roadmap for regulation. Nature 518, 314–316 (2015).
Adams, D. et al. BLUEPRINT to decode the epigenetic signature written in blood. Nat. Biotechnol. 30, 224–226 (2012).
Turchin, M. C. et al. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 44, 1015–1019 (2012).
Racimo, F., Berg, J. J. & Pickrell, J. K. Detecting polygenic adaptation in admixture graphs. Genetics 208, 1565–1584 (2018).
Berg, J. J. & Coop, G. A population genetic signal of polygenic adaptation. PLoS Genet. 10, e1004412 (2014).
Yengo, L. et al. A saturated map of common genetic variants associated with human height. Nature 610, 704–712 (2022).
Chen, M. et al. Evidence of polygenic adaptation in Sardinia at height-associated loci ascertained from the Biobank Japan. Am. J. Hum. Genet. 107, 60–71 (2020).
Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21, 263–265 (2005).
Hofmeister, R. J., Ribeiro, D. M., Rubinacci, S. & Delaneau, O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023).
Rozas, J. et al. DnaSP 6: DNA sequence polymorphism analysis of large data sets. Mol. Biol. Evol. 34, 3299–3302 (2017).
Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10, 564–567 (2010).
Leigh, W. J. & Bryant, D. POPART: full-feature software for haplotype network construction. Methods Ecol. Evol. 6, 1110–1116 (2015).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Hu, J. et al. NextPolish2: A repeat-aware polishing tool for genomes assembled using HiFi long reads. Genomics Proteomics Bioinformatics 22, qzad009 (2024).
Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22, 312 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Smolka, M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 42, 1571–1580 (2024).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Wang, S. et al. De novo and somatic structural variant discovery with SVision-pro. Nat. Biotechnol. 43, 181–185 (2025).
Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).
Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).
Prufer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Prufer, K. et al. A high-coverage Neandertal genome from Vindija Cave in Croatia. Science 358, 655–658 (2017).
Delaneau, O., Zagury, J. F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Loh, P. R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448 (2016).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
Acknowledgements
The authors thank all participants in the project, and S.-F. Wu for assistance in sample collection. This study was supported by National Natural Science Foundation of China (32288101 to B.S.; 32170632 to Y.H.; T2222030 and U23A20161 to X.Z.; 32170633 to Y.-C.L.), Major Scientific Project of Yunnan Province (202305AH340007 to B.S.), Yunnan Revitalization Talent Support Program Science and Technology Champion Project (202005AB160004 to B.S.), Yunnan Revitalization Talent Support Program Innovation Team (202405AS350008), Yunnan Scientist Workshops (to B.S.), Science and Technology General Program of Yunnan Province (202301AW070010 to Y.H.), High-level Talent Promotion and Training Projects of Kunming (2022SCP001 to M.-S.P., 2022SCP001 to Y.-P.Z. and 2020SCP001 to Q.-P.K.), Animal Branch of the Germplasm Bank of Wild Species, Chinese Academy of Sciences (the Large Research Infrastructure Funding), the National Key R&D Program of China (2022YFC3302004 to Y.-C.L.), Yunling Scholar of the Yunnan Province (Q.-P.K.) and the Yunnan Ten Thousand Talents Plan Young and Elite Talents Project (Y.-C.L.).
Author information
Authors and Affiliations
Consortia
Contributions
B.S., Y.-P.Z. and Q.-P.K. conceived and designed the study. B.S. and Y.H. coordinated and supervised the project. T.S., L.B. and X.Z. collected samples from Cambodia. J.K., M.S., W.K. and J.W. collected samples from Thailand. H.Q.H., K.D.P., S.D. and M.-S.P. collected samples from Vietnam. S. Singthong, S. Sochampa, C.L., Z.G., L.-Q.Y. and Y.-C.L. collected samples from Laos. U.W.K. and M.-S.P. collected samples from Myanmar. X.Z. collected samples from China. A.I., W.P., C.M. and K.R. were responsible for the ethical approval work, personnel coordination and volunteer organization for sample collection. Y.H., X.Z., M.-S.P., Y.-C.L., Y.Z., K.L., Y. Ma and T.Y. prepared the samples and processed them for sequencing. Y.H. and K.L. contributed to data processing, QC, variant analysis and genome assembly. Y.H., Y. Ma, W. Zheng, Y. Liao, L.M., J.G., J.L., R.H., K.L., Y. Lu and Y.W. contributed to the population genetics analysis. Y.Z., T.Y. and Y.H. contributed to construction of the imputation reference panel. W. Zhang, X.C. and B.T. contributed to construction of the SEA3K Imputation Server. Y.G., Y.H, X.Y., K.Y., S.G., S.W., B.Z., Y. Mao and X.W. contributed to structural variants analysis. L.C., S.-A.L, Y.Z. and Y.H. contributed to archaic introgression analysis. Y.H. and X.Z. were responsible for organizing the CASEAC. Y.H., M.-S.P. and Y.-C.L. were responsible for ethical, legal and social implications. Y.H. and B.S. wrote the manuscript. Y.-P.Z., Q.-P.K., X.Z., M.-S.P., Y.-C.L., L.C. and Y. Mao edited the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks Stephen Acabado and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Statistics of small variant calls.
The sample-level counts of small variants (SNVs and indels) (panel-a) and the calculated heterozygosity ratios of small variants (panel-b) cross populations, stratified by super-populations. A totoal of 29 MSEA populations (CMKL was excluded due to the small sample size of two individuals) in the SEA3K dataset and 26 populations (belong to five super-populations) from 1KGP dataset were included in this analysis. Panel-a shows a higher average number of variants per individual genome in MSEA populations than those in Eurasian populations of 1KGP, and panel-b shows a comparable heterozygosity ratio in MSEA populations with East Asian populations. The genome-wide heterozygosity was calculated by the ratio of the number of heterozygous SNVs divided by the number of non-reference homozygous SNVs. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum. Abbreviations in 1KGP: EAS-East Asians, AMR-Americans, SAS-South Asians, EUR-Europeans, and AFR-Africans.
Extended Data Fig. 2 Schematic diagram of SV merging and SV novelty evaluation.
a, The SV calls from Sniffles2 and PAV of each individual were merged using Jasmine, and then we merged the 37 individual calls to obtain the population calls (120,063 SVs in total). b, The counts of individual SVs in the Sniffles2 calls, the PAV calls and the merged calls. The mean SV counts per individual at each step are labeled and denoted using dashed red lines. c, Repeat annotation of MSEA SVs. d, The HGSVC3 and HPRC SV callsets are considered as the reference panel. The detailed information of evaluation procedure is provided in Methods.
Extended Data Fig. 3 Principal component analysis (PCA) of the SEA3K and worldwide population samples.
The plots were constructed using 4,587,143 biallelic SNVs among 5,938 global individuals, including 2,183 samples drawn from SEA3K, 2,504 samples from 1KGP, 828 samples from HGDP, 279 samples from SGDP, and other representative populations from the published data: Malays (n = 96), Andamanese (n = 10) and Tibetans (n = 38) (see Methods). The upper-panel shows the global pattern, and the bottom-panel shows the regional pattern including only East Asians, South Asians and SEA3K. The SEA3K dataset was categorized by language families.
Extended Data Fig. 4 Genetic relatedness of MSEA populations with other global populations.
The Neighbor-Joining (NJ) trees and Maximum-Likelihood (ML) trees were constructed using the merged SNV data from SEA3K (marked in red font) and 1KGP (or HGDP). The NJ trees showing the genetic relationship that was determined using genetic distances, and the unit of branch length for genetic distance is shown in the bottom of the tree. In the ML trees, the standard errors and 100 bootstrap replicates were used to evaluate the confidence in the inferred tree topology.
Extended Data Fig. 5 Pairwise FST between MSEA and East Asian populations.
Heatmaps showing the pairwise FST differences between East Asian populations from SGDP, and MSEA populations in SEA3K. Color scales for both heatmaps are the same. Genetic differences between MSEA populations are greater than between population from East Asia distributed over a comparable geographic range. For example, FST between MYNA and CMKR (0.051) is significantly higher than is the highest FST between East Asian groups (0.037, Yakut vs. She), given the two populations are separated by three to four times the geographical distance of the MSEA populations.
Extended Data Fig. 6 Historical population dynamics of MSEA populations.
The plots show the results of the estimated effective population size (Ne) of the ancestral populations in MSEA populations (stratified by countries) using SMC + + (dotted lines) and MSMC2 (solid lines) methods. The generation time and the mutation rate per generation per site (μ) in the analyses were set as 29 years and 1.25 × 10−8 in both MSMC2 and SMC + +, respectively.
Extended Data Fig. 7 GWAS annotation and simulation of PS-SNVs in MSEA populations.
a, Manhattan plot of the CMS scores of the genome-wide SNVs in MSEA populations. The reported GWAS hits are highlighted by red dots and labeled with GWAS information (variants, associated traits and mapped genes). There are six positive-selection regions harboring the GWAS hits of the top PS-SNVs (with the top 0.01% CMS scores), including two regions covering genes (GDF5, UQCC1 and ZRANB3) related to body height, one region covering genes (SLC24A5 and MYEF2) related to skin pigmentation, one region covering PNPT1 related to hip circumference, one region covering WDPCP related to diet measurement, and the MHC region related to diverse phenotypes. The dot size denotes to the P-value of the CMS score. Statistical significance was assessed using one-sided chi-squared (χ2) test. b, Permutation test of height-associated SNVs. Among the 5,505 PS-SNVs in the 44 regions, we found a significant excess of variants related to body height, with 21 height-associated SNVs falling in the PS-SNV set (P-value = 0.005 based on 1000 one-sided permutation tests) (see details in Methods).
Extended Data Fig. 8 Signature of natural selection in the FLG gene region.
a, The regional plot of CMS scores and recombination rates in the FLG region, in which the peaks indicate the selective signals. The peak SNVs are marked with colors. The bottom panel shows the LD blocks of the 262 PS-SNVs with MSEA-enriched alleles. The dot size denotes to the P-value of the CMS score, and statistical significance was assessed using chi-squared (χ2) test. The calculated recombination rates (r2) indicate the estimated linkage disequilibrium (LD) degree between the peak SNV and the other SNVs and are coded in colors. b, The TCS network of the FLG region showing a MSEA-specific haplogroup. Each node represents a haplotype, and the size is proportional to its frequency. The MSEA-specific haplogroup is highlighted.
Extended Data Fig. 9 Identification of archaic introgression in MSEA populations.
a, Comparison of the archaic-introgression sequences (from Neanderthal and Denisovan) between MSEA and global populations in 1KGP. EAS-East Asians (CDX and KHV were excluded since they due belong to MSEA), AMR-Americans, SAS-South Asians and EUR-Europeans. The y-axis indicates the mean detected Neanderthal sequences (or Denisovan sequences) per individual from different populations, stratified by super populations. We compared MSEA with EAS and SAS, and evaluated significance using two-sided unpaired t-test. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum. b, Mean amounts of the detected introgressed sequences per individual in the 1KGP populations, categorized by affinity to the Altai Neanderthal and Altai Denisovan genomes. c, Intersection of the Neanderthal introgression callsets between Sprime and IBDmix. The callsets were merged for all identified introgression segments in MSEA individuals. d, Violin plots of the Neanderthal sequences per individual in MSEA populations identified by Sprime and IBDmix. For each boxplot, we drew a box from the first quartile to the third quartile. A horizontal line across the box indicates the median. The whiskers go from each quartile to the minimum or the maximum.
Extended Data Fig. 10 Hierarchical clustering of the haplotypes spanning the adaptive introgression region on Chr1.
The rows illustrate the individual haplotypes. A total of 2,116 individuals from 23 populations in SEA3K (MSEA population) were include, and 2,504 individuals from 26 geographically diverse populations in 1KGP (including SAS, EAS, EUR and AMR populations) were used as the control groups. The Denisovan-derived PS-SNVs are denoted by rhombus in red. The colors of gray and black represent the ancestral and the derived alleles, respectively. DEN, Denisovan. The introgressed Denisovan-like haplotypes were marked in the plot.
Extended Data Fig. 11 Detection of disease-associated variants and protein-truncating variants in MSEA populations.
a, Frequency distribution of the ClinVar pathogenic variants in SEA3K. The classifications of autosomal-dominant (AD), autosomal recessive (AR) and unknown were based on the OMIM database. b, Number of pathogenic variants carried by each MSEA individual. c, Ten pathogenic variants specifically enriched in MSEA populations. The mapped gene, variants and risk alleles, and frequencies of risk alleles in SEA3K and other datasets are indicated. The clinical significance is indicated by the exclamation marks (pathogenic level) and stars (times of classified by previous submitter). d, Frequency distribution of an alpha-thalassemia variant in HBA2 (chromosome 16: 173598; c.427 T > C) in world populations and MSEA populations. NA, not available. e, The proportion of genes with at least one high-confidence PTVs (pie on the left), and the proportions of novel, known, heterozygous and homozygous PTVs (pie on the right) in the SEA3K dataset. f, The counts of the identified novel homozygous PTVs per individual across MSEA populations.
Supplementary information
Supplementary Information
Supplementary Figs 1–18 and a list of members of the Consortium of Anthropological Research in Southeast Asia and Southwest China (CASEAC).
Supplementary Tables
Supplementary Tables 1–20.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, Y., Zhang, X., Peng, MS. et al. Genome diversity and signatures of natural selection in mainland Southeast Asia. Nature 643, 417–426 (2025). https://doi.org/10.1038/s41586-025-08998-w
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41586-025-08998-w