Abstract
The performance of metagenomic profiling is constrained by the diversity of taxa present in the reference taxonomic marker database (MarkerDB) used. However, continually updating MarkerDB to include newly determined taxa using existing approaches faces increasing difficulties and will soon become impractical. Here we introduce MetaKSSD, which redefines MarkerDB construction and metagenomic profiling using sketch operations, enhancing MarkerDB scalability and profiling performance. MetaKSSD encompasses 85,202 species in its MarkerDB using just 0.17 GB of storage and profiles 10 GB of data within seconds. Leveraging its comprehensive MarkerDB, MetaKSSD substantially improves profiling results. In a microbiome–phenotype association study, MetaKSSD identified more effective associations than MetaPhlAn4. We profiled 382,016 metagenomic runs using MetaKSSD, conducted extensive sample clustering analyses and suggested potential yet-to-be-discovered niches. MetaKSSD offers functionality for instantaneous searching of similar profiles. It enables the swift transmission of metagenome sketches over the network and real-time online metagenomic analysis, facilitating use by non-expert users.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
The GTDB database is available at https://data.gtdb.ecogenomic.org/releases/release214/214.1/. The MarkerDB of MetaKSSD (L3K11) is available via Zenodo at https://zenodo.org/records/11437234/files/markerdb.L3K11_gtdb_r214.tar.gz, and the abundance vector database of MetaKSSD (L3K11) is available via Zenodo at https://zenodo.org/records/11437234/files/markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz (ref. 72). All the 382,016 metagenome sketches (L3K11) are split into 4 batches (batches 1 to 4) due to exceeding size limit. Batches 1, 2, 3 and 4 are available via Zenodo at https://doi.org/10.5281/zenodo.10609030 (ref. 73), https://doi.org/10.5281/zenodo.10614425 (ref. 74), https://doi.org/10.5281/zenodo.10676887 (ref. 75) and https://doi.org/10.5281/zenodo.10614597 (ref. 76), respectively. The MetaKSSD profiles for all sketched runs are available via Zenodo at https://doi.org/10.5281/zenodo.11345411 (ref. 77). The four CAMI2 benchmark datasets are available: the ‘Mouse_gut’ dataset at https://frl.publisso.de/data/frl:6421672/dataset/ and the ‘Rhizosphere’, ‘Marine’ and ‘Strain_madness’ datasets at https://frl.publisso.de/data/frl:6425521/. Previous CAMI2 results are available via GitHub at https://cami-challenge.github.io/OPAL/cami_ii_mg/ and https://github.com/CAMI-challenge/second_challenge_evaluation/tree/master/profiling. The OPAL results on the five datasets for all profilers benchmarked in this study are available via GitHub at https://yhg926.github.io/KSSD2/OPAL/. The stool microbiome WGS data from the 368 Chinese individuals of the BGInature2012 cohort are available under NCBI accession SRA045646. The related metadata are available in the supplementary tables of ref. 41. Source data are provided with this paper.
Code availability
MetaKSSD Standalone (Linux) is available via Zenodo at https://doi.org/10.5281/zenodo.15613720 (ref. 78) or via GitHub at https://github.com/yhg926/MetaKSSD. MetaKSSD Clients (Mac OS, see tutorial video) is available via Zenodo at https://zenodo.org/records/11437234/files/MetaKSSD_Mac.zip (ref. 72). MetaKSSD Clients (Windows OS, see tutorial video) is available via Zenodo at https://zenodo.org/records/11437234/files/MetaKSSD_Windows.exe (ref. 72). Codes are licensed under Apache License version 2.0.
References
Gacesa, R. et al. Environmental factors shaping the gut microbiome in a Dutch population. Nature 604, 732–739 (2022).
Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. https://doi.org/10.1038/nrmicro.2016.83 (2016).
Kurilshikov, A. et al. Large-scale association analyses identify host factors influencing human gut microbiome composition. Nat. Genet. 53, 156–165 (2021).
Gilbert, J. A. et al. Microbiome-wide association studies link dynamic microbial consortia to disease. Nature https://doi.org/10.1038/nature18850 (2016).
Kishikawa, T. et al. Metagenome-wide association study of gut microbiome revealed novel aetiology of rheumatoid arthritis in the Japanese population. Ann. Rheum. Dis. 79, 103–111 (2020).
Manghi, P. et al. MetaPhlAn 4 profiling of unknown species-level genome bins improves the characterization of diet-associated microbiome changes in mice. Cell Rep. 42, 112464 (2023).
Zhu, J. et al. Statistical modeling of gut microbiota for personalized health status monitoring. Microbiome 11, 184 (2023).
Gupta, V. K. et al. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635 (2020).
Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 41, 1633–1644 (2023).
Ghosh, T. S., Shanahan, F. & O’Toole, P. W. The gut microbiome as a modulator of healthy ageing. Nat. Rev. Gastroenterol. Hepatol. https://doi.org/10.1038/s41575-022-00605-x (2022).
Faust, K. et al. Microbial co-occurrence relationships in the Human Microbiome. PLoS Comput. Biol. 8, e1002606 (2012).
Ma, B. et al. Earth microbial co-occurrence network reveals interconnection pattern across microbiomes. Microbiome 8, 82 (2020).
Chen, L. et al. Gut microbial co-abundance networks show specificity in inflammatory bowel disease and obesity. Nat. Commun. 11, 4018 (2020).
Ye, S. H., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell https://doi.org/10.1016/j.cell.2019.07.010 (2019).
Sun, Z. et al. Challenges in benchmarking metagenomic profilers. Nat. Methods 18, 618–626 (2021).
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3. eLife 10, e65088 (2021).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Zeng, S. et al. A compendium of 32,277 metagenome-assembled genomes and over 80 million genes from the early-life human gut microbiome. Nat. Commun. 13, 5139 (2022).
Sánchez-Navarro, R. et al. Long-read metagenome-assembled genomes improve identification of novel complete biosynthetic gene clusters in a complex microbial activated sludge ecosystem. mSystems 7, e0063222 (2022).
Hauptfeld, E. et al. Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes. Nat. Commun. 15, 3373 (2024).
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database https://doi.org/10.1093/database/baaa062 (2020).
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
Rinke, C. et al. A standardized archaeal taxonomy for the Genome Taxonomy. Database. Nat. Microbiol. 6, 946–959 (2021).
Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Parks, D. H. et al. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).
RefSeq growth statistics. National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov/refseq/statistics/ (2024).
Ruscheweyh, H. J. et al. Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments. Microbiome 10, 212 (2022).
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).
Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).
One Codex. One Codex https://www.onecodex.com/platform/ (2025).
Yi, H., Lin, Y., Lin, C. & Jin, W. KSSD: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis. Genome Biol. 22, 84 (2021).
Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).
Meyer, F. et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20, 51 (2019).
Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).
Piro, V. C., Lindner, M. S. & Renard, B. Y. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics 32, 2272–2280 (2016).
Wang, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Zhernakova, A. et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 352, 565–569 (2016).
Huang, R. Y. et al. Metagenome-wide association study of the alterations in the intestinal microbiome composition of ankylosing spondylitis patients and the effect of traditional and herbal treatment. J. Med. Microbiol. 69, 797–805 (2020).
Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).
Al-Jameel, S. S. Association of diabetes and microbiota: an update. Saudi J. Biol. Sci. https://doi.org/10.1016/j.sjbs.2021.04.041 (2021).
Qiao, S. et al. Gut Parabacteroides merdae protects against cardiovascular damage by enhancing branched-chain amino acid catabolism. Nat. Metab. 4, 1271–1286 (2022).
Bahram, M. et al. Metagenomic assessment of the global diversity and distribution of bacteria and fungi. Environ. Microbiol. 23, 316–326 (2021).
Whitman, W. B., Coleman, D. C. & Wiebe, W. J. Prokaryotes: the unseen majority. Proc. Natl Acad. Sci. USA 95, 6578–6583 (1998).
Mise, K. & Iwasaki, W. Environmental atlas of prokaryotes enables powerful and intuitive habitat-based analysis of community structures. iScience 23, 101624 (2020).
Schnorr, S. L. et al. Gut microbiome of the Hadza hunter-gatherers. Nat. Commun. 5, 3654 (2014).
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2018).
Pavlopoulos, G. A. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594–602 (2023).
Shaw, J. & Yu, Y. W. Rapid species-level metagenome profiling and containment estimation with sylph. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02412-y (2024).
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods https://doi.org/10.1038/s41592-024-02305-7 (2024).
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure & genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Costea, P. I. et al. Subspecies in the global human gut microbiome. Mol. Syst. Biol. 13, 960 (2017).
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 550, 61–66 (2017).
Marx, V. Method of the year: long-read sequencing. Nat. Methods https://doi.org/10.1038/s41592-022-01730-w (2023).
Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 41, e75 (2013).
Fritz, A. et al. CAMISIM: simulating metagenomes and microbial communities. Microbiome 7, 17 (2019).
Zhou, B. F. Predictive values of body mass index and waist circumference for risk factors of certain related diseases in Chinese adults—study on optimal cut-off points of body mass index and waist circumference in Chinese adults. Biomed. Environ. Sci. 15, 83–95 (2002).
Pan, X. F., Wang, L. & Pan, A. Epidemiology and determinants of obesity in China. Lancet Diabet. Endocrinol. https://doi.org/10.1016/S2213-8587(21)00045-0 (2021).
Wilkinson, G. N. & Rogers, C. E. Symbolic description of factorial models for analysis of variance. J. Appl. Stat. 22, 392–399 (1973).
Becker, R. A., Chambers, J. M. & Wilks, A. R. The new S language. Biometrics 45, 2 (1989).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
Kendall, M. G. A new measure of rank correlation. Biometrika 30, 81–93 (1938).
Van Der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Wilkinson, L. ggplot2: elegant graphics for data analysis by WICKHAM, H. Biometrics 67, 678–679 (2011).
Kruskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964).
Yi, H. MetaKSSD database resource and clients. Zenodo https://doi.org/10.5281/zenodo.11437234 (2024).
Yi, H. Metagenome data sketch file batch 1. Zenodo https://doi.org/10.5281/zenodo.10609030 (2024).
Yi, H. Metagenomic data sketch file batch 2. Zenodo https://doi.org/10.5281/zenodo.10614425 (2024).
Yi, H. Metagenomic data sketch file batch 4. Zenodo https://doi.org/10.5281/zenodo.10676887 (2024).
Yi, H. Metagenomic data sketch file batch 3. Zenodo https://doi.org/10.5281/zenodo.10614597 (2024).
Yi, H. All metagenomic profile data by MetaKSSD. Zenodo https://doi.org/10.5281/zenodo.11345411 (2024).
Yi, H. MetaKSSD-2.22. Zenodo https://doi.org/10.5281/zenodo.15613720 (2025).
Acknowledgements
We thank J. Ruan from AGIS for the suggestion to name this software ‘MetaKSSD’. This work was supported by 2023 Start-up Funds for Talent at Basic Research Institutions to H.Y.’s team (grant no. JCKY2023-30); 2023 Basic Research Institution Task 4 to H.Y. (grant no. JCKY-ZDKY202304-10); Outbound Postdoctoral Research Funding in Shenzhen (grant no. SZS21001); Outbound Postdoctoral Research Funding in Dapeng New District (grant no. SDP21029); and Provincial Laboratory Special Start-up Funds to H.Y.’s team (grant no. SSZXQD006).
Author information
Authors and Affiliations
Contributions
H.Y. invented the method, developed the software, performed the analyses, wrote the paper and supervised this study. X.L. and Q.C. tested the software and assisted with the analyses.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Notes 1 and 2, Tables 1 and 2, Figs. 1–11 and commands procedures.
Supplementary Data 1
Source genomes of ‘New_released’ dataset.
Supplementary Data 2
Associations identified by MetaKSSD and MetaPhlAn4 in BGInature2012 cohort and their supporting literatures.
Supplementary Data 3
Metadata of all sketched metagenomic runs.
Supplementary Data 4
Commonly studied metagenomic environments.
Supplementary Data 5
Environment-specific species.
Supplementary Data 6
Metadata of the runs used for t-SNE analysis.
Supplementary Data 7
Lifestyle-related runs from MetaPhlAn4 paper.
Supplementary Data 8
Runs used for abundance vector clustering test.
Supplementary Data 9
GTDBr214 species to NCBI species mapping scheme.
Source data
Source Data Fig. 2
Statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yi, H., Lu, X. & Chang, Q. MetaKSSD: boosting the scalability of the reference taxonomic marker database and the performance of metagenomic profiling using sketch operations. Nat Comput Sci 5, 884–897 (2025). https://doi.org/10.1038/s43588-025-00855-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s43588-025-00855-0


