Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

MetaKSSD: boosting the scalability of the reference taxonomic marker database and the performance of metagenomic profiling using sketch operations

A preprint version of the article is available at bioRxiv.

Abstract

The performance of metagenomic profiling is constrained by the diversity of taxa present in the reference taxonomic marker database (MarkerDB) used. However, continually updating MarkerDB to include newly determined taxa using existing approaches faces increasing difficulties and will soon become impractical. Here we introduce MetaKSSD, which redefines MarkerDB construction and metagenomic profiling using sketch operations, enhancing MarkerDB scalability and profiling performance. MetaKSSD encompasses 85,202 species in its MarkerDB using just 0.17 GB of storage and profiles 10 GB of data within seconds. Leveraging its comprehensive MarkerDB, MetaKSSD substantially improves profiling results. In a microbiome–phenotype association study, MetaKSSD identified more effective associations than MetaPhlAn4. We profiled 382,016 metagenomic runs using MetaKSSD, conducted extensive sample clustering analyses and suggested potential yet-to-be-discovered niches. MetaKSSD offers functionality for instantaneous searching of similar profiles. It enables the swift transmission of metagenome sketches over the network and real-time online metagenomic analysis, facilitating use by non-expert users.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: MetaKSSD algorithm overview.
Fig. 2: Comparisons of MarkerDB scalability and computational efficiency.
Fig. 3: Performance comparison of MetaKSSD and other profilers.
Fig. 4: Comparison of MetaKSSD and MetaPhlAn4 for microbiome–phenotype association study.
Fig. 5: Large-scale metagenomic profiling.

Similar content being viewed by others

Data availability

The GTDB database is available at https://data.gtdb.ecogenomic.org/releases/release214/214.1/. The MarkerDB of MetaKSSD (L3K11) is available via Zenodo at https://zenodo.org/records/11437234/files/markerdb.L3K11_gtdb_r214.tar.gz, and the abundance vector database of MetaKSSD (L3K11) is available via Zenodo at https://zenodo.org/records/11437234/files/markerdb.abvdb231227.L3K11_gtdb_r214.tar.gz (ref. 72). All the 382,016 metagenome sketches (L3K11) are split into 4 batches (batches 1 to 4) due to exceeding size limit. Batches 1, 2, 3 and 4 are available via Zenodo at https://doi.org/10.5281/zenodo.10609030 (ref. 73), https://doi.org/10.5281/zenodo.10614425 (ref. 74), https://doi.org/10.5281/zenodo.10676887 (ref. 75) and https://doi.org/10.5281/zenodo.10614597 (ref. 76), respectively. The MetaKSSD profiles for all sketched runs are available via Zenodo at https://doi.org/10.5281/zenodo.11345411 (ref. 77). The four CAMI2 benchmark datasets are available: the ‘Mouse_gut’ dataset at https://frl.publisso.de/data/frl:6421672/dataset/ and the ‘Rhizosphere’, ‘Marine’ and ‘Strain_madness’ datasets at https://frl.publisso.de/data/frl:6425521/. Previous CAMI2 results are available via GitHub at https://cami-challenge.github.io/OPAL/cami_ii_mg/ and https://github.com/CAMI-challenge/second_challenge_evaluation/tree/master/profiling. The OPAL results on the five datasets for all profilers benchmarked in this study are available via GitHub at https://yhg926.github.io/KSSD2/OPAL/. The stool microbiome WGS data from the 368 Chinese individuals of the BGInature2012 cohort are available under NCBI accession SRA045646. The related metadata are available in the supplementary tables of ref. 41. Source data are provided with this paper.

Code availability

MetaKSSD Standalone (Linux) is available via Zenodo at https://doi.org/10.5281/zenodo.15613720 (ref. 78) or via GitHub at https://github.com/yhg926/MetaKSSD. MetaKSSD Clients (Mac OS, see tutorial video) is available via Zenodo at https://zenodo.org/records/11437234/files/MetaKSSD_Mac.zip (ref. 72). MetaKSSD Clients (Windows OS, see tutorial video) is available via Zenodo at https://zenodo.org/records/11437234/files/MetaKSSD_Windows.exe (ref. 72). Codes are licensed under Apache License version 2.0.

References

  1. Gacesa, R. et al. Environmental factors shaping the gut microbiome in a Dutch population. Nature 604, 732–739 (2022).

    Article  Google Scholar 

  2. Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. https://doi.org/10.1038/nrmicro.2016.83 (2016).

    Article  Google Scholar 

  3. Kurilshikov, A. et al. Large-scale association analyses identify host factors influencing human gut microbiome composition. Nat. Genet. 53, 156–165 (2021).

    Article  Google Scholar 

  4. Gilbert, J. A. et al. Microbiome-wide association studies link dynamic microbial consortia to disease. Nature https://doi.org/10.1038/nature18850 (2016).

    Article  Google Scholar 

  5. Kishikawa, T. et al. Metagenome-wide association study of gut microbiome revealed novel aetiology of rheumatoid arthritis in the Japanese population. Ann. Rheum. Dis. 79, 103–111 (2020).

    Article  Google Scholar 

  6. Manghi, P. et al. MetaPhlAn 4 profiling of unknown species-level genome bins improves the characterization of diet-associated microbiome changes in mice. Cell Rep. 42, 112464 (2023).

    Article  Google Scholar 

  7. Zhu, J. et al. Statistical modeling of gut microbiota for personalized health status monitoring. Microbiome 11, 184 (2023).

    Article  Google Scholar 

  8. Gupta, V. K. et al. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635 (2020).

    Article  Google Scholar 

  9. Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 41, 1633–1644 (2023).

    Article  Google Scholar 

  10. Ghosh, T. S., Shanahan, F. & O’Toole, P. W. The gut microbiome as a modulator of healthy ageing. Nat. Rev. Gastroenterol. Hepatol. https://doi.org/10.1038/s41575-022-00605-x (2022).

    Article  Google Scholar 

  11. Faust, K. et al. Microbial co-occurrence relationships in the Human Microbiome. PLoS Comput. Biol. 8, e1002606 (2012).

    Article  Google Scholar 

  12. Ma, B. et al. Earth microbial co-occurrence network reveals interconnection pattern across microbiomes. Microbiome 8, 82 (2020).

    Article  Google Scholar 

  13. Chen, L. et al. Gut microbial co-abundance networks show specificity in inflammatory bowel disease and obesity. Nat. Commun. 11, 4018 (2020).

    Article  Google Scholar 

  14. Ye, S. H., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell https://doi.org/10.1016/j.cell.2019.07.010 (2019).

    Article  Google Scholar 

  15. Sun, Z. et al. Challenges in benchmarking metagenomic profilers. Nat. Methods 18, 618–626 (2021).

    Article  Google Scholar 

  16. Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3. eLife 10, e65088 (2021).

    Article  Google Scholar 

  17. Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).

    Article  Google Scholar 

  18. Zeng, S. et al. A compendium of 32,277 metagenome-assembled genomes and over 80 million genes from the early-life human gut microbiome. Nat. Commun. 13, 5139 (2022).

    Article  Google Scholar 

  19. Sánchez-Navarro, R. et al. Long-read metagenome-assembled genomes improve identification of novel complete biosynthetic gene clusters in a complex microbial activated sludge ecosystem. mSystems 7, e0063222 (2022).

    Article  Google Scholar 

  20. Hauptfeld, E. et al. Integrating taxonomic signals from MAGs and contigs improves read annotation and taxonomic profiling of metagenomes. Nat. Commun. 15, 3373 (2024).

    Article  Google Scholar 

  21. Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database https://doi.org/10.1093/database/baaa062 (2020).

    Article  Google Scholar 

  22. Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).

    Article  Google Scholar 

  23. Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    Article  Google Scholar 

  24. Rinke, C. et al. A standardized archaeal taxonomy for the Genome Taxonomy. Database. Nat. Microbiol. 6, 946–959 (2021).

    Article  Google Scholar 

  25. Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 38, 1079–1086 (2020).

    Article  Google Scholar 

  26. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    Article  Google Scholar 

  27. Parks, D. H. et al. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).

    Article  Google Scholar 

  28. RefSeq growth statistics. National Center for Biotechnology Information https://www.ncbi.nlm.nih.gov/refseq/statistics/ (2024).

  29. Ruscheweyh, H. J. et al. Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments. Microbiome 10, 212 (2022).

    Article  Google Scholar 

  30. Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).

    Article  Google Scholar 

  31. Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).

    Article  Google Scholar 

  32. Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).

    Article  Google Scholar 

  33. Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).

    Article  Google Scholar 

  34. Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 46, W537–W544 (2018).

    Article  Google Scholar 

  35. One Codex. One Codex https://www.onecodex.com/platform/ (2025).

  36. Yi, H., Lin, Y., Lin, C. & Jin, W. KSSD: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis. Genome Biol. 22, 84 (2021).

    Article  Google Scholar 

  37. Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘pan-genome’. Proc. Natl Acad. Sci. USA 102, 13950–13955 (2005).

    Article  Google Scholar 

  38. Meyer, F. et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. 20, 51 (2019).

    Article  Google Scholar 

  39. Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).

    Article  Google Scholar 

  40. Piro, V. C., Lindner, M. S. & Renard, B. Y. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics 32, 2272–2280 (2016).

    Article  Google Scholar 

  41. Wang, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    Article  Google Scholar 

  42. Zhernakova, A. et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 352, 565–569 (2016).

    Article  Google Scholar 

  43. Huang, R. Y. et al. Metagenome-wide association study of the alterations in the intestinal microbiome composition of ankylosing spondylitis patients and the effect of traditional and herbal treatment. J. Med. Microbiol. 69, 797–805 (2020).

    Article  Google Scholar 

  44. Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

    Article  Google Scholar 

  45. Al-Jameel, S. S. Association of diabetes and microbiota: an update. Saudi J. Biol. Sci. https://doi.org/10.1016/j.sjbs.2021.04.041 (2021).

    Article  Google Scholar 

  46. Qiao, S. et al. Gut Parabacteroides merdae protects against cardiovascular damage by enhancing branched-chain amino acid catabolism. Nat. Metab. 4, 1271–1286 (2022).

    Article  Google Scholar 

  47. Bahram, M. et al. Metagenomic assessment of the global diversity and distribution of bacteria and fungi. Environ. Microbiol. 23, 316–326 (2021).

    Article  Google Scholar 

  48. Whitman, W. B., Coleman, D. C. & Wiebe, W. J. Prokaryotes: the unseen majority. Proc. Natl Acad. Sci. USA 95, 6578–6583 (1998).

    Article  Google Scholar 

  49. Mise, K. & Iwasaki, W. Environmental atlas of prokaryotes enables powerful and intuitive habitat-based analysis of community structures. iScience 23, 101624 (2020).

    Article  Google Scholar 

  50. Schnorr, S. L. et al. Gut microbiome of the Hadza hunter-gatherers. Nat. Commun. 5, 3654 (2014).

    Article  Google Scholar 

  51. Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2018).

    Article  Google Scholar 

  52. Pavlopoulos, G. A. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594–602 (2023).

    Article  Google Scholar 

  53. Shaw, J. & Yu, Y. W. Rapid species-level metagenome profiling and containment estimation with sylph. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02412-y (2024).

    Article  Google Scholar 

  54. Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods https://doi.org/10.1038/s41592-024-02305-7 (2024).

    Article  Google Scholar 

  55. Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure & genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).

    Article  Google Scholar 

  56. Costea, P. I. et al. Subspecies in the global human gut microbiome. Mol. Syst. Biol. 13, 960 (2017).

    Article  Google Scholar 

  57. Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).

    Article  Google Scholar 

  58. Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature 550, 61–66 (2017).

    Article  Google Scholar 

  59. Marx, V. Method of the year: long-read sequencing. Nat. Methods https://doi.org/10.1038/s41592-022-01730-w (2023).

    Article  Google Scholar 

  60. Yi, H. & Jin, L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 41, e75 (2013).

    Article  Google Scholar 

  61. Fritz, A. et al. CAMISIM: simulating metagenomes and microbial communities. Microbiome 7, 17 (2019).

    Article  Google Scholar 

  62. Zhou, B. F. Predictive values of body mass index and waist circumference for risk factors of certain related diseases in Chinese adults—study on optimal cut-off points of body mass index and waist circumference in Chinese adults. Biomed. Environ. Sci. 15, 83–95 (2002).

    Google Scholar 

  63. Pan, X. F., Wang, L. & Pan, A. Epidemiology and determinants of obesity in China. Lancet Diabet. Endocrinol. https://doi.org/10.1016/S2213-8587(21)00045-0 (2021).

    Article  Google Scholar 

  64. Wilkinson, G. N. & Rogers, C. E. Symbolic description of factorial models for analysis of variance. J. Appl. Stat. 22, 392–399 (1973).

    Article  Google Scholar 

  65. Becker, R. A., Chambers, J. M. & Wilks, A. R. The new S language. Biometrics 45, 2 (1989).

    Article  Google Scholar 

  66. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).

    Article  MathSciNet  Google Scholar 

  67. Kendall, M. G. A new measure of rank correlation. Biometrika 30, 81–93 (1938).

    Article  Google Scholar 

  68. Van Der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  69. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  Google Scholar 

  70. Wilkinson, L. ggplot2: elegant graphics for data analysis by WICKHAM, H. Biometrics 67, 678–679 (2011).

    Article  Google Scholar 

  71. Kruskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964).

    Article  MathSciNet  Google Scholar 

  72. Yi, H. MetaKSSD database resource and clients. Zenodo https://doi.org/10.5281/zenodo.11437234 (2024).

  73. Yi, H. Metagenome data sketch file batch 1. Zenodo https://doi.org/10.5281/zenodo.10609030 (2024).

  74. Yi, H. Metagenomic data sketch file batch 2. Zenodo https://doi.org/10.5281/zenodo.10614425 (2024).

  75. Yi, H. Metagenomic data sketch file batch 4. Zenodo https://doi.org/10.5281/zenodo.10676887 (2024).

  76. Yi, H. Metagenomic data sketch file batch 3. Zenodo https://doi.org/10.5281/zenodo.10614597 (2024).

  77. Yi, H. All metagenomic profile data by MetaKSSD. Zenodo https://doi.org/10.5281/zenodo.11345411 (2024).

  78. Yi, H. MetaKSSD-2.22. Zenodo https://doi.org/10.5281/zenodo.15613720 (2025).

Download references

Acknowledgements

We thank J. Ruan from AGIS for the suggestion to name this software ‘MetaKSSD’. This work was supported by 2023 Start-up Funds for Talent at Basic Research Institutions to H.Y.’s team (grant no. JCKY2023-30); 2023 Basic Research Institution Task 4 to H.Y. (grant no. JCKY-ZDKY202304-10); Outbound Postdoctoral Research Funding in Shenzhen (grant no. SZS21001); Outbound Postdoctoral Research Funding in Dapeng New District (grant no. SDP21029); and Provincial Laboratory Special Start-up Funds to H.Y.’s team (grant no. SSZXQD006).

Author information

Authors and Affiliations

Authors

Contributions

H.Y. invented the method, developed the software, performed the analyses, wrote the paper and supervised this study. X.L. and Q.C. tested the software and assisted with the analyses.

Corresponding author

Correspondence to Huiguang Yi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2, Tables 1 and 2, Figs. 1–11 and commands procedures.

Reporting Summary

Peer Review File

Supplementary Data 1

Source genomes of ‘New_released’ dataset.

Supplementary Data 2

Associations identified by MetaKSSD and MetaPhlAn4 in BGInature2012 cohort and their supporting literatures.

Supplementary Data 3

Metadata of all sketched metagenomic runs.

Supplementary Data 4

Commonly studied metagenomic environments.

Supplementary Data 5

Environment-specific species.

Supplementary Data 6

Metadata of the runs used for t-SNE analysis.

Supplementary Data 7

Lifestyle-related runs from MetaPhlAn4 paper.

Supplementary Data 8

Runs used for abundance vector clustering test.

Supplementary Data 9

GTDBr214 species to NCBI species mapping scheme.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, H., Lu, X. & Chang, Q. MetaKSSD: boosting the scalability of the reference taxonomic marker database and the performance of metagenomic profiling using sketch operations. Nat Comput Sci 5, 884–897 (2025). https://doi.org/10.1038/s43588-025-00855-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-025-00855-0

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics