Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Efficient and robust search of microbial genomes via phylogenetic compression

Abstract

Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of phylogenetic compression and its applications to different data types.
Fig. 2: Results of phylogenetic compression.

Similar content being viewed by others

Data availability

The Zenodo depositions for the five phylogenetically compressed test collections are provided in the following table.

Dataset

Compressed form

Zenodo accession/URL

GISP

Assemblies (XZ)

https://doi.org/10.5281/zenodo.10070404

SC2

Assemblies (XZ)

Available upon request (GISAID license).

NCTC3k

Assemblies (XZ)

https://doi.org/10.5281/zenodo.5533354

BIGSIdata

De Bruijn graphs (simplitigs after k-mer propagation; XZ)

https://doi.org/10.5281/zenodo.5555253

661k

Assemblies (XZ)

https://doi.org/10.5281/zenodo.4602622

Assemblies (MBGC)

https://doi.org/10.5281/zenodo.6347064

k-mer index (COBS; XZ)

https://doi.org/10.5281/zenodo.7313926

https://doi.org/10.5281/zenodo.7313942

https://doi.org/10.5281/zenodo.7315499

661k-HQ

k-mer index (COBS; XZ)

https://doi.org/10.5281/zenodo.6845083

https://doi.org/10.5281/zenodo.6849657

Code availability

The GitHub repositories and Zenodo depositions for the developed/modified software are provided in the following table.

Software

Description

GitHub repository

Zenodo accession

Phylign (v0.2.0)

Snakemake pipeline

https://github.com/karel-brinda/phylign/

https://doi.org/10.5281/zenodo.10828249

MiniPhy (v0.4.0)

Snakemake pipeline

https://github.com/karel-brinda/miniphy/

https://doi.org/10.5281/zenodo.10798914

MiniPhy-COBS (v.0.0.1)

Snakemake pipeline

https://github.com/leoisl/miniphy-cobs/

https://doi.org/10.5281/zenodo.14212997

ProPhyle (modified, v0.3.3)

ProPhyle metagenomic classifier

https://github.com/prophyle/prophyle/

https://doi.org/10.5281/zenodo.11004671

COBS (modified, v0.3)

COBS k-mer indexer

https://github.com/iqbal-lab-org/cobs/

https://doi.org/10.5281/zenodo.14212977

Attotree (v0.1.6)

An efficient re-implementation of the Mashtree algorithm

https://github.com/karel-brinda/attotree/

https://doi.org/10.5281/zenodo.10945896

References

  1. Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  3. Navarro, G. & Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007).

    Article  Google Scholar 

  4. Loh, P. -R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).

    Article  CAS  PubMed  Google Scholar 

  5. Yu, Y. W., Daniels, N. M., Danko, D. C. & Berger, B. Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Giancarlo, R., Scaturro, D. & Utro, F. Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009).

    Article  CAS  PubMed  Google Scholar 

  7. Salomon, D. & Motta, G. in Handbook of Data Compression, 329–441 (Springer, 2010).

  8. Daniels, N. M. et al. Compressive genomics for protein databases. Bioinformatics 29, i283–i290 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Deorowicz, S. & Grabowski, S. Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Giancarlo, R., Rombo, S. E. & Utro, F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. https://doi.org/10.1093/bib/bbt088 (2013).

  11. Zhu, Z., Zhang, Y., Ji, Z., He, S. & Yang, X. High-throughput DNA sequence data compression. Brief. Bioinform. 16, 1–15 (2015).

    Article  PubMed  Google Scholar 

  12. Hosseini, M., Pratas, D. & Pinho, A. J. A survey on data compression methods for biological sequences. Information 7, 56 (2016).

    Article  Google Scholar 

  13. Jayasankar, U., Thirumal, V. & Ponnurangam, D. A survey on data compression techniques: from the perspective of data quality, coding schemes, data type and applications. J. King Saud University-Computer Information Sci. 33, 119–140 (2021).

    Article  Google Scholar 

  14. Navarro, G. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv. 54, 1–31 (2021).

    Article  Google Scholar 

  15. Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 31, 1–12 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Blackwell, G. A. et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 19, e3001421 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Bradley, P., den Bakker, H. C., Rocha, E. P. C., McVean, G. & Iqbal, Z. Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37, 152–159 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Grabowski, S. & Kowalski, T. M. MBGC: multiple bacteria genome compressor. Gigascience 11, giab099 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Deorowicz, S., Danek, A. & Li, H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 39, btad097 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18, 186 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm. SRC Research Report 124, Digital Equipment Corporation, 1–24 (Digital Equipment Corporation Press, 1994).

  22. Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Patro, R. & Kingsford, C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31, 2770–2777 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).

    Article  CAS  PubMed  Google Scholar 

  25. Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2018).

    Article  CAS  PubMed  Google Scholar 

  26. Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Břinda, K. Novel Computational Techniques for Mapping and Classification of Next-generation Sequencing Data. PhD thesis, Univ. Paris-Est (2016).

  29. Břinda, K., Salikhov, K., Pignotti, S. & Kucherov, G. ProPhyle: an accurate, resource-frugal and deterministic DNA sequence classifier. Zenodo https://doi.org/10.5281/zenodo.1045429 (2017).

  30. Ge, H., Sun, L. & Yu, J. Fast batch searching for protein homology based on compression and clustering. BMC Bioinform. 18, 508 (2017).

    Article  Google Scholar 

  31. Reiter, T. Clustering the NCBI nr database to reduce database size and enable faster BLAST searches. Arcadia Science https://doi.org/10.57844/ARCADIA-W8XT-PC81 (2023).

  32. Collin, L. & Pavlov, I. XZ Utils. Available from https://tukaani.org/xz/ (2009).

  33. Katz, L. et al. Mashtree: a rapid comparison of whole genome sequence files. J. Open Source Softw. 4, 1762 (2019).

    Article  Google Scholar 

  34. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).

    Article  CAS  PubMed  Google Scholar 

  36. Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: A Compact Bit-Sliced Signature Index. in String Processing and Information Retrieval 285–303 (Springer International Publishing, 2019).

  37. Karasikov, M. et al. MetaGraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).

  38. Rahman, A., Chikhi, R. & Medvedev, P. Disk compression of k-mer sets. Algorithms Mol. Biol. 16, 10 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Turner, I., Garimella, K. V., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).

  41. Gupta, S. K. et al. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 58, 212–220 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 https://doi.org/10.1109/SFCS.2000.892127 (IEEE Computer Society, 2000).

  43. Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1–54 (2020).

    Article  Google Scholar 

  44. Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T. & Langmead, B. Movi: a fast and cache-efficient full-text pangenome index. iScience https://doi.org/10.1016/j.isci.2024.111464 (2024).

  45. Ames, S. K. et al. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29, 2253–2260 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Molloy, E. K. & Warnow, T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol. Biol. 14, 14 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Goig, G. A., Blanco, S., Garcia-Basteiro, A. L. & Comas, I. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol. 18, 24 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Mäklin, T. et al. Bacterial genomic epidemiology with mixed samples. Microb. Genom. 7, 000691 (2021).

    PubMed  PubMed Central  Google Scholar 

  50. Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. https://doi.org/10.1038/nrg.2017.88 (2017).

  52. Břinda, K. et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nat. Microbiol. 5, 455–464 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Břinda, K., Baym, M. & Kucherov, G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 22, 96 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Rahman, A. & Medevedev, P. Representation of k-mer sets using spectrum-preserving string sets. J. Comput. Biol. 28, 381–394 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Broder, A. Z. On the resemblance and containment of documents. In Proc. International Conference on Compression and Complexity of sequences 21–29 https://doi.org/10.1109/sequen.1997.666900 (IEEE, 1997).

  58. Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).

    CAS  PubMed  Google Scholar 

  60. Howe, K., Bateman, A. & Durbin, R. QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18, 1546–1547 (2002).

    Article  CAS  PubMed  Google Scholar 

  61. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Köster, J. & Rahmann, S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).

    Article  PubMed  Google Scholar 

  63. Li, H. Seqtk: toolkit for processing sequences in FASTA/Q formats. GitHub https://github.com/lh3/seqtk (2016).

  64. Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Grad, Y. H. et al. Genomic epidemiology of gonococcal resistance to extended-spectrum cephalosporins, macrolides, and fluoroquinolones in the United States, 2000–2013. J. Infect. Dis. 214, 1579–1587 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Tange, O. GNU Parallel: the command-line power tool. The USENIX Magazine 36, 42–47 (2011).

    Google Scholar 

  68. Larsson, N. J. & Moffat, A. Off-line dictionary-based compression. Proc. IEEE 88, 1722–1732 (2000).

    Article  Google Scholar 

  69. Wan, R. Browsing and Searching Compressed Documents. PhD thesis, Univ. Melbourne (2003).

  70. Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Chikhi, R., Limasset, A. & Medvedev, P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Břinda, K., Sykulski, M. & Kucherov, G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 3584–3592 (2015).

    Article  PubMed  Google Scholar 

  73. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by the NIGMS of the National Institutes of Health (R35GM133700 to M.B.), the David and Lucile Packard Foundation (to M.B.), the Pew Charitable Trusts (to M.B.), the Alfred P. Sloan Foundation (to M.B.), the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 872539, 956229 and 101047160 to R.C.) and the ANR Transipedia, SeqDigger, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001, respectively; to R.C.). Portions of this research were conducted on the O2 high-performance compute cluster, supported by the Research Computing Group at Harvard Medical School, and on the GenOuest bioinformatics core facility (https://www.genouest.org/).

Author information

Authors and Affiliations

Authors

Contributions

K.B., Z.I. and M.B. designed and conceptualized the method and algorithms and wrote the paper. K.B. wrote the initial draft of the manuscript. K.B. and L.L. wrote the software. K.B. performed the analyses for the study. N.Q.-O., R.C. and G.K. contributed to the conception and design of the work. S.P. and K.S. contributed to the software development. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Karel Břinda or Michael Baym.

Ethics declarations

Competing interests

S.P. is currently employed by Eligo Bioscience. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks David Koslicki, Rob Patro and Harihara Subrahmaniam Muralidharan for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Batching strategies for the 661k (a) and BIGSIdata (b) collections.

Genomes are clustered by species, and clusters that are too small are placed into a common pseudo-cluster called a dustbin. The resulting clusters and the dustbin are then divided into size- and diversity-balanced batches. For more information on batching, see Methods and Supplementary Note 5.

Extended Data Fig. 2 Quantification of phylogeny-explained data redundancy in the five test collections.

The plot depicts the percentage of data redundancy that can be explained by the compressive phylogenies in each of the five test collections. Explained redundancy is measured by bottom-up k-mer propagation along the phylogenies performed by ProPhyle and calculated as the proportion of duplicate k-mers removed by the propagation (k = 31, canonical; see Methods for the formula). A k-mer distribution perfectly explained by the associated compressive phylogeny (that is, all k-mers associated with complete subtrees) would result in 100% phylogeny-explained redundancy. The plot shows that for single-species batches (modeled by the GISP and SC2 collections), the majority of the signal can be explained by their compressive phylogenies, indicative of their extremely high phylogenetic compressibility (cf. Extended Data Fig. 4a, b). In contrast, high-diversity batches (modeled by the NCTC3k collection) have more irregularly distributed k-mer content due to horizontal gene transfer combined with sparse sampling, indicative of their lower compressibility (cf. Extended Data Fig. 4c). Large and diverse collections, such as 661k and BIGSIdata, thus exhibit a medium level of phylogenetically explained redundancies, with the level depending on the amount of noise (higher for BIGSIdata and lower for 661k, as also visible in Extended Data Fig. 7).

Extended Data Fig. 3 Calibration of XZ as a low-level tool for phylogenetic compression of assemblies.

The comparison was performed using the assemblies from the GISP collection, with genomes sorted left-to-right according to the Mashtree phylogeny. In both plots, an asterisk denotes the mode selected for phylogenetic compression in MiniPhy. a) The plot shows the compression performance of XZ, GZip, and BZip2 in bits per bp as a function of compression presets (-1, -2, etc.) with single-line FASTA. Given the specific sizes of dictionaries and windows used in the individual algorithms and their presets, only XZ with a level ≥ 4 was capable of compressing bacterial genomes beyond the statistical entropy baseline (that is, approximately 2 bits per bp). M and MM denote additional, manually tuned compression modes of XZ with increased dictionary sizes (Methods), which slightly improved compression performance but substantially increased memory and CPU time and were thus not used in MiniPhy. b) The plot shows the impact of FASTA line length on compression performance. With single-line FASTA (denoted by Inf), the compressed size is reduced to 12% compared to the 40-bp-per-line version. The plot highlights the importance of pre-formatting FASTA data before using general compressors such as XZ.

Extended Data Fig. 4 Comparison of three contrasting compression scaling modes of microbial collections.

The plots compare the scaling behavior of the XZ, GZip, BZip2, and Re-Pair compressors on the SC2 (a), GISP (b), and NCTC3k (c) collections, depicting the space per genome as a function of the number of jointly compressed genomes, progressively increased on logarithmic scales. The results highlight several key findings. First, XZ consistently outperforms the other compressors. Second, for viral genomes all four compressors are able to overcome the 2-bits-per-bp baseline thanks to their short genome length, but only XZ is able to compress beyond this limit for bacterial genomes (consistent with Extended Data Fig. 3a; the Re-Pair implementation used could not compress bacterial genomes due to their size). Third, Re-Pair compression can be nearly as effective as XZ for viruses, but its applicability to large datasets is limited by its scalability. Fourth, the compressibility of divergent bacteria is substantially limited even with the best compressors, with only a 4× improvement in per-genome compression for NCTC3k (while the highly compressible SC2 and GISP collections show 171× and 105× improvements for the same number of genomes).

Extended Data Fig. 5 Impact of within-batch genome order on the compressibility of microbial collections.

While a substantial part of the benefits of phylogenetic compression comes from organizing genomes into batches of phylogenetically related genomes, proper genome reordering within individual batches is also crucial for maximizing data compressibility. The plots demonstrate that the impact of within-batch reordering grows with the amount of diversity included (GISP vs. NCTC3k) and with the number of genomes (GISP vs. SC2). Accurate phylogenies inferred using RAxML provided a small compression benefit for assemblies over trees computed using Mashtree (GISP).

Extended Data Fig. 6 Compression trade-offs for the five test collections and for individual batches of the 661k collection.

The plot illustrates the trade-off between the per-genome size after compression and the number of bits per distinct k-mer (k = 31, canonical). The larger points represent individual genome collections and correspond to values from Supplementary Table 3. The smaller points represent individual batches within the 661k collection, with colors indicating the number of genomes in each batch. Overall, the plot reveals the influence of genomic diversity on the resulting compression characteristics. The trade-off follows an L-shaped pattern, where compressing genome groups with high diversity leads to smaller space per k-mer but larger space per genome, and vice versa for genome groups with low diversity.

Extended Data Fig. 7 Distribution of the number of distinct k-mers in the top 20 species in (a) the 661k and (b) BIGSIdata collections.

For the 661k collection, colors represent the quality of the assemblies (LQ: low-quality, HQ: high-quality), as determined as part of the quality control in the original publication. For BIGSIdata, no quality control information is available. The numbers below the species name indicate the number of samples within each category. The plots were created for canonical 31-mers.

Extended Data Fig. 8 Proportions of top 10 species (their corresponding batches) in the 661k collection before and after phylogenetic compression.

The plot depicts the proportions of the top 10 species, the Dustbin pseudo-cluster, and the remaining species grouped as Others, while comparing the following four quantitative characteristics: the number of genomes, their cumulative length, the size of the phylogenetically compressed assemblies, and the size of the phylogenetically compressed COBS indexes (for k = 31). Transitioning from the number of genomes to their cumulative length has only a minor impact on the proportions (corresponding to different mean genome lengths of individual species). However, the divergent genomes occupy a substantially higher proportion of the collection after compression. Moreover, despite genome assemblies and k-mer COBS indexes being fundamentally different genome representations (horizontal vs. vertical, respectively), the observed post-compression proportions in them were nearly identical.

Extended Data Fig. 9 Time required for decompressing the Phylign 661k-HQ database.

The wall clock and total CPU time required to decompress the Phylign 661k-HQ database, both from disk and in memory, were measured on an iMac desktop computer with 4 physical (8 logical) cores. The in-memory decompression process, which is implemented in Phylign, was completed under 30 min. This duration represents only a fraction of the typical time required for search experiments (see Supplementary Table 6).

Supplementary information

Supplementary Information

Supplementary Notes 1–6, Supplementary Tables 1–7 and additional materials.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Břinda, K., Lima, L., Pignotti, S. et al. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 22, 692–697 (2025). https://doi.org/10.1038/s41592-025-02625-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-025-02625-2

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research