Efficient and robust search of microbial genomes via phylogenetic compression

Břinda, Karel; Lima, Leandro; Pignotti, Simone; Quinones-Olvera, Natalia; Salikhov, Kamil; Chikhi, Rayan; Kucherov, Gregory; Iqbal, Zamin; Baym, Michael

doi:10.1038/s41592-025-02625-2

Article
Published: 09 April 2025

Efficient and robust search of microbial genomes via phylogenetic compression

Nature Methods volume 22, pages 692–697 (2025)Cite this article

4055 Accesses
5 Citations
123 Altmetric
Metrics details

Subjects

Abstract

Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of phylogenetic compression and its applications to different data types.**

**Fig. 2: Results of phylogenetic compression.**

Streamlining whole genome sequencing for clinical diagnostics with ONT technology

Article Open access 20 February 2025

Efficient sequence alignment against millions of prokaryotic genomes with LexicMap

Article Open access 10 September 2025

Using custom-built primers and nanopore sequencing to evaluate CO-utilizer bacterial and archaeal populations linked to bioH₂ production

Article Open access 09 October 2023

Data availability

The Zenodo depositions for the five phylogenetically compressed test collections are provided in the following table.

Dataset	Compressed form	Zenodo accession/URL
GISP	Assemblies (XZ)	https://doi.org/10.5281/zenodo.10070404
SC2	Assemblies (XZ)	Available upon request (GISAID license).
NCTC3k	Assemblies (XZ)	https://doi.org/10.5281/zenodo.5533354
BIGSIdata	De Bruijn graphs (simplitigs after k-mer propagation; XZ)	https://doi.org/10.5281/zenodo.5555253
661k	Assemblies (XZ)	https://doi.org/10.5281/zenodo.4602622
	Assemblies (MBGC)	https://doi.org/10.5281/zenodo.6347064
	k-mer index (COBS; XZ)	https://doi.org/10.5281/zenodo.7313926 https://doi.org/10.5281/zenodo.7313942 https://doi.org/10.5281/zenodo.7315499
661k-HQ	k-mer index (COBS; XZ)	https://doi.org/10.5281/zenodo.6845083 https://doi.org/10.5281/zenodo.6849657

Code availability

The GitHub repositories and Zenodo depositions for the developed/modified software are provided in the following table.

Software	Description	GitHub repository	Zenodo accession
Phylign (v0.2.0)	Snakemake pipeline	https://github.com/karel-brinda/phylign/	https://doi.org/10.5281/zenodo.10828249
MiniPhy (v0.4.0)	Snakemake pipeline	https://github.com/karel-brinda/miniphy/	https://doi.org/10.5281/zenodo.10798914
MiniPhy-COBS (v.0.0.1)	Snakemake pipeline	https://github.com/leoisl/miniphy-cobs/	https://doi.org/10.5281/zenodo.14212997
ProPhyle (modified, v0.3.3)	ProPhyle metagenomic classifier	https://github.com/prophyle/prophyle/	https://doi.org/10.5281/zenodo.11004671
COBS (modified, v0.3)	COBS k-mer indexer	https://github.com/iqbal-lab-org/cobs/	https://doi.org/10.5281/zenodo.14212977
Attotree (v0.1.6)	An efficient re-implementation of the Mashtree algorithm	https://github.com/karel-brinda/attotree/	https://doi.org/10.5281/zenodo.10945896

References

Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).
Article PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Navarro, G. & Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007).
Article Google Scholar
Loh, P. -R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).
Article CAS PubMed Google Scholar
Yu, Y. W., Daniels, N. M., Danko, D. C. & Berger, B. Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140 (2015).
Article CAS PubMed PubMed Central Google Scholar
Giancarlo, R., Scaturro, D. & Utro, F. Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009).
Article CAS PubMed Google Scholar
Salomon, D. & Motta, G. in Handbook of Data Compression, 329–441 (Springer, 2010).
Daniels, N. M. et al. Compressive genomics for protein databases. Bioinformatics 29, i283–i290 (2013).
Article CAS PubMed PubMed Central Google Scholar
Deorowicz, S. & Grabowski, S. Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013).
Article PubMed PubMed Central Google Scholar
Giancarlo, R., Rombo, S. E. & Utro, F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. https://doi.org/10.1093/bib/bbt088 (2013).
Zhu, Z., Zhang, Y., Ji, Z., He, S. & Yang, X. High-throughput DNA sequence data compression. Brief. Bioinform. 16, 1–15 (2015).
Article PubMed Google Scholar
Hosseini, M., Pratas, D. & Pinho, A. J. A survey on data compression methods for biological sequences. Information 7, 56 (2016).
Article Google Scholar
Jayasankar, U., Thirumal, V. & Ponnurangam, D. A survey on data compression techniques: from the perspective of data quality, coding schemes, data type and applications. J. King Saud University-Computer Information Sci. 33, 119–140 (2021).
Article Google Scholar
Navarro, G. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv. 54, 1–31 (2021).
Article Google Scholar
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 31, 1–12 (2021).
Article CAS PubMed PubMed Central Google Scholar
Blackwell, G. A. et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 19, e3001421 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bradley, P., den Bakker, H. C., Rocha, E. P. C., McVean, G. & Iqbal, Z. Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37, 152–159 (2019).
Article CAS PubMed PubMed Central Google Scholar
Grabowski, S. & Kowalski, T. M. MBGC: multiple bacteria genome compressor. Gigascience 11, giab099 (2022).
Article PubMed PubMed Central Google Scholar
Deorowicz, S., Danek, A. & Li, H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 39, btad097 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18, 186 (2017).
Article PubMed PubMed Central Google Scholar
Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm. SRC Research Report 124, Digital Equipment Corporation, 1–24 (Digital Equipment Corporation Press, 1994).
Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Article CAS PubMed PubMed Central Google Scholar
Patro, R. & Kingsford, C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31, 2770–2777 (2015).
Article CAS PubMed PubMed Central Google Scholar
Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).
Article CAS PubMed Google Scholar
Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2018).
Article CAS PubMed Google Scholar
Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Břinda, K. Novel Computational Techniques for Mapping and Classification of Next-generation Sequencing Data. PhD thesis, Univ. Paris-Est (2016).
Břinda, K., Salikhov, K., Pignotti, S. & Kucherov, G. ProPhyle: an accurate, resource-frugal and deterministic DNA sequence classifier. Zenodo https://doi.org/10.5281/zenodo.1045429 (2017).
Ge, H., Sun, L. & Yu, J. Fast batch searching for protein homology based on compression and clustering. BMC Bioinform. 18, 508 (2017).
Article Google Scholar
Reiter, T. Clustering the NCBI nr database to reduce database size and enable faster BLAST searches. Arcadia Science https://doi.org/10.57844/ARCADIA-W8XT-PC81 (2023).
Collin, L. & Pavlov, I. XZ Utils. Available from https://tukaani.org/xz/ (2009).
Katz, L. et al. Mashtree: a rapid comparison of whole genome sequence files. J. Open Source Softw. 4, 1762 (2019).
Article Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Article PubMed PubMed Central Google Scholar
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).
Article CAS PubMed Google Scholar
Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: A Compact Bit-Sliced Signature Index. in String Processing and Information Retrieval 285–303 (Springer International Publishing, 2019).
Karasikov, M. et al. MetaGraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).
Rahman, A., Chikhi, R. & Medvedev, P. Disk compression of k-mer sets. Algorithms Mol. Biol. 16, 10 (2021).
Article PubMed PubMed Central Google Scholar
Turner, I., Garimella, K. V., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).
Gupta, S. K. et al. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 58, 212–220 (2014).
Article PubMed PubMed Central Google Scholar
Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 https://doi.org/10.1109/SFCS.2000.892127 (IEEE Computer Society, 2000).
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1–54 (2020).
Article Google Scholar
Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T. & Langmead, B. Movi: a fast and cache-efficient full-text pangenome index. iScience https://doi.org/10.1016/j.isci.2024.111464 (2024).
Ames, S. K. et al. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29, 2253–2260 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Molloy, E. K. & Warnow, T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol. Biol. 14, 14 (2019).
Article PubMed PubMed Central Google Scholar
Goig, G. A., Blanco, S., Garcia-Basteiro, A. L. & Comas, I. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol. 18, 24 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mäklin, T. et al. Bacterial genomic epidemiology with mixed samples. Microb. Genom. 7, 000691 (2021).
PubMed PubMed Central Google Scholar
Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. https://doi.org/10.1038/nrg.2017.88 (2017).
Břinda, K. et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nat. Microbiol. 5, 455–464 (2020).
Article PubMed PubMed Central Google Scholar
Břinda, K., Baym, M. & Kucherov, G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 22, 96 (2021).
Article PubMed PubMed Central Google Scholar
Rahman, A. & Medevedev, P. Representation of k-mer sets using spectrum-preserving string sets. J. Comput. Biol. 28, 381–394 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Broder, A. Z. On the resemblance and containment of documents. In Proc. International Conference on Compression and Complexity of sequences 21–29 https://doi.org/10.1109/sequen.1997.666900 (IEEE, 1997).
Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522 (2015).
Article PubMed PubMed Central Google Scholar
Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
CAS PubMed Google Scholar
Howe, K., Bateman, A. & Durbin, R. QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18, 1546–1547 (2002).
Article CAS PubMed Google Scholar
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
Article CAS PubMed PubMed Central Google Scholar
Köster, J. & Rahmann, S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
Article PubMed Google Scholar
Li, H. Seqtk: toolkit for processing sequences in FASTA/Q formats. GitHub https://github.com/lh3/seqtk (2016).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Article PubMed PubMed Central Google Scholar
Grad, Y. H. et al. Genomic epidemiology of gonococcal resistance to extended-spectrum cephalosporins, macrolides, and fluoroquinolones in the United States, 2000–2013. J. Infect. Dis. 214, 1579–1587 (2016).
Article CAS PubMed PubMed Central Google Scholar
Tange, O. GNU Parallel: the command-line power tool. The USENIX Magazine 36, 42–47 (2011).
Google Scholar
Larsson, N. J. & Moffat, A. Off-line dictionary-based compression. Proc. IEEE 88, 1722–1732 (2000).
Article Google Scholar
Wan, R. Browsing and Searching Compressed Documents. PhD thesis, Univ. Melbourne (2003).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article CAS PubMed PubMed Central Google Scholar
Chikhi, R., Limasset, A. & Medvedev, P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 (2016).
Article CAS PubMed PubMed Central Google Scholar
Břinda, K., Sykulski, M. & Kucherov, G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 3584–3592 (2015).
Article PubMed Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the NIGMS of the National Institutes of Health (R35GM133700 to M.B.), the David and Lucile Packard Foundation (to M.B.), the Pew Charitable Trusts (to M.B.), the Alfred P. Sloan Foundation (to M.B.), the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 872539, 956229 and 101047160 to R.C.) and the ANR Transipedia, SeqDigger, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001, respectively; to R.C.). Portions of this research were conducted on the O2 high-performance compute cluster, supported by the Research Computing Group at Harvard Medical School, and on the GenOuest bioinformatics core facility (https://www.genouest.org/).

Author information

Authors and Affiliations

Inria, Irisa, Univ. Rennes, Rennes, France
Karel Břinda
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Karel Břinda, Simone Pignotti, Natalia Quinones-Olvera & Michael Baym
EMBL-EBI, Hinxton, UK
Leandro Lima & Zamin Iqbal
LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
Simone Pignotti, Kamil Salikhov & Gregory Kucherov
Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
Rayan Chikhi
Milner Centre for Evolution, University of Bath, Bath, UK
Zamin Iqbal

Authors

Karel Břinda
View author publications
Search author on:PubMed Google Scholar
Leandro Lima
View author publications
Search author on:PubMed Google Scholar
Simone Pignotti
View author publications
Search author on:PubMed Google Scholar
Natalia Quinones-Olvera
View author publications
Search author on:PubMed Google Scholar
Kamil Salikhov
View author publications
Search author on:PubMed Google Scholar
Rayan Chikhi
View author publications
Search author on:PubMed Google Scholar
Gregory Kucherov
View author publications
Search author on:PubMed Google Scholar
Zamin Iqbal
View author publications
Search author on:PubMed Google Scholar
Michael Baym
View author publications
Search author on:PubMed Google Scholar

Contributions

K.B., Z.I. and M.B. designed and conceptualized the method and algorithms and wrote the paper. K.B. wrote the initial draft of the manuscript. K.B. and L.L. wrote the software. K.B. performed the analyses for the study. N.Q.-O., R.C. and G.K. contributed to the conception and design of the work. S.P. and K.S. contributed to the software development. All authors reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Karel Břinda or Michael Baym.

Ethics declarations

Competing interests

S.P. is currently employed by Eligo Bioscience. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks David Koslicki, Rob Patro and Harihara Subrahmaniam Muralidharan for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Batching strategies for the 661k (a) and BIGSIdata (b) collections.

Genomes are clustered by species, and clusters that are too small are placed into a common pseudo-cluster called a dustbin. The resulting clusters and the dustbin are then divided into size- and diversity-balanced batches. For more information on batching, see Methods and Supplementary Note 5.

Extended Data Fig. 2 Quantification of phylogeny-explained data redundancy in the five test collections.

The plot depicts the percentage of data redundancy that can be explained by the compressive phylogenies in each of the five test collections. Explained redundancy is measured by bottom-up k-mer propagation along the phylogenies performed by ProPhyle and calculated as the proportion of duplicate k-mers removed by the propagation (k = 31, canonical; see Methods for the formula). A k-mer distribution perfectly explained by the associated compressive phylogeny (that is, all k-mers associated with complete subtrees) would result in 100% phylogeny-explained redundancy. The plot shows that for single-species batches (modeled by the GISP and SC2 collections), the majority of the signal can be explained by their compressive phylogenies, indicative of their extremely high phylogenetic compressibility (cf. Extended Data Fig. 4a, b). In contrast, high-diversity batches (modeled by the NCTC3k collection) have more irregularly distributed k-mer content due to horizontal gene transfer combined with sparse sampling, indicative of their lower compressibility (cf. Extended Data Fig. 4c). Large and diverse collections, such as 661k and BIGSIdata, thus exhibit a medium level of phylogenetically explained redundancies, with the level depending on the amount of noise (higher for BIGSIdata and lower for 661k, as also visible in Extended Data Fig. 7).

Extended Data Fig. 3 Calibration of XZ as a low-level tool for phylogenetic compression of assemblies.

The comparison was performed using the assemblies from the GISP collection, with genomes sorted left-to-right according to the Mashtree phylogeny. In both plots, an asterisk denotes the mode selected for phylogenetic compression in MiniPhy. a) The plot shows the compression performance of XZ, GZip, and BZip2 in bits per bp as a function of compression presets (-1, -2, etc.) with single-line FASTA. Given the specific sizes of dictionaries and windows used in the individual algorithms and their presets, only XZ with a level ≥ 4 was capable of compressing bacterial genomes beyond the statistical entropy baseline (that is, approximately 2 bits per bp). M and MM denote additional, manually tuned compression modes of XZ with increased dictionary sizes (Methods), which slightly improved compression performance but substantially increased memory and CPU time and were thus not used in MiniPhy. b) The plot shows the impact of FASTA line length on compression performance. With single-line FASTA (denoted by Inf), the compressed size is reduced to 12% compared to the 40-bp-per-line version. The plot highlights the importance of pre-formatting FASTA data before using general compressors such as XZ.

Extended Data Fig. 4 Comparison of three contrasting compression scaling modes of microbial collections.

The plots compare the scaling behavior of the XZ, GZip, BZip2, and Re-Pair compressors on the SC2 (a), GISP (b), and NCTC3k (c) collections, depicting the space per genome as a function of the number of jointly compressed genomes, progressively increased on logarithmic scales. The results highlight several key findings. First, XZ consistently outperforms the other compressors. Second, for viral genomes all four compressors are able to overcome the 2-bits-per-bp baseline thanks to their short genome length, but only XZ is able to compress beyond this limit for bacterial genomes (consistent with Extended Data Fig. 3a; the Re-Pair implementation used could not compress bacterial genomes due to their size). Third, Re-Pair compression can be nearly as effective as XZ for viruses, but its applicability to large datasets is limited by its scalability. Fourth, the compressibility of divergent bacteria is substantially limited even with the best compressors, with only a 4× improvement in per-genome compression for NCTC3k (while the highly compressible SC2 and GISP collections show 171× and 105× improvements for the same number of genomes).

Extended Data Fig. 5 Impact of within-batch genome order on the compressibility of microbial collections.

While a substantial part of the benefits of phylogenetic compression comes from organizing genomes into batches of phylogenetically related genomes, proper genome reordering within individual batches is also crucial for maximizing data compressibility. The plots demonstrate that the impact of within-batch reordering grows with the amount of diversity included (GISP vs. NCTC3k) and with the number of genomes (GISP vs. SC2). Accurate phylogenies inferred using RAxML provided a small compression benefit for assemblies over trees computed using Mashtree (GISP).

Extended Data Fig. 6 Compression trade-offs for the five test collections and for individual batches of the 661k collection.

The plot illustrates the trade-off between the per-genome size after compression and the number of bits per distinct k-mer (k = 31, canonical). The larger points represent individual genome collections and correspond to values from Supplementary Table 3. The smaller points represent individual batches within the 661k collection, with colors indicating the number of genomes in each batch. Overall, the plot reveals the influence of genomic diversity on the resulting compression characteristics. The trade-off follows an L-shaped pattern, where compressing genome groups with high diversity leads to smaller space per k-mer but larger space per genome, and vice versa for genome groups with low diversity.

Extended Data Fig. 7 Distribution of the number of distinct k-mers in the top 20 species in (a) the 661k and (b) BIGSIdata collections.

For the 661k collection, colors represent the quality of the assemblies (LQ: low-quality, HQ: high-quality), as determined as part of the quality control in the original publication. For BIGSIdata, no quality control information is available. The numbers below the species name indicate the number of samples within each category. The plots were created for canonical 31-mers.

Extended Data Fig. 8 Proportions of top 10 species (their corresponding batches) in the 661k collection before and after phylogenetic compression.

The plot depicts the proportions of the top 10 species, the Dustbin pseudo-cluster, and the remaining species grouped as Others, while comparing the following four quantitative characteristics: the number of genomes, their cumulative length, the size of the phylogenetically compressed assemblies, and the size of the phylogenetically compressed COBS indexes (for k = 31). Transitioning from the number of genomes to their cumulative length has only a minor impact on the proportions (corresponding to different mean genome lengths of individual species). However, the divergent genomes occupy a substantially higher proportion of the collection after compression. Moreover, despite genome assemblies and k-mer COBS indexes being fundamentally different genome representations (horizontal vs. vertical, respectively), the observed post-compression proportions in them were nearly identical.

Extended Data Fig. 9 Time required for decompressing the Phylign 661k-HQ database.

The wall clock and total CPU time required to decompress the Phylign 661k-HQ database, both from disk and in memory, were measured on an iMac desktop computer with 4 physical (8 logical) cores. The in-memory decompression process, which is implemented in Phylign, was completed under 30 min. This duration represents only a fraction of the typical time required for search experiments (see Supplementary Table 6).

Supplementary information

Supplementary Information

Supplementary Notes 1–6, Supplementary Tables 1–7 and additional materials.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Břinda, K., Lima, L., Pignotti, S. et al. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 22, 692–697 (2025). https://doi.org/10.1038/s41592-025-02625-2

Download citation

Received: 08 June 2023
Accepted: 12 February 2025
Published: 09 April 2025
Issue date: April 2025
DOI: https://doi.org/10.1038/s41592-025-02625-2

This article is cited by

Efficient sequence alignment against millions of prokaryotic genomes with LexicMap
- Wei Shen
- John A. Lees
- Zamin Iqbal
Nature Biotechnology (2025)