Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
The Zenodo depositions for the five phylogenetically compressed test collections are provided in the following table.
Dataset | Compressed form | Zenodo accession/URL |
GISP | Assemblies (XZ) | |
SC2 | Assemblies (XZ) | Available upon request (GISAID license). |
NCTC3k | Assemblies (XZ) | |
BIGSIdata | De Bruijn graphs (simplitigs after k-mer propagation; XZ) | |
661k | Assemblies (XZ) | |
Assemblies (MBGC) | ||
k-mer index (COBS; XZ) | https://doi.org/10.5281/zenodo.7313926 | |
661k-HQ | k-mer index (COBS; XZ) |
Code availability
The GitHub repositories and Zenodo depositions for the developed/modified software are provided in the following table.
Software | Description | GitHub repository | Zenodo accession |
Phylign (v0.2.0) | Snakemake pipeline | ||
MiniPhy (v0.4.0) | Snakemake pipeline | ||
MiniPhy-COBS (v.0.0.1) | Snakemake pipeline | ||
ProPhyle (modified, v0.3.3) | ProPhyle metagenomic classifier | ||
COBS (modified, v0.3) | COBS k-mer indexer | ||
Attotree (v0.1.6) | An efficient re-implementation of the Mashtree algorithm |
References
Stephens, Z. D. et al. Big data: astronomical or genomical? PLoS Biol. 13, e1002195 (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Navarro, G. & Mäkinen, V. Compressed full-text indexes. ACM Comput. Surv. 39, 2 (2007).
Loh, P. -R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).
Yu, Y. W., Daniels, N. M., Danko, D. C. & Berger, B. Entropy-scaling search of massive biological data. Cell Syst. 1, 130–140 (2015).
Giancarlo, R., Scaturro, D. & Utro, F. Textual data compression in computational biology: a synopsis. Bioinformatics 25, 1575–1586 (2009).
Salomon, D. & Motta, G. in Handbook of Data Compression, 329–441 (Springer, 2010).
Daniels, N. M. et al. Compressive genomics for protein databases. Bioinformatics 29, i283–i290 (2013).
Deorowicz, S. & Grabowski, S. Data compression for sequencing data. Algorithms Mol. Biol. 8, 25 (2013).
Giancarlo, R., Rombo, S. E. & Utro, F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform. https://doi.org/10.1093/bib/bbt088 (2013).
Zhu, Z., Zhang, Y., Ji, Z., He, S. & Yang, X. High-throughput DNA sequence data compression. Brief. Bioinform. 16, 1–15 (2015).
Hosseini, M., Pratas, D. & Pinho, A. J. A survey on data compression methods for biological sequences. Information 7, 56 (2016).
Jayasankar, U., Thirumal, V. & Ponnurangam, D. A survey on data compression techniques: from the perspective of data quality, coding schemes, data type and applications. J. King Saud University-Computer Information Sci. 33, 119–140 (2021).
Navarro, G. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv. 54, 1–31 (2021).
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 31, 1–12 (2021).
Blackwell, G. A. et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 19, e3001421 (2021).
Bradley, P., den Bakker, H. C., Rocha, E. P. C., McVean, G. & Iqbal, Z. Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37, 152–159 (2019).
Grabowski, S. & Kowalski, T. M. MBGC: multiple bacteria genome compressor. Gigascience 11, giab099 (2022).
Deorowicz, S., Danek, A. & Li, H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 39, btad097 (2023).
Zielezinski, A., Vinga, S., Almeida, J. & Karlowski, W. M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18, 186 (2017).
Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm. SRC Research Report 124, Digital Equipment Corporation, 1–24 (Digital Equipment Corporation Press, 1994).
Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Patro, R. & Kingsford, C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 31, 2770–2777 (2015).
Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2015).
Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2018).
Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Břinda, K. Novel Computational Techniques for Mapping and Classification of Next-generation Sequencing Data. PhD thesis, Univ. Paris-Est (2016).
Břinda, K., Salikhov, K., Pignotti, S. & Kucherov, G. ProPhyle: an accurate, resource-frugal and deterministic DNA sequence classifier. Zenodo https://doi.org/10.5281/zenodo.1045429 (2017).
Ge, H., Sun, L. & Yu, J. Fast batch searching for protein homology based on compression and clustering. BMC Bioinform. 18, 508 (2017).
Reiter, T. Clustering the NCBI nr database to reduce database size and enable faster BLAST searches. Arcadia Science https://doi.org/10.57844/ARCADIA-W8XT-PC81 (2023).
Collin, L. & Pavlov, I. XZ Utils. Available from https://tukaani.org/xz/ (2009).
Katz, L. et al. Mashtree: a rapid comparison of whole genome sequence files. J. Open Source Softw. 4, 1762 (2019).
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Breitwieser, F. P., Lu, J. & Salzberg, S. L. A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20, 1125–1136 (2019).
Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: A Compact Bit-Sliced Signature Index. in String Processing and Information Retrieval 285–303 (Springer International Publishing, 2019).
Karasikov, M. et al. MetaGraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).
Rahman, A., Chikhi, R. & Medvedev, P. Disk compression of k-mer sets. Algorithms Mol. Biol. 16, 10 (2021).
Turner, I., Garimella, K. V., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics https://doi.org/10.1093/bioinformatics/bty191 (2018).
Gupta, S. K. et al. ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob. Agents Chemother. 58, 212–220 (2014).
Ferragina, P. & Manzini, G. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science 390–398 https://doi.org/10.1109/SFCS.2000.892127 (IEEE Computer Society, 2000).
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1–54 (2020).
Zakeri, M., Brown, N. K., Ahmed, O. Y., Gagie, T. & Langmead, B. Movi: a fast and cache-efficient full-text pangenome index. iScience https://doi.org/10.1016/j.isci.2024.111464 (2024).
Ames, S. K. et al. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29, 2253–2260 (2013).
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Molloy, E. K. & Warnow, T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol. Biol. 14, 14 (2019).
Goig, G. A., Blanco, S., Garcia-Basteiro, A. L. & Comas, I. Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability. BMC Biol. 18, 24 (2020).
Mäklin, T. et al. Bacterial genomic epidemiology with mixed samples. Microb. Genom. 7, 000691 (2021).
Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).
Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. https://doi.org/10.1038/nrg.2017.88 (2017).
Břinda, K. et al. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nat. Microbiol. 5, 455–464 (2020).
Břinda, K., Baym, M. & Kucherov, G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 22, 96 (2021).
Rahman, A. & Medevedev, P. Representation of k-mer sets using spectrum-preserving string sets. J. Comput. Biol. 28, 381–394 (2021).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132 (2016).
Broder, A. Z. On the resemblance and containment of documents. In Proc. International Conference on Compression and Complexity of sequences 21–29 https://doi.org/10.1109/sequen.1997.666900 (IEEE, 1997).
Fan, H., Ives, A. R., Surget-Groba, Y. & Cannon, C. H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522 (2015).
Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
Howe, K., Bateman, A. & Durbin, R. QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics 18, 1546–1547 (2002).
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
Köster, J. & Rahmann, S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012).
Li, H. Seqtk: toolkit for processing sequences in FASTA/Q formats. GitHub https://github.com/lh3/seqtk (2016).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Grüning, B. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat. Methods 15, 475–476 (2018).
Grad, Y. H. et al. Genomic epidemiology of gonococcal resistance to extended-spectrum cephalosporins, macrolides, and fluoroquinolones in the United States, 2000–2013. J. Infect. Dis. 214, 1579–1587 (2016).
Tange, O. GNU Parallel: the command-line power tool. The USENIX Magazine 36, 42–47 (2011).
Larsson, N. J. & Moffat, A. Off-line dictionary-based compression. Proc. IEEE 88, 1722–1732 (2000).
Wan, R. Browsing and Searching Compressed Documents. PhD thesis, Univ. Melbourne (2003).
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Chikhi, R., Limasset, A. & Medvedev, P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32, i201–i208 (2016).
Břinda, K., Sykulski, M. & Kucherov, G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics 31, 3584–3592 (2015).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Acknowledgements
This work was supported by the NIGMS of the National Institutes of Health (R35GM133700 to M.B.), the David and Lucile Packard Foundation (to M.B.), the Pew Charitable Trusts (to M.B.), the Alfred P. Sloan Foundation (to M.B.), the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 872539, 956229 and 101047160 to R.C.) and the ANR Transipedia, SeqDigger, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001, respectively; to R.C.). Portions of this research were conducted on the O2 high-performance compute cluster, supported by the Research Computing Group at Harvard Medical School, and on the GenOuest bioinformatics core facility (https://www.genouest.org/).
Author information
Authors and Affiliations
Contributions
K.B., Z.I. and M.B. designed and conceptualized the method and algorithms and wrote the paper. K.B. wrote the initial draft of the manuscript. K.B. and L.L. wrote the software. K.B. performed the analyses for the study. N.Q.-O., R.C. and G.K. contributed to the conception and design of the work. S.P. and K.S. contributed to the software development. All authors reviewed and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
S.P. is currently employed by Eligo Bioscience. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks David Koslicki, Rob Patro and Harihara Subrahmaniam Muralidharan for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Batching strategies for the 661k (a) and BIGSIdata (b) collections.
Genomes are clustered by species, and clusters that are too small are placed into a common pseudo-cluster called a dustbin. The resulting clusters and the dustbin are then divided into size- and diversity-balanced batches. For more information on batching, see Methods and Supplementary Note 5.
Extended Data Fig. 2 Quantification of phylogeny-explained data redundancy in the five test collections.
The plot depicts the percentage of data redundancy that can be explained by the compressive phylogenies in each of the five test collections. Explained redundancy is measured by bottom-up k-mer propagation along the phylogenies performed by ProPhyle and calculated as the proportion of duplicate k-mers removed by the propagation (k = 31, canonical; see Methods for the formula). A k-mer distribution perfectly explained by the associated compressive phylogeny (that is, all k-mers associated with complete subtrees) would result in 100% phylogeny-explained redundancy. The plot shows that for single-species batches (modeled by the GISP and SC2 collections), the majority of the signal can be explained by their compressive phylogenies, indicative of their extremely high phylogenetic compressibility (cf. Extended Data Fig. 4a, b). In contrast, high-diversity batches (modeled by the NCTC3k collection) have more irregularly distributed k-mer content due to horizontal gene transfer combined with sparse sampling, indicative of their lower compressibility (cf. Extended Data Fig. 4c). Large and diverse collections, such as 661k and BIGSIdata, thus exhibit a medium level of phylogenetically explained redundancies, with the level depending on the amount of noise (higher for BIGSIdata and lower for 661k, as also visible in Extended Data Fig. 7).
Extended Data Fig. 3 Calibration of XZ as a low-level tool for phylogenetic compression of assemblies.
The comparison was performed using the assemblies from the GISP collection, with genomes sorted left-to-right according to the Mashtree phylogeny. In both plots, an asterisk denotes the mode selected for phylogenetic compression in MiniPhy. a) The plot shows the compression performance of XZ, GZip, and BZip2 in bits per bp as a function of compression presets (-1, -2, etc.) with single-line FASTA. Given the specific sizes of dictionaries and windows used in the individual algorithms and their presets, only XZ with a level ≥ 4 was capable of compressing bacterial genomes beyond the statistical entropy baseline (that is, approximately 2 bits per bp). M and MM denote additional, manually tuned compression modes of XZ with increased dictionary sizes (Methods), which slightly improved compression performance but substantially increased memory and CPU time and were thus not used in MiniPhy. b) The plot shows the impact of FASTA line length on compression performance. With single-line FASTA (denoted by Inf), the compressed size is reduced to 12% compared to the 40-bp-per-line version. The plot highlights the importance of pre-formatting FASTA data before using general compressors such as XZ.
Extended Data Fig. 4 Comparison of three contrasting compression scaling modes of microbial collections.
The plots compare the scaling behavior of the XZ, GZip, BZip2, and Re-Pair compressors on the SC2 (a), GISP (b), and NCTC3k (c) collections, depicting the space per genome as a function of the number of jointly compressed genomes, progressively increased on logarithmic scales. The results highlight several key findings. First, XZ consistently outperforms the other compressors. Second, for viral genomes all four compressors are able to overcome the 2-bits-per-bp baseline thanks to their short genome length, but only XZ is able to compress beyond this limit for bacterial genomes (consistent with Extended Data Fig. 3a; the Re-Pair implementation used could not compress bacterial genomes due to their size). Third, Re-Pair compression can be nearly as effective as XZ for viruses, but its applicability to large datasets is limited by its scalability. Fourth, the compressibility of divergent bacteria is substantially limited even with the best compressors, with only a 4× improvement in per-genome compression for NCTC3k (while the highly compressible SC2 and GISP collections show 171× and 105× improvements for the same number of genomes).
Extended Data Fig. 5 Impact of within-batch genome order on the compressibility of microbial collections.
While a substantial part of the benefits of phylogenetic compression comes from organizing genomes into batches of phylogenetically related genomes, proper genome reordering within individual batches is also crucial for maximizing data compressibility. The plots demonstrate that the impact of within-batch reordering grows with the amount of diversity included (GISP vs. NCTC3k) and with the number of genomes (GISP vs. SC2). Accurate phylogenies inferred using RAxML provided a small compression benefit for assemblies over trees computed using Mashtree (GISP).
Extended Data Fig. 6 Compression trade-offs for the five test collections and for individual batches of the 661k collection.
The plot illustrates the trade-off between the per-genome size after compression and the number of bits per distinct k-mer (k = 31, canonical). The larger points represent individual genome collections and correspond to values from Supplementary Table 3. The smaller points represent individual batches within the 661k collection, with colors indicating the number of genomes in each batch. Overall, the plot reveals the influence of genomic diversity on the resulting compression characteristics. The trade-off follows an L-shaped pattern, where compressing genome groups with high diversity leads to smaller space per k-mer but larger space per genome, and vice versa for genome groups with low diversity.
Extended Data Fig. 7 Distribution of the number of distinct k-mers in the top 20 species in (a) the 661k and (b) BIGSIdata collections.
For the 661k collection, colors represent the quality of the assemblies (LQ: low-quality, HQ: high-quality), as determined as part of the quality control in the original publication. For BIGSIdata, no quality control information is available. The numbers below the species name indicate the number of samples within each category. The plots were created for canonical 31-mers.
Extended Data Fig. 8 Proportions of top 10 species (their corresponding batches) in the 661k collection before and after phylogenetic compression.
The plot depicts the proportions of the top 10 species, the Dustbin pseudo-cluster, and the remaining species grouped as Others, while comparing the following four quantitative characteristics: the number of genomes, their cumulative length, the size of the phylogenetically compressed assemblies, and the size of the phylogenetically compressed COBS indexes (for k = 31). Transitioning from the number of genomes to their cumulative length has only a minor impact on the proportions (corresponding to different mean genome lengths of individual species). However, the divergent genomes occupy a substantially higher proportion of the collection after compression. Moreover, despite genome assemblies and k-mer COBS indexes being fundamentally different genome representations (horizontal vs. vertical, respectively), the observed post-compression proportions in them were nearly identical.
Extended Data Fig. 9 Time required for decompressing the Phylign 661k-HQ database.
The wall clock and total CPU time required to decompress the Phylign 661k-HQ database, both from disk and in memory, were measured on an iMac desktop computer with 4 physical (8 logical) cores. The in-memory decompression process, which is implemented in Phylign, was completed under 30 min. This duration represents only a fraction of the typical time required for search experiments (see Supplementary Table 6).
Supplementary information
Supplementary Information
Supplementary Notes 1–6, Supplementary Tables 1–7 and additional materials.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Břinda, K., Lima, L., Pignotti, S. et al. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 22, 692–697 (2025). https://doi.org/10.1038/s41592-025-02625-2
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41592-025-02625-2
This article is cited by
-
Efficient sequence alignment against millions of prokaryotic genomes with LexicMap
Nature Biotechnology (2025)