Main

Metagenomics and viromics are identifying new viruses at an unprecedented rate, but recognizing which sequences were seen before remains challenging1,2. Calculating average nucleotide identity (ANI), essential for classification, is limited by the scalability of alignment tools like anicalc3, commonly used to cluster viruses into virus operational taxonomic units (vOTUs), or VIRIDIC4, recommended by the International Committee on Taxonomy of Viruses (ICTV) to delineate bacteriophage species and genera. Large-scale sequence comparisons rely on efficient, albeit less accurate, k-mer approaches such as sketching (FastANI5) or sparse approximate alignments (skani6). Moreover, most tools lack clustering functionality or do not scale to large metagenomic datasets (Extended Data Table 1).

Vclust is a fast alignment-based method that calculates ANI measures for complete and fragmented viral genomes and clusters them according to ICTV and Minimum Information about an Uncultivated Virus Genome (MIUViG) standards1,4 (Extended Data Table 1). It introduces three components (Fig. 1a). First, Kmer-db 2, a successor of Kmer-db7, rapidly determines related genomes using either all k-mers or a predefined fraction. Second, LZ-ANI, a Lempel–Ziv parsing-based algorithm (Fig. 1b and Methods), identifies local alignments within related genome pairs and calculates overall ANI from these aligned regions with high sensitivity and accuracy. Third, Clusty efficiently implements six clustering algorithms suited for sparse distance matrices with millions of genomes (Fig. 1c).

Fig. 1: Vclust algorithm and features.
figure 1

a, Vclust’s workflow: (1) prefilter similar genome sequence pairs with sufficient k-mer-based identity estimated using Kmer-db 2; (2) align similar genome pairs and calculate ANI using LZ-ANI; and (3) cluster genomes based on defined cutoffs using Clusty. b, Sequence alignment using Lempel–Ziv parsing (Methods). c, Vclust’s clustering algorithms. Vertex size represents genome sequence length, and edge lengths indicate the distance (1 − ANI) between genomes. A more detailed depiction of the clustering algorithms is shown in Extended Data Fig. 1. d, Illustration of the calculation of Vclust’s sequence similarity measures.

We first tested Vclust’s accuracy of total average nucleotide identity (tANI) estimation (Fig. 1d) among 10,000 pairs of phage genomes containing simulated mutations, including substitutions, deletions, insertions, inversions, duplications and translocations (Methods and Supplementary Table 1). Vclust and VIRIDIC, both alignment-based tools, provided tANI values close to the expected ones, with mean absolute error (MAE) values of 0.3% and 0.7%, respectively, outperforming FastANI (6.8%) and skani (21.2%; Fig. 2a). Vclust predictions consistently approached expected values as tANI increased, while VIRIDIC underestimated tANI (Fig. 2a). Among genome pairs above the ICTV’s species threshold (tANI ≥ 95%4, n = 1,188), Vclust reported only 22 pairs below the threshold, whereas VIRIDIC underestimated nearly 10× more (n = 210; Supplementary Table 2).

Fig. 2: Comparison of Vclust with other tools on various datasets.
figure 2

a, Difference between predicted and expected tANI values for 10,000 bacteriophage genome pairs with simulated mutation events. b, Correlations with VIRIDIC tANI values for 22,607 complete bacteriophage genome pairs. c, Wall time and peak memory usage for processing 4,244 bacteriophage genomes (32 threads). Vclust and VIRIDIC include clustering, while FastANI and skani only calculate ANI. d, Venn diagrams comparing numbers of contig pairs meeting MIUViG thresholds (ANI ≥ 95% and AF ≥ 85%) predicted by BLASTn (purple) and other tools (red). The boxen plot shows the error distribution of predicted ANI and AF values relative to corresponding BLASTn-based reference values for 4,361,743 contig pairs meeting MiUVUG thresholds. The center line denotes the median, while each box level from the median contains half of the remaining observations. e, Wall time and peak memory usage for calculating ANI and AF among 15,677,623 IMG/VR contigs (64 threads). BLASTn values were estimated from a random sample of 1,000 query contigs. Vclust was tested in its default setting and with a 0.2 fraction of k-mers used at the ‘prefilter’ step. f, Wall time and peak memory usage of Vclust’s clustering algorithms for grouping IMG/VR contigs into vOTUs.

Source data

Next, we determined tANI using VIRIDIC in an all-to-all comparison of 4,244 bacteriophage genomes. Vclust had a higher correlation with VIRIDIC tANI (Pearson’s r = 0.983) than skani (r = 0.902) and FastANI (r = 0.671) across the entire tANI range ≥ 70% (22,606 genome pairs; Fig. 2b) and outperformed both tools within their reliability range ≥ 80%5,6.

Then, we compared the consistency of the bacteriophage species groupings (tANI ≥ 95%) with the official ICTV taxonomy (Methods). Vclust and VIRIDIC showed moderate agreement with ICTV (73% and 69%, respectively), followed by FastANI (40%) and skani (27%). Upon examining genome pairs where both Vclust and VIRIDIC diverged from the ICTV’s classification, we found inconsistencies in 50 ICTV taxonomic proposals (Supplementary Tables 3 and 4). Excluding these cases improved the agreement of both tools with ICTV taxonomy, with Vclust retaining superiority (95%) over VIRIDIC (90%) and the other tools (Supplementary Table 5). For genus groupings (tANI ≥ 70%), Vclust achieves 92% agreement with ICTV taxonomy, comparable to VIRIDIC’s 93%, despite inconsistent application of the threshold we found across ICTV genera (Supplementary Tables 6 and 7 and Extended Data Fig. 2). Given Vclust’s high agreement with ICTV taxonomy, accurate tANI determination and processing speed >40,000× faster than VIRIDIC (Fig. 2c and Supplementary Table 8), it emerges as the prime tool for bacteriophage classification.

We then assessed Vclust’s accuracy in matching contig pairs that satisfy MIUViG thresholds (ANI ≥ 95% and aligned fraction (AF) ≥ 85%; Fig. 1d). We subsampled over 90,000 metagenomic contigs from the IMG/VR database and used BLASTn8+ anicalc3 (most accurate alignment-based method) to identify over 4 million sequence pairs that met MIUViG thresholds. Vclust recovered the highest number of pairs (99%), followed by MegaBLAST + anicalc (97%), skani (96%, or 86% in the fastest mode), FastANI (96%) and MMseqs2 (ref. 9) (70%; Fig. 2d and Supplementary Table 9). Both Vclust and MegaBLAST produced ANI and AF estimates consistently with the BLASTn values (Pearson r > 0.96), outperforming the other tools (r = 0.2–0.8). On average, ANI and AF values obtained by Vclust and MegaBLAST showed minimal deviation from the expected values (MAE < 1%; Supplementary Table 9), with Vclust having the narrowest error range among all the tools (Fig. 2d). This trend is consistent across varying contig sizes, from smallest (<5 kb) to largest (>100 kb; Supplementary Table 10).

The scalability of the tools was tested using the entire IMG/VR database of 15,677,623 virus contigs. Vclust performed sequence identity estimations for ~123 trillion contig pairs and alignments for ~800 million pairs, resulting in 5–8 million vOTUs depending on the clustering algorithm (Supplementary Table 11 and Supplementary Fig. 1). These vOTUs are generally consistent with those identified by MegaBLAST, with Vclust clustering approximately 75,000 more contigs on average, indicating higher sensitivity (Supplementary Table 11). Vclust was >115× faster than MegaBLAST, >6× faster than skani or FastANI, and ~1.5× faster than MMseqs2 (Fig. 2e,f, Extended Data Fig. 3 and Supplementary Table 12). Although skani in its fastest mode was 7× faster than Vclust (Supplementary Table 12), it was substantially less accurate (Supplementary Table 9). In addition, Vclust’s runtime and memory usage can be further reduced by ~40% and ~60%, respectively, by analyzing 20% of the k-mers in each genome during prefiltering (Fig. 2e), with negligible impact on sensitivity and specificity (Extended Data Fig. 4).

In conclusion, Vclust surpasses the current state-of-the-art methods in viral genome comparison in both accuracy and speed, remaining effective in datasets of millions of sequences. It provides a complete solution for calculating intergenomic similarities and clustering complete, partial and circularly permuted (Extended Data Fig. 5) virus genomes using various ANI measures and clustering algorithms. Given the astonishing diversity of viruses in metagenomic data, we believe that Vclust will be essential for large-scale dereplication and taxonomic classification of viral sequences. It is freely available on GitHub, with a web service option for smaller projects (https://www.vclust.org/), and its core components—Kmer-db, LZ-ANI and Clusty—are available as stand-alone tools for broader applications in sequence comparison and general clustering tasks. Similar to other tools6, Vclust’s performance may decrease with large datasets of highly similar genomes owing to the high number of sequence pairs requiring alignment and clustering after prefiltering (Methods). Future work will focus on improving scalability for large homogeneous datasets, including bacterial genomes, and implementing amino acid-based computations (for example, average amino acid identity).

Methods

Overview

Vclust is a workflow that introduces and integrates three tools:

  1. 1.

    Kmer-db 2: performs the initial k-mer-based estimation of sequence identity of all genome pairs (‘Sequence identity estimation: Kmer-db 2’).

  2. 2.

    LZ-ANI: aligns sequence pairs with nucleotide identity exceeding a specified threshold and calculates ANI and AF measures (‘Sequence alignment: LZ-ANI’ and ‘Calculating ANI and AF’).

  3. 3.

    Clusty: clusters sequences based on ANI and/or AF criteria (‘Clustering sequences: Clusty’).

We implemented Kmer-db 2, LZ-ANI and Clusty in C++20 as stand-alone tools, adaptable for various sequence comparison and clustering tasks (‘Code availability’). Vclust, a Python script, integrates these tools to calculate and cluster viral genomic sequences (‘Vclust implementation’).

Sequence identity estimation: Kmer-db 2

Kmer-db 2 is an updated tool for k-mer-based estimation of pairwise similarities among nucleotide sequences, using either all or a selected fraction of k-mers. Unlike fixed-sized sketching (used, for example, by Mash10), Kmer-db 2 retains a proportional fraction of k-mers per genome, preserving the relationship between sequence lengths.

Kmer-db 2 introduces several improvements enabling the processing of tens of millions of sequences. First, unlike its predecessor, which stored similarity values in RAM as a dense matrix7, Kmer-db 2 uses sparse matrices that retain only nonzero elements in all-to-all pairwise genome comparison mode (‘all2all-sp’), allowing it to handle large and diverse genome sets. Second, Kmer-db 2 supports genome datasets partitioned into multiple input files, each generating a separate Kmer-db database. A new mode, ‘all2all-parts’, calculates shared k-mers within and across databases, optimizing memory by loading one or two databases into RAM sequentially, although at the expense of additional computational time from repeated database loading. Third, Kmer-db 2 further minimizes RAM usage by storing only genome pairs that meet a minimum threshold of shared k-mers and sequence identity. Finally, all modes in Kmer-db 2 support multithreading, except for the distance calculation step, which is sufficiently fast without parallelization. Supplementary Fig. 2 shows the computational performance improvements of Kmer-db 2 over Kmer-db 1, with runtime reductions of 3× to 100× across modes and substantially lower RAM requirements.

Sequence alignment: LZ-ANI

The LZ-ANI algorithm uses Lempel–Ziv parsing11 to align two sequences (the query and the reference).

First, the algorithm constructs two indices (dictionaries): for anchors and seeds. The anchor index maps all a-mers (substrings of length a) from both strands of the reference sequence to their positions, while the seed index performs the same mapping for shorter s-mers (Fig. 1b, step 1).

Next, the query is read from left to right using a sliding window of a nucleotides, moving one nucleotide at a time. The parsed a-mers are used to search the anchor index for matches in the reference. Upon finding an exact match, the algorithm extends it in both directions (Fig. 1b, step 2). In each direction, a window of size aw slides until it encounters more than a certain number of mismatches (am) at a time. Then, the extensions of terminal windows are trimmed to remove poorly aligned ends until they have at least ar exactly matched nucleotides. This extended anchor initiates the first ‘region’, which corresponds to a local alignment, and is constructed as described below.

The algorithm then moves to the next nucleotide after the extended anchor and looks for a-mers (anywhere in the reference) and s-mers (within r nucleotides from the end of the extended match in the reference) in the dictionaries. Four scenarios may arise:

  1. 1.

    No anchor or seed is found: shift by one position in the query and repeat the process of finding a new anchor or a seed match. However, if the distance in the query between the current position and the end of the previous match exceeds q nucleotides, the seed search is discontinued.

  2. 2.

    Only a seed match is found: extend the seed similarly to the initial anchor match, append it to the region, and continue the search for a new anchor or seed match (Fig. 1b, step 3).

  3. 3.

    Only an anchor match is found: close the current region and extend the anchor match to initiate a new region (Fig. 1b, step 5).

  4. 4.

    Both anchor and seed matches are found: select the match less likely to occur by chance, based on their lengths, seed proximity (r nucleotides) and the reference sequence length, leading to either scenario 2 or 3 (Fig. 1b, step 4).

Upon closing a region, the algorithm realigns the nucleotide stretches between all the extended matches within the region (Fig. 1b, step 6). This realignment aims to maximize the number of matching nucleotides between neighboring extended matches by allowing a single multi-symbol insertion in the reference or query sequence. As a result, the region represents a local alignment containing both matched and mismatched nucleotides, along with approximated indel fragments. To remove spurious alignments, regions shorter than g nucleotides are excluded from further analysis.

The LZ-ANI tool reads input sequences and stores them in RAM in a compact format with three nucleotides per byte. The tool processes sequences in parallel, with each thread comparing a reference sequence to all other sequences. By default, the tool performs all-versus-all pairwise alignments, but it can also accept a filter specifying sequence pairs to align, such as a file generated by Kmer-db (used by Vclust by default).

Alignment parameters

LZ-ANI parameters are adjustable and were optimized for virus genome sequences (Extended Data Table 2). The default anchor length was set to 11 nucleotides, matching the BLASTn default word size, which provides greater sensitivity than MegaBLAST’s 28-nucleotide word size. The remaining LZ-ANI parameters were optimized using Bayesian optimization with Gaussian process minimization. This optimization involved 100 evaluations on a dataset of 10,000 pairs of complete genomes with simulated mutations (that is, substitutions, insertions, deletions, duplications, inversions and translocations) and known expected ANI values of ≥70% (Supplementary Table 13). The default Vclust parameters were selected based on the lowest MAE between the predicted and reference tANI values. Supplementary Fig. 3 compares the length, number and identity of alignments generated by Vclust (using default parameters), BLASTn and MegaBLAST.

Calculating ANI and AF

Similarly to BLAST-based ANI methods3,4, LZ-ANI alignment between query (A) and reference (B) encompasses ‘regions’, analogous to BLAST’s high-scoring segment pairs. This alignment allows direct calculation of:

  • L(A, B)—the total length (sum) of all regions when aligning query A to reference B, in nucleotides

  • M(A, B)—the total number of matching nucleotides in all regions

These values are used to compute seven sequence similarity measures as follows:

  1. 1.

    ANI for A and B: \(\frac{M(A,\,B)}{L(A,\,B)}\)

  2. 2.

    ANI for B and A: \(\frac{M(B,\,A)}{L(B,\,A)}\)

  3. 3.

    AF of query A to reference B: \(\frac{L(A,\,B)}{{|A|}}\)

  4. 4.

    AF of query B to reference A: \(\frac{L(B,\,A)}{{|B|}}\)

  5. 5.

    Global ANI for A and B: \(\frac{M(A,\,B)}{{|A|}}\)

  6. 6.

    Global ANI for B and A: \(\frac{M(B,\,A)}{{|B|}}\)

  7. 7.

    Total ANI: \(\frac{M(A,\,B)+M(B,\,A)}{{|A|}+{|B|}}.\)

Clustering sequences: Clusty

Clusty is a versatile package facilitating rapid clustering across diverse data types, using six algorithms: single linkage, complete linkage, UCLUST12, greedy set cover9, CD-HIT13 and Leiden14. Our implementations of these algorithms were optimized for sparse distance matrices. A linear memory complexity with the number of distances allows the clustering of tens of millions of objects, provided the matrix remains sufficiently sparse.

Clusty uses threshold-based clustering, assigning an object to a cluster if its distance from the cluster does not exceed a user-defined threshold. Depending on the algorithm, this distance can refer to the closest, furthest or centroid member. While UCLUST, greedy set cover and CD-HIT are inherently threshold-based algorithms, single and complete linkage algorithms construct dendrograms that can be pruned at customizable distance thresholds. Clusty’s sparse data representation assumes all input values to meet the distance or similarity threshold. However, the tool allows clustering data at more stringent thresholds through additional filtering of any combinations of distance/similarity values (for example, tANI, ANI and AF) and/or other measure values (for example, minimum/maximum number of alignments, minimum/maximum number of matched nucleotides). Consequently, the matrix provided to Clusty does not need to be sparse; the tool can handle dense matrices and apply filtering at the loading stage.

Clusty interprets input data as a graph, where vertices represent objects and edges represent connections. Extended Data Fig. 1 shows details of the clustering algorithms and their time complexities.

Vclust implementation

Vclust is a Python tool integrating Kmer-db 2, LZ-ANI and Clusty for streamlined computation of intergenomic sequence similarities and clustering of viral genomes. Vclust provides three commands: ‘prefilter’, ‘align’ and ‘cluster’ (Fig. 1a). ‘prefilter’ and ‘align’ accept a single FASTA file containing viral genomic sequences or a directory of FASTA files (one genome per file), with support for gzipped inputs and outputs.

The ‘prefilter’ command uses Kmer-db 2 to screen out dissimilar genome pairs before alignment, reducing the number of genome pairs to only those with sufficient k-mer-based sequence similarity (that is, minimum number of common k-mers and/or the minimum sequence identity. Sequence identity in Kmer-db 2 is calculated similarly to ANI in Mash (1 − Mash distance) but uses the overlap coefficient15 instead of the Jaccard index. The overlap coefficient measures the intersection size of two k-mer sets (representing two genomic sequences) relative to the smaller set size, rather than the union of both sets. As a result, sequence identity values in the prefiltering step are generally higher than ANI from the alignment step. This allows users to set the minimum sequence identity in prefiltering close to the final ANI threshold without risking the exclusion of relevant genome pairs; for example, if targeting an ANI threshold of 95% or higher, the minimum sequence identity can be set to approximately 0.95 (Supplementary Fig. 4).

The ‘align’ command uses LZ-ANI to perform pairwise sequence alignments and compute ANI and AF measures between genome pairs identified by the pre-alignment filter. If the filter is not provided, Vclust aligns all possible genome pairs. The output includes two TSV files that are used for clustering: one containing ANI measures for genome pairs and the other listing genome identifiers sorted by decreasing sequence length. Optionally, Vclust can output detailed alignment results in a TSV format similar to BLASTn/MegaBLAST, with coordinates, strand orientation, matched/mismatched nucleotides and sequence identity for each local alignment.

The ‘cluster’ command uses Clusty for genome clustering, allowing users to specify a similarity measure (for example, tANI, ANI) and its threshold for clustering genomes, with optional additional filtering thresholds for other similarity measures, including AF. Output includes a TSV file listing genome identifiers and numerical cluster identifiers (including identifiers for singleton genomes). Alternatively, Vclust can output representative genomes instead of numerical cluster identifiers, which is particularly useful for dereplication tasks.

Optimizing performance for highly redundant genome datasets

Vclust is designed for dereplication and clustering of viral sequences across a range of identity values. Computational performance may decline with datasets of highly redundant genome sequences (for example, tens of thousands of sequences from the same species; Supplementary Fig. 5). In all-versus-all pairwise genome comparisons in the ‘prefilter’ step, the high frequency of similar sequences expands the similarity matrix, increasing memory consumption and the number of pairs to align, which in turn raises computational demands for alignment and clustering. Vclust has three additional techniques to optimize performance and mitigate excessive resource consumption. First, it partitions a dataset into smaller, equally sized batches of genome sequences using the built-in multi-fasta-split C++ tool. This option considerably reduces memory requirements of the ‘prefilter’ step without altering results, although it may slightly increase runtime (Extended Data Fig. 4a). Second, Vclust can limit the number of k-mers analyzed from each genome sequence, reducing memory usage and runtime with minimal impact on sensitivity (Extended Data Fig. 4b). Third, similarly to MMseqs2 and BLAST-based methods, Vclust’s ‘prefilter’ can restrict the number of sequences reported per query genome by selecting those with the highest sequence identity, reducing the overall number of genome pairs passing initial similarity assessment.

Benchmarking

Running time

All runtimes were benchmarked on a workstation equipped with an AMD Epyc 9554 CPU (64 cores clocked at 3.1 GHz) and 1,152 GiB (approximately 1,237 GB) RAM. Unless otherwise specified, all tools were run using 64 threads. The exact commands are shown in Supplementary Tables 8, 9 and 12.

Evaluating tANI accuracy

The tANI accuracy of Vclust v1.2.8, FastANI (v1.33)5, skani (v0.2.1)6 and VIRIDIC (v1.1)4 was assessed using two reference sets. In both reference datasets, VIRIDIC was run with default parameters (--word_size 7, --reward 2, --penalty 3, --gapopen 5, --gapextend 2) for highly sensitive BLASTn alignments. Similarly, skani was run in its most accurate mode optimized for small sequences (--slow, --s 0, --m 200). FastANI and Vclust were run with default parameters. The first reference dataset comprised 22,606 tANI values ranging from 70% to 100%, as determined by VIRIDIC across 4,244 complete genomes of bacteriophages affiliated with the ICTV using ICTV’s Virus Metadata Resource (VMR v38.3). Since FastANI and skani do not directly report tANI, their values were calculated from ANI, AF and genome lengths: tANI = (ANI1 × AF1 × LEN1 + ANI2 × AF2 × LEN2)/(LEN1 + LEN2). The second reference set contained expected (true) tANI values in the 70–100% range, derived from 10,000 pairs of bacteriophage genomes subjected to simulated mutations, including different levels of substitution, insertion, deletion, duplication, inversion and translocation events. Specifically, we randomly selected 100 genomes from the bacteriophage dataset and generated 100 copies of each genome. For each genome copy, we introduced mutations using Mutation-Simulator (v3.0.2)16 by randomly selecting a combination of mutation events and their corresponding frequencies (Supplementary Table 1). The expected (true) tANI value between each copy and reference genome was determined based on the variant call format produced by Mutation-Simulator, describing the exact locations of introduced mutations and the number of altered nucleotides.

Evaluating ANI and AF accuracy

The ANI and AF values predicted by Vclust, FastANI, skani, MegBLAST v2.13.0+ and MMseqs2 v2fad714b525f1975b62c2d2b5aff28274ad57466 (ref. 9) were compared to reference ANI and AF values determined by BLASTn (v2.13.0+)8. Since running BLASTn on the entire IMG/VR v4.1 database was not feasible, we subsampled 94,225 viral contigs and performed an all-to-all BLASTn search to identify 4,361,743 contig pairs meeting the MIUViG thresholds (ANI ≥ 95% and AF ≥ 85%). MegaBLAST, MMseqs and BLASTn outputs were used by the anicalc script from CheckV (v1.0.3)3 to compute ANI and AF values. Pearson correlation and MAE between the predicted and expected ANI and AF values were calculated based on the 4,361,743 contig pairs meeting MIUViG thresholds (ANI ≥ 95% and AF ≥ 85%) determined by BLASTn. Given the high level of sequence identity of the reference contig pairs, if a tool did not return a result for a given contig pair, the ANI and AF values were set to zero for that pair.

Evaluating clusterings

The agreement between clustering results from different tools and the reference clustering was assessed using the adjusted Rand index (ARI). ARI assesses clustering similarity by comparing the number of correct clustering overlaps and disagreements17 against those expected by chance. An ARI of 0 indicates random assignment, while a score of 1 indicates a perfect match. We used the scikit-learn (v1.3.2)18 implementation of the ARI.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.