Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Machine learning enables scalable and systematic hierarchical virus taxonomy

Abstract

Although virus ecogenomics has expanded access to and understanding of the virosphere, existing classification tools lack taxonomic resolution and are unable to scale to modern discovery-based datasets or classify previously unknown sequence space. Here we develop vConTACT3—a machine learning-based tool that improves scalability and accuracy of virus taxonomy. By optimizing gene-sharing thresholds and leveraging adaptive, realm-specific cut-offs, vConTACT3 expands classification to both eukaryote and prokaryote viruses for four of the six officially recognized realms, and establishes accurate hierarchical taxonomy from genus to order. Specifically, vConTACT3 achieves >95% agreement with official taxonomy for 35,545 and 13,524 public prokaryotic and eukaryotic virus genomes, respectively, to surpass vConTACT2 across most realms, while still uniquely classifying previously uncharacterized taxa, and doing so even faster. vConTACT3 application provides taxonomy assignments for tens of thousands of unclassified taxa rapidly, automatically and systematically; evaluates virus sequence space to reveal support for fewer taxonomic ranks than currently available and identifies taxonomically challenging areas across the virosphere.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: vConTACT3 new features overview.
Fig. 2: Benchmarking vConTACT3 with 60 million combinations.
Fig. 3: Comparing vConTACT3 versus ICTV taxonomic assignments.
Fig. 4: Assessment of the impact of genome fragments on taxonomic predictions.

Similar content being viewed by others

Data availability

Data used for benchmarks (parameter optimizations), construction of databases and fine-tuning the pipeline are available from NCBI Virus RefSeq (v.218). Data used to test scalability, assess fragmentation and evaluate labeling stability are available from IMG/VR v.4.1 (December 2022 release). Databases used by vConTACT3 (as well as source files) are available via Zenodo at https://doi.org/10.5281/zenodo.10035619 and https://doi.org/10.5281/zenodo.10935513 (refs. 67,68).

Code availability

vConTACT3 is available via Bitbucket at https://bitbucket.org/MAVERICLab/vcontact3 (ref. 69) as an installable Python package, as well as through Python package managers Anaconda (https://anaconda.org/) and Mamba (https://mamba.readthedocs.io/). Instructions for building an Apptainer container of vConTACT3 is available on Bitbucket, along with a definitions file. A comprehensive documentation site is available through https://vcontact3.readthedocs.io. Optimization benchmarks were performed using v.3.0.0b36 (‘beta’ v.36), fragmentation analyses using v.3.0.0b63 and label stability, v3.1.4. All other results (including tool comparisons) should be assumed as v.3.0.0.b36. Through all analyses, Python 3.10 was used. Data processing and analyses were conducted using numpy v.1.23.5, pandas v.2.1.1 and scipy v.1.10.1. Taxonomic parsing was handled by the ETE3 Toolkit, v.3.1.3. Statistical analyses were performed using scikit-learn v.1.2.2 and scikit-bio v.0.5.8. Data visualizations were created using matplotlib v.3.7.1, seaborn v.0.12.1 and UpSetPlot v.0.7.0. Networks were rendered through a combination of Cytoscape v.3.10.1, networkx v.3.1 and Python-igraph v.0.10.4. Gene predictions were done through pyprodigal v.2.3.0 and pyprodigal-gv v.0.3.1 and all sequence processing through biopython v.1.81. Protein clustering was done using MMSeqs2 v.14-7e284.

References

  1. Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693 (2016).

    Article  CAS  PubMed  Google Scholar 

  2. Guidi, L. et al. Plankton networks driving carbon export in the oligotrophic ocean. Nature 532, 465–470 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Zimmerman, A. E. et al. Metabolic and biogeochemical consequences of viral infection in aquatic ecosystems. Nat. Rev. Microbiol. https://doi.org/10.1038/s41579-019-0270-x (2020).

    Article  PubMed  Google Scholar 

  4. Emerson, J. B. et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat. Microbiol. 3, 870–880 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Jansson, J. K. & Wu, R. Soil viral diversity, ecology and climate change. Nat. Rev. Microbiol. 21, 296–311 (2023).

    Article  CAS  PubMed  Google Scholar 

  6. Koskella, B. & Taylor, T. B. Multifaceted impacts of bacteriophages in the plant microbiome. Annu. Rev. Phytopathol. 56, 361–380 (2018).

    Article  CAS  PubMed  Google Scholar 

  7. Yan, M. et al. Interrogating the viral dark matter of the rumen ecosystem with a global virome database. Nat. Commun. 14, 5254 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Yan, M. & Yu, Z. Viruses contribute to microbial diversification in the rumen ecosystem and are associated with certain animal production traits. Microbiome 12, 82 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Shkoporov, A. N. & Hill, C. Bacteriophages of the human gut: the “known unknown” of the microbiome. Cell Host Microbe 25, 195–209 (2019).

    Article  CAS  PubMed  Google Scholar 

  10. Shkoporov, A. N., Turkington, C. J. & Hill, C. Mutualistic interplay between bacteriophages and bacteria in the human gut. Nat. Rev. Microbiol. 20, 737–749 (2022).

    Article  CAS  PubMed  Google Scholar 

  11. Walker, P. J. et al. Changes to virus taxonomy and the Statutes ratified by the International Committee on Taxonomy of Viruses (2020). Arch. Virol. 165, 2737–2748 (2020).

    Article  CAS  PubMed  Google Scholar 

  12. Walker, P. J. et al. Recent changes to virus taxonomy ratified by the International Committee on Taxonomy of Viruses (2022). Arch. Virol. 167, 2429–2440 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Zerbini, F. M. et al. Changes to virus taxonomy and the ICTV Statutes ratified by the International Committee on Taxonomy of Viruses (2023). Arch. Virol. 168, 175 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Gorbalenya, A. E. et al. The new scope of virus taxonomy: partitioning the virosphere into 15 hierarchical ranks. Nat. Microbiol 5, 668–674 (2020).

    Article  Google Scholar 

  15. Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).

    Article  CAS  PubMed  Google Scholar 

  16. Roux, S. et al. Minimum Information about an Uncultivated Virus Genome (MIUViG): a community consensus on standards and best practices for describing genome sequences from uncultivated viruses. Nat. Biotechnol. 37, 29–37 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Simmonds, P. et al. Consensus statement: virus taxonomy in the age of metagenomics. Nat. Rev. Microbiol. 15, 161–168 (2017).

    Article  CAS  PubMed  Google Scholar 

  18. Simmonds, P. et al. Four principles to establish a universal virus taxonomy. PLoS Biol. 21, e3001922 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Dutilh, B. E. et al. Perspective on taxonomic classification of uncultivated viruses. Curr. Opin. Virol. 51, 207–215 (2021).

    Article  CAS  PubMed  Google Scholar 

  20. Koonin, E. V., Senkevich, T. G. & Dolja, V. V. The ancient Virus World and evolution of cells. Biol. Direct 1, 29 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Holmes, E. C. What does virus evolution tell us about virus origins? J. Virol. 85, 5247–5251 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Koonin, E. V. & Dolja, V. V. Virus World as an evolutionary network of viruses and capsidless selfish elements. Microbiol. Mol. Biol. Rev. 78, 278–303 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Moraru, C. VirClust—a tool for hierarchical clustering, core protein detection and annotation of (prokaryotic) viruses. Viruses 15, 1007 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Aiewsakun, P. & Simmonds, P. The genomic underpinnings of eukaryotic virus taxonomy: creating a sequence-based framework for family-level virus classification. Microbiome 6, 38 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics https://doi.org/10.1093/bioinformatics/btab026 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).

    Article  CAS  PubMed  Google Scholar 

  27. Moraru, C., Varsani, A. & Kropinski, A. M. VIRIDIC—a novel tool to calculate the intergenomic similarities of prokaryote-infecting viruses. Viruses 12, 1268 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bao, Y., Chetvernin, V. & Tatusova, T. Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification. Arch. Virol. 159, 3293–3304 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, veaa100 (2021).

    Article  PubMed  Google Scholar 

  30. Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R. Reticulate representation of evolutionary and functional relationships between phage genomes. Mol. Biol. Evol. 25, 762–777 (2008).

    Article  CAS  PubMed  Google Scholar 

  31. Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ 5, e3243 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).

    Article  Google Scholar 

  33. Barylski, J. et al. Analysis of Spounaviruses as a case study for the overdue reclassification of tailed phages. Syst. Biol. 69, 110–123 (2020).

    Article  CAS  PubMed  Google Scholar 

  34. Turner, D. et al. Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee. Arch. Virol. 168, 74 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Van Dongen, S. Graph clustering via a discrete uncoupling process. SIAM J. Matrix Anal. Appl. 30, 121–141 (2008).

    Article  Google Scholar 

  36. Gorbalenya, A. E. & Lauber, C. Bioinformatics of virus taxonomy: foundations and tools for developing sequence-based hierarchical classification. Curr. Opin. Virol. 52, 48–56 (2022).

    Article  PubMed  Google Scholar 

  37. Wertheim, J. O., Steel, M. & Sanderson, M. J. Accuracy in near-perfect virus phylogenies. Syst. Biol. 71, 426–438 (2022).

    Article  CAS  PubMed  Google Scholar 

  38. Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Gregory, A. C. et al. Genomic differentiation among wild cyanophages despite widespread horizontal gene transfer. BMC Genomics 17, 930 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Bobay, L. & Ochman, H. Biological species in the viral world. Proc. Natl Acad. Sci. USA 115, 6040–6045 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Ndovie, W. et al. Exploration of the genetic landscape of bacterial dsDNA viruses reveals an ANI gap amid extensive mosaicism. mSystems https://doi.org/10.1128/msystems.01661-24 (2025).

  42. Cook, R. et al. INfrastructure for a PHAge REference Database: identification of large-scale biases in the current collection of cultured phage genomes. PHAGE 2, 214–223 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Nelson, D. Phage taxonomy: we agree to disagree. J. Bacteriol. 186, 7029–7031 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Krupovic, M., Quemin, E. R. J., Bamford, D. H., Forterre, P. & Prangishvili, D. Unification of the globally distributed spindle-shaped viruses of the Archaea. J. Virol. 88, 2354–2358 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Rokyta, D. R., Burch, C. L., Caudle, S. B. & Wichman, H. A. Horizontal gene transfer and the evolution of microvirid coliphage genomes. J. Bacteriol. 188, 1134–1142 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Dominguez-Huerta, G. et al. Diversity and ecological footprint of Global Ocean RNA viruses. Science 376, 1202–1208 (2022).

    Article  CAS  PubMed  Google Scholar 

  47. Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from Pole to Pole. Cell 177, 1109–1123 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Graham, E. B. et al. A global atlas of soil viruses reveals unexplored biodiversity and potential biogeochemical impacts. Nat. Microbiol. 9, 1873–1883 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol 6, 960–970 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).

    Article  CAS  PubMed  Google Scholar 

  52. Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432–436 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

    Article  Google Scholar 

  55. Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003).

    Google Scholar 

  56. Larralde, M. Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. J. Open Source Softw. 7, 4296 (2022).

    Article  Google Scholar 

  57. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 11, 119 (2010).

    Article  Google Scholar 

  58. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Staudt, C. L., Sazonovs, A. & Meyerhenke, H. NetworKit: a tool suite for large-scale complex network analysis. Netw. Sci. 4, 508–530 (2016).

    Article  Google Scholar 

  60. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  PubMed  Google Scholar 

  62. Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2018).

    Article  CAS  PubMed  Google Scholar 

  63. Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Letunic, I. & Bork, P. Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Res. 52, W78–W82 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Millard, A. et al. taxmyPHAGE: Automated taxonomy of dsDNA phage genomes at the genus and species level. Phage (New Rochelle) 6, 5–11 (2025).

    CAS  PubMed  Google Scholar 

  67. Bolduc, B. vConTACT3 database v.220. Zenodo https://doi.org/10.5281/zenodo.10035618 (2023).

  68. Bolduc, B. vConTACT3 database v.223. Zenodo https://doi.org/10.5281/zenodo.10935512 (2024).

  69. Bolduc, B. vConTACT3 database v.223 (software repository). Bitbucket https://bitbucket.org/MAVERICLab/vcontact3/src/master/ (2025).

Download references

Acknowledgements

This work was supported by the National Science Foundation under Grants No. DBI-2149505 (iVirus2) and DBI-2022070 (BII-Implementation: the EMERGE Institute). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, under Award Number DE-SC0023307. High-performance computating was provided by the Ohio Supercomputer Center. Additional support was provided by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (EXC 2051) Project-ID 390713860, the European Research Council (ERC) Consolidator grant 865694: DiversiPHI, the Alexander von Humboldt Foundation in the context of an Alexander von Humboldt-Professorship founded by German Federal Ministry of Education and Research, and the European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF). EMA gratefully acknowledges the support of the Biotechnology and Biological Sciences Research Council (BBSRC); this research was funded by the BBSRC Institute Strategic Programme Food Microbiome and Health BB/X011054/1 and its constituent projects BBS/E/QU/230001B and BBS/E/QU/230001D, as well as the BBSRC Institute Strategic Programme Microbes and Food Safety BB/X011011/1 and its constituent projects BBS/E/QU/230002A, BBS/E/QU/230002B and BBS/E/QU/230002C.

Author information

Authors and Affiliations

Authors

Contributions

B.B. and M.B.S. designed the study. B.B., O.Z. and M.B.S. wrote the manuscript with substantial contributions from all authors. D.T., H.B.J. and B.B. performed the phylogenetic analyses and B.E.D., D.T., H.B.J. and B.B. performed the statistical and network analyses. J.G., E.M.A. and B.B. evaluated the distance metrics. B.B. developed the code with contributions from J.G.

Corresponding authors

Correspondence to Benjamin Bolduc or Matthew B. Sullivan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Alexander Gorbalenya, Arthur Gruber and Guanxiang Liang for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Detailed vConTACT3 workflow and outputs.

User genomes are provided to vConTACT3. ORFs are predicted with prodigal, and then sent to MMSeqs2 to be clustered at 5 identities (30%, 40%, 50%, 60%, 70%). These 5 clustering identities are used to build 5 separate protein cluster (PC) profiles, corresponding to each clustering identity. These 5 PC profiles are sent to the “Resolver,” which constructs a distance matrix (per profile) based on the selected distance metric (default “SqRoot”, see Methods). This distance matrix is subsequently converted into a network, which is annotated with any available genome information, such as realm-unique genes and proximity to reference sequences. The network is then filtered by the minimum number of shared genes allowed between genomes (see Main Text for details on recommended values), and then “repaired” if they are within the same connected component. Additionally, users can select a high-accuracy repair, which more carefully reviews dropped edges at the expense of computation time. The final stage of the resolver predicts virus realms associated with each genome using network edge-connected references and/or the presence of PCs exclusively identified and co-shared with references. Entirely novel genomes are assigned a default, user-selectable realm. The 5 filtered and repaired networks are then sent to the guilt-by-genome-association (GBGA) assigner. The GBGA aggregates realm predictions per-genome and per-network component (combining both genome and network information) and assigns a final realm for each genome. This realm prediction is used to select the optimal distance cutoff for each virus rank (genus, subfamily, family and order) determined by benchmarks (see Methods) and hierarchically cluster all genomes of that realm at that cutoff. These clusters are then matched against references (if co-clustered and/or available) or used as novel ranks to assign each virus rank order and below. (The upper ranks of phylum and class inherit reference-based assignments within the predicted realm). After assignments, GBGA output is then compared with reference sequences (if available) and performance metrics are calculated. Finally, PC profiles, performance metrics, and GBGA outputs are integrated into the exports/results component, which provides user-controlled outputs in Cytoscape format, a d3js interactive HTML network, UpSet plots, profiles, and Newick-formatted dendrograms.

Extended Data Fig. 2 vConTACT3-Extended-Data-Fig2-prokaryotes-duplo.jpg.

A 5×7 grid of line plots representing accuracies of Duplornaviria viruses infecting prokaryotes (Bacteria and Archaea). From left-to-right, plots increase in the minimum number of shared genes required to establish an edge/relationship between genomes and be considered as related. The top left plot is 1 minimum gene shared, the top right plot has 5 minimum genes shared, and moving from left to right increases minimum genes shared by 1. From the top-to-bottom direction, plots increase in minimum clustering identity used during MMSeqs2 clustering to establish protein clusters (PCs). Since PCs are used to determine the number of shared genes between genomes, increasing clustering identity increases the stringency required for two genes between two separate genomes to be considered shared, and thus, related. The top left plot is 30% clustering identity, the bottom left plot is 90% clustering identity, with each plot moving downward an increase in 10% clustering identity. Within each plot, accuracy (Y-axis) is a measure of agreement between NCBI taxonomy and vConTACT3 predictions. Pairwise distance cutoff (X-axis) represents the cutoff threshold used during hierarchical clustering to define clusters. The cutoff ranges between 0 - 0.99, with 0 representing completely identical PC profiles, and 0.99 representing nearly no shared genes. Line colors represent taxonomic rank, and dashed lines represent the type of distance metric (Jaccard, “SqRoot”, ‘VirClust” and “Shorter”) employed. See Methods for description of each distance metric.

Extended Data Fig. 3 Eukaryotic-infecting viruses of the Duplornaviria realm.

A 5×7 grid of line plots representing accuracies of Duplornaviria viruses infecting Eukaryota. Details are as Extended Data Fig. 2.

Extended Data Fig. 4 Prokaryotic-infecting viruses of the Adnaviria and Varidnaviria realms.

A 5×7 grid of line plots representing accuracies of Adnaviria and Varidnaviria viruses infecting prokaryotes (Bacteria and Archaea). Details are as Extended Data Fig. 2.

Extended Data Fig. 5 Eukaryotic-infecting viruses of the Adnaviria and Varidnaviria realms.

A 5×7 grid of line plots representing accuracies of Adnaviria and Varidnaviria viruses infecting Eukaryota. Details are as Extended Data Fig. 2.

Extended Data Fig. 6 Comparison in distance cutoffs of Duplodnaviria and Adnaviria & Varidnaviria between domains.

Accuracy plots showing the similarity in optimal cutoffs between realms, with prokaryote-infecting viruses appearing downshifted relative to Eukaryote-infecting viruses.

Extended Data Fig. 7 Prokaryotic-infecting viruses of the Monodnaviria realm.

A 5×5 grid of line plots representing accuracies of Monodnaviria viruses infecting prokaryotes (Bacteria and Archaea). Details are as Extended Data Fig. 2.

Extended Data Fig. 8 Eukaryotic-infecting viruses of the Monodnaviria realm.

A 5×5 grid of line plots representing accuracies of Monodnaviria viruses infecting Eukaryota. Details are as Extended Data Fig. 2.

Extended Data Fig. 9 Prediction labeling stability.

Line plots of adjusted rand index (ARI) and normalized mutual information (NMI) of a set of prediction labels. Colors indicate rank, dashed lines exclude singleton groups from the analysis, whereas solid lines include singletons. “Successive” plots are when ARI/NMI is calculated between fractions, “cumulative” is ARI/NMI calculated against the final (100%) fraction. High ARI and NMI indicate high agreement between genome predictions (that is labels) as more data is added, where an ARI and NMI of 1.00 indicate perfect agreement between labels between datasets.

Extended Data Table 1 vConTACT3 accuracies of Eukaryote-infecting viruses

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bolduc, B., Zablocki, O., Turner, D. et al. Machine learning enables scalable and systematic hierarchical virus taxonomy. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02946-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41587-025-02946-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics