arising from R. Li et al. Nature Communications https://doi.org/10.1038/s41467-022-35735-y (2022)
Inclusion of curated archaeal virus genomes in public databases is a critical step towards uncovering the distribution and evolution of archaeal viruses in the microbiome1. In a recent study, Li et al.2, created the Human Gut Archaeal Virome Database (HGAVD), which is claimed to comprise genomes of 1279 species of archaeal viruses, representing a >13-fold increase in archaeal virus diversity compared to previous studies3,4,5. However, re-analysis of the HGAVD revealed extensive contamination from Bacteria and Archaea, with 72–83% of sequences classified as non-viral by six different viral prediction tools. An improved reference database of archaeal genomes is needed to avoid propagation of errors in future studies and to accurately characterize the role of the archaeal viruses in the microbiome.
Intrigued by the large expansion in archaeal viral diversity relative to recent studies, we used six state-of-the-art computational tools, including CheckV v1.0.16, geNomad v1.5.07, VIBRANT v1.2.18, ViralVerify v1.19, VirSorter v1.0.610, and lastly VirSorter2 v2.2.411 using default parameters to perform post-hoc analysis on the HGAVD, demonstrating it was composed primarily of non-viral sequences (Fig. 1A, B and Supplementary Data 1). Of the 1279 sequences in the HGAVD, only 30.88% were predicted as a virus or provirus by any of the six tools and only 14.46% by all six. While archaeal viruses may be more challenging to detect in microbiome samples1, nearly all non-viral HGAVD sequences (985 of 987) were confidently assigned as either Archaea or Bacteria by geNomad as opposed to other mobile genetic elements. Viral classification can be challenging for very short sequences, but even long HGAVD sequences (>10 kbp) were found to contain tens to hundreds of host-specific genes while lacking any virus-specific gene (Fig. 1C). Together, the computational tools we used sensitively classified 91 of 92 archaeal viruses from NCBI RefSeq as viral, indicating our result is not a byproduct of false negatives (Supplementary Data 2).
A The prediction result of six different viral classifiers for the HGAVD. Predicted proviruses were counted as viruses. B Upset plot representing the number of shared/unique viral predictions across the six tools. C Counts of virus-specific and host-specific proteins identified on 989 HGAVD contigs longer than 10 kbp. Sequences are sorted from longest to shortest. The longest HGAVD sequences have numerous host-specific genes and few viral genes. D geNomad viral predictions for HGAVD contigs based on data source. Most false positives originate from bulk metagenomes not included in previously published viral genome catalogs.
Next, we searched for the source of the prediction error. To identify viruses, Li et al. used a combination of sequence matches to putative viral signature genes and sequence matches to archaeal CRISPR spacers. Most of the Li et al. signature genes matched two other viral databases (VOGDB http://vogdb.org/ and VPF12 confirming their viral origin, and most HGAVD sequences contained matches to Li et al. signature genes. However, only 27.36% of HGAVD sequences contained matches to virus-specific genes from three curated databases (CheckV, geNomad, and Virsorter2), suggesting that many of the putative signature genes from Li et al. are not specific to viruses. We also confirmed that nearly all HGAVD sequences contained matches to archaeal CRISPR spacers (see Supplementary Information). It is known that CRISPR spacers sometimes target chromosomal genes that are involved in plasmid conjugation or replication13 and that viruses often exchange genes with their hosts14. Thus neither of the signals are sufficient to perform accurate virus classification. To remove non-viral sequences, Li et al. relied on alignment to genomes of gut-isolated archaea (n = 35) and bacteria (n = 10,613). However, when we aligned the HGAVD to a larger collection of 1825 archaeal genomes from RefSeq, 59.5% of HGAVD sequences contained a match with > 90% identity over >90% sequence length. Consistent with this result, we found that most of non-viral sequences in the HGAVD were identified from bulk metagenomes (containing a mixture of sequences from viruses and cellular organisms) as opposed to previously published databases of curated viral genomes (Fig. 1D).
As an illustrative example, the largest sequence in the HGAVD was 560,083 bp, which would make this the largest virus genome discovered from the human gut microbiome (553,716 bp4), and the largest sequenced genome of any archaeal virus (216,805 bp15). However, alignment against NCBI RefSeq revealed a robust match to archaeal type strain Methanobrevibacter smithii ATCC 35061 (99% identity over 93% of the sequence length), and visual inspection16 revealed numerous genes for host metabolism and cellular processes, even including 16 S rRNA (Fig. S1). While the sequence did contain CRISPR spacer matches, no prophages could be identified using geNomad or VIBRANT, and no virus-specific genes were identified by either geNomad or Virsorter2.
Together, our analyses clearly demonstrate that the sequences reported by Li et al. are highly contaminated by cellular organisms and should not be utilized as a reference database for viral analyses. A more careful and systematic analysis is needed to accurately characterize the diversity of archaeal viruses in the human gastrointestinal tract and establish a high-quality reference collection. While novel approaches for viral detection can yield new discoveries, they should be carefully benchmarked in terms of sensitivity and specificity. In the absence of such benchmarking, we recommend using well-established virus detection tools, like geNomad or VirSorter2, which can distinguish sequences of viruses from cellular organisms and other mobile genetic elements12.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Datasets generated and/or analyzed during the current study are available as supplementary data.
Code availability
Code to reproduce the analyses in this paper is available at: https://github.com/CynthiaChibani/HGUT_arch_viruses.
References
Moissl-Eichinger, C. et al. Archaea are interactive components of complex microbiomes. Trends Microbiol. 26, 70–85 (2018).
Li, R., Wang, Y., Hu, H., Tan, Y. & Ma, Y. Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut. Nat. Commun. 13, 7978 (2022).
Gregory et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe. 28, 724–740.e8 (2020).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021 a).
Chibani, C. M. et al. A catalogue of 1,167 genomes from the human gut archaeome. Nat. Microbiol. 7, 48–61 (2022).
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021 b).
Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 1–23 (2020).
Antipov, D., Raiko, M., Lapidus, A. & Pevzner, P. A. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics 36, 4126–4129 (2020).
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 1–13 (2021).
Paez-Espino, D. et al. Uncovering earth’s virome. Nature 536, 425–430 (2016).
Shmakov, S. A. et al. The CRISPR spacer space is dominated by sequences from species-specific mobilomes. mBio 8, e01397–17 (2017).
Wagner, A. et al. Mechanisms of gene flow in archaea. Nat. Rev. Microbiol. 15, 492–501 (2017).
Atanasova, N. S., Roine, E., Oren, A., Bamford, D. H. & Oksanen, H. M. Global network of specific virus–host interactions in hypersaline environments. Environ. Microbiol. 14, 426–440 (2012).
Conant, G. C. & Wolfe, K. H. GenomeVx: simple web-based creation of editable circular chromosome maps. Bioinformatics 24, 861–862 (2008).
Acknowledgements
R.A.S. and C.M.C. thank the Deutsche Forschungsgemeinschaft (DFG) for their financial support (SPP2330 and SCHM1052/26-1). In addition, S.A.S. is a recipient of a Novo Nordisk Foundation project grant in basic bioscience (NNF18OC0052965).
Author information
Authors and Affiliations
Contributions
S.N. and C.M.C. conducted experiments, analyzed data, and drafted the manuscript. S.S. and R.A.S. provided feedback and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chibani, C.M., Shah, S.A., Schmitz, R.A. et al. Inaccurate viral prediction leads to overestimated diversity of the archaeal virome in the human gut. Nat Commun 15, 5976 (2024). https://doi.org/10.1038/s41467-024-49902-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-024-49902-w
