Fig. 1: Workflow for constructing the VFDB 2.0 and MetaVF toolkits.

A The expanded VFG database (VFDB 2.0) was built as follows. First, the species-specific ANI (ssANI) was calculated via complete genomes in the NCBI RefSeq database to obtain ssANI data. Next, the VFGs were expanded by Blast searching 18,521 complete genomes against curated VFDB (top panel) and filtered to obtain the redundant dataset with 467,428 alleles and orthologues. The redundant dataset was then used to generate an annotation dataset with host species and mobility information for VFG annotation in the MetaVF toolkit. The expanded alignment dataset was obtained by removing redundant VFG sequences from the redundant dataset for sequence alignment in the MetaVF toolkit. Third, the VFG alleles from the pathogen strains were further collected into the pathogenic alignment dataset. Finally, the annotation dataset and alignment dataset were integrated into VFDB 2.0. B The MetaVF toolkit consists of two pipelines. Pipeline 1 is for short-read metagenomic data, where trimmed reads are mapped to the expanded alignment dataset and then filtered by 90% identity. Pipeline 2 is for long HiFi reads of metagenomic data or draft genomes via BLASTN against the pathogenic alignment dataset. Finally, the relative abundance, coverage, host species, mobility, and VF categories of VFGs are summarized based on VFDB 2.0.