An expanded database and analytical toolkit for identifying bacterial virulence factors and their associations with chronic diseases

Dong, Wanting; Fan, Xinyue; Guo, Yaqiong; Wang, Siyi; Jia, Shulei; Lv, Na; Yuan, Tao; Pan, Yuanlong; Xue, Yong; Chen, Xi; Xiong, Qian; Yang, Ruifu; Zhao, Weigang; Zhu, Baoli

doi:10.1038/s41467-024-51864-y

Download PDF

Article
Open access
Published: 15 September 2024

An expanded database and analytical toolkit for identifying bacterial virulence factors and their associations with chronic diseases

Wanting Dong^1,2,
Xinyue Fan¹,
Yaqiong Guo ORCID: orcid.org/0000-0002-3901-2549³,
Siyi Wang^1,2,
Shulei Jia⁴,
Na Lv¹,
Tao Yuan⁵,
Yuanlong Pan¹,
Yong Xue⁶,
Xi Chen⁷,
Qian Xiong¹,
Ruifu Yang ORCID: orcid.org/0000-0003-3219-7269⁸,
Weigang Zhao⁵ &
…
Baoli Zhu ORCID: orcid.org/0000-0001-5326-9503^1,2,9,10,11

Nature Communications volume 15, Article number: 8084 (2024) Cite this article

14k Accesses
11 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Virulence factor genes (VFGs) play pivotal roles in bacterial infections and have been identified within the human gut microbiota. However, their involvement in chronic diseases remains poorly understood. Here, we establish an expanded VFG database (VFDB 2.0) consisting of 62,332 nonredundant orthologues and alleles of VFGs using species-specific average nucleotide identity (https://github.com/Wanting-Dong/MetaVF_toolkit/tree/main/databases). We further develop the MetaVF toolkit, facilitating the precise identification of pathobiont-carried VFGs at the species level. A thorough characterization of VFGs for 5452 commensal isolates from healthy individuals reveals that only 11 of 301 species harbour these factors. Further analyses of VFGs within the gut microbiomes of nine chronic diseases reveal both common and disease-specific VFG features. Notably, in type 2 diabetes patients, long HiFi sequencing confirms that shared VF features are carried by pathobiont strains of Escherichia coli and Klebsiella pneumoniae. These findings underscore the critical importance of identifying and understanding VFGs in microbiome-associated diseases.

Comprehensive genome catalog analysis of the resistome, virulome and mobilome in the wild rodent gut microbiota

Article Open access 11 June 2025

Climate warming and increasing Vibrio vulnificus infections in North America

Article Open access 23 March 2023

Alternations in the human skin, gut and vaginal microbiomes in perimenopausal or postmenopausal Vulvar lichen sclerosus

Article Open access 10 April 2024

Introduction

Over the past decade, extensive research has focused on elucidating the intricate contributions of the human gut microbiota to overall health^1,2. Studies have focused primarily on delineating commensal bacterial compositions and metabolic pathways potentially implicated in disease pathogenesis. Among these investigations, the presence of virulence factor genes (VFGs) within the gut microbiome has been noted. These genes, which are crucial for the infectivity of pathogenic bacteria, have raised questions regarding their carriers in the gut ecosystem. Uncertainty remains over whether these VFGs are borne by specific bacterial pathogens or by pathobionts—a term denoting members of the gut microbiota with latent pathogenic potential³. The evolution of pathogenic and nonpathogenic strains among opportunistic species remains a mystery. Furthermore, it is unclear whether nonpathogenic strains have the capacity to harbour VFGs, potentially transforming into pathobionts within the gut microbiota⁴. Several studies have suggested the involvement of VFGs carried by gut pathobionts in the onset of chronic gastrointestinal diseases, mainly colorectal cancer (CRC). For example, the commensal bacteria E. coli can carry colibactin, a genotoxin that can induce double-strand DNA breaks^5,6,7 and promote tumorigenesis in CRC⁸. Moreover, some strains of Fusobacterium nucleatum carrying the adhesin FadA can bind E-cadherin and induce CRC cell growth⁹. In addition, enterotoxigenic Bacteroides fragilis (ETBF) can produce a zinc-dependent metalloprotease toxin¹⁰, which triggers tumorigenesis via epithelial IL17 and Stat3 signalling^11,12. These toxins are also enriched in patients with inflammatory bowel disease (IBD), but their cause is not fully understood^13,14. Furthermore, associations of VFGs carried by gut pathobionts with other human chronic diseases have not been widely reported. Recently, the Dutch Microbiome Project, encompassing a vast cohort of 8208 individuals, shed light on the apparent connections between VFGs and chronic diseases¹⁵. In this study, bacterial adherence and iron uptake genes had significant impacts on conditions such as type 2 diabetes (T2D) and various gastrointestinal disorders¹⁵. However, the limitations posed by the lack of a comprehensive and precise VFG database, alongside advanced bioinformatics tools, might have hindered a comprehensive understanding of the role played by VFGs within the gut microbiota.

Presently, the widely utilized Virulence Factor Database (VFDB) stands as the primary resource for bacterial pathogenicity analysis¹⁶, compiling experimentally verified VFGs, which were also adapted for metagenome-association studies¹⁷. However, this database does not provide VFG orthologues and alleles or detailed information regarding bacterial hosts or VFG mobility. Analytical tools such as ShortBRED and PathoFact, built upon VFDB, have been employed for VFG analysis of metagenomic data^17,18. While ShortBRED aids in determining VFG abundance, PathoFact predicts VFG presence and mobility but with compromised accuracy.

To address these critical gaps, we leveraged a verified dataset comprising 3581 VFGs from VFDB to construct the expanded virulence factor gene database (VFDB 2.0). This comprehensive repository encompasses 62,332 VFG orthologues and alleles spanning 135 bacterial species derived from 18,521 complete bacterial genomes representing 3559 distinct species. Alongside VFDB 2.0, we introduced the MetaVF toolkit—a pipeline tailored to profile VFGs from metagenomic sequencing data utilizing VFDB 2.0. MetaVF excels in reporting VFG diversity, abundance, and coverage and predicts mobile VFGs and their respective bacterial hosts. Notably, MetaVF exhibits superior sensitivity and accuracy compared with existing VFG analytical tools.

The application of the MetaVF toolkit to publicly available short-read metagenomic data from cohorts comprising both healthy individuals and those affected by nine different diseases revealed common features of VFGs across diverse diseases—previously overlooked in conventional studies. Moreover, our investigations pinpointed specific strains of E. coli and K. pneumoniae carrying adherence and iron uptake genes implicated in association studies of cardiovascular disease among patients with type 2 diabetes. These findings were further validated via long HiFi read sequencing data, confirming the importance of VFG identification in understanding microbiome-associated diseases.

Results

The expanded virulence factor gene database (VFDB 2.0)

The VFDB core dataset is a collection of representative gene sequences from each of the 3581 verified virulence factor genes (VFGs), which are normally used for bacterial pathogen analysis and do not include orthologues and alleles from different bacterial species. Here, we defined VFG orthologues as homologous VFGs from different species and VFG alleles as VFGs with single nucleotide polymorphisms (SNPs) in different genomes of any bacterial species or duplicated VFGs in the same bacterial genome. We use the VFDB core dataset of verified VFGs as seeds to find VF orthologues and alleles on the basis of 18,521 complete genomes of the RefSeq database (see Methods Section) via species-specific average nucleotide identity (ssANI), given that the evolutionary rates of each species vary. In total, a set of 37,690 orthologues in 75 newly identified species and 429,738 alleles in 60 original species were obtained from the redundant dataset.

According to the redundant dataset, approximately 70% of the VFGs were species specific, and 94% were genus-specific (Fig. 1A, Supplementary Data 3). Those VFGs that are not species-specific or genus-specific can be shared by different species and genera, and most of them are from the Enterobacteriaceae family (Supplementary Fig. 1A). Moreover, 3.3% are mobile VFGs associated with plasmids, prophages, and integrative and conjugative elements (ICEs) that are involved in intercellular transmission. Among them, 479 are carried by plasmids, 304 by prophages, and 178 by ICEs (Supplementary Fig. 1B, Supplementary Data 3). Among the 479 plasmid-associated VFGs, 224 are exclusively carried by plasmids (“plasmid-borne only”), and 255 can be carried either by plasmids or on chromosomes of bacterial species (“alternate”) (Fig. 1A). The annotation of redundant dataset had 850 single VFGs and 248 multiple VFG clusters and was further classified into 7 VF categories. The bacterial host taxonomy, mobility, and VF categories of each VFG were collected into the annotation dataset.

**Fig. 1: Workflow for constructing the VFDB 2.0 and MetaVF toolkits.**

After removing redundancy, a total of 62,332 VFG sequences of 135 species corresponding to 3527 types of VFGs were included in the expanded alignment dataset (Fig. 1A, Supplementary Data 1). Among them, 15,943 VFG alleles of 2,741 VFGs from 59 pathogenic species were confirmed via the NCBI BioSample database and were collected into the pathogenic alignment dataset (Fig. 1A, Supplementary Data 2). Finally, the annotation dataset and alignment dataset were integrated into VFDB 2.0.

MetaVF toolkit for profiling virulence factor genes in metagenomes based on VFDB 2.0

To produce accurate annotated VFGs from gut metagenomes, we introduce the MetaVF toolkit based on VFDB 2.0, which can be outlined in 3 parts: alignment, filtering, and annotation. Step 1: For input metagenomic sequence data, clean reads were mapped against the expanded alignment dataset to obtain VFG mapped reads (VFMappedreads); for long HiFi reads or metagenome-assembled genomes (MAGs), nucleotide BLAST was performed against pathogenic alignment dataset to annotate VFGs. Step 2: VFMappedreads were filtered with tested sequence identity (TSI) obtained via artificial metagenomic datasets (AMSD1), which were generated by combining CAMI datasets with in silico mutated VFGs of defined abundance (27 different combinations of metagenome complexity, VFG abundance, and mutation rates). The TSI was determined under in silico mutated VFGs with 1%, 3% and 5% mutation rates, and the 90% TSI achieved the most stable performance, with a true discovery rate (TDR) > 97% and a false discovery rate (FDR) < 4.000767e-05% (Supplementary Fig. 1C). The complexity of metagenomes and the relative abundance of VFGs do not impact the performance of the MetaVF toolkit. The best hit for the BLAST search of each gene was selected and filtered by identity and coverage. Step 3: The filtered VFMappedreads were counted and normalized by gene length and sequencing depth, which is represented by transcripts per million (TPM). The coverage of VFG clusters, mobility, bacterial host taxonomy, and VF categories are annotated according to the annotation dataset in VFDB 2.0 (Fig. 1B).

Benchmarking of the MetaVF toolkit, which is more sensitive and precise

To evaluate the performance of the MetaVF toolkit in determining the presence and abundance of VFGs, we performed benchmarking analysis by using Artificial Metagenomic Datasets 2 (AMSD2) and Real Metagenomic Sequencing Data (RMSD) with currently available tools, including PathoFact, ShortBRED, and VFDB direct mapping.

We first used AMSD2 to test the performance of the MetaVF toolkit, PathoFact, ShortBRED, and VFDB direct mapping in VFG identification. The MetaVF toolkit was robust on all 18 combinations of subsets from AMSD2, showing advantages over VFDB direct mapping, PathoFact, and ShortBRED by improving sensitivity and precision. The performances of the MetaVF toolkit, PathoFact, ShortBRED, and VFDB direct mapping were better for artificial datasets with a 1% VFG mutation rate than for those with 3% and 5% mutation rates (Supplementary Fig. 2, Supplementary Fig. 3, and Supplementary Data 4). Moreover, all the tools performed better on the AMSD2_high datasets than on the AMSD2_low datasets. For the AMSD2_low datasets with a 5% mutation rate, the specificity, sensitivity, precision, accuracy, and F1 score of the MetaVF toolkit were 99.99%, 94.33%, 95.13%, 99.99%, and 94.72%, respectively, whereas VFDB direct mapping, ShortBRED, and PathoFact achieved precisions of 12.53%, 15.66% and 0.06%, respectively, and sensitivities of 94.60%, 16.80% and 49.86%, respectively (Supplementary Fig. 2 and Supplementary Data 4). The precision of ShortBRED and PathoFact increased to 49.19% and 0.05%, respectively, and the sensitivity increased to 87.67% and 71.79%, respectively, when the AMSD2_high dataset with a 1% mutation rate was used; however, the precision was still lower than that of the MetaVF toolkit (precision of 94.06% and sensitivity of 99.83%) (Supplementary Fig. 3 and Supplementary Data 4). The performance of the MetaVF toolkit for quantifying VFGs was tested in comparison with that of ShortBRED and VFDB direct mapping, as PathoFact does not report the abundance of VFGs. Spearman’s correlation between the predicted and expected relative abundances was the highest in the MetaVF toolkit (all p < 0.001, R > 0.94), followed by VFDB direct mapping (p < 0.001, 0.82 < R < 0.84 for the AMSD2_high datasets; p < 0.001, 0.39 < R < 0.46 for the AMSD2_low datasets) and ShortBRED (−0.042 < R < 0.49 for the AMSD2_high datasets; −0.091 < R < 0.32 for the AMSD2_low datasets; p < 0.05 only under the 1% and 3% mutation rates) in both the AMSD2_high datasets and the AMSD2_low datasets (Supplementary Figs. 4A, B).

We further evaluated the performance of the MetaVF toolkit in parallel with ShortBRED, PathoFact, and VFDB direct mapping via both short- and long-read metagenome sequencing data generated from 8 human gut microbiota. The long (HiFi) reads were used as a reference to verify the results for producing accurate metagenome assemblies at the species/strain level. To be more specific, an average of 21.0 GB long HiFi reads were generated with an average read length of 8.8 kb, and an average of 3307.9 contigs were assembled with a mean N50 of 702,400.625 bp for each sample (see Supplementary Data 5). After binning, 490 MAGs were obtained (61.2 MAGs per sample), including 365 high-quality MAGs (45.6 per sample) (see Supplementary Data 5).

The results of comparative analyses of the MetaVF toolkit with ShortBRED, PathoFact, and VFDB direct mapping are presented in Fig. 2A. Among all the VFGs identified by the MetaVF toolkit, 71.6% were confirmed by long HiFi reads, whereas 48.3% were confirmed by VFDB direct mapping, and 33.9% were confirmed by PathoFact. On the other hand, ShortBRED yielded fewer false-positive results but detected many fewer VFGs than long HiFi reads did (407 undetected). The Spearman correlation coefficient between the predicted and expected relative abundances was the highest in the MetaVF toolkit (0.16 < R < 0.87, all p < 0.001, except sample D5214, with p = 0.22), followed by VFDB (0.27 < R < 0.8, all p < 0.01) and ShortBRED (−0.34 < R < 0.53, p < 0.05, except for sample D1217, with p = 0.91) (see Fig. 2B).

**Fig. 2: Benchmark of the MetaVF toolkit using real sequencing data.**

MetaVF toolkit for annotation of VFG mobility and host species specificity in metagenomic data

Since ShortBRED, PathoFact, and VFDB direct mapping are unable to determine the mobility of VFGs or host species, we next analysed VFGs with the MetaVF toolkit and verified them by long HiFi read sequences using the 8 metagenomic sequenced samples as the test dataset. A total of 695 VFGs were detected by MetaVF and verified by long HiFi reads. For mobility determination, 392 VFGs were identified to be “chromosome-borne only” by MetaVF, and 390 of them (99.5%) were confirmed by long HiFi reads. The remaining 303 VFGs were identified as “alternate” by MetaVF, and 47 (15.5%) were confirmed to be in plasmids by long HiFi reads. None of the “plasmid-borne only” VFGs were identified in the 8 metagenomic samples (Supplementary Fig. 5A). For host species determination, 488 VFGs (of 695 VFGs) on long contigs assembled from long HiFi reads were used, where 107 VFGs were determined to be species-specific and 136 VFGs were determined to be genus-specific by MetaVF, 100% of which were confirmed by long HiFi reads. The remaining 245 VFGs were determined to be non-genus-specific by MetaVF and were confirmed by long HiFi reads to be carried by different host species. For example, the VFGs of Yersinia pestis and Shigella sonnei in VFDB were carried by E. coli in 8 metagenomic samples (Supplementary Fig. 5B). In summary, the MetaVF toolkit can predict “chromosome-borne only” VFGs with an accuracy of 99.5% but cannot determine the “alternate” VFGs. In terms of host specificity, the MetaVF toolkit can predict host taxa at the species and genus levels with an accuracy of 100% but can provide only the range of host taxa for non-genus-specific VFGs in metagenomic samples.

VFGs are present in bacterial isolates of the healthy gut microbiome

We first analysed the VFGs carried by cultured isolates from healthy gut microbiota from the HBC¹⁹, BIO-ML²⁰ and CGR²¹ datasets, with a total of 5452 bacterial isolates of 301 species, where only 512 isolates (genomes) of 12 species carried VFGs (4.0%) (Fig. 3A). Among these 12 species, all the isolates of 7 species carried VFGs, especially the large number of isolates from E. coli and K. pneumoniae, whereas only some isolates of the other 5 species carried VFGs, e.g., only 2.7% of the B. fragilis isolates carried colibactin. With respect to the VFG types, only 11.9% (61/512) of the healthy isolates carried true VFGs (61 isolates of 8 species), which indicated the pathogenic potential of these isolates (Fig. 3A).

**Fig. 3: VFGs in healthy gut microbiomes.**

Considering the bias in culturing bacterial isolates from the gut microbiota, we further analysed VFGs in metagenomes published by the Human Microbiome Project (HMP). For the 350 HMP metagenomes, a total of 651 VFGs were detected and assigned to 30 species, including the 9 species identified in the gut bacterial isolates. The relative abundance and prevalence of VFGs of K. pneumoniae and E. coli were the highest in the healthy gut microbiota, with 33.4% and 39.1% maximum prevalence and log10(−1.598115) and log10(−1.377554) maximum abundance, respectively (as shown in Fig. 3B). The most prevalent VFGs were colonization and housekeeping VFGs (siderophores), whereas some true VFGs, such as colibactin and T3SS effectors of E. coli and K. pneumoniae, were identified in more than 10% of individuals (10.3% to 17.1% for colibactin and 17.7% to 31.7% for T3SS effectors). In summary, the VFGs in healthy gut microbiota were carried by opportunist species, and colonization and housekeeping VFGs were the most prevalent and are carried by E. coli and K. pneumoniae.

Bacterial isolates from healthy gut microbiota contain pathogen-associated VFGs

To further explore the potential pathogenicity of isolates from E. coli and K. pneumoniae, we analysed VFGs in these gut commensals together with their pathogenic counterparts. Compared with hypervirulent strains (hvKp) of K. pneumoniae, healthy gut isolates harboured fewer pathogen-associated VFGs (PAVGs), such as the VFGs rmpA, aerobactin, salmochelin, and yersiniabactin²². However, we observed that 2 isolates from healthy guts carried whole sets of PAVGs, which may represent potential pathogenicity. Unlike hvKp strains, isolates from healthy gut microbiota do not carry T6SS effectors (Fig. 3C). For E. coli, we found that only 31.8% (21/66) of the commensal isolates harboured PAVGs, whereas 70.4% (19/27) of the pathogenic isolates harboured PAVGs (p = 0.001, Fisher’s exact test). For example, 5 of the 66 commensal isolates carried pathogenic ExPEC-specific VFGs, such as the VFGs P fimbriae, S fimbriae, and alpha-haemolysin²³. In addition, 4 isolates carried InPEC-specific VFGs, such as VFGs of the T3SS and Shiga toxin²³, indicating potential infectivity (Fig. 3D).

The observations of the presence of opportunist bacterial species carrying different types of VFGs led us to believe that those bacterial species are pathobionts that are defined as gut microbes with pathogenic potential under dysbiosis²⁴. According to the types of VFGs carried by pathobiont strains, we propose classifying pathobionts into 5 different pathobiont types: pathobiont type I (PBT-I), type II (PBT-II), type III (PBT-III), type IV (PBT-IV) and type V (PBT-V). PBT-I refers to those bacterial strains or isolates that contain PAVGs that do not cause infections due to low abundance; PBT-II refers to members of the same species from PBT-I that carry VFGs rather than PAVGs, e.g., some isolates of E. coli from the gut microbiota; PBT-III refers to those specialized gut commensals that can carry any VFGs, e.g., enterotoxigenic B. fragilis (ETBF), that may lead to chronic disease; PBT-IV refers to those commensals that carry both VFGs and detrimental metabolite genes, e.g., the E. coli cutC/D gene cluster that produces trimethylamine (TMA); and PBT-V refers to those commensals that do not carry VFGs but contain genes that produce detrimental metabolites causing chronic disease, e.g., the Clostridium sporogenes porA gene that produces phenylacetylglutamine (PAGln)²⁵.

Chronic diseases are characterized by disease-common features of VFGs

To characterize the VFGs and pathobionts associated with different chronic diseases, we chose gut metagenomic datasets of 9 diseases, including colorectal carcinoma (CRC), atherosclerotic cardiovascular disease (ACVD), inflammatory bowel disease (IBD), obesity, hypertension, Parkinson’s disease (PD), gastric cancer (GC), liver cirrhosis (LC), and type 2 diabetes (T2D), for VFG analysis. We found that the diversity and abundance of VFGs were greater in most patient groups except for PD, which indicates that the abundance of VFGs in the gut microbiota may represent health status (two-sided Wilcoxon test, p < 0.05 for ACVD, CRC, GC and LC; see Fig. 4A, B).

**Fig. 4: VFG analysis of nine public disease datasets.**

A total of 65 virulence factors were enriched in different disease groups and carried by different pathobionts, such as C. perfringens, E. coli, F. tularensis, K. pneumoniae, S. dysenteriae, S. flexneri, S. mutans, S. pneumoniae, S. sonnei, F. nucleatum, C. difficile, and H. influenza (Fig. 4C). In addition, disease-specific virulence factors were identified in ACVD, CRC, and LC; the ACVD patients had 8 specific virulence factors, such as T6SS and T6SS effectors, enterobactin, yersiniabactin, ShET2, EAST1, and type 1 fimbriae. The alpha and theta toxins and the siderophore salmochelin were enriched exclusively in LC patients, whereas the pathogenic E. coli-associated virulence factors CNF-1, alpha-haemolysin, and P fimbriae were enriched in CRC patients. The majority of these disease-specific virulence factors are pathogen-associated VFGs (PAVGs), which are supposedly carried by PBT-I pathobionts (Fig. 4C).

The other 46 virulence factors were identified as common features shared by different types of disease, mainly E. coli, S. pneumoniae, K. pneumoniae, and C. perfringens (82.6%), and some were plasmid-borne virulence factors. For example, T3SS effectors were enriched in ACVD, CRC, LC, IBD, hypertension, and T2D patients, and the inflammation-associated virulence factor LPS was enriched in ACVD and LC patients. In addition, FadA and colibactin, which are causative agents of CRC, were enriched in the CRC, LC, and ACVD patient groups^26,27 (Fig. 4C).

The common disease features of the virulence factors OmpA, Enterobactin, and ECP are enriched in T2D patients

To further study the VFGs in the gut microbiota associated with chronic disease, we used 150 metagenomic sequencing datasets of T2D patients from the PUMCH dataset, including 50 healthy individuals, 50 T2D patients, and 50 T2D patients with cardiovascular disease (T2D-CVD) (see methods). As expected, we found that the abundance and diversity of VFGs were significantly greater in the T2D and T2D-CVD patient groups than in the healthy group (HC) (two-sided Kruskal‒Wallis test, FDR < 0.05; see Fig. 5A, B).

**Fig. 5: VFG analysis of the PUMCH dataset.**

Further analysis revealed that virulence factors (VFs) from E. coli, such as enterobactin, ECP (E. coli common pilus), OmpA, and T3SS effectors, were significantly enriched in T2D patients (see Fig. 5C). Moreover, the LPS, T6SS, and effector genes from K. pneumoniae were significantly associated with both the T2D and T2D-CVD groups, whereas the abundances of the T6SS, salmochelin and enterobactin genes appeared to be significantly greater in T2D-CVD patients than in T2D patients (two-sided Kruskal‒Wallis test, FDR < 0.05; see Fig. 5C). In correlation analysis with T2D clinical indices, we found that the abundance of 26 virulence factors was specifically correlated with FBG and HbA1c (Spearman correlation, all R > 0.4, p < 0.05), whereas the T3SS effectors enterobactin, ECP, and OmpA showed the strongest correlations, and those VFs were predicted to be carried by E. coli and K. pneumoniae via MetaVF toolkit (see Fig. 5D). Moreover, some pathogen-associated VFs of K. pneumoniae and E. coli, such as rmpA, yersiniabactin, P fimbriae, and alpha-haemolysin, were enriched in T2D and T2D-CVD patients, indicating the existence of hypervirulent K. pneumoniae and uropathogenic E. coli (UPEC) in the gut microbiota of T2D and T2D-CVD patients. In summary, these results showed that the VFs carried by different pathobiont types of E. coli and K. pneumoniae are associated with T2D.

Specific pathobiont types of Klebsiella pneumoniae were identified in T2D-CVD patients

As shown in the previous sections, short-read sequencing analysis revealed that different types of VFGs are associated with different chronic diseases. However, the pathobiont types cannot be defined by short-read sequences because of short read lengths. The long-term HiFi read sequencing of the gut microbiome can generate strain-level assemblies for functional gene analysis²⁸. We chose 24 samples from the HC, T2D, and T2D-CVD patient groups for long-read sequencing (SMRT), which included 9 samples with deep sequencing (average of 21.0 Gb per sample; one sample failed to render valid sequencing data) and 15 samples with low sequencing depth (average of 3.2 Gb per sample), to determine the corresponding pathobiont types (see methods). The length of contig N50, the number of MAGs, and the number of annotated species were greater in deep sequencing than in low-depth sequencing (see Supplementary Data 5). In total, we assembled 873 MAGs corresponding to 270 species from 23 samples, where 22 species had more than 10 MAGs and 103 species had one MAG. Among the assembled MAGs, those of E. coli, K. pneumoniae, K. oxytoca, C. perfringens, H. parainfluenzae, and H. influenza were found to carry VFGs (Fig. 6A).

**Fig. 6: Strain-level analysis of VFGs via long HiFi reads.**

With emphasis on E. coli and K. pneumoniae, which carried the majority of VFGs, the high-quality MAGs of E. coli and K. pneumoniae from each of the samples were assessed by MAGPhase for haplotype structure analysis, where the average haplotype numbers were 7 for E. coli and 14.5 for K. pneumoniae. Furthermore, one dominant haplotype, which represents a quasi-strain/lineage providing sufficient data for subsequent determination of VFG clusters and types of pathobionts at the strain level, was present in every MAG analysed.

Among the 11 MAGs of E. coli from 9 samples, two MAGs contained the VFGs P fimbriae, S fimbriae, and alpha-haemolysin, which are associated with true pathogenic strains (ExPEC) and were assigned to pathobiont type I (PBT-I). Furthermore, the other 9 MAGs were assigned to pathobiont type IV because they carried the detrimental metabolite-producing genes cntA and cntB (trimethylamine, TMA) and VFGs (see Fig. 6B, C). Interestingly, D3103 had two MAGs belonging to pathobionts PBT-I and PBT-IV, indicating a mixture of the 2 pathobiont types from 2 quasi-lineages.

Among the 6 MAGs of K. pneumoniae from 6 samples, 4 MAGs in T2D-CVD patients belonged to PBT-I, containing siderophore genes that are associated with hypervirulent strains (see Supplementary Figs. 6B, C). There were two MAGs of K. pneumoniae in T2D-CVD patients (D3029 and D3120) that carried another set of TMA-producing genes, cutC and cutD, assigned as PBT-IV pathobionts.

Discussion

Previously, metagenome-association studies identified several VFGs that were associated with a few chronic diseases^15,29,30. However, current tools for detecting VFGs in metagenomic data tend to generate false-positive results, which impedes the discovery of the role of VFGs in chronic diseases in humans²⁹. The MetaVF toolkit uses VFDB 2.0 to analyse VFGs accurately in bacterial isolates or the gut microbiota at the species level, which was benchmarked by using artificial and real sequencing data. For the gut bacteria isolates, different types of VFGs carried by different strains were determined and used to define pathobiont types and subsequently used for disease association studies. For the gut microbiota, the use of the MetaVF toolkit enabled the identification of disease-specific VFGs in the ACVD, CRC, and LC patient groups and the common feature VFGs shared by ACVD, CRC, GC, and LC individuals, which may reflect the health status of the gut microbiota and could be used for gut health index determination^15,31. When we applied the MetaVF toolkit for T2D gut microbiota analysis, we were able to identify VFGs carried by different types of pathobionts and discovered that the VFGs were enriched mainly in E. coli and K. pneumoniae. These results were further verified by using long (HiFi) read sequencing at the strain level for the two bacterial species.

Many VFGs present genetic polymorphisms, and their orthologues and alleles were systematically explored using 18,521 complete genomes, yielding VFDB 2.0. The VFDB 2.0 covered an additional 75 species and contained approximately 20 folds of VF orthologues and alleles belonging to pathogens and nosocomial opportunists (Supplementary Data 1).

The use of VFDB 2.0 is essential for accurately identifying VFGs associated with host species, VFG mobility, and the VFG structure (VF clusters) in the gut microbiome. With respect to the species specificity of VFGs, the MetaVF toolkit cannot determine the host species of non-species-specific VFGs but can provide possible host species at the genus level when long HiFi reads are not available. Moreover, horizontal gene transfer (HGT) is believed to occur with a high frequency in the gut microbiota, and VFGs carried by mobile genetic elements (MGE) inside the gut could be mobilized between bacterial cells with high density³². In PUMCH, 18.3% of the VFGs in the T2D-CVD group were predicted to be carried by mobile genetic elements, indicating that VFGs are likely to be mobile in the human gut. For the “chromosome-borne only” or “plasmid-borne only” VF genes, we validated the accuracy of the MetaVF toolkit, which was 99.5% using real sequencing data. Our toolkit cannot determine whether the “alternate” VFGs are located in the plasmid or chromosome in real samples, but we are able to indicate their potential for mobility, which can be validated by long HiFi reads. The MetaVF toolkit is useful in predicting plasmid-borne VFGs, which is difficult to do using metagenomic data when long HiFi reads are not available³³.

In fact, most VFGs are components of VFG clusters, and the incompleteness of VFG clusters may lead to dysfunction. Examples such as the T6SS gene cluster of Shigella flexneri, which has lost several VFGs in comparison with that of Shigella sonnei, have further been demonstrated to be dysfunctional³⁴. The annotations of the VFG structure can help users determine the completeness of the VFG cluster in metagenomic samples, which can potentially be used for VFG functional assessment. For the T2D patient group analysis, approximately 40% of the VFG clusters in the disease groups and 10% in the healthy group were calculated to be complete, indicating that the healthy individuals carried fewer functional virulence factors.

Using the MetaVF toolkit, we were able to characterize VFGs in the gut microbiota with disease-specific features and common disease features. The common features of VFGs are characterized mainly by iron uptake and adherence genes, which are carried by a few pathobiont species, such as E. coli and K. pneumoniae, which have been described as common features together with other species by different studies^31,35 and could be incorporated into healthy status prediction. The common features of diseased gut microbiomes may be explained by the use of different types of medicines, such as proton pump inhibitors (PPIs) and antibiotics¹⁵, while the common features of VFGs in our study (iron uptake and adherence genes) may be explained by the fact that these VFGs possibly help pathobionts gain survival advantages over species without VFGs in the inflamed gut environment³⁶. On the other hand, the findings of disease-specific features of VFGs in the CRC, LC, and ACVD patient groups were characterized by PAVGs, which are carried by PBT-I pathobionts. In the case of CRC, the enriched alpha-haemolysin and CNF-1 carried by uropathogenic E. coli (UPEC) strains are both cytotoxic and possibly associated with CRC^37,38, which could be used for diagnostic purposes.

In the PUMCH studies, we demonstrated that the VFGs enriched in T2D patients were carried mainly by E. coli and K. pneumoniae. The association of these VFGs with T2D has also been reported in previous gut microbiome studies without specific host species³⁹. For example, a study involving a large cohort of 8208 Dutch individuals revealed enrichment of siderophores, ECP, and OmpA in T2D patients but without host information¹⁵. Notably, the enrichment of VFGs of hypervirulent K. pneumoniae in T2D-CVD patients was identified via the MetaVF toolkit and confirmed via long HiFi reads, which have not been reported in other metagenomic analyses.

Strain-level associations between gut microbes and disease have been advocated in recent years^40,41, and SMRT sequencing has been demonstrated to be efficient in bacterial lineage-resolved assemblies²⁸, which is essential for pathobiont type identification. In 23 long-HiFi-read PUMCH samples, we identified two different pathobiont types for 11 strains of E. coli and three different pathobiont types for 6 strains of K. pneumoniae, which could not be identified via short-read sequencing. For the two pathobiont types in E. coli, PBT-IV was found in both the T2D and T2D-CVD patient groups, and the two strains of PBT-IV from K. pneumoniae were found exclusively in two T2D-CVD patients, carrying both VFGs and TMA-producing genes. To date, haplotype analysis has been used to evaluate the quality of long HiFi read-generated MAGs, which still consist of multiple lineages (or strains)²⁸. The MAGs assembled for E. coli and K. pneumoniae via long HiFi reads generated from PUMCH samples consisted of multiple haplotypes that may represent different lineages. Even though the linkages of SCG haplotypes cannot be determined by long HiFi reads, we detected dominant haplotypes (strains) that correspond to the dominant pathobiont types in each of the samples, which was predicted by other studies using short-read sequencing^42,43. The quality of assembled MAGs can be improved by increasing the sequencing depth of long HiFi reads, and MAGs with fewer haplotypes can further improve disease association studies at the strain level. The determination of the entirety of VFG clusters and the mobility of VFGs within the gut microbiota can also be improved by using long HiFi reads, which are specifically important for disease association studies. Owing to the limited length of the reads, even long HiFi reads were unable to determine the entirety of VFG clusters that consisted of more than 12 genes at the single-molecule level, such as colibactin and yersiniabactin (Fig. 6C). For the MGEs in the gut microbiota, the majority of mobile VFGs are those carried by plasmids (see Supplementary Fig. 6C), while the host species cannot be determined by shotgun sequencing or SMRT sequencing. This problem can be solved only by using a high‐throughput chromosome conformation capture technique (Hi-C sequencing)⁴⁴.

In conclusion, the VFDB 2.0 is a comprehensive database that systematically collects VFG orthologues from different species and alleles, demonstrating that the vast majority of VFGs are genus specific, which serves as a base of the MetaVF toolkit to accurately identify VFGs in gut metagenomic samples. By applying the MetaVF toolkit to several human gut metagenomic datasets, we were able to identify the disease-common features and disease-specific features of VFGs that have not been defined in previous studies, revealing potential biomarkers for health status evaluation in clinical diagnosis. By combining MetaVF and long HiFi read sequence analysis, the colonization of hypervirulent K. pneumoniae in T2D patients can be determined. The VFDB 2.0 contains alleles and orthologues of 3,527 VFGs, allowing us to determine the mobility of VFGs within the gut microbiota of individual samples. In summary, the MetaVF toolkit may increase the efficiency of VFG analysis in disease-association studies, and in the future, the combined use of VFDB 2.0 and long HiFi reads may help identify VFGs that are causing agents of gut microbiota-associated diseases.

Methods

Study cohort and sample collection

Sample collection was conducted among 486 adults at Peking Union Medical College Hospital from January to September 2018. A total of 150 participants were enrolled in the final study, including 50 healthy adults, 50 T2D patients (type 2 diabetes mellitus patients without cardiovascular disease), and 50 T2D-CVD patients (type 2 diabetes mellitus patients with cardiovascular disease). Fresh stool samples were collected from all participants and stored at −80 °C. Written informed consent was obtained from the participants before any study procedures were performed, and the experimental protocol was approved by the Institutional Review Board of the Institute of Microbiology, Chinese Academy of Sciences. All the participants were compensated for travelling. All individuals completed a structured questionnaire that included demographic and lifestyle aspects such as nationality, gender, age, household income, education, smoking habits, drinking habits, duration of T2D, family history of the disease, and use of hypoglycaemic drugs. The self-reported CVD diagnosis in T2D patients, including myocardial infarction, stroke, congestive heart failure, and other ischaemic heart diseases, was confirmed through medical records. Participants who used antibiotics, had an invasive medical intervention within the previous 90 days, had a history of any cancer or inflammatory disease of the intestine, or had a moderate or severe illness at the time of enrolment were not enrolled. Participants of any sex and/or gender were enrolled.

DNA extraction, short-read sequencing, and long-read sequencing

The total genomic DNA in the faecal samples was extracted using a QIAamp PowerFecal DNA Kit following the user manual. Pair-end metagenomic sequencing was performed on the Illumina HiSeq X platform. The demographic characteristics of the sample are summarized in Supplementary Data 6. Twenty-four samples (8 from each group) were selected for long-read sequencing on the PacBio HiFi platform. Eight of the 24 samples were sequenced with a total of ~ 3 GB per sample, whereas the others were sequenced with a total of ~ 20 GB per sample.

Curation of VFDB

We established an expanded VFG catalogue based on the VFDB core database⁴⁵ (http://www.mgc.ac.cn/VFs/), which reports the DNA sequences of 3581 VFGs that were experimentally verified (by 2020.06.27). First, we artificially curated the database to find the redundant VFGs and improperly labelled genes in the database. The set1B (in E. coli and Shigella flexneri) and stxB (in E. coli and Shigella dysenteriae) genes were found to be identical. Second, we revised the VF classification in the VFDB on the basis of the VF classification scale proposed by Wassenaar⁴⁶ and the VF descriptions provided by VFDB⁴⁵. We classified 3581 VFGs into seven VF categories, including toxins and effectors (type 1 VF), colonization VFGs (type 2 VF), defence system evasion VFGs (type 3 VF), processing VFGs (type 4 VF), secretory VFGs (type 5 VF), housekeeping VFGs (type 6 VF) and regulatory VFGs (type 7 VF).

Calculation of the species-specific ANI

First, we downloaded 20,946 NCBI RefSeq complete bacterial genomes, extracted the taxonomy annotation of the genomes, and removed genomes with unclear taxonomy assignments, such as “sp.” and “candidatus”. A total of 18,521 complete genomes were retained for the following analyses. Next, ANI was computed for genome pairs of species with more than 10 genomes via fastANI⁴⁷ (version 1.32) with the default parameters. The mean ANI between genome pairs of the same species was defined as the species-specific ANI (ssANI). In total, the ssANI of 1089 species was calculated. In the cases in which a species has only one complete genome in RefSeq, 99% was used as the ssANI of that species, whereas for those species with more than 100 genomes, 100 genomes were randomly selected for ssANI computing.

Construction of VFDB 2.0

The 18,521 complete bacterial genomes of 3559 species were aligned to VFDB via local nucleotide BLAST (version 2.5.0). First, the results of the BLASTN were filtered under the thresholds of identity >ANI of the subject species (ssANI) and coverage = 100% (redundant dataset). After removing redundant VFG sequences with 100% sequence identity within the redundant dataset, the nonredundant VFG sequences were collected into an expanded alignment dataset. The VFG sequences originating from pathogenic strains in the expanded alignment dataset were collected into the pathogenic alignment dataset on the basis of information in the NCBI BioSample database.

To explore mobile VFGs, three types of mobile genetic elements in 18,521 complete genomes, including ICEs, prophages, and plasmids, were predicted. ICEs were detected on the basis of similarity alignment (>99% identity) against ICEberg⁴⁸ (https://bioinfo-mml.sjtu.edu.cn/ICEfinder/index.php), PhiSpy⁴⁹ (version 4.2.19) was used to find prophage sequences (>99% identity), and plasmid sequences were extracted from fasta files with “plasmid” in the sequence name. The VFGs from the redundant dataset carried by mobile elements were determined via Python script. If alleles or orthologues of a VFG are located on chromosomes of host species, the VFG is defined as “chromosome-borne only” VFG. If alleles or orthologues of a VFG are located on plasmids of host species, the VFG is defined as a “plasmid-borne only” VFG. If a VFG is located on either chromosomes or plasmids of a host species, the VFG is defined as an “alternate” VFG. To define the host species for each of the VFGs, the host taxonomic information of redundant VFGs from the redundant dataset was used. If alleles or orthologues of a VFG belong to only one bacterial species, the VFG is defined as “species-specific”. If alleles or orthologues of a VFG belong to different species of the same genus, the VFG is defined as “genus-specific”. If alleles or orthologues of a VFG belong to species of different genera, the VFG is defined as “non-genus-specific”. Finally, the annotations of mobile VFGs and host taxonomic information for each of the VFGs (annotation dataset) the alignment dataset were integrated into VFDB 2.0.

Overview of the MetaVF toolkit

The MetaVF toolkit is a command-line tool for Linux-based systems that integrates two distinct workflows for the prediction of VFGs in metagenomic data or draft genome data.

(1) Alignment

The MetaVF toolkit allows VFG analysis for metagenomic sequencing data (-PE), assembled contigs, draft genomes, or long reads (-draft). For short-read sequencing data, clean reads are mapped to the expanded alignment dataset via bbmap (version 38.91) (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) (with parameter rmpk, idtag=t ambiguous=random). The output includes the ‘.sam’ file, which records the mapping details of each read, and the ‘.rpkm’ file, which calculates the total number of reads of each sample. For long reads, contigs, MAGs, draft or complete genomes, nucleotide BLAST (version 2.5.0) is performed against the pathogenic alignment dataset (with parameters -max_hsps 1 -outfmt “6 std gaps qcovs qcovhsp sstrand sseq”).

(2) Filtering alignment

For short-read sequencing data, mapped reads are sorted and filtered via SAMtools⁵⁰ (version 1.15, parameter: -F 4), and the shell script is used to filter out hits with less than 90% identity according to the tag “YI:f:”. The raw read count, RPK, and TPM of each gene are calculated. For long reads, contigs, MAGs, draft, or complete genomes, the best BLAST hit of each gene is selected and filtered on the basis of identity and coverage (default: identity >90%, coverage >80%).

(3) Integrating VF annotations

The VF category, bacterial host species, and mobility of each VFG are further annotated on the basis of the annotation dataset of VFDB 2.0. The final outputs include two files that are calculated via VFGs and VFs. The relative abundance of each virulence factor is represented by the median abundance of VFGs in each VF.

Generation of artificial datasets

To determine the best threshold and evaluate the performance of the MetaVF toolkit, two sets of artificial data were simulated. First, the raw fasta sequences of low (cami_low), medium (cami_medium), and high (cami_high) complexity datasets in CAMI were downloaded⁵¹. We used nucleotide BLAST (version 2.5.0) to exclude bacterial genomes with any naturally occurring VFGs, defined as a sequence matching a gene from the VFDB with >80% identity and 70% coverage, which ensures that the VFG sequences in the artificial data were those that had been artificially spiked in. The final genomes used for generating the simulation data were 965 VF-free CAMI high, 215 VF-free CAMI medium, and 53 VF-free CAMI low genomes. The sequences of 200 VFGs were randomly sampled from the VFDB core dataset with replacement each time via shell script and mutated at 1%, 3%, and 5% via snp-mutator (version 1.2.0, parameters: snpmutator -r 1 -n 3 -s 20 -i 0 -d 0 -o summary.tsv -v variants.vcf -m -M metrics -R seq.fasta –F VF_nutation_fasta VF1.fna) (https://github.com/CFSAN-Biostatistics/snp-mutator). Next, we generated two sets of artificial data, including artificial dataset 1 (AMSD1), which was used for evaluating the best threshold for the MetaVF toolkit in filtering alignments for short-read sequencing data (~5 M per sample), and artificial dataset 2 (AMSD2), which was used for benchmarking (~100 M reads per sample). We used InSilicoSeq⁵² (version 1.5.4, iss generate --draft --model Hiseq --n_reads) to generate simulated metagenome sequencing data mimicking Illumina HiSeq paired-end reads. Each bacterial genome was assigned an abundance value drawn from a log-normal distribution with a unit mean and standard deviation. For AMSD1, the simulated fastq data for mutated VFG sequences were generated and spiked into bacterial reads (CAMI high, medium, and low) at ratios of 1:50000, 1:5000, and 1:500, respectively, generating 27 artificial datasets. For AMSD2, the sampling of VFGs was independently performed 3 times with replacement to avoid biases caused by specific artificial VFGs. The AMSD2 combined high-complexity bacterial data (CAMI high) with 3 random sets of VFGs at different mutation rates (1%, 3%, and 5%) with proportions of 1:5000 (AMSD2_low) and 1:50000 (AMSD2_high), including 18 artificial datasets. The number of reads assigned to each VFG in every simulated dataset was calculated.

Estimation of the threshold of the MetaVF toolkit

We performed a comparative analysis of different thresholds for filtering low-quality hits (99%, 97%, 95%, 93%, 90%, 85%, 80%, 75%, 70%, 65%, and 60%) via artificial dataset 1 (AMSD1) and calculated the true positive rate ( = TP / (TP + FN)) and false-positive rate ( = FP / (FP + TN)). (TP: true positive, i.e., a read is correctly predicted to be a virulence factor; FN: false-negative, i.e., a read is incorrectly predicted not to be a virulence factor or an incorrect virulence factor; TN: true negative, i.e., a read is correctly predicted not to be a virulence factor; FP: false-positive, i.e., a read is incorrectly predicted to be a virulence factor.)

Benchmark of the MetaVF toolkit using artificial datasets

To evaluate the performance of the MetaVF toolkit, we performed VFG analysis via artificial dataset 2 (AMSD2) and compared it with the following tools: ShortBRED (v0.9.5)¹⁸, PathoFact (v1.0)¹⁷, and VFDB direct mapping. First, ShortBRED reduces target protein families to short, highly representative peptide sequences (markers) and then maps reads against only those markers to obtain higher speed and specificity. ShortBRED has the best threshold tested for identifying VFGs and setting it as the default threshold, and we adopted the default threshold for benchmarking. To be more specific, we used VFDB core dataset B as a candidate gene set and UniRef 90⁵³ (downloaded by April 2023) as a reference gene set to identify marker VFG sequences via the “shortbred_identify program” (clustering the proteins of interest at 85% identity) and calculated the abundance of each VFG via the “shortbred_quantify program” using default parameters (length >=30 amino acids and >95% identity). PathoFact is an integrated pipeline for predicting virulence factors, antimicrobial resistance genes, and toxins in metagenomic data. PathoFact accepts assembled metagenomic sequencing data (contigs), predicts ORFs via Prodigal software, and determines VFGs in each ORF via the Hidden Markov Model (HMM) and random forest model. We inputted the original CAMI draft genomes to avoid errors in the process of assembly and ran a “virulence” pipeline to predict VFGs in each dataset with the default parameters. Finally, we also performed VF analysis by mapping reads directly to VFDB core dataset A without filtering low-quality hits.

The prediction quality of the presence of VFGs was evaluated by sensitivity, specificity, precision, accuracy and F1 score using the formulas below.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision = TP / (TP + FP)

Sensitivity = TP / (TP + FN)

F1 score = 2 * precision * sensitivity / (precision + sensitivity)

(TP: true positive, i.e., a read/gene is correctly predicted to be a virulence factor; FN: false-negative, i.e., a read/gene is incorrectly predicted not to be a virulence factor or an incorrect virulence factor; TN: true negative, i.e., a read/gene is correctly predicted not to be a virulence factor; FP: false-positive, i.e., a read/gene is incorrectly predicted to be a virulence factor).

The prediction quality of the abundance of VFGs was evaluated by Spearman correlation between the expected abundance and the predicted abundance of VFGs, which is represented by the estimated read count in the MetaVF toolkit, VFDB direct mapping, and ShortBRED. False-positive detections are defined as those with 0 expected abundance.

Benchmark of the MetaVF toolkit using real metagenomic sequencing data

Eight samples with both short-read sequencing and long-read sequencing were used to evaluate the performance of MetaVF. The VFGs in long HiFi data were detected via nucleotide BLAST (>90% identity and >80% coverage). The results generated via short-read sequencing methods via the MetaVF toolkit, ShortBRED, PathoFact, and VFDB direct mapping were compared with those generated via long HiFi reads. The parameters used for short-read sequencing methods were the same as those in artificial dataset 2, except that the input of PathoFact was contigs assembled by SPAdes⁵⁴ (v3.15.5, --meta). The “pathogenic” ORFs were further annotated via protein BLAST against VFDB core set B. The best alignment was selected as the final annotation of the pathogenic ORFs. The abundance of VFGs in long-read sequencing data is represented by read coverage depth.

Comparative analysis of VFGs in healthy isolates and clinical isolates

A total of 5592 draft genomes or raw sequencing data for healthy gut isolates were downloaded from three public databases^19,20,21. The raw sequencing data were assembled into draft genomes via SPAdes⁵⁴ (v3.15.5, --isolate) with the default parameters. The quality of all draft genomes was measured via CheckM⁵⁵ (v1.0.7) (with parameters lineage_wf–tab_table -x fna Prokka_annotations/). A total of 5452 high-quality draft genomes with greater than 90% completeness and less than 5% contamination were used for further analysis. All summary and quality statistics can be found in Supplementary Data 7. The VFGs in the draft genomes were identified via the -draft workflow in the MetaVF toolkit according to the previous description (100% query coverage and a minimum similarity of ssANI).

Publicly available pathogenic E. coli genomes and hypervirulent K. pneumoniae sequences were downloaded (Supplementary Data 8, 9). Hypervirulent K. pneumoniae genomes were selected by searching the keyword “hypervirulent” in the NCBI BioSample database and were confirmed by the associated articles provided by the NCBI BioSample database. Snippy (version 4.6.0) (https://github.com/tseemann/snippy) was used to call variants (SNPs and INDELs), and E. coli (GCA_000005845.2_ASM584v2_genomic.fna) and K. pneumoniae (GCA_000240185.2_ASM24018v2_genomic.fna) were used as references (default parameters). VCFtools⁵⁶ (version 0.1.16) was used to compress (bgzip), index (tabix) and merge (vcf-merge) the vcf files of each sample. vcf2phily (https://github.com/edgardomortiz/vcf2phylip) was used to transform the vcf files to fasta files for phylogenetic tree construction. Phylogenetic trees of the whole-genome SNP sequence were generated via FastTree⁵⁷ (v.2.1.10, parameters: -gtr -nt) and visualized via the ggtree package in R.

Public dataset download

Ten public metagenomic datasets were downloaded for VF analysis, including the colorectal carcinoma (CRC), atherosclerotic cardiovascular disease (ACVD), inflammatory bowel disease (IBD), obesity, hypertension, Parkinson’s disease (PD), gastric cancer (GC), liver cirrhosis (LC), type 2 diabetes (T2D) and HMP metagenomic datasets (for details, see Supplementary Data 10).

Short-read sequencing data analysis

First, KneadData (v0.10.0) (https://bitbucket.org/biobakery/kneaddata) was used to create clean paired reads (with the parameters PE-phred33 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:50). The clean reads were taxonomically classified with MetaPhlAn3⁵⁸ (version 3.0.14) using the default parameters. Next, the abundance of VFGs in each sample was determined via the –PE workflow in the MetaVF toolkit as described above. VFGs that were present in at least 10% of the participants were selected for differential analysis, and the log10(TPM) was used to normalize skewed distributions of the abundance of VFGs. Spearman correlations between the abundance of VFGs and clinical indices were calculated.

Long-read assembly, binning, and species annotation

Raw base-called data from the PacBio sequencing instrument were imported into SMRTLink (https://www.pacb.com/support/software-downloads) to generate HiFi reads via the CCS algorithm (version 6.0.0), which processed the raw data and generated HiFi fastq files (with the following settings: minimum pass 3, minimum predicted accuracy 0.99). HiFi reads were assembled into contigs via the metaFlye genome assembler⁵⁹ (version 2.9), and the ‘—pacbio-hifi’ flag was used. The total length, contig number, largest contig length, N50, and L50 were calculated to evaluate the assembly efficiency with Quast⁶⁰ (v.5.0.034). The functions binning (with parameters —metaBAT2, —CONCOCT, and —MaxBin2) and binning refinement (>70% completeness and <10% contamination) in MetaWRAP⁶¹ (v = 1.3.2) were used to generate MAGs. MAGs with more than 90% completeness and less than 5% contamination were classified as ‘high quality’. GTDB-Tk⁶² (version 2.24.31) was used to assign candidate taxonomic affiliations to all MAGs (‘classify_wf’ workflow). For MAGs annotated as Escherichia flexneri, if lacY existed and ipaH did not exist in the genome, the genome was defined as E. coli; otherwise, it was assigned to S. flexneri⁶³.

Annotation of TMA genes and VFGs via long-read sequencing

To annotate the TMA genes, long-read assemblies were aligned to the localized TMA dataset via nucleotide BLAST (version 2.5.0; BLASTN-megablsat), and alignments with at least 90% identity and 80% coverage were retained. To annotate VFGs, the draft workflow in the MetaVF toolkit was used with parameters of at least 90% identity and 80% coverage.

Detection of mobile VFGs via long-read sequencing

The ICE contigs were determined on the basis of similarity alignment (>99%) against ICEberg⁴⁸ (https://bioinfo-mml.sjtu.edu.cn/ICEfinder/index.php). The prophage-carrying contigs were discovered via VirSorter2⁶⁴ (version 2.2.3) and CheckV⁶⁵ (version 0.8.1). First, VirSorter2 was used to select potential prophage sequences (score >0.5), and then CheckV was used to filter host sequences. The filtered contigs were sent to VirSorter2 again to identify prophage sequences (score >0.9). ViralVerify (version 1.1) (https://github.com/ablab/viralVerify/) and PlasFlow⁶⁶ (version 1.1.0) were applied to predict plasmid contigs. The plasmid contigs predicted by both software programs were retained. The bacterial hosts of plasmid contigs with VFGs were predicted by aligning the contigs to plasmid sequences in the expanded alignment dataset via minimap2 (-ax map-hifi -H -N 1 --secondary=no). The best alignment was retained only if the predicted bacterial host species of the plasmid also existed in the sample.

Phylogenetic analysis of MAGs

Phylogenetic analyses of E. coli and K. pneumoniae MAGs were performed as described above, with GCA_000005845.2_ASM584v2_genomic.fna and GCA_000240185.2_ASM24018v2_genomic.fna used as reference genomes, respectively.

Haplotype analysis of E. coli and K. pneumoniae MAGs

CheckM⁵⁵ (v1.0.7) was first used to annotate the single-copy genes (SCGs) in each MAG. Long HiFi reads were mapped to E. coli and K. pneumoniae MAGs via minimap2⁶⁷ (version 2.23, parameters: -x asm 20). MagPhase²⁸ (version 1.0.0, default parameters) was used to identify the haplotypes of each SCG in the MAG, and the maximum number of haplotypes of each SCG in the MAG was used to represent the variation of the haplotypes in each MAG.

Quantification and statistical analyses

Two-sided Wilcoxon tests and Kruskal‒Wallis tests were used for differential analysis between two groups and three groups, respectively. Two-sided Spearman correlation was used for correlation analysis. When multiple hypotheses were investigated, P values were corrected for multiple hypothesis testing via the Benjamini‒Hochberg method (FDR). P values < 0.05 were considered ‘significant’.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The PacBio SMRT raw sequencing data and assemblies generated in this study have been deposited in the NCBI under accession code PRJNA1052403. The Illumina sequencing data have been deposited in the NCBI under accession code PRJNA1053635. The public metagenomic data used in this study can be accessed under PRJEB12449⁶⁸, ERP008729³⁰, PRJEB27928⁶⁹, and PRJEB10878⁷⁰ for colorectal carcinoma (CRC); ERP023788²⁹ for atherosclerotic cardiovascular disease (ACVD); EGAS00001001704⁷¹, ERP002061⁷², and PRJNA400072⁷³ for inflammatory bowel disease (IBD); ERP014480⁷⁴ for obesity; PRJEB13870⁷⁵ for hypertension; ERP019674⁷⁶ for Parkinson’s disease (PD); DRA007281, DRA008243, DRA006684 and DRA008156⁷⁷ for gastric cancer (GC); ERP005860⁷⁸ (10.1038/nature13568) for liver cirrhosis (LC); PRJNA422434⁷⁹ for type 2 diabetes (T2D); and PRJNA43017⁸⁰ for HMP. The VFDB 2.0 is available at https://github.com/Wanting-Dong/MetaVF_toolkit/tree/main/databases. Source data are provided with this paper.

Code availability

The MetaVF toolkit is available at https://github.com/Wanting-Dong/MetaVF_toolkit. The scripts used in this study are available at https://github.com/Wanting-Dong/VF_analysis_pipeline.

References

Sepich-Poore, G. D. et al. The microbiome and human cancer. Science 371, 1331 (2021).
Article Google Scholar
Fan, Y. & Pedersen, O. Gut microbiota in human metabolic health and disease. Nat. Rev. Microbiol. 19, 55–71 (2021).
Article CAS PubMed Google Scholar
Jochum, L. & Stecher, B. Label or Concept - What is a Pathobiont? Trends Microbiol. 28, 789–792 (2020).
Article CAS PubMed Google Scholar
Castillo, A., Eguiarte, L. E. & Souza, V. A genomic population genetics analysis of the pathogenic enterocyte effacement island in Escherichia coli: The search for the unit of selection. Proc. Natl. Acad. Sci. 102, 1542–1547 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Nougayrède, J. P. et al. induces DNA double-strand breaks in eukaryotic cells. Science 313, 848–851 (2006).
Article ADS PubMed Google Scholar
Wilson, M. R. et al. The human gut bacterial genotoxin colibactin alkylates DNA. Science 363, eaar7785 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic pks(+) E. coli. Nature 580, 269–273 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/beta-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).
Article CAS PubMed PubMed Central Google Scholar
Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Thiele Orberg, E. et al. The myeloid immune signature of enterotoxigenic Bacteroides fragilis-induced murine colon tumorigenesis. Mucosal Immunol. 10, 421–433 (2017).
Article CAS PubMed Google Scholar
Chung, L. et al. Bacteroides fragilis Toxin Coordinates a Pro-carcinogenic Inflammatory Cascade via Targeting of Colonic Epithelial Cells. Cell Host Microbe 23, 203 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dubinsky, V., Dotan, I. & Gophna, U. Carriage of Colibactin-producing Bacteria and Colorectal Cancer Risk. Trends Microbiol 28, 874–876 (2020).
Article CAS PubMed Google Scholar
Cao, Y. et al. Enterotoxigenic Bacteroidesfragilis Promotes Intestinal Inflammation and Malignancy by Inhibiting Exosome-Packaged miR-149-3p. Gastroenterology 161, 1552–1566.e1512 (2021).
Article CAS PubMed Google Scholar
Gacesa, R. et al. Environmental factors shaping the gut microbiome in a Dutch population. Nature 604, 732 (2022).
Article ADS CAS PubMed Google Scholar
Chen, L. H. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 33, D325–D328 (2005).
Article CAS PubMed Google Scholar
de Nies, L. et al. PathoFact: a pipeline for the prediction of virulence factors and antimicrobial resistance genes in metagenomic data. Microbiome 9, 49 (2021).
Article PubMed PubMed Central Google Scholar
Kaminski, J. et al. High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED. Plos Comput Biol. 11, e1004557 (2015).
Article MathSciNet PubMed PubMed Central Google Scholar
Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186 (2019).
Article CAS PubMed PubMed Central Google Scholar
Poyet, M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 25, 1442 (2019).
Article CAS PubMed Google Scholar
Zou, Y. Q. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wyres, K. L., Lam, M. M. C. & Holt, K. E. Population genomics of. Nat. Rev. Microbiol. 18, 344–359 (2020).
Article CAS PubMed Google Scholar
Croxen, M. A. Molecular mechanisms of Escherichia coli pathogenicity (vol 8, p 26, 2011). Nat. Rev. Microbiol. 11, 141–141 (2013).
Article Google Scholar
Mazmanian, S. K., Round, J. L. & Kasper, D. L. A microbial symbiosis factor prevents intestinal inflammatory disease. Nature 453, 620–625 (2008).
Article ADS CAS PubMed Google Scholar
Nemet, I. et al. A Cardiovascular Disease-Linked Gut Microbial Metabolite Acts via Adrenergic Receptors. Cell 180, 862 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rubinstein, M. R. et al. Promotes Colorectal Carcinogenesis by Modulating E-Cadherin/β-Catenin Signaling via its FadA Adhesin. Cell Host Microbe 14, 195–206 (2013).
Article CAS PubMed PubMed Central Google Scholar
Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic. Nature 580, 269 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40, 711 (2022).
Article CAS PubMed Google Scholar
Jie, Z. Y. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Article ADS CAS PubMed Google Scholar
Gupta, V. K. et al. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053 (2021).
Article CAS PubMed Google Scholar
Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937 (2019).
Article CAS PubMed Google Scholar
Anderson, M. C., Vonaesch, P., Saffarian, A., Marteyn, B. S. & Sansonetti, P. J. Encodes a Functional T6SS Used for Interbacterial Competition and Niche Occupancy. Cell Host Microbe 21, 769 (2017).
Article CAS PubMed Google Scholar
Dai, D. et al. GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison. Nucleic Acids Res. 50, D777–D784 (2022).
Article CAS PubMed Google Scholar
Gerós, A. S., Simmons, A., Drakesmith, H., Aulicino, A. & Frost, J. N. The battle for iron in enteric infections. Immunology 161, 186–199 (2020).
Article Google Scholar
Doye, A. et al. CNF1 exploits the ubiquitin-proteasome machinery to restrict Rho GTPase activation for bacterial host cell invasion. Cell 111, 553–564 (2002).
Article CAS PubMed Google Scholar
Bielaszewska, M., Aldick, T., Bauwens, A. & Karch, H. Hemolysin of enterohemorrhagic: Structure, transport, biological activity and putative role in virulence. Int J. Med. Microbiol. 304, 521–529 (2014).
Article CAS PubMed Google Scholar
Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262 (2015).
Article CAS PubMed PubMed Central Google Scholar
Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).
Article CAS PubMed PubMed Central Google Scholar
De Filippis, F. et al. Specific gut microbiome signatures and the associated pro-inflamatory functions are linked to pediatric allergy and acquisition of immune tolerance. Nat. Commun. 12, 5958 (2021).
Article ADS PubMed PubMed Central Google Scholar
Zhao, C. Y., Dimitrov, B., Goldman, M., Nayfach, S. & Pollard, K. S. MIDAS2: Metagenomic Intra-species Diversity Analysis System. Bioinformatics 39, btac713 (2023).
Article CAS PubMed Google Scholar
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Article CAS PubMed PubMed Central Google Scholar
Yaffe, E. & Relman, D. A. Tracking microbial evolution in the human gut using Hi-C reveals extensive horizontal gene transfer, persistence and adaptation. Nat. Microbiol. 5, 343 (2020).
Article CAS PubMed Google Scholar
Liu, B. et al. 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 50, D912–D917 (2022).
Article CAS PubMed Google Scholar
Wassenaar, T. M. & Gaastra, W. Bacterial virulence: can we draw the line? Fems Microbiol. Lett. 201, 1–7 (2001).
Article CAS PubMed Google Scholar
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Article ADS PubMed PubMed Central Google Scholar
Liu, M. et al. ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Res. 47, D660–D665 (2019).
Article CAS PubMed Google Scholar
Akhter, S., Aziz, R. K. & Edwards, R. A. a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. & Bongcam-Rudloff, E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics 35, 521–522 (2019).
Article PubMed Google Scholar
Suzek, B. E., Huang, H. Z., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Article CAS PubMed Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5, e9490 (2010).
Article ADS PubMed PubMed Central Google Scholar
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife 10, e65088 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article CAS PubMed PubMed Central Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Article PubMed PubMed Central Google Scholar
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. G. T. D. B.- Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
Article CAS Google Scholar
van den Beld, M. J. C. & Reubsaet, F. A. G. Differentiation between, enteroinvasive (EIEC) and noninvasive. Eur. J. Clin. Microbiol. 31, 899–904 (2012).
Article Google Scholar
Guo, J. R. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578 (2021).
Article CAS PubMed Google Scholar
Krawczyk, P. S., Lipinski, L. & Dziembowski, A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res. 46, e35 (2018).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vogtmann, E. et al. Colorectal Cancer and the Human Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing. PLOS ONE 11, e0155362 (2016).
Article PubMed PubMed Central Google Scholar
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).
Article CAS PubMed PubMed Central Google Scholar
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 66, 70–78 (2017).
Article CAS PubMed Google Scholar
Vich Vila, A. et al. Gut microbiota composition and functional changes in inflammatory bowel disease and irritable bowel syndrome. Sci. Transl. Med. 10, eaap8914 (2018).
Article PubMed Google Scholar
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS PubMed Google Scholar
Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 4, 293–305 (2019).
Article CAS PubMed Google Scholar
Palleja, A. et al. Roux-en-Y gastric bypass surgery of morbidly obese patients induces swift and persistent changes of the individual gut microbiota. Genome Med. 8, 67 (2016).
Article PubMed PubMed Central Google Scholar
Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 5, 14 (2017).
Article PubMed PubMed Central Google Scholar
Bedarf, J. R. et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 9, 39 (2017).
Article CAS PubMed PubMed Central Google Scholar
Erawijantari, P. P. et al. Influence of gastrectomy for gastric cancer treatment on faecal microbiome and metabolome profiles. Gut. 69, 1404 (2020).
Article CAS PubMed Google Scholar
Qin, N. et al. Alterations of the human gut microbiome in liver cirrhosis. Nature 513, 59–64 (2014).
Article ADS CAS PubMed Google Scholar
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Article ADS CAS PubMed Google Scholar
Peterson, J. et al. The NIH Human Microbiome Project. Genome Res. 19, 2317–2323 (2009).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors are grateful to Guowei Yang and Jingren Zhang for their assistance. We would like to thank Khi Pin Chua and Zuwei Qian of the PacBio APAC team for their valuable technical assistance in experimental execution and data analysis related to HiFi sequencing. This research was funded by grants from the National Key R&D Program of China (grant numbers 2021YFA1301000 and 2021YFC2301003) and the National Natural Science Foundation of China (grant numbers 32170068 and 81991534).

Author information

Authors and Affiliations

CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China
Wanting Dong, Xinyue Fan, Siyi Wang, Na Lv, Yuanlong Pan, Qian Xiong & Baoli Zhu
University of Chinese Academy of Sciences, Beijing, 100049, China
Wanting Dong, Siyi Wang & Baoli Zhu
Institute of Biotechnology and Health, Beijing Academy of Science and Technology, Beijing, 100089, China
Yaqiong Guo
School of Basic Medical Sciences, Tianjin Medical University, Tianjin, 300070, China
Shulei Jia
Department of Endocrinology, Key Laboratory of Endocrinology of Ministry of Health, Peking Union Medical College Hospital, Chinese Academy of Medical Science and Peking Union Medical College, Beijing, 100730, China
Tao Yuan & Weigang Zhao
National Engineering and Technology Research Center for Fruits and Vegetables, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing, 100083, China
Yong Xue
College of Veterinary Medicine, Nanjing Agricultural University, Nanjing, 210095, China
Xi Chen
Beijing Institute of Microbiology and Epidemiology, Beijing, 100071, China
Ruifu Yang
Department of Pathogenic Biology, School of Basic Medical Sciences, Southwest Medical University, Luzhou, 646000, China
Baoli Zhu
Jinan Microecological Biomedicine Shandong Laboratory, Jinan, 250117, China
Baoli Zhu
Beijing Key Laboratory of Antimicrobial Resistance and Pathogen Genomics, Beijing, 100101, China
Baoli Zhu

Authors

Wanting Dong
View author publications
Search author on:PubMed Google Scholar
Xinyue Fan
View author publications
Search author on:PubMed Google Scholar
Yaqiong Guo
View author publications
Search author on:PubMed Google Scholar
Siyi Wang
View author publications
Search author on:PubMed Google Scholar
Shulei Jia
View author publications
Search author on:PubMed Google Scholar
Na Lv
View author publications
Search author on:PubMed Google Scholar
Tao Yuan
View author publications
Search author on:PubMed Google Scholar
Yuanlong Pan
View author publications
Search author on:PubMed Google Scholar
Yong Xue
View author publications
Search author on:PubMed Google Scholar
Xi Chen
View author publications
Search author on:PubMed Google Scholar
Qian Xiong
View author publications
Search author on:PubMed Google Scholar
Ruifu Yang
View author publications
Search author on:PubMed Google Scholar
Weigang Zhao
View author publications
Search author on:PubMed Google Scholar
Baoli Zhu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conception and design of the study: W.D., B.Z. and W.Z. Acquisition of data: N.L., T.Y., X.C. Analysis or interpretation of data: W.D., X.F., Y.G., S.W., S.J., Y.P., Y.X., Q.X. Writing and/or revisions of the manuscript: W.D., B.Z., R.Y. All authors have approved the submitted version of the manuscript and agree to be personally accountable for their own contribution.

Corresponding authors

Correspondence to Ruifu Yang, Weigang Zhao or Baoli Zhu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Tanel Tenson, and the other, anonymous, reviewer for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary datasets 1 to 10

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Dong, W., Fan, X., Guo, Y. et al. An expanded database and analytical toolkit for identifying bacterial virulence factors and their associations with chronic diseases. Nat Commun 15, 8084 (2024). https://doi.org/10.1038/s41467-024-51864-y

Download citation

Received: 23 January 2024
Accepted: 16 August 2024
Published: 15 September 2024
DOI: https://doi.org/10.1038/s41467-024-51864-y

This article is cited by

Metaproteomics in the One Health framework for unraveling microbial effectors in microbiomes
- Robert Heyer
- Maximilian Wolf
- Paul Wilmes
Microbiome (2025)
Microbial landscape of Indian homes: the microbial diversity, pathogens and antimicrobial resistome in urban residential spaces
- Saraswati Awasthi
- Vikas M Hiremath
- Rakesh Sharma
Environmental Microbiome (2025)
Assessing the role of Escherichia coli and Klebsiella pneumoniae in colorectal cancer oncogene expression: insights from microbial colonization phenotypes
- Samin Davoody
- Zahra Tayebi
- Hamidreza Houri
Molecular Biology Reports (2025)