Abstract
Virulence factor genes (VFGs) play pivotal roles in bacterial infections and have been identified within the human gut microbiota. However, their involvement in chronic diseases remains poorly understood. Here, we establish an expanded VFG database (VFDB 2.0) consisting of 62,332 nonredundant orthologues and alleles of VFGs using species-specific average nucleotide identity (https://github.com/Wanting-Dong/MetaVF_toolkit/tree/main/databases). We further develop the MetaVF toolkit, facilitating the precise identification of pathobiont-carried VFGs at the species level. A thorough characterization of VFGs for 5452 commensal isolates from healthy individuals reveals that only 11 of 301 species harbour these factors. Further analyses of VFGs within the gut microbiomes of nine chronic diseases reveal both common and disease-specific VFG features. Notably, in type 2 diabetes patients, long HiFi sequencing confirms that shared VF features are carried by pathobiont strains of Escherichia coli and Klebsiella pneumoniae. These findings underscore the critical importance of identifying and understanding VFGs in microbiome-associated diseases.
Similar content being viewed by others
Introduction
Over the past decade, extensive research has focused on elucidating the intricate contributions of the human gut microbiota to overall health1,2. Studies have focused primarily on delineating commensal bacterial compositions and metabolic pathways potentially implicated in disease pathogenesis. Among these investigations, the presence of virulence factor genes (VFGs) within the gut microbiome has been noted. These genes, which are crucial for the infectivity of pathogenic bacteria, have raised questions regarding their carriers in the gut ecosystem. Uncertainty remains over whether these VFGs are borne by specific bacterial pathogens or by pathobionts—a term denoting members of the gut microbiota with latent pathogenic potential3. The evolution of pathogenic and nonpathogenic strains among opportunistic species remains a mystery. Furthermore, it is unclear whether nonpathogenic strains have the capacity to harbour VFGs, potentially transforming into pathobionts within the gut microbiota4. Several studies have suggested the involvement of VFGs carried by gut pathobionts in the onset of chronic gastrointestinal diseases, mainly colorectal cancer (CRC). For example, the commensal bacteria E. coli can carry colibactin, a genotoxin that can induce double-strand DNA breaks5,6,7 and promote tumorigenesis in CRC8. Moreover, some strains of Fusobacterium nucleatum carrying the adhesin FadA can bind E-cadherin and induce CRC cell growth9. In addition, enterotoxigenic Bacteroides fragilis (ETBF) can produce a zinc-dependent metalloprotease toxin10, which triggers tumorigenesis via epithelial IL17 and Stat3 signalling11,12. These toxins are also enriched in patients with inflammatory bowel disease (IBD), but their cause is not fully understood13,14. Furthermore, associations of VFGs carried by gut pathobionts with other human chronic diseases have not been widely reported. Recently, the Dutch Microbiome Project, encompassing a vast cohort of 8208 individuals, shed light on the apparent connections between VFGs and chronic diseases15. In this study, bacterial adherence and iron uptake genes had significant impacts on conditions such as type 2 diabetes (T2D) and various gastrointestinal disorders15. However, the limitations posed by the lack of a comprehensive and precise VFG database, alongside advanced bioinformatics tools, might have hindered a comprehensive understanding of the role played by VFGs within the gut microbiota.
Presently, the widely utilized Virulence Factor Database (VFDB) stands as the primary resource for bacterial pathogenicity analysis16, compiling experimentally verified VFGs, which were also adapted for metagenome-association studies17. However, this database does not provide VFG orthologues and alleles or detailed information regarding bacterial hosts or VFG mobility. Analytical tools such as ShortBRED and PathoFact, built upon VFDB, have been employed for VFG analysis of metagenomic data17,18. While ShortBRED aids in determining VFG abundance, PathoFact predicts VFG presence and mobility but with compromised accuracy.
To address these critical gaps, we leveraged a verified dataset comprising 3581 VFGs from VFDB to construct the expanded virulence factor gene database (VFDB 2.0). This comprehensive repository encompasses 62,332 VFG orthologues and alleles spanning 135 bacterial species derived from 18,521 complete bacterial genomes representing 3559 distinct species. Alongside VFDB 2.0, we introduced the MetaVF toolkit—a pipeline tailored to profile VFGs from metagenomic sequencing data utilizing VFDB 2.0. MetaVF excels in reporting VFG diversity, abundance, and coverage and predicts mobile VFGs and their respective bacterial hosts. Notably, MetaVF exhibits superior sensitivity and accuracy compared with existing VFG analytical tools.
The application of the MetaVF toolkit to publicly available short-read metagenomic data from cohorts comprising both healthy individuals and those affected by nine different diseases revealed common features of VFGs across diverse diseases—previously overlooked in conventional studies. Moreover, our investigations pinpointed specific strains of E. coli and K. pneumoniae carrying adherence and iron uptake genes implicated in association studies of cardiovascular disease among patients with type 2 diabetes. These findings were further validated via long HiFi read sequencing data, confirming the importance of VFG identification in understanding microbiome-associated diseases.
Results
The expanded virulence factor gene database (VFDB 2.0)
The VFDB core dataset is a collection of representative gene sequences from each of the 3581 verified virulence factor genes (VFGs), which are normally used for bacterial pathogen analysis and do not include orthologues and alleles from different bacterial species. Here, we defined VFG orthologues as homologous VFGs from different species and VFG alleles as VFGs with single nucleotide polymorphisms (SNPs) in different genomes of any bacterial species or duplicated VFGs in the same bacterial genome. We use the VFDB core dataset of verified VFGs as seeds to find VF orthologues and alleles on the basis of 18,521 complete genomes of the RefSeq database (see Methods Section) via species-specific average nucleotide identity (ssANI), given that the evolutionary rates of each species vary. In total, a set of 37,690 orthologues in 75 newly identified species and 429,738 alleles in 60 original species were obtained from the redundant dataset.
According to the redundant dataset, approximately 70% of the VFGs were species specific, and 94% were genus-specific (Fig. 1A, Supplementary Data 3). Those VFGs that are not species-specific or genus-specific can be shared by different species and genera, and most of them are from the Enterobacteriaceae family (Supplementary Fig. 1A). Moreover, 3.3% are mobile VFGs associated with plasmids, prophages, and integrative and conjugative elements (ICEs) that are involved in intercellular transmission. Among them, 479 are carried by plasmids, 304 by prophages, and 178 by ICEs (Supplementary Fig. 1B, Supplementary Data 3). Among the 479 plasmid-associated VFGs, 224 are exclusively carried by plasmids (“plasmid-borne only”), and 255 can be carried either by plasmids or on chromosomes of bacterial species (“alternate”) (Fig. 1A). The annotation of redundant dataset had 850 single VFGs and 248 multiple VFG clusters and was further classified into 7 VF categories. The bacterial host taxonomy, mobility, and VF categories of each VFG were collected into the annotation dataset.
A The expanded VFG database (VFDB 2.0) was built as follows. First, the species-specific ANI (ssANI) was calculated via complete genomes in the NCBI RefSeq database to obtain ssANI data. Next, the VFGs were expanded by Blast searching 18,521 complete genomes against curated VFDB (top panel) and filtered to obtain the redundant dataset with 467,428 alleles and orthologues. The redundant dataset was then used to generate an annotation dataset with host species and mobility information for VFG annotation in the MetaVF toolkit. The expanded alignment dataset was obtained by removing redundant VFG sequences from the redundant dataset for sequence alignment in the MetaVF toolkit. Third, the VFG alleles from the pathogen strains were further collected into the pathogenic alignment dataset. Finally, the annotation dataset and alignment dataset were integrated into VFDB 2.0. B The MetaVF toolkit consists of two pipelines. Pipeline 1 is for short-read metagenomic data, where trimmed reads are mapped to the expanded alignment dataset and then filtered by 90% identity. Pipeline 2 is for long HiFi reads of metagenomic data or draft genomes via BLASTN against the pathogenic alignment dataset. Finally, the relative abundance, coverage, host species, mobility, and VF categories of VFGs are summarized based on VFDB 2.0.
After removing redundancy, a total of 62,332 VFG sequences of 135 species corresponding to 3527 types of VFGs were included in the expanded alignment dataset (Fig. 1A, Supplementary Data 1). Among them, 15,943 VFG alleles of 2,741 VFGs from 59 pathogenic species were confirmed via the NCBI BioSample database and were collected into the pathogenic alignment dataset (Fig. 1A, Supplementary Data 2). Finally, the annotation dataset and alignment dataset were integrated into VFDB 2.0.
MetaVF toolkit for profiling virulence factor genes in metagenomes based on VFDB 2.0
To produce accurate annotated VFGs from gut metagenomes, we introduce the MetaVF toolkit based on VFDB 2.0, which can be outlined in 3 parts: alignment, filtering, and annotation. Step 1: For input metagenomic sequence data, clean reads were mapped against the expanded alignment dataset to obtain VFG mapped reads (VFMappedreads); for long HiFi reads or metagenome-assembled genomes (MAGs), nucleotide BLAST was performed against pathogenic alignment dataset to annotate VFGs. Step 2: VFMappedreads were filtered with tested sequence identity (TSI) obtained via artificial metagenomic datasets (AMSD1), which were generated by combining CAMI datasets with in silico mutated VFGs of defined abundance (27 different combinations of metagenome complexity, VFG abundance, and mutation rates). The TSI was determined under in silico mutated VFGs with 1%, 3% and 5% mutation rates, and the 90% TSI achieved the most stable performance, with a true discovery rate (TDR) > 97% and a false discovery rate (FDR) < 4.000767e-05% (Supplementary Fig. 1C). The complexity of metagenomes and the relative abundance of VFGs do not impact the performance of the MetaVF toolkit. The best hit for the BLAST search of each gene was selected and filtered by identity and coverage. Step 3: The filtered VFMappedreads were counted and normalized by gene length and sequencing depth, which is represented by transcripts per million (TPM). The coverage of VFG clusters, mobility, bacterial host taxonomy, and VF categories are annotated according to the annotation dataset in VFDB 2.0 (Fig. 1B).
Benchmarking of the MetaVF toolkit, which is more sensitive and precise
To evaluate the performance of the MetaVF toolkit in determining the presence and abundance of VFGs, we performed benchmarking analysis by using Artificial Metagenomic Datasets 2 (AMSD2) and Real Metagenomic Sequencing Data (RMSD) with currently available tools, including PathoFact, ShortBRED, and VFDB direct mapping.
We first used AMSD2 to test the performance of the MetaVF toolkit, PathoFact, ShortBRED, and VFDB direct mapping in VFG identification. The MetaVF toolkit was robust on all 18 combinations of subsets from AMSD2, showing advantages over VFDB direct mapping, PathoFact, and ShortBRED by improving sensitivity and precision. The performances of the MetaVF toolkit, PathoFact, ShortBRED, and VFDB direct mapping were better for artificial datasets with a 1% VFG mutation rate than for those with 3% and 5% mutation rates (Supplementary Fig. 2, Supplementary Fig. 3, and Supplementary Data 4). Moreover, all the tools performed better on the AMSD2_high datasets than on the AMSD2_low datasets. For the AMSD2_low datasets with a 5% mutation rate, the specificity, sensitivity, precision, accuracy, and F1 score of the MetaVF toolkit were 99.99%, 94.33%, 95.13%, 99.99%, and 94.72%, respectively, whereas VFDB direct mapping, ShortBRED, and PathoFact achieved precisions of 12.53%, 15.66% and 0.06%, respectively, and sensitivities of 94.60%, 16.80% and 49.86%, respectively (Supplementary Fig. 2 and Supplementary Data 4). The precision of ShortBRED and PathoFact increased to 49.19% and 0.05%, respectively, and the sensitivity increased to 87.67% and 71.79%, respectively, when the AMSD2_high dataset with a 1% mutation rate was used; however, the precision was still lower than that of the MetaVF toolkit (precision of 94.06% and sensitivity of 99.83%) (Supplementary Fig. 3 and Supplementary Data 4). The performance of the MetaVF toolkit for quantifying VFGs was tested in comparison with that of ShortBRED and VFDB direct mapping, as PathoFact does not report the abundance of VFGs. Spearman’s correlation between the predicted and expected relative abundances was the highest in the MetaVF toolkit (all p < 0.001, R > 0.94), followed by VFDB direct mapping (p < 0.001, 0.82 < R < 0.84 for the AMSD2_high datasets; p < 0.001, 0.39 < R < 0.46 for the AMSD2_low datasets) and ShortBRED (−0.042 < R < 0.49 for the AMSD2_high datasets; −0.091 < R < 0.32 for the AMSD2_low datasets; p < 0.05 only under the 1% and 3% mutation rates) in both the AMSD2_high datasets and the AMSD2_low datasets (Supplementary Figs. 4A, B).
We further evaluated the performance of the MetaVF toolkit in parallel with ShortBRED, PathoFact, and VFDB direct mapping via both short- and long-read metagenome sequencing data generated from 8 human gut microbiota. The long (HiFi) reads were used as a reference to verify the results for producing accurate metagenome assemblies at the species/strain level. To be more specific, an average of 21.0 GB long HiFi reads were generated with an average read length of 8.8 kb, and an average of 3307.9 contigs were assembled with a mean N50 of 702,400.625 bp for each sample (see Supplementary Data 5). After binning, 490 MAGs were obtained (61.2 MAGs per sample), including 365 high-quality MAGs (45.6 per sample) (see Supplementary Data 5).
The results of comparative analyses of the MetaVF toolkit with ShortBRED, PathoFact, and VFDB direct mapping are presented in Fig. 2A. Among all the VFGs identified by the MetaVF toolkit, 71.6% were confirmed by long HiFi reads, whereas 48.3% were confirmed by VFDB direct mapping, and 33.9% were confirmed by PathoFact. On the other hand, ShortBRED yielded fewer false-positive results but detected many fewer VFGs than long HiFi reads did (407 undetected). The Spearman correlation coefficient between the predicted and expected relative abundances was the highest in the MetaVF toolkit (0.16 < R < 0.87, all p < 0.001, except sample D5214, with p = 0.22), followed by VFDB (0.27 < R < 0.8, all p < 0.01) and ShortBRED (−0.34 < R < 0.53, p < 0.05, except for sample D1217, with p = 0.91) (see Fig. 2B).
A Comparison of the VFGs detected via the MetaVF toolkit, ShortBRED, PathoFact, and VFDB direct mapping. The left panel represents the total VFGs detected by each method, and the right panel represents the number of overlapping VFGs detected by different methods. B Spearman correlations between the abundance predicted by long HiFi reads and the abundance predicted by the MetaVF toolkit, ShortBRED, and VFDB direct mapping were calculated in deep sequencing samples (two-sided). The sample ID that starts with D1 belongs to the healthy group, D5 belongs to the T2D group, and D3 belongs to the T2D-CVD group. The source data are provided as a source data file.
MetaVF toolkit for annotation of VFG mobility and host species specificity in metagenomic data
Since ShortBRED, PathoFact, and VFDB direct mapping are unable to determine the mobility of VFGs or host species, we next analysed VFGs with the MetaVF toolkit and verified them by long HiFi read sequences using the 8 metagenomic sequenced samples as the test dataset. A total of 695 VFGs were detected by MetaVF and verified by long HiFi reads. For mobility determination, 392 VFGs were identified to be “chromosome-borne only” by MetaVF, and 390 of them (99.5%) were confirmed by long HiFi reads. The remaining 303 VFGs were identified as “alternate” by MetaVF, and 47 (15.5%) were confirmed to be in plasmids by long HiFi reads. None of the “plasmid-borne only” VFGs were identified in the 8 metagenomic samples (Supplementary Fig. 5A). For host species determination, 488 VFGs (of 695 VFGs) on long contigs assembled from long HiFi reads were used, where 107 VFGs were determined to be species-specific and 136 VFGs were determined to be genus-specific by MetaVF, 100% of which were confirmed by long HiFi reads. The remaining 245 VFGs were determined to be non-genus-specific by MetaVF and were confirmed by long HiFi reads to be carried by different host species. For example, the VFGs of Yersinia pestis and Shigella sonnei in VFDB were carried by E. coli in 8 metagenomic samples (Supplementary Fig. 5B). In summary, the MetaVF toolkit can predict “chromosome-borne only” VFGs with an accuracy of 99.5% but cannot determine the “alternate” VFGs. In terms of host specificity, the MetaVF toolkit can predict host taxa at the species and genus levels with an accuracy of 100% but can provide only the range of host taxa for non-genus-specific VFGs in metagenomic samples.
VFGs are present in bacterial isolates of the healthy gut microbiome
We first analysed the VFGs carried by cultured isolates from healthy gut microbiota from the HBC19, BIO-ML20 and CGR21 datasets, with a total of 5452 bacterial isolates of 301 species, where only 512 isolates (genomes) of 12 species carried VFGs (4.0%) (Fig. 3A). Among these 12 species, all the isolates of 7 species carried VFGs, especially the large number of isolates from E. coli and K. pneumoniae, whereas only some isolates of the other 5 species carried VFGs, e.g., only 2.7% of the B. fragilis isolates carried colibactin. With respect to the VFG types, only 11.9% (61/512) of the healthy isolates carried true VFGs (61 isolates of 8 species), which indicated the pathogenic potential of these isolates (Fig. 3A).
A VFGs in 5452 healthy gut isolates. The VF host species panel represents the host species of VFGs with the percentage of strains carrying VFGs. The VF category panel represents categories of VFGs, and the VF type panel represents the types of VFGs. The VFG hits with identity >ssANI and coverage =100% are shown. B The abundance and prevalence of VFGs in 350 HMP samples. The size of the circle represents the prevalence of each VFG, and the colour of the circle represents the VF category. The x-axis represents the mean abundance (log10(TPM)) of each VFG in the HMP cohort. VFGs with >10% prevalence are shown in the form of “original species in VFDB” | gene name. C Comparison of VFGs between clinical and healthy gut isolates of K. pneumoniae. The boxes are coloured grey if the VFG was present in the isolate (with identity > ssANI and coverage = 100%). The VFs that are commonly found in clinical isolates were defined as pathogen-associated VFGs (PAVGs) and are coloured red, whereas the other VFs are coloured blue. hvKp, hypervirulent Klebsiella pneumoniae. D Comparison of VFGs between clinical, environmental, and healthy gut isolates of E. coli. The boxes are coloured grey if the VFG was present in the isolate (with identity > ssANI and coverage = 100%). ExPEC, extraintestinal pathogenic Escherichia coli (red); InPEC, intestinal pathogenic E. coli (yellow); ExPEC-InPEC, isolates that can cause both extraintestinal and intestinal infection (orange). Source data are provided as a source data file.
Considering the bias in culturing bacterial isolates from the gut microbiota, we further analysed VFGs in metagenomes published by the Human Microbiome Project (HMP). For the 350 HMP metagenomes, a total of 651 VFGs were detected and assigned to 30 species, including the 9 species identified in the gut bacterial isolates. The relative abundance and prevalence of VFGs of K. pneumoniae and E. coli were the highest in the healthy gut microbiota, with 33.4% and 39.1% maximum prevalence and log10(−1.598115) and log10(−1.377554) maximum abundance, respectively (as shown in Fig. 3B). The most prevalent VFGs were colonization and housekeeping VFGs (siderophores), whereas some true VFGs, such as colibactin and T3SS effectors of E. coli and K. pneumoniae, were identified in more than 10% of individuals (10.3% to 17.1% for colibactin and 17.7% to 31.7% for T3SS effectors). In summary, the VFGs in healthy gut microbiota were carried by opportunist species, and colonization and housekeeping VFGs were the most prevalent and are carried by E. coli and K. pneumoniae.
Bacterial isolates from healthy gut microbiota contain pathogen-associated VFGs
To further explore the potential pathogenicity of isolates from E. coli and K. pneumoniae, we analysed VFGs in these gut commensals together with their pathogenic counterparts. Compared with hypervirulent strains (hvKp) of K. pneumoniae, healthy gut isolates harboured fewer pathogen-associated VFGs (PAVGs), such as the VFGs rmpA, aerobactin, salmochelin, and yersiniabactin22. However, we observed that 2 isolates from healthy guts carried whole sets of PAVGs, which may represent potential pathogenicity. Unlike hvKp strains, isolates from healthy gut microbiota do not carry T6SS effectors (Fig. 3C). For E. coli, we found that only 31.8% (21/66) of the commensal isolates harboured PAVGs, whereas 70.4% (19/27) of the pathogenic isolates harboured PAVGs (p = 0.001, Fisher’s exact test). For example, 5 of the 66 commensal isolates carried pathogenic ExPEC-specific VFGs, such as the VFGs P fimbriae, S fimbriae, and alpha-haemolysin23. In addition, 4 isolates carried InPEC-specific VFGs, such as VFGs of the T3SS and Shiga toxin23, indicating potential infectivity (Fig. 3D).
The observations of the presence of opportunist bacterial species carrying different types of VFGs led us to believe that those bacterial species are pathobionts that are defined as gut microbes with pathogenic potential under dysbiosis24. According to the types of VFGs carried by pathobiont strains, we propose classifying pathobionts into 5 different pathobiont types: pathobiont type I (PBT-I), type II (PBT-II), type III (PBT-III), type IV (PBT-IV) and type V (PBT-V). PBT-I refers to those bacterial strains or isolates that contain PAVGs that do not cause infections due to low abundance; PBT-II refers to members of the same species from PBT-I that carry VFGs rather than PAVGs, e.g., some isolates of E. coli from the gut microbiota; PBT-III refers to those specialized gut commensals that can carry any VFGs, e.g., enterotoxigenic B. fragilis (ETBF), that may lead to chronic disease; PBT-IV refers to those commensals that carry both VFGs and detrimental metabolite genes, e.g., the E. coli cutC/D gene cluster that produces trimethylamine (TMA); and PBT-V refers to those commensals that do not carry VFGs but contain genes that produce detrimental metabolites causing chronic disease, e.g., the Clostridium sporogenes porA gene that produces phenylacetylglutamine (PAGln)25.
Chronic diseases are characterized by disease-common features of VFGs
To characterize the VFGs and pathobionts associated with different chronic diseases, we chose gut metagenomic datasets of 9 diseases, including colorectal carcinoma (CRC), atherosclerotic cardiovascular disease (ACVD), inflammatory bowel disease (IBD), obesity, hypertension, Parkinson’s disease (PD), gastric cancer (GC), liver cirrhosis (LC), and type 2 diabetes (T2D), for VFG analysis. We found that the diversity and abundance of VFGs were greater in most patient groups except for PD, which indicates that the abundance of VFGs in the gut microbiota may represent health status (two-sided Wilcoxon test, p < 0.05 for ACVD, CRC, GC and LC; see Fig. 4A, B).
The boxplot shows the total abundance (A) and diversity (B) of VFGs in each group (centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). The sample size of each group is shown within parentheses. Statistical significance between disease groups (DGs, coloured in red) and healthy controls (HCs, coloured in blue) was determined via a two-sided Wilcoxon test (p = 3.7e-11, 0.47, 9.3e-5, and 0.00011 for ACVD, CRC, GC and LC in abundance comparisons; p = 4.0e-12, 1.2e-0.5, 0.0028, and 5.4e-7 for ACVD, CRC, GC and LC in diversity comparisons; ***p < 0.001; **p < 0.01; *p < 0.05; ns, p > 0.05). VFGs with more than 5 mapped reads were identified. The abundance of VFGs in the sample is represented by log10(TPM). C The VFs significantly enriched in each disease cohort are shown (>10% prevalence and >50% coverage, p < 0.05). Statistical significance was determined by a two-sided Wilcoxon test, and correction for multiple comparisons was performed via Benjamini‒Hochberg adjustment. The colour of the bubble represents the VF category. The solid circles represent common features among diseases (enriched in at least 2 types of disease), whereas the solid circles with squares represent disease-specific VF features. The y-axis labels that are coloured in black are species-specific VFs, whereas those coloured in grey are non-species-specific VFs. * represents “plasmid-borne only” VFs; ** represents “alternate” VFs. ACVD atherosclerotic cardiovascular disease, CRC colorectal carcinoma, GC gastric cancer, IBD inflammatory bowel disease, LC liver cirrhosis, PD Parkinson’s disease, T2D type 2 diabetes. Source data are provided as a source data file.
A total of 65 virulence factors were enriched in different disease groups and carried by different pathobionts, such as C. perfringens, E. coli, F. tularensis, K. pneumoniae, S. dysenteriae, S. flexneri, S. mutans, S. pneumoniae, S. sonnei, F. nucleatum, C. difficile, and H. influenza (Fig. 4C). In addition, disease-specific virulence factors were identified in ACVD, CRC, and LC; the ACVD patients had 8 specific virulence factors, such as T6SS and T6SS effectors, enterobactin, yersiniabactin, ShET2, EAST1, and type 1 fimbriae. The alpha and theta toxins and the siderophore salmochelin were enriched exclusively in LC patients, whereas the pathogenic E. coli-associated virulence factors CNF-1, alpha-haemolysin, and P fimbriae were enriched in CRC patients. The majority of these disease-specific virulence factors are pathogen-associated VFGs (PAVGs), which are supposedly carried by PBT-I pathobionts (Fig. 4C).
The other 46 virulence factors were identified as common features shared by different types of disease, mainly E. coli, S. pneumoniae, K. pneumoniae, and C. perfringens (82.6%), and some were plasmid-borne virulence factors. For example, T3SS effectors were enriched in ACVD, CRC, LC, IBD, hypertension, and T2D patients, and the inflammation-associated virulence factor LPS was enriched in ACVD and LC patients. In addition, FadA and colibactin, which are causative agents of CRC, were enriched in the CRC, LC, and ACVD patient groups26,27 (Fig. 4C).
The common disease features of the virulence factors OmpA, Enterobactin, and ECP are enriched in T2D patients
To further study the VFGs in the gut microbiota associated with chronic disease, we used 150 metagenomic sequencing datasets of T2D patients from the PUMCH dataset, including 50 healthy individuals, 50 T2D patients, and 50 T2D patients with cardiovascular disease (T2D-CVD) (see methods). As expected, we found that the abundance and diversity of VFGs were significantly greater in the T2D and T2D-CVD patient groups than in the healthy group (HC) (two-sided Kruskal‒Wallis test, FDR < 0.05; see Fig. 5A, B).
The violin plot shows the total abundance (A) and diversity (B) of VFGs in HMP (Human Microbiome project, n = 350), HC (Healthy control, n = 50), T2D (Type 2 diabetes, n = 50), and T2D-CVD (T2D patients with cardiovascular disease, n = 50) (centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). Statistical significance between HMP, HC, T2D, and T2D-CVD was determined via a two-sided Kruskal‒Wallis test, and correction for multiple comparisons was performed via Benjamini‒Hochberg adjustment (***p < 0.001, **p < 0.01, *p < 0.05; ns, p > 0.05). VFGs with more than 5 mapped reads were identified. The abundance of VFGs in the sample is represented by log10(TPM). 0 was replaced with 0.0001 for log calculation. C VFs that were significantly enriched between HCs and T2D patients, HCs and T2D-CVD patients, and T2D patients and T2D-CVD patients (>10% prevalence and >50% coverage, p < 0.05). Statistical significance was determined by a two-sided Kruskal‒Wallis test, and correction for multiple comparisons was performed via Benjamini‒Hochberg adjustment. The colour of the bubble represents the category of VFs. The solid circles represent common features among diseases (enriched in at least 2 types of disease), whereas the solid circles with squares represent disease-specific VF features. The y-axis labels coloured in black are species-specific VFs, and those coloured in grey are non-species-specific VFs. * represents “plasmid-borne only” VFs; ** represents “alternate” VFs. D Spearman correlation between the abundance of VFs and clinical indices (two-sided). Correction for multiple comparisons was performed via Benjamini‒Hochberg adjustment. The results with R > 0.35 and p < 0.05 are shown. FBG fasting blood glucose, HbA1c haemoglobin A1c, HOMA_IR homeostatic model assessment for insulin resistance. Source data are provided as a source data file.
Further analysis revealed that virulence factors (VFs) from E. coli, such as enterobactin, ECP (E. coli common pilus), OmpA, and T3SS effectors, were significantly enriched in T2D patients (see Fig. 5C). Moreover, the LPS, T6SS, and effector genes from K. pneumoniae were significantly associated with both the T2D and T2D-CVD groups, whereas the abundances of the T6SS, salmochelin and enterobactin genes appeared to be significantly greater in T2D-CVD patients than in T2D patients (two-sided Kruskal‒Wallis test, FDR < 0.05; see Fig. 5C). In correlation analysis with T2D clinical indices, we found that the abundance of 26 virulence factors was specifically correlated with FBG and HbA1c (Spearman correlation, all R > 0.4, p < 0.05), whereas the T3SS effectors enterobactin, ECP, and OmpA showed the strongest correlations, and those VFs were predicted to be carried by E. coli and K. pneumoniae via MetaVF toolkit (see Fig. 5D). Moreover, some pathogen-associated VFs of K. pneumoniae and E. coli, such as rmpA, yersiniabactin, P fimbriae, and alpha-haemolysin, were enriched in T2D and T2D-CVD patients, indicating the existence of hypervirulent K. pneumoniae and uropathogenic E. coli (UPEC) in the gut microbiota of T2D and T2D-CVD patients. In summary, these results showed that the VFs carried by different pathobiont types of E. coli and K. pneumoniae are associated with T2D.
Specific pathobiont types of Klebsiella pneumoniae were identified in T2D-CVD patients
As shown in the previous sections, short-read sequencing analysis revealed that different types of VFGs are associated with different chronic diseases. However, the pathobiont types cannot be defined by short-read sequences because of short read lengths. The long-term HiFi read sequencing of the gut microbiome can generate strain-level assemblies for functional gene analysis28. We chose 24 samples from the HC, T2D, and T2D-CVD patient groups for long-read sequencing (SMRT), which included 9 samples with deep sequencing (average of 21.0 Gb per sample; one sample failed to render valid sequencing data) and 15 samples with low sequencing depth (average of 3.2 Gb per sample), to determine the corresponding pathobiont types (see methods). The length of contig N50, the number of MAGs, and the number of annotated species were greater in deep sequencing than in low-depth sequencing (see Supplementary Data 5). In total, we assembled 873 MAGs corresponding to 270 species from 23 samples, where 22 species had more than 10 MAGs and 103 species had one MAG. Among the assembled MAGs, those of E. coli, K. pneumoniae, K. oxytoca, C. perfringens, H. parainfluenzae, and H. influenza were found to carry VFGs (Fig. 6A).
(A) Sankey chart showing the VFGs detected in long-read sequencing samples. The sample panel represents the sample ID from which the VFGs were identified with the percentage of non-species-specific VFGs identified by long HiFi reads. The sample panel was coloured on the basis of the group to which the sample belonged. The host species panel represents the host species of VFGs identified by long HiFi reads, and the VFG panel represents the VFGs with host species given by VFDB (labelled “host species in VFDB” | VF name). B Comparison of VFGs among 11 high-quality E. coli MAGs from 9 individuals. Pathogen-associated VFGs (PAVGs) are coloured red, and other VFGs are coloured blue. The phylogenetic tree was constructed on the basis of whole-genome SNPs and rooted on D1217_bin.48. C Genomic structures of the colibactin and enterobactin gene clusters in E. coli MAGs. Source data are provided as a Source Data file.
With emphasis on E. coli and K. pneumoniae, which carried the majority of VFGs, the high-quality MAGs of E. coli and K. pneumoniae from each of the samples were assessed by MAGPhase for haplotype structure analysis, where the average haplotype numbers were 7 for E. coli and 14.5 for K. pneumoniae. Furthermore, one dominant haplotype, which represents a quasi-strain/lineage providing sufficient data for subsequent determination of VFG clusters and types of pathobionts at the strain level, was present in every MAG analysed.
Among the 11 MAGs of E. coli from 9 samples, two MAGs contained the VFGs P fimbriae, S fimbriae, and alpha-haemolysin, which are associated with true pathogenic strains (ExPEC) and were assigned to pathobiont type I (PBT-I). Furthermore, the other 9 MAGs were assigned to pathobiont type IV because they carried the detrimental metabolite-producing genes cntA and cntB (trimethylamine, TMA) and VFGs (see Fig. 6B, C). Interestingly, D3103 had two MAGs belonging to pathobionts PBT-I and PBT-IV, indicating a mixture of the 2 pathobiont types from 2 quasi-lineages.
Among the 6 MAGs of K. pneumoniae from 6 samples, 4 MAGs in T2D-CVD patients belonged to PBT-I, containing siderophore genes that are associated with hypervirulent strains (see Supplementary Figs. 6B, C). There were two MAGs of K. pneumoniae in T2D-CVD patients (D3029 and D3120) that carried another set of TMA-producing genes, cutC and cutD, assigned as PBT-IV pathobionts.
Discussion
Previously, metagenome-association studies identified several VFGs that were associated with a few chronic diseases15,29,30. However, current tools for detecting VFGs in metagenomic data tend to generate false-positive results, which impedes the discovery of the role of VFGs in chronic diseases in humans29. The MetaVF toolkit uses VFDB 2.0 to analyse VFGs accurately in bacterial isolates or the gut microbiota at the species level, which was benchmarked by using artificial and real sequencing data. For the gut bacteria isolates, different types of VFGs carried by different strains were determined and used to define pathobiont types and subsequently used for disease association studies. For the gut microbiota, the use of the MetaVF toolkit enabled the identification of disease-specific VFGs in the ACVD, CRC, and LC patient groups and the common feature VFGs shared by ACVD, CRC, GC, and LC individuals, which may reflect the health status of the gut microbiota and could be used for gut health index determination15,31. When we applied the MetaVF toolkit for T2D gut microbiota analysis, we were able to identify VFGs carried by different types of pathobionts and discovered that the VFGs were enriched mainly in E. coli and K. pneumoniae. These results were further verified by using long (HiFi) read sequencing at the strain level for the two bacterial species.
Many VFGs present genetic polymorphisms, and their orthologues and alleles were systematically explored using 18,521 complete genomes, yielding VFDB 2.0. The VFDB 2.0 covered an additional 75 species and contained approximately 20 folds of VF orthologues and alleles belonging to pathogens and nosocomial opportunists (Supplementary Data 1).
The use of VFDB 2.0 is essential for accurately identifying VFGs associated with host species, VFG mobility, and the VFG structure (VF clusters) in the gut microbiome. With respect to the species specificity of VFGs, the MetaVF toolkit cannot determine the host species of non-species-specific VFGs but can provide possible host species at the genus level when long HiFi reads are not available. Moreover, horizontal gene transfer (HGT) is believed to occur with a high frequency in the gut microbiota, and VFGs carried by mobile genetic elements (MGE) inside the gut could be mobilized between bacterial cells with high density32. In PUMCH, 18.3% of the VFGs in the T2D-CVD group were predicted to be carried by mobile genetic elements, indicating that VFGs are likely to be mobile in the human gut. For the “chromosome-borne only” or “plasmid-borne only” VF genes, we validated the accuracy of the MetaVF toolkit, which was 99.5% using real sequencing data. Our toolkit cannot determine whether the “alternate” VFGs are located in the plasmid or chromosome in real samples, but we are able to indicate their potential for mobility, which can be validated by long HiFi reads. The MetaVF toolkit is useful in predicting plasmid-borne VFGs, which is difficult to do using metagenomic data when long HiFi reads are not available33.
In fact, most VFGs are components of VFG clusters, and the incompleteness of VFG clusters may lead to dysfunction. Examples such as the T6SS gene cluster of Shigella flexneri, which has lost several VFGs in comparison with that of Shigella sonnei, have further been demonstrated to be dysfunctional34. The annotations of the VFG structure can help users determine the completeness of the VFG cluster in metagenomic samples, which can potentially be used for VFG functional assessment. For the T2D patient group analysis, approximately 40% of the VFG clusters in the disease groups and 10% in the healthy group were calculated to be complete, indicating that the healthy individuals carried fewer functional virulence factors.
Using the MetaVF toolkit, we were able to characterize VFGs in the gut microbiota with disease-specific features and common disease features. The common features of VFGs are characterized mainly by iron uptake and adherence genes, which are carried by a few pathobiont species, such as E. coli and K. pneumoniae, which have been described as common features together with other species by different studies31,35 and could be incorporated into healthy status prediction. The common features of diseased gut microbiomes may be explained by the use of different types of medicines, such as proton pump inhibitors (PPIs) and antibiotics15, while the common features of VFGs in our study (iron uptake and adherence genes) may be explained by the fact that these VFGs possibly help pathobionts gain survival advantages over species without VFGs in the inflamed gut environment36. On the other hand, the findings of disease-specific features of VFGs in the CRC, LC, and ACVD patient groups were characterized by PAVGs, which are carried by PBT-I pathobionts. In the case of CRC, the enriched alpha-haemolysin and CNF-1 carried by uropathogenic E. coli (UPEC) strains are both cytotoxic and possibly associated with CRC37,38, which could be used for diagnostic purposes.
In the PUMCH studies, we demonstrated that the VFGs enriched in T2D patients were carried mainly by E. coli and K. pneumoniae. The association of these VFGs with T2D has also been reported in previous gut microbiome studies without specific host species39. For example, a study involving a large cohort of 8208 Dutch individuals revealed enrichment of siderophores, ECP, and OmpA in T2D patients but without host information15. Notably, the enrichment of VFGs of hypervirulent K. pneumoniae in T2D-CVD patients was identified via the MetaVF toolkit and confirmed via long HiFi reads, which have not been reported in other metagenomic analyses.
Strain-level associations between gut microbes and disease have been advocated in recent years40,41, and SMRT sequencing has been demonstrated to be efficient in bacterial lineage-resolved assemblies28, which is essential for pathobiont type identification. In 23 long-HiFi-read PUMCH samples, we identified two different pathobiont types for 11 strains of E. coli and three different pathobiont types for 6 strains of K. pneumoniae, which could not be identified via short-read sequencing. For the two pathobiont types in E. coli, PBT-IV was found in both the T2D and T2D-CVD patient groups, and the two strains of PBT-IV from K. pneumoniae were found exclusively in two T2D-CVD patients, carrying both VFGs and TMA-producing genes. To date, haplotype analysis has been used to evaluate the quality of long HiFi read-generated MAGs, which still consist of multiple lineages (or strains)28. The MAGs assembled for E. coli and K. pneumoniae via long HiFi reads generated from PUMCH samples consisted of multiple haplotypes that may represent different lineages. Even though the linkages of SCG haplotypes cannot be determined by long HiFi reads, we detected dominant haplotypes (strains) that correspond to the dominant pathobiont types in each of the samples, which was predicted by other studies using short-read sequencing42,43. The quality of assembled MAGs can be improved by increasing the sequencing depth of long HiFi reads, and MAGs with fewer haplotypes can further improve disease association studies at the strain level. The determination of the entirety of VFG clusters and the mobility of VFGs within the gut microbiota can also be improved by using long HiFi reads, which are specifically important for disease association studies. Owing to the limited length of the reads, even long HiFi reads were unable to determine the entirety of VFG clusters that consisted of more than 12 genes at the single-molecule level, such as colibactin and yersiniabactin (Fig. 6C). For the MGEs in the gut microbiota, the majority of mobile VFGs are those carried by plasmids (see Supplementary Fig. 6C), while the host species cannot be determined by shotgun sequencing or SMRT sequencing. This problem can be solved only by using a high‐throughput chromosome conformation capture technique (Hi-C sequencing)44.
In conclusion, the VFDB 2.0 is a comprehensive database that systematically collects VFG orthologues from different species and alleles, demonstrating that the vast majority of VFGs are genus specific, which serves as a base of the MetaVF toolkit to accurately identify VFGs in gut metagenomic samples. By applying the MetaVF toolkit to several human gut metagenomic datasets, we were able to identify the disease-common features and disease-specific features of VFGs that have not been defined in previous studies, revealing potential biomarkers for health status evaluation in clinical diagnosis. By combining MetaVF and long HiFi read sequence analysis, the colonization of hypervirulent K. pneumoniae in T2D patients can be determined. The VFDB 2.0 contains alleles and orthologues of 3,527 VFGs, allowing us to determine the mobility of VFGs within the gut microbiota of individual samples. In summary, the MetaVF toolkit may increase the efficiency of VFG analysis in disease-association studies, and in the future, the combined use of VFDB 2.0 and long HiFi reads may help identify VFGs that are causing agents of gut microbiota-associated diseases.
Methods
Study cohort and sample collection
Sample collection was conducted among 486 adults at Peking Union Medical College Hospital from January to September 2018. A total of 150 participants were enrolled in the final study, including 50 healthy adults, 50 T2D patients (type 2 diabetes mellitus patients without cardiovascular disease), and 50 T2D-CVD patients (type 2 diabetes mellitus patients with cardiovascular disease). Fresh stool samples were collected from all participants and stored at −80 °C. Written informed consent was obtained from the participants before any study procedures were performed, and the experimental protocol was approved by the Institutional Review Board of the Institute of Microbiology, Chinese Academy of Sciences. All the participants were compensated for travelling. All individuals completed a structured questionnaire that included demographic and lifestyle aspects such as nationality, gender, age, household income, education, smoking habits, drinking habits, duration of T2D, family history of the disease, and use of hypoglycaemic drugs. The self-reported CVD diagnosis in T2D patients, including myocardial infarction, stroke, congestive heart failure, and other ischaemic heart diseases, was confirmed through medical records. Participants who used antibiotics, had an invasive medical intervention within the previous 90 days, had a history of any cancer or inflammatory disease of the intestine, or had a moderate or severe illness at the time of enrolment were not enrolled. Participants of any sex and/or gender were enrolled.
DNA extraction, short-read sequencing, and long-read sequencing
The total genomic DNA in the faecal samples was extracted using a QIAamp PowerFecal DNA Kit following the user manual. Pair-end metagenomic sequencing was performed on the Illumina HiSeq X platform. The demographic characteristics of the sample are summarized in Supplementary Data 6. Twenty-four samples (8 from each group) were selected for long-read sequencing on the PacBio HiFi platform. Eight of the 24 samples were sequenced with a total of ~ 3 GB per sample, whereas the others were sequenced with a total of ~ 20 GB per sample.
Curation of VFDB
We established an expanded VFG catalogue based on the VFDB core database45 (http://www.mgc.ac.cn/VFs/), which reports the DNA sequences of 3581 VFGs that were experimentally verified (by 2020.06.27). First, we artificially curated the database to find the redundant VFGs and improperly labelled genes in the database. The set1B (in E. coli and Shigella flexneri) and stxB (in E. coli and Shigella dysenteriae) genes were found to be identical. Second, we revised the VF classification in the VFDB on the basis of the VF classification scale proposed by Wassenaar46 and the VF descriptions provided by VFDB45. We classified 3581 VFGs into seven VF categories, including toxins and effectors (type 1 VF), colonization VFGs (type 2 VF), defence system evasion VFGs (type 3 VF), processing VFGs (type 4 VF), secretory VFGs (type 5 VF), housekeeping VFGs (type 6 VF) and regulatory VFGs (type 7 VF).
Calculation of the species-specific ANI
First, we downloaded 20,946 NCBI RefSeq complete bacterial genomes, extracted the taxonomy annotation of the genomes, and removed genomes with unclear taxonomy assignments, such as “sp.” and “candidatus”. A total of 18,521 complete genomes were retained for the following analyses. Next, ANI was computed for genome pairs of species with more than 10 genomes via fastANI47 (version 1.32) with the default parameters. The mean ANI between genome pairs of the same species was defined as the species-specific ANI (ssANI). In total, the ssANI of 1089 species was calculated. In the cases in which a species has only one complete genome in RefSeq, 99% was used as the ssANI of that species, whereas for those species with more than 100 genomes, 100 genomes were randomly selected for ssANI computing.
Construction of VFDB 2.0
The 18,521 complete bacterial genomes of 3559 species were aligned to VFDB via local nucleotide BLAST (version 2.5.0). First, the results of the BLASTN were filtered under the thresholds of identity >ANI of the subject species (ssANI) and coverage = 100% (redundant dataset). After removing redundant VFG sequences with 100% sequence identity within the redundant dataset, the nonredundant VFG sequences were collected into an expanded alignment dataset. The VFG sequences originating from pathogenic strains in the expanded alignment dataset were collected into the pathogenic alignment dataset on the basis of information in the NCBI BioSample database.
To explore mobile VFGs, three types of mobile genetic elements in 18,521 complete genomes, including ICEs, prophages, and plasmids, were predicted. ICEs were detected on the basis of similarity alignment (>99% identity) against ICEberg48 (https://bioinfo-mml.sjtu.edu.cn/ICEfinder/index.php), PhiSpy49 (version 4.2.19) was used to find prophage sequences (>99% identity), and plasmid sequences were extracted from fasta files with “plasmid” in the sequence name. The VFGs from the redundant dataset carried by mobile elements were determined via Python script. If alleles or orthologues of a VFG are located on chromosomes of host species, the VFG is defined as “chromosome-borne only” VFG. If alleles or orthologues of a VFG are located on plasmids of host species, the VFG is defined as a “plasmid-borne only” VFG. If a VFG is located on either chromosomes or plasmids of a host species, the VFG is defined as an “alternate” VFG. To define the host species for each of the VFGs, the host taxonomic information of redundant VFGs from the redundant dataset was used. If alleles or orthologues of a VFG belong to only one bacterial species, the VFG is defined as “species-specific”. If alleles or orthologues of a VFG belong to different species of the same genus, the VFG is defined as “genus-specific”. If alleles or orthologues of a VFG belong to species of different genera, the VFG is defined as “non-genus-specific”. Finally, the annotations of mobile VFGs and host taxonomic information for each of the VFGs (annotation dataset) the alignment dataset were integrated into VFDB 2.0.
Overview of the MetaVF toolkit
The MetaVF toolkit is a command-line tool for Linux-based systems that integrates two distinct workflows for the prediction of VFGs in metagenomic data or draft genome data.
(1) Alignment
The MetaVF toolkit allows VFG analysis for metagenomic sequencing data (-PE), assembled contigs, draft genomes, or long reads (-draft). For short-read sequencing data, clean reads are mapped to the expanded alignment dataset via bbmap (version 38.91) (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) (with parameter rmpk, idtag=t ambiguous=random). The output includes the ‘.sam’ file, which records the mapping details of each read, and the ‘.rpkm’ file, which calculates the total number of reads of each sample. For long reads, contigs, MAGs, draft or complete genomes, nucleotide BLAST (version 2.5.0) is performed against the pathogenic alignment dataset (with parameters -max_hsps 1 -outfmt “6 std gaps qcovs qcovhsp sstrand sseq”).
(2) Filtering alignment
For short-read sequencing data, mapped reads are sorted and filtered via SAMtools50 (version 1.15, parameter: -F 4), and the shell script is used to filter out hits with less than 90% identity according to the tag “YI:f:”. The raw read count, RPK, and TPM of each gene are calculated. For long reads, contigs, MAGs, draft, or complete genomes, the best BLAST hit of each gene is selected and filtered on the basis of identity and coverage (default: identity >90%, coverage >80%).
(3) Integrating VF annotations
The VF category, bacterial host species, and mobility of each VFG are further annotated on the basis of the annotation dataset of VFDB 2.0. The final outputs include two files that are calculated via VFGs and VFs. The relative abundance of each virulence factor is represented by the median abundance of VFGs in each VF.
Generation of artificial datasets
To determine the best threshold and evaluate the performance of the MetaVF toolkit, two sets of artificial data were simulated. First, the raw fasta sequences of low (cami_low), medium (cami_medium), and high (cami_high) complexity datasets in CAMI were downloaded51. We used nucleotide BLAST (version 2.5.0) to exclude bacterial genomes with any naturally occurring VFGs, defined as a sequence matching a gene from the VFDB with >80% identity and 70% coverage, which ensures that the VFG sequences in the artificial data were those that had been artificially spiked in. The final genomes used for generating the simulation data were 965 VF-free CAMI high, 215 VF-free CAMI medium, and 53 VF-free CAMI low genomes. The sequences of 200 VFGs were randomly sampled from the VFDB core dataset with replacement each time via shell script and mutated at 1%, 3%, and 5% via snp-mutator (version 1.2.0, parameters: snpmutator -r 1 -n 3 -s 20 -i 0 -d 0 -o summary.tsv -v variants.vcf -m -M metrics -R seq.fasta –F VF_nutation_fasta VF1.fna) (https://github.com/CFSAN-Biostatistics/snp-mutator). Next, we generated two sets of artificial data, including artificial dataset 1 (AMSD1), which was used for evaluating the best threshold for the MetaVF toolkit in filtering alignments for short-read sequencing data (~5 M per sample), and artificial dataset 2 (AMSD2), which was used for benchmarking (~100 M reads per sample). We used InSilicoSeq52 (version 1.5.4, iss generate --draft --model Hiseq --n_reads) to generate simulated metagenome sequencing data mimicking Illumina HiSeq paired-end reads. Each bacterial genome was assigned an abundance value drawn from a log-normal distribution with a unit mean and standard deviation. For AMSD1, the simulated fastq data for mutated VFG sequences were generated and spiked into bacterial reads (CAMI high, medium, and low) at ratios of 1:50000, 1:5000, and 1:500, respectively, generating 27 artificial datasets. For AMSD2, the sampling of VFGs was independently performed 3 times with replacement to avoid biases caused by specific artificial VFGs. The AMSD2 combined high-complexity bacterial data (CAMI high) with 3 random sets of VFGs at different mutation rates (1%, 3%, and 5%) with proportions of 1:5000 (AMSD2_low) and 1:50000 (AMSD2_high), including 18 artificial datasets. The number of reads assigned to each VFG in every simulated dataset was calculated.
Estimation of the threshold of the MetaVF toolkit
We performed a comparative analysis of different thresholds for filtering low-quality hits (99%, 97%, 95%, 93%, 90%, 85%, 80%, 75%, 70%, 65%, and 60%) via artificial dataset 1 (AMSD1) and calculated the true positive rate ( = TP / (TP + FN)) and false-positive rate ( = FP / (FP + TN)). (TP: true positive, i.e., a read is correctly predicted to be a virulence factor; FN: false-negative, i.e., a read is incorrectly predicted not to be a virulence factor or an incorrect virulence factor; TN: true negative, i.e., a read is correctly predicted not to be a virulence factor; FP: false-positive, i.e., a read is incorrectly predicted to be a virulence factor.)
Benchmark of the MetaVF toolkit using artificial datasets
To evaluate the performance of the MetaVF toolkit, we performed VFG analysis via artificial dataset 2 (AMSD2) and compared it with the following tools: ShortBRED (v0.9.5)18, PathoFact (v1.0)17, and VFDB direct mapping. First, ShortBRED reduces target protein families to short, highly representative peptide sequences (markers) and then maps reads against only those markers to obtain higher speed and specificity. ShortBRED has the best threshold tested for identifying VFGs and setting it as the default threshold, and we adopted the default threshold for benchmarking. To be more specific, we used VFDB core dataset B as a candidate gene set and UniRef 9053 (downloaded by April 2023) as a reference gene set to identify marker VFG sequences via the “shortbred_identify program” (clustering the proteins of interest at 85% identity) and calculated the abundance of each VFG via the “shortbred_quantify program” using default parameters (length >=30 amino acids and >95% identity). PathoFact is an integrated pipeline for predicting virulence factors, antimicrobial resistance genes, and toxins in metagenomic data. PathoFact accepts assembled metagenomic sequencing data (contigs), predicts ORFs via Prodigal software, and determines VFGs in each ORF via the Hidden Markov Model (HMM) and random forest model. We inputted the original CAMI draft genomes to avoid errors in the process of assembly and ran a “virulence” pipeline to predict VFGs in each dataset with the default parameters. Finally, we also performed VF analysis by mapping reads directly to VFDB core dataset A without filtering low-quality hits.
The prediction quality of the presence of VFGs was evaluated by sensitivity, specificity, precision, accuracy and F1 score using the formulas below.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Sensitivity = TP / (TP + FN)
F1 score = 2 * precision * sensitivity / (precision + sensitivity)
(TP: true positive, i.e., a read/gene is correctly predicted to be a virulence factor; FN: false-negative, i.e., a read/gene is incorrectly predicted not to be a virulence factor or an incorrect virulence factor; TN: true negative, i.e., a read/gene is correctly predicted not to be a virulence factor; FP: false-positive, i.e., a read/gene is incorrectly predicted to be a virulence factor).
The prediction quality of the abundance of VFGs was evaluated by Spearman correlation between the expected abundance and the predicted abundance of VFGs, which is represented by the estimated read count in the MetaVF toolkit, VFDB direct mapping, and ShortBRED. False-positive detections are defined as those with 0 expected abundance.
Benchmark of the MetaVF toolkit using real metagenomic sequencing data
Eight samples with both short-read sequencing and long-read sequencing were used to evaluate the performance of MetaVF. The VFGs in long HiFi data were detected via nucleotide BLAST (>90% identity and >80% coverage). The results generated via short-read sequencing methods via the MetaVF toolkit, ShortBRED, PathoFact, and VFDB direct mapping were compared with those generated via long HiFi reads. The parameters used for short-read sequencing methods were the same as those in artificial dataset 2, except that the input of PathoFact was contigs assembled by SPAdes54 (v3.15.5, --meta). The “pathogenic” ORFs were further annotated via protein BLAST against VFDB core set B. The best alignment was selected as the final annotation of the pathogenic ORFs. The abundance of VFGs in long-read sequencing data is represented by read coverage depth.
Comparative analysis of VFGs in healthy isolates and clinical isolates
A total of 5592 draft genomes or raw sequencing data for healthy gut isolates were downloaded from three public databases19,20,21. The raw sequencing data were assembled into draft genomes via SPAdes54 (v3.15.5, --isolate) with the default parameters. The quality of all draft genomes was measured via CheckM55 (v1.0.7) (with parameters lineage_wf–tab_table -x fna Prokka_annotations/). A total of 5452 high-quality draft genomes with greater than 90% completeness and less than 5% contamination were used for further analysis. All summary and quality statistics can be found in Supplementary Data 7. The VFGs in the draft genomes were identified via the -draft workflow in the MetaVF toolkit according to the previous description (100% query coverage and a minimum similarity of ssANI).
Publicly available pathogenic E. coli genomes and hypervirulent K. pneumoniae sequences were downloaded (Supplementary Data 8, 9). Hypervirulent K. pneumoniae genomes were selected by searching the keyword “hypervirulent” in the NCBI BioSample database and were confirmed by the associated articles provided by the NCBI BioSample database. Snippy (version 4.6.0) (https://github.com/tseemann/snippy) was used to call variants (SNPs and INDELs), and E. coli (GCA_000005845.2_ASM584v2_genomic.fna) and K. pneumoniae (GCA_000240185.2_ASM24018v2_genomic.fna) were used as references (default parameters). VCFtools56 (version 0.1.16) was used to compress (bgzip), index (tabix) and merge (vcf-merge) the vcf files of each sample. vcf2phily (https://github.com/edgardomortiz/vcf2phylip) was used to transform the vcf files to fasta files for phylogenetic tree construction. Phylogenetic trees of the whole-genome SNP sequence were generated via FastTree57 (v.2.1.10, parameters: -gtr -nt) and visualized via the ggtree package in R.
Public dataset download
Ten public metagenomic datasets were downloaded for VF analysis, including the colorectal carcinoma (CRC), atherosclerotic cardiovascular disease (ACVD), inflammatory bowel disease (IBD), obesity, hypertension, Parkinson’s disease (PD), gastric cancer (GC), liver cirrhosis (LC), type 2 diabetes (T2D) and HMP metagenomic datasets (for details, see Supplementary Data 10).
Short-read sequencing data analysis
First, KneadData (v0.10.0) (https://bitbucket.org/biobakery/kneaddata) was used to create clean paired reads (with the parameters PE-phred33 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:50). The clean reads were taxonomically classified with MetaPhlAn358 (version 3.0.14) using the default parameters. Next, the abundance of VFGs in each sample was determined via the –PE workflow in the MetaVF toolkit as described above. VFGs that were present in at least 10% of the participants were selected for differential analysis, and the log10(TPM) was used to normalize skewed distributions of the abundance of VFGs. Spearman correlations between the abundance of VFGs and clinical indices were calculated.
Long-read assembly, binning, and species annotation
Raw base-called data from the PacBio sequencing instrument were imported into SMRTLink (https://www.pacb.com/support/software-downloads) to generate HiFi reads via the CCS algorithm (version 6.0.0), which processed the raw data and generated HiFi fastq files (with the following settings: minimum pass 3, minimum predicted accuracy 0.99). HiFi reads were assembled into contigs via the metaFlye genome assembler59 (version 2.9), and the ‘—pacbio-hifi’ flag was used. The total length, contig number, largest contig length, N50, and L50 were calculated to evaluate the assembly efficiency with Quast60 (v.5.0.034). The functions binning (with parameters —metaBAT2, —CONCOCT, and —MaxBin2) and binning refinement (>70% completeness and <10% contamination) in MetaWRAP61 (v = 1.3.2) were used to generate MAGs. MAGs with more than 90% completeness and less than 5% contamination were classified as ‘high quality’. GTDB-Tk62 (version 2.24.31) was used to assign candidate taxonomic affiliations to all MAGs (‘classify_wf’ workflow). For MAGs annotated as Escherichia flexneri, if lacY existed and ipaH did not exist in the genome, the genome was defined as E. coli; otherwise, it was assigned to S. flexneri63.
Annotation of TMA genes and VFGs via long-read sequencing
To annotate the TMA genes, long-read assemblies were aligned to the localized TMA dataset via nucleotide BLAST (version 2.5.0; BLASTN-megablsat), and alignments with at least 90% identity and 80% coverage were retained. To annotate VFGs, the draft workflow in the MetaVF toolkit was used with parameters of at least 90% identity and 80% coverage.
Detection of mobile VFGs via long-read sequencing
The ICE contigs were determined on the basis of similarity alignment (>99%) against ICEberg48 (https://bioinfo-mml.sjtu.edu.cn/ICEfinder/index.php). The prophage-carrying contigs were discovered via VirSorter264 (version 2.2.3) and CheckV65 (version 0.8.1). First, VirSorter2 was used to select potential prophage sequences (score >0.5), and then CheckV was used to filter host sequences. The filtered contigs were sent to VirSorter2 again to identify prophage sequences (score >0.9). ViralVerify (version 1.1) (https://github.com/ablab/viralVerify/) and PlasFlow66 (version 1.1.0) were applied to predict plasmid contigs. The plasmid contigs predicted by both software programs were retained. The bacterial hosts of plasmid contigs with VFGs were predicted by aligning the contigs to plasmid sequences in the expanded alignment dataset via minimap2 (-ax map-hifi -H -N 1 --secondary=no). The best alignment was retained only if the predicted bacterial host species of the plasmid also existed in the sample.
Phylogenetic analysis of MAGs
Phylogenetic analyses of E. coli and K. pneumoniae MAGs were performed as described above, with GCA_000005845.2_ASM584v2_genomic.fna and GCA_000240185.2_ASM24018v2_genomic.fna used as reference genomes, respectively.
Haplotype analysis of E. coli and K. pneumoniae MAGs
CheckM55 (v1.0.7) was first used to annotate the single-copy genes (SCGs) in each MAG. Long HiFi reads were mapped to E. coli and K. pneumoniae MAGs via minimap267 (version 2.23, parameters: -x asm 20). MagPhase28 (version 1.0.0, default parameters) was used to identify the haplotypes of each SCG in the MAG, and the maximum number of haplotypes of each SCG in the MAG was used to represent the variation of the haplotypes in each MAG.
Quantification and statistical analyses
Two-sided Wilcoxon tests and Kruskal‒Wallis tests were used for differential analysis between two groups and three groups, respectively. Two-sided Spearman correlation was used for correlation analysis. When multiple hypotheses were investigated, P values were corrected for multiple hypothesis testing via the Benjamini‒Hochberg method (FDR). P values < 0.05 were considered ‘significant’.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The PacBio SMRT raw sequencing data and assemblies generated in this study have been deposited in the NCBI under accession code PRJNA1052403. The Illumina sequencing data have been deposited in the NCBI under accession code PRJNA1053635. The public metagenomic data used in this study can be accessed under PRJEB1244968, ERP00872930, PRJEB2792869, and PRJEB1087870 for colorectal carcinoma (CRC); ERP02378829 for atherosclerotic cardiovascular disease (ACVD); EGAS0000100170471, ERP00206172, and PRJNA40007273 for inflammatory bowel disease (IBD); ERP01448074 for obesity; PRJEB1387075 for hypertension; ERP01967476 for Parkinson’s disease (PD); DRA007281, DRA008243, DRA006684 and DRA00815677 for gastric cancer (GC); ERP00586078 (10.1038/nature13568) for liver cirrhosis (LC); PRJNA42243479 for type 2 diabetes (T2D); and PRJNA4301780 for HMP. The VFDB 2.0 is available at https://github.com/Wanting-Dong/MetaVF_toolkit/tree/main/databases. Source data are provided with this paper.
Code availability
The MetaVF toolkit is available at https://github.com/Wanting-Dong/MetaVF_toolkit. The scripts used in this study are available at https://github.com/Wanting-Dong/VF_analysis_pipeline.
References
Sepich-Poore, G. D. et al. The microbiome and human cancer. Science 371, 1331 (2021).
Fan, Y. & Pedersen, O. Gut microbiota in human metabolic health and disease. Nat. Rev. Microbiol. 19, 55–71 (2021).
Jochum, L. & Stecher, B. Label or Concept - What is a Pathobiont? Trends Microbiol. 28, 789–792 (2020).
Castillo, A., Eguiarte, L. E. & Souza, V. A genomic population genetics analysis of the pathogenic enterocyte effacement island in Escherichia coli: The search for the unit of selection. Proc. Natl. Acad. Sci. 102, 1542–1547 (2005).
Nougayrède, J. P. et al. induces DNA double-strand breaks in eukaryotic cells. Science 313, 848–851 (2006).
Wilson, M. R. et al. The human gut bacterial genotoxin colibactin alkylates DNA. Science 363, eaar7785 (2019).
Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic pks(+) E. coli. Nature 580, 269–273 (2020).
Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).
Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/beta-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).
Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592 (2018).
Thiele Orberg, E. et al. The myeloid immune signature of enterotoxigenic Bacteroides fragilis-induced murine colon tumorigenesis. Mucosal Immunol. 10, 421–433 (2017).
Chung, L. et al. Bacteroides fragilis Toxin Coordinates a Pro-carcinogenic Inflammatory Cascade via Targeting of Colonic Epithelial Cells. Cell Host Microbe 23, 203 (2018).
Dubinsky, V., Dotan, I. & Gophna, U. Carriage of Colibactin-producing Bacteria and Colorectal Cancer Risk. Trends Microbiol 28, 874–876 (2020).
Cao, Y. et al. Enterotoxigenic Bacteroidesfragilis Promotes Intestinal Inflammation and Malignancy by Inhibiting Exosome-Packaged miR-149-3p. Gastroenterology 161, 1552–1566.e1512 (2021).
Gacesa, R. et al. Environmental factors shaping the gut microbiome in a Dutch population. Nature 604, 732 (2022).
Chen, L. H. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 33, D325–D328 (2005).
de Nies, L. et al. PathoFact: a pipeline for the prediction of virulence factors and antimicrobial resistance genes in metagenomic data. Microbiome 9, 49 (2021).
Kaminski, J. et al. High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED. Plos Comput Biol. 11, e1004557 (2015).
Forster, S. C. et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186 (2019).
Poyet, M. et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 25, 1442 (2019).
Zou, Y. Q. et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179 (2019).
Wyres, K. L., Lam, M. M. C. & Holt, K. E. Population genomics of. Nat. Rev. Microbiol. 18, 344–359 (2020).
Croxen, M. A. Molecular mechanisms of Escherichia coli pathogenicity (vol 8, p 26, 2011). Nat. Rev. Microbiol. 11, 141–141 (2013).
Mazmanian, S. K., Round, J. L. & Kasper, D. L. A microbial symbiosis factor prevents intestinal inflammatory disease. Nature 453, 620–625 (2008).
Nemet, I. et al. A Cardiovascular Disease-Linked Gut Microbial Metabolite Acts via Adrenergic Receptors. Cell 180, 862 (2020).
Rubinstein, M. R. et al. Promotes Colorectal Carcinogenesis by Modulating E-Cadherin/β-Catenin Signaling via its FadA Adhesin. Cell Host Microbe 14, 195–206 (2013).
Pleguezuelos-Manzano, C. et al. Mutational signature in colorectal cancer caused by genotoxic. Nature 580, 269 (2020).
Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40, 711 (2022).
Jie, Z. Y. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).
Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Gupta, V. K. et al. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635 (2020).
Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053 (2021).
Bertrand, D. et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 37, 937 (2019).
Anderson, M. C., Vonaesch, P., Saffarian, A., Marteyn, B. S. & Sansonetti, P. J. Encodes a Functional T6SS Used for Interbacterial Competition and Niche Occupancy. Cell Host Microbe 21, 769 (2017).
Dai, D. et al. GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison. Nucleic Acids Res. 50, D777–D784 (2022).
Gerós, A. S., Simmons, A., Drakesmith, H., Aulicino, A. & Frost, J. N. The battle for iron in enteric infections. Immunology 161, 186–199 (2020).
Doye, A. et al. CNF1 exploits the ubiquitin-proteasome machinery to restrict Rho GTPase activation for bacterial host cell invasion. Cell 111, 553–564 (2002).
Bielaszewska, M., Aldick, T., Bauwens, A. & Karch, H. Hemolysin of enterohemorrhagic: Structure, transport, biological activity and putative role in virulence. Int J. Med. Microbiol. 304, 521–529 (2014).
Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262 (2015).
Olm, M. R. et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol. 39, 727–736 (2021).
De Filippis, F. et al. Specific gut microbiome signatures and the associated pro-inflamatory functions are linked to pediatric allergy and acquisition of immune tolerance. Nat. Commun. 12, 5958 (2021).
Zhao, C. Y., Dimitrov, B., Goldman, M., Nayfach, S. & Pollard, K. S. MIDAS2: Metagenomic Intra-species Diversity Analysis System. Bioinformatics 39, btac713 (2023).
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Yaffe, E. & Relman, D. A. Tracking microbial evolution in the human gut using Hi-C reveals extensive horizontal gene transfer, persistence and adaptation. Nat. Microbiol. 5, 343 (2020).
Liu, B. et al. 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 50, D912–D917 (2022).
Wassenaar, T. M. & Gaastra, W. Bacterial virulence: can we draw the line? Fems Microbiol. Lett. 201, 1–7 (2001).
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Liu, M. et al. ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Res. 47, D660–D665 (2019).
Akhter, S., Aziz, R. K. & Edwards, R. A. a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat. Methods 14, 1063 (2017).
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. & Bongcam-Rudloff, E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics 35, 521–522 (2019).
Suzek, B. E., Huang, H. Z., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5, e9490 (2010).
Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife 10, e65088 (2021).
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103 (2020).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. G. T. D. B.- Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
van den Beld, M. J. C. & Reubsaet, F. A. G. Differentiation between, enteroinvasive (EIEC) and noninvasive. Eur. J. Clin. Microbiol. 31, 899–904 (2012).
Guo, J. R. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578 (2021).
Krawczyk, P. S., Lipinski, L. & Dziembowski, A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res. 46, e35 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Vogtmann, E. et al. Colorectal Cancer and the Human Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing. PLOS ONE 11, e0155362 (2016).
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 66, 70–78 (2017).
Vich Vila, A. et al. Gut microbiota composition and functional changes in inflammatory bowel disease and irritable bowel syndrome. Sci. Transl. Med. 10, eaap8914 (2018).
Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 4, 293–305 (2019).
Palleja, A. et al. Roux-en-Y gastric bypass surgery of morbidly obese patients induces swift and persistent changes of the individual gut microbiota. Genome Med. 8, 67 (2016).
Li, J. et al. Gut microbiota dysbiosis contributes to the development of hypertension. Microbiome 5, 14 (2017).
Bedarf, J. R. et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 9, 39 (2017).
Erawijantari, P. P. et al. Influence of gastrectomy for gastric cancer treatment on faecal microbiome and metabolome profiles. Gut. 69, 1404 (2020).
Qin, N. et al. Alterations of the human gut microbiome in liver cirrhosis. Nature 513, 59–64 (2014).
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Peterson, J. et al. The NIH Human Microbiome Project. Genome Res. 19, 2317–2323 (2009).
Acknowledgements
The authors are grateful to Guowei Yang and Jingren Zhang for their assistance. We would like to thank Khi Pin Chua and Zuwei Qian of the PacBio APAC team for their valuable technical assistance in experimental execution and data analysis related to HiFi sequencing. This research was funded by grants from the National Key R&D Program of China (grant numbers 2021YFA1301000 and 2021YFC2301003) and the National Natural Science Foundation of China (grant numbers 32170068 and 81991534).
Author information
Authors and Affiliations
Contributions
Conception and design of the study: W.D., B.Z. and W.Z. Acquisition of data: N.L., T.Y., X.C. Analysis or interpretation of data: W.D., X.F., Y.G., S.W., S.J., Y.P., Y.X., Q.X. Writing and/or revisions of the manuscript: W.D., B.Z., R.Y. All authors have approved the submitted version of the manuscript and agree to be personally accountable for their own contribution.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Tanel Tenson, and the other, anonymous, reviewer for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Dong, W., Fan, X., Guo, Y. et al. An expanded database and analytical toolkit for identifying bacterial virulence factors and their associations with chronic diseases. Nat Commun 15, 8084 (2024). https://doi.org/10.1038/s41467-024-51864-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-51864-y
This article is cited by
-
Metaproteomics in the One Health framework for unraveling microbial effectors in microbiomes
Microbiome (2025)
-
Microbial landscape of Indian homes: the microbial diversity, pathogens and antimicrobial resistome in urban residential spaces
Environmental Microbiome (2025)
-
Assessing the role of Escherichia coli and Klebsiella pneumoniae in colorectal cancer oncogene expression: insights from microbial colonization phenotypes
Molecular Biology Reports (2025)