Introduction

The human gut microbiome is a topic of intense research interest and many bacterial species have been associated with specific diseases1. One such species is Ruminococcus gnavus, for which associations with human health have been reported in the context of various ailments2,3,4,5,6,7. Officially, its taxonomic status has been revised and R. gnavus is now member of the genus Mediterraneibacter, but it has also been termed Faecalicatena gnavus8. Here, we will designate the species as Ruminococcus gnavus. R. gnavus is a non-spore forming Gram-positive member of the bacterial phylum Bacillota (formerly Firmicutes) and was first described in 19769. It is considered a prevalent member of the human gut microbiome (present in > 90% of healthy European and North-American adults), but can also be found in the gastrointestinal tract of a variety of animal species10,11. Its median relative abundance in humans is reported to be ~0.1%–0.3%, although it should be noted that these estimates were based on small and geographically restricted studies12,13.

In microbiome association studies, increases in R. gnavus relative abundance have consistently been linked to diseases including metabolic syndrome, type 2 diabetes mellitus and Crohn’s disease (CD, a form of inflammatory bowel disease (IBD))2,3,14. Furthermore, its relative abundance increased concomitantly with symptomatic flares in CD, where it reached up to 69.5% of the gut microbiome2. While it remains unknown if R. gnavus causally contributes to disease development or whether the increased abundance is a result of the changing intestinal environment, several molecular mediators have been identified that potentially contribute to disease. For instance, the cell-surface exposed polysaccharide glucorhamnan has been described as pro-inflammatory, with a strain-dependent effect, depending on whether the R. gnavus isolate carried a capsular polysaccharide that promoted a more tolerogenic response15,16. However, these observations are limited by the fact that they were made using one or few isolates and strain variation remains underexplored in many gut microbes, including R. gnavus.

Not only mechanistic, but also genomic studies of R. gnavus have suffered from a limited scope. One study divided R. gnavus into two clades based on genome sequences and noted that one was enriched in IBD patients2. However, this study was limited by a low number of draft isolate genomes (N = 11) and a scarcity of knowledge on experimentally verified virulence factors of R. gnavus at the time15,16,17,18,19,20. A more recent study based on 152 draft genomes identified three major lineages, but genomes of different host organisms were mixed and this study did not investigate associations of genetic features with metadata11. Therefore, an important outstanding question remains whether proposed R. gnavus virulence factors are enriched in IBD-derived isolates, or whether different genes and functions could separate IBD-derived R. gnavus isolates from controls.

In this work, we surveyed global R. gnavus prevalence and abundance across thousands of gut metagenomes to provide a more nuanced picture across human lifespan, different lifestyles, and disease, thereby revealing striking differences. Next, through extensive culturing efforts we established a resource of 45 R. gnavus isolates and applied PacBio circular consensus sequencing (CCS) to generate complete genomes. This collection of isolates and their complete genomes provides ample scope for targeted experimental follow-up work and will be available as a community resource for the scientific community. We complemented this unique collection with publicly available (short-read draft) genomes, which allowed us to perform large-scale comparative genomics at both the level of phylogeny and predicted gene functions.

Results

Intestinal colonization with R. gnavus is associated with age, health, geography and lifestyle

In order to provide a nuanced view of R. gnavus prevalence and abundance across health and disease, geography, and lifestyle, we screened 12,791 publicly available metagenomes from all over the world with manually curated metadata (Fig. 1, Supplementary Data 1; full per-sample metadata are available through https://waldronlab.io/curatedMetagenomicData/)21. We observed R. gnavus in 50.58% of all included subjects and the prevalence in 9126 healthy individuals was 43.09% (Fig. 1a). As R. gnavus has been robustly associated with disease, especially with metabolic disease and IBD2,3, we compared R. gnavus prevalence and abundance between patients with these diseases and healthy subjects (or asymptomatic control subjects) in a meta-analysis. R. gnavus was ~1.6 times more prevalent in IBD patients (70.2%; logistic regression, p < 2.2 × 10−16, odds ratio (OR [95% confidence interval]) = 3.1 [2.6 – 3.7]), 1.3 times more with hypertension (58.0%; p = 0.00127, OR = 1.8 [1.3–2.5]), 1.5 times with type-2 diabetes (T2D; 62.9%; p = 1.52 × 10−9, OR = 2.2 [1.8−2.8]), and 2.2 times with atherosclerotic cardiovascular diseases (ACVD; 96.2%; p < 2.2 × 10−16, OR = 33.4 [17.6–74.1]) compared to healthy subjects. Furthermore, the relative abundance of R. gnavus was also higher in these conditions as compared to healthy (Fig. 1b; healthy: median [1st-3rd quartile] = 0% [0–0.08%]; IBD: median = 0.11% [0–1.04%], linear model, p < 2.2 × 10−16; T2D: median = 0.027% [0.0-0.22%], p = 1.9 × 10−10; ACVD: median = 0.78% [0.09-3.14%], p < 2.2 × 10−16), except hypertension (median = 0.01% [0–0.07%], p = 0.399). Together, we thus recapitulated that R. gnavus occurs more frequently and in higher abundances in the gut microbiome of patients suffering from IBD, hypertension and T2D. Additionally, our analysis uncovered a striking novel enrichment in ACVD, which had the highest prevalence and abundance of any disease group.

Fig. 1: Intestinal colonization with R. gnavus is associated with age, health, geography, and lifestyle.
figure 1

a We queried the public resource curatedMetagenomicData for relative abundances of R. gnavus in human stools to conduct a meta-analysis of global prevalence and abundance. Prevalence is shown as fraction of subjects with R. gnavus abundance > 0, grouped by selected health conditions. IBD: inflammatory bowel diseases, T2D: type-2 diabetes, ACVD: atherosclerotic cardiovascular diseases. Each disease group is compared to healthy using logistic regression. IBD: p < 2.2 × 10−16; hypertension: p = 0.00127; T2D: p = 1.52 × 10−9; ACVD: p < 2.2 × 10−16. b Relative abundance of R. gnavus in the same groups as (a) shown as quantile plots, using quantiles ranging from 0 to 100% in increments of 10 with the median shown as a thick black line and quantiles closer to the median shown as darker shades of the same color (see “Methods”). Each disease is compared to healthy using linear regression. IBD: p < 2.2 × 10−16; hypertension: p = 0.399; T2D: 1.9 × 10−10; ACVD: p < 2.2 × 10−16. c Comparison of R. gnavus abundance between healthy people from Westernized and non-Westernized societies as quantile plot. P < 2.2 × 10−16, calculated using linear regression. d Prevalence of R. gnavus grouped per country and colored by Westernization, only showing results from countries from which at least 50 samples were collected. (Countries are abbreviated by ISO 3166-1 alpha-3 codes.) e Sequencing depth control per country (same as d). Each diamond represents a study that collected samples from the corresponding country. Sequencing depth is shown as median number of reads generated per country in the study. f Relative abundance of R. gnavus in different age categories (newborn: < 1 year, child: 1-11 years, school age: 12–18 years, adult: 19-65, senior: 65+ years) shown as quantile plots. Age categories are listed in g. Each age category is compared to adult using linear regression. Newborn: p < 2.2 × 10−16; child: p < 2.2 × 10−16; schoolage: p = 0.0164; senior: p = 1.37 × 10−6. g Prevalence of R. gnavus among different age categories. Each category is compared to adult using logistic regression. Newborn: p = 1.14 × 10-6; child: p < 2.2 × 10−16; schoolage: p = 0.0797; senior: p = 2.92 × 10−4. *** p < 0.001, ** p < 0.01, * p < 0.05, n.s. not significant. In (b, c, and f) a pseudocount of 1.3 × 10−5 is added to all abundances to enable visualization on a logarithmic scale. Source data are provided as a Source Data file.

Subsequently, we investigated prevalence and relative abundance of R. gnavus across countries (Fig. 1c, d and Supplementary Fig. 1a). We show only healthy individuals to exclude possible confounding by diseases such as IBD and metabolic disease. We observed large differences in prevalence, which ranged between 10–90% across countries (overall median: 41%) and mean relative abundance per country ranged between 0.0078-4.05% (overall mean = 0.67% ±3.20 standard deviation; Supplementary Fig. 1a). This variation could be partly explained by Westernization status; this binary classification of Westernized / non-Westernized lifestyles is based on, among others, access to medical care and pharmaceuticals, livestock exposure and diet22. Westernized individuals had higher prevalence and abundance of R. gnavus compared to non-Westernized individuals (Fig. 1c-e; prevalence: logistic regression, p < 2.2 × 10−16; abundance: linear model, p < 2.2 × 10−16). As these data were generated in multiple studies, we cannot exclude effects of technical differences (e.g., DNA extraction method). To partially check for this, we investigated sequencing depth and found that higher prevalence and abundance were not the result of higher sequencing depth in Westernized countries as non-Western samples were sequenced deeper (Supplementary Fig. 1b; t-test p = 6.9 × 10−20). These differences hold true for any 10% quantile of sequencing depth (Supplementary Fig. 1c, Methods). We also checked for possible correlations between sequencing depth and R. gnavus abundance and found a weakly negative correlation in both Westernized and non-Westernized metagenomes (Supplementary Fig. 1d). In conclusion, R. gnavus colonization is vastly different between countries, and Westernization (lifestyle) may be a major factor contributing to these differences.

We noted extremely high R. gnavus abundance values in healthy people, up to a relative abundance of 83%. Metagenomes with the highest abundances were often samples collected from newborns and children up to age 2, most of whom were recorded not to have received antibiotics. This motivated a further analysis of age-related patterns of R. gnavus colonization (Fig. 1f)21. R. gnavus abundances were higher in newborns (linear model, p < 2.2 × 10−16), children up to 11 years old (p < 2.2 × 10−16), and adolescents between 12 and 18 years old (‘schoolage’; p = 0.0164) as compared to adults. Abundances were also higher in seniors (65-92 years old) than in adults (p = 1.37 × 10−6). We observed similar patterns regarding R. gnavus prevalence (Fig. 1G), where newborns (logistic regression, p = 1.14 × 10−6), children aged 1-11 (p < 2.2 × 10−16) and seniors (p = 2.92 × 10−4) were more likely to carry R. gnavus than adults. Adolescents and adults did not have different prevalence of R. gnavus (p = 0.0797).

The high abundances of R. gnavus in infants instigated a closer inspection of abundance over age and in correlation to breastfeeding, as breastfeeding was recently reported to have a strong impact on R. gnavus colonization23. Looking at R. gnavus abundance in the first ten years of life (Supplementary Fig. 2a), we see a rapid increase after the first half year, followed by a decline and rebound around 8 years. We found the shift in the first half year to strongly correlate with feeding practice (Chi-square, p = 1.49 × 10−15). Specifically, infants that were breastfed had lower R. gnavus abundance than children that received no breastfeeding (linear model; exclusive breastfeeding, p = 6.49 × 10 − 11; mixed feeding, p = 4.69 × 10−4; Supplementary Fig. 2b). To exclude possible confounding and identify other associated factors, we also tested for associations between R. gnavus abundance with feeding practice (n = 184), mode of delivery (n = 170) and antibiotics use (n = 94) in infants of age up to two years, for whom feeding practice data had been recorded, using multivariable linear modeling. This indicated that only feeding practice was significantly associated with R. gnavus abundance in infants (Chi-square of total variable effect, p = 8.93 × 10−6). In summary, we find evidence that indicates that breastfeeding delays of R. gnavus colonization in infants, corresponding with previous reports23.

Together, colonization with R. gnavus appears to be dynamic across the lifespan in healthy individuals, with the highest abundances observed in newborns. While these metagenomic analyses provide important insight into the global distribution of R. gnavus, in-depth genomic analyses are required to investigate whether genomic content differs across described factors such as disease and geography.

Newly generated complete genomes have superior assembly characteristics and cover phylogenetic diversity

For our large-scale genomic analysis of R. gnavus, we first established an isolate collection through extensive culturing efforts and by collecting available isolates, from which we sequenced the genome of 45 isolates using PacBio circular consensus sequencing (CCS) to yield complete, circular genomes and potential extrachromosomal elements (Fig. 2, Methods; Supplementary Data 2). We next complemented these with 208 available MAGs for which sufficient metadata could be retrieved and short-read genome data of an additional 79 isolates (Methods). To obtain assemblies of optimal quality, we tested five long-read de novo assemblers and selected the result with the longest contig (Methods, Supplementary Fig. 3, Supplementary Data 3). We also comprehensively analyzed methylation patterns for the sequenced isolates (Supplementary Methods, Supplementary Fig. 4, Supplementary Data 4). Comparing the quality of these genomes, we observed that MAGs were worse in every aspect of genome assembly when compared to isolate assemblies (Fig. 2a). While total length and number of genes were lower for MAGs as expected, GC content clearly differed between MAGs and isolate genomes, suggesting that current MAG binning techniques may fail to capture AT-rich regions. We further observed that isolates that underwent PacBio CCS were often assembled into single circular contigs, in contrast to a mean of 107 ( ± 58.4 standard deviation) contigs per short-read isolate genome. Additionally, we found four circular extrachromosomal elements predicted to be plasmids with 99.9% confidence (Fig. 2a,b, Supplementary Fig. 5), demonstrating the added value of PacBio CCS. These four putative plasmids comprise two different large sequences of 191 kb and 164 kb, which derived from two distinct isolates from healthy individuals (i.e., QRD006, QRD009 and QRD010 contain one plasmid, QRD011 the other), and have not been described in R. gnavus to date. The plasmids are modular and highly related, that is, they are identical except for one gene cluster that is missing from the shorter 164 kb plasmid (Supplementary Fig. 5a). They do not contain evident predicted antibiotic resistance or virulence genes (Methods). The plasmids are likely conjugative or mobilizable based on identified putative transposase genes which is consistent with their geographically distinct origins (USA and Japan). The plasmids contain a putative ParABS segregation system, annotated as ‘Soj’ (ParA) and ‘ParB domain containing protein’ (ParB). A key feature is a (hypothetical) non-ribosomal protein synthesis (NRPS) cluster with no known homologs (Supplementary Fig. 5b). However, upstream of it we identified with moderate confidence a transcription factor binding site for CatR, an H2O2-responsive repressor.

Fig. 2: Newly generated complete genomes have superior assembly characteristics and cover phylogenetic diversity.
figure 2

a We collected both publicly available short-read-based genomes from isolates and metagenome-assembled genomes (MAG), as well as long-read genomes generated from isolates in this study using PacBio HiFi sequencing and compared them to the one reference genome from NCBI GenBank (accession number GCF_009831375.1). Assembly statistics of each group of genomes are compared to the reference genome, shown as dashed line. Thick lines indicate medians, boxes represent first and third quantile and whiskers indicate the rest of the data excluding outliers; outliers are shown as separate dots. Color legend is shared with (c). b Length and circularity of de novo assembled contigs from PacBio HiFi reads. c Maximum likelihood phylogenetic tree based on concatenated core genes. Each genome is annotated with its corresponding genome source and continent of origin. Stars mark genomes sequenced with PacBio newly added in this work. The gray shaded area marks the infant-associated clade that contains 8/10 MAGs with flagellum genes. Source data are provided as a Source Data file.

Leveraging our large genome collection, we then investigated the phylogenetic diversity of R. gnavus (Fig. 2c). This revealed no continent or genome source-specific clustering, but importantly, demonstrated that our R. gnavus isolate collection captures the full breadth of phylogenetic diversity across the tree (Fig. 2c).

R. gnavus motility possibly restricted to infant-derived strains

In order to characterize the functional capacity of R. gnavus, we annotated our genomes with functional orthologs, modules and pathways (from KEGG24) and used linear modeling to identify associations between microbial functions and metadata. Using this methodology, we observed flagellum biosynthesis exclusively in newborns and infants up to 1 year of age, and this association was also statistically significant (p = 0.008). We further investigated flagellum biosynthesis together with chemotaxis, as these are functionally closely related, and found both pathways in ten out of 333 genomes. These ten genomes are all MAGs originating from newborns and infants up to 1 year of age (Supplementary Fig. 6a) and contained (almost) complete operons (Supplementary Fig. 6b). To ensure this finding was not a technical assembly artifact, we traced the origin of these genomes, which revealed that these MAGs derive from infants sampled in three studies and five geographically separated locations (Estonia, Finland, Italy, Russia, and Sweden). Eight out of ten genomes with flagellum genes belong to a phylogenetic clade that is associated with newborns and infants (17/19 genomes in that clade derive from infants of 1 year old or younger; Fig. 2c, clade highlighted in gray), suggesting that motility might be associated with a specific infant-associated clade of R. gnavus. The absence of isolates in this clade precludes experimental verification of flagellum functionality, but strain differences in flagella and motility have been described9.

We also screened all genomes for antibiotic resistance genes and found that resistance against tetracycline is the most common among R. gnavus (75/125 isolates; Supplementary Fig. 7). A minority of genomes contains resistance genes against aminoglycosides (n = 19), chloramphenicol (n = 8), trimethoprim (n = 11), lincosamide/macrolide (n = 24), and one and two genomes contain genes related to beta-lactamase and streptothricin resistance, respectively. For selective culturing of R. gnavus we therefore deem tetracycline the most helpful and in vitro validation confirmed that at least isolates containing the tet(O) and/or tet(40) genes, which account for the majority of the observed tetracycline resistance determinants, indeed have increased minimum inhibitory concentrations compared to isolates without tet gene (Supplementary Data 5).

Genomic differences between isolates from healthy and Crohn’s indicates a Crohn’s-specific subspecies

To evaluate whether CD-derived R. gnavus isolates genomically differ from healthy-derived isolates, we first placed our genomes into a core genome-based phylogenetic tree (Fig. 3a). As this tree contains practically identical isolates derived from the same person, we also constructed a tree of deduplicated genomes to facilitate statistical testing (Supplementary Fig. 8). This revealed three main clades with a strong enrichment of Crohn’s-derived isolates in the two more basal clades (Fisher’s exact test, p = 3.7 × 10−4, OR = 12.1 [2.5 – 69.8]). As our phylogenetic tree was reconstructed from only the core genome, we next performed whole-genome ANI analysis and accessory genome comparisons to also assess differences in the other genomic loci, which resulted in a highly similar clustering (Fig. 3a,b). As all R. gnavus genomes included here share at least 95% similarity with one another, which is often considered the species boundary25,26, we consider that these clades represent subspecies. Together, these results demonstrate that R. gnavus isolates from CD patients are often, but not always, genomically distinct from isolates from healthy controls based both on their core and accessory genome. The phylogeny indicates that most healthy-derived isolates form a monophyletic subspecies clade, while the CD isolates appear polyphyletic and may be categorized into multiple groups.

Fig. 3: Genomic differences between isolates from healthy and Crohn’s indicates a Crohn’s-specific subspecies.
figure 3

a Using our newly generated PacBio genomes, we compared genomes of isolates from healthy people to isolates from CD patients. Maximum likelihood phylogenetic tree of PacBio isolate genomes using concatenated core genes, with annotation of disease status and genes and gene clusters described previously in literature. Asterisks indicate gene clusters from genomes that are highlighted in Supplementary Fig. 9. Below are heatmaps of pairwise average nucleotide identity (ANI) and accessory genome similarity (calculated as 1 / binary distance). SA: superantigen (2 genes), IP: inflammatory polysaccharide (23 genes, ‘partial’ = 20 or 21 genes), cps: capsular polysaccharide (20 genes), nan: sialic acid metabolic cluster (11 genes, ‘partial’ = 6 genes), TD: tryptophane decarboxylase (1 gene), sd-XHD: selenium-dependent xanthine dehydrogenase (1 gene), bilR: bilirubin reductase (1 gene). b Comparison of genome comparison metrics core genome phylogenetic distance, average nucleotide identity and accessory genome binary distance tested with Spearman correlations. P < 2.2 × 10−16. c Comparison of core and accessory genome size between deduplicated isolate genomes with a CD or healthy phenotype, derived from short-read or long-read sequencing. Box plots represent median values with first and third quartile, whiskers indicate the rest of the data excluding outliers, and overlayed dots (jitter) show individual values. P-values were calculated using two-sided Wilcoxon rank-sum test. Core genome: p = 0.4, accessory genome: p = 0.42. d We compared accessory genomes of isolates from healthy people and CD patients using a bacterial GWAS to identify genes associated with disease phenotype. Results are expressed as false discovery rate-adjusted p-value (using the Benjamini-Hochberg correction) and epsilon, which is a measure of association strength between phenotype and genotype based on the (maximum likelihood) phylogenetic tree. The gray dashed line indicates a p-value of 0.05, anything above the line is considered statistically significant. Positive values of epsilon correspond to an enrichment in CD and negative epsilon values are associated with a healthy host phenotype. P- and epsilon-values are adapted from the synchronous GWAS model as implemented in Hogwash. Source data are provided as a Source Data file.

Host phenotypes cannot be explained by previously identified putative virulence factors in R. gnavus

A previous study established that R. gnavus can secrete a glucorhamnan polysaccharide with pro-inflammatory properties15. However, another study found the putative gene cluster encoding the production machinery for this polysaccharide varied between strains, but direct comparison was not possible with short-read sequencing data27. Such insights into genomic variations may be crucial to understand immunogenicity of different isolates, motivating a more detailed analysis of this gene cluster and other genes with similar putative functions. We therefore tested whether previously suggested R. gnavus virulence factors could explain the association with CD (Supplementary Fig. 8). First, we observed that four genes or gene clusters (superantigens, tryptophane decarboxylase, bilirubin reductase and selenium-dependent xanthine dehydrogenase) were present in all complete R. gnavus genomes and are therefore part of the core genome (Fig. 3a). While we saw variation in several other gene clusters (glucorhamnan-producing gene cluster, Fisher’s exact test, p = 0.19; and the nan gene cluster, p = 0.35), only one, namely the capsular polysaccharide gene (cps) cluster was associated with the distinction and was detected exclusively in isolates from the healthy-associated clade (p = 8 × 10-4). In conclusion, only the cps cluster, that leads to a more tolerogenic immune response16, could partially distinguish host phenotype groups.

Genomic architecture of gene cluster producing the proinflammatory polysaccharide glucorhamnan reveals genomic variations

Previous studies have highlighted the relevance and genomic architecture of the gene cluster producing inflammatory glucorhamnan based on complete, intermediate, or limited short-read coverage27. Here, we re-examined in our diverse collection of complete genomes if these clusters derive from the same genomic locus and are likely to be homologous (Supplementary Fig. 9). Compared to the isolate in which the gene cluster was experimentally verified (QRD039 = RJX 1121)15, we saw variations in multiple genes, including several glycosyltransferases (Supplementary Fig. 9a). We observed 13 out of 45 long-read genomes to have the complete original cluster as identified in RJX1121, while 30 genomes had 20/23 genes as annotated in NZ_AAYG02000032.1 and two had 21/23 genes (those with 20 or 21 hits are subsequently called ‘partially complete’)15,27. These genomes lacked the same genes: a glycosyltransferase (RUMGNA_03519; present in the two genomes with 21 genes found), a transporter (RUMGNA_03522) and a polyphosphoglycerol synthesis gene (RUMGNA_03523). These partially complete cluster variants lack the genes in positions that were reported to have low coverage and we think they are therefore the same as those described in Sorbara et al., 2020 as ‘intermediate coverage’. To elucidate whether these genomes contain a truly different gene cluster at a different genomic location, the flanking genes were determined to map the genomic neighborhood. All investigated genomes had the same neighboring genes, thereby revealing a conserved genomic locus (The 3’ and 5’-flanking genes are annotated as ‘HPr family phosphocarrier protein’ and ‘glutamine-fructose-6-phosphate transaminase’). By closer inspection of the genomic loci, we found that the operon lacking RUMGNA_03519, RUMGNA_03522 and RUMGNA_03523 had other genes inserted instead (Supplementary Fig. 9a). Moreover, the variability at protein level compared to the reference gene (30-70% identity) suggests that this whole locus may be subject to positive selection or adaptation pressure. Nevertheless, based on similarity in genomic architecture we expect that all these strains still produce polysaccharides, although it remains to be established whether all of them induce pro-inflammatory effects.

A similar comparative genomics analysis for the nan gene cluster, responsible for releasing 2,7-anhydro-Neu5Ac from mucin20, showed some genomes with nan-like genes in a different locus (Supplementary Fig. 9b). All these alternative nan-like clusters had the same genomic architecture, which importantly lacked the nanH (intramolecular trans sialidase) gene, suggesting that this partial cluster does not confer the same function. Together, these data show that strain differences across functionally relevant gene clusters are common, indicating that statements regarding virulence of R. gnavus based on single isolates should be interpreted with caution. Our collection of well-characterized isolates allows researchers to assess the relevance of strain differences in future experiments.

GWAS reveals genes related to healthy or Crohn’s-associated phenotype

In order to find genes that could explain differences in the genomic repertoire of Crohn’s-derived versus healthy-derived isolates, we conducted a bacterial GWAS using Hogwash, which incorporates genomic relatedness information (Methods). On a technical note, we confirmed high correlation between core and accessory genomes (Fig. 3b), and high pangenome size similarity between the Crohn’s-associated and healthy-associated groups (Fig. 3c). We deemed including MAGs for this analysis to be inappropriate, as both the core and accessory genome of MAGs are substantially smaller than that of isolates (Supplementary Fig. 10, p <  2 × 10−16). Thus, their inclusion may increase false negatives or otherwise lead to spurious results.

Our bacterial GWAS analysis revealed 163 genes that were robustly associated with Crohn’s isolates (FDR < 0.05, stricter synchronous model) through a high epsilon value, which quantifies the correlation between genotype and phenotype (Fig. 3d)28. We visualized and counted the presence of these genes in all R. gnavus genomes to better understand their possible correlation with host phenotype (Supplementary Fig. 11,12). Among the genes enriched in Crohn’s-derived isolates we found nineteen genes related to mobile genetic elements (transposases and excisionases), a predicted fucosidase which might be involved in cleaving off the terminal fucose residue on mucin, a response regulator that Bakta annotated as ‘spo0A’, and a holin gene (Supplementary Data 6). We screened the consensus sequence of this putative fucosidase gene for CAZyme domains (Methods) to gain more functional insight and indeed found a GH29 domain encoding a fucosidase. We also compared fucosidase domains between Crohn’s and healthy isolates using CAZyme annotations for GH29 and GH95 (CAZymes with known fucose-cleaving functionality of mucin molecules), but found no significant differences (Wilcoxon rank sum test, p = 0.098 and p = 0.39, respectively; Supplementary Fig. 13). On the other hand, healthy-derived isolates were especially enriched for galactosidases and other genes involved in sugar metabolism (Fig. 3d, Supplementary Data 5). Taken together, we find novel gene-phenotype associations and provide a set of candidate genes for follow-up research on the role of R. gnavus in CD.

Discussion

Host phenotype-microbe association studies are often restricted to single diseases, age groups and geographic regions, which has also been the case for R. gnavus12,13. In this work we provide a detailed global image of both the relative abundance and prevalence of R. gnavus, while we also investigate genomic variation within R. gnavus isolates in depth. In both aspects, this is to our knowledge the largest investigation to date. Key findings are the remarkably high relative abundance in newborns and young infants (Fig. 1f), which is inversely associated with breastfeeding (Supplementary Fig. 2), and the increased prevalence and abundance of R. gnavus in Westernized populations (Fig. 1c,d). Given the robust associations of increased relative abundance of R. gnavus with several inflammatory diseases and allergies, many of which have high incidence in high-income countries and have their incidences rapidly increasing in newly industrialized countries29,30,31,32, this begs the question of whether R. gnavus can have detrimental immunogenic effects on the host and whether this is strain-dependent. We show extensive genetic variation between strains in immunomodulating gene clusters, and our genetically well-characterized isolate resource can be used for experimental validation of differences in immunogenicity. The high prevalence of R. gnavus across both healthy and diseased individuals suggests that the consequences of being colonized with R. gnavus per se are unlikely exclusively negative, prompting the question if disease-associations become apparent when distinguishing R. gnavus strains. This hypothesis is in line with what we observed in clustering of our isolate genomes (Fig. 3a), where we see that isolates deriving from healthy individuals generally cluster apart from those isolated from Crohn’s patients. Indeed, there have also been examples in literature of a positive health influence of R. gnavus, for example with healthy weight gain in undernourished children33. It would therefore be crucial that future intervention studies using R. gnavus determine if the used isolates belong to a healthy-associated or disease-associated clade.

In the past decade MAGs have been increasingly used in large-scale gut bacterial genomics studies34,35,36,37,38, especially because culturing of specific gut bacteria can be highly laborious and challenging. While these MAGs have led to important biological advances, we show here that even high-quality MAGs (as defined by international standards39) remain of substantially worse quality than isolate genomes in multiple aspects (lower genome size and missing genes, higher GC content, amongst others, Fig. 2A)40. In case of bacterial GWAS analyses, which aim to associate bacterial genes or genomic features with a phenotype of interest, including MAGs may therefore lead to biases and spurious associations caused by (non-)randomly missing genes due to binning and assembly artifacts. Extrachromosomal elements such as plasmids are generally not represented in MAGs, as they cannot be confidently binned, while these may be the most relevant in connection to disease and treatment options41,42.

Through bacterial culture combined with PacBio CCS, we have generated high-quality genome data that lead to novel insights into R. gnavus biology. Two aspects that highlight this are the identification of large plasmids and a conserved methylated sequence motif. To date, only one 7kb-long plasmid of Ruminococcus gnavus is described in GenBank (accession number NZ_CP084015.1)43. The two related novel plasmids we identified in the present study are much larger (164 kb and 191 kb; Supplementary Fig. 5) and likely conjugative, indicating a diversity of plasmids in R. gnavus that is of yet underexplored. The methylated DNA motifs that are identified here are different from those known so far (http://rebase.neb.com/cgi-bin/pacbioget?10929; Supplementary Fig. 4)44, in line with the high variability in motifs we found per genome. Nevertheless, we find a single m4C-methylated motif that is almost universally conserved across R. gnavus genomes (VNNVNCTGVNCAN). These results are reminiscent of those described for Clostridioides difficile45.

We demonstrated that R. gnavus is a polyphyletic species, divided into multiple (genotypically and phenotypically distinct) subspecies clades. Notably, Crohn’s-derived isolates were overrepresented in specific phylogenetic groups, while previously suggested virulence factors could not explain this separation. This suggests that these virulence factors may not play a significant role in CD symptomatology. Instead, by bacterial GWAS we identified 163 genes that could be targets for experimental validation of their role in CD development (Fig. 3, Supplementary Data 5). Among these genes are 56 that we find overrepresented in CD. However, we advise further validation of these genes in larger numbers of Crohn’s-derived R. gnavus genomes before conducting laborious in vitro or in vivo experiments. Validations with the currently available data indicate that some presumably Crohn’s-associated genes are also common among R. gnavus derived from healthy people. We listed the more noticeable candidates for which functions could be predicted. The most striking candidate is a putative fucosidase gene, as this could be directly involved in relevant cellular processes such as cell adhesion and immune system regulation46. Secondly, we hypothesize that genomic rearrangements and horizontal gene transfer may play an important role in the evolution of CD-associated R. gnavus, given the enrichment of predicted transposase and excisionase genes. Thirdly, we find a predicted holin gene which, although highly speculative, might play a role in suppressing competing bacteria47. A previous study identified 199 IBD-specific genes2, based on a pangenome of 17 draft genomes. Those draft genomes include multiple IBD-related strains and genomes from the type strain, which we find to be phylogenetically distant in our core genome phylogeny based on a pangenome of 333 genomes. This increase in genome number in the current work particularly expands the accessory genome, where the largest differences in functionality are expected. Both the previous report and our results indicate predicted functional differences in e.g. mobile elements such as transposases and (putative) mucus utilization genes underscoring the robustness of the results and narrowing down the set of target genes for IBD-specific research2. Furthermore, IBD research on R. gnavus could benefit from considering the host and possible complex host-microbe interplay for the proposed virulence factors. For example, in antibiotic-treated mice the genetic background determined whether R. gnavus would ameliorate or exacerbate colitis48.

In conclusion, we present one of the largest collections of complete genomes and associated extrachromosomal elements of any gut microbe not usually causing acute infection49, and provide important novel biological insight into the global epidemiology and genomic variation of R. gnavus. R. gnavus has an ambiguous relationship with human health50, and different strains may exert different effects on their host. Our resource of complete genomes and isolates opens promising avenues for experimental validation and further bioinformatic scrutiny, and we expect this to be valuable to the broad gut microbiome research community.

Methods

Assessing prevalence and abundance of R. gnavus across human populations

We used the publicly available ‘curatedMetagenomicData’ (version 3.6.2) resource to screen 21,030 fecal metagenomes from 86 studies on all habitable continents for the prevalence and abundance of R. gnavus21. We used R (version 4.0.2; https://www.R-project.org/) to interrogate this dataset and calculate statistical parameters. We focused our analyses on metagenomes with a sequencing depth of at least five million reads and retained only the first sample per subject ID, after which 12,791 samples remained. We used the accompanying curated metadata to assess prevalence and abundance among healthy individuals across age, geography, lifestyle, and health states (Supplementary Data 1). Prevalence of R. gnavus was compared using logistic regression. Relative abundances were compared after adding a pseudocount of 1.3 × 10−5, followed by log-transformation and multivariable linear modeling. To identify suitable variables for logistic and linear models, we calculated collinearity between variables using Variance Inflation Factors (VIF) using the ‘vif()’ function from the ‘car’ package. VIF values above 2 were excluded by removing age (in years) and country from the models, leaving disease, age category, gender and westernization included as informative variables. Rows with missing values were discarded when building the models. For the final models, the association with each variable to R. gnavus prevalence or abundance was tested with Chi-square using the ‘drop1()’ R function. For infants, linear models were built using the same approach, including the variables feeding_practice, born_method and antibiotics_current_use. Correlation between feeding practice and age under or over half a year were tested using Chi-square. Sequencing depth (number of reads) was also log-transformed and compared using parametric t-test. P-values ≤ 0.05 were considered significant. To compare differences in R. gnavus prevalence in relation to sequencing depth, we divided all Westernized and non-Westernized metagenomes in ten equal groups (quantiles) based on sequencing depth (number of reads). Relative abundances of R. gnavus are shown as quantiles, as adapted from previous publications51,52.

Mapping the distribution of R. gnavus across environments

To map the spread of R. gnavus across different environments, we searched publications and online resources that link the presence of R. gnavus to an environment or biome. R. gnavus has been described to reside in the intestinal tract of different animals: cats and dogs10, chickens53, lambs54, rodents and pigs11, and cattle55. Furthermore, we have downloaded and screened the dataset related to the 2022 Microbiome publication by Ruscheweyh and colleagues to visualize prevalence and abundance of R. gnavus (Supplementary Fig. 14)56.

Collection and curation of publicly available genome datasets

To compose a collection of R. gnavus metagenome-assembled genomes (MAGs) and isolate genomes, we queried a large, recent collection of gut MAGs34. Here, we specifically selected high-quality (HQ) MAGs annotated as Ruminococcus gnavus or its synonym Faecalicatena gnavus (with completeness > 90% and contamination < 5%)39. As the metadata from Almeida et al. does not contain curated information on disease status of the individual and this is of prime interest to our study34, we matched identifiers to those present in the curatedMetagenomicData package. HQ-MAGs were only included if at least both disease status and geographic origin of the original sample could be traced back. This led to a collection of 201 HQ R. gnavus MAGs with associated metadata.

In order to obtain additional isolate genomes to complement the MAG collection, we queried the NCBI database in December 2021 and associated metadata to retrieve at least information on disease status and geographic origin of the isolate, like the HQ-MAGs. This yielded an additional 65 R. gnavus isolate genomes, which all originated from China or the USA. Furthermore, we included the type strain as reference genome (ATCC 29149, accession number GCA_009831375.1)2,27,57.

Metagenome-assembled genome generation from fecal metagenomes derived from multiple recurrent Clostridioides difficile-infected patients

We used an in-house metagenomic dataset of multiple recurrent Clostridioides difficile-infected patients to generate seven additional HQ R. gnavus MAGs – the metagenomic data of which are available in the European Nucleotide Archive under project number PRJEB4473758. To produce high-quality metagenome-assembled genomes (MAGs), we adapted a previously published protocol59.

The workflow is available as Snakemake60 on Zenodo (https://doi.org/10.5281/zenodo.14628195) and works as follows. Raw metagenomics sequencing reads, from which human reads had already been removed, were preprocessed using fastp (version 0.20.1, parameters: ‘--cut_right --cut_window_size 4 --cut_mean_quality 20 -l 75 --detect_adapter_for_pe -y’) to trim low-quality ends, remove reads shorter than 75 bases, remove adapter sequences and remove low-complexity reads61. (Note: preprocessing is not part of the workflow as described on Zenodo.) Remaining, high-quality reads were assembled into scaffolds using metaSPAdes (version 3.15.4, parameters: ‘--only-assembler’)62. Scaffolds were binned with metaWRAP63 (version 1.3.2) using three binning tools: MaxBin264 (version 2.2. 6), MetaBAT265 (version 2.12.1) and CONCOCT66 (version 1.0.0) using a minimum contig length of 2500 bp (‘-l’ option). Bins were then refined using metaWRAP’s ‘bin_refinement’ function, which uses CheckM67 (version 1.0.12) to assess bin quality, setting completeness and contamination cut-offs of 75% and 10%, respectively (‘-c’ and ‘-x’ options). After refinement, bins were reassembled using metaWRAP’s ‘reassemble_bins’ function with assemblers MEGAHIT68 (version 1.1.3) and metaSPAdes (version 3.13.0), again setting the minimum completeness to 75% and contamination to 10%, and the minimum length to 2000 (‘-l’ option). The resulting refined and reassembled bins were classified with the Genome Taxonomy Database toolkit (GTDB-Tk; version 2.1.0)69. Bins classified as Ruminococcus gnavus with > 90% completeness and < 5% contamination were included for further analyses.

Culturing of R. gnavus from feces of healthy donors and patient material

We ordered R. gnavus strain H2_28 (DSM number 108212) from the German Collection of Microorganisms and Cell Cultures (DSMZ, Braunschweig, Germany), resuspended it in Brain Heart Infusion broth (bioMérieux, Marcy-l'Étoile, France) and streaked it on Tryptic Soy agar +5% Sheep blood (TSS; bioMérieux) to isolate pure cultures. Two unique cultures (QRD001-QRD002) were isolated from feces by streaking on Columbia Naladixic acid Agar (bioMérieux; Supplementary Data 2). These were all cultured in an anaerobic cabinet (Whitley A35, Don Whitley Scientific Limited, UK) with an anaerobic gas mixture (10% H2, 10% CO2, 80% N2) at 37 °C. These samples were cultured from two different sample collections. First, healthy-derived isolates were obtained from donor fecal samples of Netherlands Donor Feces Bank donors and written informed consent was obtained for using these and clinical data, and approved by the Medical Ethics Committee at Leiden University Medical Center (P15.145). Second, CD-derived isolates from LUMC were obtained from fecal samples of patients aged above 18 years with a planned fistula surgery at LUMC and material was collected between July 2019 and June 2021. The study was approved by the Central Committee on Research involving Human Subjects and the local Medical Ethical Committee of the Leiden University Medical Center (study number P18.069). All patients gave written informed consent.

To further expand our R. gnavus genome collection, we cultured fourteen R. gnavus isolates from fecal samples of healthy feces donors that were available at Vedanta Biosciences (Supplementary Data 2). Human donor samples were obtained from both university hospitals and commercial sources. In all instances, informed consent language was reviewed and approved by the local ethics and regulatory authorities. Consent for the use of the sample was obtained from each subject. These were isolated and identified as follows: R. gnavus strains were isolated from various healthy donor stools by generating spore and non-spore fractions. Briefly, the non-spore fraction was generated by resuspending 1 g of fecal material in 10 mL sterile, pre-reduced PBS. The spore fraction was generated by adding 100% ethanol to the PBS fecal suspension to achieve a 50% (v/v) ethanol concentration. The fecal ethanol suspension was incubated at 25 °C for 1 hr while shaking. Following incubation, the fecal ethanol suspension was centrifuged at 3400 × g for 20 minutes and the cell pellet resuspended in 1 mL of sterile, reduced PBS. Serial dilutions of the spore and non-spore fraction were plated on either Eggerth-Gagnon + 5% horse blood agar, Brucella Blood Agar (Anaerobe Systems, Inc., Morgan Hill, California, USA), MSAT (Anaerobe Systems), or chocolate agar and incubated at 37 °C anaerobically for 72 hr. Isolated colonies were identified by Sanger sequencing of the 16S amplicon using 8 F and 1492 R primers and Illumina shotgun sequencing. Isolated colonies were inoculated into 1.2 mL of Peptone Yeast Extract Broth with Glucose (PYG; Anaerobe Systems) in a 96-deep well plate and incubated at 37 °C anaerobically for 48 hr. After incubation, colony identity was determined by performing PCR from 200 µL of the culture using universal 16S primers 8 F and 1492 R. Selected isolates were then sub-cultured from the 96-deep well plate onto the appropriate agar medium and incubated at 37 °C anaerobically for 72 hr. An isolated colony from this plate was inoculated into 5 mL of PYG and incubated at 37 °C anaerobically for 24 hr. 1 mL of the culture was pelleted by centrifuging at 10000 × g for 5 minutes. DNA was extracted from the pellet using the DNeasy blood and tissue kit (Qiagen, Hilden, Germany) following the manufacturer instructions. Colony identity was determined again by Sanger sequencing of the 16S gene amplicon using 8 F and 1492 R primers and Illumina shotgun sequencing.

Furthermore, fourteen isolates were cultured and collected at the University Medical Center Groningen as follows. Brucella blood agar medium (Mediaproducts BV, Groningen, The Netherlands) was used to cultivate the R. gnavus strains QRD024, QRD025 and QRD028 from human clinical specimens (Supplementary Data 2). QRD024, QRD025 and QRD028 were obtained from clinical samples and isolated bacteria were used for research purposes as no objections were raised by patients and no patient data was used. The plates were transferred to an anaerobic workstation (Whitley A45) after inoculation and incubated for one to three days at 37 °C. The anaerobic medium YCFA supplemented with either apple pectin or porcine mucin type III (4.5 g/l) was used for the isolation of QRD026, QRD027, and QRD029-QRD031 as described earlier70. Fecal samples of healthy volunteers were used for inoculation on pre-reduced medium and the plates were incubated at 37oC in an anaerobic chamber (Whitley A35 Workstation) with an anaerobic gas mixture (10% H2, 10% CO2, 80% N2). The strains QRD032-QRD037 were isolated from fecal samples of IBD patients on either phenylethyl alcohol agar (Mediaproducts BV, Groningen, The Netherlands), brain heart infusion agar (Oxoid Limited, Cheshire, UK) supplemented with yeast (2,5 g/l), hemin (0,001% w/v) and cysteine (1 g/l) or YCFA medium supplemented with glucose (4.5 g/l). Ethical approval for collecting and using biological material was obtained as previously described for QRD026, QRD027 and QRD029-QRD037 (local ethics committee of the University Medical Center Groningen METc2014.236 and METc2014.291, respectively)71. Additional details on logistics and sample collection can be found in Plomp et al. for QRD026, QRD027 and QRD029-QRD03172, and in von Martels et al. (study was registered on ClinicalTrials.gov under NCT02538354) for QRD032-QRD03771.

Moreover, isolates as cultured in their respective publications were obtained from the Broad Institute15, and Sanger Institute73. All cultures from outside the Leiden University Medical Center (LUMC) were sent to the LUMC as frozen glycerol stocks and anaerobically cultured on TSS. After obtaining pure colonies, all isolates were independently confirmed to be R. gnavus in our laboratory using matrix-assisted laser desorption/ionization coupled to a time-of-flight mass spectrometer (MALDI-TOF; Bruker Daltonics GmbH, Bremen, Germany). All isolates were able to grow on TSS, CNA and Chocolate agar PolyViteX (bioMérieux) and the colony morphology appeared on plates as round, glassy white colonies with a bright white center. Sometimes colonies displayed concentric circles, reminiscent of checker game pieces.

Data processing of Illumina-sequenced R. gnavus isolates

The fourteen isolates cultured at Vedanta Biosciences were sequenced on the Illumina NextSeq platform using 150 bp paired-end reads. These data were included with the isolate short-read-based genomes, increasing the number to 79 short-read isolates. Raw Illumina sequence data was cleaned and trimmed using fastp (v0.23.2) and sequence quality was inspected using Fastqc (v0.11.9; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and Multiqc74 (v1.8). Cleaned reads were assembled by first using SKESA75 (v2.4.0) and subsequently SPAdes (v3.15.3) with “--untrusted-contigs” and “--isolate” parameters.

Quality control and annotation of short-read-based genome collection

We have collected a total of 287 short-read-based genomes of R. gnavus, consisting of 79 assembled whole-genome sequences from cultured isolates and 208 metagenome-assembled genomes (MAGs). We also added the one available reference sequence in our analyses (NCBI GenBank accession number GCF_009831375.1). We filtered out contigs shorter than 1000 bp using BBtools’ reformat.sh (version 37.62; https://sourceforge.net/projects/bbmap/). We estimated completeness and contamination of all genomes using CheckM (version 1.0.13) and verified that all genomes taxonomically classify as R. gnavus using GTDB-Tk (version 2.1.0). Assembly length statistics were determined using QUAST76 (version 5.0.2). Finally, genomes were annotated using Bakta77 (version 1.6.1), which also provides the number of open reading frames, or predicted genes, per genome.

DNA isolation of R. gnavus isolates and generation of complete genomes using PacBio circular consensus sequencing

To generate complete genomes, 45 isolates were subjected to long read sequencing on the Pacific Biosciences (PacBio, Menlo Park, California, USA) Sequel IIe platform at the Leiden Genome Technology Center. To prepare high molecular weight total DNA, isolates were cultured anaerobically overnight in 10 mL BHI at 37 °C. Cells from 5 mL of culture were pelleted and processed using the Qiagen Genomic-tip 100/G, according to the manufacturer’s instructions. SMRTbell® libraries were generated as follows. Genomic DNA was sheared with the Megaruptor 3 system (Diagenode LLC, Denville, New Jersey, USA) using 35 cycles. Libraries were generated according to the following manufacturer’s procedure and checklist: Preparing whole genome and metagenome libraries using SMRTbell® prep kit 3.0 (PN 102-166-600 REV02 MAR2023), thereby using barcoded adapters. Size-selection was performed on library sub-pools using either diluted AMPure PBbeads (PacBio, 35% beads, 3.1x v/v ratio) or Blue Pippin (Sage Science, Beverly, Massachusetts, USA), depending on the insert-size of the libraries. The libraries were sequenced on a PacBio Sequel IIe platform with a 30 hour movie time using Sequel II Binding Kit 3.2 and Sequel II sequencing kit 2.0.

Long-read assembler mini-benchmark

Given the relative infancy of assembly algorithms for PacBio CCS data of microbial genomes, we performed a mini-benchmark of five long-read de novo assemblers: Canu78,79 (version 2.2), Flye80 (version 2.9.2), Raven81 (version 1.8.1), Hifiasm82 (version 0.19.6-r595) and IPA (version 1.8.0; https://github.com/PacificBiosciences/pbipa). In this benchmark, each assembler was provided 8 processor threads on the Shark high-performance computing cluster of the Leiden University Medical Center. Shark runs on Rocky Linux 8.7, with SLURM version 23.02.7. The available processors include Intel Xeon E5-2697, E5-2690 and E5-4650. Each assembler was provided as much memory as it needed to complete the assembly. The tools exhibited clear differences in number of contigs generated, processing time and memory use (Supplementary Fig. 3). Note that sample QRD034 was sequenced much deeper than the rest and subsampled to 30% of reads ( = 277X coverage) to facilitate assembly. Contigs were taxonomically classified using the Contig Annotation Tool (CAT version 5.2.3)83 to verify if they derived from R. gnavus. Canu, Flye, Hifiasm and IPA report if assembled contigs are linear or circular. From the different assemblies, we selected the assembly that yielded the longest contig and the longest total assembly length (all exceeding 3 Mb), giving Flye precedence as it provides the most extensive statistics (Supplementary Data 3). This resulted in 38 assemblies from Flye, 3 from Hifiasm, and 2 each from IPA and Raven. All contigs from selected assemblies were reoriented using dnaapler84 (version 0.3.0) to start at the dnaA, repA or terL gene for chromosomes, plasmids and bacteriophages, respectively. Raven and Hifiasm produce assembly graphs, which were viewed to assess if contigs were linear or circular. Assemblies with a smaller secondary circular contig were analyzed with geNomad85 (version 1.7.4) to predict the probability of it being a plasmid, using the built-in score calibration module with aggregated results from both the marker-based and neural net-based classifications.

We included two isolates derived from the strain DSMZ 108212, of which one we obtained directly from the DSMZ (QRD005) and the other was cultured at the Sanger Institute (QRD022). Assembly with Hifiasm yielded a 3.3 Mb contig and a 28 kb contig for QRD022, while QRD005 could not be resolved to less than three contigs, with the longest being 2.4Mbp. These two assemblies were not completely identical and we decided to use a reference-based assembly of the unresolved one against the 3.3 Mb contig using minimap286 (version 2.29) and samtools87 consensus (version 1.19; parameters: ‘--min-MQ 5 --min-depth 10’) to generate an improved assembly of QRD005. This resulted in two contigs of 3.3 Mb and 178 bp. We manually removed the 178 bp fragment and use the single 3.3 Mb contig assembly as representative of the ‘DSMZ-108212’ = QRD005 isolate (Supplementary Data 3).

Final genome assemblies were annotated with DNA methylation information from the PacBio SMRT Link Microbial Genome Analysis platform.

Antibiotic resistance screening of isolate genomes

To assess the genotypic antibiotic resistances in isolate genomes, we screened 79 short-read genome sequences of isolates, the 45 newly generated long-read genomes, and the one reference genome for the presence of antibiotic resistance genes using ABRicate (version 0.8.13; https://www.github.com/tseemann/abricate) with NCBI’s AMRFinderPlus database (downloaded 11 November 2022, containing 5735 sequences)88. Genes were assumed present if at least 95% of the gene matched with at least 95% identity to the gene in the database. For in vitro validation, ten isolates—five with tet tetracycline resistance genes and five without—were assessed for tetracycline minimum inhibitory concentrations (MIC at 48 h) using an ETEST (bioMérieux) on TSS medium at 37 °C in a Whitley A35 anaerobic cabinet. However, since we managed to isolate R. gnavus without the use of antibiotic selection and tetracycline resistance is also common among other human gut commensals, we did not pursue this further.

Search for previously described inflammatory factors of R. gnavus

Several R. gnavus genes have previously been associated with intestinal inflammation. We screened our collection of genomes for the presence of two superantigen genes (accession numbers WP_105084811.1 and WP_105084812.1)17, 23 genes encoding the machinery to produce a proinflammatory (glucarhamnan) polysaccharide (NZ_AAYG02000032.1)15, one tryptophane decarboxylase gene (RUMGNA_01526 from UniProt)18, and 20 genes encoding a capsule polysaccharide (RUMGNA_02411 – RUMGNA_02392 from UniProt)16. We used protein BLAST89 (blastp; version 2.13.0) to screen the genomes for the presence of each of these genes. Only hits that covered at least half of the gene of interest (‘-qcov_hsp_perc 50’) with an E-value of 1 × 10−20 or smaller (‘-evalue 1e-20’) were considered for further analysis. Gene clusters were considered present when all the genes were detected.

Using the same method, we also screened genomes for the presence of the bilirubin reductase gene (bilR, WP_009244284.1)90, selenium-dependent xanthine dehydrogenase (sd-XDH, QHB24869.1)19, and the nan cluster for sialic acid metabolism (RUMGNA_02691 through RUMGNA_02701 from UniProt)20. Gene operons were visualized using clinker91,

Annotation of functional pathway genes

We annotated carbohydrate-active enzymes (CAZymes) by comparing the genomes to dbCAN92 (version 10) using HMMer93 (version 3.3.2). Within the CAZyme families, we focused on two glycosyl hydrolase families that include fucosidases, GH29 and GH95, which have been described as important for mucus utilization94, a main feature of R. gnavus. Genomes were also annotated using KEGG-Decoder95. Pathways for chemotaxis and flagellum biosynthesis were annotated using the KOALA definitions available online24. Moreover, genomes were screened for the presence of annotated biosynthetic gene clusters (BGC) using antiSMASH96 (version 6.1.1).

Comparison of whole genomes to find clusters of genomic variants

Whole genomes were compared to one another using average nucleotide identity (ANI) with fastANI (version 1.33)26. Furthermore, genomes were subjected to a pangenome analysis using Panaroo (version 1.3.0; parameters ‘--clean-mode strict -a core --aligner mafft --core_threshold 0.95’)97. For the pangenome, we considered genes that occur in at least 95% of genomes core genes as recommended when including MAGs98. The core genes were concatenated and using MAFFT99 (version 7.505) a core genome multiple sequence alignment was generated, which was automatically trimmed using trimAl100 (version 1.4.1). A maximum likelihood phylogeny was inferred from the trimmed multiple alignment using IQ-tree101 (version 2.2.0.3), including ModelFinder Plus102 to automatically select the best fitting evolutionary model and ultrafast bootstrap (1000 replicates) to calculate branch support103. The selected models were: short-read genomes GTR + F + I + I + R9; long-read genomes GTR + F + R7; all genomes GTR + F + R10. Trees were visualized in iTOL104.

Bacterial genome-wide association study (GWAS)

To identify genes that are putatively associated with CD, we subjected genomes of R. gnavus isolates to a bacterial genome-wide association study using Hogwash (version 1.2.6; parameters: ‘fdr = 0.05, bootstrap = 0.875, grouping_method = “post-ar” ’)28. Hogwash implements a more stringent version of the homoplasy-based PhyC method introduced in 2013105. Hogwash reconstructs the evolutionary history of the genomes of interest using a phylogenetic tree and predicts where genotype and phenotype transitions occurred to assess where genotype and phenotype transitions coincide. We made use of the high correlation between core and accessory genome to use these two as input, together with phenotype of either CD or healthy. Genomes were assigned healthy or CD phenotype based on available metadata on health status from the person from whom the R. gnavus isolate was cultured. We included short-read sequencing isolate draft genomes as well as our in-house generated PacBio complete genomes. If multiple sequences of the same isolate existed, we deduplicated based on ANI > 99.9%. Of these duplicates, we picked the first based on alphabetic order as representative, and we preferentially select long-read-based genomes when available. This resulted in fourteen R. gnavus isolate genomes derived from CD patients and 41 from healthy people (total N = 55). We used a matrix of (accessory) gene presence and absence generated by Panaroo as input for Hogwash. As phylogenetic tree, we pruned the tree of all R. gnavus genomes inferred by IQ-tree to include only this set of 55 deduplicated genomes and midpoint rooted the tree. Associations between genotype and phenotype are evaluated both by p-value indicating statistical significance, and epsilon value, which calculates the strength of genotype-phenotype association on a 0-1 scale (Supplementary Data 6).

To further validate the genes found to be significantly associated with either a healthy or Crohn’s host phenotype, we counted the prevalence of each group of genes in both healthy-derived (n = 123) and IBD-derived MAGs (Crohn’s n = 8; ulcerative colitis n = 1; Supplementary Fig. 11). Furthermore, we visualized the prevalence of these genes among genomes, annotated by their host disease phenotype, as a heatmap to visually inspect the predicted gene associations (Supplementary Fig. 12).

Statistical analyses

All tools were run with default parameters unless stated otherwise. Statistical analyses and visualization were done in R (version 4.0.2) using RStudio (https://posit.co/). A p-value of 0.05 or smaller was considered significant. Data were visualized using the R package ggplot2 (version 3.5.0)106, with the publication theme from ggembl (version 0.1.2; https://git.embl.de/grp-zeller/ggembl). Figures were polished manually using Inkscape (version 0.92.5; https://inkscape.org/).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.