Fig. 3: Genomic differences between isolates from healthy and Crohn’s indicates a Crohn’s-specific subspecies.

a Using our newly generated PacBio genomes, we compared genomes of isolates from healthy people to isolates from CD patients. Maximum likelihood phylogenetic tree of PacBio isolate genomes using concatenated core genes, with annotation of disease status and genes and gene clusters described previously in literature. Asterisks indicate gene clusters from genomes that are highlighted in Supplementary Fig. 9. Below are heatmaps of pairwise average nucleotide identity (ANI) and accessory genome similarity (calculated as 1 / binary distance). SA: superantigen (2 genes), IP: inflammatory polysaccharide (23 genes, ‘partial’ = 20 or 21 genes), cps: capsular polysaccharide (20 genes), nan: sialic acid metabolic cluster (11 genes, ‘partial’ = 6 genes), TD: tryptophane decarboxylase (1 gene), sd-XHD: selenium-dependent xanthine dehydrogenase (1 gene), bilR: bilirubin reductase (1 gene). b Comparison of genome comparison metrics core genome phylogenetic distance, average nucleotide identity and accessory genome binary distance tested with Spearman correlations. P < 2.2 × 10−16. c Comparison of core and accessory genome size between deduplicated isolate genomes with a CD or healthy phenotype, derived from short-read or long-read sequencing. Box plots represent median values with first and third quartile, whiskers indicate the rest of the data excluding outliers, and overlayed dots (jitter) show individual values. P-values were calculated using two-sided Wilcoxon rank-sum test. Core genome: p = 0.4, accessory genome: p = 0.42. d We compared accessory genomes of isolates from healthy people and CD patients using a bacterial GWAS to identify genes associated with disease phenotype. Results are expressed as false discovery rate-adjusted p-value (using the Benjamini-Hochberg correction) and epsilon, which is a measure of association strength between phenotype and genotype based on the (maximum likelihood) phylogenetic tree. The gray dashed line indicates a p-value of 0.05, anything above the line is considered statistically significant. Positive values of epsilon correspond to an enrichment in CD and negative epsilon values are associated with a healthy host phenotype. P- and epsilon-values are adapted from the synchronous GWAS model as implemented in Hogwash. Source data are provided as a Source Data file.