Introduction

One of the key biomedical discoveries of recent years has been the critical role of the microbiome in host function and disease. The microbiome is implicated in a huge range of functions in health, such as gut barrier and immune function, and disease, including obesity, autoimmune disease and allergy1. While associations of the microbiome with human health and disease multiply, questions of causality become increasingly important. Unpicking that causality requires careful experiments, which are typically not feasible in humans. Therefore the science of the microbiome has been, and is being, built on animal research, primarily in mice, with an exponential rise in such publications.

Unfortunately, microbiome studies can be inaccurate or biased due to both experimental and analytical factors2. Notably, experimental design issues, such as the sampling site of the microbiota3 and the environment of the host4, can have a large impact. For instance, gut microbiota research has tended to focus on stool samples, where changes in the stool microbiota have been associated with several diseases, most notably inflammatory bowel disease (IBD)5,6. However, stool samples alone do not fully reflect the total gut microbiota. Bacteria inhabit various niches along the length of the gut, particularly the mucus layer overlaying the intestinal epithelial cells3,7 (Fig. 1a). We have also shown that these bacterial niche populations can be impacted independently of effects within the stools7. Furthermore, mice8, like humans9, share gut microbes among co-housed individuals which, in mice, is reinforced by coprophagy. Thus, individual mice in the same cage have similar microbial communities and any differences among the microbial communities of different cages can dominate analysis10. Surprisingly, few studies report on how mice are caged and whether they are littermates11,12. Intergenerational differences in the murine microbiota has also been reported13. Not reporting these critical factors leads to questions about the reproducibility of the research, as discussed by Stappenbeck and Virgin10. Indeed, there is the possibility that any apparent changes in the microbiome that arise between mouse groups are mistakenly assigned as treatment effects. It is therefore important to define more clearly the contribution of host environment in animal microbiome studies and develop robust tools that interrogate this.

Figure 1
figure 1

Distribution of phyla across the phylogenetic tree. Colonic tissue sections from a male wildtype (WT) mouse was stained with a fluorescent DNA probe specific for the 16S rRNA gene to identify bacteria (red), a Muc2 antibody (green) to identify mucus and counterstained with DAPI (blue) (a). A phylogenetic tree of 16S rRNA sequences derived from the gut microbiota of FVB wildtype (WT) mice and mdr1a−/− mice (b). The distribution of major gut phyla are highlighted on the tree: Firmicutes (grey), Bacteroidetes (pink), Proteobacteria (olive), Actinobacteria (red), and Deferribacteres (gold). The same tree is shown, coloured by other criteria, in Supplementary Figure S1. Figure (b) was produced in R 3.6.0 for Windows.

Some of the most common methods of microbiome analysis also have the potential to obscure findings. Bacteria interact in complex communities. Thus, the power of individual statistics, either ones that focus on large scale shifts among phyla (e.g. the Firmicutes to Bacteroidetes ratio14) or that single out particular species differences (e.g. Lactobacillus reuteri enrichment in obesity15), potentially miss functional impacts. Such narrow foci may lead to bias and over-interpretation of the relative importance of single species changes. To date, even though microbiome studies frequently repeat analyses at multiple taxonomic levels and tools have been developed capable of highlighting specific important taxa within a tree (e.g.16), it is rare to examine all taxa across a phylogenetic tree in a single analytical framework, enabling the relative importance of community differences at different phylogenetic scales to be assessed.

Here we ask whether some of the experimental features that may confound analyses—cage and maternal effects, have similar or different effects on microbial communities to experimental variables of interest (niche within the murine gut, mouse age and genotype of the host). We use 16S rDNA data from a carefully controlled experiment, with wild type and colitis-prone mdr1a−/− mice (where some differences have been reported in the bacterial community of the gut mucus17), in co-housed, mixed genotype cages. To minimise issues of potential inaccuracy and bias, we do not define operational taxonomic unit (OTUs). Instead, we use the sequences themselves to estimate the phylogenetic relationships amongst organisms and thereby abundances at different phylogenetic scales. Only after analysis do we draw on a wider understanding of microbial taxonomy to interpret the findings. We find striking cage and maternal effects on the microbial communities, which are clearly distinguished among niches and host age, highlighting the importance of factoring housing into experimental design and analysis.

Materials and methods

Animal maintenance

The breeding and sampling design is detailed in Supplementary Figure S1. Mdr1a−/− mice (FVB.129P2-Abcb1atm1Bor N7)18 were bred with control FVB mice purchased from Taconic Biosciences (Albany, NY), to produce the F2 generation. Heterozygous parents were then used to give rise to experimental mice, allowing litters to contain a mix of genotypes. Experimental male mice from each litter were co-housed in individually vented cages, in the same room and on the same rack in the animal facility. Thus, WT and mdr1a−/− mice from the same litters were used for all subsequent experiments, in mixed genotype cages. Male mice at 6 and 18 weeks of age were used for experiments. Different mice were sampled at each time point. All mice received the same food (Beekay Rat and Mouse Diet No1 pellets; B&K Universal, UK) and irradiated water which were available ad libitum, prior to and during the experiment. Ambient temperature was maintained at 21 (± 2 °C) and the relative humidity was 55 (± 10%) with a 12 h light/dark cycle. All animals were kept under specific, pathogen-free (SPF) conditions at the University of Manchester, where cage-cross contamination was prevented at every stage via strict hygiene procedures. Experiments were performed under a project license approved by the institutional Animal Welfare and Ethical Review Body (GRENCIS 70/8127) and according to the regulations issued by the Home Office under amended ASPA, 2012.

Isolation of genomic DNA

Sample collection and processing was performed as described by Glymenaki et al.17. In brief, samples were harvested from mice at two time points, 6 and 18 weeks of age. Stool samples were collected from mice in individual autoclaved cages into sterile tubes and snap frozen on dry ice. Mice were sacrificed via CO2 inhalation, the proximal colon was cut open and the colonic mucus scraped using cell scrapers and Inhibitex buffer (QIAGEN, Manchester, UK) and snap frozen until use. Genomic DNA was extracted using QIAamp Fast Stool Mini-Kits according to the manufacturer’s instructions (QIAGEN).

Histology

Snips of the proximal colon were fixed in Carnoy’s solution (60% methanol, 30% chloroform, 10% glacial acetic acid), incubated in two changes of dry methanol (Sigma-Aldrich, Dorset, UK) for 30 min each, followed by absolute ethanol (ThermoFisher Scientific, Paisley, UK) for two incubations at 30 min each. Finally, tissue cassettes were processed in a Micro-spin Tissue Processor STP120 (ThermoFisher Scientific) and immersed in paraffin. Colon snips were embedded in paraffin blocks using a Leica Biosystems embedding station (Leica Biosystems, Milton Keynes, UK), with the luminal surface of the colon exposed for tissue sectioning. 5 µm tissue sections were cut using a Leica Biosystems microtome and adhered to uncoated microscope slides (ThermoFisher Scientific). Slides were dried for 48 h at 50 °C before use. Histological analysis was used to determine that all five of the 18 week-old mdr1a−/− mice had indications of moderate or mild colitis, with a loss of healthy gut architecture17.

Fluoresence in situ hybridisation (FISH)

FISH was performed as described previously17. In brief, FISH staining was performed using the universal bacterial probe-EUB338 (5′-Cy3-GCTGCCTCCCGTAGGAGT-3′), followed by immunostaining with a rabbit polyclonal MUC2 antibody and goat anti-rabbit Alexa-Fluor 488 antibody (Life Technologies, Paisley, UK). Slides were imaged using a BX51 upright microscope and a Coolsnap EZ camera (OLYMPUS, Tokyo, Japan) and images were processed using Image J19.

16S rRNA gene sequencing processing

16S amplicon sequencing targeting the V3 and V4 variable regions of the 16S rRNA (341F: 5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-3′ and 805R: 5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-3′) was performed on the Illumina MiSeq platform (Illumina, California, USA) according to manufacturer’s guidelines and generated paired-end reads of 300 bp in each direction. DNA from all samples was extracted using the same extraction kit. However, they were sequenced across different runs, with technical replicate samples sequenced multiple times as an internal control between each run. Illumina reads were demultiplexed to remove adapter sequences and trim primers. Illumina paired-end reads were merged together using SeqPrep20 and submitted to MG-RAST’s metagenomics pipeline21. Reads were pre-processed to remove low-quality and uninformative reads using SolexQA22. The quality-filtering process included removal of reads with low quality ends (i.e. ambiguous leading/trailing bases) and the removal of reads with a read length two standard deviations below the mean. Artificial duplicate reads were then removed based on MG-RAST’s pipeline.

The resulting FASTQ files for every sample were merged into a single file of 590,822 sequences to simplify processing, manually adding 3 known Archaeal 16S rRNA sequences from Acidilobus saccharovorans, Sulfolobus tokodaii and Methanobrevibacter smithii. Sequences were aligned using a specialist 16S RNA aligner using the Infernal algorithm23, via a web-based interface provided by the Ribosomal Database Project24. This file was then manually curated in R25. In brief, we determined the first and last position of each base for every sequence. 92% of all sequences started around position 710. 87% of sequences had their last base around position 2800. We therefore trimmed all positions before 710 and after 2800 to remove unnecessary spaces introduced by the aligner. The number of aligned bases in each sequence was then recorded and the distribution of continuously aligned bases was examined. The large majority of sequences (~ 84%) had > 437 bases and so any sequence that had less than 437 continuously aligned bases was discarded. The remaining 496,550 sequences were taken forward for analysis. All sequences were identified using BLAST+ and the top hit for each sequence was recorded26. The ‘classification’ function in the ‘taxize’ R package27 was then used to assign full taxonomic information to each identified taxon where possible. Unless otherwise stated, all analyses were performed using custom scripts in R.

Phylogenetic tree

A phylogenetic tree of all sequences was generated using FastTree 2.128, using the general time reversible (GTR) + CAT model and default parameters. The tree was rooted using the archaeal sequences as an outgroup. Phylogenetic clades were obtained using the ‘Ancestor’ function in the ‘phangorn’ R package29. A relative abundance matrix, with abundance based on how many times sequences belonging to a phylogenetic clade appeared in a sample, was calculated.

Ordination

Bray–Curtis dissimilarity and Jaccard Index values were calculated among all samples (based on the relative abundance matrix) and used for non-metric multidimensional scaling (NMDS) via the ‘MASS’30 and ‘ecodist’ R packages31, checking to ensure convergence in all cases.

Machine learning

Random forest (RF) models were run using the ‘randomForest’ package32 in R. Specifically, the clade relative abundance matrix was used as an input for the RF, using a forest of 100,000 trees and the mtry value was left at default settings (the square root of the number of clades). Separate forests were run to predict whether a sample was 6 or 18 weeks old, whether a sample was stool or mucus, whether it was a WT or an mdr1a−/− sample, what cage the sample was taken from and the mother of each respective offspring. Each forest was controlled for all other treatments (i.e. a random forest predicting age included genotype and microbial niche as explanatory variables, in addition to the generated clades). The ‘MeanDecreaseAccuracy’ (MDA) value was used as a measure of how important each clade (or treatment) was at predicting treatment information and the out-of-bag (OOB) error rate was used to determine the predictive accuracy of the model. Nodes were ranked based on MDA value, taking the five most important nodes, determining the descendant tips and confirming the identity of the tip sequences via the BLAST+ results26. Additionally, the depth of each node was determined using the ‘distances’ function in the igraph R package33. A phylogenetic tree annotated with the resulting information was plotted using the ‘plot.phylo’ function in the ‘ape’ package34.

Model validation

In order to validate each model, we included a ‘randomised’ negative control RF where relative abundances of each node were permuted with respect to each sample and the predictive accuracy was assessed. In addition, we took the relative abundances of an important node for age and redistributed the abundance to only WT samples. The RF was repeated to investigate whether this node would appear as important for genotype. We also ran RF’s with an increasing number of trees, using three different random seeds and performed Spearman’s Rank correlation on the MDA values obtained among each set of three RFs of the same size. The Monod/Michaelis–Menten model was fitted, to determine how an increasing number of trees affected correlation of the MDA values. Finally, we included technical replicates of one stool sample that was used as an internal control between sequencing runs, in our forest models. We examined the MDA values for all the clades in each of these replicates to see how tightly correlated they were. Additionally, we compared the predictive accuracy of the RF model when using these different replicates.

Statistical analysis

Analysis of the real vs null RF models predictive accuracy was performed using 2-way ANOVA, with a Sidak’s post hoc test in GraphPad Prism 8 (GraphPad Software, La Jolla, USA). Permutational multivariate analysis of variance (PERMANOVA) was used to determine interactions between taxa and treatment groups, using the ‘adonis’ function in the vegan R package35. Effects of age, niche, genotype and all possible interactions were considered. All significant effects (P < 0.05) are reported.

Results

Phylogenetic tree of 16S rDNA data derived from the gut microbiota

Microbiota samples from the stools and colonic mucus (Fig. 1a) of 20 male mice were collected from two genotypes (WT vs colitis prone-mdr1a−/−), at two different time points (6 vs 18 weeks of age, with different mice sampled at each time point). By the later time point, mdr1a−/− animals had developed a mild or moderate colitic phenotype (see methods). Littermates of the different genotypes were co-housed in 8, mixed genotype cages (Supplementary Figure S1). On average, 10,442 16S sequences (range 1892–25,681) were obtained per sample. All sequences were used to create a phylogenetic tree, comprising 496,550 tips and 428,234 internal nodes, which separated the major phyla (Fig. 1b). Sequences derived from stool and mucus were distributed across the tree (Supplementary Figure S2a), as were sequences associated with other criteria (age, genotype and cage) (Supplementary Figure S2b–d).

Strong separation of the gut microbiota by microbial niche, age and cage but not host genotype

To avoid bias by taxonomic level, we constructed a data matrix comprising the relative abundance [(number of tips in clade in sample)/(total number of tips in sample)] of clades corresponding to all 428,234 internal nodes of the phylogenetic tree in each of our samples. This avoided assigning OTUs, or using a reference database. To visualise the major differences in the microbial communities in an unsupervised fashion, Bray–Curtis dissimilarity values were calculated from our data matrix between all samples and used as an input for a 2-dimensional non-metric multidimensional scaling (NMDS) ordination (Fig. 2). A stress plot for this ordination (Supplementary Figure S3a), shows a stress value of 0.18, within the 0.20 acceptability threshold36. There was clear separation of samples by niche (PERMANOVA, p = 0.0001) (Fig. 2a). There was less visual separation by mouse age (6 vs 18 weeks, PERMANOVA, p = 0.0001) (Fig. 2b) which is in concordance with our previous work17. Samples from the same cage localised closely in the ordination. Cages containing different litters from the same mother (see Supplementary Figure S1) were adjacent or overlapping in the plot, suggesting maternal effects influencing, but not fully explaining, cage-specific microbiomes (Fig. 2c,d). Little separation was found when comparing genotypes or colitic phenotype with no significant effect on its own (PERMANOVA, p = 0.36) or in interaction with age (p = 0.07) or niche (p = 0.1) (Fig. 2e). A similar analysis using only the presence/absence of taxa (Jaccard Index) gives similar conclusions for niche (PERMANOVA, p = 0.01) and age (PERMANOVA, p = 0.001), but visually separates treatments less clearly (Supplementary Figure S4). As with the Bray Curtis matrix, genotype had no significant effect on its own (PERMANOVA, p = 0.38) or in interaction with age (p = 0.38) or niche (p = 0.97). The stress value was 0.15 (Supplementary Figure S3b).

Figure 2
figure 2

Figure produced in GraphPad Prism 8.

Separation of microbiota via NMDS for microbial niche, age, cage and mother. Two dimensional non-metric multidimensional scaling (NMDS) was performed using a Bray Curtis dissimilarity matrix based on the relative abundance of all clades in the phylogenetic tree shown in Fig. 1B. Plots highlighting stool and mucus samples (a), 6 and 18 week old samples (b), different cages (c, C1–C8 represent cages 1–8), mothers (d, M1–5 represent mothers 1–5) and WT (wildtype) and KO (mdr1a−/−) samples (e) are illustrated. Each point corresponds to a stool or a mucus sample. These samples were taken from n = 10 mice per genotype.

Specific microbiota are strongly associated with age, microbial niche, cage and mother but not host genotype

To determine the taxa driving the observed differences in community structure, the relative abundance matrix was used to construct machine learning models (random forests, RFs). Separate RF models were created to identify age, genotype, niche, cage and mother based on the relative abundance of the clades (as defined by the phylogenetic tree, Fig. 1b) in each sample. These models were compared against a null (negative control) model where relative abundances were permuted among taxa within samples to remove true associations, while keeping the characteristics of the individual samples (e.g. those arising from any differences in coverage). Further testing to validate the reproducibility of this approach is in the Supplementary Information (comprising Supplementary Figures S5S7). Niche was determined from the microbiota with 92% accuracy, age and mother with ~ 98% accuracy and cage with 80% accuracy (averaged across six technical replicates), in all instances substantially higher than the negative control model (Fig. 3a) (Two Way ANOVA-Sidak’s post hoc test: P < 0.0001). Genotype could not be determined from the microbiota using our RF models any better than in the negative control (Fig. 3a). Models considering genotype were therefore not considered further.

Figure 3
figure 3

Random forest model identifies strong associations between the microbiota, niche, age and cage. The predictive accuracy of the random forest model at taking a sample and discriminating between the different treatment groups is shown (a). The five most important nodes associated with microbial niche (b), host age (c), social group (cage) (d) and mother (e), are named based on the finest scale taxon containing the closest BLAST hits of all sequences in the clade. Taxa are prefixed with their taxonomic level: order (o_), family (f_) and genus (g_). Bars represent means and standard errors. Asterisks represent significance determined using Two Way ANOVA-Sidak’s post hoc: P < 0.00001 (****). Similar plots for further validation control analyses are shown in Supplementary Figure S3 and S4. n = 6 technical replicates. Figure produced in GraphPad Prism 8.

The RFs give an importance value for each clade (the tree’s internal nodes) in discriminating between groups. To identify which bacteria the clades encompassed, we used BLAST+ on all sequences, recording the taxonomic identity of the top hit (hits that had a percentage coverage < 100% were discarded). The finest-scale taxonomic grouping containing all sequences descending from the five most important clades is shown in Fig. 3b–e for niche, age, cage and mother RFs respectively. The named clades do not represent all bacteria within that taxon, rather they represent specific bacteria, all of which fall within the taxon. For microbial niche, the most important distinguishing clades were all Gram negative and mostly comprised Proteobacteria: the order Pseudomonadales and three clades in the families Burkholderiaceae and Deferribacteraceae and all these clades were more abundant in the mucus samples (Supplementary Figure S8a). The Deferribacteraceae containing clade mostly comprised Mucispirillum, a known mucus-associated bacteria37. The genus Porphyromonas was more abundant in stool samples. The most important clades separating ages were the families Erysipelotrichaceae and Lachnospiraceae within the Firmicutes phylum (which have each been specifically associated with young mouse microbiomes before38) plus three genera: Natranaerovirga, Desulfovibrio, and Vampirovibrio in the Firmicutes, Proteobacteria and Cyanobacteria phyla respectively. With the exception of Natranaerovirga, all these bacteria were prevalent in the 18 week old mice (Supplementary Figure S8b). The most important clades separating cages and mothers were Natranaerovirga (a different clade from that separating ages) plus four clades within the order Bacteroidales—three comprising the genera Bacteroides and one comprising Barnesiella (Supplementary Figure S8c,d).

Abundant, low-level taxa distinguish cage and maternal microbiomes but not age or niche

Having identified taxa at different phylogenetic levels as particularly important for separating microbiomes, we looked systematically at the phylogenetic scales that are important for separating different microbiomes. Clade importance was analysed as a function of the number of nodes between the clade and the root of the phylogenetic tree (Fig. 4) or the distance from each clade to the root (Supplementary Figure S9). These measures distinguish clades close to the root (high-level taxa with fewer nodes and shorter branch length from the base of the clade to the root) corresponding, e.g. to phyla, and clades far from the root (low-level taxa with more nodes and longer branch lengths from the base of the clade to the root) corresponding e.g. to genera. For both age and niche, neither the lowest nor the highest level clades were consistently important but the most clearly important clades were of intermediate taxonomic levels (Fig. 4a,b). For age, this separation between true and null models differs between the metrics: while the intermediate level taxa are important across both metrics, the highest level clades are the most important with the distance metric (Supplementary Figure S9b vs. Fig. 4b). However, for differences among cages and mothers, while intermediate level clades were important, many of the most important groups were at the extreme of low level taxa, i.e. differences in sub-specific groupings (Fig. 4c,d, Supplementary Figure S9c,d).

Figure 4
figure 4

Abundant, low-level taxa distinguish cage microbiomes but not age or niche. The phylogenetic scale of each clade was measured as the number of nodes in the phylogenetic tree between the clade and root (small values associated with large-scale taxa such as phyla). Phylogenetic scale was compared against the ‘mean decrease in accuracy’ (MDA) value when running a random forest that distinguished the niche (a), age (b), cage (c) and mother (d). Clade abundance was measured as the number of sequences (tips) descending from each clade. Each clade’s abundance was compared against its MDA value, when running a forest that distinguished the niche (d), age (e), cage (f) and mother (g). The smoothed mean for the ‘real’ random forest model is illustrated in blue and for a null (negative control) random forest model in red. The grey areas refer to confidence intervals. Each small vertical ‘rug’ line above the horizontal axis indicates the location of a single taxon. Note the logarithmic scales on all axes. Similar plots using a different measure of phylogenetic scale (distance of clades from the root of the tree) are in Supplementary Figure S5. Figure produced in R 3.6.0 for Windows.

The number of sequences within a clade of the tree that were present in a particular set of samples is an estimate of its abundance in that microbiome. We therefore asked how abundance of bacterial taxa correlated with its importance in distinguishing microbiomes. We found that, for separating niche, age, cage or maternal microbiomes, moderately abundant taxa were important, whereas the rarest taxa were never important (Fig. 4e–h). The most abundant taxa were important for distinguishing cage and maternal microbiomes (Fig. 4g,h) but much less so for distinguishing different ages or niches (Fig. 4e,f) where the true and null models converge on the right).

Finally, we asked how the importance of particular taxa related across models distinguishing the different criteria. The distributions of importance values are wide for all RFs, including null models (Supplementary Figure S10). Nonetheless, we expect a positive association between the importance of particular taxa in distinguishing cages and in distinguishing mothers, since individual litters were housed in separate cages (Fig. 2). These are indeed associated (rank correlation of 0.291, cf. 0.05 for the null model, Supplementary Figure S10). However, the importance of particular taxa was uncorrelated across all other forests that distinguished different criteria (age versus niche, age versus cage, age vs mother, niche versus cage and niche vs mother had rank correlations − 0.002, 0.005, 0.002, − 0.001 and − 0.0007 respectively, each less than their respective null models). Thus, while we can identify broad trends (Fig. 4) and a few key taxa associated with particular gut features (Fig. 3), we did not find widely applicable ‘indicator’ taxa that were individually sensitive to multiple effects on the gut microbiome.

Discussion

We were able to discriminate clearly between the microbiomes of 6 and 18 week old mice, mucus and stool samples, different groups of mixed genotype co-housed mice and mice with different mothers. This confirms the robustness of our models as it is consistent with others’ work, for instance showing microbiome changes with age in both humans and mice39,40, work identifying microbial niche as the strongest factor for separation of the microbiota17,41 and the impact of environment on the microbiome4,11,42. We associated the microbiota with cage with approximately 80% accuracy, suggesting that each cage has a distinct microbial signature. Mice were housed according to their litter and the mother could be distinguished with 98% accuracy, so the maternal microbiome is likely to be critical in determining this signature. Differences in environment can impact the microbiome and have functional impacts on the host, including altered permeability of the mucus barrier4. However, the risk of confounding phenotypes of interest with environmental variation is much broader—in humans, the microbiota is also both vertically transmissible (from mother to offspring) and horizontally transmissible (between household members)43. Nonetheless, it is predominantly animal studies that are used to assign causality of changes in the microbiome to phenotypic effects. Therefore, it is imperative that environmental factors, including both mothers and housing, are both reported and controlled for. Currently, this is by no means universal12.

Our mice are still relatively young, with initial samples taken only ~ 3 weeks after weaning and at 18 weeks old. Therefore the strong differences seen by age may be due to the microbiota still adjusting at 6 weeks due to changed diet from milk to solid food. Solid food may itself contain plant and bacterial sequences (including plant sequences that may be mistaken by taxonomy databases for cyanobacteria44). It is also true that, even in our experimental design, because it is only possible to obtain a single mucus sample from a mouse, we have confounded the age effect with the cage and maternal effects, meaning that we cannot exclude the possibility that the true effect of age is in fact less distinct. Nonetheless, dietary-derived microbial changes in mice can happen within a much shorter timeframe45,46, and the nature of the age effect identified (Figs. 3c and 4b,f) is clearly distinct from the cage and maternal effects, consistent with it being a distinct age effect. The fact that we saw these differences clearly validates our approach to modelling them. In addition, the microbes identified as key to these changes include expected taxa such as the family Erysipelotrichaceae. This family distinguished our older mice (Fig. 3c) and has been associated with the development of IBD (both positively and negatively)47,48 and colorectal cancer49.

We found no consistent differences between the gut microbiomes of wildtype and colitis-prone (mdr1a−/−) genotypes co-housed together in mixed genotype cages. Differences in the microbiota of WT and mdr1a−/− mice have been reported17,50 (with and without littermate controls respectively) so the fact that we do not see them here (Figs. 2e, 3a) is unexpected. Discrepancies in sample size between treatment groups can be a problem for RFs applied to such data51 and machine learning typically uses much larger sample sizes. However, here sample sizes are well balanced (10 wildtype and 10 mdr1a−/− mice with 2 samples from each, albeit individual mothers gave rise to different numbers of offspring, Supplementary Figure S1) and sufficient for distinguishing other criteria. The older mdr1a−/− mice were starting to develop colitis. Therefore, changes in the microbiome with colitis may have obscured any consistent differences among genotypes. Alternatively, previous analyses may have been misled by large cage effects (Figs. 2c, 3a) into erroneously attributing some of that variation to differences among genotypes. For instance, studies finding differences between the stool microbiota of eosinophil-deficient mice compared to wildtype mice (e.g.52) do not report controls for cage effects, e.g. via littermate controls. Thus, they cannot rule out the possibility that differences in the microbiota are due to environmental effects.

Changes in the gut microbiome, at any taxonomic level have been attributed as leading to functional impacts on the host. For instance, a reduction in the abundance and diversity of Firmicutes is associated with IBD in human patients53,54,55 and Bacteroidetes has been shown to be both increased56 and decreased53 with respect to inflammation. However, our data did not find such high-level taxa to show consistent differences in any of our microbiome comparisons (Fig. 4a–d, Supplementary Figure S9). This could be because our phylogenetic tree does not fully capture the relationships among the highest level taxa (Fig. 1b), because there is limited phylogenetic information in amplicons from the subset of the 16S rRNA used. Even trees using the complete 16S rRNA sequence from carefully chosen bacteria do not fully capture their evolutionary history57 and partial 16S rRNA trees can only be made to agree with accepted evolutionary relationships by incorporating many constraints (as done by Louca et al.58). Here, we did not want to create the biases that such constraints would impose. Even taking this approach, phyla themselves are largely resolved (Fig. 1b), so inadequacies in the tree seem unlikely to account for the lack of consistent differences in these taxa among criteria (Fig. 4a–d, Supplementary Figure S9). These high-level taxa are also abundant taxa. While there has been a focus on the importance of rare taxa (e.g.59), a priori, it might have been reasonable to expect that the more abundant taxa would have the most important functional consequences for the host (as posited in other microbiomes e.g.60) and therefore be the most likely to differ between different circumstances. However, the only microbiome comparison in which we find the most abundant taxa to be important was in distinguishing among cages and mothers. In those cases, it was low-level taxonomic groupings (e.g. clades within the abundant genus Bacteroides), not phyla, that distinguished cage-specific microbiomes. This findings suggests the importance of relatively rare microbial species.

Rare bacterial species are thought to play a large role in a range of ecosystems, including host and environmental microbiomes61,62. Specifically, rare taxa have been associated with inflammation63. Here however, we did not find the rarest taxa to be important in discriminating between microbiomes (Fig. 4e–h). This could be an artefact of the fact that, almost by definition in a complex microbiome, rare taxa are likely to be missed from at least a subset of samples through random sampling. Therefore, rare taxa would not show consistent differences among the factors considered (niche, age, cage or mother), as we find (Fig. 4e–h).

The presence or absence of certain taxa will allow other bacterial families/species to flourish or be inhibited, which in turn will alter host/microbial homeostasis, emphasising the need to consider communities and not bacteria in isolation. Furthermore, changes in one bacterium may not be significant functionally if the clade as a whole is unaffected. RF models can account for such interactions among taxa, and the ‘importance’ assigned to a taxon (Fig. 3b–e) takes these into account64. RF approaches have previously proved effective where ratios of Firmicutes to Bacteroidetes could not51 and could discriminate between patients with active Crohn’s disease and those in remission with ~ 70% accuracy65. Here we go one step further, by using the full range of clades in a phylogenetic tree as explanatory variables in the RF model. This avoids over-stretching the data by assigning a sequence read to one taxon rather than another, when it is in fact similarly close or distant to both. It also ensures that we do not lose power that is in the data e.g. clear phylogenetic structure among sequences that are closer than a given threshold (typically 97% identity used for OTUs66). The development of ‘de-noising’ approaches such as DADA267 and DEBLUR68 to generate amplicon sequence variants (ASVs) also goes some way to avoiding the problems of using universal similarity thresholds to define OTUs. However, different de-noisers can lead to different results69. We avoid such issues by using all sequence variants, whether true ASVs or sequencing errors. Given our well-controlled experiment, we do not expect different sequencing errors in different treatments. This expectation is consistent with the fact that we find the very rarest variants, which will be highly enriched for sequencing errors, are no better than random at distinguishing any of our treatments (Fig. 4e–h). This approach and our focus on differences among treatments comes at a cost—we do not even attempt to estimate the ‘true’ community composition of any particular sample or how well it’s captured by the data. Despite this, we are able to identify clear compositional differences and phylogenetic patterns in communities across the treatments studied.

In conclusion, taking a carefully designed factorial experiment involving co-housing of different genotypes of littermate mice, we have been able to identify major changes in the gut microbiome with age, niches, cages and mothers, but not genotype (Figs. 2, 3a). In particular, we highlight a clear impact on the gut microbial communities associated with these experimental factors, particularly cage and maternal effect with phylogenetic patterns that are in stark contrast to niche and age. Together, this work reveals the subtlety of the balance between homeostasis and difference in the gut microbiome, and emphasises the need to carefully account for host environment when performing future studies.