Introduction

The concept of enterotypes has provided a foundational framework for classifying gut microbiome configurations in human populations. Typically, these enterotypes are dominated by one of three bacterial genera; Bacteroides, Prevotella, or Ruminococcus, with Bacteroides and Prevotella often exhibiting mutual exclusion and competitive dynamics. In our earlier cluster analysis of the IHAT-Gut cohort, we identified two dominant taxa, Prevotella stercorea and Faecalibacterium prausnitzii, as key components of the gut microbiota of Gambian children1. While Prevotella predominates in many pre-industrial societies, Bacteroides is characteristically dominant in industrialised Western populations2,3. This geographic divergence reflects broader differences in diet, lifestyle and environmental microbial exposures.

Gut microbiome development in early childhood typically follows a well-characterised trajectory: an initial period of instability during infancy, followed by a gradual maturation toward an adult-like configuration by ~3 years of age4. However, much of this understanding derives from studies conducted in low-infection-burden, industrialised contexts. Consequently, there remains a critical knowledge gap regarding microbiome maturation and immune function in high-infection-burden regions such as sub-Saharan Africa. Recent studies in Nigeria, Malawi, Burkina Faso and The Gambia have begun to address this gap, consistently reporting Prevotella-dominant microbiomes as the regional norm1,5,6,7. Yet, few have examined how these microbial configurations influence infection susceptibility and health outcomes during early life.

While microbiome maturation is shaped by age, diet and environmental exposure, it is also increasingly linked to urbanisation, lifestyle transitions and long-term disease risk8,9. In settings where infectious diseases remain the leading cause of child mortality, understanding how gut microbial ecology relates to infection risk and immune development is both a scientific imperative and a public health priority.

In this follow-up analysis, we return to the IHAT-Gut dataset of 633 children aged 7–37 months from a randomised iron supplementation trial in rural Gambia10,11. We hypothesise that specific gut microbiota compositions confer protection against infection and predict future adverse events. Building on prior analyses describing a Prevotella-based trophic network1, we provide new insights into how this network may underpin immune resilience and modulate infection risk during a critical window of microbiome and immune system co-development. Here, we show that a Prevotella-dominated gut microbiome, with higher Prevotella stercorea and lower Escherichia coli, is associated with reduced infection frequency and duration, forms temporally stable co-occurring networks and has the strongest impact in 1–2-year-olds, revealing a microbiome configuration that may underpin immune resilience and guide global child health interventions.

Results

Sampling framework and cohort description

In our previous microbiome analysis of the IHAT-Gut study1, we examined 1546 stool samples collected from 633 children aged 7–37 months who were enroled in the IHAT-Gut double-blind, randomised trial (NCT02941081) in The Gambia. After quality control, 1389 samples were retained for analysis. Stool samples were collected at three timepoints: Day 1 (D1), Day 15 (D15) and Day 85 (D85) (Supplementary Fig. 1). That earlier study focused primarily on describing the gut microbiome composition and developmental trajectories and showed that iron supplementation had no confounding effect on microbial structure, allowing us to omit treatment arm from downstream analysis.

However, the relationship between microbiome composition and adverse events (AEs) recorded during the 113-day follow-up period was not explored. These AEs were predominantly infectious in nature (693 of 707 events; 98.02%; Supplementary Table 1), with diarrhoea (n = 169), acute respiratory infections (n = 383) and fever (468) being the most common.

Table 1 Baseline characteristics of children by sex, age group, illness status and geographic location

In this follow-up study, we stratified children into two cohorts based on illness status: Ill (≥1 infectious AE; n = 418) and Not-Ill (no recorded AEs; n = 210) (Table 1; Supplementary Table 2). We then examined how gut microbiome composition relates to infection risk, through analyses of alpha and beta diversity and by identifying taxa that differ between Ill and Not-Ill children.

Table 2 Summary of diversity and taxa differences (FDR and age-adjusted)

Participant demographics and gut microbiome stability across age groups and illness status (D1 vs D85)

Because our central hypothesis posits that microbiome composition predicts illness, while microbiome itself is strongly age-dependent, we first examined participant demographics and within-individual short-term microbiome stability. Demographic analyses revealed that ill children were slightly younger than not-ill children (mean age: 22.6 vs 24.2 months), with age group distributions differing between illness statuses (Mann–Whitney U test, p = 0.0005) (Table 1; Supplementary Fig. 2; Supplementary Tables 3, 4). Gender distributions remained balanced across groups (48% female overall), ruling out sex-related confounding.

Within-individual Bray–Curtis similarity analysis (Supplementary Fig. 3) confirmed expected age-dependent maturation patterns: the youngest children (7–12 months) showed the greatest instability (similarity = 0.460 ± 0.185), intermediate-aged children (13–24 months) displayed moderate stability (0.528 ± 0.182), while the oldest children (>24 months) exhibited the highest stability (0.609 ± 0.173). These gradients reflect developmental consolidation of the gut microbiome over the study period.

Crucially, when stratified by age group, within-individual stability did not differ between Ill and Not-Ill children (FDR q = 0.561–0.654; Supplementary Table 5). This indicates that illness status does not disrupt fundamental patterns of short-term microbiome maturation. Instead, the predictive signal of illness must be sought in between-individual differences in microbiome composition. Accordingly, all subsequent analyses incorporated age-adjusted or stratified models to ensure that demographic factors, especially age, were properly accounted for in microbiome–illness associations.

Iron treatment does not confound microbiome–illness associations

Given the confounding potential of illness, it was essential to evaluate the independent impact of iron supplementation on the gut microbiome and infection risk. To determine whether oral iron supplementation influenced gut microbiome structure or illness risk, we tested for associations between treatment arm (iron vs placebo) and (i) microbial taxonomic composition (via MaAsLin2), (ii) alpha diversity (Shannon, Richness, Fisher’s alpha, Simpson, Pielou), (iii) beta diversity (Bray–Curtis distances) and (iv) AE occurrence.

Iron treatment showed no significant overlap with illness-associated taxa after false discovery rate (FDR) correction. Alpha and beta diversity metrics also showed no consistent differences by treatment group, and AE incidence did not differ between iron and placebo arms (Supplementary Fig. 4). These findings replicate our earlier results¹ and support excluding treatment group as a covariate in downstream models focused on illness-related microbiome differences.

Alpha diversity differences suggest disrupted microbiome maturation in ill children

Age-adjusted generalised linear models (Benjamini–Hochberg FDR) revealed that at each timepoint, Ill children exhibited significantly lower Richness and Fisher’s α compared to Not-Ill peers, whereas Shannon, Simpson and Pielou’s evenness were generally non-significant (Table 2; Supplementary Fig. 5A; Supplementary Table 6). Specifically, same-timepoint comparisons showed deficits in Ill children at D1 (Richness: F(1, 512) = 6.05, p = 0.016, η² p = 0.012, 95% CI [−7.66, 1.00]; Fisher’s α: F(1, 512) = 5.80, p = 0.016, η² p = 0.011, 95% CI [−1.19, 0.19]), D15 (Richness: F(1, 404) = 7.49, p = 0.006, η² p = 0.018, 95% CI [−8.02, 1.58]; Fisher’s α: F(1, 404) = 7.55, p = 0.006, η² p = 0.018, 95% CI [−1.25, 0.26]) and D85 (Richness: F(1, 448) = 15.30, p = 1.3 × 10⁻⁴, η² p = 0.033, 95% CI [−10.57, −1.24]; Fisher’s α: F(1, 448) = 14.83, p = 1.3 × 10⁻⁴, η² p = 0.032, 95% CI [−1.65, −0.16]) (Fig. 1A; Table 2; Supplementary Fig. 5B).

Fig. 1: Age-adjusted and age-stratified gut microbiome alpha diversity in Ill and Not-Ill children across developmental timepoints.
Fig. 1: Age-adjusted and age-stratified gut microbiome alpha diversity in Ill and Not-Ill children across developmental timepoints.
Full size image

Species richness was assessed at days 1 (D1), 15 (D15) and 85 (D85) using genus-level data, with models adjusted or stratified by age group. Individual stool samples per child represent biological replicates (n = 940 total). Boxplots show medians, interquartile ranges (IQR), minima and maxima, with whiskers extending to 1.5 × IQR; points represent individual samples. Data from generalised linear and mixed-effects models are presented as mean ± s.e.m. A Ill versus Not-Ill children across D1, D15 and D85. Age-adjusted GLMs comparing Ill children at D1 (n = 340), D15 (n = 285) and D85 (n = 315) with Not-Ill peers (D1: n = 176; D15: n = 123; D85: n = 137) revealed significantly lower richness in Ill children at all timepoints (two-sided, FDR-adjusted p = 0.016, 0.006 and 1.3 × 10⁻⁴, respectively). Stratified analyses showed no differences in the 7–12 month or >2 year groups but significant richness deficits in 1–2 year-olds at D15 (FDR-adjusted p = 0.032) and D85 (FDR-adjusted p = 0.037). B Early Ill stools (D1, D15) versus D85 Not-Ill reference stools. Age-adjusted and stratified mixed-effects models comparing Ill (D1, D15) with D85 Not-Ill stools confirmed persistent deficits (two-sided, FDR-adjusted p < 0.001; D1 richness: F(1, 473) = 26.30, p = 4.26 × 10⁻⁷, η² p = 0.053; D15 richness: F(1, 418) = 28.35, p = 1.66 × 10⁻⁷, η² p = 0.064). The 1–2 year group showed the greatest richness loss (F(1, 213) = 13.33, p = 3.28 × 10⁻⁴; F(1, 180) = 15.18, p = 1.38 × 10⁻⁴).

Using D85 Not-Ill samples as a reference for healthy microbiome maturation (minimising pre-study illness effects), early Ill samples at D1 and D15 exhibited even more pronounced deficits. At D1, Richness was significantly reduced in Ill versus D85 Not-Ill children (F(1, 473) = 26.30, p = 4.26 × 10⁻⁷, η² p = 0.053, 95% CI [−12.70, −3.11]; n = 477), as was Fisher’s α (F(1, 473) = 25.95, p = 5.07 × 10⁻⁷, η² p = 0.052, 95% CI [−1.98, −0.46]; n = 477). Similar reductions were observed at D15 for both Richness (F(1, 418) = 28.35, p = 1.66 × 10⁻⁷, η² p = 0.064, 95% CI [−12.07, −2.72]; n = 422) and Fisher’s α (F(1, 418) = 28.38, p = 1.63 × 10⁻⁷, η² p = 0.064, 95% CI [−1.89, −0.41]; n = 422), providing a robust baseline comparison for disrupted microbiome development (Fig. 1B; Table 2; Supplementary Fig. 5B; Supplementary Table 6).

Within-group temporal trajectories indicated that α-diversity did not significantly change over time within Ill or Not-Ill children (Ill: D1 vs D15 vs D85, all p ≥ 0.061; Not-Ill: D1 vs D15 vs D85, all p ≥ 0.201) (Table 2; Supplementary Fig. 5A; Supplementary Table 6). This persistence highlights that deficits in Ill children are stable across the study period, rather than reflecting transient fluctuations.

Age-stratified analyses further revealed that the 1–2 year age group was most affected. In this group, Ill children had significantly lower diversity indices at D15 and D85 compared to age-matched Not-Ill peers (D15: Richness, F(1, 178) = 4.67, p = 0.032, η² p = 0.026, 95% CI [−15.92, −0.72]; Fisher’s α, F(1, 178) = 4.87, p = 0.032, η²  p = 0.027, 95% CI [−2.53, −0.14]; D85: Richness, F(1, 194) = 4.41, p = 0.037, η² p = 0.022, 95% CI [−15.56, −0.49]; Fisher’s α, F(1, 194) = 4.61, p = 0.037, η² p = 0.023, 95% CI [ − 2.48, −0.11]).

Early Ill samples (D1, D15) also showed markedly lower diversity compared to D85 Not-Ill (D1: Richness, F(1, 213) = 13.33, p = 3.28 × 10⁻⁴, η² p = 0.059, 95% CI [−21.11, −6.31]; Fisher’s α, F(1, 213) = 13.79, p = 2.61 × 10⁻⁴, η² p = 0.061, 95% CI [−3.39, −1.04]; D15: Richness, F(1, 180) = 15.18, p = 1.38 × 10⁻⁴, η² p = 0.078, 95% CI [−21.96, −7.19]; Fisher’s α, F(1, 180) = 16.12, p = 8.72 × 10⁻⁵, η² p = 0.082, 95% CI [−3.53, −1.20]) (Fig. 1A-B; Supplementary Table 7; Supplementary Fig. 5B). Children in the 7–12 month and >2 year groups showed less pronounced differences, indicating that microbiome disruption in Ill children is both persistent and developmentally window-specific.

The absence of significant differences in other alpha diversity metrics (Shannon, Simpson and Pielou’s evenness) suggests that while the number of taxa and the abundance-weighted richness (Fisher’s alpha) are reduced in Ill children, the overall distribution of abundances among the remaining taxa is relatively preserved. In other words, Ill children’s microbiomes show a loss of rare or less abundant taxa without major disruption to community evenness or dominance, highlighting that microbiome maturation is specifically impaired in terms of richness rather than global diversity structure.

Overall, these analyses demonstrate that Ill children exhibit sustained reductions in Richness and Fisher’s alpha from the earliest sampling point, with the 1–2 year group most strongly affected. This pattern is consistent with disrupted microbiome maturation in early life, highlighting both baseline deficits and age-dependent vulnerabilities.

Beta diversity reveals early and persistent differences in microbiome composition

Beta diversity analysis revealed significant and persistent differences in gut microbiome composition between children who developed infections (‘Ill’) and those who remained healthy (‘Not-Ill’), even after adjusting for age (Table 2; Fig. 2; Supplementary Table 8). PERMANOVA and dispersion models were FDR-adjusted using the Benjamini–Hochberg procedure. PERMANOVA demonstrated clear compositional separation between Ill and Not-Ill groups at all timepoints: Day 1 (D1): F(3, 512) = 20.46, p = 0.002, R²  = 0.107, 95% CI [0.087, 0.141]; Day 15 (D15): F(3, 404) = 19.27, p = 0.002, R²  = 0.125, 95% CI [0.101, 0.169]; and Day 85 (D85): F(3, 448) = 13.03, p = 0.002, R²  = 0.080, 95% CI [0.063, 0.114]. While R²  values were modest, they nonetheless reflect biologically meaningful shifts in community structure. These findings demonstrate consistent and statistically robust differences in overall microbial composition between Ill and Not-Ill children across all sampling points.

Fig. 2: Age-adjusted and age-stratified gut microbiome beta diversity in Ill and Not-Ill children across developmental timepoints.
Fig. 2: Age-adjusted and age-stratified gut microbiome beta diversity in Ill and Not-Ill children across developmental timepoints.
Full size image

Beta diversity was assessed at days 1 (D1), 15 (D15) and 85 (D85) using genus-level data, with models adjusted or stratified by age group. Individual stool samples per child represent biological replicates (n = 940 total). Principal coordinates analysis (PCoA) of Bray–Curtis dissimilarities revealed significant compositional differences between Ill children at D1 (n = 340), D15 (n = 285) and D85 (n = 315) and Not-Ill peers (D1: n = 176; D15: n = 123; D85: n = 137) at each timepoint (age-adjusted two-sided PERMANOVA: D1, R² = 0.1071, FDR p = 0.002; D15, R² = 0.1252, FDR p = 0.002; D85, R² = 0.0802, FDR p = 0.002). Age stratification showed that this was primarily in the 1–2 year age group. Community dispersion, assessed by Kruskal–Wallis tests, was greater among Ill children at D85 (FDR p < 0.001) but not at earlier timepoints (D1: FDR p = 0.851; D15: FDR p = 0.583). Boxplots show medians, interquartile ranges (IQR), minima and maxima, with whiskers extending to 1.5 × IQR; points represent individual samples.

The nature of these differences, however, shifted over time. At D1 and D15, the separation reflected true compositional changes, as within-group dispersion did not differ significantly between Ill and Not-Ill children (H(1) = 0.08, p = 0.851 at D1; H(1) = 1.22, p = 0.583 at D15). By contrast, at D85, a significant dispersion effect emerged (H(1) = 18.52, p < 0.001), indicating that infection status at this later stage was associated with increased inter-individual variability in the Ill group relative to the more uniform, reconstituted microbiomes of D85 Not-Ill children.

Longitudinal pairwise comparisons revealed distinct microbiome trajectories within each illness group (Supplementary Fig. 6; Supplementary Table 8). In Not-Ill children, significant compositional changes between D15 and D85 (p = 0.012) and between D1 and D85 (p = 0.002) were accompanied by decreased dispersion (p = 0.015 and p = 0.001, respectively), reflecting progressive microbiome maturation and stabilisation. Conversely, Ill children showed compositional shifts over time (D1 vs. D15: p = 0.007; D1 vs. D85: p = 0.014) without changes in dispersion (all p > 0.85), indicating persistent inter-individual variability. Cross-group comparisons between early illness timepoints and D85 Not-Ill samples further emphasised these divergent trajectories, with significant compositional and dispersion differences (all p ≤ 0.002).

Consistent with alpha diversity findings, these differences were most pronounced in the 1–2-year-old group, highlighting this developmental window as one of heightened microbiome vulnerability. Age-stratified PERMANOVA within the same timepoint comparisons confirmed this pattern: in 1–2-year-olds, Ill and Not-Ill children showed a trend toward divergence at D15 (F(1, 178) = 2.32, p = 0.084, R²  = 0.013, 95% CI [0.008, 0.030]) and significant separation at D85 (F(1, 194) = 2.64, p = 0.049, R²  = 0.013, 95% CI [0.008, 0.031]), whereas no significant separation was observed in the 7–12-month or >2-year groups at any timepoint (Fig. 2; Supplementary Table 9). Together, these results suggest that susceptibility to infection is driven more by differences in gut microbiome composition and reduced microbial richness than by overall evenness, and they pinpoint the second year of life as a critical period when the gut microbiome may play a pivotal role in shaping infection risk.

Microbial network analysis reveals antagonistic clusters centred on P. stercorea and E. coli

To elucidate ecological relationships within the gut microbiota, we conducted bacterial co-occurrence network analysis using the 50 most abundant taxa ( ≥ 0.2% relative abundance). Taxa clustered into ten, of which six formed two antagonistic and mutually exclusive modules: a health-associated network centred on Prevotella stercorea, and a dysbiosis-associated network anchored by Escherichia coli (Fig. 3A, B).

Fig. 3: Network heatmap reveals trophic microbial community structure and immune-protective clusters.
Fig. 3: Network heatmap reveals trophic microbial community structure and immune-protective clusters.
Full size image

A Network heatmap of the top 50 taxa ( ≥ 0.2% abundance) across all timepoints identified ten microbial clusters via gap statistics. Two antagonistic superclusters emerged: a Prevotella stercorea–dominated group (including P. copri and Faecalibacterium prausnitzii) and an Escherichia coli–dominated group (including Bacteroides, Bifidobacterium and several pathogens such as Klebsiella pneumoniae and Streptococcus equinus). Colour intensity indicates strength and direction of correlation (red = positive, blue = negative). Bar heights represent relative abundance (range 0–0.6) for visual comparison. Ill samples aligned with the E. coli cluster (upper right), while Not-Ill samples aligned with the P. stercorea cluster (upper left). B Schematic representation of the correlation structure extracted from (A) (based on Spearman’s correlation coefficients) showing the relationship between the Prevotella stercorea group and the Escherichia coli group. The layout displays four major interaction zones distinguished by colour intensity, with red indicating strong positive correlations and blue indicating strong negative correlations. Bidirectional arrows denote prominent antagonistic relationships between major taxonomic groups.

The P. stercorea cluster comprised Prevotella copri, Faecalibacterium prausnitzii, Succinivibrio dextrinosolvens and other obligate anaerobes known for short-chain fatty acid production and immune modulation. In contrast, the E. coli cluster included Bacteroides, Bifidobacterium and several pathobionts such as Klebsiella pneumoniae and Streptococcus equinus. Correlation heatmaps revealed strong positive associations within each cluster (red), and strong negative correlations (blue) between the two clusters, consistent with competitive exclusion and antagonistic niche occupation (Fig. 3A).

To enhance interpretability, we generated a cluster relationship map (Fig. 3B), illustrating the negative bidirectional correlations between the P. stercorea and E. coli networks. The Ill cohort predominantly aligned with the E. coli cluster, whereas the Not-Ill cohort was enriched for P. stercorea and its co-associated taxa. These data suggest the existence of two competing microbial ecologies: one potentially protective and the other permissive of inflammation and pathogen expansion.

Prevotella stercorea and Escherichia coli differentiate Ill from Not-Ill cohorts

Significant differences in taxa were observed between the Ill and Not-Ill cohorts following age-adjusted ANOVA across the three timepoints. Among the ten most abundant taxa (>1% total reads), Escherichia coli and Prevotella stercorea consistently differed between groups (FDR-corrected p < 0.05; Table 2; Supplementary Table 10, Fig. 4A, B). Specifically, E. coli abundance was significantly higher in the Ill cohort at all timepoints, while P. stercorea was significantly reduced. Both taxa exhibited parallel temporal trends with no significant interaction effects (interaction p = 0.871 for E. coli, p = 0.349 for P. stercorea), indicating stable group differences independent of sampling time.

Fig. 4: Temporal gut microbiome dynamics in children by illness status.
Fig. 4: Temporal gut microbiome dynamics in children by illness status.
Full size image

A Relative abundance of Prevotella stercorea (left) and Escherichia coli (right) at D1, D15 and D85. Red = Ill (n = 940); blue = Not-Ill (n = 436). Boxes show medians, interquartile ranges (IQR), minima and maxima, with whiskers extending to 1.5 × IQR. Age-adjusted two-way ANOVA (two-sided) based on biological replicates showed significant group effects (E. coli, p = 0.0249; P. stercorea, p = 0.0228), a significant time effect for P. stercorea (p = 0.0115), a marginal time effect for E. coli (p = 0.0703), and no group × time interaction (E. coli, p = 0.871; P. stercorea, p = 0.349). B Mean ± s.e.m. relative abundances of the 50 most abundant taxa across timepoints and illness groups. Each point represents a biological replicate (stool sample per child). Sample sizes per group and timepoint are as in (A).

These findings imply that susceptibility to infection may reflect intrinsic baseline compositional differences rather than dynamic microbiome shifts following illness onset. P. stercorea showed significant time (F(2, 512) = 5.78, p = 0.0115, η² p = 0.005, 95% CI [0.001, 1]) and group (F(1, 512) = 7.33, p = 0.0228, η² p = 0.005, 95% CI [0.001, 1]) effects, indicating sustained lower levels in the Ill cohort. This interpretation is further supported by independent MaAsLin2 analysis comparing Day 85 samples, which revealed that P. stercorea remained significantly depleted in Ill children at the final timepoint (coef = −0.026, 95% CI [−0.046, −0.005], q = 0.025, Table 2 Comparison 9), reinforcing the persistence of group-level differences.

In contrast, E. coli demonstrated a significant group effect (F(1, 512) = 7.03, p = 0.0249, η² p = 0.005, 95% CI [0.001, 1]) and a marginal time effect (F(2, 512) = 3.52, p = 0.0703), suggesting consistently elevated levels in the Ill group with modest temporal change.

Extending this analysis, the trajectories and Day 85 relative abundances of the 50 most abundant bacterial taxa across the three timepoints provide further insight into potentially protective microbes (Fig. 4B). Within the next 40 most abundant taxa, Eubacterium coprostanoligenes (F(1, 512) = 2.32, p = 0.044, η² p = 0.005, 95% CI [0.000, 0.010]) was significantly enriched in the Not-Ill group. Notably, all Prevotella species consistently aligned with the Not-Ill group, reinforcing their potential protective role, whereas pathogenic taxa generally clustered with the Ill group, underscoring candidates implicated in microbiome-mediated susceptibility.

Other dominant taxa displayed varying temporal and group patterns. Bifidobacterium showed a marginal time effect (F(2, 512) = 3.80, p = 0.060) without group or interaction differences, suggesting possible age-related shifts independent of illness. Faecalibacterium prausnitzii exhibited a significant temporal change (F(2, 512) = 4.72, p = 0.0259, η² p = 0.001, 95% CI [0, 1]) but no group-level effects. Prevotella copri, Bacteroides, Succinivibrio dextrinosolvens, Paraprevotella xylaniphila, Sutterella wadsworthensis and Streptococcus salivarius did not differ significantly between cohorts. Only P. copri (F(2, 512) = 3.41, p = 0.074) and S. salivarius (F(2, 512) = 3.33, p = 0.0758) exhibited marginal time effects, consistent with shared developmental dynamics (Supplementary Table 10).

Together, these results identify E. coli and P. stercorea as the most robust microbial markers of future infection. Their consistent divergence between Ill and Not-Ill children, supported by both longitudinal ANOVA and targeted MaAsLin2 comparisons, points to persistent compositional differences preceding clinical outcomes rather than illness-driven microbial change.

Baseline microbiome composition predicts future infection risk

Given the relative stability of microbial diversity and taxonomic composition at D85, and the reduced likelihood of residual confounding from recent pre-enrolment illnesses, D85 Not-Ill samples were designated as the optimal ‘healthy control’ group for all downstream analyses. To evaluate the predictive capacity of early microbiome features, D1 samples were stratified by time-to-illness onset: ≤35 days (‘Soon-Ill’, n = 189), 36–70 days (‘Later-Ill’, n = 100), and >70 days (‘Much-Later-Ill’, n = 51) and compared to D85 Not-Ill controls (n = 137).

PERMANOVA and dispersion models were FDR-adjusted using the Benjamini–Hochberg procedure. Age-adjusted PERMANOVA analyses (Fig. 5; Supplementary Table 8) revealed significant microbiome compositional differences across all illness-timing groups: Soon-Ill (F(3, 322) = 11.33, p = 0.001, R²  = 0.096, 95% CI [0.073, 0.144]); Later-Ill (F(3, 233) = 8.18, p = 0.001, R²  = 0.095, 95% CI [0.076, 0.149]); and Much-Later-Ill (F(3, 184) = 6.25, p = 0.001, R²  = 0.092, 95% CI [0.072, 0.154]), indicating modest but consistent shifts in baseline microbiome structure associated with future infection risk.

Fig. 5: Early gut microbiome composition distinguishes children by illness onset timing.
Fig. 5: Early gut microbiome composition distinguishes children by illness onset timing.
Full size image

Beta Diversity: Predictive Value of D1 Microbiome Composition. Principal coordinates analysis (PCoA) of Bray–Curtis dissimilarities from baseline (D1) gut microbiome profiles comparing children who developed infection-related illness within 35 days (Soon-Ill, n = 189), 36–70 days (Later-Ill, n = 100), or >70 days (Much-Later-Ill, n = 51) with D85 Not-Ill controls (n = 137). Individual stool samples per child represent biological replicates. Age-adjusted two-sided PERMANOVA indicated significant differences across groups (FDR p = 0.001), with age stratification showing that this was primarily in the 1–2 year age group. Community dispersion (distance to group centroids) was highest in Soon-Ill (FDR p = 1.23 × 10⁻⁴), followed by Much-Later-Ill (FDR p = 0.004) and Later-Ill (FDR p = 0.005). Boxplots show medians, interquartile ranges (IQR), minima and maxima, with whiskers extending to 1.5 × IQR.

Community dispersion, assessed with Kruskal–Wallis tests, showed that the Soon-Ill group exhibited the largest increase in between-sample dispersion relative to D85 Not-Ill controls (H(1) = 16.83, p = 1.23 × 10⁻⁴), followed by smaller but significant increases in Much-Later-Ill (H(1) = 9.26, p = 0.004) and Later-Ill (H(1) = 7.87, p = 0.005). These differences indicate that the compositional separation observed in PERMANOVA reflects a combination of true location shifts (changes in group centroid) and increased heterogeneity (dispersion) among children who develop infection, particularly in the Soon-Ill group. This pattern suggests a disrupted or developmentally delayed microbial state that may predispose to infection.

MaAsLin2 analyses (Fig. 6A; Supplementary Table 11) identified several D1 taxa with predictive associations. Prevotella stercorea was significantly depleted in all illness groups: Soon-Ill (coef = –0.029, 95% CI [–0.050, –0.009], q = 0.015), Later-Ill (coef = –0.040, 95% CI [–0.064, –0.016], q = 0.003) and Much-Later-Ill (coef = –0.044, 95% CI [–0.073, –0.014], q = 0.011), reinforcing its role as a core protective taxon. The pathogen Escherichia coli was significantly elevated in Soon-Ill (coef = 0.028, 95% CI [0.007, 0.049], q = 0.026), while Bacteroides was increased in both Soon-Ill (coef = 0.027, 95% CI [0.000, 0.054], q = 0.095) and Much-Later-Ill (coef = 0.044, 95% CI [0.005, 0.082], q = 0.055). Group-specific enrichments included Bifidobacterium in Later-Ill (coef = 0.024, 95% CI [0.000, 0.050], q = 0.099), Sutterella wadsworthensis in Much-Later-Ill (coef = 0.013, 95% CI [0.000, 0.027], q = 0.047) and Streptococcus salivarius in Soon-Ill (coef = 0.012, 95% CI [0.000, 0.026], q = 0.084).

Fig. 6: Early gut microbiome taxa distinguish children by illness onset timing.
Fig. 6: Early gut microbiome taxa distinguish children by illness onset timing.
Full size image

Differential abundance analysis (MaAsLin2, FDR and age-adjusted) identified microbial signatures associated with illness onset timing. Individual stool samples per child represent biological replicates. A Species-Level Differences: Predictive Signals at D1 and D15. At D1, Prevotella stercorea was reduced in all Ill groups. Escherichia coli was enriched in the total Ill group (β = 0.024; q = 0.021), particularly in Soon-Ill (β = 0.028; q = 0.026). Bacteroides and Streptococcus salivarius were increased in Soon-Ill, while Bifidobacterium and Sutterella wadsworthensis were increased in Later-Ill and Much-Later-Ill, respectively. At D15 (Soon-Ill, n = 143; Later-Ill, n = 68; Much-Later-Ill, n = 19; controls, n = 137), P. stercorea remained depleted in Soon-Ill (β = –0.039; q = 0.002) and Much-Later-Ill (β = –0.051; q = 0.068). E. coli was again enriched in Soon-Ill (β = 0.043; q = 0.009), while Succinivibrio dextrinosolvens was depleted (β = –0.024; q = 0.095). B Age-specific microbiome associations with illness. FDR- and age-adjusted MaAsLin2 analysis showed associations concentrated in 1–2 year-olds (n = 169 [D1 Ill], 136 [D15 Ill], 46 [D85 Not-Ill]). At D1, Prevotella stercorea was depleted (β = −0.048; q = 0.012) and Paraprevotella xylaniphila was depleted (β = −0.019; q = 0.036). At D15, P. stercorea (β = −0.037; q = 0.047), P. xylaniphila (β = −0.021; q = 0.023), Escherichia coli (β = 0.058; q = 0.047) and Bacteroides (β = 0.049; q = 0.030) were significantly altered. Marked depletion of Succinivibrio dextrinosolvens was also observed (β = −0.037; q = 0.001). In 7–12-month-olds (n = 46 [D1 Ill], 43 [D15 Ill], 14 [D85 Not-Ill]), P. stercorea was similarly depleted (β = −0.052; q = 0.077). Children > 2 years (n = 125 [D1 Ill], 106 [D15 Ill], 77 [D85 Not-Ill]) showed no significant associations. Error bars represent medians ± IQR.

At D15, Prevotella stercorea remained significantly depleted in Soon-Ill (coef =  –0.039, 95% CI [−0.062, −0.017], q = 0.002) and depleted in Much-Later-Ill (coef = –0.051, 95% CI [–0.096, –0.005], q = 0.068), while Escherichia coli was again significantly enriched in Soon-Ill (coef =  0.043, 95% CI [0.014, 0.071], q = 0.009). Given the short interval from D1, these results align with earlier observations. Nevertheless, the recurrence of these associations at D15 provides internal validation of the predictive signals first observed at D1 and underscores the consistency of microbial alterations linked to future illness risk.

Importantly, age-stratified analyses for both beta diversity (Fig. 5; Supplementary Table 9) and taxa (Fig. 6B; Supplementary Table 12) revealed that these associations were again concentrated in children aged 1–2 years. For beta diversity, PERMANOVA demonstrated significant compositional separation between Ill and Not-Ill groups across all illness-timing strata in this age group: Soon-Ill (F(1, 146) = 2.66, p = 0.024, R²  = 0.018, 95% CI [0.011, 0.042]); Later-Ill (F(1, 89) = 3.29, p = 0.004, R²  = 0.036, 95% CI [0.020, 0.080]); and Much-Later-Ill (F(1, 66) = 4.20, p = 0.004, R²  = 0.060, 95% CI [0.035, 0.119]). At the taxon level, consistent depletion of Prevotella stercorea (D1 Ill 1–2 yr: coef = −0.048, 95% CI [ − 0.076, −0.019], q = 0.012; D15 Ill 1–2 yr: coef = −0.037, 95% CI [−0.068, −0.005], q = 0.047; D15 Ill 7–12 mths: coef = −0.052, 95% CI [−0.089, −0.015], q = 0.077) and Paraprevotella xylaniphila (D1 Ill 1–2 yr: coef = −0.019, 95% CI [ − 0.033, −0.005], q = 0.036; D15 Ill 1–2 yr: coef = −0.021, 95% CI [−0.036, −0.007], q = 0.023) was observed at D1 and D15, alongside enrichment of Escherichia coli (D15 Ill 1–2 yr: coef = 0.058, 95% CI [0.009, 0.107], q = 0.047) and Bacteroides (D1 Ill 1–2 yr: coef = 0.044, 95% CI [0.007, 0.081], q = 0.064; D15 Ill 1–2 yr: coef = 0.049, 95% CI [0.013, 0.085], q = 0.030) and marked depletion of Succinivibrio dextrinosolvens (D15 Ill 1–2 yr: coef = −0.037, 95% CI [−0.055, −0.018], q = 0.001). In contrast, children >2 years showed no significant associations, underscoring the heightened vulnerability of the 1–2 year age group.

Although effect sizes were modest, the consistency of directional changes across timepoints and illness groups strengthens the biological plausibility of these associations. Notably, the depletion of Prevotella stercorea emerged as the most robust and reproducible marker of susceptibility. Its predictive relevance at D1 and D15, highlighted in both MaAsLin2 (Fig. 6A, Supplementary Fig. 7) and ANOVA-based analyses (Fig. 4A), is supported by diversity metrics (Figs. 1, 2; Fig. 5) and visualised compositional shifts (Fig. 4B), all pointing to early microbial divergence in children who subsequently develop infection.

Collectively, these findings underscore the potential of early-life microbiome profiling to forecast infection risk weeks to months in advance. Perturbations in gut microbial composition, particularly involving Prevotella stercorea, may precede and even contribute to the emergence of pathogenic illness in susceptible children.

Residual microbial alterations persist after illness

To determine whether illness leaves lasting microbiome signatures, we examined D85 stool samples from children who had experienced illness at different intervals. The ‘Recent-Ill’ group (n = 85) had recovered from an adverse event (AE) 10 days prior, while the ‘Early-Ill’ group (n = 83) had been ill shortly after D1 but remained healthy for at least 50 days before D85. Compared to Not-Ill controls (n = 137), Recent-Ill children exhibited significant microbiome compositional differences (PERMANOVA: F(3, 218) = 7.27, p = 0.001, R²  = 0.091, 95% CI [0.070, 0.149]; Kruskal–Wallis: H(1) = 5.25, p = 0.022), whereas the Early-Ill group showed weaker but still significant divergence (PERMANOVA: F(3, 216) = 5.82, p = 0.001, R²  = 0.075, 95% CI [0.065, 0.123]; Kruskal–Wallis: H(1) = 18.52, p = 3.36 × 10⁻⁵) (Supplementary Fig. 8; Supplementary Table 8). These findings indicate that while the gut microbiome gradually recovers following illness, residual compositional alterations can persist, especially shortly after recovery.

Consistent with this temporal recovery, Prevotella stercorea remained significantly depleted in the Recent-Ill group (coef = –0.039, 95% CI [–0.066, –0.012], q = 0.016), whereas its abundance in Early-Ill children was statistically indistinguishable from that of Not-Ill controls (q = 0.657) (Supplementary Table 11). This trajectory supports the hypothesis that P. stercorea is disrupted during illness but gradually restored over time.

In contrast, the pathogen Escherichia coli did not differ significantly between groups at either timepoint (q = 0.136 for Recent-Ill; q = 0.450 for Early-Ill), aligning with its expansion (in Soon-Ill) prior to illness onset. While overall community composition remained distinguishable in Early-Ill children, the absence of significant taxon-level differences points to a broader recovery process not driven by dominant taxa alone.

Prevotella stercorea abundance correlates with infectious burden and duration

Linear regression of Prevotella stercorea abundance at Day 85 (D85) revealed a significant inverse association with total infectious AEs (n = 1215). At the group level, higher P. stercorea abundance predicted fewer total events (slope = –0.37, 95% CI [–0.55, –0.20], R²  = 0.90, p = 0.004) and shorter cumulative duration (slope = –0.14, 95% CI [–0.25, –0.03], R²  = 0.24, p = 0.014). At the individual level, associations were weaker but still significant for frequency (slope = –0.31, 95% CI [–0.57, –0.04], R²  = 0.01, p = 0.025) and duration (slope = –0.07, 95% CI [–0.13, –0.02], R²  = 0.01, p = 0.014) (Fig. 7; Supplementary Table 13).

Fig. 7: Prevotella stercorea abundance inversely correlates with infectious burden.
Fig. 7: Prevotella stercorea abundance inversely correlates with infectious burden.
Full size image

Linear regression of log-transformed P. stercorea abundance at day 85 (D85) against total infectious adverse events (AEs) (n = 1215) and AE-specific outcomes (diarrhoeal, n = 169; respiratory, n = 383; febrile, n = 468), including both frequency and cumulative duration per child. Solid black lines indicate group-level reductions (Infectious = 31%, Diarrhoea = 80%), while dashed red lines indicate corresponding individual-level reductions (ARI = 44%, Fever = 31%), suggestive of potential herd-level protective effects. Percentage reductions were calculated as (1 - e^slope) x 100% for each unit increase in log P. stercorea. Data are shown as mean ± 95% confidence interval (CI). Full results are reported in Supplementary Table 13.

Parallel analyses of diarrhoeal (n = 169), ARI (n = 383) and febrile (n = 468) AEs showed consistent inverse trends, with group-level significance reflecting population-wide protection and individual-level significance indicating that children with higher P. stercorea abundance experienced fewer and shorter episodes. These results underscore the capacity of P. stercorea to modulate early-life infectious burden across multiple infection types, even at this secondary, sub-optimal timepoint and suggest potentially stronger effects in prospective studies specifically designed to capture microbiome–infection dynamics.

Age-dependent increase in Prevotella stercorea and peak illness susceptibility in toddlers

Prevotella stercorea abundance (Fig. 8A) increased significantly with age (7–12 months, 1–2 years, >2 years; one-way ANOVA F(2, 1373) = 49.41, p = 1.90 × 10⁻² ¹, η²  = 0.067). Post-hoc Tukey HSD comparisons confirmed that abundance was significantly higher in 1–2-year-olds versus 7–12-month-olds (diff = 1.63, 95% CI [1.03, 2.23], p = 6.94 × 10⁻¹⁰), in >2-year-olds versus 7–12-month-olds (diff = 2.53, 95% CI [1.93, 3.14], p < 1 × 10⁻² ⁰) and in >2-year-olds versus 1–2-year-olds (diff = 0.90, 95% CI [0.48, 1.32], p = 1.84 × 10⁻⁶).

Fig. 8: Prevotella stercorea abundance increases with age while illness susceptibility peaks in toddlers.
Fig. 8: Prevotella stercorea abundance increases with age while illness susceptibility peaks in toddlers.
Full size image

A P. stercorea abundance across age groups. Boxplot of P. stercorea abundance at D85 across age groups (7–12 months, n = 88; 1–2 years, n = 282; >2 years, n = 263). Boxes show medians, interquartile ranges (IQR), minima and maxima, with whiskers extending to 1.5 × IQR; points represent biological replicates. Two-sided ANOVA showed a significant age-associated increase (F(2, 1373) = 49.41, p = 1.90 × 10⁻²¹). Post hoc Tukey HSD comparisons: 1–2 versus 7–12 months (difference = 1.63, p = 6.94 × 10⁻¹⁰); >2 versus 7–12 months (difference = 2.53, p < 1 × 10⁻²⁰); >2 versus 1–2 years (difference = 0.90, p = 1.84 × 10⁻⁶). Logistic regression testing for an interaction between P. stercorea abundance and age in predicting illness showed no significant interaction (p = 0.757). B Illness prevalence peaks in toddlers. Bar plot shows illness (AE) status peaking in the 1–2 year age group: 7–12 months (Ill, n = 59; Not-Ill, n = 29); 1–2 years (Ill, n = 201; Not-Ill, n = 80); >2 years (Ill, n = 158; Not-Ill, n = 101); χ² = 22.69, d.f. = 2, p = 1.19 × 10⁻⁵.

Logistic regression examining the association between P. stercorea abundance and illness revealed no significant interaction terms between P. stercorea and age groups (logP × 1–2 years: OR = 0.98, 95% CI [0.875, 1.097], p = 0.732; logP × >2 years: OR = 1.018, 95% CI [0.908, 1.14], p = 0.757), indicating that the relationship between P. stercorea abundance and illness was consistent across ages.

Illness patterns were age-dependent (χ²  = 22.69, df = 2, p = 1.19 × 10⁻⁵), peaking in 1–2-year-olds and lowest in children >2 years (Fig. 8B). This middle age group exhibited the most pronounced illness-associated differences in microbial diversity and taxa composition across all analyses.

Discussion

Our findings indicate that the human gut microbiome, particularly the dominance of Prevotella stercorea, may play a protective role against infections in early childhood, especially in high-burden settings such as rural sub-Saharan Africa. To interpret these associations, it is essential to consider the evolutionary context of host–microbe interactions. Enteric symbioses emerged early in vertebrate evolution12,13 and persist today across multiple mucosal sites, including the gut, oral cavity and vagina.

In both non-industrialised human populations and our closest primate relatives (Pan troglodytes), Prevotella species are dominant constituents of the gut microbiota. These taxa also characterise the intestinal microbiomes of ancient humans; including Ötzi the Iceman and pre-Columbian groups, underscoring a deep evolutionary relationship14,15. Unlike Bacteroides, which primarily resides in the colon, Prevotella occupies diverse mucosal niches, suggesting broad ecological versatility and potential immunological relevance14,16.

Our findings echo this evolutionary pattern: P. stercorea forms the hub of a microbial network inversely associated with pathobionts such as Escherichia coli and Klebsiella pneumoniae, which are linked to diarrhoeal and respiratory illnesses, respectively. This inverse relationship suggests a protective ecological configuration.

Evidence from other global cohorts indicates that our observations are not unique to The Gambia. Earlier studies in Nigeria and Burkina Faso reported genus-level enrichment of Prevotella in rural communities compared with urban or European populations, although species-level resolution was limited5,7. More recent shotgun metagenomic surveys have consistently detected P. stercorea in non-Western settings, including Latin America, South Asia and sub-Saharan Africa17,18.

Region-specific studies in the Brazilian Amazon, and rural India have also identified P. stercorea within Prevotella-rich enterotype clusters, frequently co-occurring with P. copri in the context of high-fibre, subsistence diets19,20,21,22. Although P. stercorea is rarely the dominant species, its recurrent detection across rural, high-infection-incidence populations suggests that its association with infection protection in our cohort reflects a broader ecological pattern rather than an isolated finding. Importantly, whereas prior work has largely interpreted P. stercorea prevalence in dietary terms, our study is the first to examine its role primarily from an infection-centred perspective.

Recent evidence from a UK birth cohort reinforces the broader principle that early-life microbial ecology influences infection risk: infants with higher gut microbiota alpha diversity in the first week of life and Bifidobacterium longum-dominated gut profiles had lower rates of hospitalised viral lower respiratory tract infection (vLRTI) by age two, compared to those with B. breve-dominated or mixed microbial communities23. While derived from a high-income context, these results are conceptually aligned with our identification of P. stercorea as a putative protective taxon in older infants in The Gambia.

A related experimental study adds an intriguing ecological contrast. Lindell et al.24 demonstrated that gut bacteria such as Bacteroides and Odoribacter can bioaccumulate per- and polyfluoroalkyl substances (PFAS), suggesting that in sanitised, industrialised environments, microbial communities are shaped more by chemical exposures than infection risk. This highlights how dominant environmental pressures, whether pathogenic or toxic, select for distinct microbial traits, underscoring the context-specific nature of microbiome function.

These patterns suggest an evolutionary logic: in high-infection settings, natural selection favours microbiomes that enhance host survival through immune protection. Conversely, in high-income environments where infection burdens have declined, factors such as diet, sanitation, antibiotic use, and synthetic chemical exposures now dominate, driving distinct microbial architectures25,26,27. The prevalence of Prevotella in rural, pathogen-rich settings likely reflects an adaptive microbial configuration evolved to buffer against environmental pathogens.

Prevotella stercorea abundance increased with age, reflecting the transition toward adult diets and broader environmental exposure. Children aged 1–2 years were particularly vulnerable, representing a developmental window during which the gut microbiome and immune system are still co-maturing. Delays or disruptions in microbiome maturation may heighten susceptibility to infection, as microbial protective capacity is not yet fully established. This pattern aligns with observations across high-infection-burden LMICs, where children aged 12–23 months consistently exhibit the highest prevalence of diarrhoea, acute respiratory infection (ARI) and fever28,29 (Supplementary Table 14). These findings underscore that microbiome maturation, especially the establishment of a Prevotella-rich enterotype, is shaped not only by diet but also by ecological pressures, including exposure to environmental pathogens.

Supporting this, immigrant studies show that individuals moving from Prevotella-dominant to industrialised regions rapidly develop Bacteroides-dominated profiles, even without major dietary changes30. Despite substantial interest, no dietary intervention has consistently shifted individuals from a Bacteroides- to a Prevotella-dominant enterotype. Systematic reviews of diet–microbiota trials affirm the stability of core enterotype structures in the face of short-term dietary modulation31,32. These findings suggest that sustained environmental exposures, particularly to pathogens and microbial diversity, are likely critical drivers of Prevotella enrichment.

Microbial networks and immune protection

The protective role of P. stercorea likely arises through both ecological and immunomodulatory mechanisms. Emerging evidence suggests that Prevotella species can stimulate mucosal pattern recognition receptors, including epithelial Toll-like receptors, promote IL-10–producing regulatory T cells and activate Th17 pathways33,34,35. These effects are context-dependent and modulated by surrounding microbiota and host immune status.

P. stercorea may also enhance mucosal barrier integrity and inhibit pathogen colonisation via the principle of competitive exclusion, in which beneficial microbes suppress pathogen expansion by occupying shared spatial and metabolic niches36.

We propose the concept of ‘Microbiome Immunity’: the combined ecological and immunological functions of gut consortia that shape and sustain host immune protection, especially in early life. Within this framework, P. stercorea-centred communities are not passive commensals, but active agents that educate the immune system, maintain mucosal homoeostasis and enhance resilience to infection.

Systemic immune phenotypes appear to correlate with enterotype. Individuals with Prevotella-rich microbiomes exhibit distinct T cell distributions, elevated mucosal IgA responses and differentially expressed immune-regulatory cytokines34,35,37. These signatures may influence susceptibility to infection, responsiveness to vaccines38 and even risk of autoimmune and inflammatory disease39.

Our microbial network analysis supports this interpretation. P. stercorea co-occurred with beneficial taxa such as Faecalibacterium prausnitzii, a butyrate producer known to suppress NF-κB–mediated inflammation and support epithelial integrity40,41. Together, these organisms form a synergistic microbial consortium that appears to antagonise pathobiont colonisation through competitive exclusion36,42.

Collectively, our findings suggest that microbiome immunity, centered on P. stercorea–dominated consortia, constitutes a functional layer of host defence in early life. These microbial communities may promote immune homoeostasis, prime immune development and enhance resilience to infection. Future studies incorporating high-resolution longitudinal sampling and mechanistic models will be essential to evaluate causality and to assess P. stercorea as a candidate biomarker or therapeutic target.

Public health relevance

Our findings have significant implications for child health in low-resource settings. If the Prevotella stercorea–associated protection against infection is confirmed in other populations, particularly in South Asia and Latin America, it may represent a potent and modifiable microbial target for preventive or therapeutic strategies. In settings where antibiotic access is constrained and vaccine hesitancy is increasing, microbiota-based interventions could provide a complementary, sustainable and cost-effective means of infection control.

By reducing the incidence and severity of infectious illnesses through microbiome modulation, such strategies could alleviate pressure on overstretched health systems, lower healthcare costs and improve health outcomes in the most vulnerable populations. Approaches may include early-life microbial seeding, maternal microbiome optimisation, or prebiotic/probiotic formulations tailored to support P. stercorea and its beneficial partners. Furthermore, P. stercorea–like profiles could serve as biomarkers of infection resilience, helping to stratify risk in clinical trials and guide public health targeting.

Importantly, these strategies must be grounded in local ecological and immunological realities. The microbiome architecture observed in our Gambian cohort reflects adaptations to high microbial exposure, which may not be replicable, or even appropriate, in low-exposure environments. Interventions should aim to preserve or restore ecological functionality, rather than import microbiomes optimised for other contexts.

Implications for global microbiome science

Our study challenges the dominant Western-centric paradigm of microbiome science. While much of the field focuses on diet, obesity and metabolic disorders, our results highlight infection resistance and ecological resilience as core microbial functions in early life. This underscores the need for a more global and inclusive framework for microbiome research, as exemplified by recent studies in Malawi, Nigeria and elsewhere1,5,6,7.

The concept of microbiome immunity emerging from our data reframes the gut microbiota as an active immune organ, shaped by co-evolutionary pressures. Our findings underscore the urgency of these efforts and suggest that microbiome–immune interactions likely differ across ecological contexts, with important implications for health outcomes and intervention strategies.

Ultimately, a unified global microbiome framework must account for both the microbial adaptations to high-infection environments and the consequences of their disruption in industrialised settings.

Our data support two testable postulates:

  1. 1.

    The early-life gut microbiome is shaped primarily by infection exposure and contributes to immune protection.

  2. 2.

    In the absence of such pressures, other forces; such as diet, sanitation, antibiotic use and exposure to synthetic environmental chemicals become dominant, generating context-specific microbial architectures.

These postulates require validation across diverse geographic and ecological settings but provide a valuable conceptual framework for interpreting global microbiome variation.

Study limitations and broader implications

While our findings provide strong associative evidence, the observational design limits causal inference in this not-optimised secondary study. Although rigorous controls were implemented, residual confounding cannot be fully excluded and unmeasured variables, such as subclinical infections, detailed dietary intake, current breastfeeding status, socioeconomic conditions, prior illness history (e.g. intestinal protozoan infections), or domestic animal exposure, may contribute to the observed associations. Moreover, the use of 16S rRNA sequencing constrains both taxonomic and functional resolution, and the generalisability of these results to populations with different environmental exposures or healthcare systems may be limited.

Nevertheless, several measures were implemented to minimise confounding. Children with severe malnutrition were excluded at screening, and samples collected after antibiotic exposure were removed. Importantly, the trial was conducted in a relatively homogeneous rural population, and previous work by Prentice and Paul43 has demonstrated the uniformity of diet and socioeconomic conditions in rural Gambian communities, helping to further mitigate confounding.

Despite these caveats, the consistent patterns observed suggest a robust ecological and immunological role for Prevotella stercorea–dominated communities in promoting infection resilience during early life. The implications are substantial: advancing our understanding of microbial contributions to immune maturation could inform scalable interventions to reduce childhood infections, particularly in resource-limited settings. The gut microbiome may serve not merely as a passive marker of health, but as an ancient and dynamic ally in host defence, especially during the critical window of early development.

Looking further ahead, these insights may also hold relevance for extreme environments, such as long-duration spaceflight, where restricted microbial exposures could pose unforeseen risks to immune development. As we envision the future of medicine, we must not overlook our oldest allies: the microbes that have co-evolved with us, shaped our immune systems and safeguarded our health for millennia.

Methods

Participants, study design, children cohort and data collection

The stool samples used in this study were taken from the IHAT-GUT ‘The Iron Hydroxide Adipate Tartrate’ trial (NCT02941081), which is a three-arm, parallel, randomised, placebo-controlled, double-blind study with iron supplementation in young children with mild to moderate iron-deficient anaemia11,44. The trial was conducted from November 2017 to November 2018 in The Gambia. The study population in IHAT-GUT comprised children under the age of 3 years living in the north bank rural communities in the Upper River Region (URR) of The Gambia in West Africa. Informed consent for a child to participate in the study was provided by their parent before enrolment in the study. The study area included 45 villages in the Wuli and Sandu districts, situated ~400 km east of the capital Banjul, on the north bank of the Gambia River. All villages had access to borehole tap water at central places and are typical of rural sub-Saharan Africa. A detailed description of the study design, child cohort, recruitment, screening, intervention and ethnic statement are present in the Study Protocol at Gates Open Research11. Stool samples were collected at baseline (day 1), day 15 and day 85. Stool samples collected in an OMNIgene GUT tube contain a DNA-stabilising agent that ensures samples can be kept at ambient temperature for 60 days. Total stool DNA is extracted from these samples using the Mo Bio PowerLyzer PowerSoil DNA Isolation Kit (Qiagen) within 6 weeks of sample collection. The trial was conducted in accordance with the ethical principles that have their origin in the Declaration of Helsinki, and that are consistent with the International Conference on Harmonisation (ICH) requirements for Good Clinical Practice (GCP), and the applicable regulatory requirements. The study sponsor was the London School of Hygiene and Tropical Medicine (LSHTM) and the study was conducted at the Medical Research Council (MRC) Unit The Gambia at LSHTM (MRCG). The URR from which the cohort was recruited has an approximate population of 200,000 and only one major town, Basse; it is otherwise typical of rural sub-Saharan Africa. The URR has the highest mortality rate of children under 5 years in the country (92 deaths per 1,000 livebirths), the highest percentage of severely malnourished children (7–11%), and the highest prevalence of malaria and anaemia in children under 5 years (4.5% and 82.5%, respectively)45. The 45 villages in the study area had a population of ~2800 children aged 6–35 months. Adverse event (AE) data were collected on paper case report forms and subsequently entered into Microsoft Excel (version 2016; Microsoft Corporation, Redmond, WA, USA) on Windows 10. AE data and biological samples were collected at one of five study clinical facilities: Yorrobawol Health Center, Darsilami Community Health Post, Konkuba Community Health Post, Taibatu Health Post and Chamoi Health Center. Samples were transported to the study laboratory in Basse for initial processing and analysis, and subsequently to reference laboratories in Fajara, Banjul, for further processing and advanced analyses. For this large clinical trial, the Kato-Katz method was used to collect information about helminth parasite egg count and faecal calprotectin. Very few children had helminths because there is a national programme of anti-helmitic metronidazole every 6 months in The Gambia. Children were not given anti-helminths at the start of the study. The main trial findings showed that IHAT was non-inferior to ferrous sulphate in correcting iron deficiency anaemia (treatment response: 28.2% IHAT vs 22.1% FeSO₄; OR 1.39, 80% CI 1.01–1.91, per-protocol population) and superior in gastrointestinal tolerability, with a lower incidence density of moderate-to-severe diarrhoea (2.66 vs 3.42; RR 0.76, 80% CI 0.59–0.99). Overall diarrhoea prevalence was similar across groups (21.2% IHAT, 23.7% FeSO₄, 20.5% placebo), and adverse event rates were comparable (67.8% IHAT, 68.9% FeSO₄, 66.8% placebo). IHAT therefore demonstrated adequate efficacy for iron repletion with improved safety relative to conventional ferrous sulphate supplementation. Detailed results are presented in the main trial publication10.

Sample sizes

The sample size was chosen so that the IHAT-GUT study was adequately powered for the first primary objective: determining whether IHAT was non-inferior to FeSO4 on the day 85 response outcome. It was assumed based on previous evidence that the proportion of children who were responders with FeSO4 at day 85 would be 0.3. The non-inferiority margin was an odds ratio of 0.583 (equivalent to a 0.1 absolute difference in response probability). Because any significant result would be tested in a subsequent pivotal (phase III) study, a 10% one-sided type I error rate was used. A sample size of 200 per arm provides 89% power to demonstrate non-inferiority when the two arms have the same response probability. The sample size of 200 per arm also provides: (1) 90% power (10% one-sided type I error rate) for testing the superiority of IHAT over FeSO4 for the prevalence of diarrhoea when prevalence is 0.15 in the IHAT arm and 0.25 in the FeSO4 arm; (2) 93% power (10% one-sided type I error rate) for testing the non-inferiority of IHAT versus placebo for diarrhoea prevalence when it is 0.15 in the IHAT and placebo arms with a 0.1 absolute non-inferiority margin; (3) 90% power (10% one-sided type I error rate) to find a reduction in the incidence density of diarrhoea in IHAT versus FeSO4 assuming 1.28 episodes per child over the 85 days in the FeSO4 arm and rate ratio of 0.8. For the secondary outcomes, the trial (n = 200 per arm) would have over 85% power to detect significant differences between all the arms in terms of enterobacteria, non-transferrin-bound iron and calprotectin. To account for an anticipated 15% non-completion rate, based on previous studies in The Gambia, the target sample size was set to 705. Because this was a phase II trial aiming to determine whether a phase III trial was warranted, no adjustment for multiple testing was made.

Randomisation

Randomisation was performed using a stratified block design to achieve group balance in terms of age (6–11 months, 12–23 months and 24–37 months) and baseline haemoglobin concentration (above and below median, calculated for each cohort separately) at pre-enrolment (day 0). Within each of the six resulting strata, children were randomly assigned to one of the three study treatment arms (1:1:1 ratio) using a computer programme written by the trial statistician and a block randomisation approach with fixed block size of six was used.

Sex and demographic considerations

Biological sex and age were recorded for all 633 participants at enrolment and explicitly incorporated into the clinical trial design. Female counts and percentages, as well as mean ages ( ± s.d.), are reported disaggregated by age group, study timepoint and illness status in Supplementary Table 3. Geographic location of residence was also documented, with participants distributed across five study sites: Chamoi, Darsilami, Kuwonkuba, Taibatu and Yorrobawol (Table 1). Participants of both sexes were included in all analyses, and no analyses were restricted to a single sex. Preliminary assessments revealed no significant sex differences in microbiome α- or β-diversity or in the relative abundance of major taxa. Subgroup sample sizes were insufficient to support adequately powered sex-stratified analyses; therefore, formal sex-based analyses were not conducted. Age, a critical developmental factor, was adjusted for or stratified in all statistical models. Gender-related data were not collected, and no additional demographic variables (for example, breastfeeding or socioeconomic status) were captured. Informed consent was obtained from parents or legal guardians, and participants received routine healthcare monitoring. Individual-level data are disaggregated by sex in the source files, and all analyses comply with the SAGER guidelines and Nature’s reporting standards.

Nutritional and diet information

The Gambia is a low-income country in West Africa, where food availability and nutritional status in rural areas are poor, are strongly influenced by season, and a chronically marginal diet is exacerbated by a ‘hungry season’ (July to September), when food stocks from the previous harvest season are depleted. Infants in rural Gambia are breastfed to 2 years of age, with fewer than half of infants being exclusively breastfed to 6 months of age as per WHO recommendation46. The first foods introduced from 3 months of age are thin gruels made from only cereal, water (occasionally cow’s milk is added), salt and sugar and are of a low energy and fat content. A thicker porridge made from rice and pounded groundnuts is sometimes administered. Cow’s milk alone is given infrequently to infants <1 year of age; only 57% of infants receive it more than once a week, although it is provided often to children in the second year of life. From 6 months, infants start to share the family food bowl, the most common meals consisting of boiled rice and a sauce made from groundnuts or leaves. Dried fish may be added to sauces in very small quantities, but fresh fish is not given to infants before 9 months43.

Participant exclusion

Children with severe malnutrition were excluded from the trial (n = 88, 6% of children who were screened; z-scores for length/height-for-age, weight-or-age or weight-for-length/height of −3 s.d. or less). Mean z-scores for the included children were around −1. Data that failed the high-quality control procedure in the bioinformatics pipeline were also excluded, that is any samples with a low amount of DNA from which no reads >1000 were obtained. This excluded 61 of 1466 samples and an additional small number of 16 samples from 15 children who received antibiotics were also removed leaving 1389 samples from 633 children for detailed analysis. Antibiotic treatment affected 15 of the day 85 samples and one day 15 sample. Therefore, not all children in the IHAT-GUT trial were included in this study.

Bacterial 16S RNA extraction

Nucleic acid extraction

Extraction of total genomic DNA was conducted on stool samples collected on visit days 1, 15 and 85, using the MO BIO Laboratories (now Qiagen) DNeasy PowerLyzer PowerSoil Kit (catalogue number: 12855-100). Each extraction was done with 24 samples (23 study samples and one reagent blank). About 250 μl of the OMNIGENE (OMNIgene•GUT | OM-200; DNA Genotek) sample mix (from a total of 2 ml of sample plus stabilising liquid mix) was aliquoted into a labelled PowerLyzer glass bead tube (0.1 mm; catalogue number: 13118-100-GBT) and then mixed gently with 750 μl of PowerSoil Bead Solution (catalogue number: 12855-100-BS). About 60 μl of solution C1 (catalogue number: 12888-100-1) was then added and vortexed briefly. The samples were then homogenised for 45 s at 3000 r.p.m. using a Mo Bio PowerLyzer24 bead beater (catalogue number: 13155). About 400–500 μl of supernatant was transferred to a clean 2-ml collection tube (catalogue number: 12888-100-T) following centrifugation of the bead tubes at 10,000 × g for 30 s at room temperature. The supernatant was then subjected to several purification steps. To precipitate any non-DNA material, 250 μl of solution C2 (catalogue number: 12888-100-2) was added, the supernatant vortexed and then incubated at 4 °C for 5 min. The samples were centrifuged at room temperature for 1 min at 10,000 g and up to 600 μl of supernatant was transferred into another clean 2-ml collection tube. About 200 μl of solution C3 (catalogue number: 12888-100-3) was then added and the sample vortexed briefly and incubated at 4 °C for 5 min. About 750 μl of supernatant was then collected into a clean 2-ml collection tube following centrifugation of the sample and solution C3 mix at room temperature for 1 min at 10,000 g. This was followed by the addition of 1200 μl of solution C4 (catalogue number: 12888-100-4) to the supernatant, which was then vortexed for 5 s. Using PowerSoil spin filter units in 2-ml tubes (catalogue number: 12888-100-SF), 675 μl of supernatant was loaded and filtered at 10,000 × g for 1 min at room temperature. The flow through was discarded and the step was repeated two more times. About 500 μl of ethanol-based solution C5 (catalogue number: 12888-100-5) was aliquoted into the spin filter and centrifuged at room temperature for 30 s at 10,000 × g and the flow through discarded. To remove any residual solution C5, the spin filter was again centrifuged at room temperature for 1 min at 10,000 × g. The spin filter was then carefully transferred into a clean 2-ml collection tube while avoiding splashing solution C5 onto the spin filter. Finally, about 110 μl of solution C6 is added to the centre of the white filter membrane before centrifugation at room temperature for 30 s at 10,000 × g. The spin filter is then discarded and the DNA solution aliquoted into two clean 2-ml collection tubes and stored at −80 °C for downstream processing. The DNA concentration was occasionally measured on random samples to assess sample concentration and purity using a NanoDrop ND-1000 UV-Vis spectrophotometer.

Bacterial 16S rRNA gene library preparation and Illumina MiSeq sequencing

The bacterial 16S rRNA V1V2 variable region of extracted DNA was amplified with Illumina adaptor and indexed PCR primers using a dual-index sequencing strategy to target the bacterial 16S rRNA gene38. Each PCR reaction was done in triplicate in a total reaction volume of 25 μl together with 200 μM deoxynucleotide triphosphates (dNTPs), 0.5 μM V1 forward primers (7 f 5′-AATG ATACGGCGACCACCGAGATCTACAC- XXXXXXXX-acactctttccctacacgac gctcttccgatct- NNNN-AGMGTTYGATYMTGGCTCAG-3′), 0.5 μM V2 reverse primer (r356 5′-CAAGCAGAAGACGGCATACGAGAT- XXXXXXXX-gtgac tggagttcagacgtgtgctcttccgatct- NNNN-GCTGCCTCCCGTAGGAGT-3′) and 0.25 μl of Q5 Taq enzyme. The Illumina adaptor primer sequence is built of Illumina adaptor, 8 bp index sequences (8 Xs), binding side for Illumina sequencing primer (lower case letter), four maximally degenerated bases (NNNN) to maximise diversity during the first four bases of the run and a PCR target sequence. Cycling conditions were as follows: denaturation at 98 °C for 2 min, followed by 30 cycles of amplification (denaturation 98 °C for 30 s, annealing 50 °C for 30 s, extension 72 °C for 90 s) and a final extension at 72 °C for 5 min. All primers were purchased from Metabion International AL AG. Triplicate PCR reactions were pooled and purified with 75 μl of Agencourt AMPure XP (catalogue number: A63881) according to Illumina’s 16S metagenomic sequencing library preparation protocol, pages 8–9 (part number: 15044223 Rev. B; https://support.illumina.com). DNA concentrations were quantified using the Invitrogen Qubit 3.0 fluorometer (catalogue number: Q33216) and Qubit double-stranded DNA HS assay kit (catalogue number: Q32854). Samples were pooled in equimolar concentrations and gel purified using the Wizard SV Gel and PCR Clean-Up System (Promega). The library size was confirmed on a QIAxcel Advanced (Qiagen) and then MiSeq sequenced using the 600 cycle MiSeq reagent kit V3, which enables 300-bp end sequencing. The library was sequenced at the Wellcome Sanger Institute (Cambridge, UK). A total of 1546 samples including negative (n = 45) and (n = 34) positive controls were sequenced in 18 MiSeq libraries.

Bioinformatics and statistics

Bacterial 16S rRNA maker gene analysis

The forward and reverse fastq files of each sample were processed according to the MOTHUR MiSeq SOP with some modifications (MOTHUR wiki at http://www.mothur.org/wiki/MiSeq_SOP). The ‘make.contigs’ command was used with no extra parameters39. The assembled contigs were taken out from the MOTHUR pipeline and the four poly(NNNN)s present in the adaptor/primer sequences were removed using the ‘-trim_left 4’ and ‘-trim_right 4’ parameters in the PRINSEQ program40. The PRINSEQ-trimmed sequences were used for the first ‘screen. seqs’ command to remove ambiguous sequences (maxambig = 0) and sequences containing homopolymers longer than 8 bp (maxhomop = 8). The quality-screened sequences were aligned using the Silva bacterial database ‘silva.nr_v123.align’ with the flip parameter set to true. Any sequences outside the expected alignment coordinates were further removed using the ‘screen.seqs’ command. The alignment coordinates were set with ‘optimise = start-end, criteria = 90’. In addition, any sequences longer than 400 bp were remove with ‘maxlength = 400’. The correct aligned sequences were filtered using the ‘filter.seqs’ command with ‘vertical = T’ and ‘trump =.’. The subsequent filtered sequences were de-noised by allowing three mismatches in the ‘pre.clustering’ step and chimeras were removed using Uchime with the dereplicate option set to ‘true’. The chimera-free sequences were classified using the Silva reference database ‘silva.nr_v123.align’ and the Silva taxonomy database ‘silva.nr_v123.tax’ and a cut-off value of 80%. Chloroplast, Mitochondria, unknown, Archaea and Eukaryota sequences were removed. The high-quality, chimera-free and correct classified sequences were normalised using the ‘sub. sample’ command. Each sample was normalised to 20,000 reads. This removed 94 samples with reads below 20,000 per sample from a total of 1546 samples. Thirty-eight of the 94 samples with reads between 1000 and 20,000 per sample were added back to the dataset. One mislabelled sample (negative control outlier) was also removed from the dataset, leaving 1489 samples available for Oligotyping.

Oligotyping and taxa identification

Oligotyping was used for clustering the high-quality filtered fasta sequences from the MOTHUR pipeline. Oligotyping is a computational method to investigate the diversity of closely related by distinct bacterial organisms in final operational taxonomic units identified in environmental data sets through 16S rRNA gene data by the canonical approaches. For oligotyping we used the ‘Minimum Entropy Decomposition’ (MED) option for sensitive partitioning of high-throughput marker gene sequences from the oligotyping pipeline41. The normalised high-quality fasta and name file from MOTHUR were renamed by appending the group name to the sequence name, using the ‘rename.seqs’ command. A redundant renamed-fasta file was then generated using the ‘deunique.seqs’ command, which creates a redundant fasta file from a fasta and name file. The redundant fasta file was subsequently used for oligotyping using the unsupervised MED. The command line was ‘decompose <fasta.file> –g –t - -M 100 -V 2’ The –t character which was set to a dash ‘-’ character. The dash character was used in the MOTHUR ‘rename.seqs’ command to separate the sample name from the unique info in the defline of the sequence name. The -M integer defines the minimum substantive abundance of an oligotype and the -V integer defines the maximum variation allowed in each node. This MED settings generated 10,152 oligotypes. The node representative sequence of each oligotype was used for species profiling using the ARB programme (v.5.5-org-9167)42. For ARB analysis we used a customised version of the SILVA SSU Ref database (NR99, release 123) that was generated by removing environmental and uncultured taxa. ARB-generated short species abbreviations were then correlated with the full taxonomic path from species to phyla. The 10,152 redundant ARB species were then consolidated to non-redundant 524 species which were present in the 1410 samples with a minimum substance abundance of an oligotype per node of 100 (-M setting from above). Consolidation was performed using the ‘Consolidate’ option in Excel for Mac v.16.16.14; Microsoft). In cases in which a species could not be classified, we reported the genus name and in few cases the family name. For some obvious beneficial or pathogenic genera, we combined all species within the same genus, for example, for the purposes of ecological functionality, pattern recognition and visualisation of associations we combined all Bifidobacterium species together and the same was done for all Bacteroides species.

Statistical analysis used in this study

Study data and preprocessing

Microbiome data and clinical metadata were derived from the IHAT-GUT clinical trial stool samples collected at Days 1 (D1), 15 (D15) and 85 (D85). The primary datasets used in the current analysis were Merged_Illness_Cohorts.csv, containing microbiome taxonomic counts and subject metadata, and Adverse Events.csv, providing detailed illness event records. Microbiome count data (columns 13–500) were joined with illness metadata using unique child identifiers. Taxa with a minimum relative abundance <0.2% across all samples were excluded to reduce noise. Count data were log-transformed (log₁₀ + 1) or variance-stabilised as required unless otherwise specified. Model residuals were checked for normality and heteroscedasticity to validate statistical assumptions.

Iron treatment and alpha diversity

Alpha diversity metrics (Observed richness, Shannon, Simpson, Fisher’s alpha and Pielou’s evenness) were calculated using the phyloseq and vegan R packages. Pielou’s evenness was computed as Shannon diversity divided by the natural logarithm of observed richness. Linear mixed-effects models (LMMs) were fitted using the lmer function (lme4 package) to assess the effects of illness status, iron treatment, their interaction and age group on alpha diversity, with participant ID included as a random effect to account for repeated measures:

DiversityMetric ~ Ill_Status * Iron_Treatment + Age_Group + (1 | Randomisation_No)

Non-significant interaction terms (adjusted p > 0.6) justified exclusion of iron treatment from further alpha diversity models.

model <- lmer(DiversityMetric ~ Ill_Status * Iron_Treatment + Age_Group + (1 | Randomisation_No), data = alpha_div_data)

Iron treatment and beta diversity

Bray–Curtis dissimilarities were calculated from OTU counts using the vegdist function (vegan package). Community composition differences were tested using PERMANOVA (adonis2), with illness status, iron treatment, their interaction and age group as predictors:

bray_dist ~ Ill_Status * Iron_Treatment + Age_Group

The interaction term was non-significant (p = 0.92), indicating no joint effect of iron treatment and illness status on beta diversity. Principal Coordinates Analysis (PCoA) was used to visualise sample clustering, adjusted for age.

adonis2(bray_dist ~ Ill_Status * Iron_Treatment + Age_Group, data = analysis_data, permutations = 999)

Iron treatment and differential abundance analysis (MaAsLin2)

Differential abundance testing was conducted using the MaAsLin2 package (v1.10.0) to evaluate associations between microbiome taxa abundances and clinical variables, including iron treatment group (Iron: treatment vs placebo), illness status (Ill_Status: Ill vs Not-Ill), and age groups (X3agegroups.based.on.age.at.sampling: 7–12 months, 1–2 years, plus 2 years).

The multivariable linear models in MaAsLin2 explicitly adjusted for potential confounders by specifying a fixed effects formula that included the variables of interest and age group as a covariate:

fit <- Maaslin2

(input_data = taxa_for_maaslin,

input_metadata = meta_for_maaslin,

output = ‘maaslin2_output’,

fixed_effects = c(‘Iron’, ‘Ill_Status’, ‘X3age

groups.based.on.age.at.sampling’, ‘timepoints’),

normalisation = ‘TSS’,

transform = ‘LOG’,

analysis_method = ‘LM’,

standardise = TRUE,

correction = ‘BH’)

Here, taxa_for_maaslin is the relative abundance matrix of the top 50 taxa, and meta_for_maaslin contains metadata variables for each sample. This approach ensured robust identification of taxa whose abundances were significantly associated with treatment and illness status while controlling for age and sampling timepoint. Statistical significance was assessed by Benjamini-Hochberg false discovery rate (FDR) adjusted p values, with significance threshold set at q < 0.25 following MaAsLin2 defaults.

Iron treatment and illness risk

A generalised linear model (GLM) with a binomial distribution and logit link was implemented (glm(family = binomial(link = ‘logit’))) to assess the association between iron treatment (iron vs placebo) and illness status (any infection during the 113-day follow-up). Odds ratios were computed using exp(coef(model)) and 95% confidence intervals via confint(). Model fit was evaluated using residual deviance and Akaike Information Criterion (AIC).

Differential abundance analysis (MaAsLin2): age-adjusted pairwise, predictive and residual effects, and age-stratified analyses

Differential abundance analyses were conducted using microbial and clinical metadata from Merged_Illness_Cohorts.csv and Adverse_Events.csv, linking taxonomic counts (columns 13–500) by child ID. Taxa with ≥1% relative abundance across all samples were retained. Counts were normalised using total sum scaling (TSS) to adjust for sequencing depth.

To preserve biological interpretability, we omitted MaAsLin2’s default log transformation. Although log transformation improved normality, it attenuated robust associations observed in prior ANOVA models. TSS without log transformation maintained meaningful fold changes and aligned with independent statistical evidence, an approach which is supported by prior work and recommendations from the MaAsLin2 developers47,48.

Analyses were performed in MaAsLin2 (v1.10.0, R v4.5.0), including illness status (Ill/Not-Ill or illness timing) and age group as fixed effects:

Maaslin2(

input_data = counts,

input_metadata = meta,

output = ‘output_directory’,

fixed_effects = c(‘group’, ‘age_group’),

normalisation = ‘TSS’,

transform = ‘NONE’,

analysis_method = ‘LM’,

correction = ‘BH’,

max_significance = 0.05)

Analyses were structured as follows:

  • Pairwise comparisons

Differences in taxonomic abundance were tested between:

  • Timepoints (D1 vs D15, D15 vs D85, D1 vs D85) within each illness group (Ill, Not-Ill);

  • Ill vs Not-Ill groups at each timepoint (D1, D15, D85);

  • Cross-timepoint illness contrasts (e.g. D1 Ill vs D85 Not-Ill, D15 Ill vs D85 Not-Ill).

  • Forward prediction analyses

    D1 and D15 samples were used to identify taxa predictive of illness onset. Children were grouped by time to first infection: ≤35 days (‘Soon-Ill’), 36–70 days (‘Later-Ill’), or >70 days (‘Much-Later-Ill’) post-sampling, and compared to Not-Ill controls.

  • Age-stratified analyses:

The D1 Ill vs D85 Not-Ill and D15 Ill vs D85 Not-Ill comparisons were repeated within each age group (7–12 months, 1–2 years, >2 years) using only SampleGroup as a fixed effect to assess developmental window–specific patterns.

  • Residual effects analysis: D85 samples were analysed to detect taxa differences among children who had previously experienced infection ≤35 days (‘Early-Ill’) or 40–75 days (‘Recent-Ill’) after D1 sampling (excluding those with later illness), relative to D85 Not-Ill controls.

Significance was determined at a false discovery rate (FDR) of q < 0.05. Significant taxa were visualised using ggplot2 (v3.4.2) bar plots, coloured by fold-change direction and saved as vector graphics using ggsave(…, device = cairo_pdf).

Pairwise age-adjusted and age-stratified alpha diversity analyses (GLM and mixed-effects models, FDR-adjusted)

Alpha diversity was calculated at the genus level using unrarefied count data from Merged_Illness_Cohorts.csv (phyloseq v1.44.0; R v4.5.0). Metrics included Observed richness, Shannon index, Simpson index, Pielou’s evenness and Fisher’s alpha. Calculations were performed without rarefaction to avoid biases from count normalisation, following best-practice recommendations49.

Pairwise comparisons:

Differences in alpha diversity were evaluated across multiple comparisons:

  • Timepoints within each illness group (D1 vs D15, D15 vs D85, D1 vs D85);

  • Ill vs Not-Ill groups at each timepoint (D1, D15, D85);

  • Cross-timepoint illness comparisons (e.g. D1 Ill vs D85 Not-Ill, D15 Ill vs D85 Not-Ill).

For same-timepoint comparisons, generalised linear models (GLMs) were fitted for each alpha diversity metric with Ill_Status as the main predictor and age group as a covariate:

glm(metric ~ Ill_Status + X3agegroups.based.on.age.at.sampling,

data = comparison_data,

family = gaussian())

For cross-timepoint comparisons, mixed-effects models were applied using SubjectID as a random effect to account for repeated measures, with the comparison group as the main predictor and age group as a covariate:

lmer(metric ~ ComparisonGroup + X3agegroups.based.on.age.at.sampling + (1|SubjectID),

data = comparison_data, REML = FALSE)

Significance was evaluated using FDR-adjusted p values (Benjamini–Hochberg), with FDR < 0.05 considered significant. Alpha diversity distributions by group and timepoint were visualised as boxplots using ggplot2 (v3.4.2), annotated with FDR-adjusted p values, and exported as PDFs using gridExtra and ggsave().

Age-stratified analyses:

Selected comparisons (D1 Ill vs D1 Not-Ill, D15 Ill vs D15 Not-Ill, D85 Ill vs D85 Not-Ill, D1 Ill vs D85 Not-Ill, D15 Ill vs D85 Not-Ill) were repeated within each age group (7–12 months, 1–2 years, >2 years). For same-timepoint age-stratified comparisons, GLMs included only the comparison group as the predictor. For cross-timepoint age-stratified comparisons, mixed-effects models with SubjectID as a random effect were used. FDR correction was applied separately within each age-stratified set. Boxplots were faceted by age group to illustrate developmental window–specific alpha diversity patterns.

Beta diversity analysis: pairwise, predictive and residual effects (age-adjusted, with age-stratified sub-analyses)

Beta diversity was assessed using microbiome and clinical metadata from Merged_Illness_Cohorts.csv and Adverse Events.csv, with taxonomic counts (columns 13–500) joined by unique child identifiers. Counts were normalised to relative abundances (transform_sample_counts, phyloseq) to ensure Bray–Curtis dissimilarity reflected compositional differences rather than sequencing depth. Community composition differences were tested using PERMANOVA (adonis2, vegan v2.6-4, R v4.5.0), including the primary grouping variable (e.g. Ill/Not-Ill status, timepoint, or illness timing) and age group as a covariate. For overall (age-adjusted) PERMANOVA, paired cross-timepoint analyses included participant ID as a random effect via the strata option.

95% confidence intervals for R²  effect sizes were estimated via non-parametric bootstrap (199 replicates), resampling sample labels with replacement and recalculating PERMANOVA R² . Age-stratified analyses used the same approach to detect age-dependent patterns. Within-group variability was assessed as the distance to each group centroid (betadisper), and differences tested by Kruskal–Wallis or permutation tests.

To account for multiple comparisons, p values from PERMANOVA, age-stratified PERMANOVA and Kruskal–Wallis tests were adjusted using the Benjamini-Hochberg procedure to control the false discovery rate (FDR) with significance was defined as FDR-adjusted p < 0.05.

Analyses included pairwise comparisons, predictive analyses (D1/D15 microbiota predicting infection onset: ≤35 days ‘Soon-Ill’, 36–70 days ‘Later-Ill’, >70 days ‘Much-Later-Ill’) and residual effects (D85 microbiota after early illness: ≤35 days ‘Early-Ill’ or 40–75 days ‘Recent-Ill’). Principal coordinates analysis (PCoA) plots and centroid boxplots were generated in ggplot2 (v3.4.2) and saved as publication-quality vector graphics (cairo_pdf).

Microbial network analysis and visualisation

Raw count data from stool samples were transformed into relative abundances by dividing each taxon count by the total counts per sample (TSS). Taxa with mean relative abundance ≥0.2% across all samples were selected, yielding the top 50 most abundant taxa for analysis. Log-transformed (log1p) relative abundances were used to compute pairwise Spearman correlation matrices using the cor function in R.

To identify clusters of co-varying taxa, a dissimilarity matrix was constructed as 1−Spearman correlation1 - \text{Spearman correlation}1−Spearman correlation, followed by Partitioning Around Medoids (PAM) clustering. The optimal number of clusters was determined using the gap statistic (clusGap, cluster v2.1.6). Cluster assignments were visualised with consistent colour-coding.

For visualisation, a heatmap of the Spearman correlation matrix was generated using the ComplexHeatmap R package (v2.14), with a blue-to-red colour gradient representing negative to positive correlations. Taxon names were shortened to genus-species format, with key taxa of interest (e.g. Prevotella stercorea, Escherichia coli) annotated with asterisks.

Relative abundances of the top 50 taxa were calculated separately for Ill and Not-Ill groups. These group-specific means were visualised as barplot annotations along the heatmap axes, scaled by log1p of mean abundance and color-coded by cluster assignment.

To highlight key microbial groups, rectangular annotations were overlaid on the heatmap, indicating clusters centered around Prevotella stercorea and Escherichia coli. Cluster colours were derived from a palette of unique hues for optimal interpretability.

All analyses and visualisations were conducted in R (version 4.5.0) using the following packages: readr, dplyr, ComplexHeatmap, circlize, cluster, factoextra and grid.

Taxa-by-group ANOVA models with age adjustment

To investigate temporal and illness-related differences in the most prevalent taxa, we analysed the top 50 most abundant taxa (≥1% mean relative abundance across all samples), pre-selected and ordered in columns 13–62, with abundance percentages indicated in the taxa names. Counts were normalised to relative abundances using total-sum scaling (TSS).

For each taxon, a linear model was fit using the formula:

relative_abundance ~ Ill_Status * timepoints + age_group

where age_group corresponds to three categories based on age at sampling (7–12 months, 1–2 years, and greater than 2 years), timepoints includes study days (D1, D15, D85), and Ill_Status indicates illness status (Ill vs Not-Ill).

P values for main effects and interaction terms were extracted using the broom package (v1.0.5). Multiple testing correction was performed across all taxa and model terms using the Benjamini–Hochberg false discovery rate method, with significance defined as adjusted p < 0.05. Partial eta-squared (ηp2) for the Ill_Status effect was computed from ANOVA sums of squares, and 95% confidence intervals were estimated from F-statistics using non-parametric bootstrapping. Age-adjusted slopes with 95% confidence intervals and R²  were also reported.

Boxplots were created with ggplot2 (v3.5.2), displaying medians, interquartile ranges (IQR) and whiskers extending to 1.5 × IQR. Extreme values were omitted from visualisations for clarity but retained in all statistical analyses.

Longitudinal taxon abundance visualisation

Taxa count data were normalised using Total Sum Scaling (TSS). The top 50 most abundant taxa across all samples were selected for visualisation. Data were reshaped to long format using reshape2::melt(), and mean ± SEM relative abundance values were calculated for each taxon at each timepoint and Ill_Status. Faceted line plots were generated in ggplot2 with points, connecting lines and error bars, and arranged using cowplot. Taxon names were formatted for italicised labels. Figures were exported as high-resolution TIFF, vector-format PDF and EMF files using ggsave() and the Cairo graphics device.

Correlation with adverse event (AE) metrics

Linear regression was used to assess associations between mean D85 log-transformed Prevotella stercorea abundance and both (a) frequency and (b) duration of adverse events (AEs) across multiple clinical categories, including infectious AEs, acute respiratory infections (ARI), diarrhoea and fever. AE frequency was defined as the number of episodes per child, extracted from the corresponding AE indicator columns in Adverse Events.csv (‘AE Infection’, ‘AE ARI’, ‘AE Diarrhoea’, ‘AE Fever’), while AE duration was calculated as the cumulative number of days per child, derived from differences between ‘Start date’ and ‘End date’ for each episode (Supplementary Table 12). Analyses included all available AE durations without truncation, with sensitivity checks using truncated ranges (e.g. ≤30 days) to explore extreme values.

Regression models were fit separately for individual-level data (each child’s log-transformed P. stercorea abundance versus their AE metric) and for group-level means (mean log-transformed abundance per AE frequency or duration category). Models were fit using lm() in R (v4.5.0), with robust alternatives (rlm() from MASS v7.3-60) evaluated in sensitivity analyses. Regression coefficients, p values, and 95% confidence intervals (CIs) were extracted using broom::tidy().

Visualisations were generated using ggplot2 (v3.4.4), with scatterplots displaying fitted linear trends for both individual-level and group-level analyses, including shaded confidence intervals. Annotations on the plots summarise the slope, p value, and 95% CI for each regression. Frequency and duration plots for each AE type were arranged side by side using patchwork to facilitate direct comparison.

Age-group comparisons

To evaluate age-related differences in Prevotella stercorea abundance, counts were log-transformed [log(count + 1)] and compared across age categories (7–12 months, 1–2 years, >2 years) using one-way ANOVA. Post-hoc pairwise differences between age groups were assessed using Tukey’s HSD test and annotated. Potential interactions between P. stercorea abundance and age group in relation to illness status were evaluated with logistic regression including a logP × age_group interaction term and stratified illness counts were tested using Pearson’s Chi-squared test (3 × 2 age × illness contingency table; vcd v1.4-11). Figures display mean log-transformed abundance with age-stratified pairwise comparisons annotated with FDR-adjusted p values, and illness distributions by age group.

Within-individual microbiome stability

For children with both Day 1 (D1) and Day 85 (D85) stool samples, microbiome counts (columns 13–500) were normalised to relative abundances using Phyloseq. Bray–Curtis distances were calculated with the vegan package, and similarity was defined as 1−distance1 - \text{distance}1−distance. For each subject, D1–D85 similarity was extracted and summarised by age group and illness status. Differences between Ill and Not-Ill children were tested using Wilcoxon rank-sum tests with FDR-adjusted p values. Boxplots displaying similarity distributions and FDR q values were generated using ggplot2.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from analyses. The analyses were not randomised, and investigators were not blinded to allocation during data analysis. Statistical analyses were conducted using parametric and nonparametric models as appropriate. Effect sizes, confidence intervals and exact p values are reported throughout. Robustness checks were performed where relevant, including sensitivity analyses comparing ordinary least squares (lm) and robust regression (rlm, MASS v7.3-60) models for associations between Prevotella stercorea abundance and adverse event metrics. Analyses were performed on datasets prepared in Microsoft Excel (version 2016; Microsoft Corporation, Redmond, WA, USA) using R (version 4.5.0) within RStudio (version 2025.09.1, ‘Cucumberleaf Sunflower’; Posit Software, PBC) with Quarto (version 1.7.32) on Windows 10 for reproducible reporting and figure generation, and final panel assembly and formatting were conducted using Inkscape (version 1.4.2).

All analysis scripts and workflows are publicly available (see “Code availability”).

Ethics and consent

This study complied with all relevant ethical regulations for human research. The study protocol was reviewed and approved by The Gambia Government/MRC Joint Ethics Committee (reference SCC1489), with Clinical Trials Authorisation granted by the Medicines Control Agency, The Gambia (HP373/347/16/MJK(80)). The trial was conducted in accordance with the principles of the Declaration of Helsinki and International Conference on Harmonisation (ICH) guidelines for Good Clinical Practice. Written informed consent was obtained from all parents or legal guardians for their child’s participation, including consent for use of data and biological samples for the present analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.