Main

Bifidobacteria are Gram-positive, saccharolytic microorganisms that predominantly inhabit animal gastrointestinal tracts1. Multiple Bifidobacterium species colonize the human gut throughout life, with dietary carbohydrate intake playing a key role in shaping this process2. Breastfeeding fosters the dominance of specific Bifidobacterium taxa within the neonatal gut microbiota owing to their evolutionary adaptation to metabolize human milk oligosaccharides (HMOs)3,4,5. Weaning drives a gradual succession of bifidobacterial taxa from those tuned for HMO utilization to those more adapted to foraging dietary glycans (oligo- and polysaccharides) of plant origin6,7,8.

Geographic and cultural dietary differences also profoundly influence the bifidobacterial composition of neonatal microbiotas. For example, Bifidobacterium longum subsp. infantis (Bl. infantis), a specialist HMO utilizer, constitutes up to 90% of the gut microbial composition of healthy breastfed infants from non-Westernized populations9,10,11. Infants from Westernized populations often lack Bl. infantis and instead harbour less proficient HMO utilizers such as Bifidobacterium longum subsp. longum (Bl. longum), Bifidobacterium breve and Bifidobacterium pseudocatenulatum; this appears to be due to a combination of lifestyle and cultural factors9,12,13.

Despite these differences, the predominance of bifidobacteria in gut communities is associated with multiple health benefits, particularly in infancy. Bifidobacterial fermentation products, lactate and acetate, can inhibit pathogen colonization14,15 and serve as substrates for cross-feeding among microbial community members16,17. In addition, multiple Bifidobacterium species produce aromatic lactic acids modulating the immune system18,19. These beneficial traits underpin the widespread use of bifidobacterial strains as probiotics20,21,22,23, often supplemented with complementary prebiotics to facilitate engraftment23,24.

Further development is needed to rationally select probiotic strains and prebiotic glycans tailored for different populations. For instance, a probiotic Bl. infantis strain did not durably engraft in the microbiota of malnourished Bangladeshi infants whose diets were low in breast milk compared with complementary foods23. Moreover, strains isolated from children in this population may harbour distinctive genomic adaptations for metabolizing glycans common in weaning diets23. Thus, strain-level insights into bifidobacterial carbohydrate metabolism, especially in understudied populations, may be instrumental in developing locally adapted pro- and synbiotics.

Genomics-based approaches, including the analysis of carbohydrate-active enzyme (CAZyme) repertoires7,25,26,27, genotype-to-phenotype matching28,29,30 and genome-scale metabolic models31,32,33, have been used to predict the carbohydrate utilization capabilities of bifidobacteria. However, functional gene annotations, particularly for glycan transporters, remain insufficiently curated and integrated into metabolic reconstructions34. In addition, current genomic resources may not fully capture the strain-level diversity within the Bifidobacterium genus.

We used a combination of curated metabolic reconstruction and machine learning to map 68 carbohydrate utilization pathways across 263 reference and 2,820 additional Bifidobacterium genomes by analysing the representation of 589 distinct functional roles (catabolic enzymes, CAZymes, transporters and transcriptional regulators). Thirty-eight predicted glycan utilization phenotypes were validated in vitro in 30 diverse strains. Our findings reveal remarkable inter- and intraspecies variability, including a distinct Bifidobacterium longum clade that metabolizes α-glucans and Bangladeshi strains with unique gene clusters for plant hemicellulose and HMO catabolism. These insights establish a genomic framework for predicting glycan utilization networks across Bifidobacterium lineages and may guide the development of targeted probiotics and synbiotics.

Results

Curated metabolic reconstruction from reference genomes

The curated reconstruction encompassed a non-redundant reference set of 263 Bifidobacterium genomes from cultured isolates, including 19 genomes from Bangladeshi strains sequenced in this study (Supplementary Tables 1 and 2). To refine taxonomic assignments, we first constructed a maximum-likelihood phylogenetic tree based on 487 core genes identified through a pangenome analysis (Supplementary Fig. 1). Additional pairwise average nucleotide identity (ANI) comparisons for 39 genomes allowed us to delineate the infraspecific structure of Bifidobacterium longum and Bifidobacterium catenulatum species (Extended Data Fig. 1 and Supplementary Note 1). Overall, the reference set spanned 19 taxa (Supplementary Table 1).

We leveraged a subsystem-based comparative genomics framework35 to reconstruct carbohydrate utilization pathways and predict associated metabolic phenotypes (Fig. 1a). We first mined 222 publications to identify 433 orthologous groups of carbohydrate utilization genes, whose encoded proteins were either experimentally characterized in bifidobacteria (210 metabolic functional roles), shared substantial sequence similarity with proteins characterized in other microbial taxa (80 roles) or were previously assigned putative functions (143 roles). Genomic context analysis, including in silico transcriptional regulon reconstruction (Supplementary Notes 2 and 3), enabled tentative functional predictions for 156 additional gene groups involved in glycan metabolism. The resulting curated set of 589 roles—comprising 235 components of glycan-specific transporters, 197 catabolic CAZymes, 72 downstream catabolic enzymes and 85 transcription factors—was used to functionally annotate 39,589 of 541,418 protein-coding genes across 263 reference genomes (Supplementary Tables 4 and 8). Manual curation improved 76.6% and 69% of annotations over Prokka and EggNOG-mapper, respectively, including more than 90% of annotations for transporters and transcriptional regulators (Fig. 1b). The metabolic reconstruction also captured 82.2% of catabolic CAZymes identified by dbCAN (Fig. 1c).

Fig. 1: In silico reconstruction of carbohydrate metabolism in bifidobacteria.
Fig. 1: In silico reconstruction of carbohydrate metabolism in bifidobacteria.
Full size image

a, An overview of the computational framework. We analysed the distribution of 589 metabolic functional roles (glycan transporters, catabolic enzymes and transcription factors) across 263 reference Bifidobacterium genomes. Manual reconstruction of 68 carbohydrate utilization pathways enabled the assignment of binary pathway variants corresponding to predicted glycan utilization phenotypes based on specific genomic signatures (Extended Data Fig. 2). This curated dataset was used to train an automated pathway prediction pipeline (glycobif), which outputs (1) the distribution of functional roles across additional 2,820 Bifidobacterium genomes and (2) a BPM reflecting the inferred presence or absence of pathways in each genome. Predicted phenotypes for 20 reference and 10 additional strains were compared with in vitro growth profiles to evaluate prediction accuracy. b, A comparison of functional gene annotations obtained via manual curation versus automated tools in the 263 reference Bifidobacterium genomes. Stacked bar plots show the distribution of 39,589 manually curated annotations across four categories: ‘new’, annotations that update non-specific predictions (for example, hypothetical protein); ‘corrected’, annotations that replace specific but incorrect predictions; ‘refined’, annotations that add functional precision (for example, substrate and linkage specificity for GHs); ‘same’, annotations that are essentially identical to automated output (Supplementary Table 13). c, The percentage of catabolic CAZymes (GHs, CEs and polysaccharide lyases) captured in the reconstructed metabolic pathways across the 263 reference Bifidobacterium genomes. Each point represents a genome.

The genome-wide distribution of genes assigned to 589 functional roles was used to reconstruct 68 catabolic pathways: 18 for monosaccharides and their derivatives (sugar alcohols and acids), 39 for di- and oligosaccharides and 11 for polysaccharides (Supplementary Table 5). To link gene conservation patterns with metabolic phenotypes, we first established rules that define genomic signatures distinguishing metabolic pathway variants (Methods, Extended Data Fig. 2 and Supplementary Table 6). Each pathway variant was then converted into a binary phenotype, classifying strains as predicted utilizers (‘1’) or non-utilizers (‘0’) of specific carbohydrates. These assignments formed a binary phenotype matrix (BPM), summarizing the predicted utilization profiles of 263 reference Bifidobacterium strains spanning 68 glycans (Supplementary Table 7). As a preliminary validation of the reconstruction, we compared predicted phenotypes with in vitro growth data for 33 strains from previous studies8,28,36, yielding 94% accuracy (Supplementary Table 17a–d).

Automated pathway prediction across large genomic datasets

We utilized the metabolic reconstruction for 263 reference Bifidobacterium genomes to analyse the representation of glycan utilization pathways in an additional set of 2,820 non-redundant genomes (Supplementary Table 3). This set included 364 isolate genomes and 2,456 high-quality metagenome-assembled genomes (MAGs) with completeness ≥97% and contamination ≤3%. Twelve genomes represented four taxa absent from the reference collection, and nine were from Bangladeshi and Malawian strains isolated in previous studies37,38,39 and sequenced in this work (Supplementary Table 1). We used annotated protein sequences of reference strains to assign functions to a subset of 419,055 protein-coding genes in the 2,820 genomes. These functional annotations, combined with the reference BPM, were used to train a random forest model that predicted the presence of 68 reconstructed carbohydrate utilization pathways (Methods, Fig. 1a and Supplementary Tables 9 and 10).

Interspecies diversity in glycan utilization

The BPMs for 263 reference and 2,820 additional genomes were merged to assess the distribution of glycan utilization pathways across the Bifidobacterium genus. Non-metric multidimensional scaling (NMDS) of the Hamming distance matrix derived from the BPM of 3,083 genomes revealed grouping by species and subspecies, with taxonomy explaining 91% of the variation (permutational multivariate analysis of variance (PERMANOVA) R2 = 0.91, P = 0.001; Fig. 2a). A significant dispersion effect (F = 81.06, P = 0.001) indicated that strain-level variability in pathway representation differed across taxa, probably also influencing the separation.

Fig. 2: Representation of carbohydrate utilization pathways across 3,083 Bifidobacterium genomes.
Fig. 2: Representation of carbohydrate utilization pathways across 3,083 Bifidobacterium genomes.
Full size image

a, NMDS of a Hamming distance matrix derived from the presence–absence patterns of 68 predicted carbohydrate utilization pathways across 627 isolate genomes plus 2,456 MAGs. Points represent genomes; spider lines connect genomes to their group (taxon) centroid. Colours and shapes of centroids indicate taxonomic assignments. b, Predicted phenotypic richness (the total number of carbohydrate utilization pathways) at species and strain levels. Each point represents a genome. Box plots show the median (centre line), interquartile range (IQR; box bounds) and full data range excluding outliers (whiskers, defined as 1.5× IQR). Statistical comparisons were performed using a two-sided GLM with a Poisson distribution, followed by post-hoc pairwise comparisons with Bonferroni correction. The sample size (n) per group corresponds to the number of genomes analysed. c, A heat map of the proportion of genomes within each taxon encoding the 68 predicted carbohydrate utilization pathways. The colour intensity indicates the percentage of genomes that encode each pathway. Annotation rows at the bottom indicate pathway and phenotype classifications. Full names are provided in Supplementary Table 5.

We observed significant differences in predicted phenotypic richness (total number of predicted carbohydrate utilization pathways) between taxa, including phylogenetically close subspecies sharing over 95% ANI (Poisson generalized linear model (GLM), P < 2.2 × 10−16; Fig. 2b). For example, Bifidobacterium catenulatum subsp. catenulatum (Bc. catenulatum) had significantly lower phenotypic richness than subspecies kashiwanohense (Bc. kashiwanohense) and kashiwanohense_A (Bc. kashiwanohense_A) due to the absence of utilization pathways for fucosylated HMOs (for example, 2′-fucosyllactose (2′FL) and 3-fucosyllactose (3FL)) and certain plant oligosaccharides (for example, β-mannose oligosaccharides, bMnOS) in >95% of genomes (Extended Data Fig. 3 and Supplementary Note 4).

Pathways for glucose (Glc), galactose (Gal), fructose (Fru), lactose (Lac) and galactooligosaccharide (GOS) utilization were identified in over 98% of analysed genomes, defining the core catabolic potential of human-colonizing bifidobacteria (Fig. 2c). Pathways for sucrose (Scr), maltose (Mal), maltooligosaccharide (MOS), isomaltooligosaccharide (IMO), melibiose (Mel), raffinose-family oligosaccharide (RFO) and short-chain fructooligosaccharide (scFOS) metabolism were encoded in over 84% of genomes, indicating broad conservation across most species, except Bifidobacterium bifidum. Other pathways showed more sporadic distribution patterns, reflecting species-level adaptations for metabolizing dietary glycans of different origins and structures. For example, B. bifidum exhibited a notable genomic specialization towards the metabolism of host mucin O-glycans, HMOs and their degradation products lacto-N-biose (LNB), galacto-N-biose (GNB) and N-acetyl-d-glucosamine (GlcNAc)—consistent with prior reports40 (Fig. 2c). At the same time, most B. bifidum genomes lacked complete pathways for utilizing plant-derived mono-, di-, oligo- and polysaccharides. Other bifidobacteria, including the specialist HMO utilizer Bl. infantis, were more versatile in their predicted glycan utilization profiles, although complete pathways for plant polysaccharide degradation were less common than those for catabolizing the corresponding mono- and oligosaccharide components (Fig. 2c).

Hierarchical clustering of the BPM for 263 reference genomes showed a moderate correlation with core-gene phylogeny (cophenetic correlation 0.58, permutation test, P < 0.001; Supplementary Fig. 2), indicating incomplete concordance between predicted glycan utilization capabilities and phylogenetic relatedness. For example, the predicted phenotypic profiles of Bl. infantis and Bl. longum, two phylogenetically related subspecies within the B. longum species, were markedly different. The representation of glycan utilization pathways in Bl. infantis more closely resembled that of B. breve—a more distantly related species inhabiting the neonatal human gut. Given the importance of B. longum in infant microbiota development, we next conducted a more focused analysis of pathway variability within this heterogeneous species.

Diversity of glycan metabolism within the B. longum species

The B. longum species comprises multiple subspecies distinguished by phylogeny and specific phenotypic traits26,41,42,43. Our phylogenomic and ANI analyses clustered B. longum genomes into three clades matching previously described subspecies—infantis (Bl. infantis), longum (Bl. longum) and suis (Bl. suis)—and a distinct clade hereafter referred to as Bl. nov. (Extended Data Fig. 1a, Supplementary Fig. 1 and Supplementary Note 1). Bl. nov. exhibited significantly lower predicted phenotypic richness than other subspecies (Fig. 2b), lacking pathways for LNB, GNB, N-glycan, HMO and T-antigen (Tan) metabolism (Fig. 2c and Supplementary Note 5). Conversely, only Bl. nov. genomes encoded extracellular amylopullulanase ApuB (GH13_14_32), a bifunctional glycoside hydrolase (GH) cleaving both α-1,4 and α-1,6-glycosidic bonds in soluble starch (ST) and pullulan (PUL)44, and a pathway for difructose dianhydride (DFA) metabolism45. These findings suggest that Bl. nov. has a reduced capacity to metabolize host-derived glycans but can degrade α-glucans of plant and fungal origin.

Comparative analysis revealed stark differences in the repertoire of carbohydrate utilization pathways between Bl. infantis and Bl. longum (Fig. 2c). Consistent with previous studies, Bl. infantis genomes were distinguished by the presence of H146 and FL1/2 (ref. 4) gene clusters, which enable the utilization of lacto-N-neotetraose (LNnT), 2′FL, 3FL, lactodifucotetraose (LDFT), lacto-N-fucopentaose I (LNFP I) and sialylated HMOs (SHMOs; Supplementary Tables 710). Predicted HMO utilization potential of Bl. longum was more limited: 35% of genomes encoded extracellular lacto-N-biosidase LnbX (GH136) that cleaves lacto-N-tetraose (LNT) and LNFP I47, and only 2.3% carried a gene cluster driving intracellular utilization of 2′FL, 3FL, LDFT and LNFP I48,49. Beyond HMO metabolism, Bl. infantis exclusively encoded catabolic pathways for glucuronate (GlcA), inositol (Ino) and gluconate (Gco) in 63%, 48% and 23% of genomes, respectively (Extended Data Fig. 4a,b,d,e and Supplementary Note 3). Bl. longum genomes, by contrast, commonly encoded pathways for l-arabinose (Ara), α/β-arabinooligosaccharides (aAOS/bAOS), type II arabinogalactan (AGII), arabinan (AR), arabinoxylan (AX) and host- or plant-derived O-glycans (Tan and HRGP; Fig. 2c). These findings illustrate an ecological divergence between Bl. infantis and Bl. longum shaped by their respective adaptations to thrive on milk glycans during breastfeeding versus plant-derived carbohydrates after weaning.

Predicted glycan utilization profiles within the Bl. suis group were highly heterogeneous. Most genomes encoded a set of metabolic pathways similar to that of Bl. longum but more frequently included pathways for the utilization of N-acetylneuraminic acid (Neu5Ac; 44% versus 2%), l-fucose (Fuc; 61% versus 1.7%) and fucosylated HMOs (2′FL, 3FL, LDFT and LNFP I; 56% versus 2.3%), while lacking genes encoding extracellular α-l-arabinofuranosidases required for AX degradation50,51 (Fig. 2c and Supplementary Tables 8 and 10). By contrast, the Bangladeshi isolate Bg131.S11_17.F6 shared multiple genomic features with Bl. infantis, including the presence of the H1 gene cluster and the absence of araBDA genes (Supplementary Table 8). Consequently, this strain was predicted to metabolize more HMO structures (for example, LNnT and SHMOs) than other Bangladeshi Bl. suis strains, while lacking the capacity to utilize arabinose-containing glycans of plant origin (for example, Ara, aAOS and AGII; Supplementary Fig. 3). Genomic analysis of ten additional animal-derived Bl. suis strains confirmed the uniqueness of the Bg131.S11_17.F6 isolate and further underscored the phenotypic heterogeneity within this group (Extended Data Fig. 5 and Supplementary Note 5). Collectively, these findings reveal pronounced differences in carbohydrate utilization across B. longum subspecies and underscore pervasive strain-level heterogeneity within each clade.

Strain-level heterogeneity of glycan metabolism

Beyond interspecies differences, we observed extensive strain-level variability in predicted carbohydrate utilization capabilities. Among the 68 reconstructed pathways, 66 exhibited variability within at least one taxonomic group (Fig. 2c and Supplementary Tables 7 and 9). The genomic differences driving this heterogeneity ranged from individual genes encoding extracellular GHs enabling polysaccharide degradation to multigene clusters comprising up to 20 genes encoding complete metabolic pathways (Supplementary Table 8). By contrast, biosynthetic pathways for essential metabolites, such as amino acids and B vitamins, were largely conserved. A few exceptions included riboflavin (B2) biosynthesis, which varied across Bl. suis and Bc. kashiwanohense_A, and thiamine (B1) and niacin (B3) biosynthesis in Bifidobacterium adolescentis (Extended Data Fig. 6 and Supplementary Table 11).

The observed heterogeneity reflects that, while most Bifidobacterium taxa follow general ecological strategies centred on the utilization of specific core glycans, individual strains can exhibit substantial metabolic tuning. For instance, Bl. infantis Bg064.S07_13.C6 harboured pathways for xylooligosaccharide (XOS) and long-chain fructooligosaccharide (lcFOS) utilization, suggesting an enhanced capacity to metabolize dietary plant glycans compared with most other Bl. infantis strains (Supplementary Fig. 3). Conversely, several B. adolescentis isolates carried pathways for the utilization of LNB, GNB, N-glycans (strains AF96-10M2bTA and UN03-88) and fucosylated HMOs (strain M56B_1C3), highlighting the presence of traits characteristic of infant-adapted taxa in a species commonly associated with the adult gut (Supplementary Tables 9 and 10). Additional examples of notable strain-specific variability of genomic features related to carbohydrate utilization were found in Bangladeshi isolates and are detailed below.

Unique glycan utilization features of Bangladeshi strains

Our previous study described distinctive genomic features of Bangladeshi Bl. infantis strains related to N-glycan and β-glucoside catabolism23. Here, we identified a distinct gene cluster (xgl) in Bc. kashiwanohense Bg42221_1E1 and Bc. kashiwanohense_A Bg42221_1D3, two isolates from a Bangladeshi infant (Fig. 3a). This cluster encoded multiple GHs (families 3, 5_4, 29, 31, 42 and 43_12), an unclassified carbohydrate esterase (CE), a β-glucoside kinase and an ABC transport system, which together may catabolize plant hemicelluloses such as xyloglucans (XGLs). The reconstructed XGL pathway involved the hydrolysis of the XGL backbone to oligosaccharides by extracellular endo-β-1,4-glucanase Xgl5A (GH5_4), which shares catalytic and glycan-binding residues with xyloglucanase PpXG5 from Paenibacillus pabuli XG552 (Extended Data Fig. 7a). Released oligosaccharides would be imported into the cell and degraded sequentially to individual monosaccharides and cellobiose by the coordinated action of GHs and CEs via a mechanism similar to that described for Ruminiclostridium cellulolyticum53 (Fig. 3b). Regulon reconstruction of XglT, a putative TetR-family transcription factor, suggested potential co-regulation of the xgl cluster and cbpA, which encodes a GH94-family cellobiose phosphorylase (Fig. 3a–c and Supplementary Table 19). The xgl cluster was identified in only 3 of 110 studied B. catenulatum genomes but was conserved in Bifidobacterium dentium and Bifidobacterium tsurumirense (Fig. 2c).

Fig. 3: Integrated genomic and transcriptomic analysis of XGL metabolism in Bc. kashiwanohense Bg42221_1E1.
Fig. 3: Integrated genomic and transcriptomic analysis of XGL metabolism in Bc. kashiwanohense Bg42221_1E1.
Full size image

a, Gene clusters potentially driving XGL degradation in Bc. kashiwanohense Bg42221_1E1. b, Reconstructed XGL degradation pathway in Bc. kashiwanohense Bg42221_1E1: (1) XGL chains are potentially cleaved by extracellular endo-β-1,4-glucanase Xgl5A, and (2) released oligosaccharides are imported and metabolized inside the cell by a coordinated action of GHs, CEs and downstream catabolic enzymes. c, The predicted DNA-binding motif of the XglT transcriptional regulator potentially controlling XGL metabolism genes. d, Growth curves of Bifidobacterium strains in MRS-AC supplemented with 0.5% tamarind XGL. Data represent the mean ± s.d. of three biological replicates. e, A volcano plot showing differential gene expression (log2 fold change (FC) versus −log10-adjusted P value) in Bc. kashiwanohense Bg42221_1E1 grown in MRS-AC with tamarind XGL versus MRS-AC with Lac. Differential expression was assessed using moderated two-sided t-tests with empirical Bayes variance moderation. P values were adjusted for multiple comparisons using the Benjamini–Hochberg procedure. Genes were considered differentially expressed at Padj < 0.01 and |log2FC| > 2. Genes belonging to the reconstructed XglT, XylR and XosR regulons, as well as xyn cluster genes, are highlighted. Exact log2FC values, test statistics and adjusted P values are provided in Supplementary Table 21a.

Another gene cluster, unique to Bc. kashiwanohense Bg42221_1E1, contained orthologues of H1 cluster HMO utilization genes from Bl. infantis and the ‘outlier’ Bangladeshi Bl. suis Bg131.S11_17.F6 strain (Fig. 4a). The H1 variant identified in Bg42221_1E1 encoded two ABC transporters, one of which (HmoABC) is probably involved in the uptake of LNnT and other type II HMO structures54. In addition, this cluster encoded orthologues of characterized β-N-acetylglucosaminidase (GH20), two α-fucosidases (GH29 and GH95) and α-sialidase (GH33), which mediate intracellular HMO hydrolysis55,56,57, along with downstream catabolic pathways for GlcNAc and Fuc. Regulon reconstruction suggested the potential transcriptional control of H1 cluster genes in Bc. kashiwanohense Bg42221_1E1 by NagR, a GlcNAc-responsive ROK-family repressor58 (Fig. 4a and Supplementary Table 19).

Fig. 4: Comparative genomic and functional profiling of HMO utilization capabilities in Bifidobacterium strains.
Fig. 4: Comparative genomic and functional profiling of HMO utilization capabilities in Bifidobacterium strains.
Full size image

a, A schematic representation of HMO utilization genes in selected Bangladeshi Bifidobacterium strains and their homology to H1 cluster genes from Bl. infantis ATCC 15697. The H1 cluster encodes multiple paralogous substrate-binding proteins (SBPs) of putative HMO transporters (Hmo), whereas FHMO encodes a characterized, non-orthologous transporter (FL2_ABC) specific for fucosylated HMOs. b, Growth curves of Bifidobacterium strains in MRS-AC supplemented with pooled HMOs. Data represent the mean ± s.d. of three biological replicates. Curves are colour-coded based on the presence of specific HMO utilization gene clusters in the respective strains. c, HPLC-based quantification of HMO depletion from culture supernatants after 24 h. Data represent the percentage of utilized HMOs (mean of three biological replicates) relative to the medium control. Total HMO, total HMO utilized; total FHMO, total fucosylated HMO utilized; total SHMO, total sialylated HMO utilized. Concentrations of individual HMOs (nmol ml−1) are provided in Supplementary Table 18b. d, PCA of combined growth metrics (maximum OD600, area under the curve and maximum growth rate) and per cent utilization of individual HMOs. Each point represents a strain, colour-coded by the presence of specific HMO utilization gene clusters. Axes represent principal components (PCs). In b and d, the notations H1 cluster+ and FHMO cluster+ indicate that a strain carries the complete H1 or FHMO gene cluster, respectively. e, A volcano plot showing differential gene expression (log2FC versus −log10-adjusted P value) in Bc. kashiwanohense Bg42221_1E1 grown in MRS-AC with LNT versus MRS-AC with Lac. Differential expression was assessed using moderated two-sided t-tests with empirical Bayes variance moderation. P values were adjusted for multiple comparisons using the Benjamini–Hochberg procedure. Genes were considered differentially expressed at Padj < 0.01 and |log2FC| > 2. Genes belonging to the reconstructed NagR regulon are highlighted. Exact log2FC values, test statistics and adjusted P values are provided in Supplementary Table 21b.

Other notable features of Bangladeshi Bifidobacterium isolates included a putative d-galactonate utilization pathway found exclusively in B. breve Bg41721_1C11 (Extended Data Fig. 4d,e and Supplementary Note 3). This strain, along with B. breve Bg131.S11_D6, also lacked the nan gene cluster, which encodes a well-characterized Neu5Ac catabolic pathway59 (Extended Data Fig. 4f,g). This absence of the nan cluster in these two strains was unexpected, given its previously reported broad conservation in B. breve genomes29,59 and its established role in utilizing Neu5Ac via cross-feeding on SHMOs and mucin O-glycans degraded by B. bifidum59,60.

These results demonstrate that Bangladeshi Bifidobacterium strains carry unique adaptations for metabolizing dietary plant polysaccharides and HMO, while lacking pathways conserved in well-characterized strains. These traits may reflect strain-level adaptation to the diet and lifestyle of Bangladeshi children. We next investigated whether broader patterns of bifidobacterial carbohydrate utilization are associated with host age and lifestyle across diverse human populations.

Associations between pathway profiles and lifestyle

We examined the enrichment of glycan utilization pathways in Bifidobacterium genomes from different populations based on: (1) host age and stage of gut microbiota maturation (<3 years: infant or transitional; ≥3 years: adult-like), and (2) host lifestyle (‘Westernized’ versus ‘non-Westernized’ as defined by Pasolli et al.61). Host glycan (for example, HMO) utilization pathways were enriched in genomes from the ‘age <3’ group, whereas plant glycan utilization pathways were more prevalent in adult-associated (≥3 years) genomes across both lifestyle groups (Fisher’s exact test, Benjamini–Hochberg adjusted P ≤ 0.01; Extended Data Fig. 8a,b). Within the ‘age <3’ group, 14 pathways (11 for plant glycans) were enriched in the Westernized group and 24, including 11 for HMOs and their constituent blocks LNB, Neu5Ac and Fuc, in the non-Westernized group (Extended Data Fig. 8c). These differences probably stem from the uneven distribution of taxa across Westernized and non-Westernized microbiotas. For example, pathways enriched in the non-Westernized group were associated with Bl. infantis, a taxon more prevalent in that group (odds ratio 4.98, 95% confidence interval 3.24–7.51, Fisher’s exact test, P = 1.82 × 10−11).

Within taxa, pathways for sorbitol (Stl), mannitol (Mtl), lcFOS and pectic galactan (PG) metabolism were enriched in Westernized Bl. infantis genomes (Extended Data Fig. 8d). The Stl and lcFOS catabolic pathways were also more prevalent in Westernized B. adolescentis genomes, and the PG pathway was overrepresented in Westernized Bl. longum genomes. Conversely, non-Westernized B. breve genomes were enriched for melezitose (Mlz) and 1,2-β-oligoglucan (BGL12) utilization pathways but more frequently lacked the Neu5Ac pathway. These findings underscore how lifestyle-driven ecological pressures shape glycan utilization strategies among bifidobacteria. We next tested carbohydrate utilization phenotypes in representative isolates to validate predictions from the reconstruction framework.

Growth-based validation of predicted glycan utilization

To experimentally validate in silico phenotypic predictions, we conducted in vitro growth assays on 30 geographically diverse Bifidobacterium strains (15 Bangladeshi, 8 Malawian, 5 US and two European). Growth was tested on 43 substrates corresponding to 38 predicted glycan utilization phenotypes: 13 monosaccharides and derivatives, 18 di- or oligosaccharides and 7 polysaccharides (Fig. 5a). Strains were cultured in a sugar-free De Man–Rogosa–Sharpe (MRS-AC) medium4,58 supplemented with the test substrate (5–10 mg ml−1), and growth was defined using strain-specific optical density at 600 nm (OD600) thresholds (Methods and Supplementary Fig. 4). Growth outcomes were compared with predictions derived from manual curation (20 strains) or the automated pipeline (10 strains; Fig. 5b).

Fig. 5: Comparison of predicted carbohydrate utilization phenotypes with in vitro growth data.
Fig. 5: Comparison of predicted carbohydrate utilization phenotypes with in vitro growth data.
Full size image

a, A summary of in vitro growth profiles of 30 Bifidobacterium strains on 43 substrates. Growth, weak growth and no growth were categorized on the basis of strain-specific OD600 thresholds (Methods). The annotation column on the right shows the geographical origin of each strain. The details about substrates are given in Supplementary Table 15; all growth curves are shown in Supplementary Fig. 4; raw OD600 data are provided in Supplementary Table 16. b, A comparison of 38 predicted carbohydrate utilization phenotypes with corresponding in vitro growth profiles for the same 30 strains. Predicted phenotypes were tested using a singular substrate, except IMO (panose and isomaltotriose), BglOS (cellobiose, gentiobiose, laminaritriose and sophorose) and SHMO (3′SL and 6′SL). Prediction outcomes are colour-coded to indicate agreement between predicted and observed phenotypes. The annotation column on the right shows whether genomes belong to the reference or extended dataset. The annotation row at the bottom represents phenotype classification (glycan type). Summary data are provided in Supplementary Table 17f; full names of abbreviations are provided in Supplementary Table 5.

Prediction accuracy was similar for manual and automated approaches: 95% and 94%, respectively, with Matthews correlation coefficients of 0.9 and 0.89 (Supplementary Table 17e,f). False-negative predictions (growth despite predicted non-utilization) probably stemmed from incomplete knowledge about monosaccharide and HMO transporters. For example, Bl. suis Bg131.S11_17.F6 grew on 2′FL despite lacking genes encoding known transporters for this substrate4,5 (Supplementary Table 8). Some false-positive predictions (predicted utilization but no growth) may have resulted from gene disruptions, such as a premature stop codon in the gltA gene in B. pseudocatenulatum LFYP29, which probably impaired the LNT transport function.

As predicted, Bl. nov. LFYP82 was the only B. longum strain to grow on ST and PUL (Fig. 5a). We also validated strain-specific utilization of scFOS, lcFOS, XOS and GlcA in Bl. infantis (Fig. 5 and Extended Data Fig. 4c). Similarly, while two B. breve strains grew on mannotriose (bMnOS), none grew on konjac glucomannan (bMAN), in contrast to B. dentium LFYP24 and Bifidobacterium scardovii JCM 12489, which harboured extracellular endo-β-1,4-mannanases (Extended Data Fig. 7e–g and Supplementary Note 2).

We further validated several unusual glycan utilization traits predicted in Bangladeshi and Malawian isolates. Bc. kashiwanohense Bg42221_1E1 and Bc. kashiwanohense_A Bg42221_1D3, both carrying the xgl cluster, grew in the medium supplemented with tamarind XGL (Fig. 3d). Bangladeshi B. breve isolates lacking the Neu5Ac catabolic pathway (Bg41721_1C11 and Bg131.S11_D6) failed to grow on Neu5Ac, unlike other tested B. breve strains (Extended Data Fig. 4h). Finally, consistent with our prediction, B. adolescentis M56B_1C3 grew on 2′FL, highlighting the previously unrecognized capacity of this species (Fig. 5).

Glycoprofiling of HMO utilization

The H1 cluster in Bl. infantis is believed to enable the metabolism of multiple HMOs46, although the precise range of structures cannot be confidently predicted due to limited understanding of the functions of individual transporter genes. Given the presence of homologous H1 clusters in Bl. suis Bg131.S11_17.F6 and Bc. kashiwanohense Bg42221_1E1, we compared their HMO utilization with that of (1) two Bl. infantis strains, (2) phylogenetically related strains carrying a fucosylated HMO utilization gene cluster (FHMO) instead of H1 (Bl. suis Bg41121_2E1 and Bc. kashiwanohense_A Bg42221_1D3; Fig. 4a), and (3) distantly related strains predicted to have weak (Bl. longum and B. breve) or minimal (Bl. nov. and B. pseudocatenulatum) HMO utilization capacity.

Strains with the H1 cluster reached the highest optical densities when grown in MRS-AC supplemented with an HMO mixture isolated from pooled human milk (Fig. 4b). High-performance liquid chromatography (HPLC)-based glycoprofiling of culture supernatants revealed that these strains consumed 72–86% of total HMOs by 24 h, including multiple fucosylated and sialylated structures (Fig. 4c and Extended Data Fig. 9). However, Bl. suis Bg131.S11_17.F6 and Bc. kashiwanohense Bg42221_1E1 did not efficiently utilize 2′FL, the most abundant HMO species in the mixture, in line with the absence of characterized 2′FL transporters. By contrast, strains carrying the FHMO cluster preferentially metabolized fucosylated HMOs, completely depleting 2′FL while exhibiting limited utilization of sialylated structures. Among strains negative for both H1 and FHMO clusters, Bl. longum Bg115.S08_3A11 depleted LNT, LNnT, LNFP I, LNFP III and lacto-N-hexaose (LNH), probably via partial extracellular degradation by β-galactosidase and lacto-N-biosidase, whereas B. breve Bg155.S08_4F7 metabolized LNT and LNnT intracellularly (Supplementary Fig. 5). Bl. nov. LFYP82 failed to efficiently utilize any tested HMOs, consistent with lineage-specific loss of relevant genes and pathways. Principal component analysis (PCA) of growth and HMO utilization data separated strains with H1 and FHMO clusters from each other and all others (Fig. 4d). These findings provide the experimental evidence that Bc. kashiwanohense and Bl. suis strains carrying the H1 cluster exhibit HMO utilization patterns comparable to Bl. infantis, and highlight how variation in gene repertoires drives metabolic divergence among closely related Bifidobacterium strains.

Transcriptional profiles of glycan utilization in Bc. kashiwanohense

We used RNA sequencing (RNA-seq) to test regulon predictions and pathway assignments in Bc. kashiwanohense Bg42221_1E1, which harbours both xgl and H1 gene clusters. Transcriptomic comparison of cultures grown in MRS-AC-XGL versus MRS-AC-Lac revealed strong induction (100–550 fold) of most xgl genes and cbpA, supporting their co-regulation by XglT and a shared role in XGL metabolism (Fig. 3a,e and Supplementary Table 21a). Genes involved in xylose, (arabino)xylooligosaccharide and (arabino)xylan metabolism showed moderate upregulation (4–30 fold), probably in response to intracellular xylose release, which may serve as a transcriptional effector of predicted transcriptional repressors XylR (ROK family) and XosR62 (LacI family; Extended Data Fig. 7b,c, Supplementary Table 19 and Supplementary Note 2). Next, we compared the transcriptomes of Bc. kashiwanohense Bg42221_1E1 grown in MRS-AC-LNT and MRS-AC-Lac. All H1 cluster genes except nanA2 and nanH2 were significantly upregulated in the presence of LNT (Fig. 4e and Supplementary Table 21b), consistent with the proposed regulation by NagR, a GlcNAc-responsive repressor implicated in the control of HMO utilization in bifidobacteria58. Overall, the observed transcriptomes suggest that the regulatory networks of this strain are adapted for foraging on mixtures of HMOs and plant oligo- and polysaccharides.

Our large-scale genomic and experimental analyses pinpoint ecological differences between glycan foraging strategies within Bifidobacterium that reflect species-level evolutionary adaptation to different habitats (for example, infant gut versus adult gut) and dietary carbohydrate composition. The results also underscore considerable strain-level variability, probably shaped by host lifestyle and local dietary exposures.

Discussion

The ability to metabolize dietary glycans is central to bifidobacterial fitness in the human gut, with 8–12% of their gene content dedicated to carbohydrate metabolism63. Understanding this process has fundamental and translational relevance, including for the development of probiotics and synbiotics. While interspecies differences in glycan preferences within bifidobacteria are well documented25,64,65, within-species variability, particularly in populations from low- and middle-income countries, remains understudied. The growing availability of Bifidobacterium genomes, driven by culturomics66,67,68,69 and metagenomics70,71,72,73,74,75, provides an opportunity to address this gap. However, accurately predicting carbohydrate utilization phenotypes from genomic data remains challenging due to imprecise functional annotations (particularly of glycan transporters) generated by widely used pipelines, as well as incomplete representation of metabolic pathways in public databases34,76,77. To address this gap, we reconstructed 68 carbohydrate utilization pathways encoded in 3,083 Bifidobacterium genomes and MAGs and validated 38 predicted phenotypes in vitro, with an overall accuracy exceeding 94%. Several false-negative predictions for specific mono- and oligosaccharides, including HMOs, suggest the existence of yet uncharacterized glycan transport mechanisms in bifidobacteria.

Our analysis revealed taxon-specific distribution of glycan utilization pathways, reflecting the adaptation of Bifidobacterium species and subspecies to distinct ecological niches, such as the infant and adult gut, and expanding on earlier observations25,30,64,65. For instance, we identified a distinct clade within the B. longum species (Bl. nov.) specialized in metabolizing plant and fungal α-glucans but incapable of utilizing host glycans such as LNB, GNB and HMOs. As most Bl. nov. genomes originated from infants in Westernized populations, the unique glycan preferences of this clade might reflect an adaptation to differences in early-life diet and feeding practices. However, certain pathways, such as RFO utilization, were widely conserved. A recent study demonstrated that RFO metabolism supports B. breve colonization and persistence in vivo19, suggesting that conservation of this pathway across bifidobacterial lineages may promote their persistence and transmission beyond infancy.

Another important observation was extensive strain-level variability driven by accessory genome differences ranging from single genes to gene cassettes encoding entire carbohydrate utilization pathways. Particularly notable was the heterogeneity of Bl. suis isolates of Bangladeshi origin. Similar strains were recently predicted to utilize HMOs and plant fibres based on the analysis of CAZyme repertoires and reclassified as B. longum subsp. iuvenis7,43. Our analysis revealed nuanced metabolic differences within this clade, suggesting the presence of two distinct ‘ecotypes’. The first, exemplified by strain Bg131.S11_17.F6, metabolizes a wide array of HMOs—including LNT, LNnT, long-chain fucosylated and sialylated species—but not arabinose and plant polysaccharides, mimicking the glycan preferences of Bl. infantis. The second ecotype, represented by strain Bg41121_2E1, preferentially utilizes fucosylated HMO structures, including the abundant 2′FL, but can metabolize arabinose-containing oligo- and polysaccharides, phenotypically resembling previously characterized Bl. longum strains APC1478, SC596 and MC1000728,48,49. These distinct and complementary strategies suggest ecological specialization within the Bl. suis group, potentially linked to diet and weaning status.

Further within-species variability is illustrated by Bangladeshi isolates carrying unique gene clusters. One notable example is the xgl cluster identified in Bc. kashiwanohense Bg42221_1E1 and Bc. kashiwanohense_A Bg42221_1D3 isolated from the same infant. This cluster enables XGL degradation, a capability previously suggested only for Bifidobacterium species inhabiting captive marmosets78. Another unexpected discovery was the presence of an H1 cluster variant in Bc. kashiwanohense Bg42221_1E1, formerly considered to be exclusive for Bl. infantis46 and rare Bangladeshi Bl. suis group strains7,23. Although the precise substrate specificities of transporters encoded by H1 cluster remain to be characterized, our data confirm that its presence enables the utilization of major HMOs found in human milk. Bc. kashiwanohense strains have been shown to metabolize fucosylated HMOs (2′FL and LNFP I) and plant hemicelluloses (xylan and AX)8. Our findings expand the catabolic potential of Bc. kashiwanohense to XGL and a broader range of HMOs (LNT, LNnT and SHMOs). These results indicate an evolutionary adaptation of Bangladeshi strains to thrive in the microbiota of weaning children, whose diets may combine breast milk with complementary foods rich in plant polysaccharides.

Although we did not explicitly reconstruct cross-feeding interactions, the lower prevalence of complete catabolic pathways for polysaccharides, relative to those targeting their mono- and oligomeric breakdown products, suggests that many Bifidobacterium strains may depend on syntrophic relationships, both within the genus and with keystone degraders such as Bacteroides and Segatella (formerly Prevotella)79,80,81. Recent evidence for widespread HMO-degrading capacity across gut microbes82 further supports the potential for cross-feeding involving bifidobacteria and other community members.

The genomic dataset analysed in this study is skewed towards samples from high-income countries, highlighting the need for initiatives to expand collections of Bifidobacterium isolates and genomes from underrepresented populations in an ethical and culturally sensitive manner to capture global metabolic diversity83. Some observed within-species variability may reflect artefacts introduced by incomplete or contaminated MAGs and different assembly methods, underscoring the importance of expanding collections of high-quality MAGs and isolate genomes.

In summary, our study reveals how glycan foraging strategies vary across and within Bifidobacterium species, shaped by ecological factors, including host age, diet and lifestyle. The comprehensive metabolic reconstruction and an automated pathway prediction pipeline provide a scalable framework for accurate functional annotation of bifidobacterial genomes and MAGs. This genomic compendium may inform the development of pro- and synbiotic formulations, including multistrain consortia, whose members might efficiently colonize the gut across diverse lifestyle and dietary contexts.

Methods

Collection of Bifidobacterium genomes

Reference genomes for the manual in silico metabolic reconstruction were retrieved from Bacterial and Viral Bioinformatics Resource Center (BV-BRC)84 and Integrated Microbial Genomes (IMG)85 databases as of October 2020. We selected 335 genomes on the basis of the following criteria: (1) human or probiotic product-derived Bifidobacterium spp. isolates, (2) number of contigs ≤200 and (3) CheckM86 completeness ≥97% and contamination ≤3%. In addition, we sequenced 31 genomes of cultured isolates from faecal samples obtained from Bangladeshi children37,38. The resulting reference set of 366 genomes was clustered using dRep (v3.4.2)87 at a 99.95% ANI threshold (‘dRep dereplicate -pa 0.9 -sa 0.9995 -nc 0.35 --S_algorithm ANImf’), yielding 263 non-redundant genomes (Supplementary Table 2).

Additional human gut-derived Bifidobacterium genomes were collected from prior studies. These included isolate genomes from Human Gastrointestinal Bacteria Culture Collection (HBC)66, Broad Institute-OpenBiome Microbiome Library (BIO-ML)67, Culturable Genome Reference (CGR)68 and Human intestinal Bacteria Collection (HiBC)69, and from Unified Human Gastrointestinal Genome (UHGG)70, Human Reference Gut Microbiome (HRGM)71, Early-Life Gut Genomes (ELGG)72, Singapore Platinum Metagenomes Project (SPMP)73 and Inner Mongolian Gut Genome (IMGG)74 collections and other datasets75. We used the same selection criteria for both isolate genomes and MAGs: (1) Bifidobacterium spp. based on Genome Taxonomy Database (GTDB) taxonomy88 excluding Bifidobacterium leopoldii and Bifidobacterium vaginale, (2) number of contigs ≤200 and (3) CheckM completeness ≥97% and contamination ≤3%. The resulting 4,944 genomes were clustered via dRep using the command ‘dRep dereplicate -pa 0.9 -sa 0.999 -nc 0.35 --S_algorithm fastANI’, yielding 2,811 non-redundant genomes (Supplementary Table 3). We also sequenced nine genomes of strains isolated from Bangladeshi and Malawian infants previously37,38,39, resulting in 2,820 additional genomes.

Isolation and sequencing of Bifidobacterium strains from Bangladeshi and Malawian donors

Faecal samples used for culturing Bangladeshi bifidobacterial strains were collected during studies conducted by the International Centre for Diarrhoeal Disease Research (icddr,b). These were (1) the MAL-ED birth cohort study of children aged 0–24 months (Interactions of Enteric Infections and Malnutrition and the Consequences for Child Health and Development; ClinicalTrials.gov identifier NCT02441426) and (2) a cohort of healthy 12–24-month-old Bangladeshi children enrolled in parallel with children with acute malnutrition in a study of microbiota-directed complementary food prototypes (ClinicalTrials.gov identifier NCT03084731)37,38. Both studies were approved by the Ethical Review Committee of the icddr,b. Bifidobacterial strains were also isolated from faecal samples collected in a previously reported study of Malawian twins discordant for acute malnutrition39. The protocol was approved by the College of Medicine Research Ethics Committee of the University of Malawi and by the Human Research Protection Office of Washington University in St. Louis. Written informed consent, including provisions for future use of materials, was provided by the parents or guardians of participating children before enrolment. The details of genome sequencing and assembly are provided in Supplementary Methods.

Taxonomy inference

Initial taxonomic assignments for the 263 reference Bifidobacterium genomes were retrieved from the National Center for Biotechnology Information and BV-BRC for public genomes or inferred from 16S rRNA gene sequencing for Bangladeshi isolates. To refine these assignments, we built a pangenome via Panaroo (v1.3.2)89. Concatenated nucleotide sequences of 487 identified core genes (Supplementary Table 12) were aligned using MAFFT (v7.515)90. A maximum-likelihood phylogenetic tree was built in IQ-TREE (v2.2.0.3)91 and visualized via iTOL (v5)92. Pairwise ANI values (ANIb) were calculated using pyani (v0.2.12)93. The exact commands and parameters are provided in Supplementary Code File 1. The resulting tree topology was manually inspected to verify and correct taxonomic assignments of genomes based on their co-clustering with branches corresponding to the type or well-characterized strains of Bifidobacterium species and subspecies. ANI matrices were used to delineate the within-species structure for Bifidobacterium longum and Bifidobacterium catenulatum species. GTDB-based taxonomies of 2,820 additional genomes were retrieved from original publications and refined via ANI comparisons with 263 reference genomes. Species assignments were made if ANI to the closest reference genome was >95%, and subspecies assignments if >97%.

Gene prediction and functional annotation

Protein-coding sequences were predicted and annotated with Prokka (v1.14.6)94 using default settings. The 263 reference genomes were additionally annotated via RASTtk (v1.073)95 and EggNOG-mapper (v2.1.12)96 with default settings (Supplementary Table 13). The representation of CAZymes in 263 genomes was analysed using dbCAN (v4.0.0)97. HMMER, dbCAN-sub and DIAMOND-based searches against the CAZy database were used for CAZyme identification. Only GHs, CEs and polysaccharide lyases predicted by two or more methods were retained.

Subsystems-based annotation and in silico metabolic reconstruction of glycan utilization pathways

We used a subsystem-based approach implemented in the SEED platform35,95 to reconstruct carbohydrate utilization pathways in reference 263 Bifidobacterium strains. RASTtk-annotated genomes were uploaded to microbial community SEED (mcSEED), a clone of the SEED annotation environment. We created 25 subsystems that captured the representation of genes that implement functional roles (glycan transporters, CAZymes, downstream catabolic enzymes and transcriptional regulators) involved in carbohydrate utilization (Supplementary Table 4). The list of functional roles was compiled via extensive literature search and using the information from Transporter Classification (TCDB)98, Carbohydrate Active Enzyme (CAZy)99, Kyoto Encyclopedia of Genes and Genomes (KEGG)100 and RegPrecise101 databases. This knowledge was used to manually curate the automated functional annotations of protein-coding genes in a subset of reference Bifidobacterium genomes (for example, type strains) where the respective functional roles had been experimentally characterized. Additional functional gene annotation to fill gaps in metabolic pathways was based on three genome context techniques: (1) clustering of genes on the chromosome (operons), (2) co-regulation of genes by a common transcription factor and (3) co-occurrence of genes across related genomes. The approach used to reconstruct the regulons of transcription factors potentially regulating carbohydrate metabolism is outlined in Supplementary Methods.

The propagation of curated annotations (corresponding to 589 distinct functional roles) across all 263 reference genomes was performed using homology-based methods implemented in mcSEED. Orthologues were detected automatically using predefined protein family classifications such as PGFams (cross-genus protein families) and PLFams (genus-specific protein families)84. These assignments were manually refined by examining the gene neighbourhood for each functional role. Genes with conserved gene neighbourhoods were classified as orthologous, while paralogues were assigned distinct functional annotations. Overall, 39,589 out of 541,418 protein-coding gene annotations in 263 reference Bifidobacterium genomes were curated.

Next, we reconstructed 72 catabolic pathways spanning 25 subsystems (Supplementary Table 5). Many pathways included alternative biochemical modules (routes) driven by different sets of catabolic enzymes and diverse glycan transporters. For each pathway, we defined genomic signatures—sets of genes encoding functional roles that together represent the minimal gene complement required to form a complete pathway for the utilization or degradation of a specific carbohydrate (Supplementary Table 6). ‘Utilization’ was defined as a process in which a glycan molecule (mono- or oligosaccharide) is transported into the cell and then catabolized using a combination of CAZymes and downstream catabolic enzymes. ‘Degradation’ referred to the partial hydrolysis of a glycan (for example, a polysaccharide) by extracellular CAZymes, with the resulting mono-, di- and oligosaccharides subsequently transported into the cell and metabolized.

We used the presence or absence of genes matching each genomic signature to assign a detailed pathway variant to each genome: (1) transporter + catabolic pathway (‘U’), (2) catabolic pathway without transporter (‘P’) and (3) no catabolic pathway (‘N’) (Supplementary Tables 6 and 8). For the purpose of automated phenotype profiling, these assignments were simplified into binary phenotypes: a complete pathway (‘U’) indicated predicted utilization or degradation and was assigned phenotype ‘1’, while incomplete (‘P’) or missing (‘N’) pathway variants indicated no utilization and were assigned phenotype ‘0’. The resulting set of 72 predicted carbohydrate utilization phenotypes across 263 strains comprised the reference BPM (Supplementary Table 7). Four pathways (GalNAc, ManAc, Man and GalA), for which all strains in the reference set were assigned variants ‘P’ or ‘N’ and phenotype ‘0’, respectively, were retained in the BPM but excluded from the pathway prediction pipeline.

Pathway prediction pipeline (glycobif)

The workflow used to predict the carbohydrate utilization pathways encoded in additional 2,820 Bifidobacterium genomes is schematically illustrated in Fig. 1a. We first constructed a reference database containing (1) 39,589 functionally annotated protein sequences from 25 curated metabolic subsystems across 263 reference genomes and (2) an additional set of 52,990 outgroup proteins (not captured in the subsystems) clustered at 95% amino acid identity and 95% coverage using MMSeqs2 (v14.7e284)102. Proteins encoded in the 2,820 query genomes were annotated by mapping their sequences to the reference database using DIAMOND (v2.1.4)103. To handle multidomain proteins, we first selected the top 50 hits for each query based on bitscore and clustered the alignment coordinates using DBSCAN (scikit-learn v1.2.1)104. Cluster centres were treated as potential domain boundaries, which were used to split query proteins into discrete domains, with database hits attributed to each individual domain. For each resulting domain with ≥35 amino acids, we applied the Gaussian kernel density modelling (KernelDensity function from sklearn.neighbors) to the distribution of sequence identity values and used the highest local minimum (argrelextrema function from sklearn.signal) to filter out low-confidence hits. Annotations were assigned by majority vote from high-scoring, domain-specific reference hits. High-identity hits to outgroup proteins from the reference database were used as criteria to vote against applying annotation to a given query. This pipeline yielded 419,055 annotated protein sequences.

These annotations, together with the BPM for 263 reference genomes, were integrated into a machine learning-based pipeline to predict the presence (‘1’) or absence (‘0’) of carbohydrate utilization pathways in additional 2,820 genomes. We evaluated over 30 machine learning methods implemented in the Caret package (v6.0.86)105 using a leave-one-out cross-validation approach: for each reference genome, a model was trained on the remaining 262 genomes and then used to predict the binary variant for the held-out genome. This process was repeated for every pathway. Random forest was identified as the best-performing model based on prediction accuracy. Pathway-specific random forest models were then trained using the full reference set, excluding four pathways (GalNAc, ManAc, Man and GalA) for which all genomes had predicted binary phenotype ‘0’. The list of functional roles used as model predictors for each remaining pathway was manually curated to match the genomic signatures delineating pathway variants (Supplementary Table 6). Model parameters were optimized using grid search, and mock genomes with custom genomic signatures were added to the training set to ensure that rare combinations from the reference collection were adequately learned and to improve model performance on incomplete MAGs. The resulting models were used to predict the presence or absence of 68 carbohydrate utilization pathways in 2,820 genomes (Supplementary Table 9).

The representation of 29 additional metabolic pathways (for example, the biosynthesis of B vitamins and amino acids) in both 263 reference and 2,820 additional genomes was inferred via a similar annotation pipeline as recently published106 (Supplementary Table 11).

Visualization and statistical analysis

NMDS was used for the ordination of the Hamming distance matrix calculated from the merged BPM of 3,083 genomes. PERMANOVA and a test for the homogeneity of multivariate dispersions were performed using the adonis2 and betadisper functions in the vegan package (v2.6-8)107. Differences in predicted phenotypic richness were assessed via a GLM assuming Poisson distribution; post-hoc comparisons with Bonferroni correction were conducted using the emmeans package (v1.10.6)108. Fisher’s exact test was used for the enrichment analysis of pathway representation. For each of the 68 carbohydrate utilization pathways, 2 × 2 contingency tables were constructed. Rows denoted the predicted absence or presence of pathways (‘0’ or ‘1’), whereas columns represented metadata categories: age group (age <3 years versus age ≥3 years) or host lifestyle (Westernized versus non-Westernized)61. P values were corrected for multiple testing using the Benjamini–Hochberg procedure, with adjusted P ≤ 0.01 considered significant (Supplementary Table 14). Odds ratios were calculated using the fisher.test function. PCA was used for the ordination of growth and HMO consumption data. Data visualization was performed using ggplot2 (v3.5.1)109, ComplexHeatmap (v2.18.0)110 and clinker (v0.0.27)111. Code for all analyses is provided in Supplementary Code File 1.

In vitro growth of Bifidobacterium strains on selected carbohydrates

All strains except B. scardovii JCM 12489 were routinely grown in dextrose-free lactobacilli De Man–Rogosa–Sharpe broth (Alpha Biosciences) supplemented with 0.34% (wt/vol) sodium ascorbate, 0.029% (wt/vol) l-cysteine–HCl monohydrate and 10 mg ml−1 Lac (MRS-AC-Lac) in a chamber maintained with a gas mix of 10% H2, 10% CO2 and 80% N2 (Coy Laboratory Products). The MRS-C-Lac medium without sodium ascorbate was used for B. scardovii JCM 12489. The growth of 30 Bifidobacterium strains on 47 substrates (Supplementary Table 15) was measured using a custom carbohydrate array constructed in flat-bottom half-area 96-well plates (Costar). Wells were filled with 55 µl of a sterilized carbohydrate stock at 2× concentration (10–20 mg ml−1) and transferred to the chamber 48 h before the experiment. Each substrate was tested in triplicate (or duplicate in a few cases). Sterile water and Lac served as negative and positive controls.

Cultures for assay inoculations were grown overnight at 37 °C in MRS-AC-Lac, diluted 1:100 into fresh medium, and incubated for 8 h. Cultures were then adjusted to OD600 of 0.6, and 100-μl aliquots were collected at 5,000g for 5 min and resuspended in 6 ml of 2× MRS-AC (2× MRS-C for B. scardovii) without added carbohydrate. Each well in the carbohydrate array was loaded with 55 µl of inoculated 2× medium to make individual 110-µl cultures. Growth was monitored by measuring OD600 every 15 or 30 min for 40 h under constant linear shaking using a Synergy H1 Plate Reader (BioTek Instruments). Raw OD600 data were exported using Gen5 software (v2.05.5) and analysed in R (v4.3.2). For all strains except B. pseudocatenulatum M26A_2F1 growth on each glycan was classified as follows: (1) no growth (‘−’) if the mean OD600 never exceeded 10% of the strain’s highest OD600 within a panel of carbon sources, (2) weak growth (‘w’) if the mean OD600 did not exceed 25% of maximum OD600, and (3) growth (‘+’) if the mean OD600 was above 25% of maximum OD600 (Supplementary Fig. 4 and Supplementary Table 16). For B. pseudocatenulatum M26A_2F1, thresholds were increased to 22% and 55%, respectively, owing to some outgrowth in basal MRS-AC lacking an added substrate.

Comparison of predicted phenotypes with growth data

Predicted carbohydrate utilization phenotypes were compared with growth data from this work or prior literature8,28,36. A predicted binary phenotype ‘1’ was classified as a true positive (TP) if it matched growth phenotypes ‘+/w’ and as a false positive (FP) if it matched ‘−’. A predicted binary phenotype ‘0’ was considered a true negative (TN) if it matched growth phenotype ‘−’ and a false negative (FN) if it matched ‘+/w’. Most predicted utilization phenotypes were tested using individual substrates. Exceptions included phenotypes IMO, for which the growth was tested on panose and isomaltotriose, BglOS (cellobiose, gentiobiose, laminaritriose and sophorose) and SHMO (3′-sialyllactose (3′SL) and 6′-sialyllactose (6′SL)). For these cases, growth (‘+/w’) on at least one substrate was considered as a TP for a predicted binary phenotype ‘1’, and the absence of growth on all tested substrates was considered a TN for ‘0’. Standard binary classification metrics were calculated from TP, TN, FP and FN counts (Supplementary Table 17).

Glycoprofiling of HMO consumption

Overnight cultures of Bifidobacterium strains grown in MRS-AC-Lac were inoculated (0.5–1% (vol/vol)) into fresh medium and cultured to OD600 of 0.6. Cells were collected at 5,000g for 5 min, washed with sugar-free MRS-AC, and used to inoculate 200 µl of MRS-AC supplemented with an HMO mixture (10 mg ml−1, isolated from pooled human milk from ≥25 different donors) in 96-well plates (Costar) at OD600 of 0.005. Triplicate 100-µl samples of culture supernatant were collected at 8-h or 24-h timepoints in separate experiments, filtered through Spin-X Centrifuge Tube Filter (0.22 µm, Costar), snap-frozen in liquid nitrogen and stored at −80 °C. HMO concentrations in supernatants were analysed by HPLC with fluorescence detection112. Media samples (10 µl) were spiked with maltose as an internal standard, lyophilized and labelled with the fluorophore 2-aminobenzamide (2AB). 2AB-labelled HMOs were separated on a TSKgel Amide-80 column (15 cm length, 2 mm inner diameter, 3 μm particle size; Tosoh Bioscience) and detected at 360-nm excitation and 425-nm emission on Dionex Ultimate 3000 (Thermo Fisher Scientific). Nineteen HMO structures were annotated on the basis of standard retention times and quantified relative to the internal standard: 2′FL, 3FL, LDFT, 3′SL, 6′SL, LNT, LNnT, LNFP I, II and III, sialyllacto-N-tetraose (LST) b and c, difucosyllacto-N-tetraose (DFLNT), LNH, disialyllacto-N-tetraose (DSLNT), fucosyllacto-N-hexaose (FLNH), difucosyllacto-N-hexaose (DFLNH), fucodisialyllacto-N-hexaose (FDSLNH) and disialyllacto-N-hexaose (DSLNH). Total HMO concentration was calculated as the sum of individual HMO concentrations (Supplementary Table 18). HMO utilization at 8 or 24 h was calculated relative to HMO concentrations in the medium controls.

Transcriptional profiling (RNA-seq)

An overnight culture of Bc. kashiwanohense Bg42221_1E1 grown in MRS-AC-Glc was collected at 5,000g for 5 min, washed in sugar-free MRS-AC and used to inoculate MRS-AC supplemented with either 10 mg ml−1 of Glc, Lac, LNT or 5 mg ml−1 of tamarind XGL at OD600 of 0.01. Samples (2 ml, biological triplicates) were collected at the early exponential phase (OD600 of 0.35) and pelleted in a prechilled centrifuge at 4,800g for 5 min. Cell pellets were snap-frozen in liquid nitrogen and stored at −80 °C until further use. The detailed RNA extraction protocol is described in Supplementary Methods. Ribosomal RNA was depleted with the NEBNext rRNA depletion kit for bacteria (New England Biolabs) and a set of 20 pooled sequence-specific probes for Bc. kashiwanohense Bg42221_1E1 designed using the NEBNext Custom RNA Depletion Design Tool v1.0 (Supplementary Table 20). Barcoded libraries were made with the NEBNext Ultra II directional RNA library prep kit for Illumina (New England Biolabs). Libraries were pooled and sequenced (single-end 75-bp reads) on Illumina NextSeq 500 using the High Output V2 kit (Illumina). The details of sequencing data analysis are provided in Supplementary Methods.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.