Fig. 3: MASS is applicable for larger datasets and recapitulates biological understanding.

a Fermentation phenotypes of 637 bacterial species (rows) grown on 46 different carbon sources (columns) downloaded from BacDive. Fermentation of a specific carbon source is indicated in yellow; and negative (no) fermentation is indicated in purple. Marginal bar charts summarize the phenotype frequency for each species or each carbon source respectively. b Matrix showing MASS result in which carbon sources were used as a predictor (gray) or as a response (black) as a function of the total number of predictors allowed (parameter p). c Shannon entropy of each carbon source. b, c Media are arranged in descending order of how frequently they were used as predictors. d Average Matthews correlation coefficient (MCC) of random forest classifiers for each number of predictors, p. The classifiers were trained either using the MASS selection of predictors (blue), predictor sets selected based on maximum Shannon entropy (red), or 300 random draws of conditions used as predictors (green). Each point represents the mean MCC obtained via fivefold cross-validation; the thick lines are the mean of those means across all MCC values for a respective p. e Increasing the number of fermented carbon sources selected as predictors with MASS (focusing on up to p = 8 predictors) reveals a hierarchy of most descriptive carbohydrate monomers (f). g Those monomers enter central carbon metabolism via different routes. Only the relevant parts of glycolysis are shown, and reactions are differentiated into direct, single-step (black arrows) and indirect, multi-step reactions (gray arrow). Source Data for Fig. 3c, d is available in Supplementary Data 4.