Extended Data Fig. 1: Comprehensiveness of our cultured bacterial strain library and algorithm Strainer.

a, Proportion of bacterial reads in the metagenomics sample that are explained by the genome sequences of the cultured strain library for that sample (n = 20 biologically independent samples). Each point in the plot corresponds to a separate sample. The lower and upper bounds of the box in the boxplot corresponds to 25th and 75th percentile respectively, with the median line in centre. Upper whisker extends till the maxima, while the lower whisker extends till 1.5 times the inter-quartile range. Points beyond this lower limit are also plotted. b, Proportion of bacterial reads explained by the cultured strain library for a donor after gavaging (n = 3 independent replicates) germ-free mice with stool from (n = 3) corresponding human donors, and performing metagenomics on the mouse faecal samples. Each point corresponds to a separate sample. Data for mouse replicates for each different donor sample is presented as mean values ± SEM. c, Percentage similarity between (n = 96) different isolates of species Bacteriodes ovatus and the reference strain AAXF00000000.2. Similarity is found by comparing sequence k-mers of length 31 between genomes. Each point in the boxplot corresponds to a separate sample. The lower and upper bounds of the box in the boxplot corresponds to 25th and 75th percentile respectively, with the median line in centre. Upper whisker extends till the maxima, while the lower whisker extends till the minima. d, Proportion of bacterial reads in the metagenomics sample that are explained by the genome sequences of the cultured strain library for that sample. Each point in the boxplot corresponds to a separate sample. e, Overview of our algorithm Strainer. The algorithm has 3 modules, where Module-1 involves finding the unique and likely informative sequence k-mers for each strain by removing those shared extensively with unrelated sequenced strains in NCBI, unrelated metagenomics samples, and those cultured and sequenced in this study. Next, we decompose each sequencing read in the metagenomics sample of interest into its k-mers, and find reads which have k-mers belonging to multiple strains, or have <95% of informative k-mers for a single strain. We further remove these non-informative k-mers from our previous set. In Module-2 we assign sequencing reads from the metagenomics sample of interest, with a majority of informative k-mers (>95%) to each strain. Next, we map these reads to the genome of the corresponding strain, and consider the non-overlapping ones only. This step normalizes for sequencing depth across samples and checks for evenness of read distribution across the bacterial genome. Finally, in Module-3 we compare the read enrichment in a sample to unrelated samples or negative controls and present summary statistics for presence or absence of a strain in a sample.