Fig. 1: Taxonomic composition of South African study participant microbiota.

Sequence data were taxonomically classified using Kraken 2 with a database containing all genomes in RefSeq and GenBank of “scaffold” quality or better as of January 2020. a Top 20 genera by mean relative abundance for samples from participants in Bushbuckridge and Soweto, sorted by decreasing Prevotella abundance. Prevotella, Bacteroides, and Faecalibacterium are the most prevalent genera across both study sites. b Relative abundance of VANISH genera by study site, grouped by family (n = 118 Bushbuckridge, n = 51 Soweto). A pseudocount of 1 read was added to each sample prior to relative abundance normalization in order to plot on a log scale, as the abundance of some genera in some samples is zero. Relative abundance values of most VANISH genera are higher on average in participants from Bushbuckridge than Soweto (two-sided Wilcoxon rank-sum test, significance values denoted as follows: *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, (ns) not significant). Exact p values from left to right: 3.91e−2, 3.28e−1, 1.60e−2, 4.55e−3, 6.64e−3, 1.93e−5, 9.20e−3, 7.29e−3, 6.93e−2, 6.87e−4, 1.64e−11, 7.66e−6, 1.02e−7. Box plot lower and upper hinges correspond to the first and third quartiles, upper and lower whiskers represent the highest and lowest values within 1.5 times the interquartile range, and the horizontal line represents the median.