Fig. 4: Comparison of beta diversity between communities calculated by taxonomy vs. nucleotide k-mer composition.

a Percentage of reads classifiable at any taxonomic rank, by cohort, based on a reference database of all genomes “scaffold” quality or higher in RefSeq and GenBank as of January 2020. Read classification is higher in western vs. nonwestern microbiomes (one-sided Wilcoxon rank-sum test between Soweto and Sweden, p = 2.56e−8), and higher in Soweto relative to Bushbuckridge (one-sided Wilcoxon rank-sum test, p = 2.43e−4). b Comparison of microbiome sequence data using k-mer sketches, a reference-free approach that allows comparison of nucleotide sequence composition. Briefly, a hash function generates signatures at varying sequence lengths (k) and k-mer sketches can be compared between samples. Plot shows non-metric multidimensional scaling (NMDS) of angular distance values between each pair of samples at k = 31 (approx. species-level)61. c–e Comparison of pairwise beta diversity within communities using Bray–Curtis distance for species and angular distance for nucleotide k-mer sketches. c Species beta diversity is higher in Soweto vs. all populations (one-sided Wilcoxon rank-sum test, FDR-adjusted q < 2.7e−16 for all tests) except for the United States, where beta diversity in Soweto is lower (one-sided Wilcoxon rank-sum test, q = 4.05e−6). Nucleotide k-mer diversity is higher in Soweto vs. all populations (one-sided Wilcoxon rank-sum test, FDR-adjusted q < 2.2e−16 for all tests). d Species beta diversity is higher in Sweden compared to Bushbuckridge, but nucleotide k-mer distance is higher in Bushbuckridge (p < 2.22e−16 for both tests). e Species beta diversity is higher in the United States cohort compared to the Malagasy, but nucleotide k-mer distance is higher in the Malagasy (p < 2.22e−16 species, p = 0.034 k-mer). For all box plots in a, c–e, lower and upper hinges correspond to the first and third quartiles, upper and lower box plot whiskers represent the highest and lowest values within 1.5 times the interquartile range, and the horizontal line represents the median. Significance values for two-sided Wilcoxon rank-sum tests denoted as follows: *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001. One sample per participant, sample size in a–e is: n = 22 Tanzania, n = 112 Madagascar, n = 90 Burkina Faso, n = 118 Bushbuckridge, n = 51 Soweto, n = 100 Sweden, n = 134 United States.