Main

Large population studies can identify lifestyle, genetic and environmental factors that drive gut microbiome composition. Indeed, early studies5,6 established baseline human gut microbiome measurements, and more recent studies have related the gut microbiome to disease and lifestyle factors7,8,9. However, because these studies typically focus on high-income populations with relatively homogeneous resource access and disease profiles, their results often do not translate to populations with different lifestyle practices, health challenges and access to healthcare, and varied environmental exposures. In addition, large microbiome studies have typically relied on facility-based convenience-sampling models, which generalize less well to the population level than more resource-intensive cross-sectional sampling.

Low- and middle-income countries (LMICs) account for approximately 84% of the world’s population1 but are extremely under-represented in gut microbiome research2,3,4. China, an upper-middle income country, is an outlier with several recent large population studies10,11,12. Targeted studies in LMICs have found microbiome associations with infectious disease13,14 and compositional differences between microbiomes of specific LMIC and high-income country (HIC) populations15,16,17,18, including in the context of non-communicable diseases19. Several studies have also evaluated early childhood microbiomes in LMICs, especially in relation to malnutrition20,21,22,23,24. However, comprehensive measurement of global lifestyle diversity that affects gut microbiome composition requires large, population-representative studies. It is essential to work within frameworks that support representative measurements from populations, facilitate input and leadership from local stakeholders, and clearly identify community needs25,26,27.

The Africa Wits-INDEPTH Partnership for Genomic Studies (AWI-Gen)28 provides a powerful framework for population-representative and community-engaged research (Extended Data Fig. 1). Nested within the Human Heredity and Health in Africa Consortium (H3Africa), AWI-Gen studies genomic and environmental factors affecting the changing disease burden among adults in six communities in four African countries. The study is a partnership between the University of the Witwatersrand and the International Network for the Demographic Evaluation of Populations and Their Health (INDEPTH), a network of health and demographic surveillance systems (HDSSs) in LMICs. Together, the five AWI-Gen HDSSs and the Developmental Pathways for Health Research Unit in Soweto, South Africa support random cross-sectional population sampling. By contrast, most extant microbiome and genomics data are based on non-random recruitment of self-selecting participants, which is not optimal for capturing population-level trends. Each HDSS has engaged with host communities for over a decade, conducting engagement before study approval and through study conclusion29,30 that ensures high participant retention while focusing on community needs. AWI-Gen collected blood and urine biomarkers, captured extensive participant data—demographic, health history, environment and lifestyle—and genotyped all participants on the H3Africa Custom SNP Array31. Emphasizing genomics capacity-building and equitable collaborations, AWI-Gen presents a unique opportunity for microbiome research in understudied populations and holds immense potential for associating the microbiome with rich genotype and phenotype data. The first phase of AWI-Gen ran from 2012 to 2017, during which we conducted pilot microbiome projects at two South African sites32,33.

Here we present the second phase of the AWI-Gen Microbiome Project. From 2018 to 2023, we randomly sampled 1,820 adults (1,801 women and 19 men) from well-characterized populations in six research centres in Burkina Faso, Ghana, Kenya and South Africa. These centres have widely different population densities, subsistence strategies, income levels and disease profiles. Leveraging extensive clinical and demographic data, we find that geography has the strongest effect on microbiome variation. We assemble thousands of prokaryotic and phage genomes, including hundreds for Treponema succinifaciens, a hallmark bacterial species previously described as absent in industrial populations. Finally, we find HIV-associated differences in microbiome composition that differ from those described in HIC populations. Altogether, this study demonstrates the importance of investigating the gut microbiome in undersampled populations, provides a framework for equitable microbiome research and represents the largest population-representative profile of African gut microbiomes so far.

The AWI-Gen 2 Microbiome Project enrolled participants from rural villages in Nanoro, Burkina Faso34 (n = 384), Navrongo, Ghana35 (n = 235), the Agincourt-Bushbuckridge subdistrict in South Africa36 (n = 533) and Dikgale, South Africa37, in which the HDSS is now called DIMAMO (n = 203), from the township of Soweto, South Africa (n = 226) and from the Korogocho and Viwandani urban informal settlements in Nairobi, Kenya38 (n = 239) (Fig. 1a). Participants were a cross-sectional representation of the adults in the HDSS catchment areas (Supplementary Methods). The study communities span rural, peri-urban and urban areas, and therefore have drastic differences in population density, water sanitation, access to healthcare and disease profiles (Table 1 and Supplementary Data 1). Briefly, the Nanoro and Navrongo study centres are in primarily horticultural rural regions of western Africa where subsistence farming and cattle-keeping are dominant subsistence strategies. The Agincourt and DIMAMO centres largely consist of semi-rural villages that are undergoing rapid epidemiological transition and industrialization. Soweto is a district within the city of Johannesburg, which under apartheid was designated as an area for black people to live; as a post-industrial area, employment in Soweto is often related to business, retail and industry, but unemployment among women remains above 60% (ref. 39). The Nairobi centre captures two urban informal settlements where population density is very high and residents have limited access to piped water and sanitation.

Fig. 1: Microbiome composition and diversity in the AWI-Gen 2 cohort.
figure 1

a, Sample number and location of each study site. Countries containing sites are dark grey. b, Principal coordinate analysis of all samples on the basis of Bray–Curtis distance on species-level prokaryotic profiles. Study site is colour-coded and the boxplots show the samples per site projected onto the first and second principal coordinate. c, Prokaryotic diversity (inverse Simpson’s index after rarefaction) per site (Kruskal–Wallis test, P < 2 × 10−16, n = 1,796 after quality control and removing data from male individuals). d, Heatmap showing the number of prokaryotic species with high generalized fold change between sites; sites are clustered on the basis of this number of species. e, The log10(relative abundance) of genera with the highest variance in fold change and median across sites. f, The log10 of the mean relative abundance per site is shown for all species within the genera shown in e. For Prevotella, Oribacterium, Cryptobacteroides and Treponema, all species with scientific names are highlighted; only the top abundant species with scientific names are indicated for the other genera. All panels represent data from n = 1,796 biologically independent samples. Boxplot boxes denote the interquartile range (IQR), thick black lines indicate the median, and whiskers indicate the most extreme points within 1.5-fold IQR. Supplementary Methods contain photographs and further information for each site.

Source Data

Table 1 Site-level health and demographic summary statistics

AWI-Gen 2 is a population cross-sectional study of adults aged 32–98 (99% were between 41 and 84). Pregnant women and people who had been resident for fewer than 10 years were excluded. At the point of recruitment into AWI-Gen, only one person per household was included. Most participants were women (n = 1,801), although a small number of men (n = 19) were sampled as well. The focus on women was motivated by downstream interest in combining these data with a companion menopause study. Samples from men had significantly lower alpha-diversity (P = 0.027) (Extended Data Fig. 2), and were excluded for all site comparisons presented below; however, given the poor representation of these populations in existing microbiome studies, samples from men were included in the genome catalogues presented below. Participants completed a questionnaire with guidance from a field worker, and donated blood, urine and single stool samples. Stool samples were collected in temperature-stable buffer and processed at a single time in the same location to minimize handling bias (Methods). DNA was extracted from each stool sample, followed by 2 × 150 base pair paired-end sequencing. We generated a median of 44.16 million (M) (range 27.48M to 104.79M) reads per sample, with a median of 31.20M (range 17.80M to 72.95M) reads remaining after quality control and removal of human reads (Methods, Extended Data Fig. 3 and Supplementary Data 2). For an extended description of each study centre and recruitment methodology, see the Supplementary Methods.

Taxonomic composition across sites

We first characterized the overall taxonomic composition in the study populations. We performed taxonomic classification with mOTUs3 using an updated database that incorporates new genomes found in this study, and summarized features with the GTDB taxonomy (Methods and Extended Data Fig. 3). After clustering samples by overall microbiome composition, the primary axis of variation captures a trade-off in relative abundance between Bacteroidota and Bacillota A (Fig. 1b and Extended Data Fig. 4), and correlates with the abundance of the archaeal phylum Methanobacteriota. The second principal coordinate (PCo2) captures site differences (Fig. 1b), generally ordering samples along a gradient corresponding to site population densities, subsistence strategies, environments and sociodemographic factors. The exception to this gradient is Nairobi, Kenya, which is a dense urban site yet it falls in the middle of the gradient. This second axis is correlated with the abundance of Spirochaetota and Elusimicrobiota, phyla that are described to decrease in relative abundance with industrialization40. Relative to external cohorts from western Europe and Japan, individuals in the AWI-Gen 2 cohort have higher relative abundance of Verrucomicrobiota and Spirochaetota and lower relative abundance of Actinomycetota (Extended Data Fig. 4).

Seeking to identify the geography, disease and lifestyle factors that have the greatest effect on compositional variation, we performed distance-based redundancy analysis with the available covariates, excluding highly correlated variables (Extended Data Fig. 5 and Methods). Site explains the greatest amount of compositional variation (7.92%), followed by other variables inherently related to the microbiome, such as recency of antibiotic use (0.79%), recency of diarrhoea (0.59%), use of deworming medication (0.51%) or probiotics use (0.46%). Interestingly, HIV status is the only disease-related variable explaining a sizable amount of variation (0.52%). Other disease variables included arthritis, obesity and hypertension among others (Extended Data Fig. 5).

To delve deeper into differences between sites, we investigated microbial diversity and abundance. Prokaryotic diversity differs significantly between sites (Kruskal–Wallis P ≤ 2.2 × 10−16) (Fig. 1c), mirroring the site gradient observed for the second principal coordinate, with Nairobi again falling out of sequence. Many taxa have distinct relative abundance and prevalence patterns between study sites (Supplementary Data 3). Few taxa are differentially abundant between sites with similar population densities and subsistence strategies (Agincourt and DIMAMO, 3 species; Nanoro and Navrongo, 37 species) and several are differentially abundant between sites with distinct characteristics (Soweto and Nanoro, 345 species) (Fig. 1d and Supplementary Data 4). Among the bacterial genera that varied most across study sites (Methods), some have relative abundance that is positively correlated with the study site gradient observed previously, such as Phocaeicola, Bacteroides and Bifidobacterium, and others correspond to the inverse gradient, such as Prevotella, Cryptobacteroides, Oribacterium and Treponema_D (Fig. 1e). These abundance gradients are often shared among the bacterial species within each genus (Fig. 1f), and are also reflected in overall species prevalence (Extended Data Fig. 6a), where some taxa are ubiquitous and others have increasing or decreasing prevalence across the site gradient.

The comparison between Nairobi and Soweto, which are both industrial or post-industrial urban sites, is particularly interesting. Despite very high population density in Nairobi, individuals from Nairobi often have more similar microbiome composition to individuals from Agincourt and DIMAMO than to individuals from Soweto (Fig. 1d–f and Extended Data Fig. 6a,b). The relative abundance of Phocaeicola, Bacteroides, Prevotella and Treponema_D species is somewhat similar between individuals in Nairobi and individuals from semi-rural sites. By contrast, Bifidobacterium species have high relative abundance in both Soweto and Nairobi populations. Crucially, these findings illustrate that ‘urbanization’ or ‘industrialization’, which are commonly cited variables that impact the microbiome in previous research including our own, cannot adequately capture lifestyle and environmental differences between different urban areas and between different rural areas.

Altogether, site is the dominant factor explaining overall microbiome composition. Disease and medication variables explain smaller amounts of microbiome composition, and HIV status represents one of the largest subsequent contributors to microbiome variation after site. These findings indicate that study sites represent varied subsistence strategies, industrialization levels, health-care access and overall adversity that together affect microbiome composition. These findings also suggest a model for microbiome transition dynamics. Taxa such as Treponema_D are less abundant in sites practising large-scale agriculture or industry, whereas Bacteroides and Phocaeicola species gradually expand in abundance and Prevotella species gradually decrease. Despite reports in other cohorts41, we do not observe mutual exclusion between Prevotella and Bacteroides (Extended Data Fig. 6c). We observe interesting taxonomic profiles in participants from Nairobi, many of whom probably migrated to Nairobi from rural parts of Kenya38,42 into the informal settlements: high Prevotella abundance may reflect a microbial signature retained from participants’ former rural residences, whereas the high abundance of Bifidobacterium species and low abundance of Cryptobacteroides species may reflect taxa that are strongly influenced by changing environments.

Novel prokaryotic genomes

African gut microbes are under-represented in public reference collections, and when present, are often sourced from relatively isolated populations with lifestyle practices that are not representative of the African continent. To identify previously unknown taxa in the AWI-Gen 2 sample collection, we performed metagenomic assembly and binned contigs into metagenome-assembled genomes (MAGs), yielding a total of 69,539 genomes, of which 34,215 genomes are high quality (more than 90% complete and less than 5% contaminated) and 26,660 are medium quality (more than or equal to 50% complete, less than 10% contaminated). To condense redundant genomes, we dereplicated all MAGs with a minimum genome completeness of 50% and maximum genome contamination of 5% at 95% average nucleotide identity (ANI). The resulting 2,613 MAGs span 19 bacterial phyla (Fig. 2a and Supplementary Data 5). We constructed a protein catalogue from all medium- and high-quality MAGs clustered at 95% amino acid identity, yielding 63.8M unique proteins.

Fig. 2: Prokaryotic novelty and features.
figure 2

a, Phylogenetic tree of 2,584 dereplicated bacterial MAGs. Outer ring indicates study site of origin, inner ring indicates GTDB phylum and teal branches indicate Spirochaetota. b,c, Total number of previously unknown bacterial genomes by phylum (b) and new and existing bacterial genomes in the AWI-Gen assemblies (c), relative to the UHGG collection. Only representative genomes are shown. d, Archaeal phyla and species found in the AWI-Gen 2 genome collection. e,f, Self-reported antibiotic use (e) and hip circumference in centimetres (f) of n = 617 individuals from Nanoro, Burkina Faso and Navrongo, Ghana with and without T. succinifaciens present in the gut microbiome. Differences tested with a linear model that adjusted accounting for site (e) and for site and antibiotic history (f) as random effects. Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR. g, Prevalence of antibiotic resistance genes in n = 244 T. succinifaciens MAGs, ordered by drug class. h, Prevalence of glycoside hydrolase genes among the six species of Spirochaetota with the largest number of MAGs in the AWI-Gen 2 genome collection. Only glycoside hydrolases present in at least 5% of the genomes are shown.

Source Data

We compared our prokaryotic genome and protein catalogues to the Unified Human Gastrointestinal Genome (UHGG) catalogue of 4,744 prokaryotic species representatives and the Unified Human Gastrointestinal Protein 95 (UHGP95) catalogue of 20.5M proteins. The AWI-Gen 2 dataset includes 1,005 new prokaryotic MAGs relative to UHGG (Fig. 2b,c), and 7.6M new proteins relative to UHGP95 (Extended Data Fig. 7a). Most new bacterial MAGs fall under the phyla Bacillota A, Actinomycetota and Bacillota. We also observe 29 unique MAGs from the archaeal phyla Methanobacteriota, Thermoplasmatota and Halobacteriota (Fig. 2d), and nine are not found in the UHGG, indicating that the AWI-Gen 2 population contains substantial archaeal novelty. Most individual samples yielded several previously unknown prokaryotic genomes and tens of thousands of new proteins relative to reference collections (Extended Data Fig. 7b,c), with individuals in Nanoro contributing the most new genomes per individual.

Relative to recent large gut microbiome metagenomic datasets from other populations around the globe43,44,45,46,47, 598 MAGs are unique to the AWI-Gen 2 study, a much larger number than those contributed by other studies from outside the African continent (Extended Data Fig. 7d). Further, rarefaction analysis of the prokaryotic genomes (Extended Data Fig. 7e) and proteins (Extended Data Fig. 7f) generated from samples across each of the six study sites indicates that no feature has reached saturation. Crucially, these results imply that further measurement of gut microbiomes in these communities will continue to show new microbiome diversity.

The extensive AWI-Gen 2 microbial genome catalogue enables investigation of taxa that cannot be studied using standard microbiological techniques. One example is T. succinifaciens, a commensal anaerobic bacterial species in the phylum Spirochaetota that is non-spore forming, consumes a wide range of sugars and produces short-chain fatty acids and succinate48. T. succinifaciens is thought to be present in rural hunter-gatherer, pastoralist and agriculturalist populations, and lost in urban populations17,40. The first phase of AWI-Gen identified that T. succinifaciens is indeed present in urban populations33; however, we observe that T. succinifaciens abundance is inversely correlated with population density (Fig. 1e). Despite the emerging interest in this gut commensal, T. succinifaciens dispersal and acquisition are poorly understood.

Treponema are difficult to culture, making MAG catalogues an invaluable resource for understanding their biology. One complete genome from a cultured T. succinifaciens is available48, and this type strain was isolated from the swine gut49. Only 71 high-quality T. succinifaciens MAGs exist in UHGG, predominantly from human gut samples from populations in Madagascar, Peru and Fiji. Our genome catalogue includes 244 high-quality MAGs for T. succinifaciens, primarily from sites with low population density (Extended Data Fig. 8a and Supplementary Data 6). T. succinifaciens genomic features are characteristic of host-adapted microbes: regardless of site of origin, genome lengths are relatively small (2.52 ± 0.15 megabases; Extended Data Fig. 8b), and their shared core genome of 1,589 genes constitute most of each genome (68.59% ± 3.95%; Extended Data Fig. 8c). T. succinifaciens genomes from this study and others43,50 cluster by geographic origin, demonstrating a strong phylogeographic signal (Extended Data Fig. 8d; P = 0.001), further supporting a hypothesis of limited environmental dispersal.

We sought to further explore T. succinifaciens presence and absence in the AWI-Gen 2 population. In Nanoro, Burkina Faso and Navrongo, Ghana, where T. succinifaciens prevalence is high, individuals with T. succinifaciens reported less recent antibiotic use (P = 0.0075; Fig. 2e), and had a lower hip circumference (P = 0.012; Fig. 2f). Consistent with T. succinifaciens absence among individuals with recent antibiotic use, T. succinifaciens genomes have low prevalence of antibiotic resistance genes, excepting three genes related to vancomycin resistance (Fig. 2g). As vancomycin inhibits cell-wall synthesis in Gram-positive bacteria and T. succinifaciens is a Gram-negative bacterium, these genes may not be directly related to vancomycin resistance, but instead related to cell-wall synthesis and modification. Although this analysis does not capture antibiotic resistance genes on mobile genetic elements, it suggests limited antimicrobial resistance within the core genome. The association between T. succinifaciens presence and lower hip circumference may imply a connection between diet, nutrient availability and T. succinifaciens persistence. CAZyme profiling of high-quality MAGs within the Spirochaetota phylum demonstrates a broad glycoside hydrolase repertoire across species, and that T. succinifaciens has capacity to degrade hemicellulose and starch (Fig. 2h), indicating potential for T. succinifaciens loss during shifts to low fibre diets. We also observe that 58% of T. succinifaciens glycoside hydrolases are found in 95% of the T. succinifaciens genomes, implying that these genes are part of the core genome and probably vertically inherited rather than horizontally acquired. Together, these findings connect human exposure and phenotype data to T. succinifaciens persistence, and demonstrate how MAG catalogues built from populations in understudied areas can be used to investigate bacterial biology.

Viral fraction across sites

Although most gut microbiome research focuses on prokaryotes, this sample collection also represents a source of viral diversity. We generated a viral genome catalogue from all metagenomic assemblies and clustered all genomes at 95% ANI (Supplementary Data 7). Of 44,506 viral genomes, 381 are present at the assembly level in at least 18 individuals (about 1% prevalence) (Fig. 3a), and 2,701 (4.65% of the catalogue) are observed in at least 1% of participants in at least one study site. Similar to bacterial taxa, some genomes have higher prevalence in South African sites and others have higher prevalence in Nanoro and Navrongo. We compared the viral catalogue with the Metagenomic Gut Virus (MGV) catalogue of 54,118 viral operational taxonomic units, and find 40,135 new viruses relative to MGV (Fig. 3b). On average, each individual microbiome yielded dozens of new genomes relative to MGV (Extended Data Fig. 9a) and relative to a more recent viral catalogue51 (Extended Data Fig. 9b). Rarefaction indicates that this catalogue has not saturated viral discovery among the AWI-Gen populations (Extended Data Fig. 9c). Viral richness, measured as the number of assembled genomes per sample, does not follow the same population density site gradient as prokaryotic richness (Fig. 3c), even though prokaryotic and viral richness correlate well (Extended Data Fig. 9d). These findings were independent of sequencing depth (Extended Data Fig. 3b) and richness trends are similar with reference-based phage profiling (Extended Data Fig. 9e).

Fig. 3: Viral novelty and diversity in the AWI-Gen 2 cohort.
figure 3

a, Prevalence of viral genomes found in at least 18 individuals (approximately 1% of the AWI-Gen 2 population). Prevalence is measured as the proportion of the population in a given site that yielded an assembled viral genome that shares 95% ANI with the representative viral genome. Viral genomes that are new relative to the MGV catalogue and viral genomes that fall under the Crassvirales order are highlighted. b, Total number of new and existing viral genomes relative to MGV. c, Phage richness (number of phage species clusters present in each sample) per site (Kruskal–Wallis test P < 2 × 10−16, n = 1,796). Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR. d, Prevalence of Crassvirales viruses and prototypical crAssphage by site, determined by read-level abundance. e, Genome maps of nine previously unknown jumbophages with genome annotations and length in kb, and count of notable genetic features in each genome.

Source Data

Crassvirales, an order of abundant gut dsDNA bacteriophage, is prevalent across all sites (Fig. 3d). This finding is consistent with previous descriptions of crAss-like genera having 77% global prevalence52. P-crAssphage, the first discovered representative of the clade, is more prevalent in Soweto relative to other sites, again consistent with previous findings of low p-crAssphage prevalence in populations residing outside urban and highly industrial contexts53. We also identified several jumbophages (phages with genomes larger than 200 kilobases (kb)) and identified nine previously unknown jumbophages, with stringent thresholds of maximum alignment length of less than 10% to any genome in MGV and presence at the assembly level in at least five individuals (Fig. 3e). These jumbophage genomes largely consist of genes with no predicted functional annotation, but contain several features that relate to persistence in the host, including CRISPR arrays, sporulation regulators, addiction module toxins and large suites of tRNAs (Supplementary Data 8). These phages are prevalent in the studied cohort, with all reported new jumbophages reaching a prevalence of at least 5% in at least one site (Extended Data Fig. 9f). Further, one phage (phage A) shows evidence of integration into a Clostridium sp., and both phages A and G have unidirectional gene orientation. These results indicate the existence of several highly prevalent jumbophages that evaded previous discovery due to the narrow scope of previous population studies.

Taxonomic associations with HIV status

Finally, we investigated the relationship between the gut microbiome and HIV status. HIV represents one of the biggest public health concerns in the Kenyan and South African AWI-Gen study populations, especially as a rapidly increasing number of people aged 50 and older are living with HIV because of high antiretroviral therapy (ART) uptake: HIV prevalence was 17.2% among individuals aged 50–64 years in South Africa in 201754 and 9.1% among individuals aged 45–54 years in Kenya55. Despite advances in ART that have reduced population viral load and transmission, viral suppression is not sufficient to control HIV-related mortality and morbidity56, thus motivating deeper investigation into the gut microbiome as a possible mediator. Gut microbiota and their metabolites have been implicated in HIV-related inflammation and immune activation: gut-associated lymphoid tissue serves as a main reservoir for HIV57, and gut microbial metabolites can promote HIV transcription58. In turn, HIV infection can diminish epithelial barrier integrity59, allowing for microbial translocation that promotes immune activation and chronic inflammation60. Moreover, obesity has become a notable problem for individuals on the latest generation of ART such as dolutegravir61. In HICs, microbiome dysbiosis in people living with HIV (PLWH) is characterized by an enrichment of Pseudomonadota and depletion of Bacteroidota60,62. Perhaps confounded by a Prevotella-enrichment signature often observed in men who have sex with men63, it has even been described that the microbiomes of PLWH seem more similar to seronegative individuals in agrarian populations64. Few studies have measured associations between the gut microbiome and HIV status in African populations14,65,66,67, where baseline microbiome composition and disease profiles are distinct from those observed in HICs. Instead, studies have largely focused on men in HICs and have often been confounded by sexual practice.

We compared microbiome composition between women living with HIV on ART and seronegative (HIV−) women in Agincourt (PLWH n = 60; HIV− n = 341), Soweto (PLWH n = 50; HIV− n = 164) and Nairobi (PLWH n = 19; HIV− n = 214) (Fig. 4a). HIV status was not assessed in Nanoro and Navrongo because of low population prevalence, and DIMAMO was excluded from this analysis because only six individuals were found to be living with HIV. Our dataset also included participants with positive HIV status, but not self-reporting as receiving ART (Supplementary Data 9), possibly because they learned of their HIV diagnosis in the course of participation within AWI-Gen 2. Because of the low number of ART-naive PLWH in our dataset, we focused our analysis on HIV− and ART+ PLWH participants (see Extended Data Fig. 10 for comparison between ART+ and ART− PLWH). Prevalence of tuberculosis co-infection was low among PLWH (n = 6 individuals reported being on active treatment for tuberculosis) (Supplementary Data 9). Beyond co-infection, we do not expect many demographic differences between the HIV− and PLWH populations, but observe that PLWH are younger than HIV− individuals in Agincourt and Soweto, and have lower cholesterol in Nairobi and slightly lower waist-to-hip ratio in Agincourt (Extended Data Fig. 10).

Fig. 4: Microbial composition and diversity in PLWH.
figure 4

a, Number of seronegative individuals (HIV−) and PLWH on antiretroviral treatment. b, Prokaryotic diversity (inverse Simpson’s index on rarefied counts) by site and HIV status. Points represent individual samples. Differences in alpha-diversity for each individual site were tested with two-sided analysis of variance and for all sites combined with a linear mixed effect model accounting for site as a random effect. c, Principal coordinate analysis of species-level Bray–Curtis distance. Points represent individual samples, coloured by site, and PLWH are shaded. Boxplots show samples by HIV status projected onto the principal coordinates. d, Differentially abundant species (q-value < 0.01) determined by a linear mixed effect model. Species with q-value <1 × 10−5 are annotated. Shading indicates abundance fold change between seronegative individuals and PLWH. Black bars indicate previously unknown taxa from this study. e, ROC for machine learning models trained to distinguish HIV status on samples from each site or for all samples. Shading indicates the 95% confidence interval and numbers indicate AU-ROC. f, AU-ROC for machine learning model evaluation. Models trained on each site were applied to the other sites and external predictions were evaluated by means of AU-ROC. In LOSO validation, models were trained on two sites and validated on the left-out site. g, Fraction of samples from other sites predicted to be positive calibrated at a 5% false positive rate (indicated by dashed black line). For DIMAMO, HIV status is known and false positive rate and true positive rate can be evaluated. Serostatus is unknown for individuals in Nanoro and Navrongo but is expected to be below 2%. Boxplot boxes denote the IQR, thick black lines indicate the median and whiskers indicate the most extreme points within 1.5-fold IQR.

Source Data

First, we compared the microbiome composition of PLWH and HIV− individuals using alpha- and beta-diversity metrics. Consistent with previous descriptions14, alpha-diversity is lower in PLWH overall and within each study site (Fig. 4b). In terms of the beta-diversity, HIV status significantly varies over the second principal coordinate axis (PCo1 P = 0.10, PCo2 P = 6.2 × 10−12, Wilcoxon test) but is again outweighed by site differences (PCo1 P = 1.4 × 10−9, PCo2 P = 2.3 × 10−28, Kruskal–Wallis test) (Fig. 4c).

To identify specific taxa that vary with HIV status, we performed differential abundance testing with a linear mixed effect model that accounted for confounders such as site, antibiotic treatment or recency of diarrhoea. After correcting for multiple testing with the Benjamini–Hochberg procedure reported with a q-value, 131 prokaryotic species had significant differences with HIV status (q-value < 0.01) (Fig. 4d and Supplementary Data 10), most of which were lower in PLWH, agreeing with the finding of lower prokaryotic diversity in PLWH. Overall, the effect sizes generally agree across sites, with Nairobi exhibiting smaller effect sizes because of a lower number of PLWH. Some of the most significantly associated (q-value < 1 × 10−5) (Fig. 4d) taxa have been associated with HIV status in other cohorts: Faecalibacterium prausnitzii is a known butyrate producer that has been associated with reduced inflammatory biomarkers, and has been negatively associated with HIV status previously, probably because of increased oxygen levels in the gut during HIV infection68,69. By contrast, the genus Fusobacterium has well-described pro-inflammatory associations with HIV and other diseases70, and has been associated with poor immune recovery following ART administration71 and shown potential to reactivate latent HIV58. Interestingly, other taxa that are negatively associated with HIV, including Ventricola sp., Faecousia sp. and Limivicinus sp., are better represented in metagenomic studies focused on livestock72. Other taxa that are positively associated with HIV have conflicting associations with inflammatory disorders: Dysosmobacter welbionis is a recently isolated gut bacterium that has not been characterized in the context of HIV, but has been shown to counteract diet-induced obesity and improve glucose tolerance73,74. Megamonas uniformis has been described as enriched in PLWH75, but has discordant associations with inflammation and obesity76,77. These results highlight the value of investigating microbiome and disease associations in broader cohorts, as we observe several taxa that are not present in studies conducted in HICs.

To explore which microbial features can differentiate HIV status within and across sites, we trained machine learning models for each site individually and for all data combined (Fig. 4e,f). The goal of these models is to identify features that are strongly associated with HIV status in one or several populations as targets for future study; it is not proposed as a diagnostic. The models achieved accurate distinction between HIV− and PLWH individuals in all sites, yielding area under the receiver operating characteristics (AU-ROC) of more than 0.75 in all cases. When transferred across sites, only the model trained on data from Agincourt maintains high classification accuracy (Fig. 4f), perhaps because of the larger dataset available for model training. In line with this hypothesis, when data from two sites were combined in a leave-one-site-out (LOSO) validation, samples from the left-out site are accurately classified (AU-ROC ≥ 0.85 in all cases), even when samples from Agincourt are not used for training. As another test for generalization, we calibrated the model trained on all data to an internal 5% false positive rate and applied it to samples from Nanoro, Navrongo and DIMAMO (Fig. 4g). Even though information about HIV status in Nanoro and Navrongo is not available, we can assess the fraction of samples predicted to be HIV positive, which we would not expect to exceed 5% given the population HIV prevalence in these sites and our model calibration. Indeed, the model predicts very few samples to be HIV positive (3.1% in Nanoro, 3.7% in Navrongo), highlighting its specificity, and correctly classifies two thirds of the HIV-positive samples from DIMAMO (Fig. 4g).

We also considered the viral fraction of the microbiome of PLWH. Phage richness is lower in PLWH on ART compared with seronegative individuals, and is not significantly different between PLWH who are receiving ART and who are ART-naive (Extended Data Fig. 10). We observe 89 phages that show significant differences in abundance with HIV status (q-value < 0.01) (Supplementary Data 11), but find that machine learning models trained on the viral features achieve less accurate distinction between seronegative individuals and PLWH than models trained on prokaryotic features (Extended Data Fig. 10). Future work may identify which viral features are independent of their host bacterial abundance and have independent associations with HIV status.

Overall, HIV-associated microbiome differences are consistent across study settings, despite the strong effect of study context on overall microbiome composition. Some strongly associated taxa have been described in the context of HIV previously, but we also identify taxa that have not been well-described in human gut microbiomes or in microbial associations with disease. Although single-time-point sampling cannot determine causality or whether microbiome changes precede or follow HIV infection and ART administration, these results lend insight into possible targets for gut remediation.

Discussion

In 2007, the Human Microbiome Project set the goal of measuring the human microbiome and its contribution to disease6. Subsequent studies have built upon this goal, studying the human microbiome in large cohorts in HICs8,9. Here, 17 years later, the AWI-Gen 2 Microbiome Project is a landmark collaborative research effort that extends these goals to diverse LMIC populations in Africa: rural and predominantly horticultural areas (Nanoro, Burkina Faso and Navrongo, Ghana), rapidly transitioning rural areas (Agincourt and DIMAMO, South Africa), urban industrial informal settlements (Viwandani and Korogocho settlements in Nairobi, Kenya) and an urban post-industrial settlement (Soweto, South Africa). This is the largest cross-sectional and population-based survey of gut microbiome composition in relation to human health, environment and disease in low- and middle-income settings, and will prove invaluable in future microbiome discovery research.

This study enables comparison between populations that span a range of subsistence strategies and resource access. Site has a strong effect on microbiome variation, with alpha-diversity and taxon prevalence correlating with gradients in population density and resource access. Unexpectedly, we observe differences between sites that have similar subsistence strategies and industrialization levels. For example, Bifidobacterium and Cryptobacteroides have similar abundance in Nairobi and Soweto, whereas abundances of Prevotella, Bacteroides and Phocaeicola in Nairobi are more typical of the rural and semi-rural sites. HDSS data can contextualize these findings: Nairobi informal settlements have high in- and out-migration rates, and extensive circular migration38,42. Paired phenotype and metagenomic data also provide insights into taxonomic composition. For example, we identify genomic and host phenotype features that define a diet-related nutrient niche that supports T. succinifaciens persistence, and predict sensitivity to antibiotics that may drive T. succinifaciens loss. Here and previously33, we have found Treponema in the guts of urban individuals, perhaps because of low rates of antibiotic exposure or high rates of circular migration between urban and rural areas. These findings underscore the complex interplay between subsistence, industrialization and lifestyle factors in shaping the gut microbiome.

Our shotgun sequencing approach yielded assembly-based discovery of 40,135 previously unknown viral genomes. We identify nine new and prevalent jumbophages, including a putative integrated jumbophage, and jumbophages with intriguing unidirectional gene organization. To our knowledge, unidirectional gene organization is undescribed in jumbophages, and may be indicative of phage integration into a host with leading strand-biased gene distribution; indeed, the flanking regions of putative integrated Phage A map to a Clostridium species, and Bacillota have been previously described to have strong strand-biased gene distribution78. These findings provide an exciting opportunity for deeper study of previously unknown phages and their hosts.

AWI-Gen’s population-representative enrollment paired with clinical and lifestyle information also support the capture of population-level disease associations. HIV is prevalent in South Africa and Kenya, and improved viral load management with ART is not sufficient to protect against HIV-associated comorbidities56. This study represents one of the largest microbiome studies of women with HIV so far, and provides unique insights into microbiome–HIV associations in LMICs. Several taxa are enriched in PLWH receiving ART relative to seronegative participants, including taxa that have not been well-described in the context of HIV and inflammation. We cannot conclude whether these taxa are enriched in response to HIV infection, HIV-related comorbidities or antiretroviral medication, or whether taxa pre-existed HIV infection because of lifestyle or exposure differences in at-risk individuals. These results further demonstrate that existing disease associations are probably not broadly portable across global populations, and more research is necessary to disentangle the effects of HIV infection and other confounding variables on microbiome composition.

We emphasize that the AWI-Gen 2 Microbiome Project does not exhaustively represent any country or region. There is tremendous diversity within LMIC populations, and population density alone is not a sufficient indicator of microbiome composition or population lifestyle. Rather, the microbiome field needs to improve representation of LMIC populations to maximize sampling diversity and ensure portability of study findings. Even within this study, we focus on older adult women. Although sex differences have less impact on microbiome composition than other factors79, these studies may not capture health and lifestyle differences that may exist between sexes in LMICs. We also specifically highlight key variables that explain the greatest amount of microbial variation, leaving several disease and lifestyle variables open for future investigation. For example, we do not consider diet, which varies greatly across the AWI-Gen study populations and across LMIC contexts. We also acknowledge the limitations of identifying causality in microbiome associations with disease within this study design. We anticipate that longitudinal sampling of these populations and others will improve our understanding of the timing of microbiome changes in relation to HIV infection and other diseases. Further studies can also incorporate other microbiome measurements, such as eukaryotic profiling, to explore the complex relationship between eukaryote infection and immune activation, or total microbial concentration quantification to shed light on whether the taxonomic shifts observed in this study are due to blooms or losses of specific taxa. Future work can leverage the extensive AWI-Gen 2 participant data and more quantification methods to investigate the interplay between the microbiome and host genetics, environmental exposures, health status and participant demographics.

We strove to conduct the AWI-Gen 2 Microbiome Project ethically and equitably, taking into account recommendations for ethical research partnerships26,27,80. AWI-Gen hires field workers locally through the community-embedded infrastructures, and study staff host community advisory group discussions before study onset and return results to participants upon study completion. Through community discussions, the research team can identify pressing health issues within each study centre and ensure that research questions prioritize community needs. This study represents a strong scientific partnership between Stanford University in the USA and University of the Witwatersrand in South Africa. Trainees and faculty from both groups contributed to study design and data analysis, and a trainee from each institution has participated in a one-year research exchange with mentorship from both institutions. Further, the team has led three microbiome and bioinformatic training workshops to support genomics research capacity in South Africa. Altogether, this study illustrates that equitable research and impactful science do not represent a ‘zero-sum’ trade-off, but in fact lead to more robust research with benefit-sharing among all stakeholders.

The AWI-Gen 2 Microbiome Project contributes to advancing the investigation of human gut microbiomes from diverse populations around the globe. The study provides extensive opportunities for continued exploration, including identifying microbiome and disease associations, measuring human genetic contributions to microbiome composition, and defining lifestyle factors that shape microbial community assembly. Future studies can leverage these data, along with the foundation for community-engaged and equitable research described herein, to close the gap in global representation in microbiome research. Moreover, there is every reason to anticipate that the platforms and findings that emerge will enhance disease management, health and wellbeing among communities living in a diversity of contexts.

Methods

Ethics approval

Human subjects research approval was obtained (University of the Witwatersrand Human Research Ethics (Medical) Committee Clearance Certificate No. M170880, M2210108), and ethics approvals were also obtained at each study centre. Informed consent was obtained from participants for all samples collected. Every participant was provided with an information sheet and consent documents, either in English or translated into the local language. Participants had opportunities to discuss concerns with the interviewer, and participants who could not read or write had documents read aloud with a witness31. The Stanford University Institutional Review Board deemed that the de-identified data transferred to Stanford University do not constitute human subjects research and thus did not require a further ethics approval beyond the human subjects research approval obtained at the University of the Witwatersrand.

Community engagement

Each study centre conducted pre-study engagement before recruitment during both AWI-Gen 1 and AWI-Gen 2, adapting to the local contexts to engage with community members and discuss feedback and concerns related to the study. For example, in DIMAMO, South Africa, pre-study engagement involved meeting with tribal leaders, the community advisory team and community representatives. In Navrongo, Ghana, the community engagement team visited chiefs and elders of the various study communities and informed them of the proposed study, and followed up with a community sensitization gathering before AWI-Gen 1 with a larger audience of chiefs, elders and people of the study communities. The community durbar was excluded from AWI-Gen 2, because of the continuing COVID-19 outbreak. In Nairobi, Kenya, the community engagement team held several consultative meetings with members of the community advisory committee, village elders, community health volunteers and AWI-Gen study participants before, during and after the study. The village elders and community health volunteers were crucial in mobilizing study participants who could not be reached by telephone. Questions from participants were related to how blood and stool samples would be used and why the study was focused on women. If during recruitment and sample collection there were notable health concerns28 (for example, hypertension), the participants are referred into their clinical health-care service infrastructures. These mechanisms and processes varied from country to country and for sites within a country, depending on resources and local context.

Study design and cohort selection

Inclusion criteria included previous participation in the AWI-Gen 1 study28 and continued participation in the AWI-Gen 2 study. This AWI-Gen 2 microbiome study is a companion study to an AWI-Gen 2 menopause study, and so only participants self-identifying as female were surveyed for the microbiome sub-study. A small number of men were recruited owing to a fieldwork mix-up. Given the understudied nature of these populations, we did not fully exclude samples from men in downstream analyses; rather, samples from men were excluded from site comparisons and disease associations, but included when cataloguing genomic novelty. Participants were chosen semi-randomly from the overall AWI-Gen 2 participant pool, with extra measures taken to ensure a cross-section of individuals with respect to menopause status and hypertension. See Supplementary Methods for extended recruitment details.

A harmonized approach for stool sample collection was implemented in all study sites to ensure equal temperature exposures and handling of all samples. In Soweto, Nairobi and Nanoro, participants came to central locations for interviews and biomarker collection. Participants were given stool sample collection kits that were either collected the same day or collected from their homes or at a central location in the following days. At the Navrongo, DIMAMO and Agincourt study centres, participants were visited in their homes for interviews and biomarker collection. Participant phenotype data and survey information were stored in REDCap servers based in South Africa, Burkina Faso, and Ghana (v.9 to v.13, regularly updated through the course of the study). Participants were given a stool sample collection kit to use at their home, which was collected by fieldworkers within 24 h.

Each participant self-collected a single stool samples using an OMNIGene GUT OMR-200 Collection Kit (DNA Genotek). This preservation kit maintains DNA integrity and taxonomic composition across a wide range of ambient temperatures81, including the temperatures that are experienced year-round at each of the study sites. Samples were immediately frozen at study centres and then collectively shipped frozen to a central laboratory in Johannesburg, South Africa, where they were thawed, aliquoted into cryovials and stored at −80 °C. After obtaining necessary exportation and importation permits, all samples were shipped on dry ice in a single shipment to the United States for downstream processing. Samples were thawed once more to retrieve aliquots for DNA extraction. We previously conducted analysis to ensure that storage and shipping conditions would not significantly affect measured microbial composition81. Altogether, this approach minimized any technical confounders that would have coincided with study site, and we do not anticipate any other site-level methodological variation that would affect sample composition. Participant metadata, including age, demographic information, health history and blood biomarkers were collected as part of the larger AWI-Gen 2 project, with methods similar to those used in AWI-Gen 1 (ref. 31).

DNA extraction and metagenomic sequencing

All stool samples were extracted at the same time, in the same facility to minimize batch effect. DNA was extracted from samples using the QIAamp PowerFecal Pro DNA Kit (Qiagen, catalogue no. 51804) from 300 µl of stool sample according to manufacturer’s instructions. Bead beating was performed for 10 min at 30 Hz, followed by rotation of the adapter and an extra 10 min of bead beating using a TissueLyser II (Qiagen, catalogue no. 85300) using a 2-ml Tube Holder Set (Qiagen, catalogue no. 11993), and DNA extractions were eluted with C6 Elution Buffer in a final volume of 80 µl. DNA concentration was quantified by spectrophotometer using the DropSense 96 platform (Trinean, catalogue no. 10100096). Every extraction batch of 96 samples included one water blank as a negative control and one mock community aliquot (Zymo Research, catalogue no. D6300) as a positive control.

All libraries were prepared concurrently at the same facility and sequenced at the same time across several flow cells. Samples were evaluated for concentration, integrity and purity before library preparation using the 5400 Fragment Analyzer System (Agilent, catalogue no. M5312AA). Metagenomic libraries were prepared using the NEB Ultra II kit (NEB, catalogue no. E7645L) according to the manufacturer’s instructions. Library concentration was quantified using quantitative polymerase chain reaction and fragment length distribution was analysed using a 2100 Bioanalyzer (Agilent, catalogue no. G2939BA). Libraries were pooled and 2 × 150-base-pair reads were generated using the NovaSeq 6000 platform (Illumina, catalogue no. 20012850).

Metagenomic read preprocessing and taxonomy profiling

Metagenomic reads were deduplicated using HTStream SuperDeduper v.1.3.3 with default parameters, trimmed using TrimGalore v.0.6.7 with a minimum quality score of 30 and a minimum read length of 60. Reads aligning to version hg38 of the human genome were removed using BWA v.0.7.17 (ref. 82). Metagenomic reads were taxonomically profiled using mOTUs v.3.0.3 (ref. 83) and counts were distributed to GTDB84 species using the GTDB_v207 mapping file available as part of the mOTUs database.

Given the number of previously unknown bacterial taxa observed in our assembly approach (see below), we aimed to better characterize the taxonomic composition by including our assembled bacterial genomes into the mOTUs database. To do so, we extended the mOTUs database with the scripts available under https://github.com/motu-tool/mOTUs-extender/. In brief, marker genes were identified in all high-quality assembled genomes using fetchMG v.1.2 (ref. 85). Those genes were then clustered together with the genes in the mOTUs database v.3.0.3. The resulting extended database contained 662 new genome clusters and reduced the fraction of unassigned reads for nearly all samples. Particularly in samples from Nanoro and Navrongo, the new genome clusters carried a large part of the relative abundance (Extended Data Fig. 3). For GTDB-level profiling, the GTDB-tk classification of our assembled genomes were added to the GTDB_v207 mapping file. Unless indicated otherwise, all analyses shown here are based on the extended mOTUs database.

All samples were used for metagenome assembly and new feature discovery (n = 1,820). Samples from males, one sample with a potential label mismatch, and samples with high percentages of human reads (percentage of human reads more than or equal to 70%, n = 4 samples) were excluded from classification-based analyses and site comparisons, leaving 1,796 samples for other analyses.

Participant covariate processing

Extensive participant data were collected as part of the AWI-Gen study, including demographic, ethnolinguistic, family composition, pregnancy, cognition, frailty, household amenity, substance use, general health, diet, infection history, cardiometabolic disease and physical activity information. Participants also gave blood, urine and stool samples, and underwent ultrasound, blood pressure, blood and urine testing for various metrics. Not all data were available for every participant, and some participants gave stool samples for microbiome analysis but did not complete other testing or questionnaires. At the time of analysis for the microbiome study, not all participant data had gone through quality control. In total, 59 variables were available to use as covariates in the microbiome study.

Before using covariate data in microbiome analysis, we first collapsed the covariate dataset to only those variables that we expected to be most meaningful to avoid unnecessary multiple-hypothesis testing and measuring associations between dependent variables. First, we removed variables that had overwhelmingly missing data, excluding those that had entries for 100 or fewer participants (for example, several ultrasound measurements). Second, we filtered variables with not enough unique values (such as sex, which had only one group). Lastly, we excluded variables with an entropy (calculated with the infotheo package v.1.2.0.1 (ref. 86) in R) of less than 0.2 to avoid variables that were too uniform in the participant set to power comparisons (for example, breast cancer or cervical cancer status with only 10 and 12 cases, respectively).

To calculate correlation between covariates and associations between covariates and microbiome composition, we transformed non-numerical covariates into numerical values on the basis of ordered factor levels. For example, values for the Menopause covariate were changed from Pre-menopausal to 1, from Peri-menopausal to 2 and from Post-menopausal to 3. Most covariates were binary (for example, Probiotics could contain either the value Yes or No) and were converted to 1 (for Yes) and 2 (for No) in this process. The full list of binary variables is: Arthritis, Diabetes status, Diabetes treatment, Hypertension status, Hypertension treatment, Pesticides, Vigorous work, Weekend work, HIV medication, HIV status, Cattle, Other livestock, Potable water, Poultry, Refrigerator, Toilet, Deworming treatment, Probiotics, Chew tobacco and Smokeless tobacco. The variables describing time (Deworming period, Probiotics period, Antibiotics and Diarrhoea last) were ordered according to recency with the order WithinLastWeek < WithinLastMonth < WithinLastSixMonths < WithinLastYear < WithinLastTwoYears < WithinLastThreeYears < Longer < Never. Employment was ordered as Self-Employed < FormalFull-time < FormalPart-time < Informal < Unemployed. Site density was ordered as Nanoro < Navrongo < DIMAMO < Agincourt < Soweto < Nairobi.

Microbial diversity, composition and site differences

To measure prokaryotic alpha-diversity, species counts were rarefied to 5,000 using the rrarefy function available through the vegan R package v.2.6-4 (ref. 87). Alpha-diversity was measured as inverse Simpson index after rarefaction, and prokaryotic richness was measured as number of species with relative abundance greater than or equal to 1 × 10−4 after rarefaction).

Beta-diversity was calculated on the Bray–Curtis distance using the vegdist function from vegan87 and the pco function from the labdsv R package v.2.1-0 (ref. 88). To assess the amount of variance explained by covariates, we undertook distance-based redundancy analysis with the dbrda function from vegan. In an iterative manner, the covariate explaining the highest amount of variance was added to the model formula. To reduce redundancy of highly correlated covariates, all available covariates were transformed into numerical values (using ordinal factors, whenever applicable) and the Pearson correlation between covariates was calculated. In cases of highly correlated covariates (Pearson’s r ≥ 0.8), the covariate that explained the higher amount of variance in the prokaryotic composition was chosen for the iterative model (Extended Data Fig. 5).

Prokaryotic species prevalence was defined as the fraction of individuals in a study site in which a given species is found at a relative abundance of more than or equal to 1 × 10−4. The difference between sites for individual taxa was calculated using a generalized fold change89. In short, instead of comparing the median (the 50% quantile) between distributions, the generalized fold change is the mean of the differences between two distributions at several quantiles and can therefore resolve differences also in low-prevalence taxa. Figure 3d shows the number of taxa for which the generalized fold change between sites exceeds the 90% quantile of all pairwise site comparisons across all prokaryotic species.

The number of samples for these analyses (Fig. 1 and associated supplements) was distributed across the different sites as follows: Nanoro, n = 382; Navrongo, n = 218; DIMAMO, n = 201; Agincourt, n = 532; Soweto, n = 226; Nairobi, n = 237.

Metagenome assembly and external dataset comparison

All samples (n = 1,820), including samples for male participants, were included in metagenomic assembly analyses (Nanoro, n = 384; Navrongo, n = 235; DIMAMO, n = 203; Agincourt, n = 533; Soweto, n = 226; Nairobi, n = 239). Metagenomic reads were assembled using megahit v.1.2.9 (ref. 90) and assembly quality was assessed using QUAST v.5.2.0 (ref. 91). Metagenomic assemblies were binned into draft genomes using MetaBAT v.2.5 (ref. 92), CONCOCT v.1.1.0 (ref. 93) and MaxBin v.2.2.7 (ref. 94), and subsequently dereplicated and aggregated on a per-sample basis using DAS Tool v.1.1.6 (ref. 95). Bin quality was assessed using CheckM v.1.2.2 (ref. 96). To create a dereplicated genome set, MAGs were dereplicated using dRep v.3.4.3 (ref. 97), filtering to only include genomes with a minimum CheckM completeness of 50% and maximum CheckM contamination of 5%. In dereplication, we implemented a primary clustering threshold (-pa) of 0.9 and secondary alignment threshold (-sa) of 0.95, requiring minimum overlap between genomes (-nc) of 0.3, using multiround primary clustering (--multiround_primary_clustering) and greedy secondary clustering with fastANI v.1.33 (ref. 98) (--greedy_secondary_clustering, --S_algorithm fastANI) to reduce the computational complexity of dereplicating a large genome set. For dereplication, cluster representatives were chosen using scoring criteria that included a completion weight (-comW) of 1, contamination weight (-conW) of 5, N50 weight (-N50W) of 0.5, size weight (-sizeW) of 0, and centrality weight (-centW) of 0. Genome filters and scoring were consistent with standards used in the UHGG50. The final genome set was taxonomically classified and placed in a tree with GTDB-tk v.2.3.0 (ref. 99) using the GTDB r214 catalogue and default parameters. Phylogenetic trees were visualized with iTOL v.6 (ref. 100). The dereplicated prokaryotic genome set was compared against the UHGG v.2.0.1 species representatives using dRep v.3.4.3 with the same parameters as previously stated above.

Protein-coding genes were predicted from each medium-quality and high-quality prokaryotic genome, before genome dereplication, using prodigal v.2.6.3 (ref. 101) with parameters -c -p meta to exclude partial genes. Putative proteins were clustered successively using mmseqs v.14.7e284 (ref. 102) linclust command with alignment coverage (-c) of 0.8 in target coverage mode (--cov-mode 1) and greedy secondary clustering (--cluster-mode 2) at 100% and 95% amino acid identity (--min-seq-id). The 95% identity protein set was compared against the UHGP95 v.2.0.1 proteins using mmseqs v.14.7e284, and proteins sharing 95% amino acid identity over 80% of the UHGG protein were considered to match the UHGP set.

Modelled accumulation of previously unknown prokaryotic genomes and proteins with further participant sampling was determined by randomly subsetting the full participant set or site-specific participant sets to a range of individuals (1–1,500) in 100 iterations and counting the number of prokaryotic genome cluster or protein clusters represented by the participant subset.

Comparison with external metagenomic studies (Extended Data Fig. 7) used the same pipelines for read preprocessing, assembly and binning, with the exception of Carter et al.43 who published a MAG catalogue. All genomes from the UHGG, AWI-Gen 2 and external metagenomic studies were dereplicated using the same parameters as described above.

Treponema succinifaciens core genome analysis and functional profiling

We evaluated the complete set of T. succinifaciens MAGs in our genome catalogue before dereplication. To identify T. succinifaciens genomes, we selected all genomes with completeness of more than 90% and contamination less than 5% that fell into a secondary cluster with genomes classified as Treponema_D succinifaciens by GTDB-tk in our dereplicated genome catalogue (n = 244). Coding sequences were annotated with bakta v.1.8.2 (ref. 103). Core genes, defined here as genes present in at least 80% of genomes, were identified with roary v.3.12.0 (ref. 104).

Public T. succinifaciens genomes with completeness of more than 90% and contamination less than 5% were downloaded from the UHGG50, Carter et al.43 and National Center for Biotechnology Information (NCBI). To build a global phylogenetic tree, core genes were identified and incorporated into a core gene multiple sequence alignment using roary v.3.12.0 (ref. 104) and MAFFT v.7.407 (ref. 105). The core gene multiple sequence alignment was used as input to FastTree v.2.1.11 (ref. 106), and the resulting phylogenetic trees were visualized in iTOL v.6 (ref. 100). Phylogeographic signal was statistically quantified using the same method as Hildebrand et al.107: we calculated pairwise phylogenetic distance between all genomes on the basis of branch length using DendroPy108, and implement a permuted multivariate analysis of variance test with 1,000 permutations with adonis2 (ref. 87) to evaluate whether phylogenetic distances within countries are smaller than phylogenetic differences between countries.

Associations between T. succinifaciens presence and host phenotype were performed for all participants from Nanoro, Burkina Faso and Navrongo, Ghana. Host phenotype measurements included antibiotic history, anthropometric measurements, livestock ownership, hypertension status and all biomarkers. Associations were tested using a linear model that adjusted for site and for antibiotic history, excepting the association with antibiotic history, which only adjusted for site. Correction for multiple-hypothesis testing was performed with the Benjamini–Hochberg procedure. Antimicrobial resistance profiling was performed with the Resistance Gene Identifier109, and ‘Loose’ matches were omitted. Carbohydrate-active enzyme (CAZyme) annotation was performed on all high-quality MAGs from AWI-Gen 2 using dbCAN3 v.4.1.4 (ref. 110) with the prok parameter for conservative, high-confidence annotations. Substrate annotation was performed at the CAZyme family level using the high-level substrate annotations from the dbCAN3 substrate mapping table, and substrates were grouped according to biological origin.

Viral fraction characterization

Phage genomes were annotated from metagenomic assemblies with VIBRANT v.1.2.1 (ref. 111), and genome quality was determined with checkV v.1.0.1 (ref. 112). Redundant genomes from each sample were removed by clustering medium- and high-quality genomes using a database built with BLAST 2.14.0 (ref. 113), clustering at a minimum of 95% ANI and 85% alignment fraction using checkV supporting scripts with default parameters. Phage richness was measured as the number of assembled phage genomes per sample after removal of duplicate genomes. A unified catalogue of phage genomes was built by clustering the representative phages from each individual using the same clustering parameters, and this catalogue was compared against the MGV v.1.0 (ref. 114) vOTU representative phage genomes using the same BLAST clustering approach and parameters. Alternate phage profiling using read-based classification (Extended Data Fig. 9) was performed with Phanta v.1.1.0 (ref. 115) using the combination of MGV and UHGG as the reference database. Phage richness measured with Phanta was defined as the number of phage species clusters present at greater than or equal to 10−5% relative abundance. Differences between alpha-diversity metrics across sites were tested with a linear model, using the anova function from base R to estimate the significance of the difference. Modelled accumulation of previously unknown phage genomes with further sampling was performed using the same methods as described above for modelled prokaryotic genome and protein accumulation.

Crassvirales and crAssphage prevalence was defined as the fraction of individuals with taxon relative abundance of greater than or equal to 10−5%. Previously unknown jumbophages were defined as viral genomes in the dereplicated genome catalogue with length greater than 200 kb that did not cluster with an MGV vOTU representative. We further filtered the new jumbophages to highlight only jumbophages with evidence supporting prevalence and novelty, by including only those with assembled genomes present in at least five individuals, and with alignment fractions less than 10% against any MGV vOTU representative. Jumbophage genes were annotated with bakta v.1.8.2 (ref. 103). Read-level presence of jumbophages was defined at greater than 0.1 coverage threshold as measured using CoverM v.0.7.0 (ref. 116).

Association between microbiome features and HIV status

Participants from Agincourt, South Africa, Soweto, South Africa and Nairobi, Kenya were included in this analysis. Participants from DIMAMO, South Africa were excluded because of the low number of PLWH (n = 6). Participants from Nanoro, Burkina Faso and Navrongo, Ghana were excluded because HIV status was not measured in these populations owing to a low national prevalence of HIV. A total of 848 participants were included in this analysis, capturing 129 PLWH and 719 seronegative individuals (Table 1 and Supplementary Data 9). The rest of the samples from those sites were either HIV positive, but reported not to take ART (n = 28, n = 22 in Agincourt, n = 3 in Soweto, n = 3 in Nairobi). Male individuals and individuals with missing/discrepant HIV/ART data or with low read counts were excluded. Prokaryotic alpha- and beta-diversity were calculated as described above.

We undertook differential abundance analysis using a linear mixed effect model implemented in the lmerTest R package v.3.1-3 (ref. 117), including site, exposure to antibiotics and self-reported recency of diarrhoea as random effects, because those factors had shown to be related to microbiome composition in the previous analyses. Overall effect size was estimated through the lmerTest package as well and generalized fold change within each site was calculated as described above.

For the machine learning analysis, we trained statistical models using the SIAMCAT R package v.2.5.0 (ref. 89) for both all data combined and for each site separately. In short, relative abundances were normalized using the log.std method in SIAMCAT. Samples were split for five-times repeated fivefold cross-validation (20% of samples were retained for testing and not included in model training) and for each split, an L1-regularized logistic regression model was trained on the training folds, using standard parameters. Model evaluation was performed within the cross-validation (for example, within a site) by applying each model to the respective left-out test fold. The predictions for each sample were averaged across repeats and AU-ROC was calculated with the pROC package v.1.18.2 (ref. 118). For cross-site evaluation, the external data was normalized with the recorded normalization parameters (frozen normalization), all models from the cross-validation were applied to the normalized data, and predictions were averaged again for AU-ROC analysis. For the LOSO analysis, models were trained as described on data from two sites combined (for example, Agincourt and Soweto) and were then applied on the data from the left-out site (Nairobi).

To test the fraction of positive prediction in other sites, we calibrated the model prediction to an internal 5% false positive rate; that is, recorded at which prediction threshold 5% of HIV− samples were incorrectly classified as PLWH. The model trained on all data combined was then applied to the data from Nanoro, Navrongo and DIMAMO to quantify the number of samples that resulted in a prediction above the threshold value.

Viral feature comparison between seronegative individuals, PLWH who are ART-naive, and PLWH who are ART+ was performed using phage relative abundance profiles generated by Phanta. Phage richness was calculated as the number of phage species present at greater than or equal to 10−5% abundance in Phanta profiles, as opposed to using total count of assembled phages, because Phanta abundance profiles have features with sufficient prevalence for differential abundance analysis and machine learning models. Differential feature analysis and machine learning models were performed using the same methods as the prokaryotic analysis above.

Statistical analysis

Statistical analyses were performed with R v.4.1.2 using the statistical test specified in the respective Methods section. Correction for multiple-hypothesis testing was performed with the Benjamini–Hochberg procedure119 as implemented in the p.adjust function in base R in all analyses where several tests were performed. Plots were generated in R using the packages ggplot2 v.3.4.2 (ref. 120), cowplot v.1.1.1 (ref. 121), pheatmap v.1.0.12 (ref. 122) and tidyverse v.2.0.0 (ref. 123).

Ethics and inclusion statement

All authors of this study fulfilled criteria for authorship inclusion, and researchers from each study centre are represented as authors. Researchers from all institutions were involved throughout the study process. Study centre staff facilitated community engagement sessions, which identified specific community concerns and determined that this study is locally relevant. Roles and responsibilities were agreed upon amongst collaborators before conducting the research. Authors of this study have led formal capacity-building genomics workshops for local scientists during the course of the study (Extended Data Fig. 1), along with further informal training. This study has been approved by local ethics review committees (Methods). Research pertinent to the study centres and led by local researchers has been taken into account in the citations.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.