Introduction

Antibiotic resistance (i.e., the ability of bacteria to survive and replicate in the presence of an antibiotic1) poses an increasingly urgent global public health challenge2. Many bacterial pathogens have developed resistance to major antibiotics, with some resisting multiple drugs and causing untreatable infections3,4. Owing to the global broad use of antibiotics, antibiotic resistant bacteria (ARB) and their antibiotic resistance genes (ARGs) are emerging and spreading globally among people, food, animals, plants, and environmental compartments; i.e., soil, water, and air5,6. The environment provides an immense gene pool from which numerous ARGs could be acquired by pathogens to resist antibiotics7. Since many ARGs are found on mobile genetic elements (MGEs) and are therefore often horizontally transmitted, antibiotic use also imposes a selective pressure on the whole microbiome, not just pathogens.

In addition to studying the acquisition of antimicrobial resistance in pathogens, it is important to examine how antibiotic use and other environmental variables (such as temperature8, pH9, gross domestic product (GDP)10, population density11) affect the aggregate collection of resistance genes of commensal microbiomes; i.e., the resistome. Reliable information on the global occurrence and biotic/abiotic drivers of ARGs is urgently needed to inform public health actions and antibiotic-use decisions. Previous studies have reported global maps of resistomes for soil12, inland water13, urban mass-transit systems14, sewage15, and the human gut16, providing baseline information for understanding ARG diversity and health risks in the environment.

The sewage of ~52% of the global population is delivered to wastewater treatment plants (WWTPs)17,18, an essential infrastructure for the protection of human and ecosystem health19,20. However, WWTPs are among the most important reservoirs of ARGs and ARB because they receive wastewater from homes, hospitals, and pharmaceutical manufacturing facilities. Most WWTPs employ the activated sludge (AS) process, an open aerobic enrichment-culture system of microbial flocs or granules. Different anoxic/aerobic AS variants remove organic carbon, nitrogen, and phosphorus and can function within treatment trains to remove pathogens, micropollutants, and ARB21,22,23. The activated sludge could also be a spawning ground for resistance evolution, making it an important platform to study the rules governing the development of ARGs in the environment.

Recent studies have investigated resistome dynamics over time24,25 or across treatment compartments in one specific WWTP24,25, and the resistome diversity and distribution in several local WWTPs9,26,27. However, their findings exhibit limited concordance, possibly due to small sample sizes or non-unified protocols. For instance, co-occurrence network analysis suggested the bacterial phyla of Actinobacteria and Bacteroidetes as main hosts of ARGs in WWTPs26, but metagenome-assembled genome (MAG)-based methods revealed the most frequent hosts to be Proteobacteria28. Moreover, few studies have assessed the environmental factors driving resistomes in WWTPs9. Hence, our understanding of global ARG diversity in WWTPs and the underlying mechanisms affecting ARGs in WWTPs remains incomplete. Meta-analysis based on localized experiments is problematic due to differences in experimental systems, sampling methods, and analytical approaches29,30. To discern the global picture of ARGs in WWTPs, a survey is needed that is systematic, methodologically consistent, and globally representative.

To meet this need, a Global Water Microbiome Consortium (GWMC) was established (http://gwmc.ou.edu/) to oversee and coordinate a systematic global campaign for the collection, sequencing, and analysis of ~1,200 AS samples using identical protocols31. Among these samples, 226 metagenomes (i.e., a collection of genomes and genes from all microorganisms32) were identified by shotgun sequencing. The resistomes (i.e., collections of ARGs)33 were analyzed to address fundamental questions: (i) What are the diversity and distributions of global AS resistomes? (ii) What are the associations among the resistomes and microbiomes? and (iii) What biotic and abiotic mechanisms control the diversity, structure, and distributions of global AS resistomes?

Results and discussion

Diversity of global AS resistomes

To determine the resistomes of AS, the community DNA of 226 samples from 114 representative WWTPs across six continents (Fig. 1a) was sequenced. A total of 2.8 terabases (Tb), with an average of 12.3 ± 3.9 Gb per sample (Supplementary Data 1), was obtained. Rarefaction analysis of the sequencing reads mapping to bacterial 16S rRNA genes (Supplementary Figs. 1a, b) and ARGs (Supplementary Figs. 1c, d) showed that the sequencing depth was sufficient to represent the diversity of AS microbiomes and resistomes.

Fig. 1: The abundance and global distribution of the AS resistomes.
figure 1

a Map of the sampling locations. b Average relative ARG abundance (copy of ARGs per cell) across different continents based on resistance mechanism, drug class, and the nine most abundant ARGs. MLS: macrolide-lincosamide-streptogramin. c Richness of the ARGs. Richness index was calculated based on a rarified matrix of resistance gene coverage, which was rounded and subsampled to the lowest sample’s level. In the boxplots, hinges show the 25th, 50th, and 75th percentiles. The upper whisker extends to the largest value no more than 1.5 * IQR from the upper hinge, where IQR is the interquartile range between the 25% and 75% quartiles; the lower whisker extends to the smallest value at most 1.5 * IQR from the lower hinge. Sample size: n = 6, 59, 14, 20, 106, and 21 samples for Africa, Asia, Australasia, Europe, North America, and South America, respectively. Significant differences (Dunn’s test with two-sided p-values adjusted by the Bonferroni method < 0.05) between continent pairs are indicated in the plot. d Principal coordinate analysis (PCoA) reveals distinct ARG composition diversity in six continents. e PCoA reveals distinct ARG diversity in different environments. Source data are provided as a Source Data file.

Overall, 36,147,212 contigs longer than 1 kb were assembled from all filtered metagenomic reads, and 34,860,381 non-redundant open reading frames (ORFs) were predicted. 37,029 (0.11%) of the ORFs were annotated as ARG sequences. A total of 179 different ARGs, relevant to 15 drug classes, were identified (Supplementary Table 1). To assess geographical distribution, ARG abundance was normalized to the ARG copy number per bacterial cell34. The core ARGs in activated sludge, meaning those present in all AS samples analyzed, encompassed 20 genes that accounted for 83.8% of the total ARG abundance (Supplementary Data 2). The three most abundant ARGs were Tetracycline_Resistance_MFS_Efflux_Pump (15.2%), ClassB (13.5%), and vanT gene in the vanG cluster (11.4%), which respectively confer Tetracycline, Beta-lactam, and Glycopeptide resistance (Supplementary Data 2).

Since different ARGs might be associated with the same resistance mechanism or drug class, the relative abundances of ARGs were aggregated based on their resistance mechanisms and drug classes (Fig. 1b and Supplementary Fig. 2a). ARGs encoding antibiotic inactivation were the most abundant, accounting for about 55.7% of the total ARG abundance. The next most prevalent were ARGs for antibiotic-target alteration (25.9%) and efflux pumps (15.8%). When ARGs were aggregated by drug class, ARGs conferring resistance to Beta-lactam (46.5%), Glycopeptide (24.5%), and Tetracycline (16.2%) were the most abundant. The relative abundances of ARGs encoding major resistance mechanisms or drug classes were relatively consistent across samples.

Global distribution of AS resistomes

Global variation in ARG abundance

The total ARG abundance showed no significant difference across the six continents (Supplementary Fig. 2b; p = 0.78, Kruskal-Wallis test). However, the mean ARG richness (Fig. 1c) and Shannon’s H index (Supplementary Fig. 2c) were significantly higher in Asia than in other continents except Africa. ARG abundance varied across samples from different countries (p = 0.034, Kruskal-Wallis test): Samples from Chile (2.87 ± 0.40) and Canada (3.10 ± 0.35) were the lowest in mean ARG abundance, while samples from Switzerland (4.30 ± 0.20) and Colombia (4.26 ± 0.86) were the highest (Supplementary Fig. 3a). However, post hoc analysis indicated that total ARG abundance was not significantly different between any country pairs (p.adj > 0.05, Dunn post hoc tests).

Global variations in ARG compositions

To identify structural differences of resistomes across continents, PERMANOVA (Permutational multivariate analysis of variance) was performed at the individual gene level (Table 1). The resistomes were all significantly different (p < 0.05) when comparing pairwise continents. Principal coordinate analysis (PCoA) and clustering analysis at the gene level showed a strong regional separation (Fig. 1d, Supplementary Fig. 4a, and Supplementary Note 1). A weaker regional separation was observed at the drug-class level, versus the gene level (Supplementary Fig. 4b and Supplementary Note 1).

Table 1 Differences of the AS resistomes between continents

ARG differences across different habitats

To determine whether the structure of AS resistomes resembled those from other habitats, we conducted a comparative analysis of resistomes across different environments (AS, human gut35, soil36, ocean37, and sewage15) according to the read-based annotations. Comparison of the results obtained from contig- and read-based approaches on our AS samples demonstrated that the major conclusions remained consistent regardless of the approach used (Supplementary Fig. 5 and Supplementary Note 2). PCoA revealed that the resistomes were distinctly different across habitats (Fig. 1e). AS resistomes were much more similar to sewage and soil resistomes than to ocean or human gut resistomes (Fig. 1e), even when aggregated by resistance mechanisms or drug classes (Supplementary Figs. 3b, c). The similar ARG compositions among AS, sewage, and soil could be due to the interconnection of these environments, as sewage is the influent of WWTPs, and soils could also be an important source of the influent’s composition, especially in combined sewer systems that collect both domestic sewage and stormwater.

Relationships between the resistomes and microbiomes

Associations of the resistomes to bacterial community structure

To understand the relationships between resistomes and bacterial community structure, we performed Procrustes analyzes. The bacterial community structure was represented either by 16S rRNA genes extracted from metagenomes (Fig. 2a) or amplified 16S rRNA genes (Fig. 2b). Procrustes analysis yielded a matrix-matrix correlation coefficient of 0.74 for metagenome 16S-based bacterial community structure, and a matrix-matrix correlation coefficient of 0.70 for 16S amplicon-based bacterial community structure (protest, p < 0.001), suggesting a strong association between WWTP bacterial community structure and the resistomes. These results are consistent with previous studies on local WWTPs9,27 and soil38, demonstrating that bacterial community composition plays a pivotal role in shaping the resistomes.

Fig. 2: The linkage of the AS resistomes to microbiomes.
figure 2

a Relationships detected by Procrustes analysis between the resistomes and bacterial community structure as measured by 16S genes extracted from the metagenomes. Metagenomic shotgun sequencing was performed for all activated sludge samples, and the 16S sequences were extracted and grouped at the genus level using Metaxa2. b Relationships detected by Procrustes analysis between the resistomes and bacterial community structure as measured by the 16S amplicon sequencing data. The dotted ends of lines represent the resistome position, while the undotted ends represent the bacteriome position. Vegan Procrustes test ‘protest’ with 999 permutations yielded a matrix-matrix correlation coefficient of 0.74 (protest, p = 0.001) for metagenome 16S-based bacterial community structure, and a matrix-matrix correlation coefficient of 0.70 (protest, p = 0.001) for 16S amplicon-based bacterial community structure. c The association between the ARG abundance (total ARG abundance and the top four major ARG groups) and the relative abundance of top 15 major bacterial phyla from 16S rRNA gene amplicon data (16S) or metagenomes (Shotgun). The circle-filled color corresponds to Spearman’s correlation coefficient. The asterisks ‘*’ denote significant correlations (two-sided p < 0.05 after adjustment for multiple testing). d The phylogenetic tree of metagenome-assembled genomes (MAGs) from global AS samples. The leaf colors indicate phylum groups. The bar heights outside the circle are proportional to the ARG count annotated in MAGs, and red bars represent the MAGs carrying multi-species mobile ARGs. Inner rings show the resistance gene abundances of the five major drug classes, with darker colors indicating higher abundances. e The mean count and relative abundances of ARGs encoding major resistance mechanisms or drug classes across phylogenetic groups. Error bars indicate standard deviations. Numbers on the top indicate the number of MAGs belonging to the phylogenetic groups. Source data are provided as a Source Data file.

To further determine whether the relationships between the resistomes and microbiomes depend on phylogenetic lineages, we determined the linkages of the total ARG abundance and the top four major ARG groups to the relative abundances of major phyla (Fig. 2c and Supplementary Note 3). Bacteroidetes, the most abundant phylum, was positively correlated with the ARG abundance based on amplicon 16S rRNA gene data (rho = 0.28, adjusted p = 0.0001). Based on metagenome-derived 16S rRNA genes, the ARG abundance was also positively correlated with Chloroflexi (rho = 0.48, adjusted p < 2.7 × 10-13), Acidobacteria (rho = 0.28, adjusted p = 9.4 × 10-5), Gemmatimonadetes (rho = 0.24, adjusted p = 0.001), Nitrospirae (rho = 0.20, adjusted p = 0.009), and Deltaproteobacteria (rho = 0.20, adjusted p = 0.008), suggesting that these taxa may be major carriers of ARGs. Strong correlations between ARG abundance and taxonomic groups were also observed in other environments, but with different patterns (Supplementary Fig. 6 and Supplementary Note 3). These results suggest that the resistomes in AS could be strongly tied to microbial physiology.

ARG-associated metagenome-assembled genomes

To further understand the association between ARGs and their bacterial hosts, the shotgun sequences of these global AS samples were assembled into contigs and binned into genomes (see “Methods” for details). A total of 1,112 dereplicated high-quality MAGs were recovered with 536 Bacteroidota, 272 Proteobacteria, and 43 Actinobacteria. We detected that 1,054 of them contain at least one ARG, and 28 were identified as potential human pathogens based on the taxonomic information and presence of virulence factors39,40,41 (Supplementary Note 4). As shown in the MAGs-based phylogenetic tree in Fig. 2d, the total ARG abundance and major ARG classes varied greatly among different phylogenetic groups. Chloroflexi (7.2 ± 3.0 ARG counts), Acidobacteria (6.6 ± 3.0), Deltaproteobacteria (4.5 ± 2.8), Gemmatimonadota (3.5 ± 2.1), and Bacteroidetes (3.3 ± 1.7) were the top five carriers of ARGs (Fig. 2e), which was consistent with their positive correlations with the ARG abundance. Bacteroidetes and Proteobacteria were reported to be the main hosts of ARGs in local WWTPs26,28, consistent with our synthetic analyzes using both correlation- and MAG-based methods. This is likely due to their ability to disseminate resistance genes via horizontal gene transfer (HGT)42 and their adaptability to antibiotic-rich environments43. Collectively, all the above analyzes indicate that the identified taxa may play significant roles in ARG persistence and dissemination in activated sludge systems.

Mobility of resistomes and MAGs

MGEs facilitate the horizontal transfer of ARGs, contributing to antibiotic resistance dissemination and evolution in microbial communities. For determining the diversity of MGEs, a total of 2200 non-redundant ORFs were identified as 56 MGE genes (Supplementary Data 3). The three most abundant MGEs were tnpA, IS91and tniA, and the corresponding MGE classes were transposase, insertion_element_IS91, and plasmid in AS (Fig. 3a and Supplementary Note 5). The total MGE abundance showed significant differences across the six continents (Fig. 3b; p = 1.2 × 10-6, Kruskal-Wallis test) and between different countries (p = 5.8 × 10-7, Kruskal-Wallis test). Linear regressions showed that the MGE richness was positively correlated with the ARG richness (R = 0.38, adjusted p = 2.8 × 10-9). Furthermore, the total ARG abundance was positively correlated with the abundance of their nearby MGEs (R = 0.20, adjusted p = 0.003; Supplementary Note 5).

Fig. 3: Mobility of ARGs from assembly and MAG-based analyzes.
figure 3

a Relative MGE abundance identified from the non-redundant ORFs on gene level and group level. b Boxplots of the MGE Shannon’s H index across six continents. Hinges show the 25th, 50th, and 75th percentiles. The upper whisker extends to the largest value no further than 1.5 * IQR from the upper hinge, where IQR is the inter-quartile range between the 25% and 75% quartiles; the lower whisker extends to the smallest value at most 1.5 * IQR from the lower hinge, and dots indicate values of individual samples. Sample size: n = 6, 59, 14, 20, 106, and 21 samples for Africa, Asia, Australasia, Europe, North America, and South America, respectively. Significant differences (Dunn’s test with two-sided p-values adjusted by the Bonferroni method < 0.05) between continent pairs are indicated in the plot. c The relative abundance of mobile or immobile ARGs based on taxonomic composition, resistance mechanisms, and drug class. d Multi-phyla mobile ARGs based on gene sharing between MAGs. Nodes represent ARG sequences with labels indicating the gene/gene family name. Node colors indicate the phylogenetic groups of MAGs in which the ARG is present. Node shapes indicate different resistance mechanisms. Source data are provided as a Source Data file.

We further quantified mobility based on the ARGs shared between distinct hosts. Following the method applied to human microbiomes16,44, mobile ARGs were identified as identical or near-identical sequences present in different bacterial hosts. From these 1,112 dereplicated MAGs, 3,646 ORFs were annotated as ARG sequences, which were further clustered into 2,368 ARG clusters at 99% nucleotide identity. Subsequently, 29% of the ARG clusters (682/2,368) covering 54% of all ARG sequences (1,959/3,646) were assigned to multiple species, suggesting possible recent horizontal gene transfer across distantly related organisms. In comparison, 10% of the ARG clusters from the human microbiome MAGs were multi-species ARGs16. Remarkably, the proportion of potentially mobile ARGs in AS was surprisingly higher than that in the human microbiome. This may be due to the high density of bacterial cells and well-mixed nature of AS, which enhances the probability of bacterial physical contact and subsequently increases the likelihood of horizontal gene transfer. Note that the non-mobile/intrinsic ARGs still contribute to the gene pool in the environment, as they might be captured by mobile genetic elements in a certain stage of evolution and become mobile ARGs45.

The potential ARG mobility for MAGs varied across phylogenetic lineages (Fig. 2d). Of the 1,112 MAGs, 57.6% (641/1,112) were identified as carrying multi-species mobile ARGs. Among MAGs harboring multi-species mobile ARGs, the proportion of the Bacteroidetes phylum was higher than that with immobile ARGs (Fig. 3c), suggesting that the Bacteroidetes phylum could be more prone to horizontal gene transfer to survive in AS with antibiotics. In terms of resistance mechanisms and drug classes, the relative abundances of glycopeptide and macrolide-lincosamide-streptogramin resistance genes were also higher in mobile than immobile ARGs, suggesting that these classes could potentially be more mobile in AS (Fig. 3c). Most mobile ARG clusters can transfer across multi-species, while only 4% (26/682) of ARG clusters exhibit the ability to move across multi-phyla (Fig. 3d). Notably, 65% (17/26) of multi-phyla mobile ARG clusters are associated with antibiotic inactivation. Horizontal transfer of antibiotic inactivation resistance genes plays a crucial role in microbial survival by enhancing adaptability, accelerating the dissemination of resistance, and conferring evolutionary advantages in antibiotic-rich environments46. Horizontal transfer poses considerable challenges to public health.

Drivers of global AS resistomes

We quantitatively assessed the relative contribution of stochastic versus  deterministic processes to the global AS resistome variations with the metric of normalized stochasticity ratio (NST)47. The NST estimated for resistomes was generally above 0.5 for all continents except Europe (Fig. 4a and Supplementary Note 6), suggesting that stochastic processes may play a role in the AS resistome variations. Multiple regression on matrices (MRM)-based variance partition analysis (VPA) also revealed that substantial variations (67.4%) of the resistomes remained unexplained by the measured environmental variables and geographical distance (Fig. 4b and Supplementary Note 6). While these results align with previous findings that stochastic processes are important in shaping bacterial community assembly in AS31, it is critical to note that apparent stochasticity could mask unmeasured deterministic pressures, such as environmental stresses from antibiotics43, heavy metals48, or microplastics49. Additionally, methodological limitations, including sequencing depth and database biases, might constrain our ability to resolve deterministic signals. Thus, while stochastic processes likely contribute to AS resistome variations, deterministic factors should not be overlooked.

Fig. 4: Drivers for the AS resistomes.
figure 4

a Normalized stochasticity ratio (NST) quantifies the relative importance of stochasticity in governing resistomes. Sample size: n = 6, 59, 14, 20, 106, and 21 samples for Africa, Asia, Australasia, Europe, North America, and South America, respectively. b The Variance partition analysis (VPA) results indicated that the relative contributions of geographic distance (Geo), environmental variables (ENV), and their interactions to the variation of the AS resistomes all reached a significant level (two-sided p < 0.05). c PLS models of the relationships among microbiome (PC1 of bacterial community structure), resistome (the total ARG abundance, PC1 of ARG composition, abundances of the top three resistance mechanisms), the abundance of MGEs located near (< 10 kb) ARGs, ARG-correlated environmental variables, and ecosystem functions (the removal rate of BOD, COD, total nitrogen, total phosphorus). Directions for all arrows are from independent variable to a dependent variable in the forward selected PLS models (p  < 0.05); only the variables with variable influence on projection > 1 are presented. The numbers near the pathway arrow indicate the proportion of variance explained for every dependent variable, with the top row representing the partial R2 index based on PLS and the bottom row representing Pearson correlation R2. The asterisks denote the significance levels with *** p  < 0.01, ** p   < 0.05 and * p  <0.10 (two-sided). The colors of pathways are related to the positive (blue) or negative (red) relationships. The widths of pathways are related to the partial R2 index. Source data are provided as a Source Data file.

To further discern the roles of individual deterministic factors, we examined the environmental variables having significant correlations (p < 0.05) with changes in ARG abundance by using univariate models (Supplementary Table 2). The mixed liquor suspended solids (MLSS), temperature, and city population showed positive correlations with the ARG abundance (Supplementary Figs. 7a–c and Supplementary Note 7). Conversely, the ARG abundance was negatively correlated with pH, solids retention time, and influent biochemical oxygen demand (BOD) (Supplementary Figs. 7d–f and Supplementary Note 7), which have been reported to play important roles in regulating the structure of the AS bacterial community31,36. Unlike previous observations indicating that the abundance of sewage ARGs is strongly correlated with socio-economic factors15, we found no significant correlation between ARG abundance and per capita GDP or country-level antibiotics use50 for where the WWTP is located (Supplementary Table 2). The non-correlation may suggest that the antibiotic concentrations in AS might be insufficient to pose a significant selective pressure for ARGs maintenance and propagation51. However, the resolution of antibiotic use data (only from 15 country-level observations) may be too low to reveal its impact on the ARG abundance in AS.

A more in-depth analysis using partial least squares (PLS) further revealed potential direct and indirect effects of biotic and abiotic drivers (Fig. 4c). PLS analysis indicated that the bacterial community structure, MGEs, temperature, and city population could affect the AS resistome, which further influenced the AS ecosystem functioning for pollutant removal. Temperature had a direct influence on ARG abundance (Pearson r  =  0.39, partial R2 = 0.08) and indirectly affected ARG abundance through the bacterial community structure (Pearson r = 0.54, partial R2 = 0.14 of the first principal component score (PC1) representing the community structure). Because temperature is a primary driver of biological processes52, temperature likely has important effects on ARG abundance and distribution8. Although the potential mechanisms underlying the relationships between ARGs and temperature are not clear, temperature could facilitate horizontal gene transfer, population growth, biotic interactions, and community turnovers53,54,55. ARG abundance was also directly influenced by the abundance of proximal MGEs (Pearson r = 0.30, partial R2 = 0.09). Several studies have shown that MGEs can carry multiple ARGs and contribute to their spread within bacterial populations, thereby increasing the ARG abundance56,57. Another factor that had a direct positive effect on ARG abundance was the city population (Pearson r = 0.30, partial R2 = 0.05). A higher population may be associated with an increased use and sewage discharge of antibiotics, exacerbating the emergence and spread of ARGs in bacteria10. Overall, although the abiotic environmental variables had significant effects on the resistome, their impact was relatively small (partial R2 < 0.1, Fig. 4c), which is consistent with the null model-based stochasticity ratio (Fig. 4a) and MRM-VPA analysis (Fig. 4b) showing that stochastic processes may play a more important role.

Concluding remarks

Understanding the global ARG abundance, diversity, and distribution, along with their controlling mechanisms is critical to the risk assessment and mitigation of antibiotic resistance. By analyzing the AS resistomes via well-coordinated international efforts, this study showed that ARGs are highly abundant, diverse, and widely distributed across global WWTPs; this corroborates that WWTPs are an important reservoir of environmental ARGs5,58,59,60. By offering a global-scale characterization of ARGs, this study provides inter-continental and inter-country comparisons of the resistomes in WWTPs. Our results revealed that the structures of activated sludge resistomes differed among continents and were far distant from those of the human gut and oceans, but they exhibited close similarity to those of sewage and soils. We also recovered thousands of dereplicated high-quality MAGs, which could enable more in-depth analyzes of ARG hosts and the quantification of ARG mobility. In addition, our analyzes indicate that resistome variations in activated sludge may be driven by stochastic processes, such as random gene exchanges and drift61. However, deterministic factors such as temperature and city population still played important roles in the evolution and proliferation of ARGs in global WWTPs.

Methods

Global sampling and DNA sequencing

A total of 1,186 AS samples were collected by the GWMC from 269 WWTPs across 23 countries with varying geographic locations, latitudes, and climate zones31. There was a unified protocol (http://gwmc.ou.edu/files/Sampling_Shipping_Protocol_General_20141103.pdf) developed at GWMC for sampling, preserving samples, collecting metadata, collecting DNA, and sequencing so that potential effects of the variations on experimentation would be minimized. A total of 226 representative samples out of 1,186 AS samples had sufficient metadata to be used for metagenomic sequencing.

Detailed information about the procedure of DNA extraction is described in Wu et al. 31. In brief, the MoBio PowerSoil DNA isolation kit was used to isolate community DNA from mixed liquor samples (3 mL). We vortexed 12 bead tubes at maximum speed for 10 minutes, following the manufacturing protocol, to minimize variations in cell lysis efficiency between samples. Then, we constructed genomic DNA libraries by following the manufacturer’s instructions with an average insert size of 300 bp using KAPA Hyper Prep Kit (KR0961). DNA LabChip 1000 kit from Agilent was used to assess the quality of all libraries, and all qualified libraries were sequenced at the Oklahoma Medical Research Foundation (OMRF) with paired-end sequencing on Illumina HiSeq3000. The sequenced reads were deposited in the Sequence Read Archive (BioProject accession number PRJNA509305).

Metagenomic sequences processing

An internal metagenomic pipeline (ARMAP, http://zhoulab5.rccc.ou.edu/pipelines/ARMAP_web/job_submission.php) was used to process the metagenomic data. First, all sequenced reads were subjected to FastQC for quality evaluation with quality profile, duplication rates, and contamination rates. Using CD-HIT (v4.6.8)62, a 100% identity cutoff was used to remove duplicates. Quality trimming and filtering were performed using NGS QC Toolkit (v2.3.3)63. The paired-end adapter library was used to detect reads with residual adapters. Raw reads were filtered with the following constraints: (i) reads with more than one ambiguous N base were removed; (ii) 3′-ends of reads were trimmed to the first high-quality base with quality score ≥ 20; and (iii) trimmed reads with the length > 120 bp (80% of the sequence read length) were further filtered with an average quality score cutoff of 20. The paired-end reads (fasta) of each sample after quality trimming and filtering were assembled by MEGAHIT (v1.0.5)64 into contigs in a time- and cost-efficient way, using the following parameters: –min-contig-len = 1000, --k-min = 31, --k-max = 131, --k-step = 20 and –min-count = 1. All assembled contigs were imported into the NGS QC Toolkit for the calculation of the contig length profiles (N50Stat.pl).

ARGs annotation for open reading frames

Open reading frames (ORFs) of protein-coding genes were predicted from the assembled contigs of each metagenome by Prodigal (v2.6.3)65 with ‘-p meta’ option. A non-redundant ORF catalog was constructed by protein clustering using MMseqs266, with a minimum identity threshold of 95% and a minimum sequence coverage of 90% (--min-seq-id 0.95 -c 0.9 --cluster-mode 2 --cov-mode 1). The coverages of the non-redundant genes in each sample were determined by CoverM (v0.6.1) (https://github.com/wwood/CoverM) using default settings. Then, non-redundant ORFs were functionally annotated against the Comprehensive Antibiotic Resistance Database (CARD)67 and the ResFams database68. Genes were first assigned as ARGs by annotating with CARD using their recommended tool Resistance Gene Identifier (RGI) (v6.0.0), requiring a hit scoring above the family-specific threshold under the CARD homolog model, with the top hit taken if several are achieved. The remaining unannotated genes were filtered and subsequently annotated with Resfams protein families, requiring the score to a ResFams hidden Markov model to exceed the gathering threshold for that model. The ORFs annotated to ResFams were represented as gene families. The following criteria were used to remove potential false positive ARGs: (i) genes that confer resistance via the overexpression of resistant target alleles (e.g., resistance to antifolate drugs via mutated DHPS and DHFR); (ii) global gene regulators, two-component system proteins, and signaling mediators; (iii) efflux pumps that confer resistance to multiple antibiotics; (iv) genes modifying cell wall charge (e.g., those conferring resistance to polymyxins and defensins). Raw unnormalized abundance value was calculated for each ARG in a sample as the summed coverage depths of all ORFs that were annotated to that ARG in the given sample.

To assess the ARG distributions in AS samples, the raw abundance of ARGs was normalized and expressed as “copy of ARG per cell” using the Eq. (1).

$${Abundance}= \sum _{i=1}^{n}\frac{{{Coverage}}_{i({{\rm{ARG}}}-{{\rm{like}}}\; {{\rm{gene}}})}}{{{Coverage}}_{16{{\rm{S}}}\; {{\rm{sequence}}}}}\times {N}_{16{{\rm{S}}}\; {{\rm{copy}}}\; {{\rm{number}}}}\\ = \sum _{i=1}^{n}\frac{{N}_{i({{\rm{ARG}}}-{{\rm{like}}}\; {{\rm{sequence}}})}\times {L}_{{reads}}/{L}_{i({{\rm{ARG}}}\; {{\rm{ORF}}})}}{{N}_{16{{\rm{S}}}\; {{\rm{sequence}}}}\times {L}_{{reads}}/{L}_{16S{{\rm{sequence}}}}}\times {N}_{16{{\rm{S}}}\; {{\rm{copy}}}\; {{\rm{number}}}}$$
(1)

Where \({{Coverage}}_{i({{\rm{ARG}}}-{{\rm{like\; gene}}})}\) is the coverage of a specific ARG ORF, which is calculated from the number of reads annotated to this ORF (\({N}_{i({{\rm{ARG}}}-{{\rm{like\; sequence}}})}\)), the sequence length (bp) of the reads (\({L}_{{reads}}\)), and the length (bp) of the corresponding ARG ORF (\({L}_{i({{\rm{ARG\; ORF}}})}\)). For the coverage of 16S rRNA gene (\({{Coverage}}_{16{{\rm{S\; sequence}}}}\)) calculation, \({N}_{16{{\rm{S\; sequence}}}}\) is the number of the 16S rRNA gene sequences identified for the metagenomic data by Metaxa2 (v2.248)69, \({L}_{{reads}}\) represents the sequence length of the reads, \({L}_{16S{{\rm{sequence}}}}\) is the average length of 16S rRNA genes (1,432 bp) in Greengenes database70. \({N}_{16{{\rm{S}}}\; {{\rm{copy}}}\; {{\rm{number}}}}\) is the average copy number of 16S rRNA genes per cell in the community, and n is the number of annotated ARGs for a specific category. The average copy number in the community was calculated as the abundance-weighted mean 16S rRNA gene copy number, where the 16S rRNA gene copy number of each genus was estimated through the rrnDB database based on its closest relatives with known rRNA gene copy number71,72. It is noted that the normalized ARG abundance (gene copies per cell) depends on the algorithms for identifying ARGs and 16S rRNA genes. There could be false positives and false negatives; thus, the resultant ARG abundance may not reflect the real values in the community. However, we can still conduct relative comparisons across different samples, under the assumption that the estimations across samples are subjected to the same degree of bias. In this way, we can compare the abundance of ARGs between samples and explore the underlying mechanisms shaping the resistomes.

Mobile genetic elements (MGEs) annotation

To determine the diversity of MGEs in the AS, we annotated MGEs for the non-redundant ORFs by BLASTN (-perc_identity 0.5 -evalue 1e-10 -max_target_seqs 1) against the previously published database of MGEs73. This database consists of MGEs with 278 different genes and more than 2,000 unique sequences. The raw abundance of each MGE in a sample was calculated as the summed coverage depths of all ORFs annotated to that MGE and normalized as “copy of MGE per cell” in the same manner as for the ARGs.

To quantify the mobility potential of ARGs, we performed a co-localization analysis between ARGs and MGEs on all assembled contigs. We first annotated the ARGs and MGEs on all contigs and then identified the contigs carrying both ARGs and MGEs for calculating the minimum distance between them. ARGs with potential mobility were defined as sharing a nearby area (< 10 kb)74 with an MGE. We calculated the proportions of mobile ARGs in each sample. We also calculated the raw abundance of MGEs co-located (< 10 kb) with ARGs using the coverage of the corresponding contigs in the given sample, which was determined by CoverM (v0.6.1) using default settings. The raw abundances of MGEs were then normalized as “copy per cell” with the above method.

Taxonomic profiling of the metagenomic sequences

Bacterial-community profiling at the genus level was done using Metaxa2 (v2.248)69, based on the bacterial 16S rRNA reads extracted from the high-quality metagenomic reads. The bacterial profile was also represented by the OTU table based on 16S rRNA amplicon sequencing data, which was published by Wu et al. 31. The relative abundance of a taxonomic category was calculated as the sum of reads annotated to that category and normalized by the total number of taxonomic reads in each sample.

MAG recovery, taxonomic annotation, and phylogenetic tree construction

All assembled contigs longer than 1 kbp were binned with MataBAT275, MaxBin276, and CONCOCT (v0.4.1)77 based on contig composition and coverage. Before binning, Bowtie278 was used to align short-read sequences to contigs (options: -very-fast), and SAMtools79 was used to sort and convert SAM files to BAM format. Then, DAStools80 was used to refine binned contigs with default parameters where Usearch81 was used as the search engine. We performed CheckM (v1.0.6)82 to estimate the completeness and contamination of each bin. To get the nonredundant consolidation, the dRep83 dereplication workflow was used with options ‘dereplicate_wf -p 16 -pa 0.9 -sa 0.95 -nc 0.3 -comp 70 -conn 10 -str 100 -strW 0’. Bin scores were given as completeness-5×contamination+0.5×log(N50), and only the highest-scoring MAGs from each cluster (> 95% average nucleotide identity) were retained in the dereplicated set. The bins with high completeness (> 90%) and few contaminants (< 5%) were retained as high-quality MAGs and were used for downstream analyzes.

The taxonomy of the representative MAGs was assigned using GTDB-tk v2.1.084 based on the Genome Taxonomy Database85. Besides, to identify the pathogenic genomes, we first selected the potential ones by referring to two published reference pathogen lists that consisted of 140 potentially human pathogenic genera40 and 538 potentially human pathogenic species41. Then, we searched the ORFs of taxonomically predicted potentially pathogenic genomes against the experimentally verified bacterial virulence factor database VFDB (last update: Dec. 11, 2020)39 with BLASTN. The genomes with virulence factors with a global identity > 70% were considered pathogens. The phylogenetic relationships of all MAGs were inferred by a maximum likelihood alignment-based approach with PhyloPhlAn386 (--diversity high, --fast, with configurations --db_aa diamond, --map_dna diamond, --map_aa diamond, --msa mafft, --trim trimal, --tree1 iqtree). Visualization and annotation of the tree were done using GraPhlAn87. It should be noted that it has proven difficult to assemble genomes for populations below 1% relative abundance owing to insufficient sequencing depth or difficulty in binning and assembly of individual genomes from complex metagenomes88.

ARG host and mobility annotation for MAGs

For the near-complete MAGs, ARGs of the MAGs’ contigs were also identified based on CARD67 and the ResFams database68 as above. The mobile ARGs were defined as identical or near identical sequences present in different species16,44. Since our recovered MAGs were dereplicated at an average nucleotide identity of 95%, they represented species-level genome bins89,90. Thus, we searched for mobile ARGs as those present in two or more MAGs. To achieve this, we first clustered the nucleotide sequences of all detected ARG ORFs into ARG clusters with 99% identity, using the ‘cluster’ command of MMseqs266 with ‘–min-seq-id 0.99 -c 0.9 –cov-mode 0’. We then labeled any ARG cluster that was found in multiple MAGs as ‘multi-species’, which was considered as the evidence of recent horizontal gene transfer. This strategy of searching for ARG clusters across species to detect recent horizontal gene transfer is equivalent to that used in some other studies on human microbiomes16,44.

Analyzing metagenomic samples from other environments

To compare AS resistomes with those in other environments, we selected the public global metagenomic projects in human gut35, sewage15, soil36, and oceans37 and collected samples from these public databases. The raw metagenomic sequences were downloaded from the European Bioinformatics Institute Sequence Read Archive database (sewage: PRJEB13831, soil: ERP020652, gut: ERP004605, ocean: ERP001736). To avoid bias caused by data processing, we re-processed the raw sequences with the same quality trimming and filtering parameters as in our pipeline to obtain high-quality sequences. Rather than using the contig-based approach to annotate ARGs which requires significant time and vast computational resources for the assembly step, here we profiled the abundance of ARGs through a read-based mapping strategy. The read-based approach enabled an efficient comparison of resistomes between environments. We annotated ARGs from the high-quality metagenomic sequences by DeepARG (v2)91 using the default options (--id 50, -e 1e-10, -k 1000 of short reads mode), which can infer ARGs from short reads. The abundances of ARGs were normalized to the unit of “copy per cell”34 in a similar manner as described above, although the calculation of ARG coverage was slightly different from Eq. (2).

$${Abundance}= \sum _{i=1}^{n}\frac{{{Coverage}}_{i({{\rm{ARG}}})}}{{{Coverage}}_{16{{\rm{S}}}\; {{\rm{sequence}}}}}\times {N}_{16{{\rm{S}}}\; {{\rm{copy}}}\; {{\rm{number}}}}\\ = \sum _{i=1}^{n}\frac{{N}_{i({{\rm{ARG}}}-{{\rm{like}}}\; {{\rm{sequence}}})}\times {L}_{{reads}}/{L}_{i({{\rm{ARG}}}\; {{\rm{reference}}}\; {{\rm{sequence}}})}}{{N}_{16{{\rm{S}}}\; {{\rm{sequence}}}}\times {L}_{{reads}}/{L}_{16S{{\rm{sequence}}}}}\times {N}_{16{{\rm{S}}}\; {{\rm{copy}}}\; {{\rm{number}}}}$$
(2)

Where \({N}_{i({{\rm{ARG}}}-{{\rm{like\; sequence}}})}\) is the number of ARG-like reads annotated as one specific ARG reference sequence, \({L}_{i({{\rm{ARG}}}\; {{\rm{reference}}}\; {{\rm{sequence}}})}\) is the sequence length (bp) of the corresponding ARG reference sequence.

To compare the results of the two ARG detection methods (contig-based and read-based approaches), we performed Procrustes analyzes between the resultant AS ARG abundance matrices from the two methods using the function ‘procrustes’ of vegan R package92. We also examined the correlation between the total abundance from two methods using the function ‘lm’ of R.

Statistical analyzes

The global map was created using the function ‘tm_shape’ of spData R package (10.32614/CRAN.package.spData). Richness and Shannon’s H index were computed using the vegan R package92 to measure the diversity of ARGs or MGEs based on a rarefied count matrix, which was obtained by rounding the coverages and sub-sampling to the lowest sample’s level. The richness and Shannon’s H diversity rarefaction curves for bacteria and ARGs were respectively based on the reads mapping to the bacterial 16S rRNA genes and ARGs. The curves were computed using the function ‘rarefaction.individual’ of rareNMtests93 and plotted using the ggplot294 R packages. Kruskal–Wallis and the Dunn post hoc test were used to compare the means of ARG abundance or diversity between continents or countries, using R function ‘kruskal.test’ and function ‘dunnTest’ of FSA R package95. To visualize the variation of resistomes across samples, the principal coordinate analysis (PCoA) was performed on the resistome Bray-Curtis dissimilarity matrix based on gene relative abundances, using the function ‘pcoa’ of ape R package96. The heat map of genes was generated using the function ‘aheatmap’ of NMF R package97. PERMANOVA was applied to assess the resistome dissimilarities among continents using the function ‘adonis2’ of vegan R package. Procrustes analysis was performed to test the association between bacterial taxonomic composition and the resistome using the function ‘procrustes’ of vegan R package in which the ordinations of the bacterial taxonomic composition and the resistome were generated from PCoA. To disentangle the relative contributions of stochastic and deterministic processes to AS resistome, null model-based NST approach47 was applied to community ARG data. Normalized stochasticity ratio (NST) was used to quantify ecological stochasticity in communities within continents and was analyzed in R using the NST package47.

To estimate the relative contributions of the environmental effects versus the distance effects on the resistome dissimilarities, we performed a variation partition analysis (VPA) based on multiple regression on matrices (MRM). Briefly, we first selected a non-redundant set of environmental variables that contained missing data in less than 20% of all samples. The final set included mixed liquid temperature, air temperature, precipitation, design capacity, volume of aeration tanks, plant age, mixed liquor suspended solids (MLSS), solids retention time (SRT), dissolved oxygen (DO), pH, and influent biochemical oxygen demand (BOD), effluent BOD, food to microorganism (F/M) ratio and city GDP. The variance inflation factors (VIF) were less than 10, indicating a low level of collinearity among these variables. MRM was performed using the function ‘MRM’ of ecodist R package98. Geographic distance was log-transformed. A Euclidean distance matrix was calculated for each environmental variable. In VPA, the R2 of the selected environmental variables as independent matrices (\({{{{\rm{R}}}}^{2}}_{E}\)), geographical distance as an independent matrix (\({{{{\rm{R}}}}^{2}}_{G}\)), and all matrices (\({{{{\rm{R}}}}^{2}}_{T}\)) were used to compute the three components of variations: (i) pure environmental variation = \({{{{\rm{R}}}}^{2}}_{T}-{{{{\rm{R}}}}^{2}}_{G}\); (ii) pure geographical distance = \({{{{\rm{R}}}}^{2}}_{T}-{{{{\rm{R}}}}^{2}}_{E}\); and (iii) spatially structured environmental variation = \({{{{\rm{R}}}}^{2}}_{G}+{{{{\rm{R}}}}^{2}}_{E}-{{{{\rm{R}}}}^{2}}_{T}\). Univariate models predicting the total ARG abundance (ARG copies per cell) as a function of various environmental and site variables were performed using R function ‘lm’ and ‘summary’. For each variable, we fitted a linear and a quadratic model and the results are shown for the model with lower Akaike information criteria (AIC) value.

The partial least squares (PLS) model with a partial R2 index based on PLS99 was used to explore the relationships among the microbiome (PC1 of bacterial community structure), resistome (the total ARG abundance, PC1 of ARG composition, abundances of the top three resistance mechanisms), the abundance of MGEs located near (< 10 kb) ARGs, six environmental variables which significantly correlated (p < 0.05) with the total ARG abundance based on the univariate models, and ecosystem functions (the removal rate of BOD, COD, total nitrogen and total phosphorus). Based on predictive performance counting in the explained variation (\({R}_{Y}^{2}\)) and model significance (P for \({R}_{Y}^{2}\) and \({Q}_{Y}^{2}\)  < 0.05, where significant \({Q}_{Y}^{2}\) helps to avoid overfitting), Each optimum PLS model was forward selected from all factors that might affect the dependent variable. To visualize relevant associations, we only included the most relevant variable(s) with Variable Influence on Projection (VIP) values larger than 1. When used as independent variables in PLS, the ARG composition was represented by the PC1 from PCoA of Bray-Curtis distance. We used a partial \({R}^{2}\) index100 on the basis of PLS to represent the proportion of variance explained by each independent variable (Eq. 3). We also calculated the pairwise correlation coefficient (as well as the \({R}^{2}\)) among the factors and the significance was based on Pearson correlation as reference. The PLS-related analysis was performed using the ropls package101 and the Mantel test using the vegan package92 in R.

$${R}_{{{\rm{PLS}}}j}^{2}={R}_{Y}^{2}\times \frac{{\sum}_{f}({W}_{{ff}}^{2}\times {{SSY}}_{f})}{{{SSY}}_{{cum}}}=\frac{{\sum}_{f}({W}_{{jf}}^{2}\times {{SSY}}_{f})}{{SSY}}$$
(3)

Where \({R}_{{{\rm{PLS}}}j}^{2}\) is the partial R2 of variable \(j\) based on PLS, \({W}_{{jf}}\) is the PLS weight of variable \(j\) on component \(f\), \({{SSY}}_{f}\) is the sum of squares of \(Y\) explained by component \(f\), \({{SSY}}_{{cum}}\) is the cumulative sum of squares of \(Y\) explained by all components, \({R}_{Y}^{2}\) is the percentage of \(Y\) dispersion (i.e., sum of squares) explained by the PLS model, and \({SSY}\) is the \(Y\) dispersion, that is, sum of squares of \(Y\).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.