Extended Data Fig. 3: Additional checks of MAG quality after clustering genomes into OTUs.
From: New insights from uncultivated genomes of the global human gut microbiome

aāc, MAGs and reference genomes were clustered into species-level OTUs on the basis of 95% ANI. As validation, OTUs were compared to the NCBI and GTDB for 65,900 reference genomes with valid species names. a, Box plots of the number of genomes per species, in which the middle line denotes the median, the box denotes the IQR and the whiskers denote 1.5ĆĀ IQR. b, The number of species per database. c, Similarity between OTUs and other databases, as measured using the adjusted mutual information statistic. Species-level OTUs are concordant with the NCBI and GTDB taxonomies. d, e, MAGs and reference genomes were further clustered into higher-rank OTUs on the basis of phylogenetic distance cut-offs. Rank-specific cut-offs were identified that maximized similarity to the GTDB. f, As an additional indicator of completeness, genome sizes of high-quality MAGs and reference genomes from the same OTU were compared. Each point indicates one species-level OTU (nĀ =Ā 625). A positive slope of close to 1.0 indicates to systematic loss of gene content. gāl, As an additional check of contamination, six single-copy marker genes (alaS, rnhB, cbf5, pheS, pheT and infB) were aligned between MAGs using BLASTN. MAGs devoid of contamination should display high percentage identity from the same OTU, and low percentage identity between different OTUs. The six marker genes were selected on the basis of (1) their presence in >90% of high-quality MAGs and reference genomes at single copy, and (2) having species-level percentage DNA identity cut-offsĀ <98%. Highly conserved genes may be similar between different OTUs, and were not suitable for this analysis. For between-OTU comparisons we used 1 MAG for each of 2,962 species-level OTUs. For within-OTU comparisons, we used 2 MAGs for each of 1,616 species-level OTUs. The histograms indicate the distribution of DNA percentage identity between MAGs from the same species-level OTU (in which the lowest common ancestor (LCA)Ā =Ā species) (g), and between MAGs that are more distantly related, in which the LCAĀ =Ā genus (h), family (i), order (j), class (k) or phylum (l). The vast majority of genes from the same species-level OTU display >98% identity, whereas those from different OTUs display <98% identity.