Fig. 2: Biogeography of sequence-discrete Thaumarchaeota Group I populations.
From: Reply to: “Re-evaluating the evidence for a universal genetic boundary among microbial species”

A reference genome representing the thaumarchaeotal population at 4000-m depth in the Pacific Ocean was queried against the previously described metagenomes from six different depths of the Pacific Ocean29 and the Gulf of Mexico (our unpublished data). A Range in nucleotide identities between the metagenomic read sequences and the genome, represented as letter-value plots30, and their vertical line the median (x axis), plotted against the depth that the metagenomic sequences were recovered from (y axis). B Read recruitment representation for selected comparisons performed [the uppermost box-plot in the panel A represents the distribution of sequence identity values of the reads against the reference genome shown in the leftmost plot in the panel B. The plots in panel B are similar to the low, left panel of Fig. 1 but the data points (representing mapped reads) have been binned into a positional, hexagonal heatmap for demonstration purposes. Note that Thaumarchaeota are genetically distinct between different depths of the same water column (A) but genetically more similar across similar depths in geographically distant locations (B), and that if representative genomes or whole-populations from all depths are compared, they will show a range of ANI values between 89 and ~100% among themselves. Note also that the use of short Illumina reads tends to overestimate nucleotide identity (and thus ANIr values) compared to longer Sanger reads or whole/partial genomes used in our previous publications, especially for moderately identical sequences (e.g., in the range of 80–95% nucleotide identity), mostly due to inability of current read mapping algorithms to align such short sequences. Further, the mapping of metagenomic reads to the reference genome was performed with MegaBLAST, in contrast to Blastn in Fig. 1, and MegaBLAST is even less sensitive (but much faster) in finding reads of intermediate identity (e.g., in the range of 70–90% nucleotide identity) compared to Blastn31. Therefore, the ANIr values shown are higher than our previous estimates for similar samples (or even Blastn-derived estimates based on the same metagenomic reads) due to this technical limitation, but the diversity patterns across depths remain similar.