Fig. 3: Database evaluation based on amplicon data from the Global Water Microbiome Consortium project.

Raw amplicon data from the Global Water Microbiome Consortium project2 was processed to resolve ASVs of the 16S rRNA gene V4 region. The ASVs for each of the samples were filtered based on their relative abundance (only ASVs with ≥0.01% relative abundance were kept) before the analyses. The percentage of the microbial community represented by the remaining ASVs after the filtering was 88.35 ± 2.98% (mean ± SD) across samples. High-identity (≥99%) hits were determined by the stringent mapping of ASVs to each reference database. Classification of ASVs was done using the SINTAX classifier. The violin and box plots represent the distribution of percent of ASVs with high-identity hits or genus/species-level classifications for each database across n = 1165 biologically independent samples. Box plots indicate median (middle line), 25th, 75th percentile (box) and the min and max values after removing outliers based on 1.5x interquartile range (whiskers). Outliers have been removed from the box plots to ease visualisation. Different colours are used to distinguish the different databases.