Abstract
Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ . (1990). Basic local alignment search tool. J Mol Biol 215: 403–410.
Binladen J, Gilbert MT, Bollback JP Panitz F, Bendixen C, Nielsen R et al. (2007). The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS One 2: e197.
Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, Knight R . (2010a). PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26: 266–267.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK et al. (2010b). QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7: 335–336.
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ et al. (2009). The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucl Acids Res 37 (Database issue): D141–D145.
Costello EK, Gordon JI, Secor SM, Knight R . (2010). Postprandial remodeling of the gut microbiota in Burmese pythons. ISME J 4: 1375–1385.
Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R . (2009). Bacterial community variation in human body habitats across space and time. Science 326: 1694–1697.
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K et al. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72: 5069–5072.
DeSantis TZ, Keller K, Karaoz U, Alekseyenko AV, Singh NNS, Brodie EL et al. (2011). Simrank: rapid and sensitive general-purpose k-mer search tool. BMC Ecology 11: 11.
Edgar RC . (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26: 2460–2461.
Felsenstein J . (1989). PHYLIP -- phylogeny inference package (version 3.2). Cladistics 5: 164–166.
Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G et al. (2011). Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 21: 494–504.
Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R . (2008). Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods 5: 235–237.
Hoffmann C, Minkah N, Leipzig J, Wang G, Arens MQ, Tebas P et al. (2007). DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucl Acids Res 35: e91.
Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA et al. (2007). Microbial population structures in the deep marine biosphere. Science 318: 97–100.
Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML . (2008). Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PloS Genetics 4: e1000255.
Lauber CL, Hamady M, Knight R, Fierer N . (2009). Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial community structure at the continental scale. Appl Environ Microbiol 75: 5111–5120.
Liu ZZ, DeSantis TZ, Andersen GL, Knight R . (2008). Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucl Acids Res 36: e120.
Lozupone C, Knight R . (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71: 8228–8235.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.
McKenna P, Hoffmann C, Minkah N, Aye PP, Lackner A, Liu Z et al. (2008). The macaque gut microbiome in health, lentiviral infection, and chronic enterocolitis. PLoS Pathog 4: e20.
Nawrocki EP, Kolbe DL, Eddy SR . (2009). Infernal 1.0: inference of RNA alignments. Bioinformatics 25: 1335–1337.
Neefs JM, Van de Peer Y, De Rijk P, Chapelle S, De Wachter R . (1993). Compilation of small ribosomal subunit RNA structures. Nucl Acids Res 21: 3025–3049.
Price MN, Dehal PS, Arkin AP . (2010). FastTree 2-approximately maximum-likelihood trees for large alignments. PloS One 5: e9490.
Pruesse E, Quast C, Knittel K, Fuchs B, Ludwig W, Peplies J et al. (2007). SILVA: a comprehensive online resource for quality checked andaligned ribosomal RNA sequence data compatible with ARB. Nucl Acids Res 35: 7188–7196.
Ravussin Y, Koren O, Spor A, LeDuc C, Gutman R, Stombaugh J et al. (2011). Responses of gut microbiota to weight loss in obese and lean mice. Obesity; ; e-pub ahead of print 19 May 2011.
Reeder J, Knight R . (2010). Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods 7: 668–669.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB et al. (2009). Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75: 7537–7541.
Wang Q, Garrity GM, Tiedje JM, Cole JR . (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73: 5261–5267.
Werner JJ, Knights D, Garcia ML, Scalfone NB, Smith S, Yarasheski K et al. (2011). Bacterial community structures are unique and resilient in full-scale bioenergy systems. Proc Natl Acad Sci USA 108: 4158–4163.
Acknowledgements
This study was supported by Grant UH2/UH3CA140233 from the Human Microbiome Project of the NIH Roadmap Initiative, the National Cancer Institute, NIH common fund contract U01-HG004866 (a Data Analysis and Coordination Center for the Human Microbiome Project), The Hartwell Foundation, the Arnold and Mabel Beckman Foundation, the David and Lucile Packard Foundation, Cornell University Agricultural Experiment Station federal formula funds NYC-123444 received from the USDA National Institutes of Food and Agriculture (NIFA), and USDA NIFA Grant 2007-35504-05381.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Supplementary Information accompanies the paper on The ISME Journal website
Supplementary information
Rights and permissions
About this article
Cite this article
Werner, J., Koren, O., Hugenholtz, P. et al. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys. ISME J 6, 94–103 (2012). https://doi.org/10.1038/ismej.2011.82
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/ismej.2011.82
Keywords
This article is cited by
-
Intestinal microbiota composition of children with glycogen storage Type I patients
European Journal of Clinical Nutrition (2024)
-
Intensive antibiotic treatment of sows with parenteral crystalline ceftiofur and tulathromycin alters the composition of the nasal microbiota of their offspring
Veterinary Research (2023)
-
Rapid and accurate taxonomic classification of cpn60 amplicon sequence variants
ISME Communications (2023)
-
Vertical distribution patterns and drivers of soil bacterial communities across the continuous permafrost region of northeastern China
Ecological Processes (2022)
-
Performance and mechanisms of greywater treatment in a bio-enhanced granular-activated carbon dynamic biofilm reactor
npj Clean Water (2022)