Abstract
Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T . (2005). Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res 12: 281–290.
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T . (2003). Informatics for unveiling hidden genome signatures. Genome Res 13: 693–702.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ . (1990). Basic local alignment search tool. J Mol Biol 215: 403–410.
Amann RI, Ludwig W, Schleifer KH . (1995). Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59: 143–169.
Brady A, Salzberg SL . (2009). Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6: 673 .
Burge C, Campbell AM, Karlin S . (1992). Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA 89: 1358–1362.
Chan C-KK, Hsu AL, Tang S-L, Halgamuge SK . (2008a). Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. J Biomed Biotechnol 2008, doi:10.1155/2008/513701.
Chan CK, Hsu AL, Halgamuge SK, Tang SL . (2008b). Binning sequences using very sparse labels within a metagenome. BMC Bioinformat 9: 215.
Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H . (2009). Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 4: 265–270.
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B . (1999). Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 16: 1391–1399.
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW . (2009). TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformat 10: 56.
Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP et al. (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biol 10: R85.
Eddy SR . (1996). Hidden Markov models. Curr Opin Struct Biol 6: 361–365.
Eddy SR . (1998). Profile Hidden Markov Models. Bioinformatics 14: 755–763.
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G et al. (2009). Real-time DNA sequencing from single polymerase molecules. Science 323: 133–138.
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512.
Gupta PK . (2008). Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 26: 602–611.
Hanekamp K, Bohnebeck U, Beszteri B, Valentin K . (2007). PhyloGena—a user-friendly system for automated phylogenetic annotation of unknown sequences. Bioinformatics 23: 793–801.
Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA et al. (2007). Microbial population structures in the deep marine biosphere. Science 318: 97–100.
Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML . (2008). Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet 4: e1000255.
Huson DH, Auch AF, Qi J, Schuster SC . (2007). MEGAN analysis of metagenomic data. Genome Res 17: 377–386.
Jaffe DB, Butler J, Gnerre S, Mauceli E, Lindblad-Toh K, Mesirov JP et al. (2003). Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res 13: 91–96.
Karlin S, Burge C . (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11: 283–290.
Karlin S . (1998). Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol 1: 598–610.
Karlin S, Campbell AM, Mrazek J . (1998). Comparative DNA analysis across diverse genomes. Annu Rev Genet 32: 185–225.
Karlin S, Ladunga I . (1994). Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci USA 91: 12832–12836.
Karlin S, Ladunga I, Blaisdell BE . (1994). Heterogeneity of genomes: measures and values. Proc Natl Acad Sci USA 91: 12837–12841.
Kohonen T . (1982). Self-organized formation of topologically correct feature maps. Biol Cybernet 43: 59–69.
Kohonen T . (1990). Self-organization maps. Proc IEEE 78: 1464–1480.
Kohonen T, Kohonen T, Schroeder MR, Huang TS, Maps SO . (2001). Springer-Verlag New York Inc.: Secaucus, NJ.
Kohonen T, Oja E, Simula O, Visa A, Kangas J . (1996). Engineering applications of the self-organizing map. Proc IEEE 84: 1358–1384.
Kottmann R, Kostadinov I, Duhaime MB, Buttigieg PL, Yilmaz P, Hankeln W et al. (2010). Megx net: integrated database resource for marine ecological genomics. Nucleic Acids Res 38: D391–D395.
Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F et al. (2008). Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res 36: 2230 .
Martin C, Diaz NN, Ontrup J, Nattkemper TW . (2008). Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification. Bioinformatics 24: 1568–1574.
Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC et al. (2007). Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4: 495–500.
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I . (2007). Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods S 4: 63–72.
Noguchi H, Park J, Takagi T . (2006). MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 34: 5623–5630.
Ochman H . (2007). Single-cell genomics. Environ Microbiol 9: 7.
Pernthaler A, Pernthaler J, Amann R . (2002). Fluorescence in situ hybridization and catalyzed reporter deposition for the identification of marine bacteria. Appl Environ Microbiol 68: 3094–3101.
Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA et al. (2009). The NIH Human Microbiome Project. Genome Res 19: 2317–2323.
Podell S, Gaasterland T . (2007). DarkHorse: a method for genome-wide prediction of horizontal gene transfer. Genome Biol S 8: R16.
Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ . (2003). Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res 13: 145–158.
Reva ON, Tümmler B . (2004). Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns. BMC Bioinformat 5: 90.
Rocha EP, Viari A, Danchin A . (1998). Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Res 26: 2971–2980.
Sandberg R, Winberg G, Branden CI, Kaske A, Ernberg I, Coster J . (2001). Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res 11: 1404–1409.
Schloss PD, Handelsman J . (2003). Biotechnological prospects from metagenomics. Curr Opin Biotechnol 14: 303–310.
Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M . (2007). CAMERA: a community resource for metagenomics. PLoS Biol 5: e75.
Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR et al. (2006). Microbial diversity in the deep sea and the underexplored ‘rare biosphere’. Proc Natl Acad Sci USA 103: 12115–12120.
Sonnhammer EL, Eddy SR, Durbin R . (1997). Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28: 405–420.
Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R . (1998). Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 26: 320–322.
Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO . (2004). Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 6: 938–947.
Temperton B, Field D, Oliver A, Tiwari B, Muhling M, Joint I et al. (2009). Bias in assessments of marine microbial biodiversity in fosmid libraries as evaluated by pyrosequencing. ISME J 3: 792–796.
Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW et al. (2005). Comparative metagenomics of microbial communities. Science S 308: 554–557.
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM et al. (2004). Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43.
Waldmann J . (2008). Phyloprint—Entwicklung und Anwendung eines Frameworks zur taxonomischen Klassifikation. Westfälische Wilhelms-Universität Münster, Department of Mathematics and Computer Science, Diploma Thesis, http://cs.uni-muenster.de/Professoren/Lippe/diplomarbeiten/html/Waldmann/.
Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, et al. (2006). Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443: 950–955.
Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN et al. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462: 1056–1060.
Zhou J, Bruns MA, Tiedje JM . (1996). DNA recovery from soils of diverse composition. Appl Environ Microbiol S 62: 316–322.
Acknowledgements
We thank Tobin J Hammer for fruitful discussions and proof reading of the paper. This study was supported by the Max Planck society and the MIMAS project (project no. 03F0480A) funded by the German Federal Ministry of Education and Research (BMBF).
Author information
Authors and Affiliations
Corresponding author
Additional information
Supplementary Information accompanies the paper on The ISME Journal website
Supplementary information
Rights and permissions
About this article
Cite this article
Weber, M., Teeling, H., Huang, S. et al. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics. ISME J 5, 918–928 (2011). https://doi.org/10.1038/ismej.2010.180
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/ismej.2010.180
Keywords
This article is cited by
-
Glaciers as microbial habitats: current knowledge and implication
Journal of Microbiology (2022)
-
Vulnerability of Zostera noltei to Sea Level Rise: the Use of Clustering Techniques in Climate Change Studies
Estuaries and Coasts (2020)
-
A clinician's guide to microbiome analysis
Nature Reviews Gastroenterology & Hepatology (2017)
-
AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization
BMC Bioinformatics (2014)
-
A novel approach, based on BLSOMs (Batch Learning Self-Organizing Maps), to the microbiome analysis of ticks
The ISME Journal (2013)