Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing

Abstract

Distinguishing single-nucleotide variants (SNVs) from errors in whole-genome sequences remains challenging. Here we describe a set of filters, together with a freely accessible software tool, that selectively reduce error rates and thereby facilitate variant detection in data from two short-read sequencing technologies, Complete Genomics and Illumina. By sequencing the nearly identical genomes from monozygotic twins and considering shared SNVs as 'true variants' and discordant SNVs as 'errors', we optimized thresholds for 12 individual filters and assessed which of the 1,048 filter combinations were effective in terms of sensitivity and specificity. Cumulative application of all effective filters reduced the error rate by 290-fold, facilitating the identification of genetic differences between monozygotic twins. We also applied an adapted, less stringent set of filters to reliably identify somatic mutations in a highly rearranged tumor and to identify variants in the NA19240 HapMap genome relative to a reference set of SNVs.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Development of individual filters on monozygotic twin genomes.
Figure 2: Efficacy of the individual filters with respect to the number of shared and discordant SNVs in monozygotic twins (CG filters) and NA19240 genomes (Illumina filters).
Figure 3: ROC curves of all filter combinations.

Similar content being viewed by others

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

  1. Ashley, E.A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525–1535 (2010).

    Article  CAS  Google Scholar 

  2. Cirulli, E.T. & Goldstein, D.B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425 (2010).

    Article  CAS  Google Scholar 

  3. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

  4. Anonymous. The sequence is dead: long live the genome. Nat. Biotechnol. 29, 463 (2011).

  5. Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).

    Article  CAS  Google Scholar 

  6. Pleasance, E.D. et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 463, 184–190 (2010).

    Article  CAS  Google Scholar 

  7. Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010).

    Article  CAS  Google Scholar 

  8. Dalgliesh, G.L. et al. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature 463, 360–363 (2010).

    Article  CAS  Google Scholar 

  9. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).

    Article  CAS  Google Scholar 

  10. Ahn, S.M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).

    Article  CAS  Google Scholar 

  11. Baranzini, S.E. et al. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature 464, 1351–1356 (2010).

    Article  CAS  Google Scholar 

  12. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

    Article  CAS  Google Scholar 

  13. Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).

    Article  CAS  Google Scholar 

  14. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  15. Fujimoto, A. et al. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat. Genet. 42, 931–936 (2010).

    Article  CAS  Google Scholar 

  16. Kim, J.I. et al. A highly annotated whole-genome sequence of a Korean individual. Nature 460, 1011–1015 (2009).

    Article  CAS  Google Scholar 

  17. Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

    Article  CAS  Google Scholar 

  18. Ley, T.J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008).

    Article  CAS  Google Scholar 

  19. Lupski, J.R. et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N. Engl. J. Med. 362, 1181–1191 (2010).

    Article  CAS  Google Scholar 

  20. McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009).

    Article  CAS  Google Scholar 

  21. Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).

    Article  Google Scholar 

  22. Rasmussen, M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, 757–762 (2010).

    Article  CAS  Google Scholar 

  23. Schuster, S.C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943–947 (2010).

    Article  CAS  Google Scholar 

  24. Tong, P. et al. Sequencing and analysis of an Irish human genome. Genome Biol. 11, R91 (2010).

    Article  Google Scholar 

  25. Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

    Article  CAS  Google Scholar 

  26. Rhead, B. et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 38, D613–D619 (2010).

    Article  CAS  Google Scholar 

  27. Siva, N. 1000 Genomes project. Nat. Biotechnol. 26, 256 (2008).

    Article  Google Scholar 

  28. Lynch, M. et al. A genome-wide view of the spectrum of spontaneous mutations in yeast. Proc. Natl. Acad. Sci. USA 105, 9272–9277 (2008).

    Article  CAS  Google Scholar 

  29. Haag-Liautard, C. et al. Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature 445, 82–85 (2007).

    Article  CAS  Google Scholar 

  30. Baranzini, S.E. et al. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature 464, 1351–1356 (2010).

    Article  CAS  Google Scholar 

  31. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

    Article  CAS  Google Scholar 

  32. Penzkofer, T., Dandekar, T. & Zemojtel, T. L1Base: from functional annotation to prediction of active LINE-1 elements. Nucleic Acids Res. 33, D498–D500 (2005).

    Article  CAS  Google Scholar 

  33. Leunen, K. et al. Recurrent copy number alterations in BRCA1-mutated ovarian tumors alter biological pathways. Hum. Mutat. 30, 1693–1702 (2009).

    Article  CAS  Google Scholar 

  34. Gorringe, K.L. & Campbell, I.G. Large-scale genomic analysis of ovarian carcinomas. Mol. Oncol. 3, 157–164 (2009).

    Article  CAS  Google Scholar 

  35. Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).

    Article  CAS  Google Scholar 

  36. The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).

  37. Muotri, A.R. et al. L1 retrotransposition in neurons is modulated by MeCP2. Nature 468, 443–446 (2010).

    Article  CAS  Google Scholar 

  38. Karlsson, H. et al. Retroviral RNA identified in the cerebrospinal fluids and brains of individuals with schizophrenia. Proc. Natl. Acad. Sci. USA 98, 4634–4639 (2001).

    Article  CAS  Google Scholar 

  39. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

    Article  CAS  Google Scholar 

  40. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  Google Scholar 

  41. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  42. Weckx, S. et al. novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442 (2005).

    Article  CAS  Google Scholar 

  43. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci USA 39, 16910–16915 (2010).

    Article  Google Scholar 

  44. Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319 (2000).

    Article  CAS  Google Scholar 

  45. Griffith, O.L. et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36, D107–D113 (2008).

    Article  CAS  Google Scholar 

  46. Visel, A. et al. VISTA Enhancer Browser–a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

    Article  CAS  Google Scholar 

  47. Felsenstein, J. & Churchill, G.A. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13, 93–104 (1996).

    Article  CAS  Google Scholar 

  48. Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acid Res. 31, 3812–3814 (2003).

    Article  CAS  Google Scholar 

  49. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    Article  CAS  Google Scholar 

  50. Kaminker, J.S. et al. CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic Acids Res. 35, W595–W598 (2007).

    Article  Google Scholar 

Download references

Acknowledgements

We appreciate the assistance of M. Veugelers and S. Plaisance (VIB Technology Watch). We acknowledge G. Peuteman, T. Van Brussel, S. Cammaerts, M. Strazisar and the Genetic Service Facility (http://www.vibgeneticservicefacility.be/) for technical assistance. We highly appreciate the helpful comments from the reviewers. The research was supported by the Fund for Scientific Research Flanders (FWO-F) to J.R. and P.V.L., the Agency for Innovation by Science and Technology (IWT) to M.V.D.B., the Stichting tegen Kanker, FWO-F and the KULeuven (KULPFV/10/016-SymBioSysII)) to D.L.

Author information

Authors and Affiliations

Authors

Contributions

D.L. and J.D.-F. conceptualized this work. J.R. and P.D.R. wrote algorithms and analyzed data. H.Z. analyzed the Yoruban genome, A.L. assisted with the twin analysis. J.C. and B.H. performed RTG-related analyses. P.V.L. provided the ASCAT algorithm, D.S. performed SNP array experiments. K.C., M.V.D.B., B.S., E.D. and I.V. selected and characterized patient samples. All authors approved the manuscript.

Corresponding authors

Correspondence to Diether Lambrechts or Jurgen Del-Favero.

Ethics declarations

Competing interests

B.H. and J.C. are employees of Real Time Genomics and have financial interests in Real Time Genomics.

Supplementary information

Supplementary Text and Figures

Supplementary Notes 1–14 (PDF 5224 kb)

Supplementary Table S1

Validation experiments for the twin and tumor-normal genomes using Sanger sequencing and Sequenom MassARRAY genotyping (XLSX 390 kb)

Supplementary Table S2

Metrics and error rates calculated for each filter combination performed on the twin genome comparison using coverage depth cutoffs of 10 and 20 (XLSX 908 kb)

Supplementary Table S3

Overlap analysis of somatic variants in Tumor 1 and its replicate using three filter settings (XLSX 106 kb)

Supplementary Table S4

Sequenom validation of somatic missense SNVs in the ovarian clear cell tumor genome using three filter settings (XLSX 67 kb)

Supplementary Table S5

Prediction of the effect of the validated somatic mutations and somatic non-coding SNVs in the ovarian serous carcinoma (XLSX 18 kb)

Supplementary Table S6

Effect of filters cumulatively applied to the NA19240 genome, using stringent CG filters versus unfiltered CG data (XLSX 500 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reumers, J., De Rijk, P., Zhao, H. et al. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nat Biotechnol 30, 61–68 (2012). https://doi.org/10.1038/nbt.2053

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/nbt.2053

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research