Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing

Reumers, Joke; De Rijk, Peter; Zhao, Hui; Liekens, Anthony; Smeets, Dominiek; Cleary, John; Van Loo, Peter; Van Den Bossche, Maarten; Catthoor, Kirsten; Sabbe, Bernard; Despierre, Evelyn; Vergote, Ignace; Hilbush, Brian; Lambrechts, Diether; Del-Favero, Jurgen

doi:10.1038/nbt.2053

Analysis
Published: 18 December 2011

Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing

Joke Reumers^1,2^na1,
Peter De Rijk^3,4^na1,
Hui Zhao^1,2,
Anthony Liekens^3,4,
Dominiek Smeets^1,2,
John Cleary⁵,
Peter Van Loo^6,7,
Maarten Van Den Bossche^3,4,8,9,
Kirsten Catthoor¹⁰,
Bernard Sabbe^8,9,
Evelyn Despierre¹¹,
Ignace Vergote¹¹,
Brian Hilbush⁵,
Diether Lambrechts^1,2^na1 &
…
Jurgen Del-Favero^3,4^na1

Nature Biotechnology volume 30, pages 61–68 (2012)Cite this article

12k Accesses
224 Citations
33 Altmetric
Metrics details

Subjects

Abstract

Distinguishing single-nucleotide variants (SNVs) from errors in whole-genome sequences remains challenging. Here we describe a set of filters, together with a freely accessible software tool, that selectively reduce error rates and thereby facilitate variant detection in data from two short-read sequencing technologies, Complete Genomics and Illumina. By sequencing the nearly identical genomes from monozygotic twins and considering shared SNVs as 'true variants' and discordant SNVs as 'errors', we optimized thresholds for 12 individual filters and assessed which of the 1,048 filter combinations were effective in terms of sensitivity and specificity. Cumulative application of all effective filters reduced the error rate by 290-fold, facilitating the identification of genetic differences between monozygotic twins. We also applied an adapted, less stringent set of filters to reliably identify somatic mutations in a highly rearranged tumor and to identify variants in the NA19240 HapMap genome relative to a reference set of SNVs.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Development of individual filters on monozygotic twin genomes.**

**Figure 2: Efficacy of the individual filters with respect to the number of shared and discordant SNVs in monozygotic twins (CG filters) and NA19240 genomes (Illumina filters).**

**Figure 3: ROC curves of all filter combinations.**

Comparative study of tools for copy number variation detection using next-generation sequencing data

Article Open access 01 July 2025

Reducing Sanger confirmation testing through false positive prediction algorithms

Article Open access 25 March 2021

Selective multiplexed enrichment for the detection and quantitation of low-fraction DNA variants via low-depth sequencing

Article 03 May 2021

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

Ashley, E.A. et al. Clinical assessment incorporating a personal genome. Lancet 375, 1525–1535 (2010).
Article CAS Google Scholar
Cirulli, E.T. & Goldstein, D.B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 11, 415–425 (2010).
Article CAS Google Scholar
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS Google Scholar
Anonymous. The sequence is dead: long live the genome. Nat. Biotechnol. 29, 463 (2011).
Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473–477 (2010).
Article CAS Google Scholar
Pleasance, E.D. et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 463, 184–190 (2010).
Article CAS Google Scholar
Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010).
Article CAS Google Scholar
Dalgliesh, G.L. et al. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature 463, 360–363 (2010).
Article CAS Google Scholar
Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).
Article CAS Google Scholar
Ahn, S.M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).
Article CAS Google Scholar
Baranzini, S.E. et al. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature 464, 1351–1356 (2010).
Article CAS Google Scholar
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS Google Scholar
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Article CAS Google Scholar
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Fujimoto, A. et al. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat. Genet. 42, 931–936 (2010).
Article CAS Google Scholar
Kim, J.I. et al. A highly annotated whole-genome sequence of a Korean individual. Nature 460, 1011–1015 (2009).
Article CAS Google Scholar
Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
Article CAS Google Scholar
Ley, T.J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008).
Article CAS Google Scholar
Lupski, J.R. et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N. Engl. J. Med. 362, 1181–1191 (2010).
Article CAS Google Scholar
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009).
Article CAS Google Scholar
Pelak, K. et al. The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (2010).
Article Google Scholar
Rasmussen, M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, 757–762 (2010).
Article CAS Google Scholar
Schuster, S.C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943–947 (2010).
Article CAS Google Scholar
Tong, P. et al. Sequencing and analysis of an Irish human genome. Genome Biol. 11, R91 (2010).
Article Google Scholar
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Article CAS Google Scholar
Rhead, B. et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 38, D613–D619 (2010).
Article CAS Google Scholar
Siva, N. 1000 Genomes project. Nat. Biotechnol. 26, 256 (2008).
Article Google Scholar
Lynch, M. et al. A genome-wide view of the spectrum of spontaneous mutations in yeast. Proc. Natl. Acad. Sci. USA 105, 9272–9277 (2008).
Article CAS Google Scholar
Haag-Liautard, C. et al. Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature 445, 82–85 (2007).
Article CAS Google Scholar
Baranzini, S.E. et al. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature 464, 1351–1356 (2010).
Article CAS Google Scholar
Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Article CAS Google Scholar
Penzkofer, T., Dandekar, T. & Zemojtel, T. L1Base: from functional annotation to prediction of active LINE-1 elements. Nucleic Acids Res. 33, D498–D500 (2005).
Article CAS Google Scholar
Leunen, K. et al. Recurrent copy number alterations in BRCA1-mutated ovarian tumors alter biological pathways. Hum. Mutat. 30, 1693–1702 (2009).
Article CAS Google Scholar
Gorringe, K.L. & Campbell, I.G. Large-scale genomic analysis of ovarian carcinomas. Mol. Oncol. 3, 157–164 (2009).
Article CAS Google Scholar
Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Article CAS Google Scholar
The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615 (2011).
Muotri, A.R. et al. L1 retrotransposition in neurons is modulated by MeCP2. Nature 468, 443–446 (2010).
Article CAS Google Scholar
Karlsson, H. et al. Retroviral RNA identified in the cerebrospinal fluids and brains of individuals with schizophrenia. Proc. Natl. Acad. Sci. USA 98, 4634–4639 (2001).
Article CAS Google Scholar
Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Weckx, S. et al. novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15, 436–442 (2005).
Article CAS Google Scholar
Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci USA 39, 16910–16915 (2010).
Article Google Scholar
Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319 (2000).
Article CAS Google Scholar
Griffith, O.L. et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res. 36, D107–D113 (2008).
Article CAS Google Scholar
Visel, A. et al. VISTA Enhancer Browser–a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
Article CAS Google Scholar
Felsenstein, J. & Churchill, G.A. A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13, 93–104 (1996).
Article CAS Google Scholar
Ng, P.C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acid Res. 31, 3812–3814 (2003).
Article CAS Google Scholar
Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS Google Scholar
Kaminker, J.S. et al. CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic Acids Res. 35, W595–W598 (2007).
Article Google Scholar

Download references

Acknowledgements

We appreciate the assistance of M. Veugelers and S. Plaisance (VIB Technology Watch). We acknowledge G. Peuteman, T. Van Brussel, S. Cammaerts, M. Strazisar and the Genetic Service Facility (http://www.vibgeneticservicefacility.be/) for technical assistance. We highly appreciate the helpful comments from the reviewers. The research was supported by the Fund for Scientific Research Flanders (FWO-F) to J.R. and P.V.L., the Agency for Innovation by Science and Technology (IWT) to M.V.D.B., the Stichting tegen Kanker, FWO-F and the KULeuven (KULPFV/10/016-SymBioSysII)) to D.L.

Author information

Joke Reumers, Peter De Rijk, Diether Lambrechts and Jurgen Del-Favero: These author contributed equally to this work.

Authors and Affiliations

Vesalius Research Center, Vlaams Instituut voor Biotechnologie (VIB), Leuven, Belgium
Joke Reumers, Hui Zhao, Dominiek Smeets & Diether Lambrechts
Vesalius Research Center, University of Leuven, Leuven, Belgium
Joke Reumers, Hui Zhao, Dominiek Smeets & Diether Lambrechts
Department of Molecular Genetics, Applied Molecular Genomics Group, VIB, Antwerp, Belgium
Peter De Rijk, Anthony Liekens, Maarten Van Den Bossche & Jurgen Del-Favero
Applied Molecular Genomics Group, University of Antwerp, Antwerp, Belgium
Peter De Rijk, Anthony Liekens, Maarten Van Den Bossche & Jurgen Del-Favero
Real Time Genomics, San Francisco, California, USA
John Cleary & Brian Hilbush
Department of Molecular and Developmental Genetics, VIB, Leuven, Belgium
Peter Van Loo
Department of Human Genetics, University of Leuven, Leuven, Belgium
Peter Van Loo
Collaborative Antwerp Psychiatric Research Institute (CAPRI), Faculty of Medicine, University of Antwerp, Antwerp, Belgium
Maarten Van Den Bossche & Bernard Sabbe
PC Sint-Norbertushuis, Duffel, Belgium
Maarten Van Den Bossche & Bernard Sabbe
ZNA Psychiatric Hospital Stuivenberg, Antwerp, Belgium
Kirsten Catthoor
Division of Gynaecologic Oncology, Department of Obstetrics and Gynaecology, University Hospital Gasthuisberg, Leuven, Belgium
Evelyn Despierre & Ignace Vergote

Authors

Joke Reumers
View author publications
Search author on:PubMed Google Scholar
Peter De Rijk
View author publications
Search author on:PubMed Google Scholar
Hui Zhao
View author publications
Search author on:PubMed Google Scholar
Anthony Liekens
View author publications
Search author on:PubMed Google Scholar
Dominiek Smeets
View author publications
Search author on:PubMed Google Scholar
John Cleary
View author publications
Search author on:PubMed Google Scholar
Peter Van Loo
View author publications
Search author on:PubMed Google Scholar
Maarten Van Den Bossche
View author publications
Search author on:PubMed Google Scholar
Kirsten Catthoor
View author publications
Search author on:PubMed Google Scholar
Bernard Sabbe
View author publications
Search author on:PubMed Google Scholar
Evelyn Despierre
View author publications
Search author on:PubMed Google Scholar
Ignace Vergote
View author publications
Search author on:PubMed Google Scholar
Brian Hilbush
View author publications
Search author on:PubMed Google Scholar
Diether Lambrechts
View author publications
Search author on:PubMed Google Scholar
Jurgen Del-Favero
View author publications
Search author on:PubMed Google Scholar

Contributions

D.L. and J.D.-F. conceptualized this work. J.R. and P.D.R. wrote algorithms and analyzed data. H.Z. analyzed the Yoruban genome, A.L. assisted with the twin analysis. J.C. and B.H. performed RTG-related analyses. P.V.L. provided the ASCAT algorithm, D.S. performed SNP array experiments. K.C., M.V.D.B., B.S., E.D. and I.V. selected and characterized patient samples. All authors approved the manuscript.

Corresponding authors

Correspondence to Diether Lambrechts or Jurgen Del-Favero.

Ethics declarations

Competing interests

B.H. and J.C. are employees of Real Time Genomics and have financial interests in Real Time Genomics.

Supplementary information

Supplementary Text and Figures

Supplementary Notes 1–14 (PDF 5224 kb)

Supplementary Table S1

Validation experiments for the twin and tumor-normal genomes using Sanger sequencing and Sequenom MassARRAY genotyping (XLSX 390 kb)

Supplementary Table S2

Metrics and error rates calculated for each filter combination performed on the twin genome comparison using coverage depth cutoffs of 10 and 20 (XLSX 908 kb)

Supplementary Table S3

Overlap analysis of somatic variants in Tumor 1 and its replicate using three filter settings (XLSX 106 kb)

Supplementary Table S4

Sequenom validation of somatic missense SNVs in the ovarian clear cell tumor genome using three filter settings (XLSX 67 kb)

Supplementary Table S5

Prediction of the effect of the validated somatic mutations and somatic non-coding SNVs in the ovarian serous carcinoma (XLSX 18 kb)

Supplementary Table S6

Effect of filters cumulatively applied to the NA19240 genome, using stringent CG filters versus unfiltered CG data (XLSX 500 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reumers, J., De Rijk, P., Zhao, H. et al. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Nat Biotechnol 30, 61–68 (2012). https://doi.org/10.1038/nbt.2053

Download citation

Received: 08 June 2011
Accepted: 28 October 2011
Published: 18 December 2011
Issue date: January 2012
DOI: https://doi.org/10.1038/nbt.2053

This article is cited by

Increased prime edit rates in KCNQ2 and SCN1A via single nicking all-in-one plasmids
- N. Dirkx
- Wout J. Weuring
- B. P. C. Koeleman
BMC Biology (2023)
Genetic mapping, transcriptomic sequencing and metabolic profiling indicated a glutathione S-transferase is responsible for the red-spot-petals in Gossypium arboreum
- Sujun Zhang
- Jie Chen
- Jianhong Zhang
Theoretical and Applied Genetics (2022)
OsWRKY115 on qCT7 links to cold tolerance in rice
- Hualong Liu
- Luomiao Yang
- Detang Zou
Theoretical and Applied Genetics (2022)
Chromosomal fragment deletion in APRR2-repeated locus modulates the dark stem color in Cucurbita pepo
- Lei Zhu
- Yong Wang
- Yanman Li
Theoretical and Applied Genetics (2022)
Physical mapping and InDel marker development for the restorer gene Rf2 in cytoplasmic male sterile CMS-D8 cotton
- Juanjuan Feng
- Xuexian Zhang
- Jianyong Wu
BMC Genomics (2021)