Abstract
Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1–50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1–23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Accession codes
Accessions
GenBank/EMBL/DDBJ
Sequence Read Archive
References
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Hinds, D.A. et al. Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079 (2005).
Stefansson, H. et al. A common inversion under selection in Europeans. Nat. Genet. 37, 129–137 (2005).
Ben-Shachar, S. et al. 22q11.2 distal deletion: a recurrent genomic disorder distinct from DiGeorge syndrome and velocardiofacial syndrome. Am. J. Hum. Genet. 82, 214–221 (2008).
Futreal, P.A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008).
Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).
Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
Chanock, S. High marks for GWAS. Nat. Genet. 41, 765–766 (2009).
Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).
Campbell, P.J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 40, 722–729 (2008).
Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).
Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426 (2007).
Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56–64 (2008).
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677–681 (2009).
Lam, H.Y. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol. 28, 47–55 (2010).
Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
Pang, A.W. et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 11, R52 (2010).
Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 19, 1270–1278 (2009).
Wong, K., Keane, T.M., Stalker, J. & Adams, D.J. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol. 11, R128 (2010).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Simpson, J.T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2010).
Consortium, T.G. A map of human genome variation from population scale sequencing. Nature 467, 1061–1073 (2010).
Harris, R.S. Improved pairwise alignment of genomic DNA. PhD thesis, Penn State Univ. (2007).
Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527–1541 (2009).
Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics, chapter 4, unit 4.10 (Wiley, 2009).
Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).
Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Feuk, L., Carson, A.R. & Scherer, S.W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Lam, H.Y. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol. 28, 47–55 (2010).
Travers, A.A. & Klug, A. The bending of DNA in nucleosomes and its wider implications. Phil. Trans. R. Soc. Lond. B 317, 537–561 (1987).
Chen, F.C., Chen, C.J., Li, W.H. & Chuang, T.J. Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Res. 17, 16–22 (2007).
Yi, L. Resequencing of 200 human exomes identifies an excess of low frequency non-synonymous coding variants.pdf. Nat. Genet. 42, 969–972 (2010).
Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Acknowledgements
This work was supported by a National Basic Research Program of China (973 program no. 2011CB809200), the National Natural Science Foundation of China (30725008; 30890032; 30811130531; 30221004), the Chinese 863 program (2006AA02Z177; 2006AA02Z334; 2006AA02A302;2009AA022707), the Shenzhen Municipal Government of China (grants JC200903190767A; JC200903190772A; ZYC200903240076A; CXB200903110066A; ZYC200903240077A; ZYC200903240076A and ZYC200903240080A) and the Ole Rømer grant from the Danish Natural Science Research Council. This project is also funded by the Shenzhen Municipal Government and the Local Government of Yantian District of Shenzhen. The 1000 Genomes Project Consortium provided the data for population analysis. AIFB is supported by Diabetes UK, the Wellcome Trust, the Medical Research Council and the Comprehensive Biomedical Research Centre, Imperial College Healthcare NHS Trust. Thanks to X. Wang from School of Biosciences & Bioengineering, SCUT, for his excellent coordination. Thanks to J. El-Sayed Moustafa for her help analyzing the experimental validation data. L. Goodman, S. Edmunds and A. Basford edited the manuscript.
Author information
Authors and Affiliations
Contributions
Jun W., Jian W. and H.Y. managed the project. Jun W., Y.L., R. Luo designed the analyses. Y.L., R. Luo, R. Li, H. Zheng, H. Zhu, H.W., H.C., B.W., S.H., H.S., F.Z., H.M., S.F., A.J.d.S., A.I.F.B., W.Z., H.D., L.J.M.C., S.L., L.B. and K.K. performed the data analyses. G.T., J.L. and X.Z. performed the sequencing. Jun W., Y.L. and R. Luo wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–8 and Supplementary Notes (PDF 1037 kb)
Supplementary Table 1
Primers, sequences of randomly selected structural variations and Sanger capillary sequencing results for PCR validation. (XLS 111 kb)
Supplementary Table 2
Summary of Fosmid sequences validation results. (XLS 144 kb)
Supplementary Table 3
Structural variations predicted on the YH and NA18507 genome were, respectively, compared to sets of variants discovered by alternative approaches. (XLS 17 kb)
Supplementary Table 5
Classification of those strongly conserved (dN/dS ⩽ 0.1) genes containing SVs. (XLS 48 kb)
Supplementary Data Set 1
Souce code (ZIP 5936 kb)
Supplementary Data Set 2
Supplementary array CGH results (TXT 38 kb)
Rights and permissions
About this article
Cite this article
Li, Y., Zheng, H., Luo, R. et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat Biotechnol 29, 723–730 (2011). https://doi.org/10.1038/nbt.1904
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/nbt.1904
This article is cited by
-
Whole genome sequence analysis reveals genetic structure and X-chromosome haplotype structure in indigenous Chinese pigs
Scientific Reports (2020)
-
dnAQET: a framework to compute a consolidated metric for benchmarking quality of de novo assemblies
BMC Genomics (2019)
-
The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics
BMC Genomics (2017)
-
novoBreak: local assembly for breakpoint detection in cancer genomes
Nature Methods (2017)
-
Recent breeding programs enhanced genetic diversity in both desi and kabuli varieties of chickpea (Cicer arietinum L.)
Scientific Reports (2016)