Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Mc Cartney, Ann M.; Shafin, Kishwar; Alonge, Michael; Bzikadze, Andrey V.; Formenti, Giulio; Fungtammasan, Arkarachai; Howe, Kerstin; Jain, Chirag; Koren, Sergey; Logsdon, Glennis A.; Miga, Karen H.; Mikheenko, Alla; Paten, Benedict; Shumate, Alaina; Soto, Daniela C.; Sović, Ivan; Wood, Jonathan M. D.; Zook, Justin M.; Phillippy, Adam M.; Rhie, Arang

doi:10.1038/s41592-022-01440-3

Article
Published: 31 March 2022

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Nature Methods volume 19, pages 687–695 (2022)Cite this article

12k Accesses
121 Citations
55 Altmetric
Metrics details

Subjects

Abstract

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: An overview of the evaluation and polishing strategy developed to achieve a complete human genome assembly.**

**Fig. 2: Sequencing biases in PacBio HiFi and Illumina reads.**

**Fig. 3: Errors corrected after polishing.**

**Fig. 4: Examples of the largest CHM13 regions with a copy number in the reference that differs from GRCh38 and most individuals.**

**Fig. 5: Errors made by automated polishing.**

Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly

Article 08 March 2024

A complete telomere-to-telomere assembly of the maize genome

Article Open access 15 June 2023

Telomere-to-telomere human DNA replication timing profiles

Article Open access 10 June 2022

Data availability

All data types and assemblies are available on https://github.com/marbl/CHM13/ and under NCBI BioProject PRJNA559484 with the Assembly GenBank accession GCA_009914755. Polishing edits, cataloged remaining issues and known heterozygous regions are available on https://github.com/marbl/CHM13-issues/. All the data in the two GitHub repositories are directly downloadable from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/ with no restrictions. The retrained PEPPER model used for telomere polishing is available to download at https://storage.cloud.google.com/pepper-deepvariant-public/pepper_models/PEPPER_HP_R941_ONT_V4_T2T.pkl. Source data for generating plots in this paper are available on https://github.com/arangrhie/T2T-Polish/tree/master/paper/2022_Mc_Cartney/.

Code availability

To facilitate usability of our evaluation and polishing strategy, we made the up-to-date version of tools that have been used within our workflows openly available on https://github.com/arangrhie/T2T-Polish/. Exact codes used for polishing CHM13v0.9 and CHM13v1.0 are available on https://github.com/marbl/CHM13-issues/. Both GitHub repositories are available through a public domain, and have been deposited to Zenodo^60,61. Custom scripts used for merging small variants, and generating telomere edits are available at https://github.com/kishwarshafin/T2T_polishing_scripts/ and deposited to Zenodo⁶² under an MIT license.

References

Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 https://doi.org/10.1126/science.abj6965 (2022).
Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 https://doi.org/10.1126/science.abj5089 (2022).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eab7117 (2021).
Article Google Scholar
Hufford, M. B. et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373, abg5289 https://doi.org/10.1126/science.abg5289 (2021).
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 https://doi.org/10.1126/science.abl3533 (2022).
van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation sequencing technology. Trends Genet. 30, 418–426 (2014).
Article Google Scholar
Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Article CAS Google Scholar
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Article CAS Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS Google Scholar
Baran, N., Lapidot, A. & Manor, H. Formation of DNA triplexes accounts for arrests of DNA synthesis at d(TC)n and d(GA)n tracts. Proc. Natl Acad. Sci. USA 88, 507–511 (1991).
Article CAS Google Scholar
Guiblet, W. M. et al. Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate. Genome Res. 28, 1767–1778 (2018).
Article CAS Google Scholar
Chen, Y.-C., Liu, T., Yu, C.-H., Chiang, T.-Y. & Hwang, C.-C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8, e62856 (2013).
Article CAS Google Scholar
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article CAS Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods https://doi.org/10.1038/s41592-020-01056-5 (2021).
Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017).
Article CAS Google Scholar
Simpson, J. T. et al. ABySS: a parallel assembler for short-read sequence data. Genome Res. 19, 1117–1123 (2009).
Article CAS Google Scholar
Watson, M. Mind the gaps—ignoring errors in long-read assemblies critically affects protein prediction. Nat. Biotechnol. https://doi.org/10.1038/s41587-018-0004-z (2019).
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
Article CAS Google Scholar
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature https://doi.org/10.1038/s41586-021-03451-0 (2021).
Zimin, A. V. & Salzberg, S. L. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput. Biol. 16, e1007981 (2020).
Article CAS Google Scholar
Pacific Biosciences. GenomicConsensus module. https://github.com/PacificBiosciences/GenomicConsensus (2019).
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).
Article CAS Google Scholar
Poplin, R. et al. A universal SNP and small-INDEL variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS Google Scholar
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).
Article CAS Google Scholar
Oxford Nanopore Technologies. medaka: sequence correction provided by ONT Research https://github.com/nanoporetech/medaka (2018).
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS Google Scholar
Zhang, H., Jain, C. & Aluru, S. A comprehensive evaluation of long-read error correction methods. BMC Genomics 21, 889 (2020).
Article Google Scholar
Fu, S., Wang, A. & Au, K. F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol. 20, 26 (2019).
Article Google Scholar
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Article CAS Google Scholar
Jain, C. et al. Weighted minimizer sampling improves long-read mapping. Bioinformatics 36, i111–i118 (2020).
Article CAS Google Scholar
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. Long read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods (2022).
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
Article CAS Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS Google Scholar
Fofanov, Y. et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).
Article CAS Google Scholar
Lang, D. et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. GigaScience 9, giaa123 (2020).
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2021).
Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods (2022). https://doi.org/10.1038/s41592-022-01445-y
Zarate, S. et al. Parliament2: accurate structural variant calling at scale. GigaScience 9, giaa145 (2020).
Article Google Scholar
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Article CAS Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 https://doi.org/10.1126/science.abl4178 (2022).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).
Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, abi7489 https://doi.org/10.1126/science.abi7489 (2021).
Liu, J. et al. Gapless assembly of maize chromosomes using long-read technologies. Genome Biol. 21, 121 (2020).
Article CAS Google Scholar
Du, H. et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat. Commun. 8, 15324 (2017).
Article Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Tischler, G. & Leonard, S. biobambam: tools for read pair collation based algorithms on BAM files. Source Code Biol. Med. 9, 13 (2014).
Article Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS Google Scholar
Kirsche, M. et al. Jasmine: Population-scale structural variant comparison and analysis. Preprint at bioRxiv https://doi.org/10.1101/2021.05.27.445886 (2021).
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer: high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
Article Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools, GigaScience 10, giab008 (2021).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1016 (2020).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS Google Scholar
Rhie, A., Formenti, G., Shafin, K., Fungtammasan, A., & Jain, C. arangrhie/T2T-Polish: v1.0. https://doi.org/10.5281/zenodo.5649017 (2021).
Rhie, A. and Phillippy, A. marbl/CHM13-issues: v1.1. https://doi.org/10.5281/zenodo.5648989 (2021).
Shafin, K. kishwarshafin/T2T_polishing_scripts: v0.1 release for zenodo. https://doi.org/10.5281/zenodo.6127865 (2021).

Download references

Acknowledgements

This work was supported by the Intramural Research Program of the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH 1ZIAHG200398; to A.M.M., C.J., S.K., A.M.P. and A.R.); National Science Foundation DBI-1350041 and IOS-1732253 (to M.A.); NIH/NHGRI R01HG010485, U41HG010972, U01HG010961, U24HG011853 and OT2OD026682 (to K.S. and B.P.); HHMI (to G.F.); Wellcome WT206194 (to K.H. and J.M.W.); NIGMS F32 GM134558 (to G.A.L.); NIH/NHGRI R01 1R01HG011274-01, NIH/NHGRI R21 1R21HG010548-01 and NIH/NHGRI U01 1U01HG010971 (to K.M.); St. Petersburg State University grant ID PURE73023672 (to A.M.); NIH/NHGRI R01HG006677 (to A.S.); Fulbright Fellowship (to D.C.S.); and Intramural funding at the National Institute of Standards and Technology (to J.Z.). This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov/). Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose.

Author information

These authors contributed equally: Ann M. Mc Cartney, Kishwar Shafin, Micheal Alonge.

Authors and Affiliations

Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
Ann M. Mc Cartney, Chirag Jain, Sergey Koren, Adam M. Phillippy & Arang Rhie
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Kishwar Shafin, Karen H. Miga & Benedict Paten
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Michael Alonge
Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA
Andrey V. Bzikadze
Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
Giulio Formenti
DNAnexus, Mountain View, CA, USA
Arkarachai Fungtammasan
Wellcome Sanger Institute, Cambridge, UK
Kerstin Howe & Jonathan M. D. Wood
Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
Chirag Jain
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Glennis A. Logsdon
Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
Karen H. Miga
Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
Alla Mikheenko
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Alaina Shumate
Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
Daniela C. Soto
Pacific Biosciences, Menlo Park, CA, USA
Ivan Sović
Digital BioLogic d.o.o., Ivanić-Grad, Croatia
Ivan Sović
Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
Justin M. Zook

Authors

Ann M. Mc Cartney
View author publications
Search author on:PubMed Google Scholar
Kishwar Shafin
View author publications
Search author on:PubMed Google Scholar
Michael Alonge
View author publications
Search author on:PubMed Google Scholar
Andrey V. Bzikadze
View author publications
Search author on:PubMed Google Scholar
Giulio Formenti
View author publications
Search author on:PubMed Google Scholar
Arkarachai Fungtammasan
View author publications
Search author on:PubMed Google Scholar
Kerstin Howe
View author publications
Search author on:PubMed Google Scholar
Chirag Jain
View author publications
Search author on:PubMed Google Scholar
Sergey Koren
View author publications
Search author on:PubMed Google Scholar
Glennis A. Logsdon
View author publications
Search author on:PubMed Google Scholar
Karen H. Miga
View author publications
Search author on:PubMed Google Scholar
Alla Mikheenko
View author publications
Search author on:PubMed Google Scholar
Benedict Paten
View author publications
Search author on:PubMed Google Scholar
Alaina Shumate
View author publications
Search author on:PubMed Google Scholar
Daniela C. Soto
View author publications
Search author on:PubMed Google Scholar
Ivan Sović
View author publications
Search author on:PubMed Google Scholar
Jonathan M. D. Wood
View author publications
Search author on:PubMed Google Scholar
Justin M. Zook
View author publications
Search author on:PubMed Google Scholar
Adam M. Phillippy
View author publications
Search author on:PubMed Google Scholar
Arang Rhie
View author publications
Search author on:PubMed Google Scholar

Contributions

A.R. and A.M.P. conceived and supervised the project. A.M.M., K.S., G.F., K.H., J.M.D.W. and A.R. performed the pre-polishing evaluation. K.S., M.A., A.V.B., A.F., C.J., A.M., B.P. and A.R. aligned reads and called variants. A.M.M., K.S., M.A., G.F., A.F., K.H.M., A.M., J.M.Z. and A.R. manually validated variant calls. D.C.S. and J.M.Z. performed the gene collapse and expansion analysis. K.S., M.A., A.V.B., G.A.L., K.H.M., A.M. and A.R. identified and curated heterozygous and ‘issues’ loci. K.S., M.A., S.K. and B.P. patched and polished the telomeres. A.M.M., M.A., A.S. and I.S. performed automated polishing. A.M.M., K.S., M.A. and A.R. wrote the manuscript, with assistance from all authors. All authors approved the final manuscript.

Corresponding authors

Correspondence to Adam M. Phillippy or Arang Rhie.

Ethics declarations

Competing interests

I.S. is an employee of PacBio. A.F. is an employee of DNAnexus. S.K. has received travel funds to speak at symposia organized by Oxford Nanopore. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Kai Wang and Jue Ruan for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Sequencing biases observed in missing k-mers.

a, missing k-mers with its GA composition. b-d, v0.9 assembly and k-mer copy number spectrum from HiFi, Illumina, and hybrid k-mer sets (left) and per-chromosome missing (likely error) k-mer counts from the HiFi derived consensus or patches (right). Most missing k-mers in HiFi overlapped sequences from patched regions. No missing k-mer was found on chromosomes indicated with red arrows.

Extended Data Fig. 2 Error detection and polishing pipeline.

A detailed overview of the polishing pipeline along with the number of errors identified and polished at each step. Additionally, data type and polishing tools utilized are highlighted. Illumina, 100X PCR-free library Illumina reads; HiFi, 35x PacBio HiFi reads; ONT, 120x Oxford Nanopore reads.

Extended Data Fig. 3 Number of SV-like errors and globally unique single copy k-mers used for marker assisted alignment.

a. Number of SV-like errors called from long-read platforms. b. Range of k-mer counts defined as ‘single-copy’ markers from Illumina reads and in the assembly. The cutoffs were chosen to minimize inclusion of low-frequency erroneous k-mers and 2-copy k-mers. c. Number of markers in every 10 kb window. d. Cumulative number of bases covered by the number of markers in each 10 kb window.

Extended Data Fig. 4 Post-polishing evaluation.

a. Left, genotype quality and number of reads supporting the reference and alternate alleles from the combined Illumina-hifi hybrid and ONT homozygous variant calls, with AF > 0.5. Right, balanced insertion (red) and deletion (blue) length distribution from the Illumina-HiFi hybrid DeepVariant heterozygous calls in CHM13v1.0. b. Number of errors detected in each chromosome, before and after polishing. c. Polishing inside and outside of repeats. The distribution of CHM13v0.9 polishing rates within and without repeats.

Extended Data Fig. 5 Three SV-like errors corrected.

HiFi and ONT marker assisted alignments, post correction of the 3 large SV-like edits visualized with IGV. HiFi coverage track is shown in data range up to 60, ONT up to 150. Clipped reads are flagged for >100 bp. INDELs smaller than 10 bp are not shown. Reads are colored by strands; positive in red and negative in blue.

Extended Data Fig. 6 Telomere polishing.

a. An illustration of Chr. 2 telomere sequence reads from HiFi, ONT and CLR platform. b. Distribution of maximum perfect match to the canonical k-mer observed at each position in the telomere before (CHM13v1.0) and after (CHM13v1.1) polishing the telomeres.

Extended Data Fig. 7 Mapping biases found and corrected.

On simulated HiFi reads, we found excessive clippings in highly identical satellite repeats in Minimap and Winnowmap by the time of evaluation. We have addressed this issue in Winnowmap 2.01 + . Clipped (%) indicates the percentage of reads clipped in every 1,024 bp window, shown in 0~40% range with a midline of 10%.

Extended Data Fig. 8 HiFi, CLR, ONT read coverage, alignment identity, and read length from Winnowmap2 v2.01 alignments and Bionano DLE-1 molecule coverage from Bionano Solve.

Upper panel shows a zoomed in region of Chromosome 9, while the upper panel shows the whole-genome alignment view. HiFi, CLR, ONT, and Bionano coverage are shown up to 70x, 70x, 200x, and 250x, respectively. Median read identity in every 1,024 bp is shown in 80-100% range. Median read length in every 1024 bp is shown in 0-100 kb range. Read identity was the worst in CLR, and between HiFi and ONT. Bionano molecules were lacking coverage in most of the centromeric repeats.

Extended Data Fig. 9 Collapsed simple tandem repeat.

The collapse in the Intronic sequences of gene FAM227A was undetected, due to the variable insertion breakpoints and insertion length in the HiFi and ONT alignments. The panels above the alignments show marker density and percent microsatellites (GA / AT / TC / GC) in each 64 bp window, which indicates this region is highly repetitive with GA enriched sequences, which later alternates with AT enriched sequences.

Extended Data Fig. 10 Chimeric junction of two haplotypes.

In the shown above regions, both HiFi and ONT reads indicate that the consensus has a chimeric junction of the two haplotypes.

Supplementary information

Supplementary Information

Supplementary Tables 1–3

Reporting Summary

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mc Cartney, A.M., Shafin, K., Alonge, M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 19, 687–695 (2022). https://doi.org/10.1038/s41592-022-01440-3

Download citation

Received: 13 July 2021
Accepted: 04 March 2022
Published: 31 March 2022
Version of record: 31 March 2022
Issue date: June 2022
DOI: https://doi.org/10.1038/s41592-022-01440-3

This article is cited by

Asymmetric evolution of satellite DNA on homologous and homeologous chromosomes in allotetraploid Narenga porphyrocoma
- Shiqiang Huang
- Mingxiao Zhang
- Yongji Huang
BMC Plant Biology (2026)
Haplotype-resolved and near telomere-to-telomere assembly of the autotetraploid potato genome
- Pei-Xuan Xiao
- Lei Tan
- Wen-Biao Jiao
Genome Biology (2026)
A near-telomere-to-telomere genome assembly of the Chinese soft-shelled turtle (Pelodiscus sinensis)
- Zhe Li
- Ruping Zhao
- Yazhou Hu
Scientific Data (2026)
Tutorial: annotation of animal genomes
- Zoe A. Clarke
- Dustin J. Sokolowski
- Gary D. Bader
Nature Protocols (2026)
Large tandem repeats of grass frog (Rana temporaria) in silico and in situ
- Marina A. Popova
- Aleksey S. Komissarov
- Aleksandra O. Travina
BMC Genomics (2025)