Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
All data types and assemblies are available on https://github.com/marbl/CHM13/ and under NCBI BioProject PRJNA559484 with the Assembly GenBank accession GCA_009914755. Polishing edits, cataloged remaining issues and known heterozygous regions are available on https://github.com/marbl/CHM13-issues/. All the data in the two GitHub repositories are directly downloadable from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/ with no restrictions. The retrained PEPPER model used for telomere polishing is available to download at https://storage.cloud.google.com/pepper-deepvariant-public/pepper_models/PEPPER_HP_R941_ONT_V4_T2T.pkl. Source data for generating plots in this paper are available on https://github.com/arangrhie/T2T-Polish/tree/master/paper/2022_Mc_Cartney/.
Code availability
To facilitate usability of our evaluation and polishing strategy, we made the up-to-date version of tools that have been used within our workflows openly available on https://github.com/arangrhie/T2T-Polish/. Exact codes used for polishing CHM13v0.9 and CHM13v1.0 are available on https://github.com/marbl/CHM13-issues/. Both GitHub repositories are available through a public domain, and have been deposited to Zenodo60,61. Custom scripts used for merging small variants, and generating telomere edits are available at https://github.com/kishwarshafin/T2T_polishing_scripts/ and deposited to Zenodo62 under an MIT license.
References
Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 https://doi.org/10.1126/science.abj6965 (2022).
Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 https://doi.org/10.1126/science.abj5089 (2022).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eab7117 (2021).
Hufford, M. B. et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373, abg5289 https://doi.org/10.1126/science.abg5289 (2021).
Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 https://doi.org/10.1126/science.abl3533 (2022).
van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation sequencing technology. Trends Genet. 30, 418–426 (2014).
Metzker, M. L. Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010).
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Baran, N., Lapidot, A. & Manor, H. Formation of DNA triplexes accounts for arrests of DNA synthesis at d(TC)n and d(GA)n tracts. Proc. Natl Acad. Sci. USA 88, 507–511 (1991).
Guiblet, W. M. et al. Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate. Genome Res. 28, 1767–1778 (2018).
Chen, Y.-C., Liu, T., Yu, C.-H., Chiang, T.-Y. & Hwang, C.-C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8, e62856 (2013).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods https://doi.org/10.1038/s41592-020-01056-5 (2021).
Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017).
Simpson, J. T. et al. ABySS: a parallel assembler for short-read sequence data. Genome Res. 19, 1117–1123 (2009).
Watson, M. Mind the gaps—ignoring errors in long-read assemblies critically affects protein prediction. Nat. Biotechnol. https://doi.org/10.1038/s41587-018-0004-z (2019).
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature https://doi.org/10.1038/s41586-021-03451-0 (2021).
Zimin, A. V. & Salzberg, S. L. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput. Biol. 16, e1007981 (2020).
Pacific Biosciences. GenomicConsensus module. https://github.com/PacificBiosciences/GenomicConsensus (2019).
Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).
Poplin, R. et al. A universal SNP and small-INDEL variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Shafin, K. et al. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nat. Methods 18, 1322–1332 (2021).
Oxford Nanopore Technologies. medaka: sequence correction provided by ONT Research https://github.com/nanoporetech/medaka (2018).
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Zhang, H., Jain, C. & Aluru, S. A comprehensive evaluation of long-read error correction methods. BMC Genomics 21, 889 (2020).
Fu, S., Wang, A. & Au, K. F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol. 20, 26 (2019).
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Jain, C. et al. Weighted minimizer sampling improves long-read mapping. Bioinformatics 36, i111–i118 (2020).
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. Long read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods (2022).
Mikheenko, A., Bzikadze, A. V., Gurevich, A., Miga, K. H. & Pevzner, P. A. TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats. Bioinformatics 36, i75–i83 (2020).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Fofanov, Y. et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).
Lang, D. et al. Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore. GigaScience 9, giaa123 (2020).
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2021).
Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods (2022). https://doi.org/10.1038/s41592-022-01445-y
Zarate, S. et al. Parliament2: accurate structural variant calling at scale. GigaScience 9, giaa145 (2020).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 https://doi.org/10.1126/science.abl4178 (2022).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).
Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, abi7489 https://doi.org/10.1126/science.abi7489 (2021).
Liu, J. et al. Gapless assembly of maize chromosomes using long-read technologies. Genome Biol. 21, 121 (2020).
Du, H. et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat. Commun. 8, 15324 (2017).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Tischler, G. & Leonard, S. biobambam: tools for read pair collation based algorithms on BAM files. Source Code Biol. Med. 9, 13 (2014).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Kirsche, M. et al. Jasmine: Population-scale structural variant comparison and analysis. Preprint at bioRxiv https://doi.org/10.1101/2021.05.27.445886 (2021).
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer: high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Danecek, P. et al. Twelve years of SAMtools and BCFtools, GigaScience 10, giab008 (2021).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics https://doi.org/10.1093/bioinformatics/btaa1016 (2020).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Rhie, A., Formenti, G., Shafin, K., Fungtammasan, A., & Jain, C. arangrhie/T2T-Polish: v1.0. https://doi.org/10.5281/zenodo.5649017 (2021).
Rhie, A. and Phillippy, A. marbl/CHM13-issues: v1.1. https://doi.org/10.5281/zenodo.5648989 (2021).
Shafin, K. kishwarshafin/T2T_polishing_scripts: v0.1 release for zenodo. https://doi.org/10.5281/zenodo.6127865 (2021).
Acknowledgements
This work was supported by the Intramural Research Program of the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH 1ZIAHG200398; to A.M.M., C.J., S.K., A.M.P. and A.R.); National Science Foundation DBI-1350041 and IOS-1732253 (to M.A.); NIH/NHGRI R01HG010485, U41HG010972, U01HG010961, U24HG011853 and OT2OD026682 (to K.S. and B.P.); HHMI (to G.F.); Wellcome WT206194 (to K.H. and J.M.W.); NIGMS F32 GM134558 (to G.A.L.); NIH/NHGRI R01 1R01HG011274-01, NIH/NHGRI R21 1R21HG010548-01 and NIH/NHGRI U01 1U01HG010971 (to K.M.); St. Petersburg State University grant ID PURE73023672 (to A.M.); NIH/NHGRI R01HG006677 (to A.S.); Fulbright Fellowship (to D.C.S.); and Intramural funding at the National Institute of Standards and Technology (to J.Z.). This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov/). Certain commercial equipment, instruments or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose.
Author information
Authors and Affiliations
Contributions
A.R. and A.M.P. conceived and supervised the project. A.M.M., K.S., G.F., K.H., J.M.D.W. and A.R. performed the pre-polishing evaluation. K.S., M.A., A.V.B., A.F., C.J., A.M., B.P. and A.R. aligned reads and called variants. A.M.M., K.S., M.A., G.F., A.F., K.H.M., A.M., J.M.Z. and A.R. manually validated variant calls. D.C.S. and J.M.Z. performed the gene collapse and expansion analysis. K.S., M.A., A.V.B., G.A.L., K.H.M., A.M. and A.R. identified and curated heterozygous and ‘issues’ loci. K.S., M.A., S.K. and B.P. patched and polished the telomeres. A.M.M., M.A., A.S. and I.S. performed automated polishing. A.M.M., K.S., M.A. and A.R. wrote the manuscript, with assistance from all authors. All authors approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
I.S. is an employee of PacBio. A.F. is an employee of DNAnexus. S.K. has received travel funds to speak at symposia organized by Oxford Nanopore. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Kai Wang and Jue Ruan for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Sequencing biases observed in missing k-mers.
a, missing k-mers with its GA composition. b-d, v0.9 assembly and k-mer copy number spectrum from HiFi, Illumina, and hybrid k-mer sets (left) and per-chromosome missing (likely error) k-mer counts from the HiFi derived consensus or patches (right). Most missing k-mers in HiFi overlapped sequences from patched regions. No missing k-mer was found on chromosomes indicated with red arrows.
Extended Data Fig. 2 Error detection and polishing pipeline.
A detailed overview of the polishing pipeline along with the number of errors identified and polished at each step. Additionally, data type and polishing tools utilized are highlighted. Illumina, 100X PCR-free library Illumina reads; HiFi, 35x PacBio HiFi reads; ONT, 120x Oxford Nanopore reads.
Extended Data Fig. 3 Number of SV-like errors and globally unique single copy k-mers used for marker assisted alignment.
a. Number of SV-like errors called from long-read platforms. b. Range of k-mer counts defined as ‘single-copy’ markers from Illumina reads and in the assembly. The cutoffs were chosen to minimize inclusion of low-frequency erroneous k-mers and 2-copy k-mers. c. Number of markers in every 10 kb window. d. Cumulative number of bases covered by the number of markers in each 10 kb window.
Extended Data Fig. 4 Post-polishing evaluation.
a. Left, genotype quality and number of reads supporting the reference and alternate alleles from the combined Illumina-hifi hybrid and ONT homozygous variant calls, with AF > 0.5. Right, balanced insertion (red) and deletion (blue) length distribution from the Illumina-HiFi hybrid DeepVariant heterozygous calls in CHM13v1.0. b. Number of errors detected in each chromosome, before and after polishing. c. Polishing inside and outside of repeats. The distribution of CHM13v0.9 polishing rates within and without repeats.
Extended Data Fig. 5 Three SV-like errors corrected.
HiFi and ONT marker assisted alignments, post correction of the 3 large SV-like edits visualized with IGV. HiFi coverage track is shown in data range up to 60, ONT up to 150. Clipped reads are flagged for >100 bp. INDELs smaller than 10 bp are not shown. Reads are colored by strands; positive in red and negative in blue.
Extended Data Fig. 6 Telomere polishing.
a. An illustration of Chr. 2 telomere sequence reads from HiFi, ONT and CLR platform. b. Distribution of maximum perfect match to the canonical k-mer observed at each position in the telomere before (CHM13v1.0) and after (CHM13v1.1) polishing the telomeres.
Extended Data Fig. 7 Mapping biases found and corrected.
On simulated HiFi reads, we found excessive clippings in highly identical satellite repeats in Minimap and Winnowmap by the time of evaluation. We have addressed this issue in Winnowmap 2.01 + . Clipped (%) indicates the percentage of reads clipped in every 1,024 bp window, shown in 0~40% range with a midline of 10%.
Extended Data Fig. 8 HiFi, CLR, ONT read coverage, alignment identity, and read length from Winnowmap2 v2.01 alignments and Bionano DLE-1 molecule coverage from Bionano Solve.
Upper panel shows a zoomed in region of Chromosome 9, while the upper panel shows the whole-genome alignment view. HiFi, CLR, ONT, and Bionano coverage are shown up to 70x, 70x, 200x, and 250x, respectively. Median read identity in every 1,024 bp is shown in 80-100% range. Median read length in every 1024 bp is shown in 0-100 kb range. Read identity was the worst in CLR, and between HiFi and ONT. Bionano molecules were lacking coverage in most of the centromeric repeats.
Extended Data Fig. 9 Collapsed simple tandem repeat.
The collapse in the Intronic sequences of gene FAM227A was undetected, due to the variable insertion breakpoints and insertion length in the HiFi and ONT alignments. The panels above the alignments show marker density and percent microsatellites (GA / AT / TC / GC) in each 64 bp window, which indicates this region is highly repetitive with GA enriched sequences, which later alternates with AT enriched sequences.
Extended Data Fig. 10 Chimeric junction of two haplotypes.
In the shown above regions, both HiFi and ONT reads indicate that the consensus has a chimeric junction of the two haplotypes.
Supplementary information
Rights and permissions
About this article
Cite this article
Mc Cartney, A.M., Shafin, K., Alonge, M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 19, 687–695 (2022). https://doi.org/10.1038/s41592-022-01440-3
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41592-022-01440-3
This article is cited by
-
Large tandem repeats of grass frog (Rana temporaria) in silico and in situ
BMC Genomics (2025)
-
Establishing genome sequencing and assembly for non-model and emerging model organisms: a brief guide
Frontiers in Zoology (2025)
-
The reference genome of the human diploid cell line RPE-1
Nature Communications (2025)
-
A telomere-to-telomere genome assembly of the protandrous hermaphrodite blackhead seabream, Acanthopagrus schlegelii
Scientific Data (2025)
-
A telomere-to-telomere gap-free genome assembly of the endangered humphead wrasse (Cheilinus undulatus)
Scientific Data (2025)