Table 1 Proposed standards and metrics for defining genome assembly quality

From: Towards complete and error-free genome assemblies of all vertebrate species

Quality category

Metric

Finished

VGP-2020

VGP-2016

B10k-2014

This study

Notation

x.y.P.Q.C

c.c.Pc.Q60.C100

7.c.P6.Q50.C95

6.7.P5.Q40.C90

4.5.Q30

 

Continuity

Contig NG50 (x)

= Chr. NG50

>10 Mb

>1 Mb

>10 kb

1–25 Mb

Scaffolds NG50 (y)

= Chr. NG50

= Chr. NG50

>10 Mb

>100 kb

23–480 Mb

Gaps per Gb

No gaps

<200

<1,000

<10,000

75–1,500

Structural accuracy

Reliable blocks

= Chr. NG50

>10 Mb

>1 Mb

Not required

2.3–40.2 Mb

False duplications

0%

<1%

<5%

<10%

0.2–5.0%

Curation

Conflicts resolved

Manual

Manual

Not required

Manual

Base accuracy

Base pair QV (Q)

>60

>50

>40

>30

39–43

k-mer completeness

100% complete

>95%

>90%

>80%

87–98%

Haplotype phasing

Phase block NG50 (P)

= Chr. NG50

>1 Mb

>100 kb

Not required

1.6 Mba

Functional completeness

Genes

>98% complete

>95% complete

>90%

>80%

82–98%

Transcript mappability

>98%

>90%

>80%

>70%

96%

Chromosome status

Assigned (C)

>100%

>95%

>90%

Not required

94.4–99.9%

Sex chromosomes

Right order, no gaps

Localized homo pairs

At least one shared (for example, X or Z)

Fragmented

At least one shared

Organelles (for example, MT)

One complete allele

One complete allele

Fragmented

Not required

One complete allele

  1. The six broad quality categories in the first column are split into sub-metrics in the second column. The recommendations for draft to finished qualities (columns 3–6) are based on those achieved in past studies16,19,63, this study, and what we aspire to. In the x.y.P.Q.C notation, x = log10[contig NG50]; y = log10[scaffold NG50]; P = log10[haplotype phased NG50 block]; Q = Phred base accuracy QV; and C = percentage of the assembly assigned to chromosomes. c denotes ‘complete’ telomere-to-telomere continuity. The VGP assemblies (last column) satisfy the 6.7.6.Q40.C90 standard, but some come close to achieving a higher 7.c.7.Q50.C95 standard. These metrics apply to genomes about 1 Gb or bigger.
  2. aPhase blocks calculated for the zebra finch non-trio assembly using haplotype specific k-mers from parental data20; the trio assemblies had NG50 phase blocks of 17.3 Mb (maternal) and 56.6 Mb (paternal).