Table 1 Performance comparison of assembly on PacBio CLR datasets

From: De novo diploid genome assembly using long noisy reads

Dataset

Pipeline

Size (Mb)

NG50 (Mb)

Quality (reference-based)

Quality (k-mer- based)

BUSCO (%)

Hamming error (%)

Phase block NG50 (Mb)

Intra-block switch error (%)

S. cerevisiae

SK1×Y12

200X

Ref

12.1/12.0

0.9/0.9

–/–

47.3/49.4

99.6/99.6

0.13/0.03

0.9/0.9

0.13/0.03

Canu + Purge_dups

12.4/4.8

0.8/0.0

25.6/30.5

38.9/36.2

98.7/28.1

39.24/6.25

0.0/0.0

9.92/4.50

FALCON-Unzip

12.1/11.1

0.8/0.4

40.5/42.2

44.5/45.0

99.4/95.8

21.58/0.98

0.5/0.4

0.40/0.22

PECAT

12.3/11.8

0.8/0.8

35.9/36.0

39.1/39.7

96.0/93.8

1.65/0.49

0.8/0.8

0.16/0.09

A. thaliana

Col-0×Cvi-0

164X

Ref

133.3/119.7

26.2/23.20

–/–

Inf/Inf

99.3/99.2

0.12/0.01

12.1/23.2

0.11/0.01

Canu + Purge_dups

129.1/122.6

6.7/0.1

28.0/30.4

24.8/30.7

98.7/81.7

37.52/2.65

0.1/0.1

2.81/2.45

FALCON-Unzip*

140.0/104.9

8.0/4.3

40.0/40.0

30.0/34.9

98.9/93.6

15.11/0.98

3.1/2.4

0.15/0.19

PECAT

130.6/120.4

14.3/7.8

34.6/34.7

25.3/33.2

98.3/98.2

2.64/0.21

12.6/7.8

0.13/0.13

D. melanogaster

ISO1×A4

200X

Ref

143.7/140.7

25.3/25.0

–/–

46.5/46.3

98.7/98.7

0.06/0.02

25.3/24.6

0.06/0.02

Canu + Purge_dups*

143.2/129.7

16.1/0.3

33.6/34.7

35.4/35.3

98.5/87.3

43.50/3.97

0.4/0.3

3.92/3.01

FALCON-Unzip

189.7/105.9

4.0/0.4

40.5/40.5

37.1/37.4

98.8/84.0

29.31/5.38

1.0/0.3

0.34/0.41

PECAT

149.6/135.7

24.5/11.9

38.5/39.2

40.8/43.5

98.7/96.7

3.00/0.07

16.1/11.8

0.05/0.04

B. taurus

Angus×Brahman

135X

Ref

2580.8/2681.0

91.1/104.5

–/–

43.3/43.8

93.6/95.6

0.10/0.04

21.8/30.5

0.08/0.03

FALCON-Unzip*

2713.4/2453.7

31.4/2.0

39.2/38.5

39.4/39.0

95.4/86.3

28.15/1.97

3.2/1.8

0.21/0.22

PECAT

2744.7/2447.6

72.4/2.8

34.6/34.6

39.9/40.1

94.8/87.0

29.46/0.47

4.5/2.4

0.10/0.09

  1. ‘Size’ is the total number of base pairs in all contigs generated by assemblers. ‘NG50’ is the length of the shortest contig for which longer and equal length contigs cover at least 50 of genome size. The genome sizes of S. cerevisiae, A. thaliana, D. melanogaster, and B. taurus that we used for evaluation are 12 M, 130 M, 140 M, and 2.7 G, respectively. ‘BUSCO’ is gene completeness evaluated by BUSCO. ‘Quality (reference-based)’ is the metric ‘q50’ evaluated by Pomoxis. ‘Hamming error’ is the fraction of nondominant parental-specific k-mers in a contig. ‘Quality (k-mer-based)’, ‘Phase block NG50’, and ‘Intra-block switch error’ are evaluated by mercury. All assemblies are in primary/alternate format. The primary and alternate contigs are separately reported in each cell. ‘Ref’ is the reference genome. The sources of the reference genomes are illustrated in Supplementary Table 17. For B. taurus, Canu didn’t finish the assembly in 3 weeks, so it is excluded. Asterisks mark previously published assemblies.