Fig. 2: Performance comparison of error correction.
From: De novo diploid genome assembly using long noisy reads

a Accuracy of raw and corrected reads and accuracy of SNP alleles in raw and corrected reads on the simulated datasets with different heterozygosity rates. b Accuracy of raw reads and corrected reads by NECAT and PECAT in difficult-to-map regions and low-complexity regions of HG002 reference genome. c, d Consistency, and completeness of raw reads and corrected reads by Canu, FALCON, MECAT2, NECAT, and PECAT on the seven diploid datasets. The metrics by Canu on B. taurus (PacBio CLR and ONT) and HG002 (ONT) and the metrics by FALCON on B. taurus (PacBio CLR) are excluded because they could not finish correcting in three weeks. Consistency is defined as \(\sum \max ({k}_{p},{k}_{m})/\sum ({k}_{p}+{k}_{m})\), in which \({k}_{p}\) and \({k}_{m}\) are the number of paternal and maternal haplotype-specific k-mers in each read. Completeness is the percentage of parent-specific k-mers (occurrences \(\ge 4\)) in the 40X longest reads. e Consistency of D. melanogaster (ISO1 × A4) raw reads and corrected reads by the different methods. Each point corresponds to a read. Its coordinate gives the proportion of the parental specific k-mers in the read, where k is 18. All 40X longest reads are shown in each sub-figure.