Fig. 4: Benchmark of UMI deduplication tools and peak calling tools using scReadSim’s synthetic reads. | Nature Communications

Fig. 4: Benchmark of UMI deduplication tools and peak calling tools using scReadSim’s synthetic reads.

From: scReadSim: a single-cell RNA-seq and ATAC-seq read simulator

Fig. 4

Benchmark of UMI deduplication tools (ad). a Time usage of deduplication tools on synthetic datasets with varying cell numbers (at a fixed sequencing depth) or varying sequencing depths (at a fixed cell number). The y-axis indicates the time lapse (in seconds), and the x-axis shows the number of synthetic cells (left) or the total number of UMIs (sequencing depth, right). b Distributions of summary statistics of the UMI count matrices (ground truth, cellranger’s output, UMI-tools' output, and STARsolo’s output) at the gene level (mean, variance, coefficient of variance (cv), and zero proportion) and the cell level (zero proportion and library size). c Cell-wise and gene-wise correlations (Pearson correlation and Kendell’s tau) between the ground-truth UMI count matrix and each deduplication tool’s output UMI count matrix. d UMAP visualizations of the ground-truth UMI count matrix and each deduplication tool’s output UMI count matrix. The mean value of the Euclidean distances of all synthetic cells (“Methods”) is displayed for each UMI deduplication tool: a smaller value indicates that the deduplicated UMI count matrix better agrees with the ground-truth UMI count matrix in UMAP visualization. Benchmark of peak calling tools (eg). e Distributions of RPKM values of peak and non-peak regions in the ground truth (specified in scReadSim) and each tool’s peak-calling result. The box center lines, bounds, and whiskers denote the medians, first and third quartiles, and minimum and maximum values within 1.5 × the interquartile range of the box limits, respectively. A difference of peak regions' mean RPKM (mean diff.) is calculated between the ground truth and each method’s output. The numbers of peaks and non-peaks are as follows: ground-truth (2913 peaks and 2914 non-peaks), MACS3 (3310 peaks and 3290 non-peaks), SEACR (31,350 peaks and 31,339 non-peaks), HOMER (4782 peaks and 4780 non-peaks), and HMMRATAC (2726 peaks and 2571 non-peaks). f, g True positive rate (TPR) vs. false positive rate (FPR) curves (f) and precision vs. recall curves (g) using user-designed open chromatin regions as the ground-truth peaks.

Back to article page