Figure 5: Low similarity between imputed and observed data reveals low-quality datasets. | Nature Biotechnology

Figure 5: Low similarity between imputed and observed data reveals low-quality datasets.

From: Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues

Figure 5

(a) Comparison of QC metrics (columns) for the ten datasets (rows) showing lowest agreement with gene and promoter annotations (Fig. 3a,b), based on H3K4me3 PromRecov (top) and H3K36me3 GeneRecov (bottom). Each entry shows rank (out of 127) for GeneRecov/PromRecov, read depth and each QC metric (Poisson statistic, Signal Proportion of Tags (SPOT), FindPeaks, Normalized and Relative Strand Correlation between forward and reverse strands (NSC and RSC)), and similarity between imputed and observed data (Match1 and GWcorr). Orange-shaded EIDs denote the five worst-agreement datasets from b. Data sets with the same read depth (a result of highly sequenced datasets being previously downsampled to the same number of reads10) are given the same expected rank if ties were broken randomly. Most-problematic datasets (based on lack of gene or ±2 kb TSS annotation recovery) are sometimes missed by traditional QC measures but consistently show low imputation agreement. (b) Distribution of agreement between top 1% observed signal and top 1% imputed signal locations for H3K4me3 (top) and H3K36me3 (bottom), highlighting five worst-similarity (orange) and five highest-similarity (green) datasets. (c) Observed (blue) and imputed (red) signal tracks for worst-similarity (orange) and best-similarity (green) datasets for H3K4me3 (top) and H3K36me3 (bottom) for the entire chromosome 10 (0–135 Mb). Datasets with the lowest agreement have a relatively flat signal, suggesting that when observed and imputed datasets disagree most, it is usually the observed datasets that are of lowest quality. (d) Aggregation of observed signal for H3K4me3 surrounding the TSS (top) and H3K36me3 in gene bodies (bottom) for the five best-agreement (green) and worst-agreement (orange) datasets, highlighting the unusual profiles of some worst-agreement datasets, suggesting they are of lower quality, even though they were not flagged by traditional QC metrics.

Back to article page