Figure 3: Imputed data shows higher promoter/gene recovery, robustness and biological group recovery. | Nature Biotechnology

Figure 3: Imputed data shows higher promoter/gene recovery, robustness and biological group recovery.

From: Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues

Figure 3

(a,b) Quantitative comparison of observed (blue) and imputed (red) data in their recovery of annotated promoters (a) and gene bodies (b), based on the area under the ROC curve up to a 5% false-positive rate (y axis) for H3K4me3 signal recovery of locations within 2 kb of TSS (a) and H3K36me3 signal recovery of gene bodies (b). Arrows indicate two fetal brain samples (E081 and E082) with very different values in the observed data, which show much higher (and more consistent) recovery for imputed data. FPR, false-positive rate. (c,d) Comparison of aggregate signal for imputed (red) and observed (blue) datasets based on −log10 P value of H3K4me3 surrounding the TSS (c) and H3K36me3 in gene bodies (d). Imputed data show a substantially more consistent profile across all datasets, and in particular for the two fetal brain samples (E081, E082), which show substantial differences in the observed data. (e) Pairwise comparison of genome-wide signal correlation for all samples using observed (top) and imputed (bottom) data for H3K4me1, H3K27me3 and DNase (additional marks shown in Supplementary Fig. 19), with samples ordered and colored as in Figure 1a (left sidebar). Imputed datasets better capture biological relationships between samples than observed datasets, with their correlation structure clearly delineating pluripotent cells, immune cells, adult brain and multiple tissue groups (Fig. 1a), whereas observed datasets are much less correlated even for highly similar samples. (f) Area under the ROC curve for classifying whether two different pairs of experiments belong to the same group when ranking the pairs based on their correlation. A value of 0.5 could be achieved by random guessing and a value of 1.0 is the maximum possible score. The 'Other' and 'ENCODE' groups were excluded from this analysis as were imputed pairs that were not present in the observed data. This shows quantitatively that the relative similarity of imputed data sets is more consistent with the biological groupings of the samples.

Back to article page