Figure 1: Application and method overview.
From: Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues

(a) Matrix of observed and imputed datasets across 127 reference epigenomes ('samples'), including 111 from the Roadmap Epigenomics project (rows 1–111) grouped and colored by cell/tissue type, and an additional 16 from ENCODE (rows 112–127), with reference epigenome identifier (EID) and short sample/tissue description. Epigenomic marks (top) are grouped by tiers 1–3 plus RNA-seq and DNA methylation (DNA methyl), based on experimental coverage and imputation strategy. Black dotted arrows on the top denote E017 datasets shown in b (horizontal arrow), and H3K36me3 datasets shown in c (vertical arrow), illustrating the two dimensions of correlations used in ChromImpute and shown in d. PB, peripheral blood; Mesench., Mesenchymal; cult cl, cultured cells. (b) Correlation between epigenomic marks in the same sample, one of the two classes of features used for epigenome imputation. Datasets from sample E017 are shown, illustrating their highly correlated nature, comparing the observed signal for H3K4me1 from E017 (gray), the imputed data (red), which was predicted without using the observed data, and the observed tracks for other marks (blue), ordered based on their correlation with the H3K4me1. Imputation of H3K4me1 in E017 (red) does not use the observed data (gray), and instead uses the other samples to learn relationships between H3K4me1 and other marks. DNA methylation values below the horizontal line represent missing data. For the primary imputation of H3K4me1, not all marks shown were used, as only tier 1 marks are used to impute tier 1 marks. (c) Multiple signal tracks for H3K36me3 across samples illustrate the highly correlated nature of a given mark across samples, exploited in the second class of features used for epigenome imputation. This example uses the same region as used in b to compare the observed signal for H3K36me3 in E017 (gray), H3K36me3 in several other samples (blue), which constitute the basis for highly informative features for H3K36me3 imputation in E017 (red). Observed tracks (blue) are ordered by their global correlation to the observed H3K36me3 signal in E017, though ChromImpute did not have this information when imputing H3K36me3 in E017, and instead determined sample similarity based on other marks, both globally and locally at each position, and then used the H3K36me3 signal in up to ten most-proximal samples for each definition of similarity to compute individual features for each predictor of the ensemble (d, right). (d) Ensemble strategy for signal track imputation using features that exploit correlations between marks in the same sample (left) and correlations between samples for a given mark (right). We assume that no information is available for the target mark in the target sample (gray targets). Thus, we learn relationships between marks (left side) in other samples (column of E1 sample is not used) and learn relationships between samples (right side) using other marks from which we then compute same-mark features. The ensemble predictor that combines features across marks (b) and across samples (c) is learned only in other samples (top), and the marks in the target sample are used only during the actual application of the trained ensemble predictors to compute the imputed signals.