Fig. 2: Comparison of imputation methods on the Roadmap reference epigenomes. | Nature Communications

Fig. 2: Comparison of imputation methods on the Roadmap reference epigenomes.

From: Getting personal with epigenetics: towards individual-specific epigenomic imputation with machine learning

Fig. 2

a Performance metrics for the imputation of the n=203 test tracks on chromosome 21 for each model. Boxes represent the interquartile range (IQR), with the middle line representing the median; the whiskers represent points that lie within 1.5 IQRs of the lower and upper quartiles while remaining outliers are explicitly displayed. Metrics presented include Mean Squared Error (MSE) and Pearson correlation coefficient (Corr) for the Genome-wide (GW/Global), Foreground (Fg) and Background (Bg) regions, as well as the Area Under the Precision-Recall Curve (AUPRC), Precision, and Recall for the classification of peaks detected with MACS2. b Examples of observed epigenomic tracks with the signals imputed by eDICE for the assay H3K9ac in two selected tissues (E025, E052). Below the tracks, the peaks detected with MACS2 highlight how the imputations accurately capture enriched regions. The peaks were detected using a one-sided Poisson hypothesis test with Benjamini-Hochberg correction for multiple test corrections and a cut-off value of 0.01. c Percentages of test tracks on which eDICE outperforms the baselines for each metric. ChromImpute shows good performance on tasks related to the height of the peaks, while eDICE outperforms PREDICTD and Avocado on all metrics. d Learning curves that display several global performance metrics against the number of genomic positions used in training. Tensor factorization models such as Avocado need to be trained on the whole genome to make genome-wide predictions. eDICE, on the other hand, can be trained efficiently on a small subset of genomic regions and still obtain improved performance, suggesting that previous models severely overparameterized the imputation problem. Data are presented as mean ± 95% confidence interval for n = 203 test tracks. Source data are provided as comma-separated-values (csv) files.

Back to article page