Extended Data Fig. 2: Integrated analysis of 223 eCLIP data sets identifies RBP clusters on the basis of binding patterns. | Nature

Extended Data Fig. 2: Integrated analysis of 223 eCLIP data sets identifies RBP clusters on the basis of binding patterns.

From: A large-scale binding and functional map of human RNA-binding proteins

Extended Data Fig. 2

a, The effect of cluster number on hierarchical clustering on the Euclidean distance between RBPs for the fraction of peaks overlapping each of the RNA region types as shown in Fig. 2a. For each number of clusters k between 2 and 35, the sum of squared error was calculated between the number of peaks annotated for each region versus the mean of all RBPs in that RBP’s cluster and summed across all RBPs. An inflection point was identified at k = 6 (indicated). b, Model of eCLIP analysis pipeline for quantification of eCLIP signal at RNA families with multiple transcript or pseudogene copies. c, Stacked bars indicate the number of reads from replicate 1 of all 223 eCLIP experiments, separated by whether they map uniquely to the genome (red), uniquely to the genome but within a repetitive element identified by RepeatMasker (purple), or to repetitive element families (grey). Data sets are sorted by the fraction of unique genomic reads. df, Each eCLIP data set is displayed as a point based on t-SNE clustering (Fig. 2b), with colour indicating whether the data set passed peak-based or family-mapping based quality assessment (d), the relative information at coding sequence (CDS) (e), or relative information at the 45S ribosomal RNA precursor (f). g, Means of 100 random orderings of each data type for the number of genes that were differentially expressed for all 472 KD–RNA-seq data sets (requiring FDR < 0.05 and P < 0.05 from DEseq analysis; Methods) (green), bound in 223 eCLIP data sets (overlapped by a IDR-reproducible peak with P ≤ 10−3 and fold enrichment ≥ 8 in IP versus input; Methods) (blue), or both bound and differentially expressed (considering 203 pairings of eCLIP and KD–RNA-seq for an RBP in the same cell type) (orange). The set of genes considered was all 57,645 genes in GENCODE v19; see Supplementary Fig. 13a, b for analyses of expressed genes only. Grey dotted line indicates the total number of expressed genes, defined as TPM > 1 in either K562 or HepG2 cells. Shaded regions indicate 10th to 90th percentiles. h, Means of 100 random orderings of data sets for the number of differential splicing events for all 472 RBP KD–RNA-seq experiments (including skipped exons, alternative 5′ and 3′ splice sites, retained introns, and mutually exclusive exons; requiring FDR < 0.05, P < 0.05, and |ΔΨ| > 0.05) (red), and exons both bound by an RBP and differentially spliced upon RBP knockdown (considering 203 pairings of eCLIP and KD–RNA-seq for an RBP in the same cell type) (blue), with binding defined as a peak located anywhere between the upstream intron 5′ splice site and the downstream intron 3′ splice site. Shaded regions indicate 10th to 90th percentiles. i, Cumulative fraction of bases within peaks for 100 random orderings of the 223 eCLIP data sets, separated by transcript regions as indicated. Shaded region indicates 10th to 90th percentiles. See Supplementary Text and Supplementary Fig. 13c, d for additional analyses of all versus expressed genes only. j, Fraction of overlapping peaks identified from our standard eCLIP processing pipeline between K562 and HepG2 cells for RBPs profiled (blue or red) in both cell types, or (black) between one RBP in K562 cells and a second in HepG2 cells, for sets of genes separated by their relative expression change between K562 and HepG2 cells as follows: unchanged (fold-difference ≤ 1.2), weakly (1.2 < fold-difference ≤ 2), moderately (2 < fold-difference ≤ 5) or strongly (fold-difference > 5) differential, or cell type-specific genes (TPM < 0.1 in one cell type and TPM ≥ 1 in the other). Red line indicates mean. k, Each point represents one eCLIP data set compared with the same RBP profiled in the second cell type (73 total). For the set of peaks from the first cell type that are not enriched (fold enrichment < 1) in the second cell type, red points indicate the fraction that occur in genes with the indicated expression difference between HepG2 and K562 cells. Blue points similarly indicate the gene distribution of peaks that were fourfold enriched in the opposite cell type. Boxes, quartiles; green line, median.

Back to article page