Figure 3: SVM-training pipeline and H3K4me3/Pol II occupancy around the TSSs of protein-coding genes. | Nature Communications

Figure 3: SVM-training pipeline and H3K4me3/Pol II occupancy around the TSSs of protein-coding genes.

From: microTSS: accurate microRNA transcription start site identification reveals a significant number of divergent pri-miRNAs

Figure 3

(a) The initial set of protein-coding TSSs is divided into two subsets based on H3K4me3 or Pol II occupancy. The region surrounding each TSS is divided into bins and each bin is assigned a score, which is the number of overlapping ChIP-Seq reads or TF footprints. Subsequently, the scored bins are utilized as features to develop three separately trained SVMs, modelling the distribution of each transcription mark around protein-coding TSSs. (b) To train the SVM models, the annotated TSSs were selected as positive instances and the flanking regions of each active transcription mark as negatives. In addition, two randomly selected intergenic spots are selected as negatives, resulting in a 1:4 positive to negative ratio. The area (+/−1,150 and +/−950 bp for H3K4me3 and Pol II, respectively) surrounding each instance is divided in similarly scored bins of 100 nts. Both polymerase II and DGF models share the same training set, while the region (+/−2,050 bp) surrounding each DGF instance is divided in bins of 200 bps (not shown).

Back to article page