Figure 2 | Scientific Reports

Figure 2

From: Similarity-Based Segmentation of Multi-Dimensional Signals

Figure 2

Parameter Scan & Selection: ORF Transcript Recovery. 540 segmentations were calculated with varied parameters, and segments (queries \(Q\)) from each segmentation were tested against a set of 5171 annotated ORF transcripts (targets \(T\)). (a) Graphical definition of recovery measures: ratio \(R\), Jaccard index \(J\) and the number of segments per target \({n}_{hits}\) (also see Methods section). Only the best matching hits for each target were used for calculation of \(R\) and \(J\). (b) Empirical cumulative distribution function (ECDF) of query-target length ratios \(R\): mean values of the clusters (solid lines) defined in panel d, and the full spread of values in each cluster (dashed lines). A segmentation where many query segments that are longer than their matching ORF transcript target (\(R\,\mathrm{ > \; 1}\)) is interpreted as under-fragmentation of the data (e.g. the ‘too long’ red segmentations), while segments which are shorter than their target (\(R < \mathrm{1)}\) point to over-fragmentation (e.g. the ‘too short’ magenta segmentations). (c) ECDF of the best hit Jaccard indices, plotted as in panel b. (d) The total Jaccard index \({J}_{tot}\) of best-matching pairs (x-axis) and \({\tilde{n}}_{hits}\), the average number of segments per target (y-axis). Numbers indicate the example segmentations in Fig. 3, the slightly shifted ‘x’ indicate the same parameters but with scoring function ccor instead of icor. The coloring of segmentations is derived from a PAM clustering of the ratios \({R}_{short}\) and \({R}_{long}\) (vertical lines in panel b), and \({J}_{tot}\) and \({\tilde{n}}_{hits}\). (e) Frequency distribution of parameters (T: time-series processing, K: number of clusters, S: scoring function, E: similarity transformation exponent \(\varepsilon \), M: length-penalty parameter \(M\), nui: nuisance cluster correlation \(\nu \)) in the PAM clusters; the gray-scale is derived from the p-values (\(-log\mathrm{2(}p)\)) of Fisher’s exact tests of the overlaps and indicates enrichment. This provides an overview of parameter effects on optimization criteria. Detailed effects for each parameters are shown in Supp. Fig. S3a. (f) Frequency distributions of parameter combinations in the ‘optimal’ cluster 4, gray-scale is derived from the shown frequencies; see Supp. Fig. S3b for other clusters.

Figure 3
figure 3

Example Region: SRG1 vs. SER3 (chrV:317829..325452). All segmentations were calculated from the shown clustering (K = 12, clusters sorted by similarity) of selected DFT components of the raw read-count time-series, with the indicated parameters E, M and nui, and with scoring function icor (corresponding from top to bottom to 1–5 in Fig. 2d), except where indicated (‘S:ccor’), which were calculated with the same parameters as the segments directly above but using scoring function ccor (‘x’ in Fig. 2d). ‘ORF’ are annotated transcripts from the ORF-T test set26, ‘gene’ and ‘ncRNA’ are annotations from the yeast genome release R64.1.1 of the actual ORF, from left to right YER078W-A, YER079W, AIM9 and SER3, and the ncRNA SRG1. This figure was produced by the demo segment_data of the segmenTier R package.

Back to article page