Fig. 1: CHESSBOARD Pipeline.
From: A Bayesian model for unsupervised detection of RNA splicing based subtypes in cancers

a Input: splice junction read counts (red and blue reads) extracted from patients' RNA sequencing. Each row in the input data matrix is a LSV (e.g., cassette exon shown) and each rubric contains the junction spanning read counts for that LSV in a specific sample. In complex LSV involving more than two junctions, the most variable junction is selected (Methods). b Task: CHESSBOARD’s objective is to identify latent tiles in the input matrix. A tile consists of a subset of samples and a subset of LSVs where the Ψ distribution of each LSV for samples within the tile differs from the background distribution. Note that the matrices shown contain Ψ values for visualization purposes but CHESSBOARD acts on the matrix described in (a) and it may not be possible to embed each tile as a continuous square in a 2D image as shown here. c CHESSBOARD Pipeline: The pipeline includes three steps. Filtering: Lowly expressed genes (lower 5% by default) and LSVs observed in too few samples (default 20%) are removed, retaining only those exhibiting high Ψ variability between samples and multiple modes in the Ψ value distribution (Methods). MCMC: Blocked Gibbs sampling based on CHESSBOARD’s model and the input data matrix yields posterior samples for potential tile configurations. Intuitively, the algorithm iterates through a chain of solutions that tend toward higher likelihood while varying the number of tiles using the Chinese Restaurant Process (Methods). Analysis: The MC samples are summarized into marginal posterior distributions and possible point estimates for tiles. Tile analysis includes sample assignment to subgroups, LSV assignment to a signal tile, and computation of the ΔΨ and missingness rate associated with a particular LSV in a tile (Methods). Visualization and analysis are conducted using the accompanying visualization package, GAMBIT.