Fig. 2: Removal of batch effects from simulated RNA-Seq data with MOCCASIN.
From: MOCCASIN: a method for correcting for known and unknown confounders in RNA splicing analysis

RNA-Seq samples from mouse Aorta and Cerebellum were simulated using BEERS while injecting G% of the genes in half the samples with a batch effect of C% expression change of the main isoform (see main text). a Cumulative distribution of the difference (|dPSI|) from simulated ground truth after batch signal injection (G = 20%, C = 60%) either before MOCCASIN (blue) or after correction with increasing numbers of samples for each of the four batch/condition combinations: 1 × 4 (4 total, purple), 2 × 4 (8 total, brown), 3 × 4 (12 total, yellow), and 4 × 4 (16 total, orange). All plots are derived from the same representative sample (SRR1158528) to maintain a fixed base for comparison, with similar plots observed for other perturbed samples (data not shown). Total number of LSVs: 21566. b, c The number of LSVs (Y-axis) detected as differential (|E(dPSI)| > 0.2) for the batch 1 (N = 4) versus batch 2 (N = 4) signal (b, left) and the aorta (N = 4) vs cerebellum (N = 4) signal (c, right) across a range of increasingly significant p-values (X-axis, Student’s t test, −log10 scale). Number of samples used is 4 per batch/tissue combination (same as in the orange line in a). The green points (“Ground Truth”) are from the simulated data with no batch signal injection and the blue points (“Before MOCCASIN”) are from the same data after batch signal injection (G = 20%, C = 60%). Both blue and green points serve as reference points for MOCCASIN correction of the batch signal. Orange and gray represent, respectively, the results after MOCCASIN correction when the batches are known or unknown. d, e Assessing false positive rate (FPR) for the batch signal (d, left) and false discovery rate (FDR) for the tissue signal (e, right) for a range of G values (2, 5, 20%) and C effect size (2, 10, 60%). Number of samples same as in (b, c). Here positive events where considered as those changing by at least 20% with high confidence by MAJIQ (P(|dPSI| > 0.2) > 0.95). Under these definitions small effect sizes (C = 2,10%) represent perturbations that are not expected to affect the positive event set much. f Heatmaps of E(PSI) from simulated data without batch effect (ground truth, left), with simulated batch effect (G = 20%, C = 60%) without correction (middle), and after applying MOCCASIN with 1 known confounding factor (right). Each column is a sample, and each row is an LSV (N = 4941). The colored bars above the samples denote the sample’s tissue (8 aorta samples in purple, and 8 cerebellum samples in green) and batch (8 batch 1 sample in red and 8 batch 2 samples blue).