Fig. 2: Performance assessment of different normalizations on the TCGA READ RNA-seq data.
From: Removing unwanted variation from large-scale RNA sequencing data with PRPS

a, Top row: scatter plots of first two PCs for raw counts, FPKM, FPKM.UQ and RUV-III normalized data colored by key time intervals (2010 versus 2011–2014). Bottom row: same as the top row colored by the CMS. The CMSs were obtained for each dataset separately. b, Top: a plot showing the R2 of linear regression between library size and up to the first five PCs (taken cumulatively). Bottom: violin plots of Spearman correlation coefficients between the gene expression levels and library size for individual data. c, Top: the frequency of P < 0.05 obtained from DE analysis between samples with low and high library size. Bottom: Scatter plot shows silhouette coefficients and ARI for mixing samples from two different key time intervals. d, Top: a plot showing the vector correlation coefficient between plates and the first five PCs within each time intervals. Bottom: box plots of log2 F-statistics obtained from ANOVA within each key time interval for gene expression with plate as a factor. e, Top: a plot showing the vector correlation coefficient between CMS subtypes and up to the first five PCs. Bottom: a scatter plot displays silhouette coefficients and ARI for measuring the separation of CMS subtypes.