Fig. 7
From: A general and flexible method for signal extraction from single-cell RNA-seq data

Between-sample distances and silhouette widths on simulated data. a Boxplots of correlations between between-sample distances based on true and estimated low-dimensional representations of the data for simulations based on the V1 data set. b Same as a for simulations based on the S1/CA1 data set. c Boxplots of silhouette widths for true clusters for simulations based on the V1 data set. d Same as c for simulations based on the S1/CA1 data set. For a–d, all data sets were simulated from our ZINB-WaVE model with n = 1000 cells, J = 1000 genes, “harder” clustering (b2 = 5), K = 2 unknown factors, zero fraction of about 80%, X = 1n, cell-level intercept (V = 1J), and genewise dispersion. Each boxplot is based on n values corresponding to each of the n samples and defined as averages of correlations (a, b) or silhouette widths (c, d) over B = 10 simulations. See Supplementary Fig. 27 for the same scenario but with n = 10,000 cells and Supplementary Fig. 28 for additional scenarios. e–g Average silhouette widths (over n samples and B = 10 simulations) for true clusters vs. zero fraction, for n ∈ {100;1000;10,000} cells, for data sets simulated from the Lun & Marioni42 model, with C = 3 clusters and equal number of cells per cluster. Although ZINB-WaVE was relatively robust to the sample size n and zero fraction, the performance of PCA and ZIFA decreased with larger zero fraction. Between-sample distance matrices and silhouette widths were based on W for ZINB-WaVE, the first two principal components for PCA, and the first two latent variables for ZIFA. ZINB-WaVE was applied with X = 1n, V = 1J, genewise dispersion, and K ∈ {1, 2, 3, 4} (only K = 2 is shown in e–g. For PCA and ZIFA, different normalization methods were used. Colors correspond to the different methods