Fig. 1: Antibody-derived protein UMI count data noise source assessment.
From: Normalizing and denoising protein expression data from droplet-based single cell profiling

a 1 and 2: Experimental setup and potential noise sources in CITE-seq data. 3: protein-specific noise: if ambient antibody encapsulated in droplets constitutes a major source of protein-specific noise, values should be highly correlated with those in unstained control cells (top); if control cells contain information on noise not captured by empty drops, the correlation should be weak. 4: Cell-specific noise evaluated through the correlation between the background protein population mean and isotype controls across single cells. Created with BioRender.com. b Average protein log10(count + 1) of unstained control cells spiked into the stained cell pool prior to droplet generation (y-axis) versus that of droplets without a cell (x-axis). Pearson correlation coefficient and p value (two sided) are shown. c Density histograms of protein expression of lineage-defining proteins within major subsets in stained cells (black) and unstained controls (red) normalized together using dsb step I (ambient correction and rescaling based on levels in empty droplets). d A two-component Gaussian mixture model was fitted to the protein counts within each single cell; the distributions of the component means from all single cell fits (blue = ”negative” population; red = “positive” population) are shown, protein distributions from a randomly selected cell shown in the inset. e Comparison of Gaussian mixture models fit with between k = 1 and k = 6 subpopulations to dsb normalized protein values for n = 28,229 cells from batch 1 after dsb step I (ambient correction) but prior to step II, vs. the model fit Bayesian Information Criteria (BIC, using mclust R package definition of BIC where larger values correspond to a better fit) from the resulting 169,374 models. Boxplots show the median with hinges at the 25th and 75th percentile, whiskers extend plus or −1.5 times the inter quartile range. k = 2 component Gaussian mixtures have the best fit in more than 80% of cells (orange, right inset bar plot). f Pearson correlation coefficients among isotype controls and background component mean inferred by Gaussian mixture model (µ1 fitted per cell as in d); all corresponding p values (two sided) are <2e−16. g Scatter density plot between µ1, the mean of each cell’s negative subpopulation from the per-cell Gaussian mixture model (blue in c) versus the mean of the four isotype controls across single cells. Pearson correlation coefficient is shown (two-sided p value < 2e−16). h The distribution of the dsb technical component as calculated using a 2 component (x-axis) vs. 3 component (y-axis) mixture model to define the µ1 parameter, Pearson correlation coefficient, p value (two-sided) < 2e−16.