Fig. 1: Anti-correlation algorithm premise and passage of the null-dataset problem. | Nature Communications

Fig. 1: Anti-correlation algorithm premise and passage of the null-dataset problem.

From: Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq

Fig. 1

a The logical schematic behind anti-correlation-based feature selection. b As a scatter plot where expression of marker A is plotted against marker B, cells of type A and B will form an L-shaped anti-correlation pattern, while cell-type C would express low levels of both marker A and B. c This anti-correlation pattern would disappear when examining a single population of cells. d The anti-correlation pattern of marker-genes appears in an example dataset3, where high expression of AMY2A in acinar cells forms an anti-correlation pattern with SST in delta cells of the pancreas. e The anti-correlation pattern between AMY2A and SST disappears when only subset for delta cells. f The anti-correlation pattern is also present in lineage-marking-genes as shown by the pattern of AMYA2 and NEUROD1, which labels all endocrine cells of the pancreas. g The anti-correlation-based feature selection algorithm first calculates a null background of Spearman correlations based on bootstrap shuffled gene-gene pairs to calculate a background. h Next the cutoff value closest matching the desired false positive rate (FPR) is determined. Displayed is a histogram of the bootstrap shuffled null-background of Spearman correlations less than zero. i Lastly genes which show more significant negative correlations (x-axis) than expected by chance (black line), given the gene’s number of total negative correlations (y-axis), are selected: i.e., those to the right of the cutoff line. These are then used to calculate the False Discovery Rate (FDR) for each gene (See Methods for details). Heatmaps of selected features, and the total number of subclusters for each method of feature selection paired with AP clustering, when algorithms were allowed to sub-divide iteratively for homeostatic cell line scRNAseq: (j) NIH3T3, (k) HEK293T. l Boxplots indicating the total number of clusters identified by each method of feature selection (box colors) and clustering (noted in panels). Boxplots show lines that extend to minimum and maximum, with the box bounds from 25th to 75th percentile, and center denoting the median (n = 20). Source data are provided as a Source Data file.

Back to article page