Fig. 3: Application to large real-world datasets.
From: Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq

a A heatmap of the top 5 marker genes per cluster are shown for the primary lineages from the full senescent Tabula Muris dataset33, with the last cluster representing a mixture of cell-types from the endocrine pancreas. b When subclustered with anti-correlated feature selection, mixed-cell-type droplets (x) as well as classically described leukocyte, α, δ, β, and acinar populations were discovered. Subclustering β cells discovered mixed-lineage droplets with δ and leukocyte cells as well as the rare PP-cell population, but additional subclustering of PP-cells was prevented by anti-correlation-based feature selection. c Selected features for clustering 1-million PBMCs. d Subject-level reverse Percent Maximum Difference (rPMD), shows that Type-1-Diabetes (T1D) subjects are more similar to each other, while control PBMCs are more diverse by cluster composition. e A spring embedding of a subset of cells from each cluster, color-coded by donor, with sub-plots for T1D and control subjects, showing large-scale uniformity in T1D compared to the heterogeneous control samples. Note that this is for display purposes only, was not used in analysis, and does not represent cell-cell distances, but rather a display of the graph used for clustering. f A heatmap of PMD standardized residuals, which correspond to the significance of how different each subject’s relative abundance of all clusters differs from the null expectation of no-difference between subjects. A matching bar-chart shows the T-statistic of cluster level significance for each cluster’s differential over-under abundance shown in the heatmap, comparing T1D to controls. Bars are color-coded by significance (P < 0.05 after Benjamini-Hochberg). Exact p-values available in Source Data file. g The spring embedding of the kNN graph is color-coded by significance of differential abundance for each cluster, and additionally color-coded by T1D/control status, then again subset for only the significant clusters. Source data are provided as a Source Data file.