Extended Data Fig. 2: MetaCell identification and batch correction.

Related to Fig. 1. (a) Workflow of MetaCell identification and integration. (b) Box plots illustrating the distribution of gene coverage (left) and the degree of within-MetaCell variation (right) across MetaCells, which encompass varying cell counts across five datasets. The datasets, listed from top to bottom, include the following number of samples and cells: 6 patients with 10,359 cells, 2 patients with 4,375 cells, 14 patients with 33,043 cells, 3 patients with 28,678 cells, and 2 patients with 6,035 cells. The bottom of the box represents the Q1, and the top of the box represents the Q3. The height of the box represents the IQR, while the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR. (c) Application of the Same Analysis as (b) to NSCLC Datasets. The datasets, listed from top to bottom, include the following number of samples and cells: 2 patients with 3,658 cells, 3 patients with 12,193 cells, 1 patient with 1,108 cells, 4 patients with 11,453 cells, and 5 patients with 40,218 cells. The bottom of the box represents the Q1, and the top of the box represents the Q3. The height of the box represents the IQR, while the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR. (d) Radar plots showing the metrics for MetaCells under different cell numbers, including gene coverage, variation within MetaCells, and the LISI (Local Inverse Simpson’s Index) score, are accessed for BRCA_GSE148673 and NSCLC_GSE117570 datasets. (e) Box plot illustrating the distribution of LISI and entropies calculated from 736 patients across four scenarios: direct integration of single-cell and MetaCell expression profiles, and integrated single-cell and MetaCell expression profiles using CCA. Significance was assessed using a two-sided Wilcoxon test and adjusted using the Benjamini-Hochberg (BH) method. The bottom of the box represents the Q1, and the top of the box represents the Q3. The height of the box represents the IQR, while the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR. (f) Boxplot showing the distribution of ARI and ASW calculated from 736 patients across four scenarios: direct integration of single-cell and MetaCell expression profiles, and integrated single-cell and MetaCell expression profiles using CCA. Significance was assessed using a two-sided Wilcoxon test and adjusted using the BH method. The bottom of the box represents the Q1, and the top of the box represents the Q3. The height of the box represents the IQR, while the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR. (g) The pie plot showing the fractional distribution of MetaCells by source (left) and treatment condition (right), with MetaCells labeled accordingly. (h) UMAP visualization of all MetaCells, colored by the cancer type (left) and cell type (right) respectively.