Extended Data Fig. 4: Defining global inter-GSC cluster relationships and evaluation of batch correction methods.

a, UMAP projection of 69,393 GSC cells from 29 patients reveals patient-specific clustering patterns (left panel, cells colored by patient). Unbiased clustering reveals 61 transcriptional clusters (right panel, cells colored by transcriptional cluster). GSCs derived from different regions of the same tumor underlined with red (G945-I,J,K) and black (G946-J,K) bars. b, Transcriptional clusters from the same sample and patient are more similar to each other compared to cells from other samples. Dendrogram of average gene expression profiles of transcriptional clusters defined in Extended Data Fig. 4a based on distance (1-Spearman correlation)(top). Sample composition of transcriptional clusters (bottom). Vertical bars colored by sample. Labels at bottom depict sample identifier and proportion of sample for up to the top three samples/cluster. c, UMAP visualizations of global GSC clustering results with CONOS batch correction (top row), with Liger batch correction (middle row) and fastMNN batch correction (bottom row). Cells are colored by sample ID (left column) and transcriptional cluster (right column) (n = 69,393 cells from 29 GSC cultures). d, Proportion of cells (y-axis) corresponding to a given sample across transcriptional clusters (x-axis) across original and batch corrected datasets. e, Number of transcriptional clusters in original clustering pipeline vs. post-batch correction. f, Box plots representing the number of samples with >10 cells per transcriptional cluster across original and batch corrected clustering results (Original=61 clusters; Conos=12 clusters; Liger=78 clusters; fastMNN=39 clusters). Box plots represent the median, first and third quartiles of the distribution and whiskers represent either 1.5-times interquartile range or most extreme value. Outliers displayed as circles.