Fig. 1: SCOOP improves cellular resolution and accuracy of COO predictions.
From: Learning the cellular origins across cancers using single-cell chromatin landscapes

a Left: Illustration of how SCOOP uses single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) data to predict the cell-of-origin (COO) (e.g., alveolar type 2, or AT2, cells) associated with a given cancer’s mutation profile (e.g., lung adenocarcinoma, or LUAD). SCOOP takes as input a binned whole-genome sequencing (WGS) profile of cancer single-nucleotide variants (SNVs) and similarly binned scATAC-seq profiles from various normal cell subsets, where each cell subset is followed by a dataset indicator for that cell subset: D1 for28, D2 for29, D5 for34, D9 for45. The SNV and scATAC-seq profiles (features) are passed into a machine learning model, XGBoost, which predicts the COO through a process of backward feature selection (Methods). Right: Box plots of the feature importance distribution (100 SCOOP runs) of the top 5 COO predictions for LUAD (n = 37; predicted COO in red) amongst lung cell subsets (Methods). Also displayed is the number of times a cell subset appeared in the top 5 features across 100 runs (n). One-sided Mann-Whitney test p-values are displayed. Tumor and cell illustrations created in BioRender. Tsankov, A. (2025) https://BioRender.com/qu5wvua. b Test set variance explained (%) by the predicted COOs (red) for 8 cancer types studied in21 (Melanoma, n = 107; Hepatocellular carcinoma, n = 314; Colorectal adenocarcinoma, n = 52; Multiple myeloma, n = 23; Esophageal adenocarcinoma, n = 97; Glioblastoma, n = 39; Lung adenocarcinoma, n = 37; Lung squamous cell carcinoma, n = 47). Error bars show the standard error of the mean (SEM) across 100 SCOOP runs. One-sided Mann-Whitney test p-values are displayed. Each cell subset is followed by a dataset indicator for that cell subset: D1 for28, D2 for29, D3 for31, D4 for32, D5 for34. c Box plots of the feature importance distribution (100 SCOOP runs) of the top 5 COO predictions (predicted COO in red, similar cell subsets in pink) amongst lung cell subsets for epithelioid pleural mesothelioma (PM, n = 44), lung squamous cell carcinoma (LUSC, n = 47), and small cell lung carcinoma (SCLC, n = 107). Also displayed is the number of times a cell subset appeared in the top 5 features across 100 runs (n). One-sided Mann-Whitney test p-values are displayed, where Bonferroni correction for multiple hypothesis testing was used. d UMAP dimensionality reduction of individual lung cancer WGS samples binned mutation profiles (dots) colored by cancer type (adenocarcinoma, n = 37; epithelioid mesothelioma, n = 44; small cell lung cancer, n = 109; squamous cell carcinoma, n = 47). e UMAP dimensionality reduction of individual SCLC WGS sample binned mutation profiles (dots) from36,48 (aSCLC from48, n = 11; aSCLC from36, n = 2; SCLC-A, n = 37; SCLC-N, n = 4; SCLC-P, n = 6; SCLC-Y, n = 1; Undefined, n = 57). f Box plots of the feature importance distribution (100 SCOOP runs) of the top 5 COO predictions (predicted COO in red, similar cell subsets in pink) amongst lung cell subsets for atypical small cell lung cancer (aSCLC from48, n = 11; aSCLC from36, n = 2). Also displayed is the number of times a cell subset appeared in the top 5 features across 100 runs (n). One-sided Mann-Whitney test p-values are displayed, where Bonferroni correction for multiple hypothesis testing was used. g Box plots of the feature importance distribution (100 SCOOP runs) of the top 5 COO predictions amongst lung cell subsets for SCLC-A (n = 37; predicted COO in red, similar cell subsets in pink). Also displayed is the number of times a cell subset appeared in the top 5 features across 100 runs (n). One-sided Mann-Whitney test p-values are displayed, where Bonferroni correction for multiple hypothesis testing was used. h Percentage of cycling cells across lung epithelial cell types estimated using scRNA-seq data (Methods), where predicted COOs in our study are shown in red. i SCOOP’s predicted COO for different lung cancers: AT2, mesothelial, and neuroendocrine cells for LUAD, epithelioid PM, and aSCLC, respectively, and basal cells for both LUSC and SCLC. Lung model created in BioRender. Tsankov, A. (2025) https://BioRender.com/2vhmu6l. Cell type abbreviations are defined in Supplementary Data 3. Box plot vertical lines show 25th, 50th (median), and 75th percentiles, with horizontal whiskers extending to a maximum distance of 1.5 × interquartile range from the hinge. Data beyond the whisker ends are plotted individually.