Main

Tumor heterogeneity is a hallmark of isocitrate dehydrogenase (IDH) wildtype (WT) glioblastoma (GBM)1. Different GBM samples have distinct genetic and transcriptomic profiles (intertumor heterogeneity)2 and even a single GBM specimen contains diverse malignant and nonmalignant cells (intratumor heterogeneity)3,4. This diversity endows malignant GBM cells with distinct functional properties5 and is thought to enable them to evade therapies, leading to progression and relapse6. Nonmalignant cells in the tumor microenvironment (TME) establish an ecosystem with complex interactions that also plays critical roles in shaping GBM biology and treatment response5,7,8,9.

Previous studies used single-cell RNA sequencing (scRNA-seq) to uncover gene expression heterogeneity within human gliomas3,4,10,11,12,13,14,15,16,17. These studies revealed that GBM malignant cells exhibit a variety of cellular states, including those resembling neurodevelopmental cell types. The frequency of these states in a particular tumor is influenced by genomic alterations and the TME7. However, previous studies had several limitations: (1) they were based on limited number of samples and cells and could still miss rare cellular states; (2) they relied on enzymatic tumor dissociation, which may deplete certain cell types and states, as evident by the scarcity of detected astrocytes, neurons and endothelial cells; (2) they addressed either heterogeneity across tumors (that is, tumor subtypes) or within tumors (that is, cell states) but did not develop an integrated model that explains tumor heterogeneity by a combination of tumor-level and cell-level features; (4) they lacked in-depth genetic characterization of the tumors; (5) they had limited clinical metadata; and (6) they were published by different groups, and leveraged diverse pipelines, analytics and classification schemes4,10,12,15,16.

To address these challenges, we created the GBM Cellular Analysis of Resistance and Evolution (CARE) consortium and profiled 121 IDH WT GBM samples using single-nucleus RNA sequencing (snRNA-seq) and bulk DNA sequencing, thereby establishing a multi-omic dataset of unprecedented size while eliminating the confounding effects of enzymatic tumor dissociation. This dataset is richly annotated with clinical data and contains primary and recurrent matched pairs. In this article, we describe the identification of new GBM cellular states, for both the malignant and TME compartments. Furthermore, we describe new analyses that distinguish the contributions of intratumor and intertumor heterogeneity, toward a more complete understanding of the multilayered transcriptional architecture of GBM. The main conclusions of this article are independent of the timing of sampling (primary or recurrent) and can be reached by only analyzing primary GBM specimens, yet we include the recurrent samples for more robust statistical analysis and to provide a broad view of the range of GBM tumors. In the accompanying study by Spitzer et al.18, we focus on characterizing the longitudinal evolution of GBM after standard-of-care therapy.

Results

Our groups jointly generated an snRNA-seq dataset of 121 tumor samples from 59 patients spanning initial disease and recurrence (Fig. 1a; see Supplementary Table 1 for detailed clinical information). All tumors were diagnosed as ‘Glioblastoma, IDH WT’ based on the World Health Organization (WHO) 2021 classification, which was further confirmed using bulk DNA sequencing. We profiled all 121 samples with the droplet-based 10x Genomics platform, and also profiled a large subset of tumors (54 of 121) using plate-based full-length Smart-seq2 (ref. 19). Most samples (116 of 121) were also profiled using bulk whole-exome sequencing (WES) or whole-genome sequencing (WGS), enabling the integration of genetic information with single-nucleus transcriptomics (Fig. 1b and Supplementary Table 1). The frequency of common genetic alterations was consistent with previous studies of GBM20.

Fig. 1: Study workflow and dataset overview.
Fig. 1: Study workflow and dataset overview.
Full size image

a, Scheme describing the workflow of our study. b, The proportion (%) of patients (n = 46) in the dataset in which a genetic aberration was detected is shown. Arm-level refers to whole-chromosome arm amplification or deletion. CNA refers to gene-level amplification or deletion. Single-nucleotide variant (SNV) refers to single-nucleotide mutations. c, Uniform manifold approximation and projection (UMAP) for dimension reduction based on gene expression of the 429,305 cells in our cohort colored according to the inferred cell type. d, Pie chart demonstrating the proportion (%) of each cell type in this dataset. amp, amplification; del, deletion.

In the 10x snRNA-seq dataset, 429,305 of 752,458 cells passed our stringent quality control (Methods) with 2,318 genes detected per cell on average (Extended Data Fig. 1a). Cells were classified as malignant (246,408 cells) or nonmalignant (182,897 cells) by inferring chromosomal copy number aberrations (CNAs) from gene expression data, as described previously13,21, but with additional steps to improve classification accuracy (Extended Data Fig. 1b–d). Clustering of all cells highlighted the diversity of malignant cells, including many small patient-specific clusters, while nonmalignant cells were intermixed across patients without apparent batch effects (Fig. 1c and Extended Data Fig. 2a,b). Nonmalignant cells were assigned to cell types using clustering and well-established signatures (Extended Data Fig. 2c)22,23,24,25,26. Notably, we detected neurons and astrocytes (Fig. 1c,d) that were typically not captured in scRNA-seq studies with enzymatic tissue dissociation. In the Smart-seq2 dataset, 10,062 of 11,136 cells passed our quality control and were similarly classified to malignant and nonmalignant cell types (Extended Data Fig. 2d). Overall, the fractions of cell types were similar between 10x Genomics and Smart-seq2 datasets (Extended Data Fig. 2e–g).

Revisiting malignant cellular states

We previously defined four principal cellular states in GBM based on 28 samples (7,930 cells) obtained at initial diagnosis4: neural-progenitor-like cells (NPC-like); oligodendrocyte-progenitor-like cells (OPC-like); astrocyte-like cells (AC-like); and mesenchymal-like cells (MES-like). While this robust classification scheme has been broadly adopted and validated by the community5,27,28,29, we sought to extend the atlas of GBM malignant cellular states by leveraging our much larger snRNA-seq dataset. First, nonnegative matrix factorization (NMF) was used to derive gene expression programs that vary within each tumor30. After removing programs suspected of low technical quality (for example, enrichment of ribosomal proteins or pseudogenes; Methods), we found ten groups of similar programs and defined a consensus meta-program (MP) for each of them, representing the malignant cellular states in our dataset (Fig. 2a, Extended Data Fig. 3a and Supplementary Table 2).

Fig. 2: Revisiting malignant state heterogeneity.
Fig. 2: Revisiting malignant state heterogeneity.
Full size image

a, Jaccard similarity indices of the robust NMF programs (n = 707) based on their top 50 genes. Programs are ordered according to clustering and grouped into MPs (Methods). MP10 (stress1) and MP15 (stress2) were coalesced into a single stress state. MP9 (excitatory neuron (ExN)) and MP14 (neurogranin neuron) were coalesced into a single NEU-like state. The ‘cycling’ MP is separated to emphasize it as a feature that cells can have on top of their neurodevelopmental identity (and not as an independent state; Extended Data Fig. 3a). b, Gene expression scores of each cell for the different cellular states and the cell cycle. Cells are aggregated according to the state to which they were classified and are ordered within each group according to their cell cycle score. ‘Unresolved’ represents cells that could not be confidently classified to a particular state. c, Relative expression (centered log2) of the three new MPs discovered in this work. Each row represents a gene and each column a cell. d, Association between the malignant MPs (rows) and gene sets reflecting normal brain development and adult brain cell types (columns). Each cell in the heatmap represents the Pearson correlation coefficient between the gene expression scores of an MP and a normal brain gene set across the malignant cells in the dataset. e, Association between pathway-based (rows) and gene-based (columns) MPs. Each cell in the heatmap represents the Pearson correlation coefficient between the gene expression scores of the pathway-based and gene-based MPs. The colored squares represent the seven groups of PMPs and their associated MPs. f, Demonstration of the hybrid state concept. The rows represent genes, which are aggregated according to the MP to which they belong. The columns represent cells, which are aggregated according to their singular (AC, GPC) or hybrid (AC-GPC) state. The heatmap of genes’ centered log2 expression shows that hybrids express two MPs to the same extent, whereas singulars highly express a single MP. g, GBM cellular hierarchy model stemming from the high-frequency hybrid pairs. Each vertex represents a state; vertices are connected when they represent a high-frequency hybrid state. h, Two-dimensional representation of cellular states. Each quadrant reflects a cellular state and each dot represents a malignant cell; exact cell positions reflect their relative scores for the MPs, and their colors reflect the proportion of cells around them classified to the GPC state.

Overall, seven of the ten MPs largely recapitulate the patterns from ref. 4. One of these MPs represented the cell cycle and another five represented the previously defined cellular states: AC-like; OPC-like; NPC-like; MES-like; and hypoxia (previously denoted as MES2)4 (Fig. 2b). Another MP (stress) included genes associated with heat shock and unfolded protein response, which were previously included in the MES-like MP and associated with disordered DNA methylation12. Thus, high-resolution analysis divides MES-like states into three MPs that we termed MES-like, hypoxia and stress.

Although these seven MPs produce highly correlated cell scores as those from ref. 4 (Extended Data Fig. 3b), it is important to note that the exact sets of genes that define the MPs here and in ref. 4 differ (Extended Data Fig. 3c), suggesting platform-dependent effects such that distinct genes are highlighted by scRNA-seq versus snRNA-seq (Extended Data Fig. 3d). We also compared these MPs to prior GBM cell-state classifications, further corroborating their identities (Extended Data Fig. 3e)31,32.

The first new MP contained NPC-like genes but was mostly associated with markers of more differentiated neurons (NRG1, NRG3, NRXN3, CNTNAP2) and synaptic activity (SYN3, SYT1; Fig. 2c) and was similar to the NEU (neuronal) pathway-based program reported previously15. Consistent with a differentiated NEU state, cycling cells were depleted in this program relative to NPC-like cells (Extended Data Fig. 4a). Comparison to the signatures of neural cell types in development and in adult human brains showed the highest similarity to L2–L3 excitatory neurons (P = 7.2−5, hypergeometric test; Fig. 2d); thus, we named this state NEU-like. This state is depleted in single-cell relative to single-nucleus data4, suggesting that it may be poorly captured by enzymatic tumor dissociation, possibly because of complex cellular morphology.

The second new MP included the tyrosine kinase genes EGFR and ALK, the transcription factor genes MEIS1, MEOX2, ETV1, ELOVL2 and KLHDC8, which were previously linked to glioma stemness33,34,35. Comparison to normal brain signatures showed the strongest association with a recently described signature of glial progenitor cells (GPCs)36; hence, we denoted it as GPC-like. Consistent with a progenitor state, the GPC-like MP was enriched with cycling cells (Extended Data Fig. 4a). Like the NEU-like state, GPC-like cells were depleted in single-cell relative to single-nucleus data4.

The third new MP mostly contained cilia genes, including 12 of 50 DNAH and CFAP family members; therefore, it was named cilia-like. This MP was specific to a small fraction of cells (~1.6%), underscoring the utility of our large cohort for identifying rare cellular states. Notably, all of those cells also expressed the AC-like MP, indicating that a subset of AC-like cells activate a cilia program. As expected, this MP was most enriched with external cilia signatures. Therefore, all MPs are enriched with corresponding cell-type-signatures from developing or adult human brains24,26,36 (Fig. 2d), supporting our annotations and suggesting that GBM states mirror developmental trajectories32.

Collectively, our MPs are reflective of cell identity (for example, GPC-like, AC-like, NEU-like) or cell activity (for example, response to hypoxia or stress). We also scored malignant cells for signatures of functional attributes5,37,38, which suggested that AC-like and MES-like cells had greater connectivity, while OPC-like, NPC-like and NEU-like cells had greater motility and invasion scores, while cilia-like and GPC-like cells were intermediate (Extended Data Fig. 4b).

The new MPs defined from 10x data were validated in Smart-seq2; they were still detected when analyzing only primary samples and were detected independently in specimens profiled by different labs and obtained from different medical centers (Extended Data Fig. 4c–f). Overall, our analysis expands our original malignant GBM MPs to ten, some of which represent more granular subsets of the states in ref. 4.

Consistent classification by pathway-based and gene-based analysis

As an orthogonal description of cellular states, we defined pathway-based MPs (PMPs) (Fig. 2e). We previously defined four PMPs, namely NEU, glycolytic/pluri-metabolic (GPM), mitochondrial (MTC) and proliferative/progenitor (PPR)15,39. To preserve a similar computational approach to the gene-expression-based MPs, we adapted the NMF-based algorithm to consider pathways instead of genes, including 3,015 pathway gene sets from the Molecular Signatures Database Hallmark, Gene Ontology (GO) and Reactome. This resulted in seven PMPs that recapitulated and extended the previously defined PMPs (Extended Data Fig. 5a–c).

We confirmed the existence of the NEU PMP, which correlated with the OPC-like, NPC-like and NEU-like MPs (Fig. 2e). We identified two distinct PMPs related to the previous GPM, one linked to glycolysis and hypoxia, and another closely resembling the MES-like state (denoted as GPM-MES). MTC-related and PPR-related PMPs also emerged from this analysis. We additionally identified two new PMPs, one similar to the AC-like and cilia-like MPs (AC-CIL), and another that is highly correlated with the GPC-like state, denoted as GPC/radial glia-like (RGL) (GPC-RGL). Overall, the association of PMPs and MPs provides further functional characterization to each state, supporting the identification and biological interpretation of the new malignant cell states.

Hybrid states reveal putative state transitions

Most malignant cells in our dataset were classified into a single state, yet roughly 20% of the cells could be classified into multiple states, with a limited difference between the two highest state scores (for an example, see Fig. 2f). These hybrid cells did not have a high number of detected genes as would be expected from technical doublets (two nuclei occupying the same droplet; Extended Data Fig. 5d). Moreover, hybrid cells were not distributed as expected across all state pairs but rather were highly enriched for certain state pairs (Extended Data Fig. 5e). Thus, we reasoned that these may represent cells in transition between two states, and constructed a GBM model where each node represents a cellular state and each edge reflects a high-frequency hybrid state (Fig. 2g).

Although this analysis lacks evidence for directionality, the differentiated cellular states (NEU-like, AC-like, cilia-like, MES-like and hypoxia) were all at the edges, while the three progenitor-like and stem-like states (NPC-like, OPC-like and GPC-like) were all in the center of the model. Specifically, the GPC-like state was positioned at the center of three trajectories (NEU, AC-cilia, hypoxia-MES), perhaps suggesting that this population represents an early progenitor state with differentiation potential along multiple trajectories. Consistent with this possibility, cells having high GPC-like scores tended to exhibit intermediate expression levels of programs for the other cellular states (Fig. 2h).

Baseline expression patterns of intertumor heterogeneity

When assessing differences between samples, a dominant signal reflects variations in the abundance of shared cellular states. To highlight additional patterns of intertumor differences, we developed a cell-state-controlled approach in which we compared tumors only based on cells in corresponding states. First, each tumor sample was decomposed into subpopulations of cells representing seven common malignant states. For each subpopulation with a minimum of 25 cells in a particular sample, we then defined a state-specific pseudo-bulk (average) expression profile. Then, for each state, the pseudo-bulk profiles were compared across tumors, thereby eliminating state-specific signals and highlighting tumor-specific differences in that state. Principal component analysis (PCA) was applied to the pseudo-bulk profiles of each cell state, defining high and low signatures for the top three principal components (Fig. 3a).

Fig. 3: BPs of intertumor heterogeneity.
Fig. 3: BPs of intertumor heterogeneity.
Full size image

a, Schematic of state-controlled analysis to define the BPs. b, Intertumor and intratumor variability of the BPs (n = 3) and MPs (that is, cell states, n = 7). Intertumor variability reflects the average variability measured between scores of different samples (defined as the average s.d. of program scores in each state across all states). Intratumor variability reflects the average variability measured between scores of same samples (defined as the average s.d. of program scores in each sample across all samples). Intertumor variability was significantly larger in BPs relative to MPs (mean score = 1.05 versus 0.65, P = 0.0005, one-sided t-test) whereas intratumor variability was significantly higher in MPs relative to BPs (mean score = 0.76 versus 0.62, P = 0.028, one-sided t-test). P values were not corrected for multiple testing. c, Intra-sample variability in BP scores across state-specific pseudo-bulk profiles. Each dot reflects a state-specific pseudo-bulk profile, is colored according to state and aggregated horizontally according to sample. The black horizontal lines connect pseudo-bulk profiles from the same sample as a visualization aid and have no other meaning. The black dots reflect the mean BP score of each sample and are connected using a line for visualization purposes. The dashed lines mark y = −1 and y = 1 to aid in visualization. The y axis shows the difference between the ECM and NEU BPs. d, Top, two-dimensional representation of the BPs. Each dot represents a specific sample (the mean score across the pseudo-bulks for each BP) and are colored according to the maximal score. The y axis reflect the BP-glial score, the x axis reflects the difference between the NEU and ECM BPs as follows: sign(NEU-ECM) × max(NEU, ECM). Bottom, limited intratumor variability in BP scores. Cells are colored according to the maximal BP score. Cellular states that reflects each quadrant are indicated on the right.

We found highly consistent signatures of intertumor variation across PCA analyses of different states. For example, signatures of variation across AC-like pseudo-bulk profiles were highly similar to signatures of variation across NPC-like pseudo-bulk profiles. Three sets of signatures were each identified in analyses of most states, reflecting patterns of intertumor heterogeneity that commonly influences several malignant states (Extended Data Fig. 6a). We denoted their consensus signatures as state-controlled baseline profiles (BPs), that is, consistent malignant gene expression differences between tumors that are not accounted by differences in tumor cell-state distribution.

We annotated BPs based on functional enrichment as NEU, glial and extracellular matrix (ECM) (Extended Data Fig. 6b and Supplementary Table 3). While BP annotations are reminiscent of cell-state MPs (that is, BP-NEU and NEU-like MP; BP-ECM and MES-like MP), we note that BPs and cell-state MPs reflected distinct sets of genes, such that MPs preferentially varied within tumors while BPs preferentially varied between tumors (Fig. 3b,c and Extended Data Fig. 6c) and remain consistent within a tumor (P = 2.3 × 10−9, Wilcoxon rank-sum test; Extended Data Fig. 6d). Accordingly, when we assigned tumors to BPs by averaging across the respective pseudo-bulks (Fig. 3d top; Extended Data Fig. 6e), we found that this classification tended to be consistent across different cellular states in the tumor (Fig. 3d bottom). In a tumor with high ECM BP, cells existed in multiple states but most tended to express high ECM, while in a tumor with high NEU BP most cells tended to express high NEU BP, even those in a MES-like state.

To examine whether BPs are consistent across regions of the same tumor, we leveraged a GBM spatial transcriptomics dataset that includes multi-sampled tumors37. We derived de novo BPs from the spatial transcriptomics data, resulting in three BPs that were highly correlated with the snRNA-seq-derived original BPs (Extended Data Fig. 7a,b). Sample pairs from the same tumor had significantly higher similarity in BP scores than sample pairs from different tumors (Extended Data Fig. 7c), supporting the consistency of BPs within tumors. Finally, we leveraged a subset of samples with both bulk RNA-seq and snRNA-seq data to compare the newly defined BPs to The Cancer Genome Atlas (TCGA) expression subtypes17,40. We found a significant association (Fisher’s exact test, P = 0.02) with the greatest overlap between the BP-ECM and the MES TCGA subtype (Extended Data Fig. 7d). Taken together, the three BPs reflect core intertumor transcriptional variation (that is, tumor-level phenotypes) that are partially independent of the variation in cellular states within tumors.

Interplay between TME, malignant states and BPs

We next investigated how TME heterogeneity may influence malignant BPs and cell-state abundance. To accomplish this, we applied our NMF framework to derive TME MPs for each nonmalignant cell type (n = 62 total MPs; Fig. 4a). We annotated TME MPs based on pathway enrichment and by scoring cells for published signatures (Extended Data Fig. 8a–d, Supplementary Table 4 and Methods). TME MPs were linked to diverse functions, for example, reflecting cortical layers (among excitatory neurons), immune phenotypes (microglia versus bone marrow-derived macrophages) and differentiation status (immature versus mature oligodendrocytes). A subset of TME MPs had high overlap with malignant cell MPs (for example, hypoxia and MES-like), suggesting a similar effect of microenvironmental niches on multiple cell types30 (Extended Data Fig. 8e).

Fig. 4: Heterogeneity within the microenvironmental cell states.
Fig. 4: Heterogeneity within the microenvironmental cell states.
Full size image

a, Donut chart representing the number of cells per cell type analyzed to identify robust NMF MPs, labeled with the number of cells and MPs found in each cell type. The size of the donut represents the relative proportion of all TME cells. b, Selected TME NMF state composition of tumors assigned to each BPs. The statistical significance of TME composition between samples with assigned BP to ECM (n = 23), glial (n = 39) or NEU (n = 35) was computed using a two-sided Kruskal–Wallis test (n = 97). c, Interaction graph between malignant cell states and TME cell type abundance. Each node represents malignant cell-state (red) or TME cell type nodes (colored according to cell type as presented). The thickness and color of the edge indicate the strength of the Pearson correlation. d, Correlation heatmap for the Pearson correlation between the relative malignant cell-state proportions (rows) and relative TME cell-state abundance (columns), which show representative associations for the hypoxic environment. e, Ligand–receptor cross-talk graph. Each green circled node represents a malignant cell state versus nonmalignant cell type pair, while each squared node represents a ligand–receptor interaction; there is an edge if that cross-talk is present. The thickness and color of the edge indicate the percentage of tumors in which both malignant cell state and nonmalignant cell type are present and the cross-talk exist. Open ‘(’ and closed ‘)’ parentheses reflect multiple ligand or receptor interaction from the same family. Specifically, closed parentheses ‘)’ indicated multiple ligands interacting with the same receptor, and viceversa for closed ‘)’ parentheses, multiple receptors interacting with the same ligand. UPR, unfolded protein response.

We assigned all TME cells to their highest-scoring TME MP (Methods). TME MP abundances differed significantly across GBMs mapping to the three BPs, suggesting an association between TME and BPs (Fig. 4b). Similarly, TME MP abundances were associated with malignant MP abundances (Extended Data Fig. 9). Abundance of MES-like cells was associated with increased tumor-associated myeloid cells (TAMs) and monocyte-derived macrophages (MDMs)4,7,9, but also with endothelial cells (Fig. 4c). Abundance of NEU-like malignant cells was associated with neurons, suggesting that nonmalignant neurons may interact with malignant cells and shape their states. Abundance of GPC-like cells was positively correlated with TAM-microglia; abundance of OPC-like malignant cells correlated with nonmalignant OPCs.

Collectively, these results indicate that malignant state abundance is associated with diverse TME cell types. Clustering of all states revealed two major TME-malignant compartments that probably reflect tumor periphery and core regions41 (Extended Data Fig. 9). In particular, malignant hypoxia, stress and MES-like states were associated with inflammatory and hypoxic macrophages, reactive astrocytes, and inflammatory and angiogenic endothelial states (Fig. 4d). Thus, beyond shaping glioma cell identity3,4, hypoxia may cultivate an immunosuppressive ecosystem and induce astrocyte reactivity.

Finally, we used CellChat42 to quantitatively infer ligand–receptor interactions between malignant cell states and nonmalignant cell types. This resulted in two separate subnetworks, supporting the division into tumor core regions (Fig. 4e, left) versus periphery (Fig. 4e, right). Among those, we found (1) IGF1 and IGF1R cross-talk between TAM-MDM and MES-like cells, which is consistent with a proposed resistance mechanism43; (2) PTN-PTPRZ1 OPC cross-talk, which may promote tumor progression and invasion17,31; and (3) NLGN-NRXN and NRG-ERBB4 interactions between neurons and NEU-like cells, reported as potential therapeutic target in glioma44,45,46 (Fig. 4e). Taken together, GBM microenvironmental niches such as hypoxia, glial and NEU TME may serve as determinants of malignant and nonmalignant states.

Genetic associations with GBM transcriptional states

Malignant gene expression profiles may also be driven by genetic alterations. Previously, we found that the CNAs of CDK4, PDGFRA and EGFR and mutations in NF1 were associated with inferred abundance of cellular states in bulk GBM data from TCGA4. To examine these associations in our snRNA-seq dataset, we analyzed malignant state abundance in samples with matched DNA profiling and sufficient malignant cells (n = 108 samples). As expected, we found associations between CDK4 amplification and NPC-like abundance, PDGFRA amplification and OPC-like abundance, and NF1 mutations and MES-like abundance (Fig. 5a; Wilcoxon rank-sum test, P < 0.05). In contrast, we did not observe an association between EGFR amplifications and AC-like abundance.

Fig. 5: Genetic associations with transcriptional states and GBM multilayered ecosystems.
Fig. 5: Genetic associations with transcriptional states and GBM multilayered ecosystems.
Full size image

a, Association between genetic aberrations and malignant cell-state abundance. Two-sided Wilcoxon rank-sum tests are presented. P value adjustments for multiple hypothesis testing were not performed. b, Cell type composition of each of the tumors (n = 121) in the dataset, aggregated based on their composition group. c, Malignant cell-state proportions (rows) in each tumor sample (n = 111 with at least 50 malignant cells; columns). Tumors are aggregated according to the dominant malignant state (at least 30% of cells). d, Interaction graph between the different layers of GBM (composition groups, malignant state groups, BPs, gene-level CNAs and single-nucleotide mutations). Each node represents a layer; nodes are connected with an edge if a statistically significant association exists between the nodes. mut, mutation.

We next tested whether EGFR amplifications, or other genetic alterations, might be associated with the newly identified states (Fig. 1b). The strongest associations were between canonical genetic alterations of GBM (that is, chromosome 7 gain, chromosome 10p loss, EGFR amplification and TERT promoter mutations) and increased GPC-like state abundance (Fig. 5a and Extended Data Fig. 10a). These canonical genetic alterations typically co-occur in GBM, so we compared samples with at least one of these alterations to those without. Tumors with canonical alterations had, on average, 12% GPC-like malignant cells, compared with less than 1% of those without (Wilcoxon rank-sum test, P = 0.003; Extended Data Fig. 10b), suggesting that the GPC-like state is rare in GBMs lacking canonical alterations.

Other associations included TP53 mutations linked to NEU-like abundance and chromosome 7 amplifications linked to hypoxia abundance, which is consistent with a recent study28. Similar analyses of BPs found that glial BP is more closely aligned with canonical GBM genetic alterations compared to ECM or NEU BPs (Extended Data Fig. 10c). Together, these results indicate an interplay between genetic alterations and specific cellular states, although our dataset may still be underpowered to detect additional associations given the low frequencies of GBM genetic alterations.

From transcriptional layers to integrated GBM ecosystems

We defined three distinct layers of GBM transcriptional patterns: cell type composition; malignant cellular states; and BPs. To examine the associations across these layers, we first classified all samples according to each layer (Supplementary Table 5). Regarding cell type composition, tumors were divided into three main groups according to malignant fraction (high, intermediate and low); the latter group was further subdivided according to the dominant TME cell type (Fig. 5b). Regarding malignant state, most tumors were dominated by one state, thereby defining six groups (Fig. 5c). Finally, BPs were used to define three groups of tumors as described above (Fig. 3d).

Next, we investigated associations across layers (Fig. 5d). Each of the three BPs was significantly associated with classifications according to cell type composition and malignant state, as well as specific genetic events. BP-NEU was associated with the NEU-like and OPC and NPC-like states, with a glio-neural (GN)-rich (neurons, astrocytes, oligodendrocytes) TME and with TP53 and RB1 mutations. BP-ECM was associated with the MES-like and hypoxia states, with macrophage-rich TME and NF1 mutations. Finally, BP-glial was associated with the GPC-like state, with a highly malignant fraction and with EGFR and MDM2 amplifications (Fig. 5d).

These results define three multilayer ecosystems, each including correlated features among the three layers (Fig. 6a). Overall, 43% (52 of 121) of the samples could be robustly assigned to one of these three multilayer ecosystems, based on either two or three of its layers (Fig. 6b and Supplementary Table 6), which is significantly more than expected by chance (P < 0.01 for each ecosystem according to a binomial test). The remaining samples could be described according to individual layers but did not correspond to a stereotypic association between these layers (Extended Data Fig. 10d), highlighting the complexity and diversity of GBM ecosystems.

Fig. 6: Stereotype transcriptomic architectures in GBM.
Fig. 6: Stereotype transcriptomic architectures in GBM.
Full size image

a, Layer map of the 54 samples that could be classified to one of the multilayer groups. b, Model for stereotype architecture in GBM, with exemplar tumors for each ecosystem and their compositional, cell-state and BPs. Multiple intercorrelated transcriptomic layers make up the GBM ecosystems.

Discussion

After the first report of scRNA-seq for 430 cells derived from five GBM samples3, technical progress has dramatically expanded the scale of single-cell profiling. In this study, we leveraged a large-scale single-nucleus dataset (~430,000 nuclei, 121 samples) to comprehensively redefine the GBM cellular architecture. We discovered three new cellular states that were not captured in previous glioma studies, including a differentiated NEU state (NEU-like), a new progenitor state (GPC-like) and a cilia-related state. The capacity to identify these new states may be due to the high number of samples and cells profiled in this study. Yet this explanation is insufficient given that two of the new states are relatively common in our data (6.7% NEU-like and 14.2% GPC-like cells). Thus, we speculate that these states were depleted by single-cell isolation protocols that require enzymatic digestion and may poorly capture specific cell states (for example, those with complex morphology). Single-nucleus protocols isolate nuclei using tissue homogenization in cold buffer, thus capturing a wider variety of cell types23,47. Accordingly, the newly described malignant states may resemble normal cells with complex cellular processes (for example, neurons, ciliated cells).

GBM cellular states resemble the programs observed in normal glial or NEU cell types. This finding is further supported by analysis of potential cellular transitions. Transitions between GBM states was demonstrated previously4, but the rate of specific transitions is unknown. We explored this question through quantification of hybrid states, based on the assumption that cells mapping to two states often reflect a transition. Only eight of the potential 38 hybrid states accounted for 71% of hybrids, highlighting the enrichment of specific hybrids and presumably the identity of common transitions. These putative transitions also mirror normal brain development, supporting the view that many aspects of GBM biology are inherited from the cell of origin.

Previous work noted that much of the diversity between GBMs is tightly linked to diversity within GBMs, such that tumor subtypes can be partially explained by the abundance of recurring cellular states. However, not all intertumor variation may be explained in this way; to our knowledge, no previous studies attempted to define the residual signal of intertumor heterogeneity. In this study, we addressed this issue by analyzing intertumor heterogeneity only among cells of the same cellular state, thereby controlling for the diversity of states within tumors. We found that the main patterns that vary between tumors in one state also vary in other states, thus defining a state-independent pattern of intertumor variability that may be thought of as reflecting a tumor baseline profile.

Notably, the three BPs we identified—NEU, ECM and glial—are functionally reminiscent of some of the malignant states that vary within tumors (NEU-like, MES-like and GPC-like), although they include distinct sets of genes. Moreover, these features are correlated, such that tumors with one BP also tend to be enriched with the related malignant state. We speculate that certain genetic and environmental features give rise to two related effects: a BP that influences all cells in the tumor and a higher fraction of cells with a certain expression program (that is, state). Thus, tumors with an NEU BP (represented by one set of NEU-related genes) also tend to have an abundance of cells that activate another set of genes that corresponds to the NEU-like MP. Similarly, tumors with an ECM BP also have an abundance of cells that activate genes of the MES-like MP. At bulk, the two phenomena of BP and frequency of certain MPs coalesce into the signal of GBM subtypes, while single-cell analysis allows us to dissect these distinct effects.

What might be the signals that concomitantly activate BP and frequencies of MPs? Our analysis shows a correlation of both of these malignant features with the TME. NEU malignant features are associated with low-purity tumors enriched with glial and neural TME cells. Mesenchymal and ECM malignant features are associated with low-purity tumors enriched with macrophages. Finally, glial malignant features are associated with high purity. In addition to the TME, we also found genetic events associated with a combination of tumor features: NF1 mutations, TP53 and RB1 mutations, and EGFR and MDM2 amplifications are linked to the ECM, NEU and glial ecosystems, respectively. Thus, we speculate that tumor genetics along with tumor location, which influences the tumor TME, define the propensity of each GBM toward a particular BP and enriched MP, culminating in three stereotypic ecosystems, each associated with multiple correlated components—tumor genetics, nonmalignant TME and two malignant features (BP and MP frequency). Nevertheless, the coupling between these components is only partial, such that around half of the tumors cannot be assigned to a stereotypical ecosystem but rather reflect unique combinations of features, highlighting the diversity and complexity of GBM.

Methods

Participants and ethical approval

Frozen GBM specimens, diagnosed as ‘glioblastoma, IDH WT’ according to the WHO 2021 classification, were collected at seven institutions, as outlined in the sections below. Collection and genomic profiling were approved by the institutional review board (IRB) of each institution; all participants provided written informed consent accordingly. Cohorts were added to the IRB of the Dana-Farber/Harvard Cancer Center (no. 10-417). Patients were male and female. Clinical information of the cohorts is summarized in Supplementary Table 1. Additional clinical characteristics related to the longitudinal analyses are shown in the companion study by Spitzer et al.18.

MD Anderson Cancer Center cohort

For this cohort (n = 4), tumors were collected from MD Anderson Cancer Center. Collection was approved by the Institutional Review Board of MD Anderson Cancer Center under the protocol number 2012-0441.

Duke University cohort

For this cohort (n = 12), tumors were collected from the Preston Robert Tisch Brain Tumor Center Biorepository, Duke University Hospital. Collection was approved by the IRB of Duke University under protocol no. Pro0007434.

Tokyo University cohort

For this cohort (n = 30), tumors were collected from the Department of Neurosurgery, Tokyo University Hospital. Collection was approval by the IRB of Tokyo University Hospital under protocol no. G10028.

Pitié-Salpêtrière Hospital cohort

For this cohort (n = 28), tumors were collected from the Pitié-Salpêtrière Hospital and were approved by the Onconeurotek Brain Tumor bank of the hospital Pitié-Salpêtrière under certification no. 96-900.

St. Michael’s Hospital cohort

For this cohort (n = 14), tumors were collected from the Division of Neurosurgery, St. Michael’s Hospital, Unity Health Toronto. Collection was approved by the Research Ethics Board of St. Michael’s Hospital, Unity Health Toronto, under protocol no. REB 13-141; all patients provided signed informed consent accordingly.

Seoul National University cohort

For this cohort (n = 20), tumors were collected from the Department of Neurosurgery, Seoul National University Hospital, Seoul. Collection was approval by the IRB of Seoul National University Hospital (approval no. H-2004-049-1116).

NORLUX Neuro-oncology Laboratory

For this cohort (n = 13), tumors were collected from Centre Hospitalier de Luxembourg (Neurosurgical Department). Collection was approved by the National Committee for Ethics in Research Luxembourg, under protocol no. 201201/06. Cohorts were added to IRB protocol Dana-Farber/Harvard Cancer Center 10-417. Patients were male and female. Clinical information of the cohorts is summarized in Supplementary Table 1. Additional clinical characteristics related to longitudinal analyses are shown in the companion paper by Spitzer et al.18.

Nuclei isolation from frozen tissue

In protocol 1 (laboratory 1 and laboratory 2 for the cohorts from Duke University, MD Anderson Cancer Center, Tokyo University Hospital, St. Michael’s Hospital and Pitié-Salpêtrière Hospital), nuclei from frozen tumor tissue were isolated. Tissue was thawed and mechanically dissociated in ST buffer (10 mM Tris-HCl, pH 7.5, 146 mM NaCl, 1 mM CaCl2, 21 mM MgCl2) with 0.49% CHAPS (no. 28300, Merck Millipore). Single-nucleus suspensions were filtered using a 40-μm strainer, centrifuged at 500g for 5 min and resuspended in ST buffer supplemented with 0.01% BSA (cat. no. B9000S, New England Biolabs). Nuclear suspensions were inspected by microscope, counted using a hemocytometer and used for 10x Genomics workflow or fluorescence-activated cell sorting (FACS) for the Smart-seq2 workflow.

In protocol 2 (laboratory 3 for the cohorts from NORLUX and Seoul National (SNU) cohorts), tissue samples were thawed and mechanically dissociated in Nuclei EZ Lysis Buffer (NUC101, Sigma-Aldrich) using dounce homogenization. Solutions were incubated on ice for 5 min and mixed 1–2 times during incubation. Single-nucleus suspensions were filtered through a 70-μm strainer and centrifuged at 500g for 5 min at 4 °C, resuspended in Nuclei EZ Lysis Buffer and incubated on ice for 5 min. Solutions were centrifuged at 500g for 5 min at 4 °C and resuspended in 1% BSA, 0.2 U μl−1 RNase inhibitor and PBS buffer (three times). For the final resuspension, 4′,6-diamidino-2-phenylindole was added to the buffer; the solution was filtered through a 40-μm strainer, cells were counted on a Countess II Automated Cell Counter (Thermo Fisher Scientific) and nuclei were taken into the 10x Genomics workflow.

10x Genomics for single-nucleus sequencing

The Chromium Single Cell 3′ Reagent Kit Vv3 (cat. no. PN1000128, 10x Genomics) was used according to the manufacturer’s protocol. Briefly, nuclei were loaded on the Chromium Chip (cat. no. PN12000120, 10x Genomics) with a target cell recovery of 6,000–8,000 nuclei and processed in the Chromium Controller. Single nuclei were partitioned into Gel Beads-in-emulsion, followed by RNA reverse transcription with barcoding. Libraries were created by breaking the Gel Beads-in-emulsion and pooling barcoded fractions, complementary DNA amplification, fragmentation and attachment of sample index and adapter, and sequenced on the NextSeq 500 or NovaSeq (Illumina) platform, with paired-end 28–91 bp reads. For the NORLUX and SNU cohorts, nuclei were loaded on a Chromium Chip with a target cell recovery of 6,000 nuclei for single-cell multi-omic assay for transposase-accessible chromatin and gene expression according to the manufacturer’s protocol. The gene expression chemistry for the 10x multi-ome possesses the same chemistry as the 10x v3 3′ snRNA-seq.

Nucleus sorting for Smart-seq2

The nuclear suspension was stained using the Vybrant DyeCycle Ruby Stain (cat. no. V10309, Thermo Fisher Scientific) immediately before FACS. Single-nucleus sorting was performed on a FACSAria Fusion Sorter (Becton Dickinson) using a 640-nm laser (Ruby, 670/14 filter) (Extended Data Fig. 2d). After doublet discrimination, intact nuclei were selected with ruby+ and were sorted into 96-well plates containing TCL buffer (cat. no. 1031576, QIAGEN) with 1% 2-mercaptoethanol (cat. no. M6250, Sigma-Aldrich). Plates were frozen on dry ice immediately after sorting and stored at −80 °C before whole-transcriptome amplification, library preparation and sequencing.

Smart-seq2

Whole-transcriptome amplification, library construction and sequencing were performed as previously published with slight modification for the nucleus as follows4,19. Briefly, on 96-well plates, RNA derived from single nuclei was first purified with Agencourt RNAClean XP beads (cat. no. A66514, Beckman Coulter). Then, Oligo(dT)-primed reverse transcription was performed using Maxima H Minus Reverse Transcriptase (cat. no. EP0753, Thermo Fisher Scientific) and locked TSO oligonucleotide (5′-AAGCAGTGGTATCAACGCAGAGTACATrGrG+G-3′, QIAGEN). This was followed by PCR amplification (22 cycles for snRNA-seq) using the KAPA HiFi HotStart ReadyMix (cat. no. K2602, KAPA Biosystems) with subsequent Agencourt AMPure XP Bead (cat. no. A63882, Beckman Coulter) purification. Libraries were tagmented using the Nextera XT Library Prep Kit (cat. no. FC-131-1096, Illumina) with custom barcode adapters. Libraries from 768 cells with unique barcodes were pooled and sequenced on a NextSeq 500 or NovaSeq 6000 sequencer (Illumina), with paired-end 38 bp reads.

WES

WES for the samples from Duke University, the MD Anderson Cancer Center, Tokyo University Hospital and St. Michael’s Hospital was performed as follows. DNA was extracted from each frozen tumor sample and blood samples corresponding to the patients using the DNeasy Blood & Tissue Kit (cat. no. 69504, QIAGEN). Genomic DNA (100–250 ng) was acoustically sheared using an ultrasonicator (Covaris), targeting 150-bp fragments. Library preparation was performed using the KAPA HyperPrep Kit (cat. no. KK8504, KAPA Biosystems) followed by enzymatic cleanup using AMPure XP beads. Exome capture was performed using a custom exome bait (manufactured by Twist Bioscience to Broad Institute’s specifications). Captured libraries were sequenced with 150 bp paired-end sequencing on a NovaSeq 6000. For the samples from Pitié-Salpêtrière Hospital, after DNA was fragmented by a LE220 ultrasonicator (Covaris) and size-selected, library preparation and capture were performed using Twist Human Core Exome kit (Twist Bioscience) according to manufacturer’s protocol. Sequencing was performed on a NovaSeq 6000, with 200-bp paired-end sequencing.

WGS

Newly generated whole-genome DNA sequencing data were collected for a cohort of frozen samples from the NORLUX Neuro-oncology Laboratory. Briefly, DNA was extracted from each tumor sample using the AllPrep DNA/RNA Mini Kit (cat. no. 80204, QIAGEN) for samples with sufficient tumor tissue and matched normal blood when it was available. Selected tissue for bulk DNA isolation was adjacent to tissue used for single-nucleus data generation. Briefly, DNA was sheared to 400 bp using an LE220 ultrasonicator (Covaris) and size-selected using AMPure XP beads. Whole-genome libraries were prepared and sequenced with 150 bp paired-end sequencing on a NovaSeq 6000 platform. WGS data for the SNU cohort was prepared in an identical manner and was previously reported in ref. 9. Data for the SNU cohort are available on Synapse (https://www.synapse.org/glass).

Statistics and reproducibility

No statistical method was used to predetermine sample size. All available samples were provided by the aforementioned centers and no data were excluded from the analyses. For figures containing box plots, the box spans from the first to third quartiles, median values are indicated by a horizontal line and the whiskers show 1.5× the interquartile range; individual sample points are presented. The statistical analysis described in this work was done using R v.4.0.1 and above.

Somatic variant detection and copy number calling

DNA sequencing alignment, fingerprinting, somatic variant detection (Mutect2) and copy number segmentation was performed in accordance with the Genome Analysis Toolkit best practices using the Genome Analysis Toolkit v.4.0.10.1, as described previously9,48. Briefly, whole-exome and whole-genome reads were aligned to the b37 genome (human_g1k_v37_decoy) using the Burrows–Wheeler Aligner MEM v.0.7.17. DNA fingerprint analysis using ‘CrosscheckFingerprints’ (Picard) confirmed that all samples belonging to a patient came from the same individual, indicating that there were no sample mismatches in this study. Somatic mutations were detected using Mutect2 (v.4.1.0.0) for tumor samples with a matched normal blood sample. A panel of normals was constructed for each sequencing batch to account for differences in DNA library preparation (for example, whole-genome versus whole-exome) and used to filter out common artifactual and germline variants. Patient tumor samples without matched normal blood were analyzed for copy number alterations but not for SNV detection.

Single-cell and single-nucleus data processing of human glioma samples

For Smart-seq2 sequencing data, the reads were mapped to the UCSC hg19 human transcriptome using Bowtie with the following parameters:-q-phred33-quals -n -e 99999999 -l 25 -I 1 -X 2000 -a -m 15–best -S -p 6. For droplet-based sequencing data, the Cellranger v.3.1.0 pipeline provided by 10x Genomics was used for alignment (GRCh38 release 93) and to generate count matrices. Smart-seq2 gene expression levels were quantified as \({E}_{i,\,j}=\log 2(\frac{{{\mathrm{TPM}}}_{i,\,j}}{10}+1)\) where TPMi,j refers to transcripts per million for gene i in sample j, as calculated using RSEM. For droplet-based sequencing data, we quantified the gene expression levels as \({E}_{i,\,j}=\log 2(\frac{{{\mathrm{CPM}}}_{i,\,j}}{10}+1)\), where CPMi,j refers to counts per million for gene i in sample j. We divided the TPM and CPM values by ten as the size of single-cell libraries is estimated to be in the order of 100,000 transcripts; therefore, we wanted to avoid inflating the expression levels by counting each transcript approximately ten times. After these normalization steps, initial filtering of low-quality cells was done based on the low number of detected genes (fewer than 200 genes) or the high expression of mitochondrially encoded genes (more than 3%). Most low-quality cells were removed downstream in case they could not be robustly classified as malignant or nonmalignant (see the later sections on the CNA analysis). Lastly, we computed the average expression of each gene i as \(\log 2[(\frac{1}{n}{\sum }_{j=1}^{n}{{\mathrm{TPM|CPM}}}_{i,\,j})+1]\) and filtered out lowly expressed genes by limiting the analyzed genes to the top 3,000 most highly expressed genes (using the Seurat v.4.0.4 (ref. 49) function CreateSeuratObject). For the cells and genes that passed these quality control filters, we defined relative expression levels by centering the expression levels for each gene across all cells in the dataset as follows: \({{Er}}_{i,\,j}={E}_{i,\,j}-\,\frac{1}{N}{\sum }_{k=1}^{N}{E}_{i,k}\) where N is the number of cells in the dataset.

Clustering and cell type annotation

To facilitate clustering, cell type annotation and scoring of droplet-based data, we used the R package Seurat v.4.0.4 (ref. 49). Unique molecular identifier count data were normalized and scaled (using the NormalizeData and ScaleData functions) and then clustered by projecting the data to a lower dimensional space using PCA and then running the FindNeighbors and FindClusters functions. Initial classification of tumor-associated macrophages, T cells, oligodendrocytes and endothelial cells was based on the expression of known marker genes22,23,24,25,26, whereas the rest of the cells were annotated as presumed malignant. After classification of cells as malignant or nonmalignant (see the following two sections), cells classified as nonmalignant were scored (using the AddModuleScore function) for known cell-type-signatures and assigned a cell type using the method explained in the section ‘Assignment of cells to states’. Finally, heterotypic doublets were filtered our using the R package scDblFinder50.

Inferring CNAs from gene expression data

CNAs were estimated as described previously4,13,14,21 using the function infercna from the R package infercna (an implementation of the original method presented in ref. 3, which is available at https://github.com/jlaffy/infercna) with default parameters. Briefly, the algorithm sorts the analyzed genes in each sample according to their chromosomal location and applies a moving average with a sliding window of 100 genes in each chromosome to the relative expression values. The scores computed for the cells classified as nonmalignant (oligodendrocytes, macrophages and endothelial cells) define the baseline of the normal karyotype and their average CNA values are used to center the values of all cells.

Classification of cells into malignant or nonmalignant cell types

In cells classified using single-nucleus droplet-based sequencing, we generally encountered a lower CNA signal and continuous signal distributions related to (1) lower data quality relative to that observed in single-cell data, and (2) inclusion of nonmalignant cell types, such as astrocytes and neurons, in the presumed malignant group as these cell types could not be classified a priori as nonmalignant because of transcriptional similarity with malignant cells.

Therefore, we used a multi-step classification scheme for the droplet-based, single-nucleus sequencing data.

Step 1: cell classification using detection of copy number events

For each cell j, we computed the CNA signal on each chromosome separately as \({{\mathrm{CNA}}}_{j}^{C}=\frac{1}{n}{\sum }_{i=1}^{n}{{\mathrm{CNA}}}_{i,\,j}^{C}\) where n is the number of genes on chromosome C included in the analysis. We fitted a normal distribution (that is, a classifier) to the signal distributions of the reference cells for each chromosome (using the function fitdistr from the R package MASS v.7.3-56). This enabled assigning each cell with a P value for each chromosome using a two-sided z-test (function pnorm from the R package stats) reflecting whether the cell had a CNA signal that was significantly different than expected from nonmalignant cells. These P values were further corrected for multiple testing (for each chromosome) using the Benjamini–Hochberg method; cells with Benjamini–Hochberg-adjusted P < 0.05 for a particular chromosome were considered as having a copy number event on that chromosome (either amplification or deletion based on the sign of the CNA signal). Cells with an unadjusted P < 0.05 for a particular chromosome that failed to achieve statistical significance after adjustment were considered to have a ‘suspicious event’ on that chromosome. Then, inferred copy number events were cross-referenced with copy number calls from the WES and WGS data; only events detected in both methods were considered for classification and were termed ‘real’ events. For a subset of samples, this procedure was not possible because of (1) lack of WES or WGS data, or (2) inability to call CNA events from WES or WGS because of low purity or quality. For this subset of cases, we defined a panel of highly recurring CNA events (intersection of highly recurring events inferred from the snRNA-seq and WES and WGS data), which were used as a short-list of potential events. Cells classified a priori as nonmalignant were reclassified as ‘unresolved’ in case a ‘real’ or ‘suspicious’ CNA event was detected and ‘nonmalignant’ otherwise. Presumed malignant cells with at least one ‘real’ CNA event were classified as ‘malignant’; those with a ‘suspicious’ event were classified as ‘unresolved’ and those with no events at all (‘real’ or ‘suspicious’) were classified as ‘nonmalignant’.

Step 2: exclusion of questionable cells based on CNA correlation

We defined the CNA correlation as the Pearson correlation between each cell and the average CNA profile of the tumor sample the cell originated from. We limited the computation to the chromosomes on which ‘real’ events were detected and a control panel of ~1,000 genes from the other chromosomes. Based on the distribution of CNA correlations across the cells classified as malignant, we defined 0.5 as the lower threshold of malignancy. To determine the upper threshold for nonmalignancy, we performed SNV calling from the snRNA-seq data to estimate the algorithm misclassification rate. Briefly, for each sample with available WES data, we called SNVs for each cell using the Vartrix tool (https://github.com/10XGenomics/vartrix) from the FASTQ files. Misclassified cells were defined as cells harboring a malignant mutation that were classified as nonmalignant in step 1 of the algorithm. After a cost-effectiveness analysis (misclassification rate versus percentage of excluded cells), we set 0.35 as the upper threshold for nonmalignancy. Finally, cells classified as malignant with CNAcor < 0.5 or cells classified as nonmalignant with CNAcor > 0.35 were reclassified as ‘unresolved’ and excluded from the downstream analysis.

Step 3: refining classification by setting confidence levels

To refine the classification, we defined confidence levels. Confidence level 1 includes cells classified as malignant that harbored a malignant SNV, cells in which at least 50% of the expected CNA events were detected including chromosome 7 amplification or chromosome 10 deletion, and cells classified as nonmalignant with CNAcor < 0.25. Confidence level 2 includes cells classified as malignant in which at least 50% of the expected CNA events were detected that do not include chromosome 7 amplification or chromosome 10 deletion, cells classified as malignant in which less than 50% of the expected CNA events were detected, including chromosome 7 amplification or chromosome 10 deletion, and cells classified as nonmalignant with CNAcor < 0.5. The rest of the cells classified as malignant or nonmalignant were assigned confidence level 3. In the analysis presented in this work, we used only cells in confidence levels 1 and 2.

Deriving MPs from gene expression data

To capture the heterogeneity between cells from the same cell type, we leveraged the NMF51. NMF was performed on the relative expression values of each sample independently after setting negative values to zero. The NMF algorithm requires defining a priori the ‘k’ parameter, reflecting the expected number of latent features in the data. Since k varies between samples and is largely unknown, we ran the NMF algorithm on each sample using different values (3–10), thereby generating 52 programs for each sample. Each of these NMF programs was summarized using the top 50 genes based on the NMF coefficients. Derivation of the MPs from the NMF programs was described thoroughly by in ref. 30 and is described briefly here. The method first filters out NMF programs that are not robust (do not recur in the sample they originated from or across several samples) or are redundant in the sample (that is, they highly overlap other NMF programs in that sample). Then, robust NMF programs are clustered according to Jaccard similarity using a customized clustering algorithm that iteratively considers the degree of overlap between robust NMF programs and combines highly overlapping ones to form a cluster. The R 50 most-recurring genes in each cluster define an MP.

Overall, the algorithm yielded 16 MPs that were annotated based on functional enrichment analysis (using GO terms, Molecular Signatures Database Hallmark gene sets and gene sets derived from normal brain development datasets). MP16 included genes from several cell types resulting in an MP that fitted almost all cells and thereby did not reflect any heterogeneity; thus, it was excluded from the analysis. This same procedure was independently repeated for each nonmalignant cell type.

Assignment of cells to states

Malignant cells were scored for the NMF MPs using the Seurat function AddModuleScore. To facilitate cell classification, we generated 20 shuffled expression matrices by sampling each time 5,000 cells and shuffling the expression values for each gene. We then scored each shuffled matrix for the NMF MPs, thereby yielding 100,000 normally distributed scores for each expression program. These served as null distributions for cell-state classification. For each original cell, we computed a P value for each of the expression programs with a z-test (using the R pnorm function) using the statistics of the null distributions that we generated previously. We adjusted all P values for multiple testing using the Benjamini–Hochberg method. Each cell was classified into a specific state if the adjusted P value computed for that state was smaller than 0.05. Cells that achieved Padj < 0.05 for multiple states were assigned to the state for which they scored maximally. Cells that did not achieve Padj < 0.05 for any of the states were assigned an ‘unresolved’ state.

Cells were assigned a ‘cycling’ state on top of their cellular state if they achieved Padj < 0.05 for the cell cycle MP and ‘non-cycling’ otherwise.

Hybrids detection and classification

We define hybrids as cells scoring significantly (that is, with Padj < 0.05) for at least two cellular states with a difference between the top two significant scores smaller than expected by chance. To quantify the score difference expected by chance, we leveraged on the null distribution generated for the state classification algorithm and computed for each artificial cell the difference between the top two states. The 95 quantiles of this score difference distribution (equal to a difference of 0.24) was set as the threshold for determining if a state was singular or hybrid.

To estimate the expected frequency of technical hybrids, we first computed for each pair of states (A, B) the expected frequency of hybrids, defined as \({\mathrm{Exp}}\left({\mathrm{A,B}}\right)={\mathrm{Freq}}\left({\mathrm{A}}\right)\times {\mathrm{Freq}}\left({\mathrm{B}}\right)\) and the observed/expected ratio defined as \({\log }_{2}(\frac{{\mathrm{Obs}}({\mathrm{A,{B}}})}{{\mathrm{Exp}}({\mathrm{A,B}})})\). As this is an overestimation of the expected frequency attributed to technical effects, we estimated this technical factor (TF) by averaging across observed/expected ratios of hybrid pairs detected less than expected by chance (that is, observed/expected < 0). Then, we computed the expected frequency of technical hybrids for each hybrid pair by multiplying the expected frequency with the technical factor and defined the observed/expected ratio as \({\log }_{2}(\frac{{\mathrm{Obs}}({\mathrm{A,{B}}})}{{{\mathrm{TF}}}\times {\mathrm{Exp}}({\mathrm{A,B}})})\). Finally, we filtered out insignificant hybrid pairs with large observed/expected ratio (attributed to small numbers) using a one-sided Fisher’s test. The hybrid lineage model was generated using the R package igraph, with cell states as nodes and edges connecting the states that formed significant hybrid pairs. We did not include the ‘cycling’ state in this analysis because we did not view ‘cycling’ as an independent state but rather as an additional feature that cells can have on top of their neurodevelopmental identity (that is, AC-like, OPC-like, and so on).

BP derivation

Each tumor sample was decomposed to up to seven pseudo-bulk profiles (one for each of the common states—AC-like, MES-like, hypoxia, GPC-like, OPC-like, NPC-like, NEU-like—given at least 25 cells classified to that particular state) by averaging across the normalized unique molecular identifier counts and log2-transforming. Genes were included if their mean log2 expression value across all pseudo-bulks was at least 1 and if their median variance (computed separately within state) was at least 2.5. Overall, 1,005 genes passed these filters. Then, pseudo-bulks were analyzed within state to remove the state-specific signal using PCA. We then derived six gene programs from each state (top and bottom 50 genes of the first three principal components), computed the Jaccard similarity index between each pair of programs and hierarchically clustered the similarity matrix using the ConsensusClusterPlus package (v.1.56.0). Analysis of the cumulative distribution function revealed minor increases in area under the curve for k > 5 (where k is the number of clusters); therefore, k = 5 was chosen as the number of clusters. Then, we derived five consensus signatures by including genes that recurred in at least 25% of programs in each of the clusters (with a hard minimum of at least three times). Using manual inspection and GO enrichment, we annotated the consensus signatures. Two signatures were excluded because of short length or high suspicion of reflecting a technical artifact. To exclude the possibility that ambient RNA had a role in driving the BPs, we estimated the contamination by ambient RNA per sample using the R package SoupX (v.1.6.2). There was no difference in the estimated contamination level across the different BPs and no significant overlap between the top most contaminated genes per sample and the MP or BP signatures (data not shown).

Assigning tumors to composition groups

To measure the differences in tumor composition, we generated for each tumor sample a composition vector reflecting the proportion of each cell type in the tumor sample. This enabled assigning tumors to three main composition groups: high malignant fraction (percentage of malignant cells > 75%); intermediate malignant fraction (percentage of malignant cells 50–75%); and low malignant fraction (percentage of malignant cells < 50%), where the low malignant fraction group could be further subdivided according to the dominant TME cell types (percentage of TAM, oligodendrocyte, GN > 40%, or mixed in case of no dominant TME cell type). Similarly, we defined the proportion vectors for the malignant cell states and assigned tumors to six groups according to the dominant cell state (at least 25% of cells assigned to the particular state)—AC, MES/hypoxia, GPC, OPC/NPC, NEU and mixed (in case of no dominant state). Tumors were classified according to the BP with maximal score and to the mixed category if the maximal score was less than 0.25.

Multilayer group definition

The association between each pair of features (A, B) was computed using a binomial test with the number of times the pair was observed as the number of successes, the number of tumor samples as the number of experiments and the frequency of feature A times the frequency of feature B as the expected probability of success (under the assumption that the features were independent). We then generated an undirected graph by defining each feature as a node and connecting the nodes with edges when a statistically significant association (P < 0.05, unadjusted for multiple testing) between the two features was observed.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.