Introduction

Cancer diagnosis has a pivotal role in clinical oncology. Precise tumor classification is required for making accurate prognoses and adequate treatment decisions. For many tumor types, histopathological evaluation represents the standard diagnostic method. However, over the last decades, characterization of genomic alterations has enabled more refined tumor classification and molecularly stratified treatments. Today, high throughput technologies generate genetic and epigenetic tumor landscapes at unprecedented resolution. In addition, the arrival of “big data” in medicine, and particularly in cancer diagnostics, has paved the way for the employment of machine learning (ML) algorithms in clinical routines1,2. Trained on extensive reference datasets, these algorithms automate complex tasks while often surpassing human accuracy. Several recent ML applications have been successfully employed in the context of cancer diagnostics, for instance in the early detection of tumors from radiologic images3 or circulating cell-free tumor DNA4. Artificial intelligence (AI) will have a growing impact on diagnostic pathology in the future2.

We and colleagues have recently developed a ML-based approach that utilizes array-based genome-wide DNA methylation profiles to enable fast, accurate, and affordable classification of central nervous system (CNS) tumors (known as the “Heidelberg brain tumor classifier”)5. Over 100 different molecular brain tumor classes have been recognized6 and their precise classification remains difficult even for experienced neuropathologists. Our widely employed ML-based approach supports clinicians in making precise diagnoses, and it is particularly helpful to solve challenging cases7,8,9,10,11,12,13. Similar DNA methylation-based classifiers have been developed for other tumor types, including sarcoma, meningioma, sinonasal tumors, and intrahepatic bile duct cancer14,15,16,17, demonstrating generalizability and reliability of the method18.

DNA methylation is a key epigenetic mark that plays an important role in regulating the phenotypic landscape during normal development and cancer19,20,21. The etiology of CNS tumors is deeply connected to the development of the nervous system, and their high diversity mirrors the complexity of cellular phenotypes in the human brain. There is mounting evidence showing that the DNA methylation patterns of tumors reflect their respective cell of origin22, in addition to superimposed somatic epigenetic alterations that are specific to tumors23. While genome-wide DNA methylation profiles are the basis for ML-based tumor classification, it remains largely unclear which specific patterns are being used for distinguishing classes24.

Here, we present the development of an explainable artificial intelligence (XAI) framework to interpret the Heidelberg brain tumor classifier. Generally, XAI is motivated by the need for human interpretable models to better understand automated predictions, to gain insights into the underlying data, and to build up trust in AI25,26. Our XAI framework enables the identification of tumor-specific epigenetic signatures, may reveal biomarkers, and provides insights into brain tumor biology. Moreover, we provide a publicly accessible web application for easy exploration between tumor classes, methylation sites, and associated genes to facilitate a detailed understanding of the models’ decisions.

Results

Development of an interpretable AI framework

The Heidelberg brain tumor classifier and other DNA methylation-based classifiers are based on the Random Forest (RF) algorithm which has been shown to be a very suitable ML algorithm to utilize high-dimensional genomic datasets27,28,29. RF is an ensemble method that utilizes a multitude of binary decision trees trained on a subset of samples (“in-bag” samples) and features30. For each splitting node of every decision tree, the algorithm selects the feature (i.e., a DNA methylation probe) that provides the best binary split. In this way RF naturally calculates feature importances, which has been shown to be a powerful metric to identify complex biological associations31,32,33,34. We reasoned that quantifying the number of times a feature is selected during training provides us with a simple metric to create an interpretable framework that is capable of highlighting class-specific DNA methylation patterns. To this end, we retrieved the original dataset and the RF models of the Heidelberg brain tumor classifier. The dataset contains DNA methylation array profiles of 2801 samples corresponding to 82 tumor and 9 normal control classes (Fig. 1a; an overview of classes and their abbreviations is provided in Supplementary Fig. 1a and Table S1). For each sample, the DNA methylation status of 428,799 genomic sites (high-quality probes) is measured.

Fig. 1: Overview of the explainable AI workflow.
figure 1

a Plot shows a t-SNE dimensionality reduction of the reference dataset consisting of 2801 DNA methylation profiles from 82 CNS tumor classes and 9 healthy control classes. Color indicates different classes. Data were generated using Infinium HumanMethylation BeadChip arrays (428,799 probes). CNS central nervous system. Figure adapted from ref. 5. b Graph illustration of the RF classifier trained on the reference dataset. Four out of 10,000 binary decision trees are shown. Magnification exemplifies four binary splitting nodes (probes) and five terminal nodes (tumor classes) of a single tree. Color of the edges indicates if the methylation value is higher (hypermethylated, red) or lower (hypomethylated, blue) than the threshold value of the preceding splitting node. c Illustration of the pairwise probe usage extracted from the RF classifier for each pair of reference samples and aggregated by sample class. Probe usage is stored in a 3D array in which the first two dimensions represent all possible class combinations and the third dimension represents all probes. d The information stored in the 3D array is used to build an interpretable framework that is accessible in an interactive web application.

In the original work, a RF classifier was trained using all 428,799 probes (“outer” classifier) to select the 10,000 most informative probes for the final (“inner”) classifier (Fig. 1b, Supplementary Fig. 1b)5. We first investigated the outer RF model. Specifically, we analyzed each pairwise combination of tumor samples that were used for training a given tree (eight in-bag samples per class) and summarized the probe usage as the number of times a probe was selected to split their respective classes (see “Methods”). We used a positive sign if the probe was hypermethylated and a negative sign if the probe was hypomethylated (higher or lower than the threshold value, respectively) in the samples at the split. Probe usage was aggregated across all trees into a three-dimensional array of all 8281 (91 by 91) possible class combinations for each of the 428,799 probes (Fig. 1c). In total, the array is composed of 3.55 × 109 data points. The absolute probe usage across all data points is 2.24 × 1011, representing the total amount of splits between in-bag samples over all trees. This 3D array serves as the starting point for all the analyses and for the development of an interactive web application described below (Fig. 1d).

Global patterns of differential probe usage

We first investigated the global patterns of probe usage of the outer RF classifier. We found that a relatively small subset of probes contributed the majority of the total usage. The top 10,000 or 25,000 probes (2.3% or 5.8% of all probes) contributed to 61.2% or 78.1% of the probe usage across all class combinations, respectively (Fig. 2a). In contrast, the 250,000 least used probes (58.3%) contributed to only 1.3% of the total usage. Looking at probes that separated individual classes from all others, the top 10,000 probes contributed between 96.7% and 55.7% of the total usage (classes LIPN and LGG_PA_MID, respectively; see Table S1 for extended class annotations; Fig. 2b). There was high inequality in the probe usage for each class, as described by Gini indexes ranging from 0.99 for LIPN to 0.89 for LGG_PA/GG_ST (Supplementary Fig. 2a). These analyses indicate that the contribution of a majority of probes is negligible, while few probes are highly informative to distinguish between tumor classes. Among those, fewer probes are selected for the classification of biologically very distinct classes (e.g., LIPN or ETMR), while for more closely related classes the model makes use of a larger number of probes to discriminate between those classes (e.g., the members of the low grade glioma methylation class family; Supplementary Fig. 2b). Similar results were observed in an independent RF classifier trained on the same dataset, demonstrating a high stability of probe usage values across different models (Supplementary Fig. 2c–f). Moreover, probe usage values showed high concordance with feature importance values calculated using the SHAP approach35, highlighting the robustness of our method in assessing features importances compared to established methods (Supplementary Fig. 3a–c).

Fig. 2: Class-specific probe usage across functional genomic regions.
figure 2

a Plot shows the overall probe usage with probes ranked from the most to least used. Vertical lines indicate the fraction of the total usage of the first 10,000 and 25,000 probes. b Plot shows the cumulative probe usage for each tumor class. The probes are ranked by their usage. The vertical line indicates the first 10,000 probes. c Heatmap indicates the average usage of probes within CpG islands, Sea, Shelf, and Shore regions for each tumor class. Number of overlapping probes is indicated above. The color code indicates if probes are predominantly hypermethylated (red) or hypomethylated (blue). d Similar heatmap indicating the mean probe usage within annotated DHS and enhancers (overlapping CpG islands). DHS DNase I hypersensitive sites. e Similar heatmap indicating the mean probe usage within large-scale LADs and heterochromatic domains. LAD lamina associated domains. f Heatmaps show the probe usage in 250 bp windows surrounding the TSS of all annotated genes (5 kb up- and downstream). Heatmaps represent the total (left) or the average (right) probe usage. Histogram on top of the left heatmap indicates the number of microarray probes overlapping each window. TSS transcription start site. Source data are provided as a Source Data file.

We then asked if probes displayed differential usage according to their functional genomic localization and grouped probes according to the following annotations: CpG islands (regions of high CpG density), shores (regions within 2 kb from CpG island), shelves (regions 2–4 kb from CpG island), and open sea regions (the rest of the genome). Usage distributions showed major differences between classes. For instance, probes falling within CpG islands were frequently used to distinguish IDH-mutant gliomas (classes O_IDH, A_IDH, and A_IDH_HG) from other classes (52.0% of the average probe usage; Fig. 2c, Supplementary Fig. 4a). These probes were predominantly hypermethylated in IDH-mutant gliomas, in accordance with the previously reported CpG island methylator phenotype (CIMP)36,37. Interestingly, similar patterns were observed for tumor classes LYMPHO and ETMR (80.6% and 56.3%, respectively). Conversely, probes that were used to distinguish PITAD and LIPN classes were mostly located in shelf and open sea regions and were frequently hypomethylated (78.6% and 57.7%, respectively). These classes are characterized by low overall DNA methylation levels (Supplementary Fig. 4b).

Other functional genomic regions that often show differential DNA methylation between cell types and disease states are enhancer regions and large-scale heterochromatic domains. Hypomethylation within enhancer regions is generally associated with enhancer activation and transcription factor binding21. Heterochromatic domains, which frequently are positioned at the nuclear lamina, are condensed and transcriptionally silent regions in which DNA methylation is believed to be lost in a passive process in many different types of cancer38. To investigate probe usage in these genomic regions of the RF classifier, we overlapped probes with annotated enhancers (defined across ENCODE cell lines) and DNase I hypersensitive sites (DHS; Fig. 2d, Supplementary Fig. 4c, d). In addition, we overlapped probes with large-scale heterochromatic domains (H3K9me3-positive domains, defined in K562 lymphoblast cells and H1 embryonic stem cells) and lamina-associated domains (LADs, defined in Tig3 fibroblast cells; Fig. 2e, Supplementary Fig. 4e, f). We found that ETMR is often classified by hypermethylated probes located within DHS and CpG island enhancers. High usage of hypermethylated probes within enhancers that are not overlapping CpG islands was observed for different classes of ATRTs, supporting a previous study showing that H3K27ac enhancer landscapes distinguish between ATRT subgroups39. While probe usage patterns in LADs largely mirrored those observed in heterochromatic domains defined in K562 cells, we observed more pronounced patterns in H1 cells. Hypomethylated probes in those regions were frequently employed to classify LGG_MYB, MELAN, EFT_CIC, and CNS_NB_FOXR2 classes. Hypermethylated probes within LADs and heterochromatin domains were used for multiple GBM classes. These results indicate that genomic regions of different sizes (sub-kilobase to megabases) are employed to distinguish between different tumor classes, with potential links to their biology.

Finally, we focused on promoter regions which are often described as the primary region of transcriptional regulation by DNA methylation. Promoter hypermethylation is generally associated with gene silencing. To this aim, we grouped probes in 250 bp windows within 10 kb regions centered around annotated transcription starting sites (TSS) of all annotated genes. We observed highest probe usage in close proximity of the TSS in the majority of classes (Fig. 2f). To test if this enrichment was due to the higher probe coverage of promoters on the Illumina DNA methylation array (32.5% of available probes are located within 1 kb of annotated TSS), we also considered the average probe usage within windows. While hypermethylated probes in proximity of TSS showed the highest usage for IDH-mutant tumors, this analysis showed that most other tumor classes relied on probes that were located further upstream or downstream of the TSS (Fig. 2f). This analysis suggests that probes located distal to promoter regions (e.g., in enhancers, gene bodies) are more informative for classification of most tumor classes.

High genomic redundancy of informative probes

For further analyses we focused on the inner RF model that is based on 10,000 probes selected from the outer RF model and is used for generating predictions with the Heidelberg brain tumor classifier5. Using this model, we extracted and aggregated pairwise probe usage values similarly as for the outer classier. We first asked if informative probes separated multiple classes from each other or were specific to individual classes (i.e., separated a single class from all others). We also asked to which degree there is redundancy between informative probes and if they map in close proximity to each other. To this aim, we performed unsupervised clustering and t-SNE dimensionality reduction of all 10,000 probes according to their usage (using Pearson’s correlation coefficient as the distance measure). Our analysis indicated clearly defined groups of probes that were separated into a total of 88 clusters (Fig. 3a, Supplementary Fig. 5a). The number of probes per cluster was highly variable and ranged from 2 to 211 (median of 117; Supplementary Fig. 5b). Interestingly, we found that the majority of clusters were associated with a single tumor class, indicating high class specificity of most selected probes (Fig. 3b). Specifically, 71 of 88 clusters (80.7%) were associated with a specific tumor class. For instance, to classify ETMR samples, the model predominantly employed probes belonging to cluster 27 (hypermethylated, n = 136 probes) and cluster 78 (hypomethylated, n = 44 probes; Fig. 3c, d). We also identified 17 clusters (19.3%) that were associated with 2 or 3 tumor classes. Among these were clusters associated with GBM, IDH-mutant glioma, and PITAD methylation class families of closely related tumor classes (Fig. 3b). For example, probes belonging to cluster 1 (hypermethylated, n = 211) were associated with classes O_IDH, A_IDH, and A_IDH_HG (Fig. 3e, f). When looking at probes that distinguish these three classes from each other, we found that a smaller set of probes within cluster 1 was specifically used to distinguish O_IDH from the other two, while few other probes scattered across different clusters were employed to split A_IDH and A_IDH_HG (Supplementary Fig. 5c).

Fig. 3: Unsupervised clustering of probes by their usage.
figure 3

a t-SNE dimensionality reduction and unsupervised clustering of 10,000 selected probes according to their usage (Pearson’s distance). A total of 88 clusters are identified. Individual clusters of probes are numbered. b Heatmap shows the total probe usage by class and by cluster. Blue color indicates clusters in which probes are predominantly hypomethylated in a given tumor class, red color indicates clusters which are predominantly hypermethylated. Clusters (columns) and tumor classes (rows) are ordered by hierarchical clustering. Values are scaled by column. c t-SNE as in panel (a) showing the probe usage in ETMR compared to all other classes. The color code indicates hypomethylated (blue) and hypermethylated (red) probes. d Scatterplot showing the usage of probes belonging to cluster 78 (n = 44) for each tumor class. e t-SNE as in panel (a) showing the probe usage in A IDH compared to all other classes. f Scatterplot showing the usage of probes belonging to cluster 1. g Karyotype plot showing the genomic location of the probes belonging to cluster 78. h Similar karyotype plot showing probes belonging to cluster 1. Source data are provided as a Source Data file.

We next asked if probes belonging to a given cluster mapped in close proximity to each other (e.g., multiple probes associated with the same gene). We found that probes from each cluster were distributed over many regions across the genome and did not show a specific enrichment towards a particular chromosome or region (Supplementary Fig. 5d). The 44 probes belonging to cluster 78 (ETMR-specific probes) mapped to 32 distinct genomic regions (Fig. 3g). For cluster 1 (IDH-mutant glioma-specific probes, n = 211), we identified 106 genomic regions (Fig. 3h). These results indicate a high genomic redundancy of probes employed by the classifier that may mitigate potential intra-class variability between individual patient samples (e.g., due to stochastic copy-number changes). A main characteristic of the RF algorithm is that it derives predictions across a large number of decision trees. Making use of nearly redundant probes across the genome may explain the high robustness of DNA methylation array-based tumor classification.

Interpretable AI yields insights into tumor biology

To make our interpretable framework accessible to the research community, we developed a user-friendly web application utilizing the shiny R package (“shinyMNP”, https://hovestadtlab.shinyapps.io/shinyMNP/). Our application consists of four main panels that allow the user to query and explore the dataset in different ways. In the first panel, users can explore the top probes employed by the classifier to identify a given class (Supplementary Fig. 6a). The second panel of the app allows users to identify probes that distinguish any two classes (Supplementary Fig. 6b). The third panel visualizes the total probe usage for select genes of interest across all classes (Supplementary Fig. 6c). The interactive heatmap shows the usage of associated probes across all possible class combinations. The final panel visualizes the top genes associated with the most used probes for each class, as represented as a directed network in which arrows connect genes to associated tumor classes (Fig. 4a).

Fig. 4: shinyMNP interface and illustrative class-specific genes.
figure 4

a Screenshot of the shinyMNP web application shows a network illustration of the top five probe-associated genes for each tumor class. Black and colored nodes represent genes and tumor classes, respectively. Width of edges indicate the total probe usage for each gene. Some genes may be associated with multiple tumor classes. b Genome plot of a 12 kb region surrounding the SHPRH promoter highlights probes that are specifically hypermethylated in ETMR. Red indicates methylated probes (beta-value of 1), blue indicates unmethylated probes (beta-value of 0) for each tumor class. Total probe usage and exact location of probes are indicated on top. SHPRH is silenced in ETMR. c Genome plot of the PWWP3A gene highlighting pronounced hypomethylation across the entire gene body specific to HGNET_MN1. PWWP3A is highly expressed in this tumor class. d Genome plot of the TBX19 gene showing hypomethylation of the promoter specific in PITAD_ACTH. e Genome plot of the RET oncogene showing hypermethylation of a small region near the fourth exon specific to HGNET_BCOR. RET is highly expressed in this tumor class.

Using our shinyMNP web application, we investigated class-specific probes and their associated genes for potential biological discovery. We selected four illustrative examples that demonstrate the application of our interpretable framework. As a first example, we identified the promoter region of SHPRH that was specifically hypermethylated in ETMR, a tumor class that is characterized by chromosomal instability and high levels of R-loops40,41 (Fig. 4b). In accordance, gene expression analysis highlighted pronounced SHPRH downregulation specifically in ETMR tumors (Supplementary Fig. 6d). The E3-ligase SHPRH poly-ubiquitinates PCNA to prevent genomic instability from stalled replication forks that may be caused by R-loops42,43,44,45. Our observation suggests that silencing of SHPRH may contribute to the chromosomal instability and high levels of R-loops in ETMR. Second, we identified a number of probes that were specifically hypomethylated throughout the entire gene body of PWWP3A (also known as MUM1) specific to the recently defined tumor class HGNET_MN1 (Fig. 4c). PWWP3A is exclusively expressed in this tumor class (Supplementary Fig. 6e) and is involved in DNA damage repair and chromatin organization46. The third example highlights the hypomethylation of the broader promoter region of the TBX19 transcription factor (also known as TPIT) in PITAD_ACTH and relates to the pituitary origin of this tumor class (Fig. 4d). Immunohistochemistry for TPIT has been established as a specific marker for the identification of PITAD_ACTH in routine diagnostics47. Inactivating mutations of TBX19 are associated with isolated deficiency of adrenocorticotropic hormone (ACTH), indicating a crucial role of this transcription factor in the regulation of the pituitary corticotroph lineage48,49. As a last example, we identified pronounced hypermethylation of a single probe near the fourth exon of the proto-oncogene RET specific to the recently defined tumor class HGNET_BCOR in which RET is highly expressed (Fig. 4e; Supplementary Fig. 6f)50. RET encodes a transmembrane receptor of the tyrosine protein kinase family and is an important gene in CNS development51,52,53,54. While it is unknown how the observed hypermethylation is associated with the high expression of RET in HGNET_BCOR, we postulate that overexpression may be due to hypermethylation of a regulatory element, as multiple distal enhancers have been identified in this genomic region. RET may represent a HGNET_BCOR-specific vulnerability as potent targeted drugs have recently been FDA-approved for the treatment of other RET-altered tumors55.

In summary, our interpretable framework reveals associations between specific genes and tumor classes that may be employed for future studies in the context of biomarker discovery, therapeutic target identification, and cancer biology research of CNS tumors. Our publicly accessible web application of the interpretable framework enables researchers of the scientific community to perform similar analyses across all classes included in the Heidelberg brain tumor classifier.

Discussion

The ability to classify CNS tumors based on their DNA methylation profiles using artificial intelligence approaches has irrevocably changed brain tumor classification in clinical practice and in research. Despite the usefulness of the Heidelberg brain tumor classifier, a clear understanding of the classifier’s inner decision-making process has been missing. To this aim, we developed an interpretable framework to better explain its underlying working rules.

By calculating the pairwise probe usage between classes over all RF trees, we simultaneously enable both global and local interpretability. DNA methylation patterns that are globally associated with individual tumor classes are readily identified, such as CpG island hypermethylation or hypomethylation in broad heterochromatic regions. On the other hand, our framework allows us to pinpoint probes that are locally important to distinguish between pairs of select tumor classes. Results may sometimes overlap, as a probe can have both high global and local importance for the same class. For instance, for some methylation class families, such as the IDH-mutant glioma classes, a group of redundant probes are used to separate these classes from all the other classes at the global level. After this high-level separation, the algorithm selects some of the few class-specific probes, as identifiable with our pairwise analysis. In the research context, this approach can be applied to annotate DNA methylation patterns that are shared among different classes and families of brain tumors and at the same time to uncover genomic sites that are unique to a single class. In addition to their value as biomarkers for classification and diagnostic relevance, tumor-associated genes could reveal valuable targets for the development of precision therapies, such as the association between RET and HGNET_BCOR, for which targeted inhibitors have recently become available.

Importantly, our interpretable framework can be readily adapted to future versions of the Heidelberg brain tumor classifier, for example to discover relevant probes and DNA methylation patterns associated with additional tumor classes. Our approach can also be transferred to other tumor classifiers that are based on DNA methylation profiling and the RF algorithm14,15. For example, applying our approach to a RF classifier of a recently published cohort sarcoma samples15 showed that the main principles unveiled from the CNS tumor classifier also hold true in other DNA methylation-based tumor classifiers (Supplementary Fig. 7). Furthermore, motivated by recent advances using nanopore sequencing technology56,57,58,59,60, we foresee the implementation of point-of-care assays to allow for more affordable and rapid diagnosis that make use of reduced subsets of informative and non-redundant CpG sites based on our findings. Our results also have potential applications in the context of liquid biopsies for early cancer detection, classification, and disease monitoring over time from circulating tumor DNA61. Our study provides a unique reference for incorporating DNA methylation profiling into these studies.

Overall, our interpretable framework provides a better understanding of the underlying working rules of the Heidelberg brain tumor classifier. Our resource will facilitate the discovery of disease biomarkers and therapeutic targets, and support the development of bioinformatic pipelines, machine learning models and point-of-care assays for rapid diagnostics, early detection, and disease monitoring.

Methods

Datasets

DNA methylation profiles (from Illumina Infinium HumanMethylation450K BeadChip arrays) from 2,801 samples of the reference dataset of the Heidelberg brain tumor classifier5 were downloaded as IDAT files from GEO (GSE109381 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109381]). Samples were associated with 82 molecular tumor classes and 9 control classes. Processed gene expression data (Affymetrix U133 Plus 2.0 Array) were retrieved from GEO (GSE73038 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73038]) and annotations were obtained from the original publication50. The following probes were selected: 211421_s_at (RET); 221290_s_at (MUM1/PWWP3A); 226366_at (SHPRH). All computational analyses were performed in R (v.4.1.3). The pre-processed DNA methylation array data of the sarcoma cohort15 were retrieved from GEO (GSE140686 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140686]).

Random forest models

Random forests consist of many binary decision trees, which are simple ML algorithms that resemble the human cognitive process of decision-making. The algorithm constructs individual trees by selecting features that offer the best split between remaining samples, resulting in a flowchart-like structure that makes them easy to understand and interpret. Terminal nodes represent different classes that can be predicted. Decision trees are composed of a limited number of features and are readily interpretable. At the same time, individual trees suffer from low accuracy in complex scenarios and are prone to overfitting. In other words, their rigid, but transparent structure limits their ability to learn and generalize. To overcome those limitations, RF aggregates the prediction of multiple decision trees on a subset of samples and features.

In order to make the Heidelberg CNS tumor classifier interpretable, we used the trained RF models (“outer” and “inner” classifier) from the original publication (v.11b2)5. The inner classifier was trained using 10,000 probes that were selected from the outer classifier trained on all 428,799 high quality probes. Probes were selected based on their per-class importance (MeanDecreaseAcuarcy value). We first calculated the rank of each probe within every class (low rank = high MeanDecreaseAcuarcy), and subsequently selected the 10,000 probes with the lowest rank across classes5.

To re-train the outer RF model (Supplementary Fig. 2c–f), we used the randomForest R package (v.4.7-1.1) and the following parameters: ntree = 10000, mtry = 654, sampsize = rep(8, 91), importance = T. To train an outer RF model for the sarcoma cohort (Supplementary Fig. 6) we used the following parameters: ntree = 10,000, mtry = 654, sampsize = rep (7, 65). The inner RF sarcoma model was trained using the 10,000 most important features (overall MeanDecreaseAccuracy value from the outer classifier) and using mtry = 100. All the RF models were trained with the option keep.inbag = T, required for the next steps.

To calculate SHAP feature importances, we trained an independent random forest model in Python (v.3.12.5) with the scikit-learn library (v.1.5.1) using the same 10,000 selected features, achieving an out-of-bag error of 0.04. We calculated Shapley values using the shap library (v.0.45.1). To reduce computing time, Shapley values were calculated for a single sample in each of the 91 classes (similarly using one sample per class as the background).

Extraction of random forest probe usage

We assumed that the number of times a feature (i.e., a DNA methylation probe) is selected to perform a split between tumor classes reflects the importance of that particular feature in that context. Since this information is not directly accessible from the innate structure of a trained RF model object, we calculated the probe usage as follows: In an iterative process over each of the 10,000 trees, all paths from the root node to each terminal node were extracted. Next, those paths were compared in a pairwise manner, annotating the probe ID and its methylation status at the node that splits two terminal nodes (annotated as −1 for hypomethylated and 1 for hypermethylated). From these pre-computed comparisons, we retrieved the splitting nodes for all 529,984 possible combinations between in-bag samples (8 samples for each of the 91 classes). Subsequently, splitting nodes were aggregated at the tumor class level across all trees (representing the probe usage). Probe usage values were stored as a 3D array in which each layer (428,799 probes) is composed of a symmetric matrix representing the probe usage of all 91 by 91 (8281) possible class comparisons. The probe usage can be summarized as the total or average probe usage per class or across all classes. For these steps the data.table (v.1.14.2) and iterpc (v.0.4.2) R packages were used. The code for extracting probe usages from a trained RF model is accessible at https://github.com/hovestadt/shinyMNP.

Analysis of functional genomic regions

Probes were grouped by tumor classes and mean probe usage was calculated for each category of functional genomic regions. CpG island, shore, shelf, sea, DHS, and enhancers annotations were extracted from the Illumina array annotation file (HumanMethylation450 manifest file v1.2, downloaded from the Illumina website). Lamina associated domains (LADs, identified in human Tig3 human fibroblasts) and heterochromatin domains (identified in H1-hESC and K562 cell lines) positions were downloaded from the ENCODE portal. The human genome assembly hg19 and RefSeq gene annotation were retrieved from the UCSC genome browser website. Probes were mapped to the nearest TSS using functions from the GenomicRanges Bioconductor package (v.1.46.1). Probes with a distance greater than 5000 bp were removed. The total and average probe usage was plotted as a function of the distance to the nearest TSS for each tumor class in 250 bp windows.

Dimensionality reduction and clustering

Vectorized probe usages were employed to calculate the 10,000 by 10,000 Pearson’s distance matrix. This matrix was used as input for a t-SNE dimensionality reduction calculated using the Rtsne R package (v.0.16) and the following parameters: theta = 0.1, pca = F, num_threads = 0, is_distance = T, max_iter = 5000. For unsupervised clustering, we first calculated the k-nearest neighbors and constructed the SNN graph from the distance matrix. Then, clusters were identified using the Louvain algorithm to optimize the modularity function. The Seurat Bioconductor package (v.4.1.1) was employed for this step. To associate each cluster of probes to tumor classes we first calculated the fraction of probe usage over the total for each cluster and class. A tumor class was associated with a cluster if the fraction was greater than 0.1 (10%). The genomic location of probes within clusters was plotted using the karyoploteR Bioconductor package (v.1.20.3).

Development of the shinyMNP web application

To create an interactive web application that allows researchers to access our interpretable framework we used the same 3D probe usage array as described above and precomputed different summaries for rapid access. The app was developed using the shiny (v.1.7.1), tidyverse (v.1.3.1), ggplot2 (v.3.3.5), shinythemes (v.1.2.0), rhdf5 (v.2.38.1), plotly (v.4.10.0), heatmaply (v.1.3.0), igraph (v.1.3.1) and visNetwork (v.2.1.0) R/Bioconductor packages.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.