Introduction

Cancer, classified as a genetic disease, is characterized by the abnormal and uncontrolled proliferation of cells, which results from somatic mutations acquired in the DNA of cancerous cells1. During cancer evolution, driver mutations undergo positive selection conferring a growth advantage to the cells that harbor them. These mutations occur more frequently than expected from genome-wide mutation rates, exhibit a stronger functional impact, and are enriched in hallmark pathways and driver genes of cancers2,3. Therefore, A cancer driver gene has been defined as one whose mutations promote net cellular growth under specific microenvironmental conditions within the cell in vivo4. Since the establishment of genetics, identifying driver genes across tumors has been one of the major motivations for cancer research5, which is essential for developing targeted anti-cancer therapies and discovering genomic biomarkers of prognosis and therapy response6.

Currently, a substantial body of research has demonstrated that driver mutations can contribute to cancer progression without the need to modify the amino acid sequence of a protein. In fact, over 90% of somatic variants occur in non-coding regions, and increasing evidence underscores the significance of non-coding variants in cancer development7. For instance, highly recurrent mutations within the core promoter of oncogene TERT have been observed to generate binding motifs for E-twenty-six transcription factors (TFs), up-regulating TERT expression in melanoma, and as driver mutations across multiple tumor types, including the central nervous system, bladder, thyroid, and skin8,9. Additionally, Puente et al. discovered recurrent mutations in the 3′UTRs of NOTCH1, which caused abnormal splicing and enhanced NOTCH1 activity, driving a more aggressive disease in chronic lymphocytic leukemia10. Furthermore, Zeraati et al. experimentally validated mutations in the 5′UTRs of TAOK2, BCL2, and CXCL14, showing that these mutations alter mRNA translation in cancer. As a result, mutations in non-coding regions are closely linked to cancer initiation, metastasis, and drug resistance, providing important avenues for new diagnostic methods and therapeutic strategies.

The availability of large-scale whole-genome sequencing (WGS) data for cancer has made it possible to explore the role of driver events in coding and non-coding regions. As computer technology advances, intelligent algorithms have gradually been recognized as effective tools for predicting cancer driver genes11. Current approaches can be broadly classified into several categories12: Mutation frequency-based methods, such as DriverML13, ActiveDriver14, and OncodriveFML15 et al.; Methods based on sub-network, including DriverNet16, DawnRank17, and MUFFINN18 et al.; Biophysical models and integrated machine learning methods for identifying functional non-coding mutations19,20,21,22; And methods that identify driver genes by studying and analyzing hotspot mutation regions, where recurrent somatic mutations occur in protein functional regions or disrupt active sites, as seen in tools like DriverDB v323, MSEA24, and DiffMut25. In particular, the MSEA method hypothesizes that genes with mutation hotspots may serve as candidate cancer genes. The MSEA-clust module within MSEA modifies the Kolmogorov-Smirnov test to detect mutation enrichment patterns along gene transcripts. In this study, based on the concept of cancer driver genes that driver mutations occur at a higher frequency than the overall mutation rate across the genome, and the significance of mutations in both coding and non-coding regions in cancer development, we identified the mutation enrichment region within genome coding and non-coding elements by MSEA-clust, aiming to pinpoint potential candidate driver genes for cancer.

In addition, despite the availability of various computational methods for identifying potential driver genes, it remains unclear how these driver genes induce extensive transcriptional changes and contribute to malignant phenotypes26. In recent years, an increasing number of studies have demonstrated that cancer occurrence is not solely driven by single gene mutations, but is also closely associated with alterations in a series of gene sets. Cancer development is typically the outcome of multiple molecular events, involving various genes, pathways, and their complex interactions27. In general, analyzing gene sets that constitute pathways and biological processes may provide deeper insights into the molecular mechanism of carcinogenesis compared to examining individual genes27. Currently, researchers have identified numerous important gene sets across various research contexts. For instance, a senescence gene set was defined to identify senescent cells and predict pathways associated with senescence28. Clarke et al. developed an mRNA biomarker set as a predictor for post-treatment Lyme disease29. A mutation-based gene set was established to forecast survival benefits following immunotherapy for multiple cancer types30. Additionally, CRSO is a method designed to identify essential and synergistic alteration modules that drive cancer in individual patients, aiming to identify a rule set that represents the unique minimal combinations capable of explaining the study population of a given tumor31. Therefore, integrated analysis for multidimensional data has the significant potential to identify a core and robust gene set and enhance our understanding of the mechanism underlying cancer development and progression. However, studies focused on identifying a core driver gene set shared across multiple cancers and investigating its functional impact on oncogenic processes in a pan-cancer context are limited.

Here, we developed geMER to identify candidate driver genes for both coding and non-coding genome elements. We assessed the performance of geMER using several key metrics and compared its results with other genome-wide tools. We investigated the potential biological roles of candidate drivers in cancer development and progression. Subsequently, we systematically identified a novel core driver gene set (CDGS) for multiple cancer types and characterized its mutation landscape. We described the potential driving mechanisms of CDGS through the multi-omics analysis, containing somatic mutations, copy number variations, transcription, DNA methylation, TFs, and histone modifications. Moreover, we demonstrated the ability of CDGS mutations to predict survival outcomes and the response to immune checkpoint inhibitor (ICI) therapy for pan-cancer patients. Finally, we integrated our geMER results into an interactive, web-based resource, freely available at http://bio-bigdata.hrbmu.edu.cn/geMER/. This platform enables users to explore candidate driver genes for 33 cancer types from TCGA or upload their mutation data to identify candidate driver genes for a given cancer.

Results

geMER: A pipeline for genome-wide identification of potential cancer drivers

Over the years, protein-coding somatic variants have been successfully mapped by exome-based analyses. However, the detection of positive selection signals in non-coding elements also highlights driver events that contribute to tumor development32. Meanwhile, we analyzed non-synonymous somatic mutations from WGS data across 33 cancer types generated by TCGA, along with five genomic elements provided by the PCAWG33. When we aligned mutations to both coding and non-coding genomic elements, our findings revealed that 94.3% of mutations were located within these elements, with the following distribution: CDS (41.2%), promoters (10.3%), splice sites (32.9%), 3’UTRs (11.3%), and 5’UTRs (4.3%) (Fig. 2A). Therefore, considering the functional impact of mutation enrichment within both coding and non-coding elements, we developed geMER, a method that quantifies the enrichment of mutations and detects enrichment regions across the genomic element for identifying candidate driver genes (see methods, Fig. 1). For 33 cancer types from TCGA, we totally identified 16,667 candidate drivers out of 22,026 eligible unique genes with 2.54 million somatic mutations for analysis, including 15,270 in CDS, 5705 in promoters, 13,784 in splice sites, 8217 in 3’UTRs, and 3387 in 5’UTRs (adj.p < 0.05, Fig. 2B).

Fig. 1: A schematic of geMER.
Fig. 1: A schematic of geMER.
Full size image

The pipeline for identifying mutation enrichment regions for genomic elements.

Fig. 2: Identification of candidate drivers for coding and non-coding elements across 33 cancer types.
Fig. 2: Identification of candidate drivers for coding and non-coding elements across 33 cancer types.
Full size image

A Proportion of mutations located in genomic elements across cancer types. B The number of candidate drivers called by geMER within five genomic elements for each cancer. C Comparison of CGC driver gene proportions between healthy and tumor cohorts. The bar chart in the upper panel displays the number of mutation records for healthy and tumor samples. D The comparison of CGC enrichment -log10P for driver genes among geMER, oncodriveFML, ActiveDriverWGS, and DriverPower. E The comparison of F1 scores among geMER, oncodriveFML, ActiveDriverWGS, and DriverPower. F The number of CGC genes enriched in candidate drivers (left) and the precision (right) for tools. G Overlaps among drivers for genomic elements detected by geMER and other tools.

To benchmark geMER, we detected mutation enrichment regions within CDS using geMER across five independent healthy cohorts. The mutation record counts of these healthy cohorts were comparable to or exceeded those in the corresponding cancer types from TCGA. As a gold standard for CDS drivers, the COSMIC CGC is a catalog of driver genes whose mutations are causally linked to the development of cancer34. By comparing the proportion of CGC driver genes between the healthy and cancer cohorts, we observed that the proportion in the healthy cohorts was consistently significantly lower than in the corresponding cancer cohorts (Fig. 2C). This finding underscores the capability of geMER to identify potential driver genes specific to tumors. Additionally, we evaluated candidate drivers for genomic elements called by geMER against three other published genome-wide detection tools, including ActiveDriverWGS3, oncodriveFML15, and DriverPower35. For comparison, we input the same mutation data and detected drivers using their default filter criteria (DriverPower, q < 0.1; OncodriveFML, q < 0.1; ActiveDriverWGS, FDR < 0.05). We evaluated two metrics, including the enrichment of CGC in candidate drivers as a proxy for sensitivity and the F1 score as a balanced indicator for sensitivity and specificity. We found that geMER outperformed other methods across most cancer types, particularly in PRAD, READ, and OV (Fig. 2D, E). Additionally, the number and proportion of CGC genes from geMER results were relatively higher (Fig. 2F). Furthermore, we noticed a substantial overlap in candidate drivers for genomic elements between geMER and other methods, especially with ActiveDriverWGS and DriverPower (Fig. 2G). When geMER was applied to the COSMIC dataset for identifying candidate CDS drivers across 34 cancer types, it successfully identified candidate drivers in 27 cancer types. The high enrichments of known cancer genes and strong F1 scores were observed across cancer types (Fig. S1). In conclusion, geMER effectively identifies candidate drivers for both coding and non-coding elements in various cancers.

Candidate drivers were associated with dysregulated expression and patient prognosis

We conducted a deeper biological function analysis of the candidate element drivers identified by geMER. The abnormalities in mRNA and protein expression regulation are hallmarks of many human diseases including cancer36. Therefore, for each driver, we performed four types of comparison analysis. Firstly, we compared the expression level between healthy and tumor samples using the GTEx and TCGA datasets (Fig. S2A). Next, we grouped tumor patients into mutant- and wild-type groups based on their mutation status to compare gene expression levels and overall survival (OS). Finally, we stratified patients into high- and low-expression groups according to the median expression level and compared intergroup survival outcomes. In total, 12,996 out of 16,667 candidate drivers were significantly associated with either dysregulated expression or patient prognosis in at least one cancer type. Candidate element drivers affecting expression or survival outcomes could be found in numerous cancers (Fig. S2B). For most cancer types, the candidate drivers displayed the highest proportions of differential expression between normal and tumor groups, underscoring their driving roles in cancer progression (Fig. 3A).

Fig. 3: Functional validation of candidate drivers for five genomic elements across 33 cancer types.
Fig. 3: Functional validation of candidate drivers for five genomic elements across 33 cancer types.
Full size image

A The proportional distribution of significant candidates in four types of analysis for each element. B The expression boxplots and Kaplan–Meier curves of comparison groups for PDGFRA in LGG, CPPL in OV, and D. LAG3 in UCEC. E Immunohistochemistry (IHC) staining of PDGFRA in normal cerebral cortex and glioma tissues. F IHC staining of PPL in normal ovary and ovary cancer tissues. G The clusters of somatic mutations in protein domains of LAG3 in UCEC. H Correlations between expression and methylation level of LAG3 in UCEC. A p-value < 0.05 was used to show significant results. I Visualization of the association of LAG3 expression with methylation sites within DNA sequence, somatic mutations, CNV, and overall survival in UCEC.

Since a gene can be a candidate for multiple genomic elements, we focused on unique drivers for each cancer type, regardless of the elements (Supplementary Data 1). For the above four types of analysis, three genes were significant in all analyses within their respective cancers (Fig. S2C). Among them, PDGFRA is a well-known oncogene for glioma37, showing significantly upregulated expression in LGG. PDGFRA’s high expression level was identified as a high-risk factor for patients with advanced tumors (Fig. 3B). Similarly, PPL was significantly upregulated in OV, and its high expression level was linked to poor prognosis, suggesting its potential as an oncogene (Fig. 3C). Consistently, overexpression of PPL has been reported to be associated with worse survival outcomes in ovarian cancer patients38. We also observed a higher PDGFRA protein expression level in glioma compared to normal tissue as evidenced by immunohistochemical (IHC) results from The Human Protein Atlas39 (Fig. 3E). PPL also showed higher protein expression in ovary cancer (Fig. 3F). PDGFRA mutations clustered within functional protein domains Ig1_PDGFR-alphabeta and Ig4_PDGFR-alpha in LGG.PPL mutations were concentrated in the SPEC domain in OV (Fig. S2D). In contrast, LAG3 was significantly downregulated in UCEC. Patients with high LAG3 expression showed a better survival outcome, suggesting that LAG3 was a probable tumor-suppressed gene (Fig. 3D). Consistently, the expression of LAG3 has been associated with prognosis with a range of tumor types40. Numerous missense mutations clustered in its Ig-like domain, indicating that these mutations may impact the immune system and contribute to cancer suppression (Fig. 3G).

In addition, mutations of PDGFRA, PPL, and LAG3 were significantly related to their expression levels. Mutations of PDGFRA and PPL were linked to poorer survival outcomes, while patients with LAG3 mutations showed an improved prognosis (Fig. 3B–D). MEXPRESS41 was subsequently employed to visualize the complex interplay among their expression, copy number variations, DNA methylation, and somatic mutations. The summary view results confirmed a significant correlation (p-value < 0.05) between mRNA expression and methylation levels of LAG3 in UCEC and PDGFRA in LGG (Fig. 3H and Fig. S2E). Their expression was significantly positively correlated with copy number variations and OS while showing negative correlations with numerous promoter methylation probes. Additionally, increased expression was associated with a higher number of somatic mutations (Fig. 3I and Fig. S2F). The observed correlations and dysregulation across multiple omics layers provide strong evidence that the candidate drivers are likely involved in critical mechanisms of cancers and play a pivotal role in tumorigenesis.

Identification and mutation landscape overview of a core driver gene set

To capture the biological processes regulated comprehensively by candidate element drivers, we performed an integrative pathway enrichment analysis across 33 cancer types. 97 significantly enriched pathways with 2497 unique functional drivers were identified across 25 cancer types (Supplementary Data 2). As expected, major pathways were co-supported by coding and non-coding elements highlighting the importance of coding and non-coding drivers in cancers (Fig. S3A, B). 15 pathways closely associated with tumor processes were found to be shared across 22 cancer types (Fig. S3C). For example, widespread genomic alterations in axon guidance genes have been implicated in pancreatic ductal adenocarcinoma, joining growing evidence in other cancers42,43. The EGFR signaling pathway known for its oncogenic role, contains mutations that predict responses to the EGFR tyrosine kinase inhibitors in lung cancer patients44. FGFR signaling has been shown to promote tumorigenesis by facilitating the occurrences of proliferation, migration, and invasion45. In addition, more than 20 cancer-specific pathways were found to be activated in eight cancer types, potentially influencing cancer onset and progression (Fig. S3D). For instance, BMP signaling has been shown to promote primary colon cancer growth in vivo and might enhance invasion in vitro46. Notch1 signaling was found to inhibit the growth of esophageal carcinoma cells by downregulating gene expression47. These findings suggest that mutations in functional drivers may contribute to cancer by modulating key molecular pathways involved in tumor initiation, progression, and treatment.

We identified a core driver gene set (CDGS) for 25 cancer types based on functional drivers, which included 10 known cancer genes from the CGC gene set (see methods, hypergeometric test p < 0.05) (Fig. 4A and Supplementary Data 3). We explored the mutation landscape of CDGS to investigate their genomic characteristics. The overall DNA alteration level of CDGS ranged from 9% to 38% (Fig. S3E). CDGS genes exhibited a higher proportion of somatic mutations and were more prone to amplification or deep deletion, indicating their genomic instability in DNA aberrations. Relative distribution analysis of CDGS revealed that somatic mutations in CDGS always occurred in UCEC and COAD, whereas amplifications or deep deletions were more prevalent in OV and BRCA (Fig. 4B). Additionally, we observed cancer-specific alterations in individual CDGS genes. For instance, TTN showed a higher mutation frequency in SKCM and UCEC, BRAF amplification was particularly high in OV, and TP53 deep deletions predominantly occurred in PRAD. CDGS genes significantly co-occurred across pan-cancer and individual cancer types, revealing a tendency to activate collaborating oncogenic pathways48 (Fig. 4C and Fig. S3F). KEGG pathway enrichment analysis found that CDGS genes co-regulate key biological processes of various cancer types (Fig. 4D).

Fig. 4: Identification and mutation landscape of CDGS across 25 cancer types.
Fig. 4: Identification and mutation landscape of CDGS across 25 cancer types.
Full size image

A Workflow of identifying core driver gene set (CDGS) for cancers. B Relative distribution of mutation, amplification, and deep deletion frequencies for CDGS. C Pairwise co-occurring and mutually exclusive relationships across CDGS in 25 cancer types. The number of cancer types with significant co-occurring or mutually exclusive is shown in orange and green gradient colors, respectively. D Top 10 of KEGG pathways that CDGS enriched. E Shared Pfams that were affected by somatic mutations of CDGS in over 20 cancer types. The dots are colored by mutation numbers and its size represents the number of cancers. F The clusters of the enrichment region mutations in protein domains of TTN in LUAD (bottom panel) and the structure changes of an I-set Ig domain between wild-type and mutant-type (top panel). Known cancer genes of CGC are highlighted with *.

Mutation frequency alone is not a reliable indicator of a gene being a driver, as it can be influenced by the gene’s size. For example, TTN had the highest mutation frequency among CDGS genes, but it may be attributed to its large gene length of 281,434 bp. Therefore, we identified Pfam domains affected by CDGS variants aiming to provide a more nuanced understanding of their functional impact. Somatic mutations in CDGS impacted 87 distinct Pfam domains, in which 16 domains were affected in over 20 cancer types (Fig. 4E and Supplementary Data 4). Notably, numerous mutations occurred in the fibronectin-type III (FN3) domain and Immunoglobulin I-set (I-set) domain of TTN. TTN mutations have been reported as key intra-tumor potential drivers for MIP/SOL components in lung adenocarcinoma49. In this study, TTN appeared to function as a potential tumor suppressor gene (TSG), representing lower expression in LUAD compared to normal tissues (log2FC = −0.66, adj.p = 1.99e−67). Mutations in the enrichment region of TTN were predominantly clustered within the I-set rather than the FN3 domain in LUAD (Fig. 4F). The I-set domains were involved in several critical functions including immune system regulation akin to antibody functions50. The p.S9785F, p.I9761T, p.D9826N, and p.V9772L were variants of the TTN enrichment region, which occurred in an I-set domain whose structure could be accurately modeled by SWISS-MODEL with a 100% template protein coverage. Notably, the p.S9785F variant replaced the serine with phenylalanine and the p.I9761T variant replaced Isoleucine with threonine, altering local side-chain polarity and physicochemical properties of the original amino acid. These changes were likely to affect the folding pattern of the I-set domain, potentially impairing its antigen-binding capacity in LUAD.

Genetic alterations in CDGS may affect transcription and modulate epigenetic regulation

To systematically investigate the cancer-driving mechanisms of somatic mutations in CDGS across multi-omics levels, we first compared the expression levels of CDGS in tumor versus normal tissues across 25 cancer types. It revealed a predominant dysregulated expression pattern of CDGS genes and approved their important role in driving cancer development (Fig. 5A). We observed a significant difference in the average expression levels of CDGS between mutant- and wild-type groups in the pan-cancer cohort (Fig. 5B). Specifically, CDGS mutations led to decreased expression in BLCA, PAAD, and STAD, while increasing expression in READ and THCA (Fig. S4A). In specific CDGS gene analysis, we evaluated the proportion of patients with both mutations and altered expression levels (Z-score > 2 for upregulation or Z-score < -2 for downregulation) by comprehensively analyzing genetic and transcriptomic information from the same samples. TP53, KRAS, COL6A3, and APOB were identified with a significant proportion of mutated samples with expression abnormalities for eight cancer types (Fig. S4B). Notably, APOB whose role in tumors has not been well-established expressed the highest proportion in LIHC. The expression level of APOB was significantly higher in LIHC compared to normal tissues (Fig. 5C), suggesting that APOB may function as a potential oncogene. Furthermore, patients with APOB mutations showed a worse prognosis than the wild-type patients (Fig. 5D).

Fig. 5: Multi-omics analysis of CDGS across 25 cancer types.
Fig. 5: Multi-omics analysis of CDGS across 25 cancer types.
Full size image

A The expression alterations of CDGS genes in 25 cancer types, * represents 0.01< adj.p < 0.05, ** represents adj.p < 0.01. B Comparison of the average expression level of CDGS between the patients with or without CDGS mutations in pan-cancer. C Comparison of the expression level of APOB between LIHC and normal samples. D Comparison of the survival OS between patients with or without APOB mutations in LIHC. E WashU track (hg38) of the transcription, DNA methylation, histones, promoter TFs and enrichment region TFs spanning APOB in the HepG2 cell line of LIHC. Orange and blue represent activated or suppressed signal of histones and TFs, respectively. F WashU track (hg38) of the transcription, DNA methylation and histones spanning APOB in the Panc1 cell line of PAAD. Orange and blue represent activated or suppressed signal of histones, respectively. G Heatmap of the correlation between the expression of erTFs and APOB in the Diver&Different-Region group (left panel) and the Diver&Same-Region group (right panel), * represents 0.01< adj.p < 0.05, ** represents adj.p < 0.01. H The average methylation alterations of CDGS genes across 25 cancer types, * represents 0.01< adj.p < 0.05, ** represents adj.p < 0.01. Known cancer genes of CGC are highlighted with *.

To further investigate the relationship between the variants and other omics of APOB in LIHC, we modeled the causal regulatory relationship mechanisms by integrating transcription, DNA methylation, histone modifications, and TFs data from ENCODE51. We categorized 25 cancer types into three groups (Supplementary Data 5): Driver & Same-Region, cancers where APOB was identified as a driver with the mutation enrichment region covering the LIHCs; Driver & Different-Region, cancers where APOB was a driver but with the mutation enrichment region distinct from LIHCs; And Non-Driver, cancers where APOB was not a driver. Compared to G401, Caki2, SK-N-DZ, and Panc1 of the Non-Driver group, APOB transcription levels were significantly higher in HepG2 cells for LIHC, particularly in the mutation enrichment region (Fig. 5E and Fig. S4C), indicating that variants in the enrichment region may contribute to the upregulation of APOB expression. Histone modifications such as H3K4me3 and H3K27ac, which mark active promoters to increase gene expression52, showed stronger signals near the APOB promoter region in HepG2 cells compared to Panc1 cells (Fig. 5E and Fig. 5F). Conversely, histone markers of transcriptional repression including H3K27me3 and H3K9me353 exhibited weak signals across the APOB gene body in HepG2 cell lines. Additionally, H3K36me3 and H3K4me1 which activate transcription by marking the gene body and active enhancer regions, respectively52, displayed consistently stronger signals in HepG2 than in Panc1 cell lines. We searched the TFs regulating APOB, which included FOXA1, HNF4A, NR2F1, NR2F6, PPARA, and PPARG from the TRRUST database54. The TFs targeting the enrichment region (erTFs) of APOB in LIHC were predicted by the Cistrome Data Browser55 (Fig. S4D). Although no significant difference was found in the association of TFs and APOB expression between the Driver&Same-Region and Non-Driver groups, we observed significantly positive correlations between APOB and erTFs CEBPA and NR3C1 expressions in a higher number of cancers within the Driver&Same-Region group, compared to the Driver&Different-Region group. It suggests a potential activation influence of mutations within the enrichment region of APOB on CEBPA and NR3C1 in LIHC (Fig. 5G and Fig. S4E). This activity enhancement was also reflected by their signals covering the promoter and enrichment regions of APOB in HepG2 cell lines. A similar signal pattern of APOB was observed in A549 cells for LUSC from the Diver&Same-Region group (Fig. S4F). These findings suggest that mutations in the APOB enrichment region may upregulate transcription, activate key TFs, and influence crucial histone modifications.

Additionally, we found significantly different methylation levels of CDGS genes when comparing tumor tissues to normal tissues (Fig. 5H). In pan-cancer analysis, the average methylation level of CDGS was significantly lower in the mutant group (Fig. S4G). Mutations of CDGS also decreased the methylation level across eight specific cancer types (Fig. S4H). Meanwhile, we observed methylation signals of APOB in HepG2 and A549 rather than Panc1 cell lines (Fig. 5E, F and Fig. S4F), suggesting that mutations of APOB may influence their methylation levels. Notably, these observed discrepancy in methylation level influence arises from the distinct contexts in which APOB and CDGS mutations were analyzed. The objective of the APOB methylation analysis is to investigate the potential impact of mutations within enrichment regions on cancer development. In contrast, the global analysis of CDGS mutations reflects a broader epigenetic effect as a collective group.

Meanwhile, RYR3 was found to be downregulated in LUSC, PAAD, and LIHC compared to normal tissues, indicating its potential role as a TSG (Fig. 5A). RYR3 was identified as a candidate driver gene by geMER in LUSC and PAAD, but not in LIHC. Compared to the HepG2 cell line, the signals of transcription, DNA methylation, and histone modifications that promote transcription were weaker, while histone modifications that suppress transcription were stronger in A549 and Panc1 cell lines (Fig. S4I). These observations suggest that mutations within the enrichment regions of RYR3 probably reduced its transcriptional levels, DNA methylation, and histone modifications that activate expression, while enhancing histone modifications that inhibit expression. These findings provide novel mechanistic insights into the role of RYR3 as a potential driver across various cancer types. RELN, recognized as a candidate driver gene across LIHC, LUSC, and PAAD, displayed similar transcriptional, DNA methylation, and histone modification patterns in all three cell lines (Fig. S4J). This demonstrated that mutations within the enrichment regions of RELN may have similar regulatory mechanisms across these cancers.

Cross-talk among CDGS genes and functional consequences of CDGS mutations

We observed that the members of CDGS featured frequent interactions within protein-protein interaction (PPI) networks (Fig. 6A). Their expression levels were generally significantly positively correlated across pan-cancer and individual cancer types (Fig. 6B). These results indicated that the cross-talk among CDGS genes may contribute to the development and progression of multiple cancers.

Fig. 6: Analysis of CDGS interaction and hallmark pathways.
Fig. 6: Analysis of CDGS interaction and hallmark pathways.
Full size image

A The expression correlations in pan-cancer and protein-protein interactions network among CDGS genes. B The expression correlations among CDGS genes for 25 cancer types. The number of cancer types with significant positively related or negatively related is shown in orange and green gradient colors, respectively. C The correlation between expression of CDGS genes and hallmark pathway activity scores, * represents p < 0.05. D The activity comparisons of 50 hallmark pathways containing 8 classifications for the patients with or without CDGS mutations in pan-cancer. E Top 5 hallmark pathways for patients with or without CDGS mutations in GSEA analysis.

Functionally aberrant drivers synergistically or complementarily disrupt downstream pathways within biological networks, with some resulting in the activation of cancer hallmarks56. To explore common functional consequences induced by CDGS mutations across pan-cancer, we downloaded and categorized 50 hallmark pathways of cancers from MSigDB57 (Supplementary Data 6). These pathways were scored employing single-sample Gene Set Enrichment Analysis (ssGSEA) using RNA-seq data of the PANCAN cohort. We observed generally significant correlations between the expression levels of CDGS genes and pathway activity scores (Fig. 6C). We subsequently performed a differential pathway activity analysis comparing samples with CDGS mutations to wild-type samples. Activity scores for 47 out of 50 pathways were significantly different between groups, with 40 pathways demonstrating significant activation in the CDGS mutant group (Supplementary Data 7). Notably, the pathways most frequently upregulated were associated with cellular proliferation (Fig. 6D). Consistently, Gene Set Enrichment Analysis (GSEA) also indicated an upregulation of proliferation-related signatures in samples harboring CDGS mutations (Fig. 6E). These results contributed to a better understanding of the CDGS interplay and functional impact of their mutations on cancers.

CDGS mutations could be an independent predictor of prognosis for pan-cancer patients

We investigated the prognostic relevance of CDGS mutation status across 25 cancer types. Results revealed significant associations of CDGS mutations with four types of survival outcomes across pan-cancer cohorts, rather than within specific cancer types (Fig. S5A). To investigate the potential reason, we performed survival analyses by sequentially adding cancer types, starting with ACC. Consistent prognostic associations were observed when 16 or more cancer types were incorporated (Fig. S5B). To further evaluate the influence of specific cancer types, we conducted random sampling analyses across various numbers of cancer types, selecting from 2 to 25. For each list of cancers, we randomized 100 times and repeated four types of survival analysis. As the number of cancer types increased, the density of random instances linked to prognosis also rose, highlighting that the prognostic power of CDGS mutations is stronger and not influenced by specific cancer types (Fig. S5C). These results suggest that the lack of significant associations in certain cancer types is likely due to smaller sample sizes, with larger datasets or more cancer types strengthening the prognostic value of CDGS mutations.

To predict prognosis in the pan-cancer cohort based on CDGS mutations, we employed a randomized approach, dividing patients into training and testing sets in a seven-to-three ratio, repeated 100 times. For each random, patients were stratified into mutant- and wild-type groups based on CDGS mutation status. Subsequently, we assessed differences in terms of survival outcomes between groups. Both the training and testing cohorts consistently demonstrated significant results across four types of survival outcomes with 100 randomizations (Supplementary Data 8). For subsequent analysis, we designated the first randomized datasets as the training and internal validation cohorts, and 11 datasets representing eight different cancers from the ICGC as an external validation cohort. In both the training and validation cohorts, patients in the mutant group exhibited notably shorter OS compared to the wild-type group (Fig. 7A). Consequently, we classified the mutant group as high-risk and the wild-type group as low-risk. The poorer survival trends among high-risk patients were specifically observed in ACC, LGG, and PAAD (Fig. S6A).

Fig. 7: Survival analysis of CDGS mutations.
Fig. 7: Survival analysis of CDGS mutations.
Full size image

A Kaplan–Meier survival curve of TCGA training cohort, TCGA internal validation cohorts, and ICGC external validation cohorts. B Univariate and Multivariate Cox regression analysis of CDGS mutations in the entire cohort of 25 cancer types. C Comparison of clinical features between high- and low-risk groups. D Different age groups (age≥65 (top) and age < 65 (bottom)), gender groups (male (top) and female (bottom)), stageN groups (stageN2-N3 (top) and stageN0-N1 (bottom)), stageM groups (stageM1 (top) and stageM0 (bottom)), and stageT groups (stageT3-T4 (top) and stageT0-T2 (bottom)) survival curves of the entire cohort.

In the training cohorts and the entire TCGA cohort, univariate Cox regression analysis demonstrated a significant correlation between CDGS mutations and OS (p-value < 0.001). Upon adjusting for age, gender, and cancer stages using multivariate Cox regression analysis, CDGS mutations remained an independent predictive factor for prognosis (Fig. 7B and Fig. S6B). These findings underscore the robustness of CDGS mutation status as an independent prognostic indicator for the pan-cancer cohort, further supporting its clinical relevance in prognostic assessment. Comparative analysis of clinical characteristics revealed that patients categorized as high-risk had significantly higher ages and a greater proportion of advanced stage categories (stage M, T, N, and TNM) compared to the low-risk group (Fig. 7C and Fig. S6C). Stratification analysis further demonstrated that CDGS mutation status effectively stratifies patients into high- and low-risk groups in the same stratum of age, gender, and various cancer stages (N, M, T). Across each stratum, patients in the high-risk group consistently showed poorer OS outcomes (Fig. 7D and Fig. S6D). These findings highlight the prognostic utility of CDGS mutations across diverse clinical contexts and its potential as a predictive biomarker in prognosis for the pan-cancer cohort.

Predictive potential of CDGS mutations in immunotherapy outcomes for pan-cancer patients

Carcinogenesis is driven by interactions between genetic mutations and the local tumor microenvironment58. Mutations in CDGS impacted immune-related domains, such as I-set domain (Fig. 4E), and activated hallmark pathways associated with immune or immune response (Fig. 6D). Therefore, we compared the immune characteristics between high- and low-risk groups in the entire TCGA cohort containing 25 cancer types. We assessed immune cell infiltrations by CIBESORT and found that 19/22 immune cell abundances were significantly different between these two groups (Fig. S7A). Patients in the low-risk group showed a predominance of various immune cells at the tumor site, including a higher degree of infiltration by M0 macrophages, naïve B cells, activated CD4+ memory T cells, and so on. In contrast, the high-risk group exhibited increased levels of immunosuppressive cells, such as M2 macrophages, activated NK cells, and neutrophils (Fig. 8A). To validate the accuracy of CIBESORT results, we employed the ssGSEA approach to quantify the enrichment levels of 28 immune signatures from He et al.59 (Fig. S7B). The distribution of 14 immune cell types according to immune infiltration scores generated by Danaher et al.60 was also evaluated (Fig. S7C). Consistently, patients in the high-risk group displayed a significantly higher infiltration of immunosuppressive cells, including M2 macrophages, dendritic cells, regulatory T cells, and Th1 cells. Additionally, MCPcounter analysis revealed that the high-risk group exhibited a significantly greater abundance of fibroblasts (Fig. 8B). Cancer-associated fibroblasts have recently been identified as key promoters of immune evasion61. This suggests that fibroblasts may facilitate extrinsic immune escape in patients within the high-risk group.

Fig. 8: Immune profiles of high- and low-risk in 25 cancer types from the TCGA.
Fig. 8: Immune profiles of high- and low-risk in 25 cancer types from the TCGA.
Full size image

A Comparison of immune cells estimated by CIBERSORT in high-risk relative to low-risk groups. B Comparison of fibroblast infiltration estimated by MCPcounter in high-risk relative to low-risk groups. C Comparison of enrichment scores of oncogenic pathways in high-risk relative to low-risk groups. Comparison of (D) LAG3; (E) PD-L1; (F) PD-1; and (G) CTLA-4 expression in high-risk relative to low-risk groups. The proportion of patients responding to ICI therapy and Kaplan–Meier survival curve of high- and low-risk groups for the (H). training cohort; (I) testing cohort; (J) Allen et al. cohort; (K) Miao et al. (PMID 29301960) cohort; and (L) Miao et al. (PMID 330150660) cohort.

We also calculated enrichment scores for 10 canonical oncogenic pathways from Sanchez-Vega et al.62 using ssGSEA and compared them between the high- and low-risk groups (Fig. 8C). The Wnt signaling pathway and cell cycle were more prevalent in the high-risk group, while the NRF2, PI3K, and RAS pathways scored higher in the low-risk group (Wilcoxon p < 0.05). The Wnt pathway has been demonstrated linked to mechanisms of immunological exclusion63. Aberrant activity of the cell cycle and its resulting genomic instability are hallmark features of cancer64. NRF2, PI3K, and RAS pathways are implicated in critical processes such as oxidative stress response, cell growth and metabolism, and cell signaling. For instance, NRF2 is a key transcription factor in the cellular response to oxidative stress, and its activation plays a crucial role in regulating antioxidant responses and maintaining cellular homeostasis65. These findings highlight distinct immune profiles associated with CDGS mutation status, potentially influencing tumor immune microenvironment dynamics and immune responses in cancer.

Additionally, we compared the expression levels of immune checkpoint genes (ICGs) in the high-risk group relative to the low-risk group to comprehensively characterize the immune profile (Fig. S7D). Several key ICGs, including PD1, PD-L1, and LAG3, were found to be significantly upregulated in the high-risk group (Fig. 8D–F). Notably, the expression of CTLA-4 was markedly increased in the high-risk group, with a logFC of 1.02 and an adj.p of 1.95E-63 (Fig. 8G). These results suggest that the high-risk group may have a potential response to treatment with these immune checkpoint inhibitors (ICIs). Consequently, we investigated the predictive capacity of CDGS mutations regarding the response to ICIs using the training and testing cohorts from Long et al.30. We observed a significantly higher proportion of DCB in the high-risk group compared to the low-risk group (Fisher’s exact test, p < 0.05) (Fig. 8H, I). After receiving immunotherapy, the survival curves of the high-risk and low-risk groups exhibited convergence. The Miao et al. dataset from the testing cohort exhibited consistent results regarding the proportion of responders and survival (Fig. 8J). Two other datasets demonstrated even higher survival rates in the high-risk group, further supporting the potential of CDGS mutations as predictive markers for ICI response across the pan-cancer cohort (Fig. 8K, L).

Discussion

Cancer diagnosis and therapy hinge on the identification and comprehensive understanding of driver genes that play a causal role in cancer evolution66. Given that less than 2% of the human genome encodes proteins, non-coding regions, which contain important regulatory elements such as promoters, 5′UTR, 3′UTR, and splice sites, are of critical importance. Therefore, conducting effective simultaneous functional studies of mutations in both coding and non-coding regions is essential for advancing our understanding of cancer biology and development. In response, we have reported genes with mutation enrichment regions within coding and non-coding genomic elements as putative drivers and developed the geMER method to accurately identify them.

The geMER can leverage the massive somatic mutations generated by WGS technology, and can also be compared with other published methods for identifying driver genes across the genome, demonstrating superior efficacy in the majority of cancer types. Compared to existing tools such as DriverPower, OncodriveFML, and ActiveDriverWGS, geMER offers distinct advantages, particularly in identifying driver mutations in non-coding regions. DriverPower integrates mutational burden with functional impact (FI) information35. OncodriveFML models local mutational backgrounds to compute functional mutation bias of genomic elements15. Their accuracy relies on predicted FI scores that are more developed for coding mutations and poorly assessed in non-coding regions. ActiveDriverWGS tests mutation enrichment in predefined genomic elements based on expected background mutation models, which may limit its sensitivity to rare mutation patterns3. In contrast, geMER is hypothesis-free and does not depend on prior functional annotations or background mutation rates. The calculation of MAS and MES relies solely on the genomic positions of mutations and their counts in tumor samples. This approach allows geMER to effectively identify driver mutations with low frequency, especially in non-coding regions where functional annotations and FI scores are sparse or lacking. Therefore, geMER complements existing methods by expanding the scope of driver mutation detection, with enhanced sensitivity for rare and non-coding mutations.

Additionally, understanding how driver mutations contribute to widespread transcriptional dysregulation in cancer remains a fundamental challenge in cancer genomics. geMER advances driver gene identification by explicitly integrating the concept of mutational convergence within functional genomic elements. It incorporates MAS and MES to assess not only the statistical significance of mutational clustering but also the enrichment level of mutations in regulatory sequence modules, such as promoters, UTRs, and splice sites. By concentrating on functional genomic contexts, geMER could prioritize candidate driver mutations that are more likely to impact gene regulation and downstream transcriptional programs. Ultimately, to substantiate the role of driver genes identified by geMER in malignant phenotypes, we conducted comprehensive biological validation across multiple cancer types. Our results demonstrated that a substantial proportion of candidate driver genes exhibited significant associations with aberrant gene expression patterns and clinical prognosis, supporting their potential roles in tumorigenesis. Notably, genes such as PDGFRA in LGG, PPL in OV, and LAG3 in UCEC consistently showed significant differences in expression and survival outcomes when comparing mutant- and wild-type groups. Taken together, geMER establishes a critical link between statistical mutational enrichment and functional interpretation by pinpointing mutation hotspots within regulatory elements that have plausible biological impacts. This approach offers novel mechanistic insights into how driver genes may induce transcriptional changes, ultimately contributing to malignant phenotypes.

Our current analysis focuses on five gene-proximal regions (e.g., promoters, UTRs, CDS, splice sites), which may overlook potentially impactful mutations in distal regulatory elements such as enhancers and intergenic regions. These long-range regulatory mutations have been shown to play critical roles in cancer development, and several recent studies have developed computational frameworks to identify them through integrative approaches. For example, bpb3 implemented a Bayesian biophysical model to identify functional mutation blocks (FMBs) by integrating genomic and transcriptomic data, particularly those with high-frequency SNVs, altered in TFs binding affinity, and near the regulatory regions of differentially expressed genes20. Similarly, the study by Fadason et al. incorporated DNA methylation, gene expression, chromatin organization, topologically associated domain (TAD), and mutations within enhancers to identify mutation block-gene pairs, thereby uncovering long-range regulatory mechanisms involved in cancer19. As geMER is designed to be hypothesis-free and annotation-independent, it provides a flexible foundation for expansion. Moving forward, geMER could be extended to incorporate distal and unannotated regulatory elements, thereby improving the identification of non-coding driver mutations located far from gene bodies.

We further identified CDGS by comprehensively integrating biology pathways, mutation frequencies, and the enrichment of the CGC gene set, thereby highlighting the most influential driver genes across multiple cancer types. To mitigate potential bias from frequently mutated genes that may not be true drivers, we employed geMER to identify candidate driver genes based on mutation enrichment rather than raw mutation frequency. This enrichment-based approach helps capture functionally relevant genes that may be infrequently mutated but still play a critical role, including those in non-coding regions. In addition, we incorporated pathway enrichment analysis to retain functional drivers with potential biological relevance. The inclusion of the CGC gene enrichment further validated the reliability of the CDGS identification. These combined strategies increase the likelihood that the genes identified as part of CDGS are genuine functional drivers.

The CDGS enriched 10 known cancer genes from the CGC list, while the remaining have been reported as related to cancers through extensive studies. For instance, RYR3 has been associated with the risk and calcification of breast cancer67. In this study, RYR3 was downregulated in BRCA compared to normal tissues, indicating its potential role as a TSG (Fig. 5A). Deep deletions of RYR3 were particularly observed in BRCA and OV (Fig. 4B). Similarly, DMD was identified as a potential TSG with its expression significantly downregulated across 24 cancer types. Amplification and deep deletions of DMD were frequently observed in OV and STAD. Genomic deletions of DMD have been characterized in a subgroup of progressive or higher-grade meningiomas with poor outcomes68. The RELN pathway has been recognized as a target for diagnosis and therapy in pancreatic cancer69. Consistently, our analysis identified RELN as a candidate driver gene in PAAD, where its expression was significantly elevated compared to normal tissues (Fig. 5A). Polymorphisms of COL6A3 have been potentially linked to lung cancer risk70. Meanwhile, we found that high proportions of patients with COL6A3 mutant and expression dysregulated in multiple cancers, including LUAD (Fig. S4B). Additionally, LAMA1 was significantly upregulated in esophageal squamous cell carcinoma, correlating with the aggressive oncogenic phenotype71. This observation is consistent with our findings, as depicted in Fig. 5A. Thus, CDGS comprises both well-established and potential cancer drivers, whose mutations could significantly influence tumor occurrence and development.

In our study, we not only explored the potential driver mechanisms of CDGS at the pan-cancer level but also demonstrated its relevance to individual cancer types, such as breast cancer. Specifically, the relative distributions of somatic mutations and copy number alterations indicated a significant proportion of CDGS genes exhibit genomic instability in BRCA (Fig. 4B). Furthermore, CDGS genes were enriched in pathways associated with breast cancer (Fig. 4D). When comparing tumor tissue to normal tissue, nearly all CDGS genes exhibit significant differential expression in BRCA (Fig. 5A), with notable differences in their methylation levels (Fig. 5H). Interestingly, we identified a high proportion of COL6A3-mutated samples with expression dysregulation, indicating a potentially important role of COL6A3 in breast cancer progression (Fig. S4B). Additionally, the mutational status of CDGS was significantly associated with OS in BRCA patients (Fig. S5A). These findings underscore the utility of our pipeline in identifying novel pan-cancer driver genes, as well as its potential to offer valuable insights for clinical research and application in specific cancer types.

Finally, we confirmed that the collective mutation status of CDGS served as an independent prognostic factor across the pan-cancer cohort. These mutations were associated with distinct immune profiles, potentially influencing the tumor immune microenvironment and modulating immune responses. High-risk patients were more likely to develop an immunosuppressive microenvironment and showed a potential responsiveness to ICIs treatment. Looking ahead, we propose that future studies should incorporate pre-treatment survival data to directly evaluate the predictive value of CDGS mutations for immunotherapy outcomes.

In conclusion, the geMER method effectively identified potential driver genes for various cancer types by researching mutation enrichment regions within coding and non-coding genomic elements. CDGS shared by multiple cancers offered valuable insights into the underlying oncogenic mechanisms, whose mutations accurately predicted patient survival outcomes and informed decisions related to immunotherapy strategies.

Methods

Data resources

The multiple omics data for 33 cancer types of the Cancer Genome Atlas (TCGA) project were downloaded from the UCSC Xena Database (https://xenabrowser.net/datapages/). These comprehensive datasets encompassed somatic mutation of whole cancer genomes generated by MuTect2 via the DNA Seq Variant Calling Pipeline, RNA-sequencing (RNA-seq) data produced using the Illumina HiSeq platform and normalized through the log(fpkm+1) transformation process, copy number variation generated using the Illumina platform computed by GISTIC2, DNA methylation data generated on the Illumina Human Methylation 450 K platform, and clinicopathological information. Except for individual cancer types, we downloaded RNA-seq and methylation data of the PANCAN cohort and extracted samples of 25 cancer types associated with CDGS for pan-cancer analysis.

The genomic element set containing the coding sequences (CDS) (n = 20,185), splice sites (n = 18,729), 3′ untranslated regions (3′UTRs) (n = 19,369), 5’ untranslated regions (5′UTRs) (n = 19,188), and promoters (n = 20,164) were generated by PCAWG and downloaded from DriverPower (https://github.com/smshuai/DriverPower/tree/master/data)35. Their genome coordinates were converted from GENCODE v37 to GENCODE v38 using the online UCSC Liftover tool to align them with somatic mutations from TCGA.

RNA-Seq data for normal tissues of healthy individuals was from the Genotype-Tissue Expression (GTEx) dataset (https://commonfund.nih.gov/GTEx/). For the purpose of comparing the expression levels with tumor tissues from TCGA, the GTEx expression values were converted from log2(fpkm+0.01) to log2(fpkm+1).

The somatic mutations of 34 cancer types were downloaded from the Catalogue Of Somatic Mutations In Cancer (COSMIC) dataset (https://cancer.sanger.ac.uk/cosmic), which is the world’s most comprehensive resource for investigating the somatic mutations’ influence on human tumors72.

Multiple-omics signal abundances of various cancer cell lines were downloaded from the Encyclopedia of DNA Elements (ENCODE) database51. Specifically, we downloaded transcriptomic data of the G401 and Caki2 cell lines derived from KIRP or KIRC, and SK-N-DZ cell lines derived from LGG. For PAAD, we downloaded transcriptomic, DNA methylation, and histone modification data from the Panc1 cell line. For LIHC and LUAD, we downloaded transcriptomic, DNA methylation, histone modification, and transcription factor binding data from the HepG2 cell and A549 cell, respectively.

Eleven datasets of eight cancer types simultaneously containing somatic mutations and prognostic information as a prognosis validation set were from the International Cancer Genome Consortium (ICGC) (https://dcc.icgc.org/). These datasets included COAD-COCA-CN, KIRC-RECA-EU, LIHC-LICA-CN, LIHC-LINC-JP, LIHC-LIRI-JP, KIRC-RECA-EU, PAAD-PACA-AU, PAAD-PACA-CA, SKCM-SKCA-BR, STAD-GACA-CN and UCS-UTCA-FR.

The training and testing cohort including clinical information for patients undergoing ICI immunotherapy were from Long et al.30. Mutation data of the training cohort TMB and Immunotherapy (MSK, Nat Genet 2019)73 were downloaded from the cBioPortal database (https://www.cbioportal.org). For the testing cohort, we maintained the clinical information provided by Van Allen et al.74, Liu et al.75, Miao et al.76,77, Riaz et al.78, and Hugo et al.79 for whom mutation data and ICI immunotherapy response outcomes were accessible. We categorized the immunotherapy response outcomes as durable clinical benefit (DCB) containing complete response (CR) or partial response (PR) or stable disease (SD) with progression-free survival (PFS) lasted > 6 months, and no durable benefit (NDB) containing progressive disease (PD) or SD with PFS lasted≤6 months according to Long et al.30.

Identification of candidate drivers for cancers across genomic elements by geMER

The mutation enrichment research of genomic elements (geMER) was established by integrating nonsynonymous somatic mutations and the sequence information of genomic elements. First, nonsynonymous somatic mutations were aligned to five genomic elements. For each genomic element, we retained genes with mutations observed in at least four tumor samples within each cancer type. Next, the mutation enrichment regions for genomic elements were detected by the mutation set enrichment analysis (MESA-clust) [20]. Specifically, geMER simulates a walker walking along the sequence of a given genomic element while recording a mutation accumulation score (MAS) which starts as 0. The change of MAS at each genomic site depends on their mutation occurrence and frequency in tumor patients. The maximum variation of MAS highlights the location and magnitude of mutation enrichment. For each genome site, MAS increases if patients encounter mutations and decreases without a mutation. The increment (\({S}_{{inc}}\)) and decrement (\({S}_{{dec}}\)) of MAS are separately calculated by:

$${S}_{{inc}}=1/\sum Y,{and}$$
(1)
$${S}_{{dec}}=1/Z$$
(2)

where Y represents a vector that records the number of mutations at each genome site, and L denotes the length of the genomic element sequence, that is, Y = (\({y}_{1}\), …, \({y}_{L}\)). Z denotes the number of non-mutated positions. The sum of increments is equal to the sum of decrements, both amounting to 1. Thus, MAS at the \({i}^{{th}}\) position is calculated by

$${{MAS}}_{i}=\sum _{j\in {L}^{M},j\le i}{y}_{j}\times {S}_{{inc}}-\sum _{j\notin {L}^{M},j\le i}{S}_{{dec}}$$
(3)

where \({L}^{M}\) denotes a vector of positions where mutations occur, and the increment at a mutant position is calculated by \({y}_{j}\times {S}_{{inc}}\), where 1 ≤ j ≤ L and \(j\in {L}^{M}\). The decrease of positions without mutations is consistently \({S}_{{dec}}\). Therefore, MAS bridges of the genomic element sequence are expected to sharply increase over a short distance in region with mutation enrichment. Thus, the Mutation Enrichment Score (MES) is defined as the maximum deviation of MAS, which represents the enrichment region of mutations and the abundance of mutation enrichment:

$${MES}=\max \left({MAS}\right)-\min \left({MAS}\right)$$
(4)

An empirical p-value is employed to assess the significance of MES through a randomization-based test. The randomization process is conducted 1000 times. In each randomization, the number of mutations is kept the same as the actual mutation records, while the sites where mutations occur are randomly selected across the sequence, allowing for replacement. The p-value is calculated as

$$p=\frac{{MES}(\pi )\ge {MES}}{1000}$$
(5)

where \({MES}(\pi )\) is a MES vector of randomly selected mutations in the given genomic element. Benjamini-Hochberg procedure is applied to adjust the p-values. Genes with an adjusted p-value (adj.p) < 0.05 are considered candidate drivers for the given element.

Integrative pathway enrichment analysis of candidate drivers using ActivePathways

ActivePathways is an integrated approach that employs statistical data fusion, rationalizes contributing evidence, and highlights relevant genes to explore significant enrichment pathways across various datasets80. It incorporates molecular pathways of the Reactome database and biological processes of Gene Ontology. To perform integrative pathway enrichment analysis, a matrix of adjusted p values for genomic elements from geMER was provided as input to ActivePathway, where the rows represent unique candidate driver genes and columns represent genomic elements. We completed ActivePathways with default parameters, corrected the enriched pathways by multiple tests, and filtered the significant pathways according to default settings (Holm Q < 0.05). A gene involved in biological pathways identified by ActivePathway was considered a functional driver.

Identification of a core driver gene set for multiple cancer types

To identify a set of the most influential driver genes across pan-cancer and explore potential common mechanisms of tumorigenesis, we identified a core driver gene set shared by multiple cancer types (Fig. 4A). We first mapped the candidate driver genes of genomic elements identified by geMER into biological pathways by ActivePathway across cancer types. The functional drivers that comprehensively regulated the biological processes and molecular pathways were recognized within corresponding cancer types. Next, we integrated somatic mutation data for the pan-cancer cohort containing cancer types corresponding to functional drivers. The mutation frequency of a functional driver was calculated as its mutation records divided by the number of mutations across all genes in the pan-cancer cohort. The functional drivers were subsequently ranked decreasingly based on mutation frequency. Finally, we estimated the enrichment between the top gene set of ranked functional drivers and the Cancer Gene Census (CGC) gene list. Specifically, n represents the number of ranked functional drivers. For top i (i = 2, 3, …, n) functional drivers, we separately computed their overlap proportion with CGC genes. To evaluate the statistical significance of enrichment, the p-value was calculated using the hypergeometric test:

$$P(k,N,M,i)=\frac{\left(\begin{array}{c}M\\ k\end{array}\right)\left(\begin{array}{c}N-M\\ i-k\end{array}\right)}{\left(\begin{array}{c}N\\ i\end{array}\right)}$$
(6)

where N represents the total number of a combination list for functional drivers and CGC genes, M denotes the number of CGC genes, and k refers to the number of top functional drivers included within CGC genes. A set of functional drivers with the significantly highest proportion of CGC genes was considered a CDGS.

Analysis of Somatic mutations and copy number alteration for CDGS

Copy-number values equal to 1 and −1 were considered to be amplification and deep deletion, respectively. Relative distribution of mutation, amplification, and deep deletion frequencies for CDGS genes were respectively calculated by the occurrence number of an individual gene dividing the total occurrence numbers of CDGS genes across patients, multiplied by one hundred percent. OncoPrint plots of CDGS somatic mutations and copy number variation in the pan-cancer cohort were generated by the “ComplexHeatmap” R package81. Co-occurrence and mutual exclusivity of CDGS mutations were plotted using the R package “maftools”82. Protein family (Pfam) encoded by the somatic mutations of CDGS were computed by Variant Effect Predictor (VEP)83. The clusters of the enrichment region mutations within protein domains of TTN in LUAD were shown by ProteinPaint84. The three-dimensional structure of I-set Ig domains was modeled by SWISS-MODEL85 and the alterations in amino acids within the protein structure were visualized with PyMOL.

Statistical analysis

The hypergeometric test was used to evaluate the enrichment of known cancer genes from the CGC gene set. Differential expression, methylation, and pathway activity analyses between comparison groups were conducted using the R package “limma”. A gene with adj.p < 0.05 as well as fold change (FC) > 2 or < -2 was considered significantly different. CpG sites missing in more than 5% samples were excluded from the methylation profile86. The impute.knn function from R package “impute” was applied to impute the retained CpGs according to the K Nearest Neighbor approach. The M-value transformation was used to normalize methylation distribution and stabilize the variance: M-value = log2(Beta/(1-Beta)) for beta values. Kaplan-Meier survival curves were implemented to determine the association between CDGS mutations and prognosis. The log-rank test was used to calculate the statistical significance of the comparison groups. Correlation analysis was completed using the Pearson correlation.