Background & Summary

Cancer arises from genetic mutations, DNA damage responses, immune dysfunction, and chronic inflammation, among other factors1,2,3,4. The detection, identification, and interpretation of tumor driver genes (TDGs) and tumor driver variants (TDVs) are fundamental for devising personalized treatment strategies, predicting disease prognosis, and assessing therapeutic efficacy5,6,7. However, traditional approaches for detecting DNA variants rely on patient tissue sampling, which can be invasive and challenging. Liquid biopsy is a minimally invasive approach for circulating tumor DNA (ctDNA) detection, which overcomes many limitations of traditional tissue biopsy, enabling the identification of minimal residual disease, early detection of cancer recurrence, and dynamic monitoring of molecular changes in tumors8,9.

Currently, numerous bioinformatics methods have been developed to identify driver mutations in tissue-based multi-omics data, broadly classified into three main categories. The first category is frequency-based approaches, such as DriverML10, ActiveDriver11, and OncodriveFML12, which identify genes that are mutated more frequently than expected by background mutation rates. The second category is sub-network-based approaches, including tools like DriverNet13 and DawnRank14, which utilize prior knowledge of biological pathways and protein-protein interaction networks to identify driver genes. The third approach focuses on the hotspot mutations region that lie within functional domains or key residues in the three-dimensional structure of proteins, such as MSEA15 and e-Driver16. While these tools have been validated and benchmarked using tissue mutation data from consortia like The Cancer Genome Atlas (TCGA) project17, their efficacy in ctDNA mutation profiles, characterized by lower sensitivity and fewer detectable mutations, remains insufficiently explored.

In addition to the identification of driver genes and mutations, it is equally important to elucidate their biological functions and clinical significance. The concerted efforts of many scientists have advanced the standardization of mutation interpretation, enabling a comprehensive understanding of mutations from multiple perspectives (Table S1). For example, databases such as ClinVar18 and ClinGen19 comprehensively encompass currently reported human cancer and hereditary disease variants. Some drug-related databases like The Gene Drug Knowledge Database20 and PharmGKB21, mine existing resources to illustrate drug-gene/variant associations such as drug sensitivity/resistance, drug response, dosing guidelines, etc, while The Cancer Genome Interpreter22 integrates algorithms into an online platform to annotate the driver potential of alterations and their possible effect on treatment response. More comprehensive databases, like CIVIC23, not only contain therapeutic-related knowledge but also employ a strategy that combines expert review and community input to depict prognostic, diagnostic, and predisposing associations of all types of genetic and somatic variants. However, aside from part resources restricting public access, the main limitation of these resources is that the vast majority of variants were curated from tissue samples, which are not entirely applicable in clinical variant interpretation for patients who cannot undergo tissue sampling or tissue biopsy carries a high risk.

Here, we introduce CTDdgv, a comprehensive resource that provides researchers with an advanced platform for identifying TDGs and TDVs in ctDNA. CTDdgv integrates clinical functional information on ctDNA from thousands of research papers. By simply uploading a ctDNA mutation profile, researchers can efficiently identify genes and variants with cancer-driving potential and obtain clinical interpretations of the genes and variants of interest. Additionally, CTDdgv offers a suite of tools for visualizing driver genes and mutations, as well as exploring their prognostic and driver effects. In summary, CTDdgv is an invaluable resource for the in-depth study of ctDNA. A schematic overview of this study design is shown in Fig. 1.

Fig. 1
Fig. 1
Full size image

Content and interface of CTDdgv. (A) Experimentally supported ctDNA variants. (B) Identified tumor driver genes and variants in ctDNA mutation profiles. (C) Searching experimentally supported clinical interpretation. (D) Genome browser for illustrating the genomic landscape of experimentally supported mutations and potential tumor driver variants (TDVs). (E) Downloading ctDNA mutation spectra and curated data. (F) A collection of tools for identifying tumor driver genes (TDGs) and TDVs for tissues or ctDNA mutation spectra, evaluating the prognostic value of driver genes, and explaining the effect of driver gene set from multiple aspects. (G) Searching potential TDGs and TDVs.

Methods

Collection and annotation of experimentally verified ctDNA variants

The associations between ctDNA variants and clinical interpretation were manually curated from the published literature by systematically screening in the PubMed database using the following keyword combinations: (i) cancer and ctDNA mutation/mutated; (ii) ctDNA mutation; (iii) tumor and ctDNA mutation; (iv) cancer and circulating tumor DNA mutation; (v) ctDNA mutation detection. In total, 4,797 pieces of literature were collected as of 8 May 2024. All articles underwent a rigorous review process by at least two researchers to ensure their relevance and accuracy. During the screening process, cfDNA variants that might result from clonal hematopoiesis, inflammatory damage, or release from blood cells were filtered out24. The following information was retrieved from the eligible studies: PubMed ID, gene symbol, variant type (e.g., missense, insertion, deletion, complex rearrangement), variant description (e.g., G1202R, G12C), disease type (e.g., breast cancer and metastatic breast cancer, with cancer subtypes specified in strict accordance with the original publications), body fluid used for ctDNA extraction (e.g., plasma, cerebrospinal fluid), sequencing method, experimental method, evidence of ctDNA variants with drug resistance, association with tumor characteristics (e.g., tumor staging, tumor diameter), prognostic value, marker for cancer metastasis, prospects of substituting tissue detection, impact on treatment (e.g., targeted therapy, immunotherapy), and other clinical interpretation. Here, experimentally supported clinical interpretation records of ctDNA driver genes and driver variants were categorized into two groups: Gene-Level and Variant-Level. The Gene-Level group provides clinical interpretation for overall gene variations while the Variant-Level group focuses on specific variations.

Data expansion of ctDNA variants

To facilitate a more comprehensive study and interpretation of driver genes and their variants, the following enhancements have been implemented: (i) Databases such as GeneCards25, HGNC26, Ensembl27, NCBI GenBank28, CIVIC23, COSMIC29, and Jax-Clinical Knowledgebase30 were linked to acquiring information from external resources, allowing users to retrieve a substantial amount of annotation and functional information. (ii) TransVar31 was used for the conversion of multi-omics mutation coordinates. Users can access detailed information about the DNA strand where the mutation occurs, along with the mutation coordinates across various omics levels, including gDNA, cDNA, and protein. This enables users to conduct a deeper investigation of the variants across multiple omics layers. (iii) CTDdgv offers links to the genome browser in UCSC32 and Ensembl databases27, allowing users to retrieve variant distribution within a 10,000 bp region upstream or downstream of the corresponding mutation site. (iv) For missense mutations, which make up the majority in the Variant-Level, AlphaFold33 and PyMOL (https://pymolwiki.org/) were utilized to predict the optimal protein structures before and after the mutation. Structure visualization and PDB format files were provided to assist users in researching protein macromolecular structure changes due to mutations, thereby aiding the development of targeted drugs and enhancing the understanding of cancer mechanisms.

Design of geMERlb pipeline for identifying TDGs and TDVs in ctDNA

We introduce geMERlb (Genomic Element Mutation Enrichment Research in Liquid Biopsy) to identify tumor driver genes (TDGs) and variants (TDVs) in ctDNA by integrating nonsynonymous somatic mutations of liquid biopsy and the sequence information of genomic elements.

The genomic element set containing the coding sequences (CDS) (n = 20,185), splice sites (n = 18,729), 3′ untranslated regions (3′ UTRs) (n = 19,369), 5′ untranslated regions (5′ UTRs) (n = 19,188), and promoters (n = 20,164) were generated by PCAWG and downloaded from DriverPower34. Their genome coordinates were converted from GENCODE v37 to GENCODE v38 by Liftover to align with somatic mutations annotated in different reference genomes.

geMERlb simulates a walker walking along the sequence of a given genomic element. Mutation Accumulation Score (MAS) represents the cumulative value recorded at each genome position, starting with 0, increasing if patients encounter mutations, and decreasing without a mutation. The increment (\({S}_{{inc}}\)) and decrement (\({S}_{{dec}}\)) of MAS are separately calculated by:

$${S}_{{inc}}=1/\left({\rm{Number\; of\; mutation\; records}}\right)=1/\sum {\rm{Y}},{\rm{and}}$$
$${S}_{{dec}}=1/\left({\rm{Number\; of\; non}}-{\rm{mutated\; positions}}\right)$$

where Y represents a vector recording the number of mutations at each genome position, and L denotes the length of the genomic element sequence, that is, Y = (y1, …, yL). The sum of MAS increments and MAS decrements both amounting to 1. Thus, MAS at the \({i}^{{th}}\) position is calculated by:

$${{MAS}}_{i}=\sum _{j\in {L}^{M},j\le i}{y}_{j}\times {S}_{{inc}}-\sum _{j\notin {L}^{M},j\le i}{S}_{{dec}}$$

where \({L}^{M}\) denotes a vector of positions where mutations occur, and the increment at a mutant position is calculated by \({y}_{j}\times {S}_{{inc}}\), where 1 ≤ j ≤ L and \(j\in {L}^{M}\). The decrease of positions without mutations is consistently \({S}_{{dec}}\). Therefore, MAS bridges of the genomic element sequence are expected to sharply increase over a short distance in a region with mutation enrichment. Mutation Enrichment Score (MES) is defined as the maximum deviation of MAS, representing the location of the mutation enrichment region (MER) and the abundance of mutation enrichment. Taking into account the positional order of the maximum MAS (maxMAS) and minimum MAS (minMAS) within the genomic sequence, we determine MER and MES as follows:

  1. 1)

    MER is identified as the region between the genomic position of minMAS and maxMAS when the genomic position coordinate of maxMAS is greater than minMAS’s. MES is calculated by:

    $${MES}={\max }{MAS}-{\min }{MAS}$$

    Since MAS begins and ends with a value of 0, both the positions of maxMAS and minMAS can’t simultaneously occur at the beginning and end of a genomic sequence. When the genomic position coordinate of maxMAS is less than that of minMAS,

  2. 2)

    If the maxMAS, with a value of 0, is located at the beginning of the genomic sequence, a novel maximum MAS (nvmaxMAS) is always present to the right of the minMAS coordinate. MER is defined as the region between the genomic position of minMAS and nvmaxMAS. MES is calculated by:

    $${MES}={nvmaxMAS}-{\min }{MAS}$$
  3. 3)

    If the minMAS, with a value of 0, is located at the end of the genomic sequence, a novel minimum MAS (nvminMAS) will consistently be found to the left of the maxMAS coordinate. MER is defined as the region between the genomic positions of nvminMAS and maxMAS. MES is calculated by:MES is equal to maxMAS minus nvminMAS.

    $${MES}={\max }{MAS}-{nvminMAS}$$
  4. 4)

    If minMAS and maxMAS are not located at the beginning or end of the genomic sequence, nvminMAS is always present to the left of the maxMAS coordinate, while nvmaxMAS is always situated to the right of the minMAS coordinate. MES is compared between the maxMAS minus nvminMAS group and the nvmaxMAS minus minMAS group. The MER is defined as the region between the genomic position of minMAS and maxMAS of the group with greater MES. MES is calculated by:

$${MES}={\max }{MAS}-{\min }{MAS}$$

An empirical p-value is employed to assess the MES significance by a randomization-based test. The randomization process is conducted 1000 times. For each randomization, the number of mutations is kept the same as the actual mutation records, while the sites where mutations occur are randomly selected across the sequence, allowing for replacement. The p-value is calculated as:

$$p=\frac{{MES}(\pi )\ge {MES}}{1000}$$

where \({MES}(\pi )\) is a MES vector of randomly selected mutations in the given genomic element. Benjamini-Hochberg procedure is applied to adjust the p-values. Genes with an adjusted p-value (adj.p) < 0.05 are considered as TDGs for the given element and the variants within MER are defined as TDVs.

Mutation dataset of ctDNA and tissue

We collected 38 ctDNA mutation profiles (5 of which included corresponding survival information) representing 21 cancer types from published literature35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67, 15 datasets containing over 500 mutations were selected to assess the performance of TDGs identification tools (Table S2). We identified 2,305 TDGs and 9,645 TDVs from 38 profiles and organized them into a Prediction-Level dataset. We reclassified and consolidated the 21 cancer types into 17 categories based on their tumor origins. For example, we merged “Non-Small Cell Lung Cancer” and “Lung Cancer” with the same tumor origin under a single Lung Cancer category, and removed “Gynecologic Cancers” due to uncertain tumor origin. In addition, we downloaded somatic mutation data for whole cancer genomes in MAF format across 33 cancer types from The Cancer Genome Atlas (TCGA, https://www.cancer.gov/tcga) and identified TDGs based on tissue mutation profiles. To facilitate comparison with the cancer types in ctDNA datasets, these 33 cancer types were reorganized into 28 categories based on the same classification criteria.

Other datasets

For benchmarking geMERlb, the Cancer Gene Census (CGC) gene list was downloaded from the COSMIC Cancer Gene Census project68, the HiConf gene list was obtained from Erina Takai et al.69, and the experimental supported clinical interpretation (ESCI) gene list comprises all ctDNA driver genes with clinical interpretation curated by us. The non-small cell lung cancer dataset used to demonstrate the application of external data was obtained from Qian Wang et al.62 The breast cancer from Belinda Kingston et al.56 was utilized to explore the driving impact of the driver gene set. Ten hallmark cancer gene sets were obtained from LncACTdb 3.0.70. We collected high-throughput expression profiles across 33 cancer types and corresponding clinical information from the TCGA (https://www.cancer.gov/tcga) to investigate the oncogenic expression patterns and prognostic value of driver genes.

Statistical analysis

We performed gene expression differential analysis using DESeq 2 (v1.38.3). Survival analysis results were calculated using the survival package. Enrichment analysis was performed using clusterProfiler (v4.6.2).

Data Records

All the data files are stored in the Figshare71 and are available under the terms of CC-BY 4.0. The dataset contains a total of 4 datasets, covering the underlying data for the data composition and visualization tools. The Variant.csv and Gene.csv contain the experimentally supported clinical interpretation recorded including PubMed ID, gene symbol, variant type, variant description, disease type, body fluid used for ctDNA extraction, sequencing method, experimental method, evidence of ctDNA variants with drug resistance, association with tumor characteristics, prognostic value, marker for cancer metastasis, prospects of substituting tissue detection, impact on treatment, and other clinical interpretation. The Prediction.csv contains all potential TDGs and TDVs identified from 38 datasets across 17 cancers. The data mainly include disease name, PubMed ID, gene symbol, prognostic value, genomic element, start and end positions of TDV, reference nucleotide, alteration nucleotide, and mutation type. The RawData.csv detailed annotation information for all mutation datasets, including data source, cancer type, number of samples, number of mutations, and survival information.

Technical Validation

To confirm the accuracy of ctDNA TDGs and TDVs clinical interpretation supported by the experiment, data extraction was independently conducted by the authors and subsequently cross-checked. Discrepancies in the data extraction process were resolved through consensus. Through manual review of 4,797 publicly available studies, we curated a comprehensive set of 1,674 ctDNA-related records with clinically relevant interpretations, covering 238 TDGs and 577 TDVs. We categorized the dataset into 1,170 Variant-Level and 504 Gene-Level records (Table 1). The dataset encompasses 109 cancer types systematically ranked by Variant-Level record counts, as displayed on the database homepage. The top 15 most frequently reported diseases and associated genes are presented in Fig. 2A,B. At the Variant-Level, the majority of records are associated with disease prognosis (n = 439, Table 1) and drug resistance (n = 438, Table 1). At the Gene-Level, prognosis-related records are the most prevalent (n = 331, Table 1). Notably, at the Variant-Level, missense mutations predominate, accounting for 73.93% (n = 865) of all records, with the T790M mutation being the most frequently observed (Fig. 2C,D).

Table 1 Clinical interpretation records statistics in Variant-Level and Gene-Level dataset.
Fig. 2
Fig. 2
Full size image

Statistics of clinical interpretation records and schematic of geMERlb pipeline. (A) Counts of clinical interpretation records for the top 15 disease types in the Variation-Level and Gene-Level datasets. (B) Counts of clinical interpretation records for the top 15 genes in the Variation-Level and Gene-Level datasets. (C) Statistics of clinical interpretation records for different variant types in the Variation-Level dataset. (D) Statistics of clinical interpretation records for different Missense mutations. (E) Schematic of the algorithm for identifying mutation enrichment region for genomic element.

We developed a pipeline, geMERlb, specifically designed to identify tumor driver genes (TDGs) and tumor driver variants (TDVs) in ctDNA by detecting mutation enrichment regions across genomic elements (see methods, Fig. 2E). To benchmark geMERlb, we performed a comprehensive performance evaluation by comparing it to two established tools, DriverPower34 and ActiveDriverWGS11, using 15 ctDNA mutation datasets. Following the user manuals of each tool, the same datasets were standardized to the required input formats for geMERlb, DriverPower, and ActiveDriverWGS. All tools were executed with default parameters for a consistent comparison. An adjusted p-value threshold of < 0.05 (i.e., the adjusted False Discovery Rate (FDR) < 0.05) was uniformly applied to identify candidate driver genes. Since the absence of a gold standard (i.e., true cancer driver genes) for evaluating performance with different tools, we employed two benchmarking measures as performance metrics (Table S3). The first benchmark reflects the capacity of tools for capturing well-known cancer-associated genes. The Cancer Gene Census (CGC) is a list of 748 driver genes whose mutations have been causally linked to cancer development72. A higher proportion of CGC genes in predictions is generally regarded as an indicator of better performance in numerous studies10,13. The second benchmark comprises a curated set of 99 high-confidence (HiConf) cancer genes identified by Kumar et al.69 through a comprehensive literature review. The overlaps between the potential driver genes and the CGC or HiConf gene sets were utilized as the gold standard for evaluating candidate ctDNA drivers. In most ctDNA datasets, Driverpower consistently predicted a larger number of driver genes, ActiveDriverWGS was more conservative in predictions, and geMERlb predicted an intermediate number of driver genes (Table S4). For all potential TDGs across 15 datasets, geMERlb identified the highest proportions of driver genes in the CGC gene set (11.23%) and the HiConf gene set (3.22%) (Fig. 3A). Moreover, geMERlb demonstrated superior performance when evaluated across individual datasets, achieving the highest proportions of potential driver genes in the CGC gene set and the HiConf gene set ranking second (Fig. 3B). Additionally, we evaluated the predictive performance of geMERlb on tissue mutation data from TCGA spanning 33 cancer types employing identical benchmarking criteria. The geMERlb demonstrated robust performance across 33 cancer types for both benchmarking metrics (CGC = 4.29%, HiConf = 0.58%; Fig. 3C) and consistently showed the highest proportions in both benchmark sets when assessed on individual datasets (Fig. 3D). Subsequently, we utilized these 238 genes with experimentally supported clinical interpretation (ESCI) as a novel benchmark to evaluate the performance of tools (Table S3). Among the three tools, geMERlb demonstrated the highest proportion of driver genes within the ESCI gene set (5.31%, Fig. 3E). Moreover, geMERlb consistently outperformed the other tools across different datasets (Fig. 3F). Then, we identified 2,305 TDGs and 9,645 distinct TDVs from publicly available publications, encompassing 17 cancer types (Fig. 3G). Retrieval examples of the identified results are provided in the supplementary materials (Method S2). Notably, several driver genes (115/2305) and driver variants (76/9645) have been previously demonstrated in independent studies to be associated with clinical outcomes such as treatment response and prognosis. Additionally, we also identified TDGs across 28 cancer types from TCGA data (Fig. 3H). In conclusion, the comparative analysis underscores the effectiveness of geMERlb in detecting driver genes in liquid biopsy and tissue samples relative to other established methodologies.

Fig. 3
Fig. 3
Full size image

Performance evaluation of geMERlb and statistics of the Prediction-Level dataset. (A) The proportions of CGC and HiConf genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across all 15 ctDNA datasets. (B) The proportions of CGC and HiConf genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across 15 individual ctDNA datasets. (C) The proportions of CGC and HiConf genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across all 33 TCGA datasets. (D) The proportions of CGC and HiConf genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across 33 individual TCGA datasets. (E) The proportions of ESCI genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across all 15 ctDNA datasets. (F) The proportions of ESCI genes identified by the geMERlb, ActiveDriverWGS, and DriverPower tools across 15 individual ctDNA datasets. (G) Statistics of TDGs and TDVs identified from 38 ctDNA mutation profiles. (H) Statistics of TDGs identified from the TCGA dataset.

Usage Notes

To our knowledge, CTDdgv (http://bio-bigdata.hrbmu.edu.cn/CTDdgv/) is currently the most comprehensive resource on ctDNA variants. This platform offers researchers user-friendly modules to explore ctDNA TDGs and TDVs, including clinical annotations (Method S1, Fig. S1A–H, Table S5), driver potential predictions (Method S2, Fig. S1I,J), as well as online analysis (Method S3, Fig. S2, Table S6-7) and visualization tools (Method S4, Fig. S3, Table S8). In this study, our core algorithm, geMERlb, by focusing purely on the genomic positions of mutations and their counts in tumor samples, can detect rare mutations with greater sensitivity by calculating the MAS and MES, particularly in the underexplored non-coding regions of the genome. Nevertheless, the pipeline does not include a correction mechanism for undetected mutations in plasma that may fall below the sensitivity threshold. Moreover, while geMERlb has been successfully applied to identify driver genes in large-scale TCGA cohorts, most currently available plasma-derived mutation data originate from targeted sequencing, which typically captures fewer mutations. To address this limitation, we plan to incorporate larger-scale plasma mutation datasets in future research.