Abstract
Type-II topoisomerases resolve topological stress in DNA through double-strand breaks. While topoisomerases are chemotherapy targets linked to therapy-related genotoxicity, TOP2B is uniquely positioned to influence mutagenesis through its activity in non-dividing cells and sensitivity to topoisomerase poisons. To investigate this, we generated DNA-binding maps of TOP2B, CTCF, and RAD21 in human cancer samples and analyzed these for driver mutations and mutational processes across 6500 whole cancer genomes. TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites are enriched in somatic mutations and structural variants, particularly at sites with evolutionary conservation, high transcription and long-range chromatin interactions. TOP2B binds driver genes such as TP53, MYC, FOXA1, and VHL, and many frequently mutated non-coding regions. We show that one non-coding TOP2B-bound element at the non-coding RNA gene RMRP drives tumor initiation and growth in vivo. Our study highlights TOP2B as a safeguard of genome integrity and a marker of mutational processes and hotspots in cancer, underscoring implications for cancer genomics research.
Similar content being viewed by others
Introduction
Chromatin interactions during transcription, replication, and other cellular processes impose mechanical stress on DNA, leading to entanglements and knots that cause genome instability and cell death if left unresolved. Topological constraints in mammalian genomes are resolved by type II topoisomerases TOP2A and TOP2B that catalyze reversible DNA double-strand breaks (DSBs) and strand passage of uncleaved DNA duplexes1. TOP2A is expressed in proliferating cells and ensures appropriate chromosome condensation and segregation during mitosis. In contrast, TOP2B is expressed ubiquitously across tissues and cell types2. Notably, TOP2B binds active promoters, enhancers, and boundaries of topologically associating domains, often with the insulator protein CTCF and the cohesin complex subunit RAD213,4, linking TOP2B functions with tissue-specific gene regulation and chromatin architecture5,6,7,8. During its reaction cycle, TOP2 temporarily attaches to the cleaved ends of DNA to form a covalent TOP2-DNA cleavage complex9. Inhibiting the reversal of the complex in proliferating cells by anthracycline class of chemotherapy drugs such as doxorubicin and epirubicin results in cytotoxic DSBs and apoptosis in solid and hematological cancers10,11. However, stabilization of TOP2 cleavage complexes can result in structural variants (SVs) due to aberrant DSB formation and repair. For example, MLL (KMT2A) translocations commonly found in therapy-induced leukemias are linked to TOP2B activity at transcriptionally active sites12,13,14. TOP2A may also contribute to formation of translocations7. TOP2B-mediated DNA lesions and translocations can arise from endogenous cellular processes such as RNA Polymerase II (Pol II) pause release and transcriptional elongation15,16,17,18,19,20,21,22, including TOP2B-mediated translocations in prostate and breast cancers23,24. Despite this evidence, the roles of TOP2B in cancer mutagenesis remain largely unexplored.
Cancer is driven by somatic mutations such as single nucleotide variants (SNVs), insertions-deletions (indels), and larger structural variants such as copy number alterations (CNAs), inversions, and translocations. Driver mutations deregulate key cancer-related pathways by activating oncogenes or disrupting tumor suppressors. Driver mutations are positively selected in cancer genomes and are found computationally by identifying loci with significantly elevated mutation rates25,26,27. Most known driver mutations affect protein-coding sequences, whereas fewer non-coding drivers aside promoter mutations of TERT have been identified to date26,27,28,29,30,31. Cancer genomes typically contain few driver mutations, whereas most mutations are considered functionally neutral passengers that result from diverse mutational processes. These mutational processes affect the genome at multiple levels, including nucleotide-level mutational signatures32, local mutational enrichments often observed in regulatory elements such as CTCF binding sites33,34, and megabase-level variations in mutation burden that correlate tissue-specifically with chromatin accessibility and DNA replication timing35,36,37,38. Mutational processes have been linked to aging, DNA repair deficiencies, carcinogen exposures, or chemotherapies39,40,41,42. However, many of these processes remain poorly understood in terms of underlying mechanisms and etiology.
Here, we study TOP2B binding in the context of mutational processes and cancer driver mutations across thousands of cancer genomes. Leveraging a comprehensive genome-wide map of TOP2B binding obtained from human tumor biopsies, we reveal a striking connection between TOP2B binding and genomic enrichment in somatic mutations and SVs, particularly at evolutionarily conserved, constitutively active regulatory elements marked by CTCF or cohesin binding. We identify hundreds of individual sites as significant mutational hotspots that often affect cancer driver genes. We validate one mutational hotspot as a non-coding cancer driver in vivo. These findings shed light on the mutational processes involving TOP2B, potentially explaining how tissue-specific and chemotherapy-mediated DNA damage can lead to mutations at critical regulatory and architectural elements in cancer genomes.
Results
Conserved DNA-binding landscape of TOP2B in human cancer cells
We generated genome-wide DNA-binding profiles of TOP2B, CTCF, and RAD21 in human liver hepatocellular carcinoma (HCC) samples using chromatin immunoprecipitation sequencing (ChIP-seq). Peak calling across the biological replicates and merging the peaks by factor binding revealed 83,749 unique binding sites of 600 bps (FDR < 0.05, FC ≥ 2 from MACS2) (Supplementary Fig. 1). 51% of sites were bound by TOP2B, including 16,258 TOP2B-CTCF-RAD21 sites (19%), 20,731 TOP2B-RAD21 sites (25%), and 12,568 sites bound by TOP2B alone (15%) (Fig. 1a; Supplementary Data 1). We also identified 11,445 CTCF-RAD21 sites (14%) and 21,469 sites bound by RAD21 alone (25%) while only few sites were bound by CTCF alone (1018 or 1.2%). Frequent DNA-binding interactions of TOP2B, CTCF, and RAD21 are consistent with our previous study in mouse liver3.
a Overview of ChIP-seq analysis. TOP2B, CTCF, and RAD21 binding sites were identified in clinical samples of human hepatocellular carcinomas. Venn diagram shows sites bound by the three factors. b Evolutionary conservation, pan-tissue activity, and functional characteristics of binding sites. TOP2B-CTCF-RAD21 sites and CTCF-RAD21 sites are highly conserved based on DNA-binding profiles of CTCF and TOP2B in mammalian livers, occur frequently at constitutively active CTCF binding sites measured from 70 human cell lines, and are enriched in sites with double strand break (DSB) activity. P-values are from one-tailed hypergeometric tests. c Chromatin states of TOP2B binding sites in normal tissue types from the Roadmap Epigenomics project (X-axis). Chromatin states of TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites were compared to sites lacking TOP2B binding (i.e., CTCF-RAD21 sites and RAD21-only sites, respectively). Colors in the heatmap show log odds ratios and asterisks show false discovery rate (FDR)-adjusted P-values (q-values) from two-tailed hypergeometric tests (FDR < 0.05). Tissues and chromatin states are ordered by total significance (TSS, transcription start site). d Transcriptional associations of TOP2B binding sites in cancer. Violin plots show mean gene expression levels of genes adjacent to binding sites. P-values are from two-tailed Wilcoxon tests. Pie charts below show the fractions of sites at near genes. P-values are from one-tailed hypergeometric tests. Pan-cancer transcriptome data from the Pan-cancer Analysis of Whole Genomes (PCAWG) study is shown. Boxplots span interquartile range (IQR; 25–75%), lines show median values, whiskers show values within 1.5x IQR, and site counts are shown below. e Long-range chromatin interactions involving TOP2B binding sites from 27 human cell types. P-values are from two-tailed Wilcoxon tests. Source data are provided as Source Data files.
To gain functional insights into TOP2B binding sites, we studied their evolutionary conservation and epigenetic activity using maps of CTCF binding in 70 human cell lines in ENCODE43 and evolutionary conservation of CTCF binding in mammalian livers44. TOP2B-CTCF-RAD21 sites were highly conserved and active across human cell types compared to all sites combined (Fig. 1b): 70% of these occurred at constitutively active CTCF binding sites found in at least 90% of ENCODE human cell lines and 23% showed conserved CTCF binding in all five mammalian livers. Also, 60% of TOP2B-CTCF-RAD21 sites had a CTCF DNA-binding motif, 48% showed conserved TOP2B binding in mouse liver3, and 40% occurred at DSB sites from earlier cell line experiments4, collectively underscoring the functional importance of TOP2B-CTCF-RAD21 sites. To exclude potential biases of CTCF or RAD21 binding, we performed a more stringent analysis comparing TOP2B-CTCF-RAD21 sites to CTCF-RAD21 sites as controls. TOP2B-CTCF-RAD21 sites were enriched in in vitro DSB activity (odds ratio (OR) = 1.24), conserved TOP2B binding in mice (OR = 2.54), constitutive CTCF binding in human tissues (OR = 1.37), and conserved CTCF binding in mammalian livers (OR = 1.37) (all P < 2.2 × 10-16, hypergeometric test), collectively emphasizing the functional associations with TOP2B binding. Interestingly, a subset of CTCF-RAD21 sites also overlapped with conserved TOP2B binding in mice (27%) or showed in vitro DSB activity (35%), potentially indicating additional TOP2B-bound sites that remain undetected in TOP2B ChIP-seq due to technical limitations or profiling of heterogeneous cancer samples. Similarly, comparison of TOP2B-RAD21 sites with RAD21-only sites revealed that 27% of TOP2B-RAD21 sites showed conserved TOP2B binding in mice (P < 2.2 × 10-16, OR = 3.40) while no significant overlap with DSB sites was detected. TOP2B-RAD21 sites were less likely to overlap with conserved CTCF binding in mammalian livers (P < 2.2 × 10-16, OR = 0.21) or constitutive binding in human tissues (P < 2.2 × 10-16, OR = 0.24), which is consistent with lack of CTCF binding in our data. Thus, CTCF binding alone does not fully recapitulate TOP2B binding. CTCF-independent binding of cohesin at cis-regulatory elements is involved in tissue-specific transcription regulation45,46, and TOP2B-RAD21 sites may represent a different mode of TOP2B activity mediated by cohesin in tissue-specific loci.
We characterized the chromatin states of TOP2B binding sites in ten major human tissue types from the Roadmap Epigenomics project47, comparing TOP2B-bound sites with sites lacking TOP2B binding similarly to the analysis above. Both TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites were highly enriched in active transcription start sites (TSS) and TSS-flanking regions in all tissue types (FDR < 0.05) (Fig. 1c, Supplementary Data 2). For example, 23% of TOP2B-CTCF-RAD21 sites and 35% of TOP2B-RAD21 sites occurred at active TSSs in liver (P < 2.2 × 10-16, OR = 11 and OR = 6.5, respectively). In contrast, TOP2B binding sites were depleted of quiescent and repressive epigenetic states. Although our TOP2B binding maps were obtained from liver cancer samples, regulatory states of other tissue types were also enriched, indicating that TOP2B often binds regulatory elements shared across cell types.
We studied transcriptional characteristics of TOP2B binding by examining pan-cancer gene expression data from 2,851 transcriptomes of primary and metastatic cancers from Pan-cancer Analysis of Whole Genomes (PCAWG) and Hartwig Medical Foundation (HMF) projects48,49. TOP2B binding sites occurred more often near genes compared to sites lacking TOP2B binding (P < 2.2 × 10-16, OR = 1.73, hypergeometric test) and mean gene expression levels at the sites were significantly higher in pan-cancer data (P < 2.2 × 10-16, Wilcoxon test) and in individual cancer types (Fig. 1d, Supplementary Fig. 2). Similarly, expression levels at TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites were higher compared to respective control sites lacking TOP2B binding.
To investigate TOP2B binding in the context of chromatin architecture, we studied promoter-enhancer interactions of 27 normal tissue types from a promoter-capture HiC dataset50. TOP2B binding sites occurred at hotspots of regulatory chromatin interactions more often than non-bound sites (P < 2.2 ×10-16, Wilcoxon test), extending our prior findings in mouse liver3. For example, 50% of TOP2B-bound sites had at least one promoter-enhancer interaction and 18% of sites had ten or more interactions, more than expected from non-TOP2B sites (35% and 8.7%, respectively) (Fig. 1e). We also compared TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites to control sites lacking TOP2B binding and confirmed that TOP2B-bound sites had more frequent promoter-enhancer interactions. In summary, TOP2B binding in human cancer cells occurs at transcriptionally active sites with three-dimensional chromatin interactions that are bound by CTCF in many human tissue types, show conserved TOP2B or CTCF binding in mammalian genomes, and coincide with in vitro DSB maps attributed to topoisomerase activity, together delineating functional elements of the cancer genome.
TOP2B binding sites are enriched in small mutations in cancer genomes
To decipher the roles of TOP2B in cancer mutagenesis, we studied somatic mutations in 6495 cancer samples of 18 major cancer sites, including 2452 primary cancers from the PCAWG project28 and 4043 metastatic cancers from the HMF project49 profiled using whole-genome sequencing (WGS) and comprising 97 million SNVs and indels in total (Supplementary Data 3). First, we characterized local mutational processes of small mutations using the computational method RM234, which we used to study mutational enrichments across binding sites. RM2 applies a negative binomial regression model to quantify mutational processes at binding sites relative to flanking control sequences and trinucleotide-level and megabase-level covariates (Fig. 2a).
a Analysis overview. Mutation frequency in aligned binding sites was compared to flanking sequences using RM234, which quantifies mutational enrichments in genomic elements using trinucleotide context and megabase-scale mutation burden. b Mutational enrichments in TOP2B binding sites in primary and metastatic cancers (RM2, FDR < 0.05). Enrichment scores represent directional significance such that binding sites with enriched mutations are on the right and depleted sites on the left (OR, odds ratio). c Grassy hills plots show local mutation burden in pooled binding sites (colors; 600 bps) and flanking control sequences (grey; ±600 bps). FDR-adjusted two-tailed P-values (q-values) from RM2 are shown. d Mutations in TOP2B binding sites associate with conserved TOP2B binding in mice (left) and in vitro double-strand break (DSB) activity (right). e Mutational signatures of single base substitutions (SBS) in TOP2B binding sites. Indels were also included as a separate signature. In panels d and e, TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites were compared to controls (CTCF-RAD21 and RAD21-only sites, respectively) using two-tailed hypergeometric tests (FDR < 0.05). Enriched signatures are displayed on the right and depleted signatures on the left. f Mutational processes of small mutations in TOP2B binding sites grouped by transcription or chromatin loops. Four bins of sites were analyzed (none, low, middle, high). Heatmap shows site types and cancer types having at least one bin with significant mutational enrichments (FDR < 0.05 from RM2). Positive associations are shown above (i.e., higher mutagenesis associated with more transcription or chromatin interactions) and negative associations are shown below. Color strips indicate site types, cancer types, and activity. g, h Examples of mutational processes at TOP2B-RAD21 binding sites. Grassy hills plots show mutation frequencies in sites binned by transcription or chromatin interactions. TOP2B-RAD21 sites (dark red) were compared to flanking sequences (grey) using RM2. A control analysis using RAD21-only sites is also shown (light blue). FDR-adjusted two-tailed P-values (q-values) from RM2 are shown. Loess smoothing in panels (c, g, h) is shown with 95% standard error bands. Source data are provided as Source Data files.
Sites co-bound by TOP2B-CTCF-RAD21 or TOP2B-RAD21 were consistently enriched in mutations in most cancer types we analyzed, especially in liver, breast, lung, esophagus, and prostate cancers, and pan-cancer (FDR < 0.05) (Fig. 2b). CTCF-RAD21 sites and RAD21-only sites were also enriched in mutations in some cancer types, although at lower significance levels. Distinct patterns of localized mutagenesis were apparent: TOP2B-CTCF-RAD21 sites and CTCF-RAD21 sites involved focal mutation enrichments at peak summits, while TOP2B-RAD21 sites had a smoother and wider increase in mutation burden (Fig. 2c). TOP2B-only sites were not enriched in mutations, while too few CTCF-only sites were available for this analysis.
We examined mutational processes in the context of functional genomic annotations. TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites with conserved TOP2B binding in mouse genomes were highly enriched in small mutations in all cancer types, relative to non-conserved sites and sites lacking TOP2B binding (Fig. 2d). TOP2B-CTCF-RAD21 sites with in vitro DSB activity4 were also enriched in mutations in most cancer types, while TOP2B-RAD21 sites were not enriched or showed limited signals of depletion. We also grouped binding sites according to tissue-specific chromatin states and again observed significant mutational enrichments in TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites in multiple cancer types, while few or no enrichments were found in other types of sites (Supplementary Fig. 3). Thus, subsets of TOP2B binding sites have certain functional characteristics that associate with even higher local mutagenesis. TOP2B binding may explain the increased mutagenesis at CTCF and cohesin binding sites observed previously in many cancer types33,34.
We asked if TOP2B binding sites were differentially altered by mutational signatures of single base substitutions (SBS). We assigned SNVs to their most likely SBS signatures and selected the signatures enriched or depleted in TOP2B-CTCF-RAD21 sites or TOP2B-RAD21 sites relative to CTCF-RAD21 sites or RAD21 sites as controls, respectively (Fig. 2e). Signature SBS1 linking to methylcytosine deamination was consistently enriched in TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites in multiple cancer types. APOBEC signatures SBS2 and SBS13 were enriched in lung, esophageal, colorectal, and breast cancers, suggesting a link between TOP2B binding, advanced cancers and therapy-associated mutagenesis by APOBEC enzymes51,52. In contrast, signatures such as SBS4 (tobacco exposure) and SBS17a/b (unknown etiology) associated negatively with TOP2B binding and were enriched in CTCF-RAD21 or RAD21 sites. Indel analysis revealed heterogeneous associations with TOP2B binding: in melanoma, indels were enriched in TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites, while in colorectal cancer, significantly fewer indels were found in TOP2B binding sites. This analysis is more likely to emphasize more common SBS signatures while less-common signatures and indels may remain underpowered. In summary, TOP2B binding sites are affected by tissue-specific mutational processes in multiple cancer types.
We studied mutational processes at TOP2B binding sites in the context of transcriptional activity and chromatin interactions. For transcription, we analyzed 2851 matched cancer transcriptomes from PCAWG and HMF and grouped the binding sites into four bins based on adjacent gene expression (i.e., none, low, medium, or high). For chromatin interactions, sites were binned based on promoter-enhancer interaction frequency across 27 human cell types50. Local mutational processes were then quantified for each of the bins and site types using RM2, and results with at least one significant bin were selected (FDR < 0.05 from RM2). This analysis revealed 49 associations of mutational processes with transcription or chromatin interactions in 11 cancer types (Fig. 2f). Most of these (44/49) involved TOP2B-RAD21 or TOP2B-CTCF-RAD21 sites, more than expected (P = 8.5 × 10-15, OR = 18, hypergeometric test). Positive interactions were seen for TOP2B-RAD21 binding: sites at highly expressed genes or frequent chromatin interactions were highly enriched in mutations relative to flanking sequences, while sites lacking transcription or chromatin interactions were not (Fig. 2g, h). As controls, RAD21-only sites grouped by gene expression or chromatin interactions showed no significant mutational enrichments. In TOP2B-CTCF-RAD21 sites, we also observed a few inverse associations between gene expression activity and local mutational processes, and as expected, these were less pronounced in control sites bound by CTCF-RAD21 (Supplementary Fig. 4). These analyses highlight transcriptional and chromatin architectural correlates of local mutational processes in TOP2B binding sites. In summary, TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites are highly enriched in small mutations in multiple cancer types.
Structural variant breakpoints are enriched at TOP2B binding sites
We studied TOP2B binding in the context of 2.1 million structural variant (SV) breakpoints (SVBPs) in primary cancers and metastases from PCAWG and HMF. TOP2B binding sites had significantly more pan-cancer SVBPs compared to genome-wide tiling windows (P < 2.2 × 10-16, Wilcoxon test). TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites also had significantly more SVBPs than control sites lacking TOP2B binding (P < 2.2 × 10-16) (Fig. 3a). Thus, TOP2B binding sites are enriched in SVBPs beyond what is explained by CTCF or RAD21 binding.
a Comparison of SVBP frequency in TOP2B binding sites in pooled pan-cancer data from primary cancers and metastases. TOP2B-bound sites and non-bound sites were compared (TOP2B-CTCF-RAD21 vs CTCF-RAD21, TOP2B-RAD21 vs RAD21-only). As genome-wide controls, tiled genomic windows of matching size were used. b Frequency of SVBPs in TOP2B binding sites relative to functional activity at the sites. TOP2B binding sites were grouped into four bins by pan-cancer gene expression or chromatin interaction frequency. c Frequency of SVBPs in TOP2B binding sites with functional annotations: TOP2B conservation in mice, constitutive CTCF binding in human tissues, or DSB activity. P-values from one-tailed Wilcoxon tests are shown in panels (a–c). d Analysis of SVBPs in TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites relative to small mutations. Sites were compared to control sites lacking TOP2B binding using two-tailed hypergeometric tests and filtered results are shown (FDR < 0.05). Enrichment scores represent directional significance such that sites with enriched mutations are on the right and depleted sites on the left (OR, odds ratio). e Relative distances of binding site midpoints and the closest SV breakpoints. Pooled SV breakpoints from HMF and PCAWG datasets within 1 kbps from site midpoints were included. The dotted line corresponds to the 600 bps window defining binding sites in this study (i.e., ±300 bps). Boxplots span interquartile range (IQR; 25–75%), lines show median values, whiskers show values within 1.5x IQR, and site counts are shown below. P-values are from one-tailed Wilcoxon tests. Source data are provided as Source Data files.
Focusing on TOP2B binding sites, we asked if their functional annotations were informative of SVBP burden. Sites at highly expressed genes had significantly more pan-cancer SVBPs compared to sites at lowly expressed genes or intergenic sites (P < 2.2 × 10-16; Fig. 3b) and these results were recapitulated in most cancer types (Supplementary Fig. 5). Similarly, TOP2B binding sites at sites with frequent promoter-enhancer chromatin interactions were enriched in SVBPs compared to sites with fewer or no interactions (Fig. 3b). We then studied TOP2B binding sites in the context of functional annotations. Sites with conserved TOP2B binding in mice3 or sites with constitutive CTCF binding in human cell types43 were enriched in SVBPs compared to sites lacking these annotations (Fig. 3c). No SVBP enrichment was observed for TOP2B-bound sites having in vitro DSBs4.
To complement these pan-cancer analyses, we analyzed SVBPs in TOP2B binding sites in the context of small mutations in individual cancer types, which we established above as highly enriched in TOP2B binding sites. By pooling SVBPs with small mutations and using sites lacking TOP2B binding as controls, we asked if SVBPs were differentially distributed in TOP2B-bound sites relative to small mutations. Again, TOP2B-CTCF-RAD21 sites and TOP2B-RAD21 sites were enriched in SVBPs compared to control sites lacking TOP2B binding (hypergeometric test, FDR < 0.05) (Fig. 3d). This enrichment of SVBPs was apparent in ten cancer types, especially in lung, colorectal, and esophageal cancers.
We examined how SVBPs were distributed relative to midpoints of binding sites by measuring genomic distances between midpoints and closest SVBPs within 1000 bps in pooled pan-cancer samples. Midpoints of TOP2B binding sites were significantly closer to SVBPs compared to non-bound sites (P = 3.6 × 10-13, Wilcoxon test). Midpoints of TOP2B-CTCF-RAD21 and TOP2B-RAD21 sites were also closer to SVBPs compared to control sites lacking TOP2B binding (Fig. 3e). Interestingly, while 71% of sites overall had at least one SVBP located within 1000 bps of site midpoints, only 51% of these occurred inside the 600 bps peaks that we defined in our ChIP-seq analysis. Therefore, our stringent mutational analysis may miss additional SVs that occur near TOP2B binding sites but outside the genomic windows of ChIP peaks. Further breakpoints could be explained by limitations in SV calling or mutagenic mechanisms adjacent to binding sites. In summary, TOP2B binding sites appear as hotspots of genomic rearrangements, potentially due to topoisomerase activity that resolves topological constraints through DNA breakage and religation.
Established and candidate cancer driver mutations at TOP2B binding sites
To explore potential functional consequences of mutations at TOP2B binding sites, we asked if individual sites were specifically enriched in mutations indicative of cancer drivers and positive selection. As our initial analyses revealed multiple adjacent binding sites in major cancer genes and SV breakpoints occurring outside sites, we extended binding site coordinates based on ±1 kbps flanking windows around peak midpoints and merged overlapping sites (Supplementary Data 4). We performed cancer type-specific analyses of driver mutations separately for small mutations and SVs using the ActiveDriverWGS method27.
Driver analysis revealed 249 unique binding sites significantly enriched in either small mutations (SNVs, indels) or SVBPs (FDR < 0.05 from ActiveDriverWGS) (Fig. 4a, Supplementary Data 5). We refer to these as frequently mutated regulatory elements (FMREs). These included 197 FMREs with enriched small mutations, 63 FMREs with enriched SVBPs, and a few FMREs found in both analyses. FMREs were detected in 17 of the 18 cancer types we studied, and most were identified in melanomas, and cancers of colon, breast, esophageal, liver, lymph, and prostate. FMREs were enriched in 41 known driver genes (8.7 expected; P < 2.2 × 10-16, hypergeometric test), including major pan-cancer drivers such as TP53, MYC, PTEN and EGFR, and tissue-specific drivers such as VHL, TMPRSS2, and FOXA1. Besides established drivers, we found 163 FMREs enriched in non-coding mutations as a tissue-specific catalogue of non-coding candidate drivers. Further, a subset of FMREs were enriched in translocation breakpoints that collectively affected 30 established cancer genes in 380 cancer samples (Fig. 4b).
a FMREs identified at the binding sites through significant enrichment of small mutations (SNVs, indels) or structural variant breakpoints (SVBPs) in individual cancer types in primary cancers (PCAWG) and metastases (HMF) (one-tailed tests from ActiveDriverWGS, FDR < 0.05). Known cancer genes are labelled. Color strips below the bar chart show site types, primary or metastatic cancers, and the type of mutation involved (small mutation or structural variant). b Circos plot of translocations at FMREs combined from primary cancers and metastases. Known cancer genes at FMREs are labelled. c Functional characteristics of FMREs relative to other binding sites. FMREs in the highest category (bin4) correspond to sites with highest third of gene expression or chromatin interactions from 27 human tissue types, or sites with constitutive CTCF binding in >90% of human tissues. P-values from one-tailed Wilcoxon tests (top) or one-tailed hypergeometric tests (bottom) are shown. d FMREs ranked by the number of promoter-enhancer chromatin interactions. Top genes with FMREs are labelled. e Genes with differential expression associations with alterations in adjacent FMREs (one-tailed F-test, FDR < 0.05). Bars show fold-change values of genes with respect to FMRE mutations. Symbols indicate associations identified from SNVs and indels (circles) or SVBPs (triangles). f Enrichment map of biological processes and molecular pathways enriched in non-coding FMREs (ActivePathways, FDR < 0.05). The network shows enriched pathways as nodes in which similar pathways that share many genes are connected by edges. Pathways were prioritized by mutational enrichments across primary and metastatic cancers. Colors indicate cancer types in which pathways were identified, and white corresponds to pathways only detected across multiple cancer types. Source data are provided as Source Data files.
Most FMREs were bound by TOP2B (189 of 249), significantly more than expected from binding sites in our dataset (P = 3.0 × 10-7, OR = 1.67). These included 94 TOP2B-RAD21 sites and 71 CTCF-TOP2B-RAD21 sites. Functional annotations relevant to TOP2B activity were also significantly enriched in FMREs, such as in vitro evidence of DSBs, conservation of TOP2B binding in mouse liver, and constitutive CTCF binding in human cells (Fig. 4c). FMREs were significantly associated with higher gene expression (P = 6.7 × 10-9, Wilcoxon test) and frequency of promoter-enhancer interactions (P = 3.4 × 10-5), suggesting roles of FMREs in gene regulation and chromatin architecture in cancer. Many FMREs appeared as chromatin interaction hotspots: 132/249 (55%) FMREs had at least one interaction and 33 FMREs (13%) at least 25 interactions across 27 human tissue types. These often occurred at cancer driver genes such as PTEN, TP53, BCL6 and ETV6 (Fig. 4d). Promoter-enhancer interactions of FMREs highlighted 20 additional distally-located cancer driver genes, including the cell cycle regulatory kinase CHEK2 (Supplementary Fig. 6, Supplementary Data 6), suggesting that some TOP2B-associated FMREs may regulate cancer genes distally via chromatin interactions or topologically associating domains27.
We asked if mutations in FMREs were associated with differential expression of target genes in matching cancer transcriptomes and found associations with 43 genes (FDR < 0.05, F-test). FMRE alterations associated with increased expression of oncogenes such as MYC in lymphoma, EGFR in lung cancer, and FOXA1 in breast cancer, and decreased expression of tumor suppressor genes such as TP53 in prostate cancer and VHL in kidney cancer (Fig. 4e, Supplementary Data 7), suggesting functional insights of these TOP2B-associated mutations.
Then we focused on FMREs that were enriched in non-coding SNVs or indels, aiming to obtain evidence of their functionality in cancer. We performed an integrative pathway enrichment analysis using the ActivePathways method53 that considered GO biological processes and Reactome pathways of FMRE-associated genes and prioritized FMREs with mutational enrichments from multiple cancer types. This analysis revealed 202 pathways and processes significantly altered by non-coding mutations, including hallmark cancer processes such as cell adhesion, cell migration, developmental and differentiation processes, and signaling pathways Wnt, VEGF, Hippo, and Rho GTPases (family-wise error rate (FWER) < 0.05 from ActivePathways53) (Fig. 4f, Supplementary Data 8), highlighting potential roles of FMREs in cancer pathways. In summary, FMREs identified through TOP2B binding converge onto established cancer drivers, biological pathways, and non-coding regulatory elements across cancer types.
TOP2B binds mutational hotspots in established cancer genes
We studied the FMREs at 41 known cancer genes, most of which involved TOP2B binding sites (Fig. 5a). Protein-coding SNVs and indels in cancer driver genes often follow well-established mutational patterns: oncogenes are characterized by mutational hotspots that affect one or a few adjacent amino acid residues in protein sequence, while tumor suppressor genes are often affected by loss-of-function mutations such as nonsense SNVs and frameshift indels54. FMRE analysis revealed 14 genes with at least five such protein-coding alterations (Fig. 5b).
a Known cancer driver genes at frequently mutated regulatory elements (FMREs) identified from enriched small mutations or SV breakpoints (SVBPs) (one-tailed tests from ActiveDriverWGS, FDR < 0.05). Stacked barplots show cancer samples affected at each FMRE, types of mutations, and binding sites. Colors on the left indicate cancer types. b Genes with FMREs at mutational hotspots and loss-of-function mutations in known driver genes. Mutations ( ≥ 5) in affected genes are grouped by amino acid substitutions or impact (frameshift (FS), non-frameshift (NFS), or nonsense (STOP)). Colors indicate cancer types. Mutation counts are shown across primary and metastatic cancers. c–f The prominent FMRE at TP53 promoter is a TOP2B-CTCF-RAD21 binding site that is enriched in SVBPs. c Genomic locus of the FMRE at TP53 with SVBPs in the FMRE (black) and adjacent flanking sequence (grey) (top), corresponding ChIP-seq signals from representative experiments (black) and MACS2 peaks (pink) (middle). Counts of alterations and cancer samples colored by alteration types and cancer types are shown below. FDR-adjusted P-values from one-tailed tests in ActiveDriverWGS are shown. d Circos plot of translocations at the TP53 promoter FMRE. The FMRE locus is shown as a triangle and translocations involving the FMRE are displayed as arcs (brown). Putative target genes of translocations are listed. e TP53 expression in metastatic prostate cancers in samples having SVBPs at the FMRE (n = 10) and other samples (n = 156). FDR-adjusted P-value from one-tailed a F-test is shown. Boxplots span interquartile range (IQR; 25–75%), lines show median values, and whiskers show values within 1.5x IQR. f Long-range chromatin interactions at the TP53 FMRE from 27 human tissue types. Source data are provided as Source Data files.
We examined TOP2B-associated mutational hotspots in detail. First, IDH1 encodes isocitrate dehydrogenase 1, a metabolic enzyme whose R132 amino acid substitution hotspot defines a lower-risk subtype of glioma and drives a hypermethylator phenotype55,56. In our data, this hotspot occurred at a TOP2B-RAD21 binding site and was identified as an FMRE with enriched mutations in biliary and central nervous system cancers (Supplementary Fig. 7a). Second, VHL encodes a core tumor suppressor protein that is frequently mutated in kidney cancer. We found an FMRE bound by TOP2B-RAD21 in VHL that was enriched in frameshift indels and nonsense SNVs in 61 kidney cancer samples in both PCAWG and HMF cohorts (4.5 expected, FDR < 2.2 × 10-16 from ActiveDriverWGS) (Supplementary Fig. 7b). These TOP2B-associated mutations correlated with transcriptional inhibition of VHL in matching cancer transcriptomes (FDR = 1.8 × 10-4, F-test), indicating loss-of-function effects (Supplementary Fig. 7e). Third, the TOP2B-RAD21 bound FMRE in the EGFR oncogene included the oncogenic mutational hotspot T790M in 42 metastatic lung cancers (10 expected; FDR = 7.9 × 10-10) (Supplementary Fig. 7c). This mutational hotspot was associated with EGFR upregulation in matching cancer transcriptomes (FDR = 0.024), suggesting a gain-of-function effect of these TOP2B-associated mutations (Supplementary Fig. 7e). EGFR T790M mutations are known to cause acquired resistance to tyrosine kinase inhibitors57.
Some known cancer driver genes were enriched in non-coding mutations at TOP2B binding sites. One TOP2B-CTCF-RAD21 bound FMRE was found in the pioneer factor FOXA158 that was enriched in coding and non-coding mutations in breast and prostate cancers (ActiveDriverWGS, FDR < 10-12). In breast cancer, FMRE mutations were associated with FOXA1 upregulation (Supplementary Fig. 7d,e). In summary, these examples indicate that the DNA breakage and repair function of TOP2B coincides with sites of positively selected driver mutations in cancer, pointing to a potential mechanism by which TOP2B activity influences cancer genome evolution.
FMREs with structural variation hotspots
We reviewed the FMREs with SV hotspots at known cancer genes. Promoter of the tumor suppressor gene TP53 harbored one of the most prominent FMREs. This TOP2B-CTCF-RAD21 site was enriched in pan-cancer SV breakpoints in 61 cancer samples in PCAWG and HMF corresponding to deletions and translocations (6.1 expected, all FDR < 10-9 from ActiveDriverWGS) (Fig. 5c, d). The FMRE was also identified tissue-specifically in 18 metastatic prostate cancer samples from HMF (1.5 expected, FDR = 5.3 × 10-9). Analysis of matching transcriptomes revealed that metastatic prostate cancers having SVBPs at the FMRE showed significant reduction in TP53 expression compared to cancers lacking SVBPs (FDR = 3.8 × 10-4, F-test) (Fig. 5e). This suggests a TOP2B-mediated recurrent rearrangement of the TP53 promoter that leads to transcriptional inhibition TP53 in prostate cancer. The FMRE is also a hotspot of promoter-enhancer chromatin interactions (Fig. 5f). Thus, TOP2B-mediated resolution of genomic entanglements from 3D chromatin interactions could lead to DNA damage and positively selected structural variants inhibiting a core tumor suppressor gene.
Additional recurrent SV hotspots were found at TOP2B binding sites. First, the TMPRSS2-ERG locus involved two FMREs, one in TMPRSS2 bound by TOP2B-RAD21 and another in ERG bound by TOP2B-CTCF-RAD21, that affected 91 prostate cancer samples across PCAWG and HMF (6.5 expected, min FDR < 2.2 × 10-16) (Supplementary Fig. 8a–c). TMPRSS2-ERG gene fusions are hallmark drivers of prostate cancer59 and their formation involves TOP2B-induced DSBs24, lending confidence to our findings. As expected, these SVs associated with a strong increase in ERG expression in matching prostate cancer transcriptomes (FDR = 6.3 × 10-4) (Supplementary Fig. 8d). Second, the MYC oncogene promoter included a TOP2B-CTCF-RAD21 bound FMRE with enriched SVs in 14 B-cell non-Hodgkin’s lymphomas in PCAWG (0.24 expected, FDR = 4.2 × 10-16 from ActiveDriverWGS). These encode translocations with the immunoglobulin heavy chain locus at chr14, a known driver mechanism in lymphoma60 (Supplementary Fig. 9a, b). Accordingly, lymphomas with these SVs showed substantial MYC upregulation (FDR < 2.2 × 10-16, Wilcoxon test) (Supplementary Fig. 9c). The FMRE also had multiple promoter-enhancer interactions (Supplementary Fig. 9d), exemplifying the roles of TOP2B binding in genome architecture and structural variation in cancer. Lastly, the most significant SV-enriched FMRE was a translocation hotspot at NOD1 that involved SVs in 49 metastatic colorectal, esophageal and lung cancers in HMF (1.1 expected; FDR < 2.2 × 10-16 from ActiveDriverWGS) (Supplementary Fig. 10). NOD1 encodes an intracellular sensor of microbial components involved in innate immunity61 whose recurrent SVs highlight a putative driver role triggered by structural variation. Collectively, these data show that TOP2B binding sites often co-occur with structural rearrangements of cancer driver genes.
Non-coding mutations at RMRP drive cancer in vivo
Our systematic analyses of individual TOP2B binding sites revealed tissue-specific enrichments of putative non-coding driver mutations whose functional annotations, evolutionary conservation and molecular pathways suggest roles in cancer. To examine this hypothesis in detail, we studied one top-ranking FMRE experimentally. We selected one TOP2B-CTCF-RAD21 binding site found in the promoter of the non-coding RNA (ncRNA) gene RMRP that was identified as significant with small mutations in 40 metastatic breast cancer samples (11 expected, FDR = 3.3 × 10-8 from ActiveDriverWGS) (Fig. 6a). The FMRE was also highly mutated in stomach, skin, and ovary cancers ( > 5%) and showed a high pan-cancer mutation frequency in PCAWG and HMF data (Fig. 6b). Previous studies of promoter mutations of RMRP have presented in vitro functional evidence62 while recent genomics analyses have linked these to elevated localized mutagenesis26. The FMRE locus included RMRP involved in mitochondrial function, ARHGEF39 encoding a rho guanine nucleotide exchange factor, and CCDC107 encoding a little-studied membrane protein. Overall, most FMRE mutations (204/209 or 98%) were non-protein-coding while five protein-coding mutations were found in exons of CCDC107 (Fig. 6c).
a Genomic locus of the frequently mutated regulatory element (FMRE) at the non-coding RNA gene RMRP. The plot shows small mutations (SNVs, indels) in primary breast cancers (PCAWG) and metastases (HMF) (top), TOP2B-CTCF-RAD21 binding site (middle), corresponding ChIP-seq signals from representative experiments (black) and MACS2 peaks (pink) (bottom). b FMRE Mutations grouped by cancer types. c Locus overview with pan-cancer mutation frequency (top), FMRE and binding sites (middle), and adjacent genes (bottom). Triangles indicate sgRNAs used in genome editing. d Experimental validation. Two sgRNAs targeting the orthologous mouse FMRE were delivered by lentivirus using ultrasound-guided in utero injection into mouse embryos. In parallel, two sgRNAs targeting the human FMRE were delivered into MCF10A human mammary epithelial cell lines (not shown). e Genome editing of mouse FMRE causes earlier tumor onset in vivo. Kaplan-Meier plot compares tumor-free survival of mice having FMRE-targeted sgRNAs (red or purple lines) and mice having control sgRNAs (grey lines). Hazard ratio (HR) and P-value were derived from Cox proportional-hazards regression with Wald test. HR is shown with 95% confidence intervals. f Three-dimensional (3D) Matrigel growth assay of MCF10A cells compares control (Scr)-treated cells (left) and FMRE-mutated cells (right). One representative experiment for sgRNA sg6-1835p is imaged on day 28. g Differential expression analysis of FMRE-mutant MCF10A cells. Scatterplot shows differential gene expression from RNA-seq data of 2D (X-axis) and 3D assays (Y-axis) of edited MCF10A cells relative to Scr-treated controls. Genes were prioritized based on either joint up-regulation or down-regulation in 2D and 3D assays and significant genes are colored (merged FDR < 0.05 from DPM64). Directional significance scores display up-regulated genes on the top or right and down-regulated genes on the bottom or left. Cancer genes from Cancer Gene Census are labelled. h Significantly enriched pathways visualized as an enrichment map. Significant pathways and processes are shown as nodes that that are connected into subnetworks if the pathways share many genes (family-wise error rate (FWER) < 0.05 from ActivePathways53). Node color corresponds to transcriptomics evidence from the two types of growth assays. Source data are provided as Source Data files.
We studied the RMRP FMRE using functional genomics experiments in human mammary epithelial cells and a mouse model of skin cancer. We performed CRISPR/Cas9 genome editing of the locus in MCF10A cells using one sgRNA targeting the major mutational hotspot in the RMRP promoter and another sgRNA targeting the non-coding regions of CCDC107 and ARHGEF39 (Supplementary Fig. 11a). In parallel, we edited the conserved orthologous FMRE in the mouse genome using ultrasound-guided in utero injection of two independent sgRNAs into the surface ectoderm of mouse embryos expressing the sensitizing Pik3caH1047R oncogene63 (Fig. 6d). In this mouse model of skin cancer, FMRE mutations triggered significantly earlier tumor onset compared to scramble control sgRNAs (P = 0.0019, Wald test) (Fig. 6e). All tested tumors displayed squamous cell carcinoma histology and contained the expected indels in the FMRE (Supplementary Fig. 11b, c). In MCF10A cells, FMRE mutations caused enhanced proliferation phenotypes in a three-dimensional (3D) growth assay (Fig. 6f, Supplementary Fig. 11d, Supplementary Data 9).
To complement these functional assays, we performed transcriptome profiling of FMRE-edited MCF10A cells from 2D and 3D growth conditions using RNA-seq, detecting differentially expressed genes relative to scramble-treated control cells (Supplementary Data 10-11). To detect consistent expression patterns, we performed directional integration of transcriptomics data using the DPM method64 and prioritized 641 genes that were jointly up-regulated or down-regulated in the two growth conditions (FDR < 0.05 from DPM) (Fig. 6g, Supplementary Data 12). Pathway enrichment analysis of differentially expressed genes confirmed the observed phenotypes of FMRE-mutant cells: the major pathway themes involved cell migration and motility, regulation of wound healing, cell adhesion, and developmental processes (Fig. 6h, Supplementary Data 13). Interestingly, the genes RMRP, CCDC107 and ARHGEF39 at the locus were not differentially expressed in FMRE-mutant cells; however, several gene-regulatory and signaling genes with cancer functions such as NOTCH1, FAT1, ERBB3, TNFAIP3 and NFKB2 were differentially expressed, and known cancer genes were significantly enriched (P = 0.0051, OR = 1.68, hypergeometric test) (Supplementary Fig. 12). Non-coding driver mutations at this FMRE may have complex epigenetic and regulatory functions that can be deciphered by examining these transcriptome-wide deregulation patterns. Collectively, these experiments characterize the frequently mutated RMRP locus as a non-coding cancer driver and exemplify TOP2B binding sites as a rich resource for experiments.
Discussion
While TOP2B-mediated DSBs are essential in regulating chromatin architecture during transcription, development, and differentiation, these DSBs can also be a double-edge sword leading to genomic rearrangements and mutations in cancer. Here we integrated a genome-wide map of TOP2B binding in human cancer samples with a WGS dataset of thousands of cancer genomes. We show that TOP2B binding sites are subject to localized mutational processes, which give rise to small mutations and larger structural variants in untreated primary cancers and treated metastatic cancers. The dominant mutational signals at TOP2B binding sites were found from SNVs and SVs while indel-based signals were generally less detectable. The notable enrichment of SV breakpoints at TOP2B binding sites is consistent with our understanding of the role of TOP2B in generating DSBs and aligns with previous reports linking TOP2B to translocations12,13,14,23,24, which lends confidence to our TOP2B binding map. However, enrichment of SNVs at TOP2B binding sites is somewhat unexpected in topoisomerase context and may indicate complex processes involved in DSB formation and repair. Recently, rare protein-coding mutations in TOP2A have been linked to an indel signature in cancer genomes, demonstrating the capability of topoisomerases to generate indels65.
The most striking mutational patterns are apparent at binding sites of TOP2B-CTCF-RAD21 or TOP2B-RAD21. Mutational enrichments at binding sites of CTCF and cohesin have been observed in multiple cancer types33,66,67,68. We show that TOP2B binding at these sites consistently associates with higher mutagenesis compared to the cofactors alone, offering a plausible explanation for these mutational patterns and suggesting that TOP2B-mediated DNA cleavage might be a key source of mutations. TOP2B binding also associates with transcription, active chromatin states, chromatin interactions, and certain mutational signatures. These collectively highlight context- and tissue-specific aspects of TOP2B-mediated mutagenesis for future investigation, such as transcription-coupled nucleotide excision repair, RNA Pol II pause-release18,23, or transcription factor (TF) binding69,70, as well as chemotherapy-induced mutagenesis in metastatic cancers. The high evolutionary conservation of TOP2B binding sites and their genome-wide co-occurrence with sites of DSB activity underscore their functional importance4,5,7,23,71 and support a pivotal role for TOP2B-mediated mutagenesis in cancer development, heterogeneity, and therapy resistance. Besides somatic mutations in cancer genomes, TOP2B-associated processes may exert a broader influence on the evolutionary dynamics of the germline genome.
The prominent genomic presence of TOP2B encompasses dozens of well-established driver genes and regions harboring frequent genomic rearrangements. Remarkably, SVs in TOP2B-associated FMREs drive activation of oncogenes such as MYC and TMPRSS2-ERG or disable tumor suppressors such as TP53. TOP2B also binds hotspots of protein-coding driver mutations in cancer genes such as IDH1, VHL, FOXA1, or EGFR. Intriguingly, analysis of driver mutations in TOP2B binding sites highlights hundreds of FMREs as putative non-coding drivers. FMREs associate with genes in cancer pathways, display high evolutionary conservation, and include frequent chromatin interactions, suggesting uncharacterized regulatory roles in cancer. As a proof of concept of TOP2B-mediated mutagenesis in cancer, the direct causal role of TOP2B in the generation of structural variants at the TMPRSS2 and ERG loci resulting in de novo oncogenic fusions of prostate cancer is established previously24. We report TOP2B binding at a diverse set of additional cancer genes and non-coding genomic elements, suggesting that TOP2B may contribute to the genesis of driver mutations and their positive selection in cancer more commonly than documented to date. Importantly, focusing on TOP2B binding sites as a targeted search space for driver mutations allowed us to identify FMREs in individual cancer types rather than pooled cancer pan-cancer data. While TOP2B often binds ubiquitously active genomic elements, drivers found in various cancer types were relatively distinct. Thus, unraveling the intricacies of positive selection in cancer and fully understanding the mutagenic processes at TOP2B binding sites will require substantial further research.
We generated in vitro and in vivo functional data to validate one TOP2B-bound FMRE in a striking non-coding mutational hotspot at the ncRNA gene RMRP. We highlight a bona fide non-coding cancer driver using in vivo data of enhanced tumor initiation and in vitro data of three-dimensional cellular growth. This FMRE is highly mutated in several cancer types, lending confidence to its broad pan-cancer roles. However, the region entails complex gene-regulatory interactions that do not appear to control adjacent gene expression but instead may regulate numerous genes, some of which are involved in cancer hallmark processes and signaling pathways. More work is needed to decipher these non-coding mutations and associated molecular phenotypes.
Our study has notable limitations. First, associations of TOP2B binding with mutational processes and cancer drivers are based on systematic integration of multi-omics datasets. As such, causal evidence to link TOP2B binding with somatic mutagenesis is currently limited. While mutational processes at TOP2B binding sites are more pronounced at functionally active and conserved sites, additional confounding factors such as chromatin accessibility, transcription, RNA polymerase II stalling, chromatin interactions, DNA sequence context and secondary structures may also contribute. Future experiments involving TOP2B functional modulation with genomic readouts are required to better understand these processes, however these are challenging due to the essential in vivo roles and molecular properties of TOP2B20,72. Our findings are supported by earlier work demonstrating TOP2 activity at oncogenic translocations in cancer14,24. Second, TOP2B binding sites were determined in two clinical samples from hepatocellular carcinoma, which provides only a partial view of TOP2B activity in cancer genomes. While mapping DNA-binding of TOP2B in human cells has been experimentally challenging to date, we tackled this challenge here by extending our previous robust experimental protocols in mouse liver3. While TOP2B binding sites from liver were analyzed, other cancer types also showed mutational enrichments, and the maps aligned with functional and evolutionary properties of TOP2B binding. This lends confidence to the catalogue of binding sites and indicates its pan-cancer relevance. Third, apart from TOP2B-CTCF-RAD21 sites, we observed mutational enrichments at CTCF-RAD21 sites, which also sometimes displayed evolutionary conservation for CTCF or TOP2B binding, and may include TOP2B binding sites in other cell types which may have been missed in our ChIP-seq experiments due to molecular heterogeneity of cancer samples. Additional proteins co-binding DNA with TOP2B could also explain the mutational processes we observed. Fourth, we analyzed two independent cancer genomics datasets comprising untreated primary cancers and treated metastatic cancers of multiple cancer types that are confounded by data processing steps, variable sample sizes, and clinical characteristics. Notwithstanding, our findings are consistently replicated across multiple cancer types and cohorts. Despite these challenges, our TOP2B binding maps, along with annotations of mutational processes and known and putative driver mutations, represent a valuable resource for computational and experimental follow-up studies.
Mutational processes at TOP2B binding sites may be driven by many factors, such as TOP2B poisoning by chemotherapeutic drugs, dietary components, or environmental compounds. Subsequent processing of protein-trapped DSB ends can contribute to translocations and secondary malignancies, as documented in leukemia10,14. Potentially tumorigenic DNA lesions and translocations can originate from TOP2B-DNA transactions during endogenous cellular processes such as RNA Pol II pause release and transcriptional elongation, as shown in prostate and breast cancers23,24. We demonstrate the prevalence of small mutations and chromosomal rearrangements at TOP2B binding sites in both primary untreated cancers and metastatic treated cancers, which appear to be more common than previously recognized. Future mechanistic studies and integrative analyses bridging molecular cancer profiles with clinical and lifestyle data will be essential for advancing our understanding of the role of TOP2B in cancer mutagenesis and etiology, and the risks and off-target effects of widely used TOP2-targeting chemotherapies. In summary, we posit that TOP2B functions as a double-edged sword in cancer biology. While it plays a critical role in relieving topological stress and maintaining genome integrity, it may paradoxically serve as a source of somatic mutations.
Methods
Tissue collection
Three liver hepatocellular carcinoma (HCC) samples were collected from patients who underwent liver transplantation at the Ajmera Transplant Centre of Toronto General Hospital (ON, Canada) in 2016-2019. The HCC samples were from male patients with mean age 59 (age range 56–64) and of hepatitis C etiology. Tumors were of stage II (T2N0M0), moderately differentiated histologic grade, and did not exceed a maximum total tumor volume of 145 cm3 permitted by institutional guidelines. Tumors from males were studied because HCC incidence is up to four times higher in men than in women.
Chromatin immunoprecipitation (ChIP-seq) experiments
Approximately 30 mg of HCC tumor sample was used for each ChIP reaction. Frozen HCC tissues were dounce homogenized into single cell suspensions and cross-linked with 1% formaldehyde in Solution A (50 mM Hepes–KOH, 100 mM NaCl, 1 mM EDTA, 0.5 mM EGTA) for 20 min at RT. Fixation was quenched by adding glycine to a final concentration of 125 mM. After two washes with ice cold PBS, cells were filtered through 100 μm cell strainer to remove any connective tissue. Cell lysis and nuclei isolation steps were carried out in 10 mL of LB1 buffer (Diagenode) by rotating 20 min at 4 °C, followed by nuclear lysis in 5 mL of LB2 (Diagenode) and rotating 10 min at 4 °C. For chromatin shearing, the cell pellet was resuspended in Shearing Buffer (Diagenode) supplemented with complete proteinase inhibitors (Roche), and sonicated for 5 cycles (30 s ON, 30 s OFF) with the Bioruptor Pico sonicator (Diagenode). Chromatin was cleared by centrifugation at 13,000 rpm (21000 x g) for 10 min at 4 °C. For ChIP, chromatin lysates were combined with 5 ug of anti-TOP2B, anti-CTCF, or anti-RAD21 antibodies and incubated overnight in IP buffer (Diagenode) rotating at 4 °C, and then with 70 μl of pre-blocked (0.5 mg/mL BSA) Dynabeads protein G (ThermoFisher) for 4 h. The following antibodies were used: anti-TOP2B (Abcam, catalog number ab72334, lot GR3259083-8), anti-CTCF (Millipore, catalog number mp07-449, lot 3059608), and anti-RAD21 (Abcam, catalog number ab992 (lot GR3253930-6). Beads were then washed six times with RIPA buffer (50 mM Hepes–KOH, pH 7.5; 500 Mm LiCl; 1 mM EDTA; 1% NP-40 or Igepal CA-630; 0.7% Na– Deoxycholate), and one time with TBS (20 mM Tris–HCl, pH 7.6; 150 mM NaCl), and resuspended in ChIP Elution buffer (50 mM Tris–HCl, pH 8; 10 mM EDTA; 1% SDS). Cross-linking was reversed by overnight incubation at 65 C. Cellular proteins and RNA were digested with Proteinase K (Invitrogen) and RNaseA (Ambion). ChIP DNA was purified with P:C:I method. ChIP-seq libraries were prepared using NEBNext Ultra II DNA Library preparation Kit following the manufacturer’s instructions (NEB, E7645). Libraries were amplified for 9 cycles using Q5 Hot Start High-Fidelity DNA Polymerase (NEB, M0493L), purified and size-selected with AMPure XP PCR purification beads (Beckman Coulter, A63881), and quantified with 2100 Bioanalyzer (Agilent). Input control DNA extracted from sonicated cell lysates of each sample were processed in parallel. We generated four libraries of binding sites for CTCF (patient 4, patient 5, two replicates for patient 6), three sets for RAD21 (patient 4, two replicates for patient 6), and two sets for TOP2B (patient 5, patient 6). The third TOP2B library was not included as limited tissue material was available, and the resulting library did not pass our quality controls. Libraries were sequenced with Illumina NovaSeq SP in a paired-end configuration. ChIP-seq experiments and data analysis were used to identify binding peaks of three factors (TOP2B, CTCF, RAD21), which reflect interactions of these proteins with chromatin. ChIP-seq peaks are referred to throughout the manuscript as binding sites (e.g., TOP2B binding sites or TOP2B-bound regions).
ChIP-seq data analysis to define binding sites
Quality assessment of sequencing data was performed using the FastQC method73. Sequences were aligned to the human genome hg19 using the Bowtie2 method (v2.1.0) with the parameters “-1” and “-2” for paired-end reads74. The Samtools package (v1.2) was used to sort, clean, and format the data75. Peak calling was performed in individual replicates separately using the MACS2 method76 using input control libraries. The software was run using standard parameters except for the parameter “-f BAMPE” to specify paired-end reads and the parameter “-p 1e-3” to initially establish a lenient P-value threshold. As output, narrowPeak files including the standard BED columns with four additional columns were generated, including signal values, P-values, q-values (i.e., multiple testing adjusted P-values), and relative summit positions. Multiple testing correction was performed using the Benjamini-Hochberg (BH) False Discovery Rate (FDR) method77. Binding sites were first filtered for significance and fold-change (FDR < 0.05, FC ≥ 2). Sites in chromosomes 1-22, X, and Y were selected and sites in unmappable regions of the genome were excluded, as defined in the ENCODE project for hg19 (ENCFF001TDO)43. Based on the distribution of binding site widths, site widths were initially fixed to 550 bps around summits. Within each DNA-binding factor (e.g., all CTCF libraries), we then merged the overlapping coordinates and calculated the median summit position and q-value. For CTCF and RAD21, we selected only the sites where binding was detected in at least two different libraries. Due to technical limitations, we used a more sensitive process for TOP2B binding sites and selected sites if binding was detected in at least one of the two libraries. TOP2B ChIP-seq is an experimentally sensitive procedure with specific challenges in human cells. One challenge involves the large size of the TOP2B protein that requires extensive optimization of sonication to preserve TOP2B-DNA complexes while still achieving suitable chromatin shearing. Another challenge is limited input material and heterogeneous tissue material, especially for clinical cancer samples used here. Therefore, we opted for this approach to extract a wider set of TOP2B binding sites. TOP2B binding sites were filtered stringently similarly to RAD21 and CTCF sites (FDR < 0.05; FC ≥ 2). To define sites bound by combinations of TOP2B, CTCF, or RAD21, we used the multiIntersect function of BEDTools78 to find overlapping sites. Sites within 100 bps were merged and their widths were fixed to 600 bps centering around the merged midpoints. Seven types of sites were defined based on the binding of the three factors: TOP2B-CTCF-RAD21, TOP2B-RAD21, CTCF-RAD21, TOP2B-only, RAD21-only, CTCF-only, and CTCF-TOP2B sites. Relatively few CTCF-only and CTCF-TOP2B sites were found. These were excluded from downstream analyses of mutational processes involving groups of sites, while these were included in driver analyses focusing on individual sites.
Somatic SNVs, indels, and SVs in whole cancer genomes
Somatic small mutations (SNVs and indels) in whole cancer genomes were retrieved from two WGS datasets (hg19), PCAWG79 and HMF49. Hypermutated samples with >90,000 mutations were excluded. For PCAWG, the consensus dataset of SNVs and indels was used. For HMF, variant calls were further filtered as follows. We excluded data from multiple tumors from the same patient by selecting the first tumor ID alphabetically and removed samples lacking sufficient informed consent. Variant calls in HMF VCF files were filtered (filter = PASS). Cancer types in PCAWG and HMF were consolidated to 18 cancer types based on organs and anatomical sites. Cancer types with at least 25 samples in both HMF and PCAWG cohorts were analyzed separately, while the pan-cancer analyses included all cancer samples per cohort. We mapped functional effects of SNVs and indels on protein-coding genes using the ANNOVAR software80 and RefSeq gene definitions. Mutational signatures of single base substitutions (SBS) were compiled separately for PCAWG and HMF. SBS signatures of PCAWG were retrieved from the consensus dataset by Alexandrov et al. (2020)39. SBS signatures of HMF were computed using the SigProfiler software with SigProfilerMatrixGenerator (v1.2) and SigProfilerExtractor (v1.1.4)81 using SBS signatures of the COSMIC database (V3)81,82. Each SNV was assigned to the most probable SBS signature given its trinucleotide sequence. All indel mutations were combined for signature analyses. Somatic structural variants (SVs) from PCAWG and HMF were analyzed separately. For PCAWG, we used the consensus SV dataset developed by Li et al. (2020)40. For HMF, SVs identified using the gridss software83 were further filtered using recommended steps to select high-quality SVBP calls (filter = PASS, qual ≥ 1000, AS > 0, RAS > 0) and high-confidence SV events having both SVBPs were selected. SVs were annotated as duplications (DUP), deletions (DEL), inversions (INV), translocations (TRA). We mapped SVBPs to genes using ±3 kbps flanking regions. For differential gene expression analysis, we also computed gene copy-number alterations (CNAs) using segments from PCAWG and HMF, assigning the median total copy number to each gene based on overlapping CNA segments.
Cancer transcriptomes
For PCAWG, gene expression data was retrieved from the consensus dataset represented as fragments per kilobase per million values (FPKM-UQ)48. For HMF, gene expression data was processed from paired fastq files, with reads aligned using trimGalore and Cutadapt (version 0.6.6)84, expression quantified as transcripts per million (TPM) values using the Kallisto method (v0.46.2)85, and gene-level TPM counts imported using tximport and tximportData (v1.18.0). Low-confidence transcriptomes with less than 35% pseudoalignment were excluded. We also excluded samples that lacked matching WGS data. For patients with multiple cancer transcriptomes, we selected the samples with highest tumor purity.
Long-range chromatin interactions
We annotated binding sites using long-range chromatin interactions from promoter-capture HiC experiments in 27 human tissues from50 and selected high-confidence interactions (freq > 10; -log10_score > 3). To annotate sites to genes through chromatin interactions, we assigned one anchor of an interaction to a site and the other anchor to a transcription start site (TSS) of a protein-coding gene using sequence overlaps. TSSs were retrieved from Ensembl BioMart for GRCh37. From these distal site-gene interactions, we excluded interactions already identified via site proximity to gene (i.e., ±3 kbps). Long-range chromatin interactions involving FMREs and genes were visualized in Cytoscape.
Genomic and epigenomic annotations of binding sites
Sites were annotated using genomic, epigenomic, and functional features. First, we considered 162,000 CTCF binding sites in 70 human cell lines from the ENCODE project43. Sites were classified by CTCF binding as tissue-specific (binding in <10% of cell lines), intermediate, or constitutively active (binding in >90% of cell lines). Evolutionary conservation of CTCF binding sites in five mammalian livers and conservation of TOP2B binding in mouse liver were retrieved from3. Chromatin states were obtained from the core 15-state model of the Epigenomics Roadmap project47 in ten normal human tissues (breast, colorectum, esophagus, kidney, liver, lung, ovary, pancreas, skin, stomach) representing the most common cancer types analyzed in PCAWG and HMF (see below). Experimentally mapped DSB sites from two cell lines (Mcf7, Nalm6) were retrieved from4 and merged based on overlapping coordinates. Next, we used the GENCODE database to annotate sites to protein-coding genes, lincRNA genes, antisense, miRNA, and sense-intronic non-coding genes, immunoglobulin (IG) genes and T-cell receptor (TR) genes. These were also used to filter the transcriptomics datasets of PCAWG and HMF. Sites were associated with adjacent genes using ±3 kbps flanking sequence, and sites lacking adjacent genes were considered intergenic. CTCF binding motifs in sites were identified using the FIMO method86 of MEME software (version 5.4.1) using the HOCOMOCOv11_core_HUMAN TF motif dataset and default parameter settings.
Grouping sites by gene expression and chromatin interactions
We grouped sites into four bins based on their functional activity separately for gene expression and long-range chromatin interactions, as follows. Binning was performed for each class of sites separately (TOP2B-CTCF-RAD21, TOP2B-RAD21, CTCF-RAD21, RAD21-only, TOP2B-only sites). Sites lacking chromatin interactions or intergenic sites were assigned to bin 1 (i.e., the no-activity bin). The three remaining bins of equal size represented sites with low, medium, and high activity based on gene expression or chromatin interaction frequency, respectively. For chromatin interactions, we counted interactions of sites across all tissues in the dataset. For gene expression, we selected the most highly expressed gene per site for a given cancer type and cohort using median expression values. This analysis was limited to cancer types for which at least five matching transcriptomes were available. In addition, sites were grouped to bins by median pan-cancer expression values in HMF and PCAWG.
Statistical analysis of binding site annotations
Evolutionary conservation and constitutive activity of sites was assessed using one-tailed hypergeometric tests comparing high evolutionary conservation (5/5 species) or constitutive CTCF binding ( > 90% of human tissues) in one type of binding sites relative to other sites combined. Similarly, we asked if DSBs, CTCF motifs, and TOP2B binding sites in conserved mouse liver were enriched. Next, we evaluated chromatin states by comparing TOP2B-bound sites with control sites lacking TOP2B binding (TOP2B-CTCF-RAD21 vs. CTCF-RAD21 sites, TOP2B-RAD21 vs. RAD21-only sites). Using two-tailed hypergeometric tests, we evaluated 15 core chromatin states in ten human tissues, excluded states with few sites (n < 250), and reported significant results (FDR < 0.05). Next, we studied TOP2B binding in the context of pan-cancer gene expression in PCAWG and HMF and repeated the analysis for individual cancer types. Median gene expression values were log1p-transformed and converted to Z-scores. Binding sites were assigned to the adjacent gene (±3 kbps) with the highest gene expression value (Z-score). Z-scores were compared using non-parametric tests (Wilcoxon rank-sum tests) between sites bound and not bound by TOP2B, and classes of sites with and without TOB2B binding (i.e., TOP2B-CTCF-RAD21 vs. CTCF-RAD21 sites, TOP2B-RAD21 vs. RAD21-only sites). The proportions of intergenic sites (i.e., not near genes) were analyzed using hypergeometric tests. Third, we studied TOP2B binding in the context of promoter-enhancer chromatin interactions pooled across multiple normal human tissues. For each site, we counted the number of overlapping anchors of chromatin interactions and used non-parametric analysis with Wilcoxon rank-sum tests to compare sites with and without TOP2B binding, as described above.
Mutational processes of small mutations in binding sites
We studied the mutational processes of SNVs and indels at TOP2B binding sites using the RM2 method (v2022-08-11), which compares the mutation burden in a set of genomic elements with their flanking control sequences given trinucleotide and megabase-level covariation in a negative binomial regression model34. We used a window size of 300 bps in RM2 such that sites of 600 bps around the midpoints of sites (i.e., ±300 bps) were compared with controls of 600 bps flanking sequences around the sites. This window size corresponds to peaks identified from our ChIP-seq analysis. To compare differently sized sets of binding sites, RM2 was run in down-sampling mode that repeatedly analyzed a fixed number of sites from the overall site pool. Down-sampling was run for 100 iterations the results corresponding to the median P-value were selected. First, we analyzed different types of sites (TOP2B-CTCF-RAD21, TOP2B-RAD21, CTCF-RAD21, TOP2B-only, RAD21-only), excluding site types that had few sites detected (CTCF-only, CTCF-TOP2B sites). Down-sampling to 10,000 sites was used in RM2 in accordance with the least frequent site type. The 18 cancer types and the pan-cancer sets for PCAWG and HMF were analyzed separately. First, we analyzed all types of sites and all cancer types and selected significant results after multiple testing correction (FDR < 0.05). Grassy hills plots in RM2 were used for visualization. Second, we studied four bins of sites grouped by either gene expression or chromatin interactions. Expression-based analysis of sites was conducted for cancer types and cohorts having at least five samples with transcriptomics data available. Down-sampling to 1000 sites was used in RM2 in accordance with the smallest bin. Third, to evaluate local mutational processes in chromatin context, we grouped binding sites by chromatin states from Roadmap Epigenomics in multiple normal tissue types and repeated the RM2 analysis of sites grouped by chromatin state. A combined BH FDR correction was applied across the expression- and chromatin interaction-based analyses, significant findings were selected (FDR < 0.05) and were visualized as a heatmap ordered by directional associations of mutational processes. While reducing bias from unbalanced site sets, the more stringent down-sampling step in RM2 reduced the statistical power of the analysis by considering fewer sites. Fourth, we evaluated mutation frequencies in binding sites relative to in vitro DSB activity or conserved TOP2B binding in mice. We analyzed mutations in TOP2B-bound sites relative to control sites lacking TOP2B binding (TOP2B-CTCF-RAD21 vs. CTCF-RAD21 sites, TOP2B-RAD21 vs. RAD21-only sites) using two-tailed hypergeometric tests separately for each cancer type in PCAWG and HMF. Significant results were selected after multiple testing correction (FDR < 0.05). Fifth, we asked if SNVs of specific mutational signatures of single base substitutions (SBS) were enriched in TOP2B binding sites. SBS mutation frequencies in sites with and without TOP2B binding (TOP2B-CTCF-RAD21 vs. CTCF-RAD21 sites; TOP2B-RAD21 vs RAD21-only sites) were compared using two-tailed hypergeometric tests and significant results were reported (FDR < 0.05). Indels were included in the same analysis as SBS signatures and all indels were analyzed as a single representative signature.
Mutational processes of structural variants in binding sites
We analyzed SVBPs in sites using two complementary methods. First, we counted the numbers of SVBPs per site in the different types of sites. The few sites bound by CTCF-only and CTCF-TOP2B were excluded. As additional genome-wide control sites, we used the set of non-overlapping genomic windows of 2 kbps excluding poorly defined gap regions from the UCSC database (UCSC_hg19_gap.txt.gz). We counted pan-cancer SVs in PCAWG samples, HMF samples, and combined samples from PCAWG and HMF, and compared per-site SVBP counts in different types of sites using Wilcoxon rank-sum tests. Pan-cancer counts were used for increased statistical power since SVs were relatively infrequent. To associate SVBP burden with gene expression, we compared four bins of TOP2B-bound sites grouped by pan-cancer transcription. This analysis pooled all TOP2B-bound sites (i.e., TOP2B-CTCF-RAD21, TOP2B-RAD21, TOP2B-only sites). SVBP counts in sites of the different bins were compared using Wilcoxon rank-sum tests. Sites grouped by chromatin interaction activity were analyzed similarly. To group sites by gene expression, sites were assigned to the adjacent gene (±3 kbps) with the highest median pan-cancer expression. For pan-cohort analyses, highest expression value was used. Expression analyses were repeated in individual cancer types by grouping sites based on matching cancer transcriptomes. Lastly, we also asked if SVBP burden in TOP2B-bound sites was associated with functional genomic annotations including DSBs, conservation of TOP2B binding in mice, and constitutive human CTCF binding. This analysis only included pooled TOP2B binding sites and compared SVBP counts in sites with and without functional annotations using Wilcoxon rank-sum tests. Lastly, we computed the distance to the closest SV breakpoint from the midpoint of each binding site, using all SVs in PCAWG and HMF. SVBPs within ±1 kbps were included. One-tailed Wilcoxon tests were used to evaluate if sites were closer to SVBPs.
Analysis of SVBPs in sites relative to SNV/indel burden
Having established enrichments of small mutations (SNVs, indels) in TOP2B binding sites, we analyzed SVBP burden relative to small mutations in individual cancer types in PCAWG and HMF. Small mutations and SVBPs were pooled and two-tailed hypergeometric tests were used to determine whether SVBPs in TOP2B-bound sites were differentially distributed compared to non-bound sites and small mutations. Two comparisons were included (TOP2B-CTCF-RAD21 vs. CTCF-RAD21; TOP2B-RAD21 vs. RAD21-only). The resulting P-values were adjusted for multiple testing and significant results were selected (FDR < 0.05).
Discovery of frequently mutated elements (FMREs) as cancer drivers
For driver analyses, we extended binding sites around ChIP-seq peaks using a lenient width of ±1 kbps around peak midpoints and merged these based on sequence overlaps. We performed driver analyses to find individual sites that had frequent somatic mutations (i.e., FMREs) using the ActiveDriverWGS method27 (v1.2.1). ActiveDriverWGS utilizes Poisson regression to find genomic regions with increased mutation burden relative to flanking genomic sequence and its trinucleotide-level sequence content. Small mutations (SNVs, indels) and SVBPs were analyzed separately since SVBPs were less frequent genome-wide and different parameter settings of ActiveDriverWGS were used. Cancer types in PCAWG and HMF were also analyzed separately. Binding sites with zero mutations were assigned a conservative P-value of 1.0. As a stringent approach, results from all cancer types in PCAWG and HMF were pooled for a joint multiple testing correction and significant results were selected (FDR < 0.05). For small mutations, the default background sequence length around sites was used in ActiveDriverWGS ( ± 50 kbps). For SVBPs, two adaptations were implemented: (i) a more conservative background sequence length was used to account for the lower genome-wide SVBP frequency ( ± 500 kbps), and (ii) additional pan-cancer analyses of SVBPs in PCAWG and HMF were performed for increased statistical power. SVBP-based FMREs were selected similarly to FMREs from small mutations. Translocations involving FMREs were visualized as circos plots using the ggbio package in R87. Known cancer genes associated with FMREs were retrieved from the COSMIC Cancer Gene Census database88 (downloaded July 31st 2021). FMREs were also visualized in the context of promoter-enhancer chromatin interactions pooled from a collection of normal human tissues50. We compared FMREs and non-FMRE binding sites in the context of chromatin interaction counts, adjacent gene expression levels, and human tissues showing CTCF binding using Wilcoxon tests, and their location at known cancer genes, DSB sites, and conserved TOP2B binding sites using hypergeometric tests.
Associating mutations in FMREs with target gene expression
To evaluate functional roles of mutations in FMREs, we asked if these associated with the expression of any target genes located within ±3 kbps of FMREs. Cancer types in PCAWG and HMF were analyzed separately, and small mutations and SVBPs were also analyzed separately. First, we selected genes for which at least samples with five FMRE mutations and matching RNA-seq data were available, excluding genes with low overall expression (mean expression <1). Gene expression values were log1p-transformed. To analyze FMREs with small mutations, we excluded samples in which high amplifications of target genes were seen (copy number (CN) > 8). To analyze differential expression, we compared gene expression values of FMRE-mutated and non-mutated samples using linear regression with the gene CN as a covariate, and derived P-values from one-tailed F-tests and ANOVA. For increased stringency, multiple testing correction with BH FDR was conducted across all tests from the cancer types in HMF and PCAWG. Significant results were selected (FDR < 0.05). Gene expression analyses of SVBPs in FMREs were performed similarly, except that gene CN values were not used to filter samples or in covariates, since SVBPs included CNAs. To analyze ERG gene expression in the context of TMPRSS2/ERG structural variants, we combined SVBPs from the FMREs in TMPRSS2 and ERG and compared ERG expression in metastatic prostate cancers with and without these SVBPs.
Pathway enrichment analysis of non-coding FMREs
To conduct pathway analyses of non-coding mutations at TOP2B binding sites, we conducted an additional driver analysis to exclude potential protein-coding biases of mutational enrichments. To this end, we re-defined all binding sites by subtracting protein-coding (CDS) exons and analyzed these binding sites again using ActiveDriverWGS. The resulting mutational enrichments of binding sites were integrated on the pathway level. To map pathways to binding sites, we first mapped sites to protein-coding genes, selecting the most significant mutation-enriched site to every protein-coding gene using a ± 3 kbps flanking window. To integrate mutational enrichments across cancer types, the most significant FMRE per gene was selected for each cancer type from either PCAWG or HMF, resulting in a matrix representing top P-values of protein-coding genes for each of the 18 cancer types. We used integrative pathway enrichment analysis of the ActivePathways method (v2.0.3)53 by merging evidence across the cancer types, prioritizing genes whose non-coding sites showed mutational enrichments in multiple cancer types. ActivePathways also included sub-significant sites and therefore considered mutational enrichments beyond the statistically filtered list of FMREs. All protein-coding genes were used as the background set in the pathway analysis, gene sets of 25-500 genes were included, significant pathway enrichments were selected using the Holm family-wise error rate (FWER < 0.05) and default parameters of ActivePathways were used otherwise. Pathway analysis considered gene sets of molecular pathways of the Reactome database89 and GO biological processes. Gene sets were retrieved from the g:Profiler90 web site (2023-10-06). Pathway results were visualized as an enrichment map using the Cytoscape software (v3.10.0). We curated the map manually to identify major functional themes91.
Lentivirus production and transduction for RMRP FMRE experiments
Large-scale production and concentration of lentivirus was performed as previously described63,92. Briefly, 293 T cells (Invitrogen R700-07) were seeded on a poly-L-lysine coated 15 cm plates and transfected using PEI (polyethyleneimine) method in a non-serum media with lentiviral construct of interest (pLKO-Cre sgRNA v4; Addgene plasmid 158032) along with lentiviral packaging plasmids psPAX2 and pPMD2.G (Addgene plasmids 12259, 12260). 8 h post-transfection media was added to the plates supplemented with 10% Fetal bovine serum and 1% pencillin-streptomycin antibiotic solution (w/v). 48 h later, the viral supernatant was collected and filtered through a Stericup-HV PVDF 0.45-μm filter and then concentrated ~2000-fold by ultracentrifugation in an MLS-50 rotor (Beckman Coulter) at 24,000 rpm (103,864 x g) for 2 h.
CRISPR guides for RMRP FMRE experiments
For MCF10A cell culture experiments involved two guides: ATAGGCTTTCAGAGGCATTG and CTAGAGTTCCAGATATGAAG. Mouse in utero transductions in the skin involved two guides: AAGTATCATGCCTAAAACAA and CCCGGCTACCTGTAAAATGA. NTC sequence was GAAGGAGGCTACACCCGTTA.
In utero lentiviral transduction for RMRP FMRE experiments
Ultrasound-guided lentiviral injection and related procedures have been described63,92. Briefly, to deliver the lentiviral sgRNAs targeting gene of interest, a non-invasive, ultrasound-guided in utero injection method was employed, which selectively transduces single-layered surface ectoderm of living E9.5 mouse embryos in a clonal fashion. The animals used in this study were R26-LSL-Pik3caH1047R/+ [Gt(ROSA)26Sortm1(Pik3ca*H1047R)Egan in a clean FVBN background kindly provided by Dr. Sean E. Egan, The Hospital for Sick Children], R26-LSL-Cas9-GFP [Gt(ROSA)26Sortm1(CAG-xstpx-cas9,-EGFP)Fezh/J, #026175, in C57/Bl6 background from The Jackson Laboratory]. The in utero transduction experiments were performed with multiple independent surgeries with at least two independent mothers and surgeries for each of the three sgRNAs. Three independent viral batches were used, which were independently titered using LSL-tdTomato MEFs. About 500 nl of 1 × 10^8 pfu/ml were injected per mouse so that ~20% of the surface epithelium of each mouse was transduced with sgRNA lentiviral vectors, as confirmed upon birth. A total of 21 mice were monitored for tumor free survival up to one year, including seven mice in control group targeted with sgTIGRE, nine mice targeted with sgRNA1, and five mice targeted with sgRNA2. Mice of both sexes were used in all groups and both male and female mice developed tumors. Animal husbandry, ethical handling of mice, and all animal work were carried out according to guidelines approved by the Canadian Council on Animal Care and under protocols approved by the Centre for Phenogenomics Animal Care Committee.
2D cell culture for RNA extraction for RMRP FMRE experiments
75,000 sgRNA-infected C3 MCF10A cells (Horizon, cat. #HD 101-011)93 were plated into four 12-well plates (Sarstedt 83.3921). The cells were grown for 24 h and scraped into 100 ul of RNA Lysis buffer (Zymo Research R1051), then stored in -80 °C prior to RNA extraction. We used commercial cell lines that were mycoplasma free.
3D sphere formation assay for RMRP FMRE experiments
For three-dimensional sphere formation experiments, MCF10A cells were plated on Matrigel Matrix (Corning 354234) as described previously94. The initial starting count of 15,000 cells were plated on 24-Well Plates (Sarstedt 83.3922) precoated with 100 ul of Matrigel, and grown in MCF10a Assay Media, made with DMEM/F12 (Wisent 319-084-CL), 2% Horse Serum (Gibco 16050122), 0.5ug/ml Hydrocortisone (Sigma-Aldrich H0888), 100 ng/ml Cholera Toxin (Sigma-Aldrich C8052), 10 ug/ml Insulin (Sigma-Aldrich I5500), 1x Pen/Strep (Wisent 450-201-EL), 5 ng/ml EGF (Wisent 511-110-EU) and 2% Matrigel. Media was changed every other day for three weeks.
Sphere dissociation from Matrigel for RNA extraction for RMRP FMRE experiments
Media was removed and plates were placed on ice for 5 min. 1 ml of cold PBS was added to each well and contents were transferred into microtubes on ice. A second wash was performed with 800ul, and the remaining contents were added into their respective microtubes. Samples were spun at 5 min, 4 °C at 1000 x g. The supernatant was removed, and the cell pellets were washed two more times with cold PBS. After the final spin, each cell pellet was resuspended in 200ul Zymo Lysis buffer (Zymo Research R1051).
Analysis of genome editing efficiency
Genomic DNA from formalin-fixed, paraffin embedded mouse skin tumor tissue samples was isolated using the QIAamp DNA FFPE Tissue Kit (56404, Qiagen) following the manufacturer’s instructions. Briefly, tissue samples were cut using a microtome, deparaffinized with xylene and processed for DNA isolation. For MCF10A cells in culture, genomic DNA was extracted by harvesting cell pellets during routine passaging. Genomic DNA was isolated using DNeasy Blood & Tissue Kit (69504, Qiagen) according to manufacturer’s instructions. PCR was performed flanking the regions of the sgRNA target site on genomic DNA from both WT and sgRNA transduced MCF10A cells as well as WT and tumor tissue samples. Approximately 500 ng of genomic DNA was used from each sample in a PCR reaction with 2.5 μl of 10μM TIDE primer and 25 μl of Q5 High-Fidelity DNA Polymerase (M0544S, New England BioLabs) in final reaction volume of 50μl. The conditions of the PCR protocol were as followed: initial denaturation at 98 °C for 30 s, then 9 cycles at 98 °C for 10 s, 61 °C for 30 s, 72 °C for 40 s followed by 9 cycles of 98 °C for 10 s, 63 °C for 30 s 72 °C for 40 s, lastly 4 remaining cycles of 98 °C for 10 s, 65 °C for 30 s and 72 °C for 40 s, with final extension at 72 °C for 2 minutes. PCR products were run on a 1 to 2 % agarose gel depending on product size and purified using the GenepHlow Gel/PCR kit (DFH300, Geneaid) according to manufacturer’s instructions. Purified PCR products were quantified using NanoDrop spectrophotometer and sent for Sanger Sequencing. Genome editing efficiency was estimated by the Tracking on Indels by Decomposition (TIDE) algorithm95. TIDE primers are listed in Supplementary Data 14.
Total RNA sequencing for RMRP FMRE experiments
RNA samples were extracted from cells using the Zymo Quick-RNA Microprep Kit (Zymo Research R1051). Three replicates were used for each of two targeted sgRNAs and the scramble control sgRNA. After extraction, sample integrity was checked with RNA Qubit (Invitrogen Q32852) and the Bioanalyzer. Total RNA sequencing with single-end reads at 40 million reads per sample was performed at the core facility of the Lunenfeld-Tanenbaum Research Institute in Toronto, ON, Canada.
Analysis of RNA-seq data from RMRP FMRE experiments
We uniformly processed RNA-Seq single-end reads for 18 samples using the Rsubread96 method (v2.12.0). Reads were aligned to the human genome (hg38) from the Ensembl database (downloaded 2024-01-16) using default parameters. Read counts for each gene defined in the Gencode gtf annotation file (v45, downloaded 2024-04-23) were obtained. We filtered genes with low expression values (<10 counts in 70% of samples). Differential gene expression analysis was performed on filtered genes with raw counts using the edgeR97 method (v4.0.16). Gene expression was compared separately for 2D and 3D cultures by combining the two sgRNA treatments of three replicates each and the scramble-treated control of three replicates. Differential gene expression analysis was performed using a quasi-likelihood negative binomial generalized log-linear model (glmQLFit) implemented in edgeR. P-values were adjusted for multiple testing using the FDR method and significant genes were selected for volcano plots (FDR < 0.05). Directional pathway enrichment analysis using the DPM method64 jointly considered transcriptomic changes in both invasion assays and prioritized genes and pathways with consistent directional change in both experiments, based on significance and sign of log-fold-change. DPM is available in the ActivePathways package (v2.0.3)53. Pathway enrichment analysis of directionally prioritized genes was performed using the ActivePathways method53 (v2.0.3). Genes with low expression were removed from the background gene set for pathway analysis. Gene Ontology and Reactome gene sets were downloaded from the g:Profiler90 website (downloaded 2024-06-14). Pathway enrichment analysis considered gene sets of 25–500 genes and significant results were selected using the Holm multiple-testing correction (FWER < 0.05). Results were visualized as an enrichment map using the Cytoscape software (v3.9.1) and the map was organized manually to identify major functional themes91.
Inclusion and ethics
Whole genome sequencing (WGS) of cancer tissues of human individuals was performed by ICGC and HMF consortia members outside of this study under a series of locally approved Institutional Review Board protocols. Written informed consent was obtained from all human participants as part of previous studies. Ethical review of this cancer WGS data analysis project was granted by the University of Toronto Research Ethics Board under protocol no. 37521. Liver hepatocellular carcinoma (HCC) samples were collected from male patients who underwent liver transplantation at the Ajmera Transplant Centre of Toronto General Hospital (ON, Canada) in 2016-2019. Written informed consents were obtained from all participants prior to the procedures according to the guidelines of Multi-Organ Transplant Biobank (Toronto, ON, Canada). Ethical review of the chromatin immunoprecipitation sequencing project involving these tissues was granted by the University of Toronto Research Ethics Board under protocol no. 37952 and University Health Network Research Ethics Board under protocol no. 19-5576.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
ChIP-seq data and RNA-seq data generated in this study have been deposited in the ArrayExpress database with the accession numbers E-MTAB-13301 and E-MTAB-14611. Processed data are available as Supplementary data files. Input datasets from Hartwig Medical Foundation (HMF) representing whole-genome sequencing, transcriptomics, and metadata annotations of cancer samples of multiple cancer types are controlled-access and can be provided by HMF pending scientific review and a completed material transfer agreement. Requests for these datasets should be submitted to the Hartwig Medical Foundation. Input datasets from the PCAWG study representing whole-genome sequencing, transcriptomics, and metadata annotations of cancer samples of multiple cancer types are available from the Xena data portal (https://pcawg.xenahubs.net). Intermediate data files cannot be shared due to the use of controlled-access datasets from HMF. Source data are provided with this paper.
Code availability
Publicly available software packages and common statistical methods were used to process and analyze the data as described in the Methods section. Main algorithms used here are available freely as open-source R packages: ActiveDriverWGS (https://github.com/reimandlab/ActiveDriverWGSR; v1.2.0), ActivePathways (https://github.com/reimandlab/ActivePathways; v2.0.3), and RM2 (https://github.com/reimandlab/RM2; v2022-08-11). Relevant versions are archived in Zenodo98.
References
Pommier, Y., Sun, Y., Huang, S. N. & Nitiss, J. L. Roles of eukaryotic topoisomerases in transcription, replication and genomic stability. Nat. Rev. Mol. Cell Biol. 17, 703–721 (2016).
Nitiss, J. L. DNA topoisomerase II and its growing repertoire of biological functions. Nat. Rev. Cancer 9, 327–337 (2009).
Uuskula-Reimand, L. et al. Topoisomerase II beta interacts with cohesin and CTCF at topological domain borders. Genome Biol. 17, 182 (2016).
Canela, A. et al. Genome organization drives chromosome fragility. Cell 170, 507–521 (2017).
Canela, A. et al. Topoisomerase II-induced chromosome breakage and translocation is determined by chromosome architecture and transcriptional activity. Mol. Cell 75, 252–266 (2019).
Vian, L. et al. The energetics and physiological impact of cohesin extrusion. Cell 173, 1165–1178 (2018).
Gothe, H. J. et al. Spatial chromosome folding and active transcription drive DNA fragility and formation of oncogenic MLL translocations. Mol. Cell 75, 267–283.e212 (2019).
Austin, C. A., Cowell, I. G., Khazeem, M. M., Lok, D. & Ng, H. T. TOP2B’s contributions to transcription. Biochem Soc. Trans. 49, 2483–2493 (2021).
Champoux, J. J. DNA topoisomerases: structure, function, and mechanism. Annu Rev. Biochem 70, 369–413 (2001).
Vann, K. R., Oviatt, A. A. & Osheroff, N. Topoisomerase II poisons: converting essential enzymes into molecular scissors. Biochemistry 60, 1630–1641 (2021).
Nitiss, J. L. Targeting DNA topoisomerase II in cancer chemotherapy. Nat. Rev. Cancer 9, 338–350 (2009).
Pendleton, M., Lindsey, R. H. Jr., Felix, C. A., Grimwade, D. & Osheroff, N. Topoisomerase II and leukemia. Ann. N. Y Acad. Sci. 1310, 98–110 (2014).
Cowell, I. G. et al. Model for MLL translocations in therapy-related leukemia involving topoisomerase IIbeta-mediated DNA strand breaks and gene proximity. Proc. Natl Acad. Sci. USA 109, 8989–8994 (2012).
Mistry, A. R. et al. DNA topoisomerase II in therapy-related acute promyelocytic leukemia. N. Engl. J. Med. 352, 1529–1538 (2005).
Ju, B. G. et al. A topoisomerase IIbeta-mediated dsDNA break required for regulated transcription. Science 312, 1798–1802 (2006).
Madabhushi, R. et al. Activity-induced DNA breaks govern the expression of neuronal early-response genes. Cell 161, 1592–1605 (2015).
Bunch, H. et al. BRCA1-BARD1 regulates transcription through modulating topoisomerase IIb. eta. Open Biol. 11, 210221 (2021).
Singh, S. et al. Pausing sites of RNA polymerase II on actively transcribed genes are enriched in DNA double-stranded breaks. J. Biol. Chem. 295, 3990–4000 (2020).
Bunch, H. et al. Transcriptional elongation requires DNA break-induced signalling. Nat. Commun. 6, 10191 (2015).
Uuskula-Reimand, L. & Wilson, M. D. Untangling the roles of TOP2A and TOP2B in transcription and cancer. Sci. Adv. 8, eadd4920 (2022).
Herrero-Ruiz, A. et al. Topoisomerase IIalpha represses transcription by enforcing promoter-proximal pausing. Cell Rep. 35, 108977 (2021).
Olmedo-Pelayo, J., Rubio-Contreras, D. & Gomez-Herreros, F. Canonical non-homologous end-joining promotes genome mutagenesis and translocations induced by transcription-associated DNA topoisomerase 2 activity. Nucleic Acids Res 48, 9147–9160 (2020).
Dellino, G. I. et al. Release of paused RNA polymerase II at specific loci favors DNA double-strand-break formation and promotes cancer translocations. Nat. Genet. 51, 1011–1023 (2019).
Haffner, M. C. et al. Androgen-induced TOP2B-mediated double-strand breaks and prostate cancer gene rearrangements. Nat. Genet. 42, 668–675 (2010).
Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 171, 1029–1041 (2017).
Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2693 cancer whole genomes. Nature 578, 102–111 (2020).
Zhu, H. et al. Candidate cancer driver mutations in distal regulatory elements and long-range chromatin interaction networks. Mol. Cell https://doi.org/10.1016/j.molcel.2019.12.027 (2020).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Bell, R. J. et al. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer. Science 348, 1036–1039 (2015).
Corona, R. I. et al. Non-coding somatic mutations converge on the PAX8 pathway in ovarian cancer. Nat. Commun. 11, 2020 (2020).
Liu, E. M. et al. Identification of cancer drivers at CTCF insulators in 1962 whole genomes. Cell Syst. 8, 446–455 (2019).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Katainen, R. et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821 (2015).
Lee, C. A., Abd-Rabbo, D. & Reimand, J. Functional and genetic determinants of mutation rate variability in regulatory elements of cancer genomes. Genome Biol. 22, 133 (2021).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Schuster-Bockler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507 (2012).
Stamatoyannopoulos, J. A. et al. Human mutation rate associated with DNA replication timing. Nat. Genet. 41, 393–395 (2009).
Ocsenas, O. & Reimand, J. Chromatin accessibility of primary human cancers ties regional mutational processes and signatures with tissues of origin. PLoS Comput Biol. 18, e1010393 (2022).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).
Kumar, S. et al. Passenger mutations in more than 2500 cancer genomes: overall molecular functional impact and consequences. Cell 180, 915–927 e916 (2020).
Pich, O. et al. The mutational footprints of cancer therapies. Nat. Genet. 51, 1732–1740 (2019).
Encode Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Schmidt, D. et al. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell 148, 335–348 (2012).
Faure, A. J. et al. Cohesin regulates tissue-specific expression by stabilizing highly occupied cis-regulatory modules. Genome Res. 22, 2163–2175 (2012).
Valton, A. L. et al. A cohesin traffic pattern genetically linked to gene regulation. Nat. Struct. Mol. Biol. 29, 1239–1251 (2022).
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
PCAWG Transcriptome Core Group, Calabrese, C. et al. Genomic basis for RNA alterations in cancer. Nature 578, 129–136 (2020).
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Jung, I. et al. A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat. Genet. 51, 1442–1449 (2019).
de Bruin, E. C. et al. Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science 346, 251–256 (2014).
Isozaki, H. et al. Therapy-induced APOBEC3A drives evolution of persistent cancer cells. Nature 620, 393–401 (2023).
Paczkowska, M. et al. Integrative pathway enrichment analysis of multivariate omics data. Nat. Commun. 11, 735 (2020).
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).
Turcan, S. et al. IDH1 mutation is sufficient to establish the glioma hypermethylator phenotype. Nature 483, 479–483 (2012).
Yan, H. et al. IDH1 and IDH2 mutations in gliomas. N. Engl. J. Med. 360, 765–773 (2009).
Yu, H. A. et al. Analysis of tumor specimens at the time of acquired resistance to EGFR-TKI therapy in 155 patients with EGFR-mutant lung cancers. Clin. Cancer Res 19, 2240–2247 (2013).
Adams, E. J. et al. FOXA1 mutations alter pioneering activity, differentiation and prostate cancer phenotypes. Nature 571, 408–412 (2019).
Tomlins, S. A. et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, 644–648 (2005).
Erikson, J., ar-Rushdi, A., Drwinga, H. L., Nowell, P. C. & Croce, C. M. Transcriptional activation of the translocated c-myc oncogene in Burkitt lymphoma. Proc. Natl. Acad. Sci. USA 80, 820–824 (1983).
Strober, W., Murray, P. J., Kitani, A. & Watanabe, T. Signalling pathways and molecular interactions of NOD1 and NOD2. Nat. Rev. Immunol. 6, 9–20 (2006).
Rheinbay, E. et al. Recurrent and functional regulatory mutations in breast cancer. Nature 547, 55–60 (2017).
Loganathan, S. K. et al. Rare driver mutations in head and neck squamous cell carcinomas converge on NOTCH signaling. Science 367, 1264–1269 (2020).
Slobodyanyuk, M. et al. Directional integration and pathway enrichment analysis for multi-omics data. Nat. Commun. 15, 5690 (2024).
Boot, A. et al. Recurrent mutations in topoisomerase IIalpha cause a previously undescribed mutator phenotype in human cancers. Proc. Natl. Acad. Sci. USA 119. https://doi.org/10.1073/pnas.2114024119 (2022).
Poulos, R. C. et al. Functional mutations form at CTCF-cohesin binding sites in melanoma due to uneven nucleotide excision repair across the motif. Cell Rep. 17, 2865–2872 (2016).
Kaiser, V. B., Taylor, M. S. & Semple, C. A. Mutational biases drive elevated rates of substitution at regulatory sites across cancer types. PLoS Genet. 12, e1006207 (2016).
Guo, Y. A. et al. Mutation hotspots at CTCF binding sites coupled to chromosomal instability in gastrointestinal cancers. Nat. Commun. 9, 1520 (2018).
Hanawalt, P. C. & Spivak, G. Transcription-coupled DNA repair: two decades of progress and surprises. Nat. Rev. Mol. Cell Biol. 9, 958–970 (2008).
Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & Lopez-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016).
Baranello, L. et al. DNA break mapping reveals topoisomerase II activity genome-wide. Int J. Mol. Sci. 15, 13111–13122 (2014).
Yang, X., Li, W., Prescott, E. D., Burden, S. J. & Wang, J. C. DNA topoisomerase IIbeta and neural development. Science 287, 131–134 (2000).
Andrews, S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2010).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Met. 57, 289–300 (1995).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
ICGC-TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82-93. https://doi.org/10.1038/s41586-020-1969-6 (2020).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).
Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. https://doi.org/10.14806/ej.17.1.200 (2011).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Yin, T., Cook, D. & Lawrence, M. ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biol. 13, R77 (2012).
Futreal, P. A. et al. A census of human cancer genes. Nat. Rev. Cancer 4, 177–183 (2004).
Fabregat, A. et al. The reactome pathway knowledgebase. Nucleic Acids Res 46, D649–D655 (2018).
Reimand, J., Kull, M., Peterson, H., Hansen, J. & Vilo, J. g: Profiler-a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic acids Res. 35, W193–W200 (2007).
Reimand, J. et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat. Protoc. 14, 482–517 (2019).
Loganathan, S.K., Malik, A., Langille, E., Luxenburg, C. & Schramek, D. In Vivo CRISPR/Cas9 Screening to Simultaneously Evaluate Gene Function in Mouse Skin and Oral Cavity. J Vis Exp. https://doi.org/10.3791/61693 (2020).
Langille, E. et al. Loss of epigenetic regulation disrupts lineage integrity, induces aberrant alveogenesis, and promotes breast cancer. Cancer Discov. 12, 2930–2953 (2022).
Debnath, J., Muthuswamy, S. K. & Brugge, J. S. Morphogenesis and oncogenesis of MCF-10A mammary epithelial acini grown in three-dimensional basement membrane cultures. Methods 30, 256–268 (2003).
Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168 (2014).
Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47, e47 (2019).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Reimand, J. Topoisomerase IIb binding delineates localized mutational processes and driver mutations in cancer genomes. Zenodo. https://doi.org/10.5281/zenodo.17154324 (2025).
Acknowledgements
This work was supported by the Canadian Institutes of Health Research (CIHR) Project Grants PJT-162410 and PJT-197925 to J.R., the Investigator Award to J.R. from the Ontario Institute for Cancer Research (OICR), and the New Investigator Award of the Terry Fox Research Institute (TFRI) to J.R. M.D.W. was supported by the Canada Research Chairs Program. L.U.R. was supported by a Next Generation of Scientists Award from the Cancer Research Society (PIN25558), the Estonian Research Council fellowship, NSERC grant (RGPIN-2019-07014) to M.D.W., as well as a Cancer Research Society/Isaiah 40:31 Memorial Fund Operating grant to M.D.W. and J.R. Funding to OICR is provided by the Government of Ontario. The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. We acknowledge the contributions of the many clinical networks of ICGC and TCGA who provided samples and data to PCAWG. We thank the patients and their families for their participation in ICGC and TCGA projects. This publication and the underlying study have been made possible partly based on the data that Hartwig Medical Foundation has made available to the study.
Author information
Authors and Affiliations
Contributions
C.A.L. and J.R. analyzed the data. L.U.R. and S.A.A. performed the ChIP-seq experiments. L.U.R., C.A.L., and J.R. interpreted the data and wrote the manuscript. L.U.R. and M.D.W. designed and supervised the ChIP-seq experiments. R.O., E.L., E.S., R.T., and D.S. designed and performed the functional validation experiments. N.A., Z.P.K., K.C.L.C., D.A.R., and H.H. contributed to data analysis. E.P. and M.B. contributed clinical samples. J.R., M.D.W., and L.U.R. conceived the project. J.R. supervised the project. The authors reviewed and edited the manuscript and approved the final version.
Corresponding author
Ethics declarations
Competing interests
D.S. is a consultant for Tango Therapeutics and a founder of Viverita Therapeutics. M.B. reports grants from Natera, Eisai, AstraZeneca, Novo Nordisk, and Merck, grants and personal fees from Paladin, and nonfinancial support from CareDx outside the submitted work. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Srinivasan Yegnasubramanian and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Uusküla-Reimand, L., Lee, C.A., Oh, R.H. et al. Topoisomerase IIb binding delineates localized mutational processes and driver mutations in cancer genomes. Nat Commun 16, 10241 (2025). https://doi.org/10.1038/s41467-025-65005-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-65005-6








