Main

Approximately 50% of the mammalian genome consists of transposable elements (TEs), mainly retrotransposons, that once possessed the ability to move their locations in the genome1,2. TEs have actively shaped genomic functions through a multitude of mechanisms, including genome structural rearrangements, introduction of mutations and transcriptional regulation1,3. In turn, the mammalian genome has evolved to take advantage of this abundant source of genetic material to regulate essential cellular functions4,5,6. In mammals, several mechanisms have been described to tightly control the expression of TEs spatially and temporally, including noncoding RNAs, DNA methylation and histone modifications (for example, H3K9me3 and H3K27me3)4,7,8,9,10,11. Derepression or insertion of TEs has been linked to cancer, where TEs were either used as alternative promoters (altPs) for oncogene activation or as disruptors of tumor suppressors7,12,13. Furthermore, previous studies have shown that derepression of hypermethylated TEs by inhibiting DNA methylation caused a viral mimicry response in cancer14,15, where in some cases it can be evaded by cancer cells via epigenetic state switch to restore TE repression16,17. Recent studies suggest that TE-driven oncogene expression, a process known as onco-exaptation, is frequent across multiple cancer types18. TEs were also co-opted as cis-regulatory elements, for example, by lineage-specific transcription factors (TFs), to drive oncogenesis19. Among TEs, long terminal repeats (LTRs) that contain intact RNA polymerase II binding sites, TF binding sites and conserved splice donor sites have higher propensity to be co-opted as altPs than other TE subclasses, for example, long interspersed nuclear element (LINE) and short interspersed nuclear element (SINE)13,20.

Chromatin is organized into 3D structures, consisting of self-interacting topologically associating domains (TADs). TADs are established and maintained through CTCF/cohesin-mediated looping and regulate long-range interactions between enhancers and promoters and consequent transcriptional activation21,22. TADs can be further organized into a hierarchical structure of nested sub-TADs, that is, hierarchical TADs23. Compared to nonhierarchical TADs, hierarchical TADs are associated with functionally more active epigenetic states and a higher level of gene transcription23. Unlike CTCF, the cohesin complex does not recognize specific DNA sequences, and its residence on chromatin relies on the continuous loading action by the cohesin complex loading factors, Nipped-B-like protein (NIPBL) and MAU2 sister chromatid cohesion factor (MAU2)21. NIPBL is essential in the establishment and maintenance of TADs24. Complete loss of NIPBL is lethal in dividing cells24. Pathologically, heterozygous loss-of-function (LoF) NIPBL mutations underlie the molecular basis of the majority of Cornelia de Lange syndrome (CdLS). CdLS is a multisystem developmental disorder caused by genetic aberrations of the core structural or regulatory components of the cohesin complex (for example, NIPBL, SMC1A and SMC3)25. Furthermore, monoallelic LoF mutations of NIPBL are frequently observed in multiple cancers, and NIPBL is recognized as a bona fide tumor suppressor in pan-cancer analyses26. Although TE has recently been described to modulate CTCF binding and influence TAD boundary27,28, whether and how TADs regulate TE expression remains largely unknown.

Here we used a doxycycline-inducible system to mimic the monoallelic loss of NIPBL in disease and evaluated its impact on chromatin topology and transcriptional regulation. Haploinsufficient loss of NIPBL led to hierarchical TAD reduction and global transcriptional activation of altPs that arose primarily from LTRs that already possessed open chromatin features. Hierarchical TAD reorganization and subsequent recruitment and retargeting of proximal enhancers contributed to the activation of altPs from LTRs. These data indicate that chromatin topological hierarchy maintained by NIPBL restricts transcriptional activation from LTRs, and perturbation of the hierarchical chromatin topology in cancer can lead to the co-option of LTRs and aberrant transcriptional activation of new oncogenes and divergent transcripts.

Results

Haploinsufficient loss of NIPBL induces global altP activation

To mimic the monoallelic loss of NIPBL in cancer and CdLS, we used a doxycycline-inducible RNA interference (RNAi) system. After doxycycline treatment, shNIPBL_1 and shNIPBL_2 downregulated the NIPBL mRNA by ~34% and 56%, respectively, in 501mel melanoma cells (Extended Data Fig. 1a). NIPBL downregulation did not significantly alter the total SMC1A protein level, a core component of the cohesin complex24 (Extended Data Fig. 1b), whereas chromatin immunoprecipitation coupled with next-generation sequencing (ChIP–seq) showed a modest reduction of SMC1A protein on chromatin, indicating a partial loss of cohesin loading (Extended Data Fig. 1c). To investigate the effects of haploinsufficient NIPBL loss on transcriptome and transcription start site (TSS) usage, we performed poly-A RNA sequencing (RNA-seq) to capture most mature RNA transcripts and cap analysis of gene expression sequencing (CAGE-seq) to accurately assess TSS, including altPs29, in 501mel cells. Compared to shLuc, shNIPBL_1 and shNIPBL_2 resulted in 305/169 and 1,261/1,308 significantly upregulated/downregulated genes, respectively (Fig. 1a and Supplementary Table 1), which was consistent with short-term degradation of cohesin, as well as CdLS mouse models and patient studies24,30,31 that caused minor perturbations in gene expression. The changes in CAGE-seq-TSSs (CTSSs) and RNA-seq-detected TSSs for annotated transcripts showed a highly significant correlation (Fig. 1b,c, Extended Data Fig. 1d–f and Supplementary Table 2). At baseline, ~56% of all CTSSs were located at annotated promoters, and ~16%, 10%, 9% and 6% were localized to annotated exons, introns, distal intergenic regions and 3′ UTRs, respectively (Fig. 1d and Supplementary Table 2). Notably, shNIPBL-upregulated CTSSs were enriched in introns and distal intergenic regions (Fig. 1d,e), indicating altP usage, whereas shNIPBL-downregulated CTSSs were preferentially located at annotated promoters (Fig. 1d,e). Majority of the top 30% upregulated CTSSs transcribed from altPs produced mature transcripts (Supplementary Note 1). For example, the ALKATI arose from an altP in intron 19 and subsequently spliced into exons 20–29 of ALK, expressing the entire kinase domain, which has been previously characterized as an oncogenic variant of ALK prevalent in melanoma and sporadically in other cancer types32 (Fig. 1a,b,f). Similarly, the ULK4 variant arose from intron 31 and LINC01387 from intron 2, both of which spliced into existing exons to produce mature transcripts (Fig. 1a,b,f). Moreover, we observed that some RNA transcripts originated from distal intergenic regions, such as between SYN3 and LINC01640 in chromosome 22 (chr22): 33,468,501–33,471,517 and between TAS2R39 and TAS2R40 in chr7: 142,895,998–142,896,596, respectively (Fig. 1f). The distal intergenic altP in chr7: 142,895,998–142,896,596 revealed a distinctive feature where transcription was initiated, and the transcript was spliced across a genomic distance of ~240 kb into the coding gene KEL (Fig. 1f). These observations are reminiscent of a previous study where Nipbl knockout in nondividing mouse hepatocytes was associated with increased intergenic and antisense (exogenic) transcription24 and demonstrated that the new CTSSs detected by CAGE-seq were associated with production of mature transcripts.

Fig. 1: Haploinsufficient loss of NIPBL leads to altP activation in intron and distal intergenic regions.
figure 1

a, Volcano plots of differentially regulated genes (FDR < 0.05) by whole transcriptome analysis of shRNA-mediated NIPBL perturbation in 501mel melanoma cells. Upregulated genes (red, log2(FC) > 1), downregulated genes (blue, log2(FC) < −1) and genes of interest (green). n = 2 biological replicates. b, Volcano plots of significantly altered TSSs (CTSSs, FDR < 0.1, log2(FC) > 1) by CAGE-seq under NIPBL perturbation conditions as in a. n = 2 biological replicates. c, Two-sided Spearman correlation of transcriptome changes (log2(FC)) of annotated transcripts between whole transcriptome analysis by poly-A RNA-seq and CTSS analysis by CAGE-seq under NIPBL perturbation conditions (P < 1 × 10−16 for both shNIPBL conditions). d, Genomic annotation of all CTSSs by CAGE-seq under control condition. Percentage of each genomic feature is indicated. e, Distribution of genomic features of significantly (FDR < 0.1) upregulated (log2(FC) > 1) and downregulated (log2(FC) < −1) CTSSs with shRNA-mediated NIPBL downregulation. f, Representative examples of CAGE-seq and RNA-seq profiles of significantly upregulated CTSSs in intron and distal intergenic regions with NIPBL downregulation. Normalized CAGE-seq (blue, plus strand; red, minus strand) and RNA-seq profiles from two independent experiments are shown. Sashimi plots from one representative RNA-seq experiment depicted new spliced transcripts initiated from respective CTSSs, shaded in pink. The presence of repetitive elements is indicated by the ‘repeatmasker’ track from the UCSC genome browser (hg19 (GRCh37) genomic version). Enlarged genomic regions are shown in the dashed inset. FC, fold change; FDR, false discovery rate; Up, upregulated; Down, downregulated.

Source data

Additional doxycycline-inducible hairpins, shNIPBL_4 and shNIPBL_5, were used to validate transcriptional perturbation by poly-A RNA-seq and CAGE-seq (Extended Data Fig. 2, Supplementary Tables 3 and 4 and Supplementary Note 2). Doxycycline-inducible RNAi of NIPBL was performed in two additional melanoma cell lines, COLO800 and A375, to confirm altP usages (Supplementary Fig. 1 and Supplementary Note 3). In addition, CRISPR interference (CRISPRi)33-mediated downregulation of NIPBL resulted in ~50% decrease in NIPBL and significant increase in ALKATI mRNA originating from the altP in ALK intron 19 (Supplementary Fig. 2).

We performed CAGE-seq after NIPBL loss in nine cell lines from additional cancer types and observed similar global altP usage from introns and distal intergenic regions (Extended Data Fig. 3, Supplementary Table 5 and Supplementary Note 4).

NIPBL loss activates altP, arising from LTR with open chromatin features

To investigate the mechanisms of altP activation by NIPBL perturbation, we focused on the melanoma context. We observed bimodal distribution of sharp and broad promoters of all CTSSs at baseline with a 16 bp cutoff (Fig. 2a), consistent with previous studies34,35. NIPBL downregulation preferentially affected CTSSs with broad promoters (Fig. 2a and Extended Data Fig. 4a). Furthermore, we noted the presence of repetitive elements, particularly LTRs, at upregulated altP by NIPBL downregulation (Fig. 1f). Genome-wide, fewer than 3% of all CTSSs localized to annotated promoters contained repetitive elements, whereas 45–48% of all CTSSs localized in the intronic or distal intergenic regions contained repetitive elements, particularly 10–17% of each of the LTR, LINE and SINE subclasses of retrotransposons (Fig. 2b). With NIPBL downregulation, upregulated CTSSs in intron, distal intergenic and promoter regions were markedly enriched for LTR and depleted in SINE regions (Fig. 2b). Specific endogenous retrovirus subclass L (ERVL)/ERVL–mammalian-apparent LTR retrotransposons (MaLR) families were enriched in upregulated LTRs (Extended Data Fig. 4b–f and Supplementary Note 5).

Fig. 2: Loss of NIPBL induced altP usage arising from LTR repetitive elements characterized by open chromatin characteristics.
figure 2

a, Bimodal distribution of promoter width showing sharp and broad promoters at baseline (black), and the promoter width of differentially regulated CTSSs by shNIPBL perturbation in melanoma cells, demonstrating a preference for broad promoters (green). Gray dashed line depicted the cutoff at 16 bp that separated sharp and broad promoters based on the distance between CAGE tags for each clustered CTSS with TPM > 0.5. b, Distribution of significantly upregulated CTSSs by shNIPBL at repetitive versus nonrepetitive elements in intron, distal intergenic and promoter regions. c, CpG compositions of significantly upregulated, downregulated and all CTSSs by shNIPBL_1. d, Characterization of 15 chromatin states by ChromHMM (left) and enrichment of different chromatin states over genome distribution for RefSeq TSS and differentially regulated CTSS (right) under NIPBL perturbation conditions. e, Density plots of genome-wide distribution of indicated histone modification marks by CUT&RUN, centered on differentially upregulated and downregulated CTSSs by shNIPBL_1. Data from one representative biological replicate (n = 2) are shown. f, HOMER de novo motif analysis of significantly changed CTSSs at nonpromoters demonstrated significant enrichment of the MITF motif by shNIPBL in melanoma cells. g, Density plots of genome-wide distribution of MITF by ChIP–seq centered on differentially upregulated and downregulated CTSSs by shNIPBL, demonstrating an increase in MITF binding at upregulated CTSSs. Data from one representative biological replicate (n = 2) are shown. h, Box and whiskers plots with boundaries extended from 25th to 75th percentile, line showing median and whiskers showing 10th to 90th percentile of MITF binding by ChIP–seq read counts at differentially upregulated (top) and downregulated (bottom) CTSSs by shNIPBL in melanoma cells. Each dot represented the quantification of MITF binding at one differentially regulated CTSS locus. ****P < 0.0001, matched one-way ANOVA, Dunnett’s multiple comparisons test. n = 2 biological replicates. NS, not significant. i, Quantification of MITF binding by ChIP–qPCR at representative altP in intron and distal intergenic regions under NIPBL perturbation conditions in melanoma cells. Data indicated the mean ± s.d. n = 3–4 biological replicates. *P < 0.05, **P < 0.01, ***P < 0.001 and ****P < 0.0001, ordinary one-way ANOVA, Dunnett’s multiple comparisons test. ANOVA, analysis of variance.

Source data

Because LTR retrotransposons are generally silenced epigenetically by DNA methylation and repressive chromatin modifications in the genome1,4,7,8, we examined the local chromatin characteristics of the shNIPBL-upregulated CTSSs. We observed substantially lower CpG and C/G representation in the shNIPBL_1-upregulated than in the shNIPBL_1-downregulated and baseline CTSSs (Fig. 2c and Extended Data Fig. 4g), and modestly lower CpG representation in the shNIPBL-2-upregulated CTSSs than in the shNIPBL_2-downregulated and baseline CTSSs (Extended Data Fig. 4h). To experimentally delineate the DNA methylation status at CTSSs, we performed bisulfite-free enzymatic methyl sequencing (EM-seq). We observed uniform sequencing coverage among replicates of shLuc, shNIPBL_1 and shNIPBL_2 (Extended Data Fig. 5a). The DNA methylation ratio at CTSSs exhibited a bimodal distribution, consistent with prior studies36,37 without any obvious differences in shLuc, shNIPBL_1 and shNIPBL_2, indicating the lack of global DNA methylation changes after NIPBL downregulation (Extended Data Fig. 5b). We observed that the shNIPBL_1-downregulated CTSSs and the shNIPBL_2 upregulated and downregulated CTSSs exhibited enrichment of low DNA methylation ratios, while the shNIPBL_1-upregulated CTSSs exhibited relatively uniform distribution across various DNA methylation levels (Extended Data Fig. 5c,d). These genomic regions have relatively few CpG sites (Extended Data Fig. 5e,f), consistent with the low CpG content (Fig. 2c and Extended Data Fig. 4g,h). The majority (>95%) of the differentially upregulated and downregulated CTSSs were not associated with differential methylation changes (Extended Data Fig. 5g,h). We then specifically evaluated the representative CTSSs upregulated by shNIPBL_1 and shNIPBL_2 (Fig. 1f). These regions generally contained few CpG sites and had no or low levels of CpG methylation; the CpG methylation levels were not altered by NIPBL perturbations (Extended Data Fig. 5i–m). Collectively, these results supported the notion that differentially activated CTSSs by NIPBL knockdown were not silenced by DNA methylation.

We next examined the baseline chromatin characteristics of altered CTSSs by NIPBL loss, using cleavage under targets and release using nuclease (CUT&RUN)38 for histone modifications preferentially enriched at active enhancers (H3K4me1 and H3K27ac), promoters (H3K27ac and H3K4me3) and silenced chromatin regions (H3K9me3 and H3K27me3; Fig. 2d,e). Additional chromatin marks, for example, H3K36me3, CTCF and SMC1A, were used to generate 15 comprehensive chromatin states by ChromHMM analysis39 at baseline (Fig. 2d). As expected, the annotated RefSeq TSS sites were enriched for ChromHMM states 6 (H3K4me3/CTCF high)40, 7 (H3K4me3 high) and 8 (H3K27ac/H3K4me3 high); all are characterized by the enrichment of H3K4me3 (Fig. 2d). While the significantly downregulated CTSSs by shNIPBL_1 were enriched for the same ChromHMM states as active TSS (Fig. 2d,e), the significantly upregulated CTSSs by shNIPBL_1 were most enriched for ChromHMM state 9 characterized by enrichment for active and poised chromatin marks (for example, H3K27ac and H3K4me1) and devoid of repressive chromatin marks (for example, H3K9me3 and H3K27me3), reminiscent of active enhancers. Notably, a subset of these upregulated CTSSs were also enriched with H3K4me3 chromatin mark, reminiscent of active or poised promoters (Fig. 2e). These active chromatin marks were further enriched at the significantly upregulated CTSS heatmaps with NIPBL downregulation and are concordant with CTSS transcriptional changes (Fig. 2e and Supplementary Fig. 3a–c). Moreover, we observed similar patterns of chromatin modifications (for example, H3K4me3 and H3K27ac enrichment) at LTR-derived upregulated CTSSs with NIPBL downregulation (Supplementary Fig. 3e–g).

De novo and known motif analyses by hypergeometric optimization of motif enrichment (HOMER) revealed that shNIPBL-induced changes at nonpromoter CTSS regions were most enriched for microphthalmia-associated transcription factor (MITF), a lineage-specific master regulator in melanocyte and melanoma41 (Fig. 2f, Extended Data Fig. 6a,b and Supplementary Fig. 4). ChIP–seq confirmed that MITF bound the majority of differentially regulated CTSSs in shLuc cells (Fig. 2g,h and Supplementary Fig. 3d,h). With shNIPBL perturbation, differentially upregulated CTSSs enriched at intron and distal intergenic regions showed significantly increased binding of MITF, whereas downregulated CTSSs enriched at promoter regions showed no appreciable changes of MITF binding (Fig. 2g,h and Supplementary Fig. 3d,h). This observation was validated by quantitative ChIP–qPCR of MITF at shNIPBL-upregulated CTSSs (Fig. 2i and Extended Data Fig. 6c). We further observed enrichment of H3K4me3 at these selectively upregulated CTSSs by CUT&RUN-seq and ChIP–qPCR, corroborating the corresponding transcriptional activation (Extended Data Fig. 6c,d). Moreover, the increase in MITF binding to differentially upregulated CTSSs by NIPBL loss at TEs was preferentially at LTRs (Extended Data Fig. 6b). Collectively, these data demonstrated that loss of NIPBL led to altP usage arising from LTRs with open chromatin. They also indicate that the transcriptional activation from these LTRs is restricted despite open chromatin characteristics and may require additional chromatin structure changes that allow further recruitment of context-dependent master regulator TFs for activation, for example, MITF, in melanoma cells.

NIPBL partial loss leads to decrease in hierarchical TAD structures

Complete loss of the core cohesin family members on chromatin, either by biallelic genetic deletion of Nipbl in mouse hepatocytes24 or through auxin-inducible degron-mediated protein degradation of RAD21 (ref. 31), has demonstrated global disappearance of all TADs. To evaluate the effects of haploinsufficient loss of NIPBL on 3D chromatin organization, we performed Hi-C in biological duplicates to generate contact maps at 10 kb resolution and analyzed the genome-wide chromatin interaction frequencies in melanoma cells with doxycycline-inducible knockdown of NIPBL (shNIPBL_2) and control (shLuc). The 3D chromatin organization is heterogeneous and can be organized into hierarchies of TAD structures that correlate with more transcriptionally active chromatin regions when compared to single nonhierarchical structures21,23,42,43. We analyzed the Hi-C dataset with OnTAD, a TAD caller optimized for calling hierarchies of TAD with nested sub-TAD structures23. Hi-C analysis of the shLuc control cells revealed that most TADs contained a hierarchy of nested sub-TADs (Fig. 3a, green arrows) within a larger outer TAD structure (Fig. 3a, light blue arrows), and a minority of TADs existed as single structures (singleton TAD; Fig. 3a, navy blue arrows). Haploinsufficient loss of NIPBL resulted in significant weakening of TADs and preferential loss of the larger outer TAD structure, manifested by the loss/weakening of corner dots, representing the disappearance of cohesin-mediated interaction between the most distal boundaries of the hierarchical TADs44, with relative preservation of the inner nested sub-TADs (Fig. 3a). We defined the hierarchy of TADs based on the number of nested sub-TADs, where higher levels of hierarchical TADs correlated with more successive layers of nested sub-TADs. We then quantified the number of singleton and hierarchical TADs and identified a total of 3,791 TADs, including 967 (25.5%) singleton and 2,824 (74.5%) hierarchical TADs in control cells (Fig. 3b). With NIPBL loss (shNIPBL_2), there was a substantial increase in the total number of TADs to 4,541, with an increase in the proportion of singleton TADs (1,723/4,541, 37.9%) and a reciprocal decrease in hierarchical TADs (2,818/4,541, 62.1%; Fig. 3b). The decrease in hierarchical TADs was more pronounced in ≥ level 3 hierarchical TADs, 38.2% (1,448/3,791, shLuc) versus 27.3% (1,239/4,541, shNIPBL_2; Fig. 3b), likely due to the dissolution of higher-level hierarchical TADs ≥ level 3 into lower-level hierarchical and singleton TADs (Fig. 3a,b). Consistently, we observed an overall shift to substantially smaller TAD sizes, with a median TAD size of 530 kb in shLuc and 340 kb in shNIPBL_2 cells (Fig. 3c). Similarly, Hi-C read pair revealed that shNIPBL_2 cells contained more short-range interactions of ~10 to 100 kb (log10 values between 4 and 5), while shLuc control cells contained more mid-to-long-range interactions of ~100 kb to 3 Mbp (log10 values between 5 and 6.5; Extended Data Fig. 7a). Interestingly, we observed more mega-loop long-range interactions (>3 Mbp, log10 > 6.5) with shNIPBL_2 perturbation than with shLuc control (Extended Data Fig. 7a). Furthermore, consistent with previous study24, partial loss of NIPBL resulted in finer segregation of compartments (Extended Data Fig. 7b). Consistent with prior studies of cohesin perturbations that led to TAD boundary defects24,31,45, we observed a substantial reduction in TAD boundary strength by insulation score46 in shNIPBL_2 compared to control cells (Fig. 3d), indicating that partial NIPBL loss can also lead to weakened TAD boundaries.

Fig. 3: NIPBL partial loss leads to a decrease in hierarchical TAD structures and preferentially affects CTSSs residing in high-level hierarchical TADs.
figure 3

a, Representative hierarchical TAD structures by Hi-C contact maps (10 kb resolution) under control and NIPBL perturbation conditions in melanoma cells. TADs are organized as singletons (navy blue arrows) or hierarchical structures characterized by nested sub-TADs (green arrows) and nested inside meta-TADs (light blue arrows). NIPBL downregulation weakened TAD structures and preferential loss of outermost TADs (light blue arrows) of hierarchical TADs. n = 2 biological replicates. bd, The effects on hierarchical TAD structures by partial loss of NIPBL in melanoma cells, including on the number of various levels of hierarchical TAD structures by OnTAD (b), size distribution (c) and TAD boundary/insulation scores (d). ****P < 0.0001, Mann–Whitney two-tailed unpaired nonparametric t test. e, Comparison of fraction of baseline CTSS versus differentially regulated CTSSs by shNIPBL_1 and shNIPBL_2 in hierarchical and nonhierarchical TADs defined under control condition (≥ level 3 versus ≤ level 2 and singleton). *P < 0.05, **P < 0.01, two-tailed chi-square test. Number of CTSS in each category is in Extended Data Fig. 7. f,g, Relative CTSS distance to nearest TAD boundaries defined in shLuc condition (f) and in shNIPBL condition (g). The relative CTSS distance to TAD boundaries was calculated as the distance of CTSS to the nearest TAD boundary divided by the size of the respective TAD defined in the shLuc and shNIPBL conditions. h, Schematics illustrating the change in location of CTSSs relative to TAD structure changes with NIPBL perturbation. diff. CTSS, differential CTSS.

Source data

We next evaluated the effect of hierarchical TAD structure changes following NIPBL knockdown on differential CTSS expression. There was a significant enrichment of differential CTSSs mediated by shNIPBL_1 and shNIPBL_2 that localized to ≥ level 3 hierarchical TADs compared to lower levels of hierarchical and singleton TADs defined in the control condition (Fig. 3e and Extended Data Fig. 7c). Similarly, significant enrichment of differential CTSSs by shNIPBL_4 and shNIPBL_5 localized to ≥ level 3 hierarchical TADs compared to lower levels of hierarchical and singleton TADs was observed (Extended Data Fig. 7d,e).

Differentially regulated CTSSs were localized at the boundary of redefined TADs in the shNIPBL_2 condition (Fig. 3f–h and Supplementary Note 6). These data indicate that the TAD hierarchical structure changes can impact transcriptional activation from altP.

Promiscuous gene activation through retargeting of enhancers to altP

We investigated the 3D chromatin topology around activated altPs, for example, ALK intron 19 (Fig. 4a,b) and ULK4 intron 31 (Fig. 4c,d), to further probe the molecular mechanisms. Both altPs (highlighted in blue) were situated inside complex hierarchical TADs (≥ level 5) that contained multiple layers of nested sub-TADs in control cells (Fig. 4a–d). With partial reduction of NIPBL, the Hi-C contact matrix revealed loss of the outer TAD structures (blue circles) with relative preservation of the inner nested sub-TADs (green circles) and splitting of hierarchical TADs into multiple smaller TADs (gray horizontal bars; Fig. 4a,c). Consistently, OnTAD analysis of the ALK intron 19 (Fig. 4b) and ULK4 intron 31 (Fig. 4d) altP regions revealed decreased hierarchical TAD structures, from ≥ level 5 to ≤ level 4. Additionally, both altPs that were initially embedded in the center of a complex network of hierarchical TAD structures in the control condition are now repositioned to the boundary of split and ‘simplified’ TAD structures (Fig. 4b,d).

Fig. 4: Examples of promiscuous gene activation (ALKATI, ULK4 from intron 31) through reorganization of hierarchical TADs and retargeting of enhancers to altP in proximity.
figure 4

a,c, Hi-C map of hierarchical TADs of selective genomic regions, ALKATI (a) and ULK4 (c) in control (top, shLuc) and NIPBL perturbation (bottom, shNIPBL_2) conditions in melanoma cells. Gray bars below each Hi-C map denote computationally annotated TADs. ALKATI and ULK4 are located in a nested TAD comprising multiple sub-TADs. NIPBL loss leads to splitting (gray bars) and loss of hierarchy of outer TADs (blue circles), with relative preservation of sub-TADs (green circles). Genomic regions of ALKATI and ULK4 intron 31 altP were shaded in blue. b,d, OnTAD hierarchical domain changes of selective genomic regions, ALKATI (b) and ULK4 (d) in control (shLuc) and NIPBL perturbation (shNIPBL_2) conditions. The hierarchical domains are color-coded similarly to Fig. 3b, demonstrating the split of the top outer level of the hierarchical TADs (blue bar). Bottom illustration: B, TAD boundary; C, TAD center. e,f, Representative normalized 4C-seq profiles under control and NIPBL perturbation conditions, using ALKATI (e) and intron 31 of ULK4 (f) as viewpoints (shaded in pink), demonstrating enhanced interaction profile close to the viewpoint (arrows) and diminished long-range interaction (arrowheads) with NIPBL downregulation. Normalized average 4C-seq signal was binned with either a 25 kb window (ALKATI locus) or a 10 kb window (ULK4 intron 31 locus) with a 1 kb shift to calculate differential and log2(FC) in interaction (shNIPBL_2 signal—shLuc signal) and statistical significance by two-tailed paired t test. Respective chromatin states by ChromHMM, defined in Fig. 2d at the corresponding genomic regions, were also shown. gj, Zoomed-in regions from normalized 4C-seq profile under control and NIPBL perturbation conditions, demonstrating proximal contacts gain surrounding ALKATI (g) and ULK4 intron 31 (i) altPs with NIPBL perturbation. CRISPRi of the indicated genomic regions (blue, g,i) confirmed engagement of neighboring proximal H3K27ac-enriched peaks as retargeted enhancers for the transcriptional activation of ALKATI (h) and ULK4 from intron 31 (j). Data indicated the mean ± s.d. (n = 3–4 biological replicates). *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001, ordinary one-way ANOVA, Dunnett’s multiple comparisons test.

Source data

To investigate if changes in local chromatin architecture directly affect transcription from ALK intron 19 and ULK4 intron 31, we selectively disrupted CTCF insulators located at the TAD/sub-TAD boundaries by either dCas9–KRAB-mediated blocking or Cas9-mediated deletion of the CTCF binding sites47 (Extended Data Fig. 8a,j). dCas9–KRAB that targeted cognate CTCF binding motifs resulted in substantial reduction of CTCF binding by ChIP–qPCR (Extended Data Fig. 8b–f,k–m). Specifically, we observed increased expression of alternative transcripts from ALK intron 19 and ULK4 intron 31 when CTCF insulators at TAD/sub-TAD boundaries were disrupted by dCas9–KRAB (Extended Data Fig. 8g,n). Consistently, Cas9-mediated deletion of CTCF binding sites at TAD/sub-TAD boundaries enhanced alternative transcript expression at both loci (Extended Data Fig. 8h,i,o,p), indicating that changes in hierarchical TAD structure directly impact altP usage.

TADs are thought to facilitate and restrict long-range enhancer (E)–promoter interactions within and outside of TADs, respectively21. To assess the genome-wide E-altP interactions that may contribute to the transcriptional activation of de novo CTSSs, we performed circular chromosome conformation capture sequencing (4C-seq)48, using ALK and ULK4 altPs in intron 19 and intron 31 as viewpoints, respectively (Fig. 4e,f, red shades). In the shLuc control, we observed a high frequency of interactions from genomic regions close to the viewpoints and a rapid decay of interaction frequencies with increased genomic distance. With partial NIPBL loss (shNIPBL_2), we observed a substantial decrease in interaction frequencies with the altPs, preferentially from distal genomic regions (Fig. 4e,f, arrowheads) and a marked increase in interactions from the proximal genomic regions (Fig. 4e,f, arrows), suggesting a shift of genomic interactions and possibly retargeting of enhancers from distal to proximal genomic regions to the altPs.

Consistent with the genome-wide chromatin analysis (Fig. 2d), altPs of ALK and ULK4 were devoid of heterochromatin marks, for example, H3K27me3, and were enriched for open chromatin features, for example, H3K4me1, in shLuc control cells (Fig. 4e,f). NIPBL depletion in shNIPBL_2 did not affect H3K4me1 but increased H3K4me3 and H3K27ac at the altPs and H3K36me3 at the gene bodies of the alternative transcripts, consistent with transcriptional activation (Fig. 4e,f). Multiple H3K27ac-enriched peaks flanking ALK intron 19 and ULK4 intron 31 within the hierarchical TADs suggested potential enhancers to the altPs. We performed CRISPRi to target the various potential enhancers, using guide RNAs specific for the ALK (intron 19) or ULK4 (intron 31) altP as positive and nontargeting sgGFP as negative controls (Fig. 4g–j). We also included two sets of sgRNAs (sgALK Ctrl1&2 and sgULK4 Ctrl1&2) targeting neighboring nonenhancer regions (without H3K27ac enrichment) as additional controls (Fig. 4g–j). Cas9 ChIP–seq, H3K27ac and H3K9me3 CUT&RUN were performed to validate the specificity of CRISPRi-sgRNAs (Extended Data Fig. 9 and Supplementary Note 7). CRISPRi-sgRNA targeting confirmed the recruitment of proximal enhancers to activate transcription from altPs of ALK and ULK4 loci with NIPBL downregulation (Fig. 4g–j and Supplementary Note 8).

These observations corroborated the 4C-seq data. They indicated that the distal enhancers in preserved hierarchical TADs provided basal interactions with altP but are not sufficient to activate transcription. Hierarchical TADs collapse by NIPBL loss shift E-altP interactions to recruit proximal enhancers to activate transcription from altP arising from LTR repetitive elements.

NIPBL loss-mediated altP usage from LTR can give rise to oncogene expression

To further evaluate the functional consequences of NIPBL loss-mediated altP usage, we noted the robust induction of the ALKATI variant originating from the LTR16B2 region in intron 19 in melanoma cells (Fig. 1a,b,f). This ALKATI variant was previously characterized as a new oncogenic ALK isoform that is biallelically expressed and independent of genetic alterations at the ALK locus32. To evaluate if NIPBL perturbations contribute to the expression of ALKATI in melanoma, we selected the top 50 cases of TCGA–SKCM samples of high ALKATI expression by RNA-seq, compared to the bottom 50 cases that had no or very low ALKATI expression, and analyzed the NIPBL somatic mutations by FunSeq2 (ref. 49), a validated computational algorithm that prioritizes mutational variants with functional significance using a weighted score system (Fig. 5a and Extended Data Fig. 10a). Combining the number of NIPBL somatic mutations in each tumor sample and the FunSeq2 scores, we generated a composite value that estimated the functional inactivation of NIPBL mutations for each case (Fig. 5a and Extended Data Fig. 10a). The composite FunSeq2 score was significantly higher (P < 0.05) in high versus no-low ALKATI expression cases, indicating a higher level of NIPBL functional inactivation (Fig. 5a). To further establish the link between NIPBL perturbation and ALKATI expression, we screened available patient-derived melanoma cell lines for ALKATI expression and identified SKMEL-23 and SKMEL-1128 with ALKATI expression by RNA-seq (Fig. 5b). Compared to two well-characterized melanoma cell lines, COLO800 and A375, which did not express ALKATI, SKMEL-23 and SKMEL-1128 expressed lower levels of NIPBL mRNA transcripts (Fig. 5c). Consistently, H3K27ac ChIP–seq revealed H3K27ac enrichment at the ALKATI altP in all the melanoma cell lines examined32 (Fig. 5d), suggesting that melanoma might be primed to express ALKATI.

Fig. 5: NIPBL perturbation contributes to transcriptional activation of ALKATI from the altP through engagement of proximal H3K27ac-enriched enhancers in melanoma.
figure 5

a, Evaluation of NIPBL functional perturbation by FunSeq2 of TCGA melanoma cases with high and low ALKATI expression. Fifty cases from each of the high and low ALKATI-expressing TCGA–SKCM cases were used to calculate the composite FunSeq2 score of NIPBL. Scatter dot plot with box representing the mean and whiskers representing the s.d. *P < 0.05, two-tailed unpaired t test. b, Representative RNA-seq profiles of the ALK locus, demonstrating the expression of ALKATI in SKMEL-23 and SKMEL-1128, but not in COLO800 and A375 melanoma cell lines. c, Comparison of NIPBL TPM in ALKATI-expressing and nonexpressing melanoma cell lines. d, Normalized ChIP–seq profiles of H3K27ac (top) demonstrating putative proximal enhancers (blue shade) surrounding the ALKATI altP (pink shade); H3K27ac HiChIP arcs with hotspot analysis (bottom) using the ALKATI altP in intron 19 (pink shade) as virtual viewpoint confirmed the proximal enhancer interactions (blue shades) with the ALKATI altP. e, CRISPRi of the indicated proximal H3K27ac-enriched putative enhancer regions indicated in d led to a reduction of ALKATI mRNA expression, confirming their contribution as enhancers. Data indicated the mean ± s.d. (n = 4 biological replicates). *P < 0.05, **P < 0.01, ****P < 0.0001, ordinary one-way ANOVA, Dunnett’s multiple comparisons test.

Source data

We performed H3K27ac Hi-C followed by chromatin immunoprecipitation (HiChIP) to investigate the enhancer connectome in SKMEL-23 and SKMEL-1128 and observed significant interactions between the proximal H3K27ac-enriched enhancers with ALK intron 19. Genomic regions, for example, the 5′ upstream, introns 1 and 4 of the ALK locus and surrounding the PCARE locus, were the main E-altP interaction sites with the altP of ALK by virtual 4C and H3K27ac HiChIP hotspot analysis using ALK intron 19 as viewpoint (Fig. 5d, blue highlights). CRISPRi-sgRNA targeting was performed to confirm the proximal enhancer retargeting (Fig. 5e and Supplementary Note 9).

Finally, we investigated whether MITF has a role in transcriptional activation of ALKATI in melanoma cell lines, analogous to its recruitment to activate altP after NIPBL loss (Fig. 2f–i, Extended Data Fig. 6 and Supplementary Fig. 3d,h). Compared to known MITF-binding sites at TYR and DCT promoters, we observed comparable levels of MITF binding to ALKATI proximal H3K27ac-enriched enhancers by MITF ChIP–qPCR in both SKMEL-23 and SKMEL-1128 (Extended Data Fig. 10b). RNAi-mediated MITF downregulation (siMITF_1 and siMITF_2) led to a significant decrease in ALKATI expression (Extended Data Fig. 10c,d), indicating that MITF binding is critical in the active transcription of ALKATI from the ALK intron 19 altP. Together, these observations suggest that NIPBL somatic mutations and reduced expression can induce context-dependent oncogene activation through proximal enhancer retargeting and altP usage.

Discussion

Promoter usage is highly dynamic in cancer, and the resulting isoform diversity could potentially increase the fitness of cancer cells50 by virtue of differential stability, translational efficiency, protein structure and binding partners51. Differential isoforms expression can be used to stratify cancer molecular subtypes and predict patient survival, highlighting the distinctive functional roles of certain transcript isoforms50,52. Cancer-specific altPs are prevalent in various cancer types, such as ovarian and stomach cancers, which also express more TE-encoded oncogenes18,50, suggesting that TE co-option might be one of the underlying mechanisms that drive altP usage in cancer. Multiple layers of interconnected epigenetic mechanisms have been described to restrict TE expression, including DNA methylations and repressive chromatin modifications4,7,8,10,11. Here we report a new mechanism of TE regulation by the 3D chromatin hierarchy to restrict the activation of TEs that already possess active chromatin features, for example, H3K4me1, H3K27ac and the lack of CpG methylation, as active promoters.

We observed that in melanoma cells, the melanocyte/melanoma-specific master regulator MITF41 was responsible for activating altPs from LTRs, following the collapse of hierarchical TAD structures. NIPBL downregulation led to altP activation in multiple cancer types in addition to melanoma (Extended Data Fig. 3 and Supplementary Table 5), suggesting the possible involvement of other lineage/cell context-specific TFs in regulating altP usage from TEs under regulation of higher-order chromatin structures that warrants future investigation. Activated TEs by 3D chromatin hierarchical changes in melanoma are marked by active chromatin features, which resemble open chromatin accessibility and key TF bindings (for example, OCT4 and NANOG) at TEs that were observed in pluripotent stem cells53,54. It will be of interest to explore whether specific classes of TEs are more prone to be exploited by cancer cells that were once activated in the stem cell state. A previous study has hinted at the transactivation potential of TEs that are bound by TFs with the decoration of active histone modification marks (for example, H3K27ac, H3K4me1 and H3K4me3) and lower DNA methylation compared to their counterparts that are not associated with TFs55. In our study, most TE-derived altPs have low levels of MITF binding at baseline, which is further enriched by the loss of NIPBL. Consistently, these regions are also decorated with a high H3K27ac level and H3K4me1/H3K4me3 ratio, closely resembling conventional enhancer elements. Recent studies indicate that the transcriptional machinery and architecture are strikingly similar between promoter and enhancer cis-regulatory elements56,57. Whether a cis-regulatory element results in productive mRNA depends on its transcriptional strength57, which in turn is controlled by enhancer activity and contact frequency58,59. Consistently, our findings support that this class of enhancer-like TEs is typically restricted from transcriptional activation by hierarchical 3D chromatin organization. Upon the collapse of hierarchical TAD structures, long-range interactions between distal enhancers and promoters enabled by cohesin are lost; instead, H3K27ac-enriched regions close to TEs are recruited to enhance the transcriptional strength, resulting in productive mRNA transcription. Notably, the upregulated CTSSs have lower cytosine-guanine (CG) content or CpG methylation levels than CTSSs that are either downregulated (mostly canonical promoters) or unchanged. These observations have suggested that this type of CTSSs is unlikely to be repressed by DNA methylation; they also highlighted one of the differentiating features between enhancers and promoters, where enhancers are usually devoid of DNA methylation and have low CpG content57.

Methods

The research conducted in this study complies with all relevant ethical and regulatory guidance, including the Memorial Sloan Kettering Cancer Center’s Institutional Review Board.

CRISPRi

Cells were transduced with pHR–SFFV–KRAB–dCas9–P2A–mCherry lentivirus (Addgene, 60954) and sorted for mCherry-positive cells. sgRNAs were designed using CRISPOR60 (Supplementary Table 6). To perturb CTCF binding, sgRNAs were designed to target the cognate CTCF binding motif (Supplementary Table 6). sgRNAs were cloned into a modified U6-based expression vector pSLQ1651-sgTelomere (F + E; Addgene, 51024) that enhanced sgRNA stability and assembly with dCas9 protein61. To remove the mCherry reporter gene from the pSLQ1651-sgTelomere (F + E) vector, PCR was done using exCherry_NheI and exCherry_puro_EcoRI primers (Supplementary Table 6), with pSLQ1651-sgTelomere (F + E) as template. The PCR fragment and pSLQ1651-sgTelomere (F + E) were digested with NheI and EcoRI (New England Biolabs (NEB)), ligated and transformed into Stbl3 competent bacteria (Thermo Fisher Scientific). sgRNA sequence was amplified using the pSLQ1651-sgTelomere (F + E) vector with FE_AS and region-specific sgRNA primer (Supplementary Table 6). A pool of two to six sgRNAs was used to target each region. PCR product was gel purified, digested with BstXI and XhoI (NEB) and inserted into pSLQ1651-sgTelomere (F + E) without mCherry. Identity of all plasmids was confirmed by Sanger sequencing. Lentivirus carrying sgRNA was then transduced into KRAB–dCas9–P2A–mCherry-expressing cells, selected by 2 µg ml−1 puromycin and collected after 6 days to test for mRNA level or CTCF binding by ChIP–qPCR.

RNA-seq

Cells were lysed with TRIzol RNA isolation reagent (Thermo Fisher Scientific). Total RNA was extracted with chloroform and isopropanol precipitation of the upper aqueous phase. Total RNA precipitate was washed once with 75% ethanol, air-dried and resuspended in RNase-free water. Absence of RNA degradation was confirmed by Agilent BioAnalyzer. In total, 500 ng of total RNA was used for poly-A selection of RNA and library construction using the TruSeq Stranded mRNA LT Kit (Illumina, RS-122-2102). Samples were barcoded and run on a HiSeq 4000 in PE50 to generate 40–50 million reads. Ribosomal reads represented 0.75–1.87% of the total reads generated and the percent of mRNA bases averaged 80%.

CAGE-seq

CAGE-seq was performed according to the protocol in ref. 62, with modifications to detect 5′ ends of capped RNAs. Five micrograms of total RNA were reverse transcribed using a primer that carried a 15-nt random region and EcoP15I recognition sequence and SuperScriptIII RNase H-reverse transcriptase (Thermo Fisher Scientific) in the presence of trehalose/sorbitol to increase the efficiency of reverse transcription. Diol groups on RNA/cDNA duplex were oxidized and labeled with biotin hydrazide (Vector Laboratories). RNase I (Promega) treatment was performed to cleave single-stranded RNA regions that were not hybridized with first-strand cDNA. tRNA-coated streptavidin beads (Thermo Fisher Scientific) were used to capture biotinylated RNA/cDNA duplex. Sample was washed three times with wash buffer A (4.5 M NaCl, 50 mM EDTA (pH 8.0) and 0.1 % Tween-20), once with wash buffer B (10 mM Tris–HCl (pH 8.5), 1 mM EDTA (pH 8.0), 0.5 M NaOAc (pH 6.1) and 0.1 % Tween-20) and once with wash buffer C (0.3 M NaCl, 1 mM EDTA (pH 8.0) and 0.1% Tween-20). Single-stranded cDNAs were released from streptavidin beads by heating at 95 °C for 5 min. Released cDNA was treated with RNase H (Thermo Fisher Scientific) and RNase I (Promega). Library was constructed with modified primers for compatible sequencing on the Illumina Nextseq 500 sequencer. Ligation of the barcoded 5′ linker was performed. Second-strand cDNA was synthesized using Deep Vent (exo-) DNA polymerase (NEB) and biotinylated second-strand primer. cDNA was treated with shrimp alkaline phosphatase (Thermo Fisher Scientific) to remove the phosphate group from the 5′ linker. EcoP15I (NEB) was used to cleave 27 nt downstream of the recognition site at the 5′ linker. The 3′ linker was ligated to the 27-nt-long tags, purified with tRNA-coated streptavidin beads and amplified with 2× Phusion high-fidelity PCR mix (NEB). CAGE libraries were treated with exonuclease I (NEB) to degrade single-stranded DNA before sequencing on NextSeq 500 using SR50 to obtain around 40 million reads for each library. All primers used for CAGE-seq are listed in Supplementary Table 6.

EM-seq

High-molecular-weight genomic DNA was extracted using the Puregene cell core kit (Qiagen) and sent to the Memorial Sloan Kettering Cancer Center, Epigenetics Research Innovation Lab for processing. EM-seq was performed with 100 ng of purified genomic DNA using the NEBNext Enzymatic Methyl-seq Kit (NEB, E7120) following the manufacturer’s instructions. After quantification and size distribution assessment using a Qubit Flex fluorometer (Thermo Fisher Scientific) and a TapeStation (Agilent), respectively, the purified libraries were sequenced on an Illumina NovaSeq 6000 to generate around 200 million PE100 reads per library.

CUT&RUN

CUT&RUN was performed using the CUT&RUN assay kit (Cell Signaling Technology) based on the published protocol in ref. 38. Briefly, 0.25 × 106 live cells were immobilized on activated concanavalin A beads and permeabilized with digitonin. Cell-bead suspension was washed with 1× wash buffer supplemented with spermidine and protease inhibitor cocktail and incubated with 1 µg of antibody at 4 °C overnight. The following antibodies were used: anti-H3K4me1 (Thermo Fisher Scientific/Invitrogen, 710795), anti-H3K4me3 (Thermo Fisher Scientific/Invitrogen, PA5-27029), anti-H3K27ac (Abcam, ab4729), anti-H3K27me3 (Cell Signaling Technology, 9733BF), anti-H3K9me3 (Abcam, ab176916), anti-H3K36me3 (Active Motif, 61021) and anti-CTCF (Cell Signaling Technology, 3418F) antibodies. Antibody-targeted digestion of chromatin was performed by incubation of pAG-MNase at 4 °C for 1 h and activation by calcium chloride at 4 °C for 30 min. MNase cleavage was stopped by the addition of 1× stop buffer. Digested chromatin was allowed to diffuse out of cells at 37 °C for 20 min, and DNA was purified by Nucleospin gel and PCR clean-up spin column (Takara Bio). Sequencing libraries were prepared using the KAPA HTP Library Preparation Kit (KAPA Biosystems). Barcoded libraries were run on NovaSeq 6000 in a PE100 run. An average of 25 million reads were generated per sample.

Hi-C

Arima Hi-C kit (Arima Genomics) was used for all Hi-C experiments. Standard input protocol was followed with 3 × 106 cells for each condition. Proximally ligated DNA was fragmented to around 400 bp by the Covaris E220 focused ultrasonicator (peak power = 140 W, duty factor = 10% and cycle per burst = 200) for 70 s. Fragmented DNA was size-selected to have a distribution between ~200 and 600 bp using AMPure XP beads (Beckman Coulter). Approximately 2 µg of size-selected DNA was used for library preparation using the KAPA Hyper Prep Kit (KAPA Biosystems) and TruSeq DNA Unique Dual Indexes (Illumina). The number of PCR cycles used for library amplification was determined by the KAPA library quantification kit for the Illumina platform (KAPA Biosystems). Each Hi-C library was sequenced to obtain 500 million reads PE100 with biological duplicates to achieve a total of 1 billion reads for each condition.

4C-seq

4C-seq was performed as per ref. 48 with modifications. In brief, 10 × 106 cells were fixed with 1% formaldehyde for 10 min. Cross-linking was quenched with 200 mM glycine, and cells were lysed with NP-40 lysis buffer (10 mM Tris–HCl (pH 8.0), 10 mM NaCl, 0.2% NP-40 and 1× complete EDTA-free protease inhibitors (Roche)) for 20 min on ice. Nuclei were incubated with 0.5 % SDS at 65 °C for 10 min followed by 1.25 % Triton X-100 at 37 °C for 15 min. Nuclei were digested with 500 U EcoRI (ALKATI viewpoint) or NlaIII (ULK4 intron 31 viewpoint) overnight. qPCR was performed using primers that span the restriction enzyme cut site at the bait versus uncut control region to confirm the efficiency of restriction enzyme digestion to over 90%. Digested chromatin was ligated at 16 °C overnight, reverse cross-linked and purified. For efficient PCR amplification and next-generation sequencing, a second round of restriction enzyme digestion was performed using 10 µg of 3C product and 100 U of DpnII (for both ALKATI and ULK4 intron 31 viewpoints) at 37 °C for 5 h. Sample was ligated overnight at 16 °C, ethanol precipitated at −80 °C and purified with a QIAquick PCR purification column (Qiagen). Inverse PCR was performed using 4C_ALKATI_EcoRI or 4C_ULK4int31_NlaIII (reading) and 4C_ALKATI_DpnII or 4C_ULK4int31_DpnII (nonreading) primers with 2× Phusion high-fidelity PCR mix (NEB). Each 4C sample was amplified in four PCR reactions, each containing 200 ng of 4C DNA template for 18 cycles. A second round of PCR was performed to add P5 and P7 adapters as well as index sequences for multiplex sequencing. Double-sided size selection was performed using AMPure XP beads (Beckman Coulter) with 0.5× and 1× of the PCR volume. Sequencing was performed on an Illumina NovaSeq 6000 sequencer PE75. Each sample was sequenced to generate 10 million reads. All primers used for 4C-seq are listed in Supplementary Table 6.

RNA-seq analysis

RNA-seq reads were trimmed and filtered for quality using Trimmomatic (v0.38). Processed reads were then aligned to hg19 using STAR (v2.7.1a). For each GENCODE (v19) gene, reads were counted using featureCounts (v1.6.4). Gene counts were used to perform differential analysis using DESeq2 (v1.30.1). Transcript per million (TPM) normalized bigWig tracks were generated using deepTools (v3.1.10 bamCoverage with ‘–normalizeUsing BPM‘) for each replicate. BedGraph files from each replicate, converted from bigWig using bigWigToBedGraph, were combined using bedtools (v2.27.1) unionbedg, and scores were averaged for each condition. The averaged bedGraph files were converted to TPM-averaged bigWig tracks using bedGraphToBigWig (v4). TPM counts were generated with StringTie (v2.1.1) using BAM files and GENCODE (v19) gtf.

CAGE-seq analysis

CAGE-seq reads were analyzed based on the Nextflow CAGE-seq analysis pipeline63. Index sequences, EcoP15I recognition site and the first G following the EcoP15I site at the 5′-end from reads were removed using Cutadapt (v2.6). rRNA sequences were filtered out using SortMeRNA (v4.2.0), and filtered reads were aligned to the hg19 genome using STAR (v2.7.1a). At this step, two sets of BAM files were created for each sample—one with uniquely mapped reads (adding–outFilterMultimapNmax 1 to the STAR command) and one with multimapped reads (adding–outFilterMultimapNmax 10 to the STAR command). Unmapped reads from BAM files were removed. BAM files were converted to BED files with the summed up 1 bp unclustered CAGE tags using SAMtools (v1.14) and bamtobed (v.2.26.0). We pooled CAGE tags across samples, filtered them by TPM threshold of 0.5 and then clustered them with paraclu64 (minValue of 30) to generate a BED file with the clustered CTSSs. Then, paraclu-cut with default settings (remove single-position clusters, clusters longer than 200 bp, clusters with maximum density/baseline density less than 2, and any cluster contained in a larger cluster) was used to filter out clusters. Promoter width was calculated by the distance between CAGE tags for each clustered CTSS with a 200 bp cutoff using CAGEr (v1.33.1). CAGEr analyzed promoter width across all samples by considering both the position and the CAGE tag signal at TSSs along the tag cluster. The width of every tag cluster is calculated based on the cumulative distribution of the CAGE signal along the cluster. CAGE tags were intersected with the clusters identified by paraclu, and a raw count table with columns for each sample and rows for each tag cluster was generated. DESeq2 (v1.30.1) was used to perform differential CAGE expression using the raw count table. CTSS clusters were annotated using R/Bioconductor packages ChIPseeker (v1.26.2) and AnnotationDbi (v1.52) using GENCODE (v19) gene annotations. The annotatePeak function from the ChIPseeker package was used to assign peaks to genomic annotation, including promoter (±1 kb around TSS), exon, 5′ UTR, 3′ UTR, intronic, intergenic and downstream (downstream of gene end), and calculate the distance of a peak to the nearest gene.

Correlation between RNA-seq and CAGE-seq

Spearman correlation was calculated between either log2(TPM) or log2(fold change) in RNA-seq for each gene and CAGE-seq for the corresponding promoter CTSS.

Repetitive region analysis

Multimapped CTSSs were compared to all the repeatmasker-annotated repetitive elements by the genomic coordinates downloaded from the UCSC browser (hg19). An overlap of at least 1 bp was used to transfer the names of repeat class, family and subfamily to each CTSS.

CUT&RUN analysis

All CUT&RUN data were processed as described for ChIP–seq to generate BAM files. Replicates were merged using the pysam Python module. The merged BAM files were then converted to bigWig files with deepTools (v3.1.1) bamCoverage (binSize as 10, reads per genomic content normalization and other parameters as default). Density plot was generated using deepTools as in ChIP–seq.

ChromHMM

ChromHMM (v1.24) was used to partition the human genome into 15 states based on the CUT&RUN-seq data from H3K4me1, H3K4me3, H3K27ac, H3K27me3, H3K9me3 and H3K36me3, and ChIP–seq from CTCF and SMC1A. The two biological replicates were analyzed separately. The OverlapEnrichment function was used to determine enrichment of the upregulated and downregulated CTSSs in the 15 states.

CG content analysis

All CTSSs were expanded to 300 bp from the center and then used to count the percentages of C, G and CG in the sequences with both strands considered. Due to the short length of CTSSs, an expansion of 300 bp was made to get an accurate determination of the local CG composition. The analysis was also tested on 100-bp and 500-bp expansions and produced similar results.

EM-seq and DNA methylation analysis

EM-sequencing reads were processed using the Bismark pipeline65. Raw reads with low quality (<20) and adapter sequences were removed by Trim Galore (v0.6.4). C(G) in the trimmed sequences was converted to T(A) and mapped to similarly converted reference human genome (hg19), lambda and pUC19 separately using default Bowtie2 settings implemented in Bismark. Duplicated reads were discarded. The remaining alignments were then used for cytosine methylation calling by the Bismark methylation extractor. Batch effects were adjusted using limma (v3.46.0). Any CpGs with coverage less than ten were removed before downstream analyses. After filtering, 969,316 CpG loci were recovered with an average coverage of 19× across all samples.

HOMER motif analysis

HOMER motif analysis was performed to discover both de novo and known motifs using the ‘findMotifsGenome.pl’ script (v4.11) from the HOMER software suite. The analysis was configured with a size parameter of 200, using hg19 genome sequences. CTSS was used as the background.

4C-seq analysis

R1 was used for analysis. From the single-end reads, nonreading primers at the 3′ end were trimmed using Cutadapt (v3.7). A maximum error rate of 0.15 was allowed, and trimmed reads shorter than 40 bp were discarded. The 5′ end of the trimmed reads was further trimmed until the primary restriction enzyme site (RE1; EcoRI or NlaIII) using R. Only reads containing RE1 next to the viewpoint were considered for downstream analysis. Bowtie2 (v2.4.5) was used for aligning the trimmed reads to the human genome (hg19), while SAMtools (v1.14; parameter, -q 1 -bSu) was used to filter for high-quality uniquely mapped reads. Fragment maps of the hg19 genome were generated for each restriction enzyme pair. Blind fragments (which have RE1 on both ends) were excluded from the analysis. Aligned reads were assigned to overlapping fragment ends. Fragment read counts were depth normalized to per million reads on the cis chromosome where the viewpoint is located. The top three fragments with the highest read counts corresponding to the undigested and self-ligated fragments (usually the ones closest to the viewpoint) were excluded when calculating depth normalization. For smoothing, running mean across the fragment read counts was calculated using a sliding window sized 21 (averaging ten fragments left and right to the fragment of interest). For statistical analysis, mean normalized read counts were calculated for 25 kb (ALK intron 19) or 10 kb (ULK4 intron 31) bins with a 1 kb shift. With an assumption that the measured read counts follow a normal distribution for each bin, a two-tailed paired t test was done for each replicate to call significantly differential bins between conditions using a P-value cutoff of 0.05. All 4C-seq data were visualized using the Gviz (1.41.1)66 and ggplot2 (v3.4.4) packages in R.

FunSeq2 analysis

Normalized RNA-seq for TCGA–SKCM samples was downloaded from the National Cancer Institute Genomic Data Commons (NCI GDC) Data Portal. ALK expression was ranked, and the top ALK-expressing tumors were further examined (RNA-seq by expectation-maximization (RSEM) > 210). Tumors with ALK amplification were excluded (n = 3). Exon quantification data for ALK was analyzed, and high ALKATI-expressing tumors were defined as more than tenfold differential expression in ALK exons 20–29 than in exons 1–19 (ref. 32). The top 50 cases of high ALK-expressing tumors that met these criteria were assigned as high ALKATI-expressing tumors and used to compare to the bottom 50 cases with low/no ALKATI expression. FunSeq2 (ref. 49) analysis for NIPBL was run on default settings, and each mutation was assigned a FunSeq2 score of 1–5, with a higher score that prioritized somatic alterations with functional significance. A composite FunSeq2 score for NIPBL was calculated by combining the FunSeq2 score for each NIPBL mutation in the same tumor sample. Tumor sample that has no NIPBL mutation was assigned a FunSeq2 score of 0.

Hi-C and HiChIP analysis

Generation of valid paired-end reads

Hi-C sequencing files were processed with the HiC-Pro pipeline (v3.0.0). In silico digestion of the hg19 genome by ‘Arima’ restriction enzymes was performed with the ‘digest_genome.py’ tool from HiC-Pro, and bed files generated were used for assigning mapped reads to DNA fragments. All aligned, filtered and deduplicated read pairs from each replicate were used to calculate cis valid paired-end read distance after normalizing for sequencing depth in R.

Preprocessing of Hi-C and HiChIP data

Hi-C or HiChIP reads were aligned to the hg19 genome using the Juicer pipeline (v1.7.6) with default settings. MAPQ30 threshold.hic files were used for the subsequent analyses.

Calling significant and differential loops

HiC-DC+67 was run on both pooled and individual replicates of Hi-C and HiChIP data to find significant interactions at 5, 10 or 25 kb resolutions. TSS interactions were defined as the ones with one anchor overlapping a promoter. HiC-DC+ differential interaction calling (function hicdcdiff) was also performed at different resolutions for pairwise comparison. HiC-DC+ normalized scores (observed over expected counts) were used to show Hi-C interactions in plots. To generate H3K27ac HiChIP hotspots, virtual 4C plots were calculated as the maximum −log10(P) of the interactions of regions with the ALKATI altP in intron 19 as anchor.

TAD annotations

TADs were annotated using OnTAD (v1.3)23 on ICE-normalized and MAPQ30 threshold Hi-C data at 10 kb resolution. A penalty score of 0.1 and a maximum TAD size of 2 Mbp were set while running OnTAD. Singleton TADs have no inner sub-TADs. TADs with nested sub-TADs were assigned a hierarchical TAD level according to the highest level of sub-TADs contained within a nested TAD.

Each TAD was annotated with the overlapping CAGE-seq CTSSs. The relative CTSS distance to TAD boundaries was calculated by dividing the distance of the CTSS to the nearest TAD boundary by the length of the respective TAD.

The insulation score for each TAD was obtained from OnTAD outputs.

Statistics and reproducibility

Experiments were reproduced two to four times as indicated in the figure legends. No blinding, randomization or predetermined sample size calculations were used. Three tumors with ALK amplification in the TCGA–SKCM dataset were excluded from FunSeq2 analysis. Data distributions were assumed to be normal, but this was not formally tested. All statistical tests were generated using GraphPad Prism (9.2.0) and are detailed in the figure legends. Significance levels are defined as follows: not significant—P ≥ 0.05, *P < 0.05, **P < 0.01, ***P < 0.001 and ****P <0.0001.

Additional experimental details are available in Supplementary Note 10.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.