Main

Throughout evolution, hosts and transposable elements (TEs) have coevolved, engaging in an evolutionary ‘arms race’ between TE invasion within host genomes and the defensive mechanisms aimed at limiting their expansion1,2,3. Novel TE insertions can produce multiple deleterious effects on the host cell, such as promoting genomic instability through insertion and recombination, producing nucleic acids or proteins that are toxic, disrupting gene expression through insertion in genic sequence or regulatory elements and altering gene expression through the TE’s own regulatory elements2,4. Thus, hosts have developed a wide variety of mechanisms to silence TEs. Epigenetic regulation by DNA methylation is one of the most common mechanisms used by host genomes to suppress TE expression and activity. A second layer of silencing is mediated by trimethylation of histone 3 lysine 9 (H3K9me3)5, which is generally controlled by histone methyltransferase complex SETDB1 (refs. 6,7). In mammals, this complex is commonly targeted to TE loci by a family of transcription factors (TFs) known as KRAB-containing zinc-finger proteins8. While these and a few other cell-type-specific mechanisms9 limit deleterious TE activity in most organisms, there are many examples of mutualistic events where TEs have important roles in host biology. TEs are particularly active in early development, where they have been shown to help drive zygotic gene activation across multiple organisms. High expression levels in these early developmental stages are likely to promote inclusion of new TE copies in the germline and, thus, vertical inheritance3. Once inherited, TE-derived genomic sequences can be expressed as distinct transcription units or function as new noncanonical regulatory elements such as enhancers and promoters for host genes10,11,12,13,14. Through this process, referred to as co-option, exaptation or domestication15, TEs can confer a positive effect on organism fitness. Domestication of TEs also involve the possibility of a TE-derived sequence to become part of a host mRNA, generating novel chimeric host–viral transcripts16,17. This process is referred to as TE exonization. Many previous attempts have been made to characterize TE exonization12,16,18, even recently19,20,21, but such efforts had the limitations of characterizing this process in unique conditions and using primarily short-read sequencing. More importantly, the mechanism and regulatory events of how TE exonization originate TE-chimeric genes (TE-chimeras) is poorly defined. Here, we provide a global view of this process by (1) defining TE exonization events with long-read sequencing in human and rodent cells, tissues and organs; (2) characterizing the evolutionary history of exonization and its impact on human genetic variation (health and disease states); (3) discovering an RNA degradation-dependent and a splicing-dependent mechanism that prevents TE-chimera expression; and (4) showing how these two mechanisms converge in stem cells to prevent TE-induced enhancement of cellular potency.

Results

Cartography of TE-derived genes in mouse and human

We and others have benchmarked the combined usage of long-read and short-read sequencing to analyze the complexity of transcriptomes22,23. To identify TE exonization events, we performed isoform-resolved long-read single-molecule real-time sequencing (PacBio Iso-seq) in mouse embryonic stem cells (mES cells) and epiblast-like cells (EpiLCs) and analyzed the data in combination with paired short-read RNA sequencing (RNA-seq)23. Because of the complexity of the data, we used a top-down approach in which we first used our Iso-seq and RNA-seq datasets to assemble an isoform-resolved reference of the transcriptome, wherein isoform diversity is annotated with full-length reads and high-confidence splice junctions. We then scanned detected transcripts for TE exonization events, novel gene loci and valid open reading frames (ORFs).

An overview of the strategy is shown (Extended Data Fig. 1a). By contrasting our novel isoform-resolved reference transcriptome with the mm10 Ensembl transcriptome release (v102) for the mouse genome assembly, we identified 5,807 novel RNA species comprising both coding and noncoding transcripts (Fig. 1a). These include both novel isoforms of known genes and novel isoforms transcribed from novel genes antisense to known genes (novel antisense genes) or in loci that do not overlap any known gene (novel intergenic genes). By cross-referencing the isoform information with the genomic locations of integrated TEs, we then identified all instances of TE exonization in our novel transcriptome. We found that, despite the fact that short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs) both outnumber long terminal repeat-containing TEs (LTRs) in terms of genomic copies (Extended Data Fig. 1b,c), exonization events are predominantly driven by LTRs and located at the 5′ end of a given transcript compared to internal or 3′ end terminal locations (Fig. 1b,c). This suggests a unique cis-regulatory potential of TEs that can directly drive the transcription and exonization into a downstream region. Approximately half of these, hereafter called TE-chimeras, are novel and not annotated in the current mm10 Ensembl transcriptome (Extended Data Fig. 1d), indicating that many 5′-TE exonization events have been missed because of a lack of technical resolution that is now provided by applying a mixed-read sequencing.

Fig. 1: Cartography of TE-derived genes in mouse and human.
figure 1

a, Classification of novel isoforms found in WT mES cell Iso-seq dataset based on known isoforms and protein-coding predictions. Classifications were generated with SQANTI3 (ref. 44) and are based on the mm10 Ensembl transcriptome. Left, depictions of the five classes of novel transcript are shown: ‘novel not in catalog’ transcripts, novel exon combinations with at least one novel exon; ‘novel in catalog’ transcripts, novel exon combinations of known exons; ‘intergenic’ transcripts, novel transcripts that do not overlap known transcripts; ‘antisense’ transcripts, transcripts that overlap known transcripts but on the antisense strand; ‘incomplete splice match’ transcripts, novel transcripts that are fragments of known transcripts. Middle, bar graph depicting the amount of each class found, with n denoting the total number of transcripts within each category. Right, coding predictions generated with CPAT45, using the default threshold of 0.44 to indicate the presence of a valid ORF. Box plots indicate the median, upper quartile and lower quartile for each distribution. Outliers are indicated with dots (outside 1.5× the interquartile range from the upper and lower quartiles). b, Relative position within a transcript of TE sequences in WT mES cell isoforms. Only the top ten TEs in each class are listed. Bars are colored by the proportions of different locations of the exonization events: 5′, internal and 3′ exonizations. c, Relative location of TE sequences in WT mES cell isoforms within a transcript in exonized chimeras, sorted by TE class. d, Change over time of TE-chimera expression in the indicated anatomical region in mice. Bar graphs indicate the gain and loss of individual 5′ chimeras from one time point to the next. Line graphs indicate the total number of chimeras in each time point. e, Proportion of TE-chimeras in human and mouse belonging to each TE class found in organogenesis dataset. f, Distribution of number of tissues with an active TE-promoter in organogenesis dataset. g, Distribution of tissue specificity of active TE-promoters in organogenesis dataset. h, Heat map displaying the relative promoter activity of TE-chimeric promoters across all RNA-seq samples spanning multiple organs and developmental time points in human organogenesis. i, Genome browser snapshots displaying representative short-read bulk RNA-seq pileups for ovary (up) and liver (bottom) samples, alongside host and TE-chimeric promoters, their respective isoforms and the genomic location of TEs.

We then applied the same mixed-read sequencing strategy to mouse organs at different time points of development and generated a compendium of mouse TE-chimeras (Extended Data Fig. 1e,f and Supplementary Table 1). The resulting temporal analysis of TE-chimeras in aged tissue samples indicated notable differences in class of TE used (Fig. 1d and Extended Data Fig. 1f), whereby SINEs were the most co-opted in brain structures analyzed while LTRs were the top class in nonbrain samples. The temporal analysis also revealed the identity of TE-chimera across tissues and time points (Supplementary Table 2). These results suggest that different TEs drive the expression of TE-chimeras that changes between tissues and across time.

We then leveraged our combined long-transcript and short-transcript annotations and sought to determine the extent to which TE-chimera expression changes during organogenesis in human tissues using public data. To filter out any potential false positives caused by the repetitive nature of TEs, we used tissue-specific transcriptome assembly along with proActiv24, which uses spliced reads to accurately quantify promoter activity both absolutely and relatively to the other promoters driving canonical isoforms. We filtered promoters within TE regions (TE-promoters) using both metrics to determine that the TE was both actively promoting transcription and the resulting TE-chimeras made up a sizable proportion of the corresponding gene’s expression (Methods).

Our comparative (human versus mouse) analysis of seven organs (forebrain, hindbrain, heart, kidneys, liver, testis and ovary) at multiple time points during fetal development and aging25,26 uncovered 2,419 and 1,107 TE-promoters in the human and mouse datasets, respectively (Extended Data Fig. 1g). Each TE-promoter gives rise to, on average, two TE-chimeras (Extended Data Fig. 1h,i). We found that, overall, LTRs were responsible for generating the most TE-chimeras, followed by LINEs and SINEs (Fig. 1e). LTRs were overrepresented among TE-promoters expressed in only 1–3 tissues, while SINEs were the most abundant across multiple tissues (Extended Data Fig. 1j). We also found that the distributions of TE-chimeras were remarkably tissue specific between human and mouse data, as over 50% of TE elements drove robust chimeric expression in a single tissue in both mouse and human datasets (Fig. 1f and Extended Data Fig. 2a). In both species, the testis possessed the highest TE-promoter activity, most of which was not found in other tissues (Fig. 1g). Similarly, in both species, the liver possessed the second largest number (Fig. 1g). By investigating the protein-coding potential of the TE-chimeras, we found that over 80% were noncoding in both the human and the mouse transcriptome (Extended Data Fig. 2b). The noncoding nature of these TE-chimeras was broadly shared across chimeras generated from different TE classes and expressed in different tissues (Extended Data Fig. 2b–e). Interestingly, TE-chimeras expressed in the mouse brain displayed higher protein-coding probabilities than those expressed in other organs (Extended Data Fig. 2e). We identified a class of organogenesis TE-promoters that were broadly expressed in multiple tissues; thus, we investigated whether they were shared between homologous genes in human and mouse. We found little mouse–human overlap in organogenesis (Extended Data Fig. 2f), while orthologous genes in embryogenesis were more prevalent (Extended Data Fig. 2g,h).

We then investigated how TE-promoter activity varies across organogenesis and postnatal development and found that, broadly, the activity was higher in postnatal samples when compared to their fetal counterparts (Extended Data Fig. 2i). We additionally clustered TE-promoters by their relative activity across individual samples spanning multiple fetal and postnatal stages, identifying both commonly activated TE-promoters and clusters of TE-promoters displaying tissue-specific activities across stages and organisms (Fig. 1h and Extended Data Fig. 2j). Representative TE-chimeras specifically expressed in either fetal or postnatal development are shown in Fig. 1i. Overall, we found that TE-chimeras comprise both tissue-specific and common transcriptional events, with organ expression patterns that are shared between human and mouse.

TE-chimeras in health and disease

Our next goal was to evaluate the regulation of TE-chimera expression across human variation and disease. We focused these analyses on the ~900 individuals and 37 tissues available in the Genotype-Tissue Expression project (GTEx)27. We identified 739 ‘high-confidence’ TE-chimeras (Supplementary Table 3) by leveraging the combined annotations of long-read and short-read sequencing and filtering on the basis of the activity of known TE-promoters and presence in at least 80 individuals (Methods). Identification of transcripts informed by the isoform-resolved data expanded the number of TE-chimeras detected by 26% (Extended Data Fig. 3a,b). Most of the TE-chimeras found were derived from LTRs (Extended Data Fig. 3c), where highly expressed transcripts showed tissue specificity (Extended Data Fig. 3d). TE-chimera expression varied significantly across organs and the most highly expressed TE-chimeras were observed in the muscle and testis (Fig. 2a). Aging appeared to regulate TE-chimeras, where expression in whole blood and brain regions decreased in older individuals (>50 years old) but was increased in peripheral tissues (Fig. 2b).

Fig. 2: TE-chimera expression across human variation and disease.
figure 2

a, Heat map of class of TE (y axis) across all human tissues (x axis), showing the number of TE-chimeras passing detection thresholds (top; Methods) and the mean expression (TPM; middle). Bottom, bar chart showing the total number of TE-chimeras per tissue (summed across individuals). b, Individuals in GTEx were binned into age categories (over or under 50 years) and Wilcoxon signed-rank test P values based on comparisons were calculated using a two-sided test and adjusted by the FDR statistic (accounting for the number of organs). c, Expression (TPM; x axis) of either the TE-chimeric isoform ENST00000426261.6 or all other transcripts (mean TPM) corresponding to LINC02693 in aorta (lincRNA). The top ten expressors in each category are colored blue or red, where none overlap between the two. The bounds of each box indicate the 25th and 75th percentiles and the whiskers extend from the minimum (0 for both lincRNA and TE-chimera) to maximum (2.49 for lincRNA and 2.72 for TE-chimera). The centers of the boxes are 0.51 and 0.78, respectively. d, Scatter plot of expression (TPM) for the same TE-chimera ENST00000426261.6 versus LINC02693, as in c. The P value was determined using two-sided Student’s regression. e, The top pathways based on GSEA in the Gene Ontology database for either LINC02693 (right) or ENST00000426261.6 (left), where the x axis shows the distribution of the bicor correlation coefficient. f, LTR exonizations were analyzed in TCGA in terms of their differential expression in tumor versus surrounding tissue (top) and survival prediction in tumor tissue (bottom). Color scales reflect the average log2 fold change (log2FC) in tumor versus surrounding tissue (purple–green) or whether the relative expression of LTR exonizations per individual showed increased or decreased survival prediction (blue–orange). Specifically, an orange tone shows instances where individuals with a high relative expression of LTR exonizations showed decreased survival (compared to low expressors). *P < 0.05, **P < 0.01 and ***P < 0.001 corresponding to a differential expression P value (tumor versus surrounding tissue) or log-rank test P value (survival). g, Top three pathways from Gene Ontology overrepresentation tests corresponding to the LTR-exonized genes that were significantly changed (adjusted P < 0.1) in oxaliplatin resistance32. Top three pathways from Gene Ontology overrepresentation tests corresponding to the LTR-exonized genes that were significantly changed (adjusted P < 0.1) in anti-PDL1 resistance31. P values were calculated using a nonparametric permutation test to build a null distribution of the enrichment score. From the distribution of P values, FDR corrections were made using the Benjamini–Hochberg method, resulting in adjusted P values.

Next, we reasoned that examination of variation across populations could provide suggestive functional roles of TE-chimeras. For example, a TE-chimera (ENST00000426261.6) showed a significant anticorrelation with its cognate transcript (LINC02693), an aortic long intergenic noncoding RNA (lincRNA) with no known function (Fig. 2c,d). The genetically coregulated pathways correlated with each isoform, further suggesting a suppressive role of the TE-chimera (Fig. 2e).

Given the recent implications for TE-chimera in disease28,29, we analyzed variation in The Cancer Genome Atlas (TCGA)30. This analysis showed that LTR-chimeras were almost uniformly upregulated in tumors when compared to surrounding tissue and that elevated tumor expression predicted poorer survival (Fig. 2f). We note that, while many significant relationships were observed in terms of survival prediction, these statistics were subjective to sample size and variation in group measures. As these data showed a uniform pattern of exonization associated with cancer progression, we hypothesized that exonization events could be involved in cancer drug resistance mechanisms. We analyzed RNA-seq data collected from in anti-PDL therapies31 and oxaliplatin resistance32. Comparison of resistant versus responding individuals showed LTR exonization events in pathways pertinent to mechanism of drug actions (Fig. 2g and Extended Data Fig. 4a), suggesting that exonization could be involved in treatment resistance. In sum, these data highlight specific patterns of tissue-specific promoter TE-chimeras and highlight how LTR exonization changes with age, cancer progression, survival and treatment.

Transcription-coupled RNA degradation suppresses TE-chimeras

We and others have shown that RNA decay can suppress TE expression33,34. We, thus, hypothesized that RNA surveillance could also control TE-chimera expression. We performed long-read RNA-seq in wild-type (WT) mES cells and mES cells conditionally ablated for Exosc3, a subunit of the RNA exosome complex. Exosc3 is essential for RNA exosome-mediated RNA degradation35. This tamoxifen-induced conditional knockout (cKO) model (Extended Data Fig. 5a) enabled a quantitative and highly resolved characterization of the interplay between chimeric RNA anabolism and catabolism. TE-promoters identified by our mixed-sequencing analysis generate many coding and noncoding TE-chimeras predominantly from LTR (Fig. 3a and Extended Data Fig. 5b–e) that are more active in Exosc3 cKO compared to other TE and host promoters (Fig. 3b and Extended Data Fig. 5f). Among LTRs, ERVK, ERVL and ERVL-MaLR generate most of the TE-chimeras that are suppressed by RNA exosome (Fig. 3c,d and Extended Data Fig. 5g). These results suggest that RNA degradation controls LTR-chimeras at the level of transcription, which is initiated at and extends past the 3′ ends of LTR elements and can be subject to splicing, and/or at the level of RNA stability.

Fig. 3: Transcription-coupled RNA degradation suppresses TE-chimeras.
figure 3

a, Bar plots displaying number of TE-derived promoters, grouped by TE class. b, Scatter plots displaying promoter activities in WT and Exosc3 cKO, grouped by promoter type. c, Bar plots displaying number of LTR-derived promoters, grouped by LTR family. d, Box plot displaying promoter activities in WT and Exosc3 cKO, grouped by LTR family (n = 25, 142, 500 and 168 for ERV1, ERVK, ERVL and ERVL-MaLR, respectively). Gypsy was discarded because of a small number of promoters detected in c. The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test (P = 1.336401 × 10−1, 3.122109 × 10−10, 4.482375 × 10−81 and 6.809815 × 10−13 for ERV1, ERVK, ERVL and ERVL-MaLR, respectively). e, Pie charts displaying proportions of LTR promoters grouped by differential RNAPII enrichment in Exosc3 cKO relative to WT. f,g, Genome browser snapshots displaying RNAPII (8WG16) ChIP-seq and short-read total input RNA-seq pileups for WT and Exosc3 cKO, alongside known and novel isoforms identified from the PacBio Iso-seq and the genomic location of TEs in the mouse mm10 genome. A novel isoform for Nelfa (f) and a novel gene (g) derived from LTRs and upregulation in Exosc3 cKO are shown. h, Strand-specific metagene plots of stable RNA (from RNA-seq) (left) and nascent RNA (from metabolic labeling) (right) across chimeric LTRs, grouped by differential RNAPII enrichment state upon Exosc3 cKO. The s.e.m. is shown as a shaded area around the mean curve.

To discriminate between these two events, we stratified host and LTR promoters by RNA polymerase II (RNAPII) level in WT and Exosc3 cKO. We used the previous RNAPII ChIP-seq data23 generated with the 8WG16 antibody that mainly targets initiating RNAPII (Fig. 3e). Although these datasets were obtained without spike-in normalization and, thus, cell–cell comparison can be suboptimal, we saw that about 80% of host promoters retained comparable levels of RNAPII in Exosc3 cKO, ~18% had increased RNAPII deposition and ~1.4% had lower levels in Exosc3 cKO. By contrast, around 52% of LTR promoters displayed a significant increase in RNAPII levels. An increase in RNAPII was associated with an increase in promoter activity (Extended Data Fig. 5h), suggesting that increased transcription initiation leads to induction of LTR-chimera expression in Exosc3 cKO.

ERVL accounts for the largest portion of the LTR promoters with more RNAPII enrichment (Extended Data Fig. 5i), as shown in the representative LTR-chimera from the Nelfa gene derived from murine endogenous retrovirus with leucine primer (MERVL) (Fig. 3f). We also identified a second class of LTR promoters, which account for ~48% of LTR promoters (Fig. 3e). This class is less dominated by ERVLs (Extended Data Fig. 5i) and had a significantly higher promoter activity in the Exosc3 cKO despite RNAPII levels being comparable (Extended Data Fig. 5h). This result suggests that increase of these LTR-chimeras in the Exosc3 cKO is a result of RNA stabilization, as shown in a representative novel gene driven by RLTR13C3 (Fig. 3g).

To further investigate the relationship between RNA degradation and nascent transcription, we reanalyzed our nascent RNA-seq (metabolic labeling) dataset23 focusing on LTRs dynamics. Our results indicate that (1) only LTR promoters with more RNAPII enrichment show an increase in nascent transcripts upon Exosc3 cKO (Fig. 3h, right) and (2) LTR promoters with equal level of RNAPII display lower level of stable RNA expression in WT compared to Exosc3 cKO (Fig. 3h, left), despite equal nascent transcription (Fig. 3h, right). These data support our hypothesis that LTR promoters with more RNAPII enrichment are regulated by transactivation and LTR promoters with equal RNAPII enrichment in WT and Exosc3 cKO are regulated by RNA degradation.

We then used Hi-C data in WT mES cells36 to map all LTR elements with respect to their topological position within the nucleus. Our data indicate that genomic bins including chimeric LTRs were more frequently positioned within the A compartment (positive PC1 values) compared to bins lacking chimeric LTRs (Extended Data Fig. 5j). Higher relative enrichment of chimeric LTRs was also found to be correlated with a stronger association within the A compartment (Extended Data Fig. 5k).

Position-dependent LTR functionalization

We reason that, because of their repetitive and multicopy nature in the genome, LTRs that generate chimeras must carry distinct features compared to LTRs that do not generate chimeras. We first examined the distinctive sequence features of promoter-proximal splice donor (SD) motif and polyadenylation site (PAS), which are closely embedded (SD-PAS) in many LTRs (Fig. 4a). By contrast, the presence of SD-PAS motifs near transcription start sites (TSSs) is a unique genomic architecture that is rarely found in host genes (Fig. 4b), as it signals for competing activities. We cloned a representative LTR, MT2_Mm (the long terminal repeat regions of MERVL elements), and analyzed whether this configuration is prone to transcription or early termination. RNA pulldown using the MT2_Mm canonical sequence (MT2 WT) indicated that this element, unlike a point mutant control (MT2 Cmt), is recognized by cleavage and polyadenylation machinery (Fig. 4c,d). PAS recognition in the promoter-proximal region is known to induce premature RNAPII termination37,38. This result is consistent with the fact that most LTR regions do not act as LTR promoters in the genome despite the high proportion of SD-PAS occurrence (Extended Data Fig. 6a) and provides evidence that sequence features alone cannot explain the biogenesis of LTR-chimeras.

Fig. 4: Position-dependent control of TE-chimera.
figure 4

a, Bar plot displaying proportions of top 16 TE elements with SD-PAS motifs within 10 bp of each other. b, Bar plot displaying proportions of TSS having SD-PAS motifs within 10 bp of each other, grouped by repeat-derived versus not. Occurrences were found between TSS and 150 bp downstream. c, Consensus MT2_Mm sequence having a 5′ SD with UGUA CFIm-binding site and poly(A) site. The poly(A) site mutation for RNA pulldown assay is indicated. d, Western blot analysis of RNA pulldown assays using the MT2_Mm canonical sequence (MT2 WT) or a point mutant control (MT2 Cmt). The experiment was performed with one biological replicate. e, Diagram displaying gene–LTR pairs: (1) genes and their overlapping intragenic LTRs and (2) genes and their nearest promoter-proximal upstream intergenic LTRs. f, Box plots displaying normalized expressions in WT for genes overlapping with intragenic LTR (n = 12,657 nonchimeric and 214 chimeric) or in proximity of promoter-proximal upstream intergenic LTRs (n = 11,801 nonchimeric and 67 chimeric), grouped by chimeric status. The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks and P values were calculated on the basis of an unpaired, two-sided Wilcoxon rank-sum test (P = 1.295968 × 10−51 for intragenic and 5.243340 × 10−25 for intergenic cases). g, Density plots displaying log2 fold changes of normalized expressions of genes overlapping with intragenic LTRs or in proximity of promoter-proximal upstream intergenic LTRs in Exosc3 cKO compared to WT, grouped by chimeric status. P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test. h, Box plot displaying average promoter-proximal antisense coverage from RNA-seq between genes and its nearest promoter-proximal upstream intergenic LTR in WT, grouped by chimeric status (n = 11,532 nonchimeric and 60 chimeric). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks and P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test (P = 8.133342 × 10−12). i, Density plot displaying log2 fold changes of average promoter-proximal antisense coverage from RNA-seq between gene and its nearest promoter-proximal upstream intergenic LTR in Exosc3 cKO compared to WT, grouped by chimeric status. P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test. j, Strand-specific gene track displaying RNA-seq pileups across splice junctions for WT and Exosc3 cKO, alongside promoter-proximal antisense transcripts (gray), known genes in the mm10 Ensembl transcriptome (black), the genomic location of chimeric TEs (red) and known and novel isoforms identified from the PacBio Iso-seq data (blue and orange, respectively). k, Box plots displaying input-normalized average coverage of H3K9me3 (left) and average DNA methylation (right) in WT and Exosc3 cKO across chimeric LTRs defined in e (n = 286). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Outliers beyond these limits were removed for H3K9me3 and no outliers were observed for DNA methylation. P values were calculated using a paired, one-sided (left-sided) Wilcoxon signed-rank test (comparing Exosc3 cKO relative to WT; P = 0.9871791 for H3K9me3 and 0.999983 for DNA methylation). l, Model displaying cis-regulatory function of intragenic chimeric LTRs (top) and promoter-proximal upstream intergenic chimeric LTRs (bottom) and its dependency on genomic location and RNA exosome activity.

Source data

We then investigated the positional information of LTRs that give rise to chimeras (chimeric LTRs) and LTRs that do not (nonchimeric LTRs). Notably, we found that ~70% of chimeric LTRs were located proximal to promoters or inside genes, while ~59% of nonchimeric LTRs were located in distal intergenic regions (Extended Data Fig. 6b). We hypothesized that LTRs give rise to chimeras because they are in a genomic context that makes them prone to being transcribed. We further hypothesized that chimeric LTRs are unsilenced by the act of transcription occurring in their proximity, specifically transcription of (1) the endogenous gene from which the TE-chimera derives or (2) promoter-proximal antisense transcripts. To test this, we paired endogenous genes with (1) their overlapping intragenic LTRs or (2) their nearest promoter-proximal upstream intergenic LTRs (Fig. 4e and Methods). We found that genes that pair with chimeric LTRs are not only more highly expressed in WT (Fig. 4f) but also more upregulated in Exosc3 cKO compared to genes that pair with nonchimeric LTRs (Fig. 4g), indicating that genes close to chimeric LTRs are more transcriptionally active and regulated by RNA exosome.

We next reasoned that promoter-proximal antisense transcripts could provide a more transcriptionally amenable environment, thereby allowing chimera generation from promoter-proximal intergenic LTRs. To answer this question, we next quantified the antisense signal between genes and their nearest promoter-proximal upstream intergenic LTRs. Consistent with a higher transcriptional activity of genes paired with chimeric LTR, we found a higher antisense signal between a given gene and its nearest chimeric LTRs in both RNA-seq (Fig. 4h) and nascent RNA-seq (metabolic labeling; Extended Data Fig. 6c) in WT, compared to genes and their nearest nonchimeric LTRs counterparts. Furthermore, Exosc3 cKO caused an increase in stable and nascent antisense RNAs (Fig. 4i and Extended Data Fig. 6d), supporting the idea that transcription and possibly a cis-effect of the stabilized nascent transcript promote LTR-chimera expression. Overall, these results correlate promoter-proximal antisense RNA and cis-regulatory activity of chimeric LTRs, as shown in a representative strand-specific gene track where we detected an increase in promoter-proximal antisense RNA spanning a chimeric LTR (ORR1A2) along with increased expression of novel LTR-derived isoforms in the sense strand in Exosc3 cKO (Fig. 4j). Lastly, reanalysis of our previous dataset23 demonstrated that Exosc3 cKO does not show a reduction in the level of H3K9me3 (with the caveat that spike-in normalization was not performed) or DNA methylation across chimeric LTR compared to WT, indicating that upregulation of chimeric LTRs expression is not caused by the loss of these epigenetic marks (Fig. 4k). Taken together, we propose that transcriptional and RNA degradative activity near LTR elements guide LTR functionalization (Fig. 4l).

Inhibition of RNA degradation and splicing promotes TE functionalization

As the RNA exosome is targeted to its RNA substrates by different cofactors39,40,41, we addressed their role in both controlling TE expression and TE functionalization in chimeric transcripts. We performed loss-of-function experiments (Wdr82 knockdown (KD); Extended Data Fig. 7a) along with analysis of previous dataset of targeted depletion of NEXT (Rbm7 and Zcchc8 KD)23, PAXT (Zfc3h1 KO)33 and Integrator (Ints11 KD)23. Notably, all perturbations caused a significant upregulation of TE, predominantly LTRs, similarly to that in Exosc3 cKO23 (Extended Data Fig. 7b,c). The cofactor TE and gene expression signatures were positively correlated with that of Exosc3 cKO (Extended Data Fig. 7d). Upregulation of LTRs but not of other classes of TEs was also accompanied by a significant activation of chimeric promoters (Fig. 5a). In clear contrast, KD of Cpsf2 showed neither upregulation of LTR nor activation of LTR promoters (Fig. 5a and Extended Data Fig. 7c).

Fig. 5: Converged function of RNA exosome and splicing in regulation of LTR-chimera biogenesis and cell potency.
figure 5

a,b, Heat map displaying median log2 fold change of TE-promoter activity in each depletion condition targeting NEXT, PAXT, Integrator, Restrictor (a) and spliceosome (b), compared to control, grouped by TE class. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum test for Zfc3h1 KO and a paired, two-sided Wilcoxon signed-rank test for all the rest. c, Box plot showing log2 fold changes of LTR promoter activity for U1 AMO, Exosc3 cKO transfected with Scr AMO and double inhibition condition compared to Scr AMO-transfected WT (n = 687). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test (P = 4.772289 × 10−70 for U1 AMO versus Exosc3 cKO, 9.078507 × 1074 for U1 AMO versus double inhibition condition and 2.566191 × 1027 for Exosc3 cKO versus double inhibition condition). d, PCA showing global transcriptomic differences between perturbed (red) and control (black) conditions. Projections were calculated on the basis of the top 2,000 most variable known protein-coding genes across the samples. WT and control siRNA-treated samples were considered as control and the rest (Exosc3 cKO, AMO treatment and siRNA-mediated splicing inhibition42) were considered as perturbed. As input for PCA, expression counts were first adjusted with ComBat-seq71 to correct for clustering driven by potential technical confounders, then normalized and variance-stabilized with DESeq2 (ref. 72). Both Exosc3 cKO and Exosc3 cKO transfected with Scr AMO were used to assess the effect of Exosc3 cKO. e, Heat map representing the median log2 fold change for protein-coding genes with 1–2 exons or >2 exons in each perturbation relative to the corresponding control. All expressed genes in each comparison were considered. f, Heat map presenting log2 fold change of MERVL-int and associated TFs (Dux (Duxf3), Zscan4d and Obox5) in each perturbation indicated. For e,f, U1 AMO, Exosc3 cKO treated with Scr AMO and double inhibition were compared to WT treated with Scr AMO; Snrpb and Snrpd2 KD were compared to control siRNA-treated samples. PlaB represents the differential gene expression test in the dataset generated in mES cells cultured with splicing inhibitor (PlaB) across passages42. A differential gene expression test was designed to compare the effect in the late passages (passages 4–6) over the early passages (passages 0–2). The log2 fold changes calculated by DESeq2 (ref. 72) were used and asterisks represent nominal P values from DESeq2 (ref. 72) (f). gl, RT–qPCR analysis of MERVL-int (g,j), Zscan4 (h,k) and Nelfa chimera (i,l) transcripts upon transfection of vector expressing GFP, Dux-FL and Dux-S in WT (gi) or Exosc3 cKO (jl). Data indicate the mean of two replicates, with individual values shown. m, Bar plots showing the average counts normalized by DESeq2 (ref. 72) of MERVL-int, MT2_Mm and Zscan4d across three replicates (each point), with asterisks representing nominal P values from DESeq2 (ref. 72) (for MERVL-int, P = 4.576920 × 10172 for WT and below the limits of double-precision floating-point arithmetic for Exosc3 cKO; for MT2_Mm, P = 7.582753 × 1077 for WT and 1.337273 × 10170 for Exosc3 cKO; for Zscan4d, 8.882704 × 1010 for WT and 1.888064 × 1025 for Exosc3 cKO). n, Box plot representing LTR-chimera promoter activities in WT and Exosc3 cKO with Scr and MERVL ASO treatment conditions (n = 440). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test for all the rest (P = 3.499736 × 106 for WT with Scr ASO versus WT with MERVL ASO, 1.644941 × 1049 for WT with Scr ASO versus Exosc3 cKO with Scr ASO and 5.236761 × 1017 for Exosc3 cKO with Scr ASO versus Exosc3 cKO with MERVL ASO). o, Volcano plot displaying differentially expressed genes and TEs in Exosc3 cKO treated with MERVL ASO versus Scr ASO. Color coding represents the gene set, with purple indicating 2CLC genes and gray indicating other genes. P values were generated with the DESeq2 (ref. 72) R package, applying a two-sided test and FDR correction (***P < 0.0001). p, GSEA of 2CLC genes from the differential expression signature in Exosc3 cKO treated with MERVL ASO compared to that with Scr AMO. P values were generated using GSEA (clusterProfiler73 R package), which uses a two-sided testing approach. NES, normalized enrichment score.

Source data

We then addressed whether other factors that control TE expression might also influence TE-chimera generation. We focused on splicing factors that have been shown to suppress MERVL expression42,43 and replicated our analyses focusing on TE and TE-chimera promoter activity. KD of Eftud2, Isy1, Lsm4, Snrpb and Snrpd2 (ref. 42) resulted in the upregulation of a considerable number of TEs, among which LTRs represented the largest group (Extended Data Fig. 7e). The upregulation of TEs and genes seen upon loss of defined splicing factors was positively correlated with that in Exosc3 cKO (Extended Data Fig. 7f) and caused a significant activation predominantly at LTR promoters (Fig. 5b). We also validated that KD of Snrpd2 increased the representative LTR-chimeric transcript (Nelfa chimera) (Extended Data Fig. 7g–i). Another factor we tested, Srsf7, did not phenocopy Snrpd2 (Extended Data Fig. 7j–l), a result that we cannot explain, possibly pointing to indirect effects or unique splicing factor vulnerabilities44.

Notably, the LTR promoters transactivated in Exosc3 cKO were more highly activated upon spliceosomal repression compared to the LTR promoters with comparable level of RNAPII (Extended Data Fig. 7m). These results indicate that, while splicing is required for exonization of LTR-chimeras, reducing splicing activity transactivates LTR promoters. While this conclusion seems counterintuitive, it is known that reduced level of splicing activity can have a positive effect on some genes (aside from the expected repressive effect on most genes)42. Consistently, we found a positive correlation between expression signature of Exosc3 cKO and a previous splicing inhibitor, Pladienolide B (PlaB), treatment dataset42 for all time points (Extended Data Fig. 7n). To understand the epistatic relationship linking RNA degradation, splicing and exonization, we then performed inhibition of major and minor spliceosome activity through transfection of antisense morpholino oligonucleotide (AMO) in WT and Exosc3 cKO. Exosc3 cKO in the Scr AMO-transfected condition showed a significant correlation with that in the untransfected condition (Extended Data Fig. 7o) and U1 and U6atac AMOs were specific in inhibiting the major and minor spliceosome, respectively (Extended Data Fig. 7p–q). U1 AMO treatment caused an increase in LTR promoter activity in WT cells (Extended Data Fig. 7r); importantly, this effect was enhanced in Exosc3 cKO (Fig. 5c).

We then analyzed the transcriptome of cells in which RNA degradation or splicing was perturbed alone or in combination compared to control cells. Principal component analysis (PCA) of Exosc3 cKO along with the AMO-derived splicing inhibition dataset and splicing-factor-depleted dataset42 indicated that perturbed cells segregate away from controls and share considerable (38–92%) transcriptomic changes (Fig. 5d and Extended Data Fig. 7s), despite the targeted factors acting at different steps of splicing process (Extended Data Fig. 7t). On the basis of the fact that previous work has shown a gene-size stratified effect caused by splicing inhibition42 and RNA degradation inhibition23, we separated the genome into short (protein-coding genes with 1–2 exons) and long (protein-coding genes with >2 exons). Strikingly, perturbation causing RNA degradation and splicing inhibition caused upregulation of short genes (Fig. 5e and Extended Data Fig. 7u). We further identified an upregulation of a TE (MERVL-int) and associated TFs (Dux, Zscan4d and Obox) that are representative two-cell (2C) genes, which are known to regulate TE-chimera expression and enhance cell potency45,46,47,48 (Fig. 5f). Notably, the upregulation in MERVL-int, Dux and Zscan4d was enhanced by a codepletion of RNA exosome and major splicing (Extended Data Fig. 7v).

To further demonstrate the importance of Dux, we performed overexpression analysis. Vector expressing GFP, Dux (Dux-FL) or Dux lacking the transactivation domain (amino acids 1–178; Dux-S) were transfected into mES cells (Extended Data Fig. 7w,x). Exosc3 cKO was induced by simultaneous tamoxifen treatment. In line with the previous finding45,49, Dux-FL overexpression, unlike Dux-S, caused a significant upregulation of both 2C-related transcripts expression (MERVL-int and Zscan4) and the representative LTR-chimeric transcript under transactivation (Nelfa chimera) (Fig. 5g–l).

Moreover, reanalysis of the previous Dux ChIP-seq in mES cells45 and Obox1 Stacc-seq in late 2C mouse embryos47 showed a significant enrichment of Dux and Obox1 in chimeric LTRs. In contrast, Dux and Obox1 did not show a clear enrichment across nonchimeric LTRs, suggesting that these chimeric LTRs are subject to the Dux-dependent and MERVL-int-dependent transactivation (Extended Data Fig. 7y–z).

To determine the functional importance of MERVL-int, we used antisense oligonucleotide (ASO)-mediated degradation of MERVL50. We transfected WT or Exosc3 cKO with either MERVL-degrading ASOs or scramble control (Scr). MERVL ASO treatment, compared to Scr ASO treatment, was accompanied by a significant reduction, especially for Exosc3 cKO, in MERVL-int, MT2_Mm and Zscan4d expression (Fig. 5m) and LTR promoter activities (Fig. 5n). Notably, MERVL ASO treatment in Exosc3 cKO led to a dampened activation of 2C-like cell (2CLC) genes (Fig. 5o) and erasure of the gene network associated with enhanced cell potency (Fig. 5p). Our result indicates that suppression of MERVL-int expression is sufficient to revert the changes in transcriptomics and cellular plasticity caused by Exosc3 cKO.

Overall, our data suggest a model where chemical or genetic perturbation of splicing and RNA degradation converges into upregulation of short genes like MERVL-int (which is intronless), which instigate a gene regulatory program of enhanced cell potency and TE-chimera expression.

Control of TE-chimera expression in vivo

To understand whether RNA degradation controls TE-chimeras in vivo, we compared Exosc3 cKO mES cells with a conditional ablation of an essential exosome cofactor (Mtr4 cKO) in oocyte51, along with conditional deletion of Dicer1 in both mES cells and oocytes52,53,54. Oocytes express a set of tissue-specific MTA (ERVL-MaLR) TE-chimeras and a pair of oocyte-specific MTA and MTC TE-chimeras in the Dicer1 gene was previously shown to regulate TE exonization events in mouse oocytes through RNA interference53. Mtr4 cKO oocytes overexpress LTR-chimeras in the ERVK family relative to controls but exhibit no overexpression of LTR-chimeras in the ERVL family (Fig. 6a). Meanwhile, Dicer1 cKO oocytes have the opposite effect, indicating that, in mouse oocytes, ERVL chimeras are regulated through the RNA interference pathway, whereas ERVK chimeras are regulated by the RNA surveillance pathway. This contrasts with the Dicer1 cKO and Exosc3 cKO mES cells. mES cells do not express the Dicer1 TE-chimera responsible for allowing mouse oocytes to regulate ERVL chimeras through RNA interference. Furthermore, Dicer1 cKO mES cells do not have significant changes in the expression of either ERVL or ERVK chimeras (Fig. 6b). Exosc3 cKO mES cells have increased expression of TE-chimeras for both families of LTR. Our model (Fig. 6c) indicates that, in vivo, different mechanisms operate specifically at defined TE classes to regulate cell-specific and spurious TE exonizations.

Fig. 6: Regulation of LTR-chimeras in mouse oocytes and mES cells.
figure 6

a, Expression profile of mouse oocytes in Dicer1 cKO and Mtr4 cKO. Mtr4 cKO in oocyte upregulates ERVK LTR-chimeras, whereas Dicer1 cKO in oocyte upregulates ERVL LTR-chimeras. b, Expression profile of mES cells in Dicer1 cKO and Exosc3 cKO. Exosc3 cKO upregulates both ERVL and ERVK LTR-chimeras, whereas Dicer1 cKO upregulates neither. For both a,b, one-sided (upregulated) Wilcoxon rank-sum tests were performed on change in promoter activity (average promoter activity of experiment − average promoter activity of control) between chimeric promoters and nonchimeric promoters with Bonferroni correction. c, Schematic of LTR-chimera regulation based on a,b. The RNA interference pathway regulates ERVL LTR-chimeras in oocyte through MTA/MTC Dicer1 chimeras. In the absense of these chimeras in mES cells, RNA surveillance regulates the expression of ERVL LTR-chimeras.

Evolutionary analysis of TE-chimeras

While most TEs have lost the capacity to mobilize13, our data indicate that LTRs in the genome can promote the transcription of novel isoforms in known and novel gene loci. These data suggest that neogene birth is an ongoing process, contrary to the idea that numbers of genes in genomes are fixed. To test this, we performed an in-depth evolutionary analysis on TE sequences and their exonization.

Previously, as part of the Zoonomia project, we found that more than 85% of primate-specific human cis-regulatory elements originated from TEs55. In this study, we adapted our earlier approach to investigate TE-chimeras in humans, categorizing them on the basis of their evolutionary conservation levels. We began by investigating TE-chimeras conserved among apes and then extended our analysis to those conserved among apes and monkeys, followed by those conserved among apes, monkeys and lemurs and, finally, those conserved beyond primates. This stratification enabled us to analyze the TE-chimeras in relation to the evolutionary age of each transposon family. For example, LTRs and L1 elements are most active among apes, SINEs are most active among primates and L2 and DNA elements are most active in lineages older than primates. We performed a similar analysis for TE-chimeras in mice on the basis of the evolutionary conservation levels of their TEs, categorized into murine-specific TEs, rodent-specific TEs and TEs conserved beyond the rodent lineage.

We compared the TEs that generate chimeric transcripts in the sense orientation (sense chimeric TEs) against two sets of TEs as negative controls: (1) TEs that form chimeric transcripts in the antisense orientation (antisense chimeric TEs) and (2) TEs that are upstream and within 5 kb of the TSS of known gene loci but do not generate chimeric transcripts (nonchimeric TEs). We analyzed the compositions of TE families among the three groups of TEs in humans and mice separately. For the results presented below, we focused on the comparison with the first negative set (antisense chimeric TEs).

For humans, sense chimeric TEs are most enriched in four families when contrasted with antisense chimeric TEs: LTR12 (a subset of the ERV1 family of LTRs), ERVL (a family of LTRs), ERVL-MaLR (a family of LTRs) and SVA elements (Fig. 7a). When separating TEs by their evolutionary ages, it becomes apparent that TEs in the LTR12 and SVA families are most enriched in the ape-specific subset, ERVL and ERVL-MaLR are most enriched in the primate-specific subset and ERVL-MaLR are most enriched in the subset that is conserved beyond the primate lineage (Fig. 7a). These results indicate that the four families of TEs incurred separate waves of exonization events in mammalian genomes and, generally, the process of TE exonization accompanied major evolutionary expansion events.

Fig. 7: Evolutionary analysis of human TE-chimeras.
figure 7

a, Bar plots showing the proportions of TE families for sense chimeric TEs, antisense chimeric TEs and nonchimeric TEs that are upstream and within 5 kb of the TSS of known genes, as well as all TEs in the human genome. The analysis is presented for all TEs, as well as for TEs with three evolutionary conservation levels: ape-specific, primate-specific and conserved beyond the primate lineage. b, Analysis of evolutionary ages for ERVL, ERVL-MaRL, LTR12, L1 and L2, stratified by whether they are sense chimeric, antisense chimeric or nonchimeric, as in a. Randomly chosen intergenic non-TE regions were also used as controls. Left, bar plots showing the percentage of these TEs and non-TEs across different evolutionary ages, from human-specific (young; colored red) to eutherian mammal conserved (old; colored dark blue). Middle, box plots indicating the number of mammalian genomes (among a total of 240) that these TEs or genomic regions can map to. Boxes present the median, lower quartile and upper quartile, while whiskers denote the maximum and minimum after removing outliers (more than 1.5× the upper quartile or less than 1.5× the lower quartile). P values were generated using an unpaired two-sided Wilcoxon rank-sum test. Right, bar plots representing the number of sense chimeric and antisense chimeric TEs across different evolutionary ages.

For mice, sense chimeric TEs were most enriched in two families when contrasted with antisense chimeric TEs: ERVL and ERVL-MaLR (Extended Data Fig. 8a). Most of the ERVL chimeric TEs were murine specific, while some of the ERVL-MaLR chimeric TEs were murine specific and others were rodent specific, further suggesting different time periods for their exonization events. No enrichment of sense chimeric TEs was observed over antisense chimeric TEs beyond the rodent lineage (Extended Data Fig. 8a).

Given the enrichment of the above families of TEs in sense chimeras, we proceeded to analyze the evolutionary ages of individual members in each TE family, quantified as the number of mammalian genomes that the TE can be mapped to. For both ERVL and ERVL-MaLR families, in both humans and mice, the TEs that formed sense chimera were significantly younger than the TEs in the same family that formed antisense chimera (Fig. 7b and Extended Data Fig. 8b). Human TEs in the LTR12 families did not show a significant age difference between the sense chimera and antisense chimera, likely because of the lack of statistical power in comparing their young evolutionary ages. In sharp contrast, a younger evolutionary age was not observed in LINEs (Fig. 7b) or in any other TE families for both human and mouse (Extended Data Fig. 9a). Overall, these results indicate that young LTRs are still under evolutionary pressure to drive sense chimeric exonization leading to novel genes.

Discussion

TEs are known to be ‘used’ in gene regulatory networks controlling cell identity (mainly during embryo development) but most work has focused on TE families because of the repetitiveness of these elements. Our long-term effort is to establish a framework to characterize TEs as genes. Here, we provide a mechanistic and cartographic analysis of novel genes and isoforms derived from TEs along with their positional identity that overall expands our transcriptome in a dynamic and temporal fashion. We describe TEs involved in cell-type-specific gene expression during organogenesis, aging and disease. As 95% of common disease-associated single-nucleotide polymorphisms (SNPs) reside in loci outside of coding genes, systematic analyses of human variation in TE sequences using technologies that can distinguish the singular identity of all TEs are warranted. In fact, while our study provides an initial attempt to define the molecular and positional identity of TE-chimeras, it remains limited by the lack of cell type and spatial resolution, as well as by our inability to map many extremely long multi-TE exonization events or terminal exonizations lacking a poly(A) tail. Moreover, on the basis of TE expression being cell type dependent, many mechanisms controlling TE transcription and exonization are likely controlled by cell-type-specific factors. Future work, supported by reduced sequencing costs or advances in technology, will enable broader application of the strategy used here to achieve in all cells an isoform-level resolution of genes and TEs. Lastly, TE and TE-chimera evolutionary analyses like the one performed here will be greatly enhanced by the growing availability of highly resolved, telomere-to-telomere genome assemblies from most species.

With respect to mechanism, our data are in line with the idea that many TEs are in genomic positions that avoid conventional epigenetic silencing. We show how sense and antisense transcription of a nearby unit (or the genic region giving rise to chimera) can unsilence TEs and promotes TE RNA synthesis. Our analysis indicates that perturbation of cotranscriptional and post-transcriptional events such as nuclear RNA degradation and splicing converge in the upregulation of TEs. This event is also associated with the establishment of enhanced stem cell potency, in line with previous observations23,42. In fact, in mES cells, inhibiting the spliceosome with PlaB can push cells from a pluripotent toward a totipotent state through a mechanism that appears to involve selective downregulation of pluripotency genes through inefficient splicing (they have many or large introns) while totipotency-associated genes (which often have fewer or shorter introns) remain more efficiently spliced and can be activated. On this basis and considering the reversion to totipotent-like cells caused by exosome loss, we propose that perturbation of splicing and RNA degradation cause upregulation of MERVL-int expression that sets in motion a gene regulatory network that transactivates TE-chimera. This is consistent with previous work suggesting MERVL as a master regulator of enhanced cell potency50, which functions upstream of TE-chimera expression48.

Accordingly, we propose that the many chimeras generated in somatic tissue have been selected throughout evolution because the TE from which they originate contains cell-type-specific TF-binding sites that can transactivate them. Our data also indicate that spurious transcription could cause the birth of novel TE-chimeras. We hypothesize a model in which RNA degradation safeguards the inherent threat of TE functionalization en masse from the spurious transcription of numerous LTRs present in the genome, while the transactivation of these LTRs by cell-type-specific TFs can sustain and functionalize unique TE genes in the gene regulatory network of a given cell state.

Our model differs from the mechanisms by which LINE1s are silenced56,57 and also differs from the recent evidence that SAFB proteins control LINE exonization57. We note that analytical approaches to quantify tissue expression of exonization were consistent with previous methods (Extended Data Fig. 10a); however, promoter-derived TEs and regulation by EXOSC3, based on reanalysis of a publicly available dataset58, showed a distinct pattern of organ expression (Extended Data Fig. 10b–g). Our findings help to categorize different types of silencing that target TE on the basis of genomic features. In a simplistic way, transcription-associated degradation is used for short TEs that function as regulatory elements.

Our model of how RNA degradation impacts TE transcription might have important consequences when considering genes in the context of an evolutionary timescale. The majority of the genome is transcribed59,60,61. For a considerable portion of the transcribed genome, transcription is coupled to degradation58. While regulatory mechanisms have been established to prevent transcription of undesirable transcriptional units, spurious transcription happens as an inherent byproduct of the mechanics of transcription itself62. In fact, RNAPII initiates transcription at a low rate on most nucleosome-free regions63 and mostly outside of known promoters at any given time in the cell64. TFs recognize 6–10 bp and responsive elements are created constantly under the constant mutational burden to which our genomes are subjected. Rather than eliminating the act of spurious transcription, cells have resorted to induce transcript degradation by the RNA quality control system. This is likely because spurious transcription is a primary source of genetic innovation. The road to genetic innovation is uphill for spurious transcripts originated by random DNA sequences. In fact, even if spurious transcripts are under neutral evolution65,66 and their presence is inconsequential to the cell and they are not ‘selected against’, they need to surpass a high resistance barrier before conferring an advantage, as they need to acquire many mutations that render them useful and counteract quality controls occurring at multiple levels after their transcription, such as RNA degradation, described here.

For viral-derived elements such as TEs, the energy barrier to functionalization is lower, as they already contain TF-binding sites and RNAPII promoter elements13,65,67,68,69. We surmise that TE-derived gene birth is the result of evolutionary tinkering—the opportunistic tendency of evolution where changes are based on pre-existing materials, leading to adaptations that may not be perfect but good enough to stay70.

Methods

PacBio Iso-seq library preparation and sequencing in mES cells and EpiLCs

Exosc3 Cre/lox conditional inversion (COIN) mouse pluripotent stem cells were gifted from the U. Basu lab. mES cell culture, EpiLC differentiation, Exosc3 cKO and RNA extraction were performed as previously described23. Two replicates were processed for each condition. Purified RNA was submitted to the Icahn School of Medicine Genomic Core facility for sequencing. Sequencing libraries were prepared using SMARTer PCR complementary DNA (cDNA) synthesis kit (Clontech) per manufacturer recommendations. OligodT primers were used to capture full-length polyadenylated transcripts. cDNA was then size-selected and sorted into two bins of greater or less than 4 kb. These bins were pooled together at equimolar concentrations using SMRTbell template preparation kit version 1.0. End-repaired and purified libraries were loaded onto a SMRTcell 1M, which was then sequenced on a Sequel I system with a 10-h movie.

PacBio Iso-seq error correction

Reads were imported into SMRTlink and IsoSeq3 software was used to obtain circular consensus sequences for error correction, yielding highly accurate reads. Lima was subsequently used to remove barcodes, SMART-seq primers and template-switching oligonucleotide sequences, further orienting isoforms in the correct 5′-to-3′ direction. Next, the refine command was used to remove poly(A) tails and concatemers.

WT and WT plus Exosc3 cKO mES cell and EpiLC Iso-seq analysis

Isoform-resolved transcriptomes were generated for WT and WT plus Exosc3 cKO samples using the following methodology. Refined long reads were mapped to the mouse mm10 genome (GRCm38.p6) using minimap2 using the following parameters: -ax splice -uf -secondary=no -C5 --MD. The Illumina bulk RNA short-read RNA-seq dataset was obtained from our previous study23 and reads were mapped to the mm10 genome using STAR74 with --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. StringTie75 was used to generate the novel isoform-resolved transcriptome by providing both Illumina short-read RNA-seq and PacBio Iso-seq aligned BAM files using the --mix option and the default minimum predicted isoform abundance of 1% (-f 0.01). SQANTI3 (ref. 76) was used to classify isoforms compared to the Ensembl GRCm38 versopm 102 reference transcriptome. To produce a high-quality reference transcriptome, the following isoform models were removed: novel monoexonic transcripts (which are often products of transcript degradation or sequencing artifacts77), novel transcripts classified in the fusion and genic categories by SQANTI and novel transcripts that overlapped TE sequences by over 50% of their total length. The protein-coding potential for each novel isoform was predicted with CPAT using default parameters. Novel transcripts were considered protein coding if they possessed a valid ORF that passed the default CPAT filtering parameters (≥0.44 coding probability). Novel genes were considered protein coding if a randomly selected transcript was found to be protein coding. To generate the TE-chimera annotation, transcripts whose TSSs were located within TEs in either orientation were initially marked as TE-chimeras according to the RepeatMasker GTF file for the mm10 genome. In cases where multiple TEs overlapped with a single TSS, one TE was selected with duplicated = F in R and retained in the annotation (~1% of TE-chimeras). This annotation was further used to define promoter status, as described below. Novel transcription units identified as eRNAs and PROMPTs in a previous study23 were also included in the annotation.

Human and mouse Iso-seq

The long-read RNA-seq dataset78 was scanned for chimeric transcripts by cross-comparing exon locations with annotations for repetitive elements in the genome. Transcripts whose TSS regions overlapped with repetitive elements were classified as TE-chimeras. The tissue-specific and line-specific expression level was estimated by mapping Illumina bulk RNA-seq reads78 to the respective novel transcriptome. Transcripts were counted as present in the sample if expression of the transcript exceeded one transcript per million mapped reads (TPM).

Organogenesis

For the human and mouse organogenesis RNA-seq dataset, raw short-read sequencing data were downloaded from ArrayExpress accession codes E-MTAB-6814 (human) and E-MTAB-6798 (mouse). Sequencing reads were aligned to the human hg38 and mouse mm39 reference genomes using STAR74 using a two-pass mapping approach, with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --limitSjdbInsertNsj 50000000 --outMultimapperOrder Random. Annotations of de novo assembled isoforms from the RNA-seq datasets of human and mouse were downloaded from the source publication26 and converted from hg19 to hg38 (for human) and from mm10 to mm39 (for mouse) genomic coordinates using liftOver. Next, SQANTI3 (ref. 76) was used to classify de novo assembled isoforms on the basis of their similarity to known reference transcripts in the Ensembl version 106 transcriptome assemblies for hg38 and mm39. A reference transcriptome for subsequent analyses was built by including all Ensembl version 106 isoforms, plus all de novo assembled isoforms with at least two exons and classified as either ‘novel in catalog’ or ‘novel not in catalog’ by SQANTI3. Next, proActiv24 was applied to quantify promoter activity across all RNA-seq samples against the newly built reference transcriptome. For subsequent analyses, only promoters with at least 1.5 absolute activity and 15% relative activity in at least two samples from the same tissue were included. TE-chimeras were further restricted to the cases where the TSS and first SD of the TE-chimeric isoform were fully contained within the same TE. Protein-coding probabilities for each isoform were assessed using CPAT79 with default parameters. ORFs with a coding probability ≥0.364 in human and ≥0.44 in mouse transcripts were labeled as protein coding, as indicated by the tool developers, while sequences below this threshold were classified as noncoding.

Organogenesis versus embryogenesis in human and mouse

De novo transcriptomes of human80 and mouse81 embryogenesis and expression data were downloaded from previously published research. Isoform-level expression was filtered using supplementary datasets. In human embryogenesis data, a transcript is considered expressed in a tissue if the transcript count is greater than or equal to 10 in at least two samples in the same tissue. In mouse embryogenesis data, a transcript is considered expressed in a tissue if the average TPM of the transcript is greater than or equal to 1. In human and mouse organogenesis data, TE-derived promoters were filtered using the filtering criteria described above. Only genes with an isoform with a TE-derived promoter that passed these filtering criteria were considered. Human and mouse genes were queried for homologous genes using BiomaRt82,83. All gene lists were converted to their human gene identifiers and overlaps were calculated with unique human gene identifiers.

GTEx TE-chimera filtering

TE-chimeras in the novel human transcriptome were identified by filtering for transcripts whose TSS and first SD were within the genomic range of the same TE. Comparisons of reference annotations showed that including Iso-seq-annotated transcriptomes enhanced the numbers of TE-chimeras detected by 26% (Extended Data Fig. 3a). This list was further filtered in GTEx for robustness, where chimeras must be expressed at >0.1 TPM in at least 20% of the individuals in each organ. This filter required a TE-chimera being expressed in a minimum of 110 individuals. Expression levels for TE-chimeric reads were further filtered by proActiv24 using both absolute promoter activity > 1.5 and relative promoter activity accounting for >15% of gene expression. We note that this metric was driven by interindividual variation and, as a result, was affected by the number of samples included. This was observed by modeling of the number of TE-promoters that pass this threshold in multiple subsampling analyses of GTEx RNA-seq data. In ten subsampling experiments, the number of TE-chimeras that passed our proActiv filtering thresholds never surpassed 739 (the averages for 10 experiments are shown in Extended Data Fig. 3b), even when more than several hundred samples were included. In sum, these combined analyses resulted in a ‘final set’ of 739 TE-chimeras passing both expression across individuals and proActiv24, listed in Supplementary Table 3.

Genetic correlation analyses and pathway enrichment assignment

To determine TE-chimera–gene correlations, as well as assign pathways enriched for TE-chimeras in population datasets such as GTEx, the TE-Chimera transcripts were correlated with all genes available in a given tissue using the bicorAndPvalue() function in the R package WGCNA. Relationships were then either visualized directly or pathway enrichments were assigned through gene set enrichment analysis (GSEA). This was accomplished by using the bicor coefficients as the enrichment weights for each gene and performed using gseGO() in clusterProfiler where thresholds were based on 1,000 permutations. For example, in Fig. 2e, the TE-chimera (ENST00000426261.6) was correlated with all other genes in the aorta; then, GSEA was performed using the regression coefficients and genes. Thus, plotting enrichment scores reflects the strength and direction of correlation assigned to each Gene Ontology term shown.

Human cancer analyses of TE-chimeras

TCGA data were obtained from the UCSC Xena portal (accessed February 4, 2024)84. For survival analyses, only persons in which a ‘survival event’ (for example, deceased) occurred over the course of the study were used to increase accuracy of the model. To bin individuals by expression, all LTRs listed in Supplementary Table 3 and detected in persons with cancer were used. For each Ensembl transcript corresponding to a TE-Chimera, the average expression (fragments per kilobase of TPM) was calculated across the population. Individuals were then assigned a ‘low’ or ‘high’ value on the basis of whether their expression of a given transcript was below or above the mean, respectively. Individuals were then binned into categories of ‘low expressors’ or ‘high expressors’ on the basis of the ratio of low or high transcript counts being below or above 0.5, respectively. Survival analyses were performed and visualized using R packages survival85 and survminer according to standard protocols. The P values for survival differences between groups were assigned on the basis of a log-rank test.

Human aging analysis of TE-chimeras

To compare age effects across organs among human TE-chimeras, all individuals in GTEx were binned as to whether their reported age was over or under 50 years old. Next, all LTRs identified in Supplementary Table 3 were compared in indicated organs using a Wilcoxon rank-sum test. Corresponding false discovery rates (FDRs) from t-test P values were calculated from the R package qvalue.

mES cell culture

mES cells were cultured in 2i/Lif medium as previously described23. Briefly, mES cells were cultured in N2B27 medium consisting of a 1:1 mixture of DMEM/F12 (Gibco) with HEPES and Neurobasal medium (Gibco) supplemented with 0.5× N2 (Gibco), 0.5× serum-free B27 (Gibco), 1× GlutaMAX (Gibco), 1× penicillin–streptomycin (Gibco), 0.05% bovine albumin fraction V (Gibco) and 1× 2-mercaptoethanol (Gibco). For naive mES cells, 3 µM CHIR99021 (Reprocell), 1 µM PD0325901 (Reprocell) and 20 ng ml−1 mouse recombinant leukemia inhibitory factor (R&D systems) were added to the medium to sustain stemness. Exosc3 cKO was induced by 100 nM tamoxifen (4-OHT) (EMD Millipore) treatment in Exosc3 Cre/lox COIN mES cells for 48 h. RNA was purified with TRIzol (Thermo Fisher Scientific) following the manufacturer’s recommendations.

Small interfering RNA (siRNA) transfection

For siRNA-mediated KD of Snrpd2, the WT line of mES cells, derived following the standard protocol, was transfected at a final concentration of 90 nM siRNA with Lipofectamine 3000 (Thermo Fisher Scientific) at the time of plating and collected 60 h after transfection. For siRNA-mediated KD of Wdr82 and Srsf7, COIN mES cells (without tamoxifen treatment) were transfected at a final concentration of 50 nM siRNA with Lipofectamine RNAiMAX (Thermo Fisher Scientific) at the time of plating and collected 2 days after transfection. For the KD experiments of Wdr82 and Srsf7, siRNAs were purchased from Dharmacon (Horizon Discovery). For the KD experiment of Snrpd2, an siRNA with a sequence reported in a previous study42 was custom-ordered from the same supplier. At the time of collection, total RNA was purified with TRIzol (Thermo Fisher Scientific) following the manufacturer’s recommendations. KD of Wdr82 was validated by western blotting using the following antibodies: WDR82 (D2I3B) rabbit monoclonal antibody (Cell Signaling Technology, 99715S; 1:1,000), anti-rabbit IgG, horseradish peroxidase (HRP)-linked antibody (Cell Signaling Technology, 7074S; 1:5,000) and β-actin (8H10D10) mouse monoclonal antibody (HRP conjugate) (Cell Signaling Technology, 12262S; 1:1,000). KD of Snrpd2 and Srsf7 was validated by reverse transcription (RT)–qPCR, as described below. The cytotoxicity of siRNA-transfected cells was evaluated using a lactase dehydrogenase (LDH) colorimetric assay (Promega). The release of LDH into the culture medium, indicative of cell membrane damage, was quantified by measuring absorbance at 490 nm. Data are presented as the percentage cytotoxicity relative to cells transfected with control siRNA and lysed (LDH release maximum). Two replicates were processed for each condition. For sequencing, RNA libraries were prepared using the NEBNext Ultra II directional RNA library prep kit for Illumina (New England Biolabs) following the manufacturer’s recommendations.

Dux overexpression

cDNA for the flag-tagged Dux gene (Gene ID 664783), either with or without the transactivation domain (amino acids 1–178), was PCR-amplified from the template plasmid (Addgene, 138320). The amplified Dux cDNA and GFP were cloned into an EF1a promoter plasmid. For transfection, 50,000 COIN mES cells were reverse-transfected with vectors encoding the indicated genes using Lipofectamine Stem (Thermo Fisher Scientific). Where Exosc3 cKO was required, 100 nM tamoxifen (4-OHT) was added to the at the time of plating. Control wells were treated with ethanol alone. Cells were then incubated for 2 days. Two replicates were processed for each condition. After incubation, cells were washed with PBS and resuspended in TRIzol (Thermo Fisher Scientific) reagent for RNA extraction.

RT–qPCR

Total RNA was extracted using TRIzol reagent and treated with DNase using the TURBO DNA-free kit (Invitrogen), following the manufacturer’s instructions. DNase-treated RNA was reverse-transcribed into cDNA using the high capacity cDNA RT kit (Applied Biosystems) and real-time qPCR was performed using a SYBR green master mix (Bio-Rad). To validate the transfection of FLAG–GFP and FLAG–Dux, RT–qPCR amplicons were visualized by agarose gel electrophoresis. Bands corresponding to the expected amplicon sizes were quantified using ImageJ 1.53t software86 and the values were normalized to the corresponding Actb band intensities. For other transcripts, relative expression of the transcript of interest was calculated using the 2−ΔCt method, where ΔCt = Ct_target − Ct_Actb. All primers used for qPCR are listed in Supplementary Table 6. The data were visualized as bar plots using GraphPad Prism 10.6.0 software for macOS (GraphPad software).

RNA pulldown of MT2 PAS

A 130-nt fragment of the MT2_Mm consensus sequence centered around the PAS motif (AAUAAA) was cloned into the pBlueScript-3×MS2 vector. This region includes the SD motif (UGUA) for CFIm binding, the aforementioned PAS motif and the cleavage site. A fragment with a mutated PAS motif (AACAAA) was also cloned as a negative control. RNA substrates were synthesized by runoff in vitro transcription using T7 RNA polymerase (New England Biolabs) and capped by vaccinia capping enzyme (New England Biolabs). RNA pulldown and western blotting experiments were performed as previously described38 using the following antibodies: mouse monoclonal anti-U1-70K clone 9C4.1 (Millipore, 05-1588; RRID: AB_10805959; 1:2,000), mouse monoclonal anti-U1A (Santa Cruz, sc-101149; RRID: AB_2193721; 1:2,000), rat monoclonal anti-U1C (Sigma, SAB4200188-200UL; RRID: AB_10640155; 1:2,000), rabbit polyclonal anti-CFIm68 (Bethyl, A301-358A; RRID:AB_937785; 1:2,000), rabbit polyclonal anti-CFIm59 (Bethyl, A301-360A; RRID:AB_937864; 1:2,000), rabbit polyclonal anti-NUDT21 (CFIm25) (Proteintech, 10322-1-AP; RRID:AB_2251496; 1:500), rabbit polyclonal anti-NCBP2 (CBP20) (Bethyl, A302-553A; RRID: AB_2034872; 1:2,000), rabbit polyclonal anti-CPSF30 (Bethyl, A301-585A; RRID: AB_1078868; 1:2,000), goat anti-rabbit IgG HRP conjugate (Millipore, 12-348; RRID: AB_390191; 1:2,000), goat anti-mouse IgG HRP conjugate (Millipore, 12-349; RRID: AB_390192; 1:2,000) and rabbit anti-rat IgG HRP conjugate (Invitrogen, PA1-28573; RRID: AB_10980086; 1:2,000).

KD of MERVL using gapmer ASOs

Gapmer ASOs against AMOs were synthesized by Integrated DNA Technologies. Sequences used were as follows:

MERVL-ASO1: T*G*G*T*G*G*A*T*C*A*A*C*A*A*G*C*C*A*A*T

MERVL-ASO2: C*A*T*T*T*G*T*C*T*G*T*T*T*A*C*C*A*C*G*A

MERVL-ASO3: G*A*C*C*C*C*G*A*A*A*A*G*T*C*T*G*A*T*T*A

Scr ASO: A*G*C*G*C*G*G*G*T*A*T*T*G*A*A*C*C*A*G*G

Here, asterisks indicate a phosphorothioate bond instead of a phosphodiester bond. Bold and underlined residues represent 2′-O-methoxyethyl nucleotides. All stock ASOs were resuspended in 1× siRNA resuspension buffer (Horizon Discovery) before use. MERVL-ASO1, MERVL-ASO2 and MERVL-ASO3 were mixed at equimolar concentrations to form a 100 μM stock (MERVL-ASOmix) that was used for all downstream experiments. For KD experiments, Scr ASO or MERVL-ASOmix were complexed with 2.5 μl of Lipofectamine RNAiMAX (Thermo Fisher Scientific) in 250 μl of Opti-MEM. The resulting ASO–Lipofectamine mix was then reverse-transfected into single-cell suspensions of 2 × 105 mES cells in 750 μl of mES cell medium + 2i/Lif. Final concentrations of ASOs used in these experiments was 100 nM. Where Exosc3 cKO was required, 100 nM tamoxifen (4-OHT) was added at the time of plating. Control wells were treated with ethanol alone. Then, 48 h after transfection and tamoxifen treatment, cells were collected in TRIzol (Thermo Fisher Scientific). Three replicates were processed for each condition. RNA was isolated following the manufacturer’s recommendations. For sequencing, RNA libraries were prepared using the NEBNext Ultra II directional RNA library prep kit for Illumina (New England Biolabs) following the manufacturer’s recommendations.

KD of U1/U6 RNA using AMOs

U1 and U6 AMOs were synthesized by Gene Tools. Sequences used were as follows:

Control AMO: CCTCTTACCTCAGTTACAATTTATA

U1 AMO: GGTATCTCCCCTGCCAGGTAAGTAT

U6ATAC AMO: AACCTTCTCTCCTTTCATACAACAC

To study the impact of major or minor spliceosome function in control or Exosc3-depleted cells, mES cells were cultured in 2i/Lif medium in the presence of 0.01% ethanol or 100 nM tamoxifen for 48 h. mES cell colonies were then dissociated into single-cell suspensions using Accutase, counted and resuspended in Opti-MEM (Thermo Fisher Scientific) containing 2i/Lif. For KD experiments, single-cell suspensions of 1.25 × 106 cells were prepared in 400 μl of Opti-MEM containing 2i/Lif. AMOs were added to the cell suspensions at a final concentration of 15 μM and electroporated into cells using the Bio-Rad Gene Pulser XCell electroporation system. Electroporation was carried out with a single 240-V, 500-μF pulse in a 0.4-mm cuvette. Following electroporation, cells were immediately plated into full mES cell medium. Then, 24 h after electroporation, cells were collected in TRIzol (Thermo Fisher Scientific). Two replicates were processed for each condition. RNA was isolated following the manufacturer’s recommendations. For sequencing, RNA libraries were prepared using the TruSeq stranded total RNA library prep gold (Illumina) following the manufacturer’s recommendations.

mES cells bulk short-read RNA-seq and analysis

Illumina short-read RNA-seq data in WT and Exosc3 cKO mES cells were obtained from our previous study23. Data from other previous studies were obtained from their respective references (listed in Supplementary Table 7). Reads were mapped onto the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome using two-pass STAR74 mapping with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. Gene-level and TE-level quantification was performed with TEcount using the RepeatMasker gtf file generated from the mm10 genome. Differential expression was calculated using DESeq2 (ref. 72) with default parameters independently for each comparison with or without eRNAs and PROMPTs. In the Exosc3 cKO with AMO transfection dataset, each DESeq2 model included a similar number of expressed genes. Protein-coding status and exon counts of each gene were determined using the method described above.

mES cell nascent RNA (metabolic labeling) sequencing analysis

Nascent RNA-seq data in WT and Exosc3 cKO mES cells were obtained from the previous study23. Reads were mapped onto the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome using two-pass STAR74 mapping with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100.

EU-seq total input RNA-seq analysis

Total input RNA-seq of the EU-seq dataset was obtained from a previous study23. Illumina adaptors were trimmed from reads using Trim Galore87. Reads were aligned to the GRCm38/mm10 reference genome using STAR74 with the following custom parameters: --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. To generate bigWig files, aligned BAM files were first filtered using sambamba88, removing unmapped reads and secondary alignments. Next, deepTools89 was used with RPKM (reads per kilobase of TPM) normalization to generate unstranded bigWig files that were used to produce genome browser snapshot (Fig. 3f,g).

mES cell promoter activity analysis

Promoters and their activity were estimated by proActiv24, using Illumina RNA-seq splice junction quantification files (SJ.out.tab) from a STAR two-pass alignment and the novel isoform-resolved transcriptome as input, as described above. Promoters that were estimated to be inactive across all datasets used, as well as internal promoters, were excluded from downstream analysis. The average absolute and relative promoter activity across replicates for each condition was used to analyze the promoter activity. Low-activity promoters with absolute promoter activity lower than 1.5 and relative promoter activity lower than 15% were further filtered out to avoid false positives. For the WT and Exosc3 cKO dataset, filtering was applied for Exosc3 cKO samples to avoid eliminating the unstable or lowly expressed isoforms in WT. For the remaining comparisons, the TE-promoters whose maximum absolute promoter activity and maximum relative activity across the dataset in each comparison were higher than 1.5 and 0.15, respectively, were considered. For the analyses described in this section, promoter activity refers to the average absolute promoter activity unless otherwise indicated. Promoters were classified on the basis of the genomic location of promoters and their association with TE-chimeras annotated in the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome, as described above. Specifically, we initially annotated nonchimeric (host) and TE-promoters on the basis of the overlap between promoter regions and annotated repetitive elements. When multiple TEs were overlapped with a promoter, a single TE was selected with distinct function of dplyr90 package in R. Promoters that did not overlap with any TE on the same strand were defined as host (nonchimeric) promoters. Promoters that overlapped with a TE on the same strand were defined as TE-promoters. TE-promoters were then further restricted to instances where the TSS of the TE-chimeras was contained within the same TE on the same strand. To calculate log10 or fold changes, a small constant (+0.1) was added to the values to prevent the occurrence of infinite values.

Promoter classification based on RNAPII status

RNAPII ChIP-seq (8WG16) data generated in WT and Exosc3 cKO from the previous study were obtained23 and processed as described therein. To classify promoters on the basis of RNAPII, a differential binding analysis between WT and Exosc3 cKO was performed by using (1) MACS2 (ref. 91) with the parameters --broad --qvalue 0.5 and (2) DiffBind91. Promoters that overlapped with RNAPII peaks (allowing 200 bp of max gap) were then classified in three categories: (1) more RNAPII in Exosc3 cKO, in cases where the first exon overlapped an RNAPII peak in Exosc3 cKO but not WT or both conditions had peaks but with significantly stronger signal in Exosc3 cKO (FDR < 0.05, fold > 0.5); (2) less RNAPII in Exosc3 cKO, in cases where the first exon overlapped an RNAPII peak in WT but not Exosc3 cKO or both conditions had peaks but with significantly lower signal in WT (FDR < 0.05, fold < −0.5); and (3) similar RNAPII levels, in all other cases. MT2_Mm_dup920 (positionally overlapping both RNAPII promoters) and MTC-int_dup3214 (showing the outlier signal of nascent transcription) were excluded in Fig. 3h and Extended Data Fig. 7m,y–z.

Hi-C data processing and analysis

Hi-C data were processed as previously described92. Briefly, Hi-C reads were trimmed at MboI/DpnII recognition sites (GATC) and aligned to the mouse genome (GRCm38/mm10) using STAR, keeping only read pairs that both mapped to unique genomic locations for further analysis (MAPQ > 10). All PCR duplicates were also removed. PCA of Hi-C experiments used to define chromatin compartments was performed with HOMER93. For each chromosome, a balanced and distance-normalized contact matrix was generated using a window size of 50 kb sampled every 25 kb, reporting the ratios of observed-to-expected contact frequencies for any two regions. The correlation coefficient of the interaction profiles for any two regions across the entire chromosome was then calculated to generate a correlation matrix. This matrix was then analyzed using PCA from the prcomp function in R and the eigenvector loadings for each 25-kb region along the first principal component (PC1) were assigned to each region. The PC1 values from each chromosome were scaled by their s.d. to make them more comparable across chromosomes and analysis parameters. For each chromosome, PC1 values were multiplied by −1 if negative PC1 regions were more strongly enriched for active chromatin regions defined by H3K27ac peaks to ensure that the positive PC1 values aligned with the A/permissive compartment (as opposed to the B/inert compartment). For each 25-kb region (bin), the relative enrichment of chimeric LTRs was calculated as the ratio of chimeric LTR base pairs to total LTR base pairs. For both chimeric and total LTRs, overlapping regions were merged such that each base pair was counted only once when calculating coverage. PC1 values were then used to compare (1) genomic bins containing or lacking chimeric LTRs, considering only bins that contained at least 1 bp of any LTR, and (2) genomic bins stratified by the relative enrichment of chimeric to total LTRs, considering only bins that contained at least 1 bp of chimeric LTR.

Correlation analyses

The WGCNA bicorandpvale function94 was used to generate a biweight midcorrelation coefficient (bicor) and corresponding P values between the expression signature of coactivators depletion and that of Exosc3 cKO. For the PlaB treatment dataset42 analysis, log2 fold changes of the counts normalized using DESeq2 (ref. 72) were manually quantified because of the limited number of replicates available deposited in the dataset.

Motif analysis

Motif analysis was performed using HOMER95. The entire mouse genome (mm10) was scanned with known SD and PAS motifs and pairs of SD-PAS motifs where the SD is upstream of the PAS for up to 10 bp at most were identified. The proportions of TEs or TSSs that contain this specific motif configuration were calculated by intersecting genomic motif coordinates with the reference GTF. Only presence on standard chromosomes was considered when calculating the proportions. Only TSSs that were associated with promoters that passed the filtering criteria described above were used. TSSs that did not overlap with TEs on the same strand were classified as nonrepetitive, whereas TSSs overlapping with TEs on the same strand were classified as repeat-derived. Only motifs whose strand matched that of the TEs or TSSs were considered. Occurrences were found between the TSS and 150 bp downstream.

Positional analysis of chimeric and nonchimeric LTRs

Chimeric LTRs were filtered and defined on the basis of criteria described above. All other LTRs that were either not defined as LTR promoters or did not satisfy the filtering criteria were classified as nonchimeric LTRs. All positional analyses were performed against the mm10 Ensembl transcriptome. Primary annotation of LTR and gene–LTR pairing were performed considering only genes located on standard chromosomes. Primary annotation of LTR was performed using the annotatePeak() function in ChIPseeker96, with the parameters level set to ‘gene’ and overlap set to ‘all’. Annotations were then classified as four groups: (1) promoter, where LTRs are located proximal to the TSS (−3 kb to +3 kb); (2) intragenic, where LTRs are located in the untranslated region, exon or intron; (3) downstream (of gene end) ≤ 300 bp; and (4) distal intergenic (Extended Data Fig. 6b). LTRs were then paired with genes (regardless of strand orientation): (1) gene and its overlapping intragenic LTR and (2) gene and its nearest promoter-proximal intergenic LTR (within 5 kb from the TSS). To avoid any complication, only genes present in the mm10 Ensembl transcriptome that were not overlapping with any other genes (that is, nonoverlapping genes) were considered. Intragenic LTRs positioned inside overlapping genes and promoter-proximal intergenic LTRs positioned closer to overlapping genes than nonoverlapping genes were discarded. Genes paired with chimeric LTRs were excluded from pairing with nonchimeric LTR. The gene-level RNA-seq count data generated by TEcount97 using BAM files were mapped onto the mm10 Ensembl transcriptome. The normalized counts from DESeq2 (ref. 72) were further normalized by the total exon length (in kilobases) of the longest transcript of each gene and averaged across replicates for each condition. These normalized expression values were then used to compare expression of genes paired with chimeric LTRs to those paired with nonchimeric LTRs.

The coverage of H3K9me3 was calculated using datasets from our previous study23 with the multiBigWigSummary function in deepTools89. The processed bigWig files deposited in the corresponding Gene Expression Omnibus (GEO) entry were used as input. For DNA methylation analysis, whole-genome bisulfite sequencing data were obtained from our previous study23. Reads were aligned to the mouse reference genome (mm10) using the bsseq package in R98. A BSseq object was created, filtering for CpG sites with a minimum coverage depth of six reads. The filtered BSseq object was converted to bedGraph. For visualization, bigWig files were generated from bedGraph files using the bedGraphToBigWig utility99. DNA methylation levels were quantified for each sample using the resulting bigWig files with the multiBigWigSummary function in deepTools89. For both H3K9me3 and DNA methylation analyses, NA values were replaced with 0 and the average signal was calculated across replicates within each condition. The coverage of H3K9me3 immunoprecipitation (IP) samples was further normalized by the input sample.

The average antisense coverage between genes and the nearest upstream proximal intergenic LTR was calculated from stranded, RPKM-normalized bigWig files. To generate these bigWig files, aligned BAM files from replicates were first merged and then filtered with sambamba88 to remove unmapped reads and secondary alignments. The replicate-merged, filtered BAM files were converted into stranded bigWig files using the bamCoverage function in deepTools89 with the parameters: --binSize 10 --normalizeUsing RPKM. Strand specificity was applied using the --filterRNAstrand option. The average antisense coverage was then calculated from these bigWig files using the multiBigWigSummary function of deepTools89. Gene and promoter-proximal intergenic LTR pairs where the range between gene and LTR was wholly contained within the novel transcripts, which were antisense to the gene, were excluded when quantifying antisense signals to avoid complications.

For the analyses described in this section, a small constant (+0.1) was added to all values before calculating log10 or fold changes to avoid infinite values. The ggbreak100 R package was used to introduce axis breaks.

Sashimi plot, metagene plot and heat map

A strand-specific sashimi plot of RNA-seq was created by ggsashimi101 with the following parameters: -s MATE2_SENSE. Strand-specific metagene plots of RNA-seq and nascent RNA sequencing (metabolic labeling) were generated by ngs.plot.r102 with BAM containing only the first mate reads. Replicate-merged BAM files for each condition were used as an input to ggsashimi101 and ngs.plot.r102. Metagene plots of Dux ChIP-seq in mES cells45 and Obox1 Stacc-seq in late 2C mouse embryos47 were generated by deepTools89 computeMatrix and plotHeatmap. The processed bigWig files deposited in the corresponding GEO entry by previous studies were used as input. For ChIP-seq, replicate IP bigWig files were merged using bigwigCompare (--operation mean) for visualization. For Stacc-seq, to match the reference version, the genomic coordinates of LTRs were extracted from the RepeatMasker GTF file for the mm9 genome and only elements annotated as LTRs in the mm9 assembly were considered.

2CLC gene set expression and enrichment analyses

The 2CLC gene set was obtained from a previous study (GSE75751)46, using genes and TEs that were upregulated in MuERVL+Zscan4+ double-positive samples when compared to untransfected negative control samples (FDR < 0.05 and log2 fold hange > 5). GSEA was performed using fgsea (version 1.16.0)103 ranking genes according to the differential expression statistic output by DESeq2 (ref. 72).

LTR-chimera analysis in vivo

De novo oocyte transcriptome was assembled with paired illumina short-read data from a previous study (GSE247848)51. Reads were filtered using fastp (version 0.23.2)104, mapped with STAR74 two-pass alignment to mm10 with parameters --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 and assembled into transcriptome annotation using StringTie75 with default parameters. Annotations were merged with StringTie --merge and novel transcripts with >50% TE coverage were removed. Mtr4 cKO (GSE247848)51, considering only 21-day samples, and Dicer1 cKO (GSE57514)54 samples were mapped to mm10 with fastp and STAR as described and quantified with de novo oocyte transcriptome using proActiv24, with previously described filtering criteria. For mES cells, the previously described Exosc3 cKO and WT in mES cell and EpiLC transcriptome was used. Exosc3 cKO (GSE205211)23 and Dicer1 cKO (GSE256381)105 mES cell RNA-seq samples and their respective controls were filtered with fastp, mapped to the de novo transcriptome with STAR and quantified with proActiv as described above.

Evolutionary analysis of TE-chimeras

We used the 241-way mammal alignment to investigate the age of TE-derived promoters. The TSS regions ± 100 bp were defined as promoters and aligned to each of the other 240 mammals. Promoters with more than 100 alignable base pairs in a specific mammal were considered present in the mammal. For human promoters, the age was defined as follows according to their presence in other mammal classes: (1) human-specific; (2) great apes; (3) apes; (4) old-world monkeys; (5) new-world monkeys; (6) lemurs; (7) rodents; and (8) other eutherian mammals. We did the same for mouse promoters, dividing them into the following classes: (1) mouse-specific; (2) murine; (3) Cricetidae; (4) Dipodidae; (5) rodents; and (6) other eutherian mammals. The Wilcoxon rank-sum test was used for generating P values.

Statistical significance

Unless otherwise noted, asterisks represent the following convention for statistical significance: NS (not significant), P > 0.05; *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001 and ****P ≤ 0.0001.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.