Abstract
Transposable elements (TEs) in the human genome are the heritage of ancient parasitic infections. While most of human DNA comprises TEs and TE-derived elements, their repetitive nature poses technical challenges; thus, little is known about their positional identity and regulatory roles. Here, by integrating long-read and multidimensional transcriptional analyses, we investigate when, where and how TEs become part of a gene. We characterize how TE-derived isoforms change across mouse–human variation and how they are linked to gene regulatory networks controlling cell states during differentiation, organogenesis and health (aging and pathological states). Mechanistically, we identify an RNA degradation-dependent and splicing-dependent quality control mechanism that operates independently of conventional mechanisms of TE suppression, such as DNA methylation and heterochromatinization, and prevents TE-chimera expression and TE-induced cell differentiation. Overall, our findings unveil mechanisms by which viral-derived elements enhance transcriptome plasticity.
Main
Throughout evolution, hosts and transposable elements (TEs) have coevolved, engaging in an evolutionary ‘arms race’ between TE invasion within host genomes and the defensive mechanisms aimed at limiting their expansion1,2,3. Novel TE insertions can produce multiple deleterious effects on the host cell, such as promoting genomic instability through insertion and recombination, producing nucleic acids or proteins that are toxic, disrupting gene expression through insertion in genic sequence or regulatory elements and altering gene expression through the TE’s own regulatory elements2,4. Thus, hosts have developed a wide variety of mechanisms to silence TEs. Epigenetic regulation by DNA methylation is one of the most common mechanisms used by host genomes to suppress TE expression and activity. A second layer of silencing is mediated by trimethylation of histone 3 lysine 9 (H3K9me3)5, which is generally controlled by histone methyltransferase complex SETDB1 (refs. 6,7). In mammals, this complex is commonly targeted to TE loci by a family of transcription factors (TFs) known as KRAB-containing zinc-finger proteins8. While these and a few other cell-type-specific mechanisms9 limit deleterious TE activity in most organisms, there are many examples of mutualistic events where TEs have important roles in host biology. TEs are particularly active in early development, where they have been shown to help drive zygotic gene activation across multiple organisms. High expression levels in these early developmental stages are likely to promote inclusion of new TE copies in the germline and, thus, vertical inheritance3. Once inherited, TE-derived genomic sequences can be expressed as distinct transcription units or function as new noncanonical regulatory elements such as enhancers and promoters for host genes10,11,12,13,14. Through this process, referred to as co-option, exaptation or domestication15, TEs can confer a positive effect on organism fitness. Domestication of TEs also involve the possibility of a TE-derived sequence to become part of a host mRNA, generating novel chimeric host–viral transcripts16,17. This process is referred to as TE exonization. Many previous attempts have been made to characterize TE exonization12,16,18, even recently19,20,21, but such efforts had the limitations of characterizing this process in unique conditions and using primarily short-read sequencing. More importantly, the mechanism and regulatory events of how TE exonization originate TE-chimeric genes (TE-chimeras) is poorly defined. Here, we provide a global view of this process by (1) defining TE exonization events with long-read sequencing in human and rodent cells, tissues and organs; (2) characterizing the evolutionary history of exonization and its impact on human genetic variation (health and disease states); (3) discovering an RNA degradation-dependent and a splicing-dependent mechanism that prevents TE-chimera expression; and (4) showing how these two mechanisms converge in stem cells to prevent TE-induced enhancement of cellular potency.
Results
Cartography of TE-derived genes in mouse and human
We and others have benchmarked the combined usage of long-read and short-read sequencing to analyze the complexity of transcriptomes22,23. To identify TE exonization events, we performed isoform-resolved long-read single-molecule real-time sequencing (PacBio Iso-seq) in mouse embryonic stem cells (mES cells) and epiblast-like cells (EpiLCs) and analyzed the data in combination with paired short-read RNA sequencing (RNA-seq)23. Because of the complexity of the data, we used a top-down approach in which we first used our Iso-seq and RNA-seq datasets to assemble an isoform-resolved reference of the transcriptome, wherein isoform diversity is annotated with full-length reads and high-confidence splice junctions. We then scanned detected transcripts for TE exonization events, novel gene loci and valid open reading frames (ORFs).
An overview of the strategy is shown (Extended Data Fig. 1a). By contrasting our novel isoform-resolved reference transcriptome with the mm10 Ensembl transcriptome release (v102) for the mouse genome assembly, we identified 5,807 novel RNA species comprising both coding and noncoding transcripts (Fig. 1a). These include both novel isoforms of known genes and novel isoforms transcribed from novel genes antisense to known genes (novel antisense genes) or in loci that do not overlap any known gene (novel intergenic genes). By cross-referencing the isoform information with the genomic locations of integrated TEs, we then identified all instances of TE exonization in our novel transcriptome. We found that, despite the fact that short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs) both outnumber long terminal repeat-containing TEs (LTRs) in terms of genomic copies (Extended Data Fig. 1b,c), exonization events are predominantly driven by LTRs and located at the 5′ end of a given transcript compared to internal or 3′ end terminal locations (Fig. 1b,c). This suggests a unique cis-regulatory potential of TEs that can directly drive the transcription and exonization into a downstream region. Approximately half of these, hereafter called TE-chimeras, are novel and not annotated in the current mm10 Ensembl transcriptome (Extended Data Fig. 1d), indicating that many 5′-TE exonization events have been missed because of a lack of technical resolution that is now provided by applying a mixed-read sequencing.
a, Classification of novel isoforms found in WT mES cell Iso-seq dataset based on known isoforms and protein-coding predictions. Classifications were generated with SQANTI3 (ref. 44) and are based on the mm10 Ensembl transcriptome. Left, depictions of the five classes of novel transcript are shown: ‘novel not in catalog’ transcripts, novel exon combinations with at least one novel exon; ‘novel in catalog’ transcripts, novel exon combinations of known exons; ‘intergenic’ transcripts, novel transcripts that do not overlap known transcripts; ‘antisense’ transcripts, transcripts that overlap known transcripts but on the antisense strand; ‘incomplete splice match’ transcripts, novel transcripts that are fragments of known transcripts. Middle, bar graph depicting the amount of each class found, with n denoting the total number of transcripts within each category. Right, coding predictions generated with CPAT45, using the default threshold of 0.44 to indicate the presence of a valid ORF. Box plots indicate the median, upper quartile and lower quartile for each distribution. Outliers are indicated with dots (outside 1.5× the interquartile range from the upper and lower quartiles). b, Relative position within a transcript of TE sequences in WT mES cell isoforms. Only the top ten TEs in each class are listed. Bars are colored by the proportions of different locations of the exonization events: 5′, internal and 3′ exonizations. c, Relative location of TE sequences in WT mES cell isoforms within a transcript in exonized chimeras, sorted by TE class. d, Change over time of TE-chimera expression in the indicated anatomical region in mice. Bar graphs indicate the gain and loss of individual 5′ chimeras from one time point to the next. Line graphs indicate the total number of chimeras in each time point. e, Proportion of TE-chimeras in human and mouse belonging to each TE class found in organogenesis dataset. f, Distribution of number of tissues with an active TE-promoter in organogenesis dataset. g, Distribution of tissue specificity of active TE-promoters in organogenesis dataset. h, Heat map displaying the relative promoter activity of TE-chimeric promoters across all RNA-seq samples spanning multiple organs and developmental time points in human organogenesis. i, Genome browser snapshots displaying representative short-read bulk RNA-seq pileups for ovary (up) and liver (bottom) samples, alongside host and TE-chimeric promoters, their respective isoforms and the genomic location of TEs.
We then applied the same mixed-read sequencing strategy to mouse organs at different time points of development and generated a compendium of mouse TE-chimeras (Extended Data Fig. 1e,f and Supplementary Table 1). The resulting temporal analysis of TE-chimeras in aged tissue samples indicated notable differences in class of TE used (Fig. 1d and Extended Data Fig. 1f), whereby SINEs were the most co-opted in brain structures analyzed while LTRs were the top class in nonbrain samples. The temporal analysis also revealed the identity of TE-chimera across tissues and time points (Supplementary Table 2). These results suggest that different TEs drive the expression of TE-chimeras that changes between tissues and across time.
We then leveraged our combined long-transcript and short-transcript annotations and sought to determine the extent to which TE-chimera expression changes during organogenesis in human tissues using public data. To filter out any potential false positives caused by the repetitive nature of TEs, we used tissue-specific transcriptome assembly along with proActiv24, which uses spliced reads to accurately quantify promoter activity both absolutely and relatively to the other promoters driving canonical isoforms. We filtered promoters within TE regions (TE-promoters) using both metrics to determine that the TE was both actively promoting transcription and the resulting TE-chimeras made up a sizable proportion of the corresponding gene’s expression (Methods).
Our comparative (human versus mouse) analysis of seven organs (forebrain, hindbrain, heart, kidneys, liver, testis and ovary) at multiple time points during fetal development and aging25,26 uncovered 2,419 and 1,107 TE-promoters in the human and mouse datasets, respectively (Extended Data Fig. 1g). Each TE-promoter gives rise to, on average, two TE-chimeras (Extended Data Fig. 1h,i). We found that, overall, LTRs were responsible for generating the most TE-chimeras, followed by LINEs and SINEs (Fig. 1e). LTRs were overrepresented among TE-promoters expressed in only 1–3 tissues, while SINEs were the most abundant across multiple tissues (Extended Data Fig. 1j). We also found that the distributions of TE-chimeras were remarkably tissue specific between human and mouse data, as over 50% of TE elements drove robust chimeric expression in a single tissue in both mouse and human datasets (Fig. 1f and Extended Data Fig. 2a). In both species, the testis possessed the highest TE-promoter activity, most of which was not found in other tissues (Fig. 1g). Similarly, in both species, the liver possessed the second largest number (Fig. 1g). By investigating the protein-coding potential of the TE-chimeras, we found that over 80% were noncoding in both the human and the mouse transcriptome (Extended Data Fig. 2b). The noncoding nature of these TE-chimeras was broadly shared across chimeras generated from different TE classes and expressed in different tissues (Extended Data Fig. 2b–e). Interestingly, TE-chimeras expressed in the mouse brain displayed higher protein-coding probabilities than those expressed in other organs (Extended Data Fig. 2e). We identified a class of organogenesis TE-promoters that were broadly expressed in multiple tissues; thus, we investigated whether they were shared between homologous genes in human and mouse. We found little mouse–human overlap in organogenesis (Extended Data Fig. 2f), while orthologous genes in embryogenesis were more prevalent (Extended Data Fig. 2g,h).
We then investigated how TE-promoter activity varies across organogenesis and postnatal development and found that, broadly, the activity was higher in postnatal samples when compared to their fetal counterparts (Extended Data Fig. 2i). We additionally clustered TE-promoters by their relative activity across individual samples spanning multiple fetal and postnatal stages, identifying both commonly activated TE-promoters and clusters of TE-promoters displaying tissue-specific activities across stages and organisms (Fig. 1h and Extended Data Fig. 2j). Representative TE-chimeras specifically expressed in either fetal or postnatal development are shown in Fig. 1i. Overall, we found that TE-chimeras comprise both tissue-specific and common transcriptional events, with organ expression patterns that are shared between human and mouse.
TE-chimeras in health and disease
Our next goal was to evaluate the regulation of TE-chimera expression across human variation and disease. We focused these analyses on the ~900 individuals and 37 tissues available in the Genotype-Tissue Expression project (GTEx)27. We identified 739 ‘high-confidence’ TE-chimeras (Supplementary Table 3) by leveraging the combined annotations of long-read and short-read sequencing and filtering on the basis of the activity of known TE-promoters and presence in at least 80 individuals (Methods). Identification of transcripts informed by the isoform-resolved data expanded the number of TE-chimeras detected by 26% (Extended Data Fig. 3a,b). Most of the TE-chimeras found were derived from LTRs (Extended Data Fig. 3c), where highly expressed transcripts showed tissue specificity (Extended Data Fig. 3d). TE-chimera expression varied significantly across organs and the most highly expressed TE-chimeras were observed in the muscle and testis (Fig. 2a). Aging appeared to regulate TE-chimeras, where expression in whole blood and brain regions decreased in older individuals (>50 years old) but was increased in peripheral tissues (Fig. 2b).
a, Heat map of class of TE (y axis) across all human tissues (x axis), showing the number of TE-chimeras passing detection thresholds (top; Methods) and the mean expression (TPM; middle). Bottom, bar chart showing the total number of TE-chimeras per tissue (summed across individuals). b, Individuals in GTEx were binned into age categories (over or under 50 years) and Wilcoxon signed-rank test P values based on comparisons were calculated using a two-sided test and adjusted by the FDR statistic (accounting for the number of organs). c, Expression (TPM; x axis) of either the TE-chimeric isoform ENST00000426261.6 or all other transcripts (mean TPM) corresponding to LINC02693 in aorta (lincRNA). The top ten expressors in each category are colored blue or red, where none overlap between the two. The bounds of each box indicate the 25th and 75th percentiles and the whiskers extend from the minimum (0 for both lincRNA and TE-chimera) to maximum (2.49 for lincRNA and 2.72 for TE-chimera). The centers of the boxes are 0.51 and 0.78, respectively. d, Scatter plot of expression (TPM) for the same TE-chimera ENST00000426261.6 versus LINC02693, as in c. The P value was determined using two-sided Student’s regression. e, The top pathways based on GSEA in the Gene Ontology database for either LINC02693 (right) or ENST00000426261.6 (left), where the x axis shows the distribution of the bicor correlation coefficient. f, LTR exonizations were analyzed in TCGA in terms of their differential expression in tumor versus surrounding tissue (top) and survival prediction in tumor tissue (bottom). Color scales reflect the average log2 fold change (log2FC) in tumor versus surrounding tissue (purple–green) or whether the relative expression of LTR exonizations per individual showed increased or decreased survival prediction (blue–orange). Specifically, an orange tone shows instances where individuals with a high relative expression of LTR exonizations showed decreased survival (compared to low expressors). *P < 0.05, **P < 0.01 and ***P < 0.001 corresponding to a differential expression P value (tumor versus surrounding tissue) or log-rank test P value (survival). g, Top three pathways from Gene Ontology overrepresentation tests corresponding to the LTR-exonized genes that were significantly changed (adjusted P < 0.1) in oxaliplatin resistance32. Top three pathways from Gene Ontology overrepresentation tests corresponding to the LTR-exonized genes that were significantly changed (adjusted P < 0.1) in anti-PDL1 resistance31. P values were calculated using a nonparametric permutation test to build a null distribution of the enrichment score. From the distribution of P values, FDR corrections were made using the Benjamini–Hochberg method, resulting in adjusted P values.
Next, we reasoned that examination of variation across populations could provide suggestive functional roles of TE-chimeras. For example, a TE-chimera (ENST00000426261.6) showed a significant anticorrelation with its cognate transcript (LINC02693), an aortic long intergenic noncoding RNA (lincRNA) with no known function (Fig. 2c,d). The genetically coregulated pathways correlated with each isoform, further suggesting a suppressive role of the TE-chimera (Fig. 2e).
Given the recent implications for TE-chimera in disease28,29, we analyzed variation in The Cancer Genome Atlas (TCGA)30. This analysis showed that LTR-chimeras were almost uniformly upregulated in tumors when compared to surrounding tissue and that elevated tumor expression predicted poorer survival (Fig. 2f). We note that, while many significant relationships were observed in terms of survival prediction, these statistics were subjective to sample size and variation in group measures. As these data showed a uniform pattern of exonization associated with cancer progression, we hypothesized that exonization events could be involved in cancer drug resistance mechanisms. We analyzed RNA-seq data collected from in anti-PDL therapies31 and oxaliplatin resistance32. Comparison of resistant versus responding individuals showed LTR exonization events in pathways pertinent to mechanism of drug actions (Fig. 2g and Extended Data Fig. 4a), suggesting that exonization could be involved in treatment resistance. In sum, these data highlight specific patterns of tissue-specific promoter TE-chimeras and highlight how LTR exonization changes with age, cancer progression, survival and treatment.
Transcription-coupled RNA degradation suppresses TE-chimeras
We and others have shown that RNA decay can suppress TE expression33,34. We, thus, hypothesized that RNA surveillance could also control TE-chimera expression. We performed long-read RNA-seq in wild-type (WT) mES cells and mES cells conditionally ablated for Exosc3, a subunit of the RNA exosome complex. Exosc3 is essential for RNA exosome-mediated RNA degradation35. This tamoxifen-induced conditional knockout (cKO) model (Extended Data Fig. 5a) enabled a quantitative and highly resolved characterization of the interplay between chimeric RNA anabolism and catabolism. TE-promoters identified by our mixed-sequencing analysis generate many coding and noncoding TE-chimeras predominantly from LTR (Fig. 3a and Extended Data Fig. 5b–e) that are more active in Exosc3 cKO compared to other TE and host promoters (Fig. 3b and Extended Data Fig. 5f). Among LTRs, ERVK, ERVL and ERVL-MaLR generate most of the TE-chimeras that are suppressed by RNA exosome (Fig. 3c,d and Extended Data Fig. 5g). These results suggest that RNA degradation controls LTR-chimeras at the level of transcription, which is initiated at and extends past the 3′ ends of LTR elements and can be subject to splicing, and/or at the level of RNA stability.
a, Bar plots displaying number of TE-derived promoters, grouped by TE class. b, Scatter plots displaying promoter activities in WT and Exosc3 cKO, grouped by promoter type. c, Bar plots displaying number of LTR-derived promoters, grouped by LTR family. d, Box plot displaying promoter activities in WT and Exosc3 cKO, grouped by LTR family (n = 25, 142, 500 and 168 for ERV1, ERVK, ERVL and ERVL-MaLR, respectively). Gypsy was discarded because of a small number of promoters detected in c. The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test (P = 1.336401 × 10−1, 3.122109 × 10−10, 4.482375 × 10−81 and 6.809815 × 10−13 for ERV1, ERVK, ERVL and ERVL-MaLR, respectively). e, Pie charts displaying proportions of LTR promoters grouped by differential RNAPII enrichment in Exosc3 cKO relative to WT. f,g, Genome browser snapshots displaying RNAPII (8WG16) ChIP-seq and short-read total input RNA-seq pileups for WT and Exosc3 cKO, alongside known and novel isoforms identified from the PacBio Iso-seq and the genomic location of TEs in the mouse mm10 genome. A novel isoform for Nelfa (f) and a novel gene (g) derived from LTRs and upregulation in Exosc3 cKO are shown. h, Strand-specific metagene plots of stable RNA (from RNA-seq) (left) and nascent RNA (from metabolic labeling) (right) across chimeric LTRs, grouped by differential RNAPII enrichment state upon Exosc3 cKO. The s.e.m. is shown as a shaded area around the mean curve.
To discriminate between these two events, we stratified host and LTR promoters by RNA polymerase II (RNAPII) level in WT and Exosc3 cKO. We used the previous RNAPII ChIP-seq data23 generated with the 8WG16 antibody that mainly targets initiating RNAPII (Fig. 3e). Although these datasets were obtained without spike-in normalization and, thus, cell–cell comparison can be suboptimal, we saw that about 80% of host promoters retained comparable levels of RNAPII in Exosc3 cKO, ~18% had increased RNAPII deposition and ~1.4% had lower levels in Exosc3 cKO. By contrast, around 52% of LTR promoters displayed a significant increase in RNAPII levels. An increase in RNAPII was associated with an increase in promoter activity (Extended Data Fig. 5h), suggesting that increased transcription initiation leads to induction of LTR-chimera expression in Exosc3 cKO.
ERVL accounts for the largest portion of the LTR promoters with more RNAPII enrichment (Extended Data Fig. 5i), as shown in the representative LTR-chimera from the Nelfa gene derived from murine endogenous retrovirus with leucine primer (MERVL) (Fig. 3f). We also identified a second class of LTR promoters, which account for ~48% of LTR promoters (Fig. 3e). This class is less dominated by ERVLs (Extended Data Fig. 5i) and had a significantly higher promoter activity in the Exosc3 cKO despite RNAPII levels being comparable (Extended Data Fig. 5h). This result suggests that increase of these LTR-chimeras in the Exosc3 cKO is a result of RNA stabilization, as shown in a representative novel gene driven by RLTR13C3 (Fig. 3g).
To further investigate the relationship between RNA degradation and nascent transcription, we reanalyzed our nascent RNA-seq (metabolic labeling) dataset23 focusing on LTRs dynamics. Our results indicate that (1) only LTR promoters with more RNAPII enrichment show an increase in nascent transcripts upon Exosc3 cKO (Fig. 3h, right) and (2) LTR promoters with equal level of RNAPII display lower level of stable RNA expression in WT compared to Exosc3 cKO (Fig. 3h, left), despite equal nascent transcription (Fig. 3h, right). These data support our hypothesis that LTR promoters with more RNAPII enrichment are regulated by transactivation and LTR promoters with equal RNAPII enrichment in WT and Exosc3 cKO are regulated by RNA degradation.
We then used Hi-C data in WT mES cells36 to map all LTR elements with respect to their topological position within the nucleus. Our data indicate that genomic bins including chimeric LTRs were more frequently positioned within the A compartment (positive PC1 values) compared to bins lacking chimeric LTRs (Extended Data Fig. 5j). Higher relative enrichment of chimeric LTRs was also found to be correlated with a stronger association within the A compartment (Extended Data Fig. 5k).
Position-dependent LTR functionalization
We reason that, because of their repetitive and multicopy nature in the genome, LTRs that generate chimeras must carry distinct features compared to LTRs that do not generate chimeras. We first examined the distinctive sequence features of promoter-proximal splice donor (SD) motif and polyadenylation site (PAS), which are closely embedded (SD-PAS) in many LTRs (Fig. 4a). By contrast, the presence of SD-PAS motifs near transcription start sites (TSSs) is a unique genomic architecture that is rarely found in host genes (Fig. 4b), as it signals for competing activities. We cloned a representative LTR, MT2_Mm (the long terminal repeat regions of MERVL elements), and analyzed whether this configuration is prone to transcription or early termination. RNA pulldown using the MT2_Mm canonical sequence (MT2 WT) indicated that this element, unlike a point mutant control (MT2 Cmt), is recognized by cleavage and polyadenylation machinery (Fig. 4c,d). PAS recognition in the promoter-proximal region is known to induce premature RNAPII termination37,38. This result is consistent with the fact that most LTR regions do not act as LTR promoters in the genome despite the high proportion of SD-PAS occurrence (Extended Data Fig. 6a) and provides evidence that sequence features alone cannot explain the biogenesis of LTR-chimeras.
a, Bar plot displaying proportions of top 16 TE elements with SD-PAS motifs within 10 bp of each other. b, Bar plot displaying proportions of TSS having SD-PAS motifs within 10 bp of each other, grouped by repeat-derived versus not. Occurrences were found between TSS and 150 bp downstream. c, Consensus MT2_Mm sequence having a 5′ SD with UGUA CFIm-binding site and poly(A) site. The poly(A) site mutation for RNA pulldown assay is indicated. d, Western blot analysis of RNA pulldown assays using the MT2_Mm canonical sequence (MT2 WT) or a point mutant control (MT2 Cmt). The experiment was performed with one biological replicate. e, Diagram displaying gene–LTR pairs: (1) genes and their overlapping intragenic LTRs and (2) genes and their nearest promoter-proximal upstream intergenic LTRs. f, Box plots displaying normalized expressions in WT for genes overlapping with intragenic LTR (n = 12,657 nonchimeric and 214 chimeric) or in proximity of promoter-proximal upstream intergenic LTRs (n = 11,801 nonchimeric and 67 chimeric), grouped by chimeric status. The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks and P values were calculated on the basis of an unpaired, two-sided Wilcoxon rank-sum test (P = 1.295968 × 10−51 for intragenic and 5.243340 × 10−25 for intergenic cases). g, Density plots displaying log2 fold changes of normalized expressions of genes overlapping with intragenic LTRs or in proximity of promoter-proximal upstream intergenic LTRs in Exosc3 cKO compared to WT, grouped by chimeric status. P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test. h, Box plot displaying average promoter-proximal antisense coverage from RNA-seq between genes and its nearest promoter-proximal upstream intergenic LTR in WT, grouped by chimeric status (n = 11,532 nonchimeric and 60 chimeric). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks and P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test (P = 8.133342 × 10−12). i, Density plot displaying log2 fold changes of average promoter-proximal antisense coverage from RNA-seq between gene and its nearest promoter-proximal upstream intergenic LTR in Exosc3 cKO compared to WT, grouped by chimeric status. P values were calculated using an unpaired, two-sided Wilcoxon rank-sum test. j, Strand-specific gene track displaying RNA-seq pileups across splice junctions for WT and Exosc3 cKO, alongside promoter-proximal antisense transcripts (gray), known genes in the mm10 Ensembl transcriptome (black), the genomic location of chimeric TEs (red) and known and novel isoforms identified from the PacBio Iso-seq data (blue and orange, respectively). k, Box plots displaying input-normalized average coverage of H3K9me3 (left) and average DNA methylation (right) in WT and Exosc3 cKO across chimeric LTRs defined in e (n = 286). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Outliers beyond these limits were removed for H3K9me3 and no outliers were observed for DNA methylation. P values were calculated using a paired, one-sided (left-sided) Wilcoxon signed-rank test (comparing Exosc3 cKO relative to WT; P = 0.9871791 for H3K9me3 and 0.999983 for DNA methylation). l, Model displaying cis-regulatory function of intragenic chimeric LTRs (top) and promoter-proximal upstream intergenic chimeric LTRs (bottom) and its dependency on genomic location and RNA exosome activity.
We then investigated the positional information of LTRs that give rise to chimeras (chimeric LTRs) and LTRs that do not (nonchimeric LTRs). Notably, we found that ~70% of chimeric LTRs were located proximal to promoters or inside genes, while ~59% of nonchimeric LTRs were located in distal intergenic regions (Extended Data Fig. 6b). We hypothesized that LTRs give rise to chimeras because they are in a genomic context that makes them prone to being transcribed. We further hypothesized that chimeric LTRs are unsilenced by the act of transcription occurring in their proximity, specifically transcription of (1) the endogenous gene from which the TE-chimera derives or (2) promoter-proximal antisense transcripts. To test this, we paired endogenous genes with (1) their overlapping intragenic LTRs or (2) their nearest promoter-proximal upstream intergenic LTRs (Fig. 4e and Methods). We found that genes that pair with chimeric LTRs are not only more highly expressed in WT (Fig. 4f) but also more upregulated in Exosc3 cKO compared to genes that pair with nonchimeric LTRs (Fig. 4g), indicating that genes close to chimeric LTRs are more transcriptionally active and regulated by RNA exosome.
We next reasoned that promoter-proximal antisense transcripts could provide a more transcriptionally amenable environment, thereby allowing chimera generation from promoter-proximal intergenic LTRs. To answer this question, we next quantified the antisense signal between genes and their nearest promoter-proximal upstream intergenic LTRs. Consistent with a higher transcriptional activity of genes paired with chimeric LTR, we found a higher antisense signal between a given gene and its nearest chimeric LTRs in both RNA-seq (Fig. 4h) and nascent RNA-seq (metabolic labeling; Extended Data Fig. 6c) in WT, compared to genes and their nearest nonchimeric LTRs counterparts. Furthermore, Exosc3 cKO caused an increase in stable and nascent antisense RNAs (Fig. 4i and Extended Data Fig. 6d), supporting the idea that transcription and possibly a cis-effect of the stabilized nascent transcript promote LTR-chimera expression. Overall, these results correlate promoter-proximal antisense RNA and cis-regulatory activity of chimeric LTRs, as shown in a representative strand-specific gene track where we detected an increase in promoter-proximal antisense RNA spanning a chimeric LTR (ORR1A2) along with increased expression of novel LTR-derived isoforms in the sense strand in Exosc3 cKO (Fig. 4j). Lastly, reanalysis of our previous dataset23 demonstrated that Exosc3 cKO does not show a reduction in the level of H3K9me3 (with the caveat that spike-in normalization was not performed) or DNA methylation across chimeric LTR compared to WT, indicating that upregulation of chimeric LTRs expression is not caused by the loss of these epigenetic marks (Fig. 4k). Taken together, we propose that transcriptional and RNA degradative activity near LTR elements guide LTR functionalization (Fig. 4l).
Inhibition of RNA degradation and splicing promotes TE functionalization
As the RNA exosome is targeted to its RNA substrates by different cofactors39,40,41, we addressed their role in both controlling TE expression and TE functionalization in chimeric transcripts. We performed loss-of-function experiments (Wdr82 knockdown (KD); Extended Data Fig. 7a) along with analysis of previous dataset of targeted depletion of NEXT (Rbm7 and Zcchc8 KD)23, PAXT (Zfc3h1 KO)33 and Integrator (Ints11 KD)23. Notably, all perturbations caused a significant upregulation of TE, predominantly LTRs, similarly to that in Exosc3 cKO23 (Extended Data Fig. 7b,c). The cofactor TE and gene expression signatures were positively correlated with that of Exosc3 cKO (Extended Data Fig. 7d). Upregulation of LTRs but not of other classes of TEs was also accompanied by a significant activation of chimeric promoters (Fig. 5a). In clear contrast, KD of Cpsf2 showed neither upregulation of LTR nor activation of LTR promoters (Fig. 5a and Extended Data Fig. 7c).
a,b, Heat map displaying median log2 fold change of TE-promoter activity in each depletion condition targeting NEXT, PAXT, Integrator, Restrictor (a) and spliceosome (b), compared to control, grouped by TE class. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum test for Zfc3h1 KO and a paired, two-sided Wilcoxon signed-rank test for all the rest. c, Box plot showing log2 fold changes of LTR promoter activity for U1 AMO, Exosc3 cKO transfected with Scr AMO and double inhibition condition compared to Scr AMO-transfected WT (n = 687). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test (P = 4.772289 × 10−70 for U1 AMO versus Exosc3 cKO, 9.078507 × 10−74 for U1 AMO versus double inhibition condition and 2.566191 × 10−27 for Exosc3 cKO versus double inhibition condition). d, PCA showing global transcriptomic differences between perturbed (red) and control (black) conditions. Projections were calculated on the basis of the top 2,000 most variable known protein-coding genes across the samples. WT and control siRNA-treated samples were considered as control and the rest (Exosc3 cKO, AMO treatment and siRNA-mediated splicing inhibition42) were considered as perturbed. As input for PCA, expression counts were first adjusted with ComBat-seq71 to correct for clustering driven by potential technical confounders, then normalized and variance-stabilized with DESeq2 (ref. 72). Both Exosc3 cKO and Exosc3 cKO transfected with Scr AMO were used to assess the effect of Exosc3 cKO. e, Heat map representing the median log2 fold change for protein-coding genes with 1–2 exons or >2 exons in each perturbation relative to the corresponding control. All expressed genes in each comparison were considered. f, Heat map presenting log2 fold change of MERVL-int and associated TFs (Dux (Duxf3), Zscan4d and Obox5) in each perturbation indicated. For e,f, U1 AMO, Exosc3 cKO treated with Scr AMO and double inhibition were compared to WT treated with Scr AMO; Snrpb and Snrpd2 KD were compared to control siRNA-treated samples. PlaB represents the differential gene expression test in the dataset generated in mES cells cultured with splicing inhibitor (PlaB) across passages42. A differential gene expression test was designed to compare the effect in the late passages (passages 4–6) over the early passages (passages 0–2). The log2 fold changes calculated by DESeq2 (ref. 72) were used and asterisks represent nominal P values from DESeq2 (ref. 72) (f). g–l, RT–qPCR analysis of MERVL-int (g,j), Zscan4 (h,k) and Nelfa chimera (i,l) transcripts upon transfection of vector expressing GFP, Dux-FL and Dux-S in WT (g–i) or Exosc3 cKO (j–l). Data indicate the mean of two replicates, with individual values shown. m, Bar plots showing the average counts normalized by DESeq2 (ref. 72) of MERVL-int, MT2_Mm and Zscan4d across three replicates (each point), with asterisks representing nominal P values from DESeq2 (ref. 72) (for MERVL-int, P = 4.576920 × 10−172 for WT and below the limits of double-precision floating-point arithmetic for Exosc3 cKO; for MT2_Mm, P = 7.582753 × 10−77 for WT and 1.337273 × 10−170 for Exosc3 cKO; for Zscan4d, 8.882704 × 10−10 for WT and 1.888064 × 10−25 for Exosc3 cKO). n, Box plot representing LTR-chimera promoter activities in WT and Exosc3 cKO with Scr and MERVL ASO treatment conditions (n = 440). The box hinges represent the 25th and 75th percentiles and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5× the interquartile range. Data beyond these limits are outliers. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test for all the rest (P = 3.499736 × 10−6 for WT with Scr ASO versus WT with MERVL ASO, 1.644941 × 10−49 for WT with Scr ASO versus Exosc3 cKO with Scr ASO and 5.236761 × 10−17 for Exosc3 cKO with Scr ASO versus Exosc3 cKO with MERVL ASO). o, Volcano plot displaying differentially expressed genes and TEs in Exosc3 cKO treated with MERVL ASO versus Scr ASO. Color coding represents the gene set, with purple indicating 2CLC genes and gray indicating other genes. P values were generated with the DESeq2 (ref. 72) R package, applying a two-sided test and FDR correction (***P < 0.0001). p, GSEA of 2CLC genes from the differential expression signature in Exosc3 cKO treated with MERVL ASO compared to that with Scr AMO. P values were generated using GSEA (clusterProfiler73 R package), which uses a two-sided testing approach. NES, normalized enrichment score.
We then addressed whether other factors that control TE expression might also influence TE-chimera generation. We focused on splicing factors that have been shown to suppress MERVL expression42,43 and replicated our analyses focusing on TE and TE-chimera promoter activity. KD of Eftud2, Isy1, Lsm4, Snrpb and Snrpd2 (ref. 42) resulted in the upregulation of a considerable number of TEs, among which LTRs represented the largest group (Extended Data Fig. 7e). The upregulation of TEs and genes seen upon loss of defined splicing factors was positively correlated with that in Exosc3 cKO (Extended Data Fig. 7f) and caused a significant activation predominantly at LTR promoters (Fig. 5b). We also validated that KD of Snrpd2 increased the representative LTR-chimeric transcript (Nelfa chimera) (Extended Data Fig. 7g–i). Another factor we tested, Srsf7, did not phenocopy Snrpd2 (Extended Data Fig. 7j–l), a result that we cannot explain, possibly pointing to indirect effects or unique splicing factor vulnerabilities44.
Notably, the LTR promoters transactivated in Exosc3 cKO were more highly activated upon spliceosomal repression compared to the LTR promoters with comparable level of RNAPII (Extended Data Fig. 7m). These results indicate that, while splicing is required for exonization of LTR-chimeras, reducing splicing activity transactivates LTR promoters. While this conclusion seems counterintuitive, it is known that reduced level of splicing activity can have a positive effect on some genes (aside from the expected repressive effect on most genes)42. Consistently, we found a positive correlation between expression signature of Exosc3 cKO and a previous splicing inhibitor, Pladienolide B (PlaB), treatment dataset42 for all time points (Extended Data Fig. 7n). To understand the epistatic relationship linking RNA degradation, splicing and exonization, we then performed inhibition of major and minor spliceosome activity through transfection of antisense morpholino oligonucleotide (AMO) in WT and Exosc3 cKO. Exosc3 cKO in the Scr AMO-transfected condition showed a significant correlation with that in the untransfected condition (Extended Data Fig. 7o) and U1 and U6atac AMOs were specific in inhibiting the major and minor spliceosome, respectively (Extended Data Fig. 7p–q). U1 AMO treatment caused an increase in LTR promoter activity in WT cells (Extended Data Fig. 7r); importantly, this effect was enhanced in Exosc3 cKO (Fig. 5c).
We then analyzed the transcriptome of cells in which RNA degradation or splicing was perturbed alone or in combination compared to control cells. Principal component analysis (PCA) of Exosc3 cKO along with the AMO-derived splicing inhibition dataset and splicing-factor-depleted dataset42 indicated that perturbed cells segregate away from controls and share considerable (38–92%) transcriptomic changes (Fig. 5d and Extended Data Fig. 7s), despite the targeted factors acting at different steps of splicing process (Extended Data Fig. 7t). On the basis of the fact that previous work has shown a gene-size stratified effect caused by splicing inhibition42 and RNA degradation inhibition23, we separated the genome into short (protein-coding genes with 1–2 exons) and long (protein-coding genes with >2 exons). Strikingly, perturbation causing RNA degradation and splicing inhibition caused upregulation of short genes (Fig. 5e and Extended Data Fig. 7u). We further identified an upregulation of a TE (MERVL-int) and associated TFs (Dux, Zscan4d and Obox) that are representative two-cell (2C) genes, which are known to regulate TE-chimera expression and enhance cell potency45,46,47,48 (Fig. 5f). Notably, the upregulation in MERVL-int, Dux and Zscan4d was enhanced by a codepletion of RNA exosome and major splicing (Extended Data Fig. 7v).
To further demonstrate the importance of Dux, we performed overexpression analysis. Vector expressing GFP, Dux (Dux-FL) or Dux lacking the transactivation domain (amino acids 1–178; Dux-S) were transfected into mES cells (Extended Data Fig. 7w,x). Exosc3 cKO was induced by simultaneous tamoxifen treatment. In line with the previous finding45,49, Dux-FL overexpression, unlike Dux-S, caused a significant upregulation of both 2C-related transcripts expression (MERVL-int and Zscan4) and the representative LTR-chimeric transcript under transactivation (Nelfa chimera) (Fig. 5g–l).
Moreover, reanalysis of the previous Dux ChIP-seq in mES cells45 and Obox1 Stacc-seq in late 2C mouse embryos47 showed a significant enrichment of Dux and Obox1 in chimeric LTRs. In contrast, Dux and Obox1 did not show a clear enrichment across nonchimeric LTRs, suggesting that these chimeric LTRs are subject to the Dux-dependent and MERVL-int-dependent transactivation (Extended Data Fig. 7y–z).
To determine the functional importance of MERVL-int, we used antisense oligonucleotide (ASO)-mediated degradation of MERVL50. We transfected WT or Exosc3 cKO with either MERVL-degrading ASOs or scramble control (Scr). MERVL ASO treatment, compared to Scr ASO treatment, was accompanied by a significant reduction, especially for Exosc3 cKO, in MERVL-int, MT2_Mm and Zscan4d expression (Fig. 5m) and LTR promoter activities (Fig. 5n). Notably, MERVL ASO treatment in Exosc3 cKO led to a dampened activation of 2C-like cell (2CLC) genes (Fig. 5o) and erasure of the gene network associated with enhanced cell potency (Fig. 5p). Our result indicates that suppression of MERVL-int expression is sufficient to revert the changes in transcriptomics and cellular plasticity caused by Exosc3 cKO.
Overall, our data suggest a model where chemical or genetic perturbation of splicing and RNA degradation converges into upregulation of short genes like MERVL-int (which is intronless), which instigate a gene regulatory program of enhanced cell potency and TE-chimera expression.
Control of TE-chimera expression in vivo
To understand whether RNA degradation controls TE-chimeras in vivo, we compared Exosc3 cKO mES cells with a conditional ablation of an essential exosome cofactor (Mtr4 cKO) in oocyte51, along with conditional deletion of Dicer1 in both mES cells and oocytes52,53,54. Oocytes express a set of tissue-specific MTA (ERVL-MaLR) TE-chimeras and a pair of oocyte-specific MTA and MTC TE-chimeras in the Dicer1 gene was previously shown to regulate TE exonization events in mouse oocytes through RNA interference53. Mtr4 cKO oocytes overexpress LTR-chimeras in the ERVK family relative to controls but exhibit no overexpression of LTR-chimeras in the ERVL family (Fig. 6a). Meanwhile, Dicer1 cKO oocytes have the opposite effect, indicating that, in mouse oocytes, ERVL chimeras are regulated through the RNA interference pathway, whereas ERVK chimeras are regulated by the RNA surveillance pathway. This contrasts with the Dicer1 cKO and Exosc3 cKO mES cells. mES cells do not express the Dicer1 TE-chimera responsible for allowing mouse oocytes to regulate ERVL chimeras through RNA interference. Furthermore, Dicer1 cKO mES cells do not have significant changes in the expression of either ERVL or ERVK chimeras (Fig. 6b). Exosc3 cKO mES cells have increased expression of TE-chimeras for both families of LTR. Our model (Fig. 6c) indicates that, in vivo, different mechanisms operate specifically at defined TE classes to regulate cell-specific and spurious TE exonizations.
a, Expression profile of mouse oocytes in Dicer1 cKO and Mtr4 cKO. Mtr4 cKO in oocyte upregulates ERVK LTR-chimeras, whereas Dicer1 cKO in oocyte upregulates ERVL LTR-chimeras. b, Expression profile of mES cells in Dicer1 cKO and Exosc3 cKO. Exosc3 cKO upregulates both ERVL and ERVK LTR-chimeras, whereas Dicer1 cKO upregulates neither. For both a,b, one-sided (upregulated) Wilcoxon rank-sum tests were performed on change in promoter activity (average promoter activity of experiment − average promoter activity of control) between chimeric promoters and nonchimeric promoters with Bonferroni correction. c, Schematic of LTR-chimera regulation based on a,b. The RNA interference pathway regulates ERVL LTR-chimeras in oocyte through MTA/MTC Dicer1 chimeras. In the absense of these chimeras in mES cells, RNA surveillance regulates the expression of ERVL LTR-chimeras.
Evolutionary analysis of TE-chimeras
While most TEs have lost the capacity to mobilize13, our data indicate that LTRs in the genome can promote the transcription of novel isoforms in known and novel gene loci. These data suggest that neogene birth is an ongoing process, contrary to the idea that numbers of genes in genomes are fixed. To test this, we performed an in-depth evolutionary analysis on TE sequences and their exonization.
Previously, as part of the Zoonomia project, we found that more than 85% of primate-specific human cis-regulatory elements originated from TEs55. In this study, we adapted our earlier approach to investigate TE-chimeras in humans, categorizing them on the basis of their evolutionary conservation levels. We began by investigating TE-chimeras conserved among apes and then extended our analysis to those conserved among apes and monkeys, followed by those conserved among apes, monkeys and lemurs and, finally, those conserved beyond primates. This stratification enabled us to analyze the TE-chimeras in relation to the evolutionary age of each transposon family. For example, LTRs and L1 elements are most active among apes, SINEs are most active among primates and L2 and DNA elements are most active in lineages older than primates. We performed a similar analysis for TE-chimeras in mice on the basis of the evolutionary conservation levels of their TEs, categorized into murine-specific TEs, rodent-specific TEs and TEs conserved beyond the rodent lineage.
We compared the TEs that generate chimeric transcripts in the sense orientation (sense chimeric TEs) against two sets of TEs as negative controls: (1) TEs that form chimeric transcripts in the antisense orientation (antisense chimeric TEs) and (2) TEs that are upstream and within 5 kb of the TSS of known gene loci but do not generate chimeric transcripts (nonchimeric TEs). We analyzed the compositions of TE families among the three groups of TEs in humans and mice separately. For the results presented below, we focused on the comparison with the first negative set (antisense chimeric TEs).
For humans, sense chimeric TEs are most enriched in four families when contrasted with antisense chimeric TEs: LTR12 (a subset of the ERV1 family of LTRs), ERVL (a family of LTRs), ERVL-MaLR (a family of LTRs) and SVA elements (Fig. 7a). When separating TEs by their evolutionary ages, it becomes apparent that TEs in the LTR12 and SVA families are most enriched in the ape-specific subset, ERVL and ERVL-MaLR are most enriched in the primate-specific subset and ERVL-MaLR are most enriched in the subset that is conserved beyond the primate lineage (Fig. 7a). These results indicate that the four families of TEs incurred separate waves of exonization events in mammalian genomes and, generally, the process of TE exonization accompanied major evolutionary expansion events.
a, Bar plots showing the proportions of TE families for sense chimeric TEs, antisense chimeric TEs and nonchimeric TEs that are upstream and within 5 kb of the TSS of known genes, as well as all TEs in the human genome. The analysis is presented for all TEs, as well as for TEs with three evolutionary conservation levels: ape-specific, primate-specific and conserved beyond the primate lineage. b, Analysis of evolutionary ages for ERVL, ERVL-MaRL, LTR12, L1 and L2, stratified by whether they are sense chimeric, antisense chimeric or nonchimeric, as in a. Randomly chosen intergenic non-TE regions were also used as controls. Left, bar plots showing the percentage of these TEs and non-TEs across different evolutionary ages, from human-specific (young; colored red) to eutherian mammal conserved (old; colored dark blue). Middle, box plots indicating the number of mammalian genomes (among a total of 240) that these TEs or genomic regions can map to. Boxes present the median, lower quartile and upper quartile, while whiskers denote the maximum and minimum after removing outliers (more than 1.5× the upper quartile or less than 1.5× the lower quartile). P values were generated using an unpaired two-sided Wilcoxon rank-sum test. Right, bar plots representing the number of sense chimeric and antisense chimeric TEs across different evolutionary ages.
For mice, sense chimeric TEs were most enriched in two families when contrasted with antisense chimeric TEs: ERVL and ERVL-MaLR (Extended Data Fig. 8a). Most of the ERVL chimeric TEs were murine specific, while some of the ERVL-MaLR chimeric TEs were murine specific and others were rodent specific, further suggesting different time periods for their exonization events. No enrichment of sense chimeric TEs was observed over antisense chimeric TEs beyond the rodent lineage (Extended Data Fig. 8a).
Given the enrichment of the above families of TEs in sense chimeras, we proceeded to analyze the evolutionary ages of individual members in each TE family, quantified as the number of mammalian genomes that the TE can be mapped to. For both ERVL and ERVL-MaLR families, in both humans and mice, the TEs that formed sense chimera were significantly younger than the TEs in the same family that formed antisense chimera (Fig. 7b and Extended Data Fig. 8b). Human TEs in the LTR12 families did not show a significant age difference between the sense chimera and antisense chimera, likely because of the lack of statistical power in comparing their young evolutionary ages. In sharp contrast, a younger evolutionary age was not observed in LINEs (Fig. 7b) or in any other TE families for both human and mouse (Extended Data Fig. 9a). Overall, these results indicate that young LTRs are still under evolutionary pressure to drive sense chimeric exonization leading to novel genes.
Discussion
TEs are known to be ‘used’ in gene regulatory networks controlling cell identity (mainly during embryo development) but most work has focused on TE families because of the repetitiveness of these elements. Our long-term effort is to establish a framework to characterize TEs as genes. Here, we provide a mechanistic and cartographic analysis of novel genes and isoforms derived from TEs along with their positional identity that overall expands our transcriptome in a dynamic and temporal fashion. We describe TEs involved in cell-type-specific gene expression during organogenesis, aging and disease. As 95% of common disease-associated single-nucleotide polymorphisms (SNPs) reside in loci outside of coding genes, systematic analyses of human variation in TE sequences using technologies that can distinguish the singular identity of all TEs are warranted. In fact, while our study provides an initial attempt to define the molecular and positional identity of TE-chimeras, it remains limited by the lack of cell type and spatial resolution, as well as by our inability to map many extremely long multi-TE exonization events or terminal exonizations lacking a poly(A) tail. Moreover, on the basis of TE expression being cell type dependent, many mechanisms controlling TE transcription and exonization are likely controlled by cell-type-specific factors. Future work, supported by reduced sequencing costs or advances in technology, will enable broader application of the strategy used here to achieve in all cells an isoform-level resolution of genes and TEs. Lastly, TE and TE-chimera evolutionary analyses like the one performed here will be greatly enhanced by the growing availability of highly resolved, telomere-to-telomere genome assemblies from most species.
With respect to mechanism, our data are in line with the idea that many TEs are in genomic positions that avoid conventional epigenetic silencing. We show how sense and antisense transcription of a nearby unit (or the genic region giving rise to chimera) can unsilence TEs and promotes TE RNA synthesis. Our analysis indicates that perturbation of cotranscriptional and post-transcriptional events such as nuclear RNA degradation and splicing converge in the upregulation of TEs. This event is also associated with the establishment of enhanced stem cell potency, in line with previous observations23,42. In fact, in mES cells, inhibiting the spliceosome with PlaB can push cells from a pluripotent toward a totipotent state through a mechanism that appears to involve selective downregulation of pluripotency genes through inefficient splicing (they have many or large introns) while totipotency-associated genes (which often have fewer or shorter introns) remain more efficiently spliced and can be activated. On this basis and considering the reversion to totipotent-like cells caused by exosome loss, we propose that perturbation of splicing and RNA degradation cause upregulation of MERVL-int expression that sets in motion a gene regulatory network that transactivates TE-chimera. This is consistent with previous work suggesting MERVL as a master regulator of enhanced cell potency50, which functions upstream of TE-chimera expression48.
Accordingly, we propose that the many chimeras generated in somatic tissue have been selected throughout evolution because the TE from which they originate contains cell-type-specific TF-binding sites that can transactivate them. Our data also indicate that spurious transcription could cause the birth of novel TE-chimeras. We hypothesize a model in which RNA degradation safeguards the inherent threat of TE functionalization en masse from the spurious transcription of numerous LTRs present in the genome, while the transactivation of these LTRs by cell-type-specific TFs can sustain and functionalize unique TE genes in the gene regulatory network of a given cell state.
Our model differs from the mechanisms by which LINE1s are silenced56,57 and also differs from the recent evidence that SAFB proteins control LINE exonization57. We note that analytical approaches to quantify tissue expression of exonization were consistent with previous methods (Extended Data Fig. 10a); however, promoter-derived TEs and regulation by EXOSC3, based on reanalysis of a publicly available dataset58, showed a distinct pattern of organ expression (Extended Data Fig. 10b–g). Our findings help to categorize different types of silencing that target TE on the basis of genomic features. In a simplistic way, transcription-associated degradation is used for short TEs that function as regulatory elements.
Our model of how RNA degradation impacts TE transcription might have important consequences when considering genes in the context of an evolutionary timescale. The majority of the genome is transcribed59,60,61. For a considerable portion of the transcribed genome, transcription is coupled to degradation58. While regulatory mechanisms have been established to prevent transcription of undesirable transcriptional units, spurious transcription happens as an inherent byproduct of the mechanics of transcription itself62. In fact, RNAPII initiates transcription at a low rate on most nucleosome-free regions63 and mostly outside of known promoters at any given time in the cell64. TFs recognize 6–10 bp and responsive elements are created constantly under the constant mutational burden to which our genomes are subjected. Rather than eliminating the act of spurious transcription, cells have resorted to induce transcript degradation by the RNA quality control system. This is likely because spurious transcription is a primary source of genetic innovation. The road to genetic innovation is uphill for spurious transcripts originated by random DNA sequences. In fact, even if spurious transcripts are under neutral evolution65,66 and their presence is inconsequential to the cell and they are not ‘selected against’, they need to surpass a high resistance barrier before conferring an advantage, as they need to acquire many mutations that render them useful and counteract quality controls occurring at multiple levels after their transcription, such as RNA degradation, described here.
For viral-derived elements such as TEs, the energy barrier to functionalization is lower, as they already contain TF-binding sites and RNAPII promoter elements13,65,67,68,69. We surmise that TE-derived gene birth is the result of evolutionary tinkering—the opportunistic tendency of evolution where changes are based on pre-existing materials, leading to adaptations that may not be perfect but good enough to stay70.
Methods
PacBio Iso-seq library preparation and sequencing in mES cells and EpiLCs
Exosc3 Cre/lox conditional inversion (COIN) mouse pluripotent stem cells were gifted from the U. Basu lab. mES cell culture, EpiLC differentiation, Exosc3 cKO and RNA extraction were performed as previously described23. Two replicates were processed for each condition. Purified RNA was submitted to the Icahn School of Medicine Genomic Core facility for sequencing. Sequencing libraries were prepared using SMARTer PCR complementary DNA (cDNA) synthesis kit (Clontech) per manufacturer recommendations. OligodT primers were used to capture full-length polyadenylated transcripts. cDNA was then size-selected and sorted into two bins of greater or less than 4 kb. These bins were pooled together at equimolar concentrations using SMRTbell template preparation kit version 1.0. End-repaired and purified libraries were loaded onto a SMRTcell 1M, which was then sequenced on a Sequel I system with a 10-h movie.
PacBio Iso-seq error correction
Reads were imported into SMRTlink and IsoSeq3 software was used to obtain circular consensus sequences for error correction, yielding highly accurate reads. Lima was subsequently used to remove barcodes, SMART-seq primers and template-switching oligonucleotide sequences, further orienting isoforms in the correct 5′-to-3′ direction. Next, the refine command was used to remove poly(A) tails and concatemers.
WT and WT plus Exosc3 cKO mES cell and EpiLC Iso-seq analysis
Isoform-resolved transcriptomes were generated for WT and WT plus Exosc3 cKO samples using the following methodology. Refined long reads were mapped to the mouse mm10 genome (GRCm38.p6) using minimap2 using the following parameters: -ax splice -uf -secondary=no -C5 --MD. The Illumina bulk RNA short-read RNA-seq dataset was obtained from our previous study23 and reads were mapped to the mm10 genome using STAR74 with --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. StringTie75 was used to generate the novel isoform-resolved transcriptome by providing both Illumina short-read RNA-seq and PacBio Iso-seq aligned BAM files using the --mix option and the default minimum predicted isoform abundance of 1% (-f 0.01). SQANTI3 (ref. 76) was used to classify isoforms compared to the Ensembl GRCm38 versopm 102 reference transcriptome. To produce a high-quality reference transcriptome, the following isoform models were removed: novel monoexonic transcripts (which are often products of transcript degradation or sequencing artifacts77), novel transcripts classified in the fusion and genic categories by SQANTI and novel transcripts that overlapped TE sequences by over 50% of their total length. The protein-coding potential for each novel isoform was predicted with CPAT using default parameters. Novel transcripts were considered protein coding if they possessed a valid ORF that passed the default CPAT filtering parameters (≥0.44 coding probability). Novel genes were considered protein coding if a randomly selected transcript was found to be protein coding. To generate the TE-chimera annotation, transcripts whose TSSs were located within TEs in either orientation were initially marked as TE-chimeras according to the RepeatMasker GTF file for the mm10 genome. In cases where multiple TEs overlapped with a single TSS, one TE was selected with duplicated = F in R and retained in the annotation (~1% of TE-chimeras). This annotation was further used to define promoter status, as described below. Novel transcription units identified as eRNAs and PROMPTs in a previous study23 were also included in the annotation.
Human and mouse Iso-seq
The long-read RNA-seq dataset78 was scanned for chimeric transcripts by cross-comparing exon locations with annotations for repetitive elements in the genome. Transcripts whose TSS regions overlapped with repetitive elements were classified as TE-chimeras. The tissue-specific and line-specific expression level was estimated by mapping Illumina bulk RNA-seq reads78 to the respective novel transcriptome. Transcripts were counted as present in the sample if expression of the transcript exceeded one transcript per million mapped reads (TPM).
Organogenesis
For the human and mouse organogenesis RNA-seq dataset, raw short-read sequencing data were downloaded from ArrayExpress accession codes E-MTAB-6814 (human) and E-MTAB-6798 (mouse). Sequencing reads were aligned to the human hg38 and mouse mm39 reference genomes using STAR74 using a two-pass mapping approach, with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --limitSjdbInsertNsj 50000000 --outMultimapperOrder Random. Annotations of de novo assembled isoforms from the RNA-seq datasets of human and mouse were downloaded from the source publication26 and converted from hg19 to hg38 (for human) and from mm10 to mm39 (for mouse) genomic coordinates using liftOver. Next, SQANTI3 (ref. 76) was used to classify de novo assembled isoforms on the basis of their similarity to known reference transcripts in the Ensembl version 106 transcriptome assemblies for hg38 and mm39. A reference transcriptome for subsequent analyses was built by including all Ensembl version 106 isoforms, plus all de novo assembled isoforms with at least two exons and classified as either ‘novel in catalog’ or ‘novel not in catalog’ by SQANTI3. Next, proActiv24 was applied to quantify promoter activity across all RNA-seq samples against the newly built reference transcriptome. For subsequent analyses, only promoters with at least 1.5 absolute activity and 15% relative activity in at least two samples from the same tissue were included. TE-chimeras were further restricted to the cases where the TSS and first SD of the TE-chimeric isoform were fully contained within the same TE. Protein-coding probabilities for each isoform were assessed using CPAT79 with default parameters. ORFs with a coding probability ≥0.364 in human and ≥0.44 in mouse transcripts were labeled as protein coding, as indicated by the tool developers, while sequences below this threshold were classified as noncoding.
Organogenesis versus embryogenesis in human and mouse
De novo transcriptomes of human80 and mouse81 embryogenesis and expression data were downloaded from previously published research. Isoform-level expression was filtered using supplementary datasets. In human embryogenesis data, a transcript is considered expressed in a tissue if the transcript count is greater than or equal to 10 in at least two samples in the same tissue. In mouse embryogenesis data, a transcript is considered expressed in a tissue if the average TPM of the transcript is greater than or equal to 1. In human and mouse organogenesis data, TE-derived promoters were filtered using the filtering criteria described above. Only genes with an isoform with a TE-derived promoter that passed these filtering criteria were considered. Human and mouse genes were queried for homologous genes using BiomaRt82,83. All gene lists were converted to their human gene identifiers and overlaps were calculated with unique human gene identifiers.
GTEx TE-chimera filtering
TE-chimeras in the novel human transcriptome were identified by filtering for transcripts whose TSS and first SD were within the genomic range of the same TE. Comparisons of reference annotations showed that including Iso-seq-annotated transcriptomes enhanced the numbers of TE-chimeras detected by 26% (Extended Data Fig. 3a). This list was further filtered in GTEx for robustness, where chimeras must be expressed at >0.1 TPM in at least 20% of the individuals in each organ. This filter required a TE-chimera being expressed in a minimum of 110 individuals. Expression levels for TE-chimeric reads were further filtered by proActiv24 using both absolute promoter activity > 1.5 and relative promoter activity accounting for >15% of gene expression. We note that this metric was driven by interindividual variation and, as a result, was affected by the number of samples included. This was observed by modeling of the number of TE-promoters that pass this threshold in multiple subsampling analyses of GTEx RNA-seq data. In ten subsampling experiments, the number of TE-chimeras that passed our proActiv filtering thresholds never surpassed 739 (the averages for 10 experiments are shown in Extended Data Fig. 3b), even when more than several hundred samples were included. In sum, these combined analyses resulted in a ‘final set’ of 739 TE-chimeras passing both expression across individuals and proActiv24, listed in Supplementary Table 3.
Genetic correlation analyses and pathway enrichment assignment
To determine TE-chimera–gene correlations, as well as assign pathways enriched for TE-chimeras in population datasets such as GTEx, the TE-Chimera transcripts were correlated with all genes available in a given tissue using the bicorAndPvalue() function in the R package WGCNA. Relationships were then either visualized directly or pathway enrichments were assigned through gene set enrichment analysis (GSEA). This was accomplished by using the bicor coefficients as the enrichment weights for each gene and performed using gseGO() in clusterProfiler where thresholds were based on 1,000 permutations. For example, in Fig. 2e, the TE-chimera (ENST00000426261.6) was correlated with all other genes in the aorta; then, GSEA was performed using the regression coefficients and genes. Thus, plotting enrichment scores reflects the strength and direction of correlation assigned to each Gene Ontology term shown.
Human cancer analyses of TE-chimeras
TCGA data were obtained from the UCSC Xena portal (accessed February 4, 2024)84. For survival analyses, only persons in which a ‘survival event’ (for example, deceased) occurred over the course of the study were used to increase accuracy of the model. To bin individuals by expression, all LTRs listed in Supplementary Table 3 and detected in persons with cancer were used. For each Ensembl transcript corresponding to a TE-Chimera, the average expression (fragments per kilobase of TPM) was calculated across the population. Individuals were then assigned a ‘low’ or ‘high’ value on the basis of whether their expression of a given transcript was below or above the mean, respectively. Individuals were then binned into categories of ‘low expressors’ or ‘high expressors’ on the basis of the ratio of low or high transcript counts being below or above 0.5, respectively. Survival analyses were performed and visualized using R packages survival85 and survminer according to standard protocols. The P values for survival differences between groups were assigned on the basis of a log-rank test.
Human aging analysis of TE-chimeras
To compare age effects across organs among human TE-chimeras, all individuals in GTEx were binned as to whether their reported age was over or under 50 years old. Next, all LTRs identified in Supplementary Table 3 were compared in indicated organs using a Wilcoxon rank-sum test. Corresponding false discovery rates (FDRs) from t-test P values were calculated from the R package qvalue.
mES cell culture
mES cells were cultured in 2i/Lif medium as previously described23. Briefly, mES cells were cultured in N2B27 medium consisting of a 1:1 mixture of DMEM/F12 (Gibco) with HEPES and Neurobasal medium (Gibco) supplemented with 0.5× N2 (Gibco), 0.5× serum-free B27 (Gibco), 1× GlutaMAX (Gibco), 1× penicillin–streptomycin (Gibco), 0.05% bovine albumin fraction V (Gibco) and 1× 2-mercaptoethanol (Gibco). For naive mES cells, 3 µM CHIR99021 (Reprocell), 1 µM PD0325901 (Reprocell) and 20 ng ml−1 mouse recombinant leukemia inhibitory factor (R&D systems) were added to the medium to sustain stemness. Exosc3 cKO was induced by 100 nM tamoxifen (4-OHT) (EMD Millipore) treatment in Exosc3 Cre/lox COIN mES cells for 48 h. RNA was purified with TRIzol (Thermo Fisher Scientific) following the manufacturer’s recommendations.
Small interfering RNA (siRNA) transfection
For siRNA-mediated KD of Snrpd2, the WT line of mES cells, derived following the standard protocol, was transfected at a final concentration of 90 nM siRNA with Lipofectamine 3000 (Thermo Fisher Scientific) at the time of plating and collected 60 h after transfection. For siRNA-mediated KD of Wdr82 and Srsf7, COIN mES cells (without tamoxifen treatment) were transfected at a final concentration of 50 nM siRNA with Lipofectamine RNAiMAX (Thermo Fisher Scientific) at the time of plating and collected 2 days after transfection. For the KD experiments of Wdr82 and Srsf7, siRNAs were purchased from Dharmacon (Horizon Discovery). For the KD experiment of Snrpd2, an siRNA with a sequence reported in a previous study42 was custom-ordered from the same supplier. At the time of collection, total RNA was purified with TRIzol (Thermo Fisher Scientific) following the manufacturer’s recommendations. KD of Wdr82 was validated by western blotting using the following antibodies: WDR82 (D2I3B) rabbit monoclonal antibody (Cell Signaling Technology, 99715S; 1:1,000), anti-rabbit IgG, horseradish peroxidase (HRP)-linked antibody (Cell Signaling Technology, 7074S; 1:5,000) and β-actin (8H10D10) mouse monoclonal antibody (HRP conjugate) (Cell Signaling Technology, 12262S; 1:1,000). KD of Snrpd2 and Srsf7 was validated by reverse transcription (RT)–qPCR, as described below. The cytotoxicity of siRNA-transfected cells was evaluated using a lactase dehydrogenase (LDH) colorimetric assay (Promega). The release of LDH into the culture medium, indicative of cell membrane damage, was quantified by measuring absorbance at 490 nm. Data are presented as the percentage cytotoxicity relative to cells transfected with control siRNA and lysed (LDH release maximum). Two replicates were processed for each condition. For sequencing, RNA libraries were prepared using the NEBNext Ultra II directional RNA library prep kit for Illumina (New England Biolabs) following the manufacturer’s recommendations.
Dux overexpression
cDNA for the flag-tagged Dux gene (Gene ID 664783), either with or without the transactivation domain (amino acids 1–178), was PCR-amplified from the template plasmid (Addgene, 138320). The amplified Dux cDNA and GFP were cloned into an EF1a promoter plasmid. For transfection, 50,000 COIN mES cells were reverse-transfected with vectors encoding the indicated genes using Lipofectamine Stem (Thermo Fisher Scientific). Where Exosc3 cKO was required, 100 nM tamoxifen (4-OHT) was added to the at the time of plating. Control wells were treated with ethanol alone. Cells were then incubated for 2 days. Two replicates were processed for each condition. After incubation, cells were washed with PBS and resuspended in TRIzol (Thermo Fisher Scientific) reagent for RNA extraction.
RT–qPCR
Total RNA was extracted using TRIzol reagent and treated with DNase using the TURBO DNA-free kit (Invitrogen), following the manufacturer’s instructions. DNase-treated RNA was reverse-transcribed into cDNA using the high capacity cDNA RT kit (Applied Biosystems) and real-time qPCR was performed using a SYBR green master mix (Bio-Rad). To validate the transfection of FLAG–GFP and FLAG–Dux, RT–qPCR amplicons were visualized by agarose gel electrophoresis. Bands corresponding to the expected amplicon sizes were quantified using ImageJ 1.53t software86 and the values were normalized to the corresponding Actb band intensities. For other transcripts, relative expression of the transcript of interest was calculated using the 2−ΔCt method, where ΔCt = Ct_target − Ct_Actb. All primers used for qPCR are listed in Supplementary Table 6. The data were visualized as bar plots using GraphPad Prism 10.6.0 software for macOS (GraphPad software).
RNA pulldown of MT2 PAS
A 130-nt fragment of the MT2_Mm consensus sequence centered around the PAS motif (AAUAAA) was cloned into the pBlueScript-3×MS2 vector. This region includes the SD motif (UGUA) for CFIm binding, the aforementioned PAS motif and the cleavage site. A fragment with a mutated PAS motif (AACAAA) was also cloned as a negative control. RNA substrates were synthesized by runoff in vitro transcription using T7 RNA polymerase (New England Biolabs) and capped by vaccinia capping enzyme (New England Biolabs). RNA pulldown and western blotting experiments were performed as previously described38 using the following antibodies: mouse monoclonal anti-U1-70K clone 9C4.1 (Millipore, 05-1588; RRID: AB_10805959; 1:2,000), mouse monoclonal anti-U1A (Santa Cruz, sc-101149; RRID: AB_2193721; 1:2,000), rat monoclonal anti-U1C (Sigma, SAB4200188-200UL; RRID: AB_10640155; 1:2,000), rabbit polyclonal anti-CFIm68 (Bethyl, A301-358A; RRID:AB_937785; 1:2,000), rabbit polyclonal anti-CFIm59 (Bethyl, A301-360A; RRID:AB_937864; 1:2,000), rabbit polyclonal anti-NUDT21 (CFIm25) (Proteintech, 10322-1-AP; RRID:AB_2251496; 1:500), rabbit polyclonal anti-NCBP2 (CBP20) (Bethyl, A302-553A; RRID: AB_2034872; 1:2,000), rabbit polyclonal anti-CPSF30 (Bethyl, A301-585A; RRID: AB_1078868; 1:2,000), goat anti-rabbit IgG HRP conjugate (Millipore, 12-348; RRID: AB_390191; 1:2,000), goat anti-mouse IgG HRP conjugate (Millipore, 12-349; RRID: AB_390192; 1:2,000) and rabbit anti-rat IgG HRP conjugate (Invitrogen, PA1-28573; RRID: AB_10980086; 1:2,000).
KD of MERVL using gapmer ASOs
Gapmer ASOs against AMOs were synthesized by Integrated DNA Technologies. Sequences used were as follows:
MERVL-ASO1: T*G*G*T*G*G*A*T*C*A*A*C*A*A*G*C*C*A*A*T
MERVL-ASO2: C*A*T*T*T*G*T*C*T*G*T*T*T*A*C*C*A*C*G*A
MERVL-ASO3: G*A*C*C*C*C*G*A*A*A*A*G*T*C*T*G*A*T*T*A
Scr ASO: A*G*C*G*C*G*G*G*T*A*T*T*G*A*A*C*C*A*G*G
Here, asterisks indicate a phosphorothioate bond instead of a phosphodiester bond. Bold and underlined residues represent 2′-O-methoxyethyl nucleotides. All stock ASOs were resuspended in 1× siRNA resuspension buffer (Horizon Discovery) before use. MERVL-ASO1, MERVL-ASO2 and MERVL-ASO3 were mixed at equimolar concentrations to form a 100 μM stock (MERVL-ASOmix) that was used for all downstream experiments. For KD experiments, Scr ASO or MERVL-ASOmix were complexed with 2.5 μl of Lipofectamine RNAiMAX (Thermo Fisher Scientific) in 250 μl of Opti-MEM. The resulting ASO–Lipofectamine mix was then reverse-transfected into single-cell suspensions of 2 × 105 mES cells in 750 μl of mES cell medium + 2i/Lif. Final concentrations of ASOs used in these experiments was 100 nM. Where Exosc3 cKO was required, 100 nM tamoxifen (4-OHT) was added at the time of plating. Control wells were treated with ethanol alone. Then, 48 h after transfection and tamoxifen treatment, cells were collected in TRIzol (Thermo Fisher Scientific). Three replicates were processed for each condition. RNA was isolated following the manufacturer’s recommendations. For sequencing, RNA libraries were prepared using the NEBNext Ultra II directional RNA library prep kit for Illumina (New England Biolabs) following the manufacturer’s recommendations.
KD of U1/U6 RNA using AMOs
U1 and U6 AMOs were synthesized by Gene Tools. Sequences used were as follows:
Control AMO: CCTCTTACCTCAGTTACAATTTATA
U1 AMO: GGTATCTCCCCTGCCAGGTAAGTAT
U6ATAC AMO: AACCTTCTCTCCTTTCATACAACAC
To study the impact of major or minor spliceosome function in control or Exosc3-depleted cells, mES cells were cultured in 2i/Lif medium in the presence of 0.01% ethanol or 100 nM tamoxifen for 48 h. mES cell colonies were then dissociated into single-cell suspensions using Accutase, counted and resuspended in Opti-MEM (Thermo Fisher Scientific) containing 2i/Lif. For KD experiments, single-cell suspensions of 1.25 × 106 cells were prepared in 400 μl of Opti-MEM containing 2i/Lif. AMOs were added to the cell suspensions at a final concentration of 15 μM and electroporated into cells using the Bio-Rad Gene Pulser XCell electroporation system. Electroporation was carried out with a single 240-V, 500-μF pulse in a 0.4-mm cuvette. Following electroporation, cells were immediately plated into full mES cell medium. Then, 24 h after electroporation, cells were collected in TRIzol (Thermo Fisher Scientific). Two replicates were processed for each condition. RNA was isolated following the manufacturer’s recommendations. For sequencing, RNA libraries were prepared using the TruSeq stranded total RNA library prep gold (Illumina) following the manufacturer’s recommendations.
mES cells bulk short-read RNA-seq and analysis
Illumina short-read RNA-seq data in WT and Exosc3 cKO mES cells were obtained from our previous study23. Data from other previous studies were obtained from their respective references (listed in Supplementary Table 7). Reads were mapped onto the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome using two-pass STAR74 mapping with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. Gene-level and TE-level quantification was performed with TEcount using the RepeatMasker gtf file generated from the mm10 genome. Differential expression was calculated using DESeq2 (ref. 72) with default parameters independently for each comparison with or without eRNAs and PROMPTs. In the Exosc3 cKO with AMO transfection dataset, each DESeq2 model included a similar number of expressed genes. Protein-coding status and exon counts of each gene were determined using the method described above.
mES cell nascent RNA (metabolic labeling) sequencing analysis
Nascent RNA-seq data in WT and Exosc3 cKO mES cells were obtained from the previous study23. Reads were mapped onto the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome using two-pass STAR74 mapping with the following parameters: --outSAMstrandField intronMotif --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100.
EU-seq total input RNA-seq analysis
Total input RNA-seq of the EU-seq dataset was obtained from a previous study23. Illumina adaptors were trimmed from reads using Trim Galore87. Reads were aligned to the GRCm38/mm10 reference genome using STAR74 with the following custom parameters: --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100. To generate bigWig files, aligned BAM files were first filtered using sambamba88, removing unmapped reads and secondary alignments. Next, deepTools89 was used with RPKM (reads per kilobase of TPM) normalization to generate unstranded bigWig files that were used to produce genome browser snapshot (Fig. 3f,g).
mES cell promoter activity analysis
Promoters and their activity were estimated by proActiv24, using Illumina RNA-seq splice junction quantification files (SJ.out.tab) from a STAR two-pass alignment and the novel isoform-resolved transcriptome as input, as described above. Promoters that were estimated to be inactive across all datasets used, as well as internal promoters, were excluded from downstream analysis. The average absolute and relative promoter activity across replicates for each condition was used to analyze the promoter activity. Low-activity promoters with absolute promoter activity lower than 1.5 and relative promoter activity lower than 15% were further filtered out to avoid false positives. For the WT and Exosc3 cKO dataset, filtering was applied for Exosc3 cKO samples to avoid eliminating the unstable or lowly expressed isoforms in WT. For the remaining comparisons, the TE-promoters whose maximum absolute promoter activity and maximum relative activity across the dataset in each comparison were higher than 1.5 and 0.15, respectively, were considered. For the analyses described in this section, promoter activity refers to the average absolute promoter activity unless otherwise indicated. Promoters were classified on the basis of the genomic location of promoters and their association with TE-chimeras annotated in the WT and Exosc3 cKO mES cell and EpiLC novel transcriptome, as described above. Specifically, we initially annotated nonchimeric (host) and TE-promoters on the basis of the overlap between promoter regions and annotated repetitive elements. When multiple TEs were overlapped with a promoter, a single TE was selected with distinct function of dplyr90 package in R. Promoters that did not overlap with any TE on the same strand were defined as host (nonchimeric) promoters. Promoters that overlapped with a TE on the same strand were defined as TE-promoters. TE-promoters were then further restricted to instances where the TSS of the TE-chimeras was contained within the same TE on the same strand. To calculate log10 or fold changes, a small constant (+0.1) was added to the values to prevent the occurrence of infinite values.
Promoter classification based on RNAPII status
RNAPII ChIP-seq (8WG16) data generated in WT and Exosc3 cKO from the previous study were obtained23 and processed as described therein. To classify promoters on the basis of RNAPII, a differential binding analysis between WT and Exosc3 cKO was performed by using (1) MACS2 (ref. 91) with the parameters --broad --qvalue 0.5 and (2) DiffBind91. Promoters that overlapped with RNAPII peaks (allowing 200 bp of max gap) were then classified in three categories: (1) more RNAPII in Exosc3 cKO, in cases where the first exon overlapped an RNAPII peak in Exosc3 cKO but not WT or both conditions had peaks but with significantly stronger signal in Exosc3 cKO (FDR < 0.05, fold > 0.5); (2) less RNAPII in Exosc3 cKO, in cases where the first exon overlapped an RNAPII peak in WT but not Exosc3 cKO or both conditions had peaks but with significantly lower signal in WT (FDR < 0.05, fold < −0.5); and (3) similar RNAPII levels, in all other cases. MT2_Mm_dup920 (positionally overlapping both RNAPII promoters) and MTC-int_dup3214 (showing the outlier signal of nascent transcription) were excluded in Fig. 3h and Extended Data Fig. 7m,y–z.
Hi-C data processing and analysis
Hi-C data were processed as previously described92. Briefly, Hi-C reads were trimmed at MboI/DpnII recognition sites (GATC) and aligned to the mouse genome (GRCm38/mm10) using STAR, keeping only read pairs that both mapped to unique genomic locations for further analysis (MAPQ > 10). All PCR duplicates were also removed. PCA of Hi-C experiments used to define chromatin compartments was performed with HOMER93. For each chromosome, a balanced and distance-normalized contact matrix was generated using a window size of 50 kb sampled every 25 kb, reporting the ratios of observed-to-expected contact frequencies for any two regions. The correlation coefficient of the interaction profiles for any two regions across the entire chromosome was then calculated to generate a correlation matrix. This matrix was then analyzed using PCA from the prcomp function in R and the eigenvector loadings for each 25-kb region along the first principal component (PC1) were assigned to each region. The PC1 values from each chromosome were scaled by their s.d. to make them more comparable across chromosomes and analysis parameters. For each chromosome, PC1 values were multiplied by −1 if negative PC1 regions were more strongly enriched for active chromatin regions defined by H3K27ac peaks to ensure that the positive PC1 values aligned with the A/permissive compartment (as opposed to the B/inert compartment). For each 25-kb region (bin), the relative enrichment of chimeric LTRs was calculated as the ratio of chimeric LTR base pairs to total LTR base pairs. For both chimeric and total LTRs, overlapping regions were merged such that each base pair was counted only once when calculating coverage. PC1 values were then used to compare (1) genomic bins containing or lacking chimeric LTRs, considering only bins that contained at least 1 bp of any LTR, and (2) genomic bins stratified by the relative enrichment of chimeric to total LTRs, considering only bins that contained at least 1 bp of chimeric LTR.
Correlation analyses
The WGCNA bicorandpvale function94 was used to generate a biweight midcorrelation coefficient (bicor) and corresponding P values between the expression signature of coactivators depletion and that of Exosc3 cKO. For the PlaB treatment dataset42 analysis, log2 fold changes of the counts normalized using DESeq2 (ref. 72) were manually quantified because of the limited number of replicates available deposited in the dataset.
Motif analysis
Motif analysis was performed using HOMER95. The entire mouse genome (mm10) was scanned with known SD and PAS motifs and pairs of SD-PAS motifs where the SD is upstream of the PAS for up to 10 bp at most were identified. The proportions of TEs or TSSs that contain this specific motif configuration were calculated by intersecting genomic motif coordinates with the reference GTF. Only presence on standard chromosomes was considered when calculating the proportions. Only TSSs that were associated with promoters that passed the filtering criteria described above were used. TSSs that did not overlap with TEs on the same strand were classified as nonrepetitive, whereas TSSs overlapping with TEs on the same strand were classified as repeat-derived. Only motifs whose strand matched that of the TEs or TSSs were considered. Occurrences were found between the TSS and 150 bp downstream.
Positional analysis of chimeric and nonchimeric LTRs
Chimeric LTRs were filtered and defined on the basis of criteria described above. All other LTRs that were either not defined as LTR promoters or did not satisfy the filtering criteria were classified as nonchimeric LTRs. All positional analyses were performed against the mm10 Ensembl transcriptome. Primary annotation of LTR and gene–LTR pairing were performed considering only genes located on standard chromosomes. Primary annotation of LTR was performed using the annotatePeak() function in ChIPseeker96, with the parameters level set to ‘gene’ and overlap set to ‘all’. Annotations were then classified as four groups: (1) promoter, where LTRs are located proximal to the TSS (−3 kb to +3 kb); (2) intragenic, where LTRs are located in the untranslated region, exon or intron; (3) downstream (of gene end) ≤ 300 bp; and (4) distal intergenic (Extended Data Fig. 6b). LTRs were then paired with genes (regardless of strand orientation): (1) gene and its overlapping intragenic LTR and (2) gene and its nearest promoter-proximal intergenic LTR (within 5 kb from the TSS). To avoid any complication, only genes present in the mm10 Ensembl transcriptome that were not overlapping with any other genes (that is, nonoverlapping genes) were considered. Intragenic LTRs positioned inside overlapping genes and promoter-proximal intergenic LTRs positioned closer to overlapping genes than nonoverlapping genes were discarded. Genes paired with chimeric LTRs were excluded from pairing with nonchimeric LTR. The gene-level RNA-seq count data generated by TEcount97 using BAM files were mapped onto the mm10 Ensembl transcriptome. The normalized counts from DESeq2 (ref. 72) were further normalized by the total exon length (in kilobases) of the longest transcript of each gene and averaged across replicates for each condition. These normalized expression values were then used to compare expression of genes paired with chimeric LTRs to those paired with nonchimeric LTRs.
The coverage of H3K9me3 was calculated using datasets from our previous study23 with the multiBigWigSummary function in deepTools89. The processed bigWig files deposited in the corresponding Gene Expression Omnibus (GEO) entry were used as input. For DNA methylation analysis, whole-genome bisulfite sequencing data were obtained from our previous study23. Reads were aligned to the mouse reference genome (mm10) using the bsseq package in R98. A BSseq object was created, filtering for CpG sites with a minimum coverage depth of six reads. The filtered BSseq object was converted to bedGraph. For visualization, bigWig files were generated from bedGraph files using the bedGraphToBigWig utility99. DNA methylation levels were quantified for each sample using the resulting bigWig files with the multiBigWigSummary function in deepTools89. For both H3K9me3 and DNA methylation analyses, NA values were replaced with 0 and the average signal was calculated across replicates within each condition. The coverage of H3K9me3 immunoprecipitation (IP) samples was further normalized by the input sample.
The average antisense coverage between genes and the nearest upstream proximal intergenic LTR was calculated from stranded, RPKM-normalized bigWig files. To generate these bigWig files, aligned BAM files from replicates were first merged and then filtered with sambamba88 to remove unmapped reads and secondary alignments. The replicate-merged, filtered BAM files were converted into stranded bigWig files using the bamCoverage function in deepTools89 with the parameters: --binSize 10 --normalizeUsing RPKM. Strand specificity was applied using the --filterRNAstrand option. The average antisense coverage was then calculated from these bigWig files using the multiBigWigSummary function of deepTools89. Gene and promoter-proximal intergenic LTR pairs where the range between gene and LTR was wholly contained within the novel transcripts, which were antisense to the gene, were excluded when quantifying antisense signals to avoid complications.
For the analyses described in this section, a small constant (+0.1) was added to all values before calculating log10 or fold changes to avoid infinite values. The ggbreak100 R package was used to introduce axis breaks.
Sashimi plot, metagene plot and heat map
A strand-specific sashimi plot of RNA-seq was created by ggsashimi101 with the following parameters: -s MATE2_SENSE. Strand-specific metagene plots of RNA-seq and nascent RNA sequencing (metabolic labeling) were generated by ngs.plot.r102 with BAM containing only the first mate reads. Replicate-merged BAM files for each condition were used as an input to ggsashimi101 and ngs.plot.r102. Metagene plots of Dux ChIP-seq in mES cells45 and Obox1 Stacc-seq in late 2C mouse embryos47 were generated by deepTools89 computeMatrix and plotHeatmap. The processed bigWig files deposited in the corresponding GEO entry by previous studies were used as input. For ChIP-seq, replicate IP bigWig files were merged using bigwigCompare (--operation mean) for visualization. For Stacc-seq, to match the reference version, the genomic coordinates of LTRs were extracted from the RepeatMasker GTF file for the mm9 genome and only elements annotated as LTRs in the mm9 assembly were considered.
2CLC gene set expression and enrichment analyses
The 2CLC gene set was obtained from a previous study (GSE75751)46, using genes and TEs that were upregulated in MuERVL+Zscan4+ double-positive samples when compared to untransfected negative control samples (FDR < 0.05 and log2 fold hange > 5). GSEA was performed using fgsea (version 1.16.0)103 ranking genes according to the differential expression statistic output by DESeq2 (ref. 72).
LTR-chimera analysis in vivo
De novo oocyte transcriptome was assembled with paired illumina short-read data from a previous study (GSE247848)51. Reads were filtered using fastp (version 0.23.2)104, mapped with STAR74 two-pass alignment to mm10 with parameters --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 and assembled into transcriptome annotation using StringTie75 with default parameters. Annotations were merged with StringTie --merge and novel transcripts with >50% TE coverage were removed. Mtr4 cKO (GSE247848)51, considering only 21-day samples, and Dicer1 cKO (GSE57514)54 samples were mapped to mm10 with fastp and STAR as described and quantified with de novo oocyte transcriptome using proActiv24, with previously described filtering criteria. For mES cells, the previously described Exosc3 cKO and WT in mES cell and EpiLC transcriptome was used. Exosc3 cKO (GSE205211)23 and Dicer1 cKO (GSE256381)105 mES cell RNA-seq samples and their respective controls were filtered with fastp, mapped to the de novo transcriptome with STAR and quantified with proActiv as described above.
Evolutionary analysis of TE-chimeras
We used the 241-way mammal alignment to investigate the age of TE-derived promoters. The TSS regions ± 100 bp were defined as promoters and aligned to each of the other 240 mammals. Promoters with more than 100 alignable base pairs in a specific mammal were considered present in the mammal. For human promoters, the age was defined as follows according to their presence in other mammal classes: (1) human-specific; (2) great apes; (3) apes; (4) old-world monkeys; (5) new-world monkeys; (6) lemurs; (7) rodents; and (8) other eutherian mammals. We did the same for mouse promoters, dividing them into the following classes: (1) mouse-specific; (2) murine; (3) Cricetidae; (4) Dipodidae; (5) rodents; and (6) other eutherian mammals. The Wilcoxon rank-sum test was used for generating P values.
Statistical significance
Unless otherwise noted, asterisks represent the following convention for statistical significance: NS (not significant), P > 0.05; *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001 and ****P ≤ 0.0001.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Raw and processed sequencing data were deposited to the GEO (GSE205211 and GSE297796). A list of publicly available datasets used in this study is available in Supplementary Table 7. Source data are provided with this paper.
References
Senft, A. D. & Macfarlan, T. S. Transposable elements shape the evolution of mammalian development. Nat. Rev. Genet. 22, 691–711 (2021).
Cosby, R. L., Chang, N. C. & Feschotte, C. Host-transposon interactions: conflict, cooperation, and cooption. Genes Dev. 33, 1098–1116 (2019).
Chuong, E. B., Elde, N. C. & Feschotte, C. Regulatory activities of transposable elements: from conflicts to benefits. Nat. Rev. Genet. 18, 71–86 (2017).
Wells, J. N. & Feschotte, C. A field guide to eukaryotic transposable elements. Annu. Rev. Genet. 54, 539–561 (2020).
Padeken, J., Methot, S. P. & Gasser, S. M. Establishment of H3K9-methylated heterochromatin and its functions in tissue differentiation and maintenance. Nat. Rev. Mol. Cell Biol. 23, 623–640 (2022).
Giménez-Orenga, K. & Oltra, E. Transposable elements shaping the epigenome. in Handbook of Epigenetics 3rd edn (ed Tollefsbol T. O.) Ch. 18 (Academic Press, 2023); https://doi.org/10.1016/B978-0-323-91909-8.00035-9
Almeida, M. V., Vernaz, G., Putman, A. L. K. & Miska, E. A. Taming transposable elements in vertebrates: from epigenetic silencing to domestication. Trends Genet. 38, 529–553 (2022).
Ecco, G., Imbeault, M. & Trono, D. KRAB zinc finger proteins. Development 144, 2719–2729 (2017).
Russell, S. J., Stalker, L. & LaMarre, J. PIWIs, piRNAs and Retrotransposons: complex battles during reprogramming in gametes and early embryos. Reprod. Domest. Anim. 52, 28–38 (2017).
Rowe, H. M. et al. TRIM28 repression of retrotransposon-based enhancers is necessary to preserve transcriptional dynamics in embryonic stem cells. Genome Res. 23, 452–461 (2013).
Bourque, G. et al. Ten things you should know about transposable elements. Genome Biol. 19, 199 (2018).
Franke, V. et al. Long terminal repeats power evolution of genes and gene expression programs in mammalian oocytes and zygotes. Genome Res. 27, 1384–1394 (2017).
Friedli, M. & Trono, D. The developmental control of transposable elements and the evolution of higher species. Annu. Rev. Cell Dev. Biol. 31, 429–451 (2015).
Thompson, P. J., Macfarlan, T. S. & Lorincz, M. C. Long terminal repeats: from parasitic elements to building blocks of the transcriptional regulatory repertoire. Mol. Cell 62, 766–776 (2016).
Miller, W. J., McDonald, J. F. & Pinsker, W. Molecular domestication of mobile elements. Genetica 100, 261–270 (1997).
Macfarlan, T. S. et al. Embryonic stem cell potency fluctuates with endogenous retrovirus activity. Nature 487, 57–63 (2012).
Peaston, A. E. et al. Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. Dev. Cell 7, 597–606 (2004).
Hashimoto, K. et al. Embryonic LTR retrotransposons supply promoter modules to somatic tissues. Genome Res 31, 1983–1993 (2021).
Oomen, M. E. et al. An atlas of transcription initiation reveals regulatory principles of gene and transposable element expression in early mammalian development. Cell 188, 1156–1174.e1120 (2025).
Arribas, Y. A. et al. Transposable element exonization generates a reservoir of evolving and functional protein isoforms. Cell 187, 7603–7620.e7622 (2024).
Pasquesi, G. I. M. et al. Regulation of human interferon signaling by transposon exonization. Cell 187, 7621–7636.e7619 (2024).
Babarinde, I. A. et al. Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells. Nucleic Acids Res. 49, 9132–9153 (2021).
Torre, D. et al. Nuclear RNA catabolism controls endogenous retroviruses, gene expression asymmetry, and dedifferentiation. Mol. Cell 83, 4255–4271.e4259 (2023).
Demircioğlu, D. et al. A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell 178, 1465–1477.e1417 (2019).
Brawand, D. et al. The evolution of gene expression levels in mammalian organs. Nature 478, 343–348 (2011).
Mazin, P. V., Khaitovich, P., Cardoso-Moreira, M. & Kaessmann, H. Alternative splicing during mammalian organ development. Nat. Genet. 53, 925–934 (2021).
Battle, A., Brown, C. D., Engelhardt, B. E. & Montgomery, S. B. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Anwar, S. L., Wulaningsih, W. & Lehmann, U. Transposable elements in human cancer: causes and consequences of deregulation. Int. J. Mol. Sci. https://doi.org/10.3390/ijms18050974 (2017).
Bonté, P. E. et al. Selective control of transposable element expression during T cell exhaustion and anti-PD-1 treatment. Sci. Immunol. 8, eadf8838 (2023).
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Gide, T. N. et al. Distinct Immune Cell Populations Define Response to Anti-PD-1 Monotherapy and Anti-PD-1/Anti-CTLA-4 Combined Therapy. Cancer Cell 35, 238–255.e236 (2019).
Shi, Y. et al. PRMT3-mediated arginine methylation of IGF2BP1 promotes oxaliplatin resistance in liver cancer. Nat. Commun. 14, 1932 (2023).
Garland, W. et al. Chromatin modifier HUSH co-operates with RNA decay factor NEXT to restrict transposable element expression. Mol. Cell 82, 1691–1707.e1698 (2022).
Wu, Y. et al. Nuclear exosome targeting complex core factor Zcchc8 regulates the degradation of LINE1 RNA in early embryos and embryonic stem cells. Cell Rep. 29, 2461–2472.e2466 (2019).
Pefanis, E. et al. Noncoding RNA transcription targets AID to divergently transcribed loci in B cells. Nature 514, 389–393 (2014).
Zhu, Y. et al. Relaxed 3D genome conformation facilitates the pluripotent to totipotent-like state transition in embryonic stem cells. Nucleic Acids Res. 49, 12167–12177 (2021).
Andersen, P. K., Lykke-Andersen, S. & Jensen, T. H. Promoter-proximal polyadenylation sites reduce transcription activity. Genes Dev. 26, 2169–2179 (2012).
Soles, L. V. et al. A nuclear RNA degradation code is recognized by PAXT for eukaryotic transcriptome surveillance. Mol. Cell 85, 1575–1588.e1579 (2025).
Lubas, M. et al. Interaction profiling identifies the human nuclear exosome targeting complex. Mol. Cell 43, 624–637 (2011).
Meola, N. et al. Identification of a nuclear exosome decay pathway for processed transcripts. Mol. Cell 64, 520–533 (2016).
Rambout, X. & Maquat, L. E. Nuclear mRNA decay: regulatory networks that control gene expression. Nat. Rev. Genet. 25, 679–697 (2024).
Shen, H. et al. Mouse totipotent stem cells captured and maintained through spliceosomal repression. Cell 184, 2843–2859.e2820 (2021).
Rodriguez-Terrones, D. et al. A molecular roadmap for the emergence of early-embryonic-like cells in culture. Nat. Genet. 50, 106–119 (2018).
Rogalska, M. E. et al. Transcriptome-wide splicing network reveals specialized regulatory functions of the core spliceosome. Science 386, 551–560 (2024).
Hendrickson, P. G. et al. Conserved roles of mouse DUX and human DUX4 in activating cleavage-stage genes and MERVL/HERVL retrotransposons. Nat. Genet. 49, 925–934 (2017).
Eckersley-Maslin, M. A. et al. MERVL/Zscan4 network activation results in transient genome-wide DNA demethylation of mESCs. Cell Rep. 17, 179–192 (2016).
Ji, S. et al. OBOX regulates mouse zygotic genome activation and early development. Nature 620, 1047–1053 (2023).
Yang, J., Cook, L. & Chen, Z. Systematic evaluation of retroviral LTRs as cis-regulatory elements in mouse embryos. Cell Rep. 43, 113775 (2024).
De Iaco, A. et al. DUX-family transcription factors regulate zygotic genome activation in placental mammals. Nat. Genet. 49, 941–945 (2017).
Sakashita, A. et al. Transcription of MERVL retrotransposons is required for preimplantation embryo development. Nat. Genet. 55, 484–495 (2023).
Wu, Y. W. et al. RNA surveillance by the RNA helicase MTR4 determines volume of mouse oocytes. Dev. Cell 60, 85–100.e104 (2025).
Flemr, M. et al. A retrotransposon-driven dicer isoform directs endogenous small interfering RNA production in mouse oocytes. Cell 155, 807–816 (2013).
Taborska, E. et al. Restricted and non-essential redundancy of RNAi and piRNA pathways in mouse oocytes. PLoS Genet. 15, e1008261 (2019).
Stein, P. et al. Essential Role for endogenous siRNAs during meiosis in mouse oocytes. PLoS Genet. 11, e1005013 (2015).
Andrews, G. et al. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science 380, eabn7930 (2023).
Seczynska, M., Bloor, S., Cuesta, S. M. & Lehner, P. J. Genome surveillance by HUSH-mediated silencing of intronless mobile elements. Nature 601, 440–445 (2022).
Ilık İ, A. et al. Autonomous transposons tune their sequences to ensure somatic suppression. Nature 626, 1116–1124 (2024).
Lykke-Andersen, S. et al. Integrator is a genome-wide attenuator of non-productive transcription. Mol. Cell 81, 514–529.e516 (2021).
Birney, E. et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
Core, L. J., Waterfall, J. J. & Lis, J. T. Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322, 1845–1848 (2008).
Kapranov, P. et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488 (2007).
Clark, M. B. et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625 (2011). discussion e1001102.
Hennig, B. P. & Fischer, T. The great repression: chromatin and cryptic transcription. Transcription 4, 97–101 (2013).
Alam, T. et al. Comparative transcriptomics of primary cells in vertebrates. Genome Res. 30, 951–961 (2020).
Palazzo, A. F. & Koonin, E. V. Functional long non-coding RNAs evolve from junk transcripts. Cell 183, 1151–1161 (2020).
Schmid, M. & Jensen, T. H. Controlling nuclear RNA levels. Nat. Rev. Genet. 19, 518–529 (2018).
Fort, A. et al. Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance. Nat. Genet. 46, 558–566 (2014).
Kapusta, A. & Feschotte, C. Volatile evolution of long noncoding RNA repertoires: mechanisms and biological implications. Trends Genet. 30, 439–452 (2014).
Kapusta, A. et al. Transposable elements are major contributors to the origin, diversification, and regulation of vertebrate long noncoding RNAs. PLoS Genet. 9, e1003470 (2013).
Jacob, F. Complexity and tinkering. Ann. N. Y. Acad. Sci. 929, 71–73 (2001).
Zhang, Y., Parmigiani, G. & Johnson, W. E. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform 2, lqaa078 (2020).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: An R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730 (2022).
Pardo-Palacios, F. J. et al. SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nat. Methods 21, 793–797 (2024).
Wyman, D. et al. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Preprint at bioRxiv https://doi.org/10.1101/672931 (2019).
Reese, F. et al. The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity. Preprint at bioRxiv https://doi.org/10.1101/2023.05.15.540865 (2023).
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).
Torre, D. et al. Isoform-resolved transcriptome of the human preimplantation embryo. Nat. Commun. 14, 6902 (2023).
Qiao, Y. et al. High-resolution annotation of the mouse preimplantation embryo transcriptome using long-read sequencing. Nat. Commun. 11, 2653 (2020).
Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).
Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model (Springer, 2000).
Schneider, C. A., Rasband, W. S. & Eliceiri, K. W. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 9, 671–675 (2012).
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.j https://doi.org/10.14806/ej.17.1.200 (2011).
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of NGS alignment formats. Bioinformatics 31, 2032–2034 (2015).
Ramírez, F. et al. deepTools2: A next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Wickham, H. et al. dplyr: A grammar of data manipulation. CRAN https://cran.r-project.org/web/packages/dplyr/index.html (2023).
Mistry, M. et al. hbctraining/Intro-to-ChIPseq-flipped: understanding chromatin biology - lessons from HCBC (2nd release). Zenodo https://doi.org/10.5281/zenodo.7723255 (2023).
Heinz, S. et al. Transcription elongation can affect genome 3D structure. Cell 174, 1522–1536.e1522 (2018).
Lin, Y. C. et al. Global changes in the nuclear positioning of genes and intra- and interdomain genomic interactions that orchestrate B cell fate. Nat. Immunol. 13, 1196–1204 (2012).
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 9, 559 (2008).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Yu, G., Wang, L. G. & He, Q. Y. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).
Jin, Y., Tam, O. H., Paniagua, E. & Hammell, M. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics 31, 3593–3599 (2015).
Hansen, K. D., Langmead, B. & Irizarry, R. A. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol 13, R83 (2012).
Perez, G. et al. The UCSC Genome Browser database: 2025 update. Nucleic Acids Res. 53, D1243–d1249 (2025).
Xu, S. et al. Use ggbreak to effectively utilize plotting space to deal with large datasets and outliers. Front. Genet. 12, 774846 (2021).
Garrido-Martín, D., Palumbo, E., Guigó, R. & Breschi, A. ggsashimi: Sashimi plot revised for browser- and annotation-independent splicing visualization. PLoS Comput. Biol. 14, e1006360 (2018).
Shen, L., Shao, N., Liu, X. & Nestler, E. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC Genomics 15, 284 (2014).
Korotkevich, G. et al. Fast gene set enrichment analysis. Preprint at bioRxiv https://doi.org/10.1101/060012 (2021).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Lyu, X. et al. A transient transcriptional activation governs unpolarized-to-polarized morphogenesis during embryo implantation. Mol. Cell 84, 2665–2681.e2613 (2024).
Acknowledgements
We thank the I.M. lab and all the teams at the Center for Epigenetics and Metabolism. This work was supported by the University of California, Irvine (UCI) seed fund (to I.M.), Burroughs Wellcome Fund (G-1017892.01 to I.M.) and National Institutes of Health funds (5R01AI168130-05, 5U54OD039864-02 and 5R01NS123287-04, to I.M.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. This work used resources of the UCI Genomics Research and Technology Hub (GRT Hub), parts of which are supported by National Institutes of Health grants to the Comprehensive Cancer Center (P30CA-062203), the UCI Skin Biology Resource-Based Center (P30AR075047) and the GRT Hub for instrumentation (1S10OD010794-01 and 1S10OD021718-01).
Author information
Authors and Affiliations
Contributions
M.S. and I.M. conceptualized and designed the study. I.M. directed the study, administered the project and acquired funding. Y.C., D.T.Q., K.H., Y.S.F., J.S.Y.H., L.L., Y.Y. and L.F. performed all the experiments. Y.C., E.G.A., D.T., J.N., M.Z., T.Y., F.R., J.S.Y.H., M.B. and Y.S. performed the bioinformatic and data analysis. Z.W. directed the phylogenetic analysis. U.B., E.K., E.M.V., E.G., M.B., J.N., F.J.A., Y.L., F.R. and Y.S. provided conceptual and experimental guidance or materials.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Structural & Molecular Biology thanks Marina Lusic and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Dimitris Typas, in collaboration with the Nature Structural & Molecular Biology team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 A mixed-sequencing strategy to identify TE-host gene isoforms.
a, Outline of methodology to generate novel transcriptome. RNA extracted from mESCs and EpiLCs were subject to PacBio long-read (Iso-Seq) and Illumina short-read bulk RNA-Seq. These two sequencing methodologies are used to generate a novel transcriptome annotation (see Methods), which is then scanned for exonization of TEs. b, The distribution of TEs across the Mouse genome, split by TE class. c, The distribution of TEs across the Mouse genome, split by TE class and family. d, The distribution of novel and known chimeric isoforms split by TE class and presence in reference the transcriptome, the Ensembl mm10 transcriptome. e, Breakdown of novel and known 5′ TE-chimeras in the mouse compendium (cell line and tissues) transcriptome by presence and location of a valid ORF, TE class, and TE family. f, Change over time of TE-chimera expression in the indicated anatomical region in mice. Bar graphs indicate the gain and loss of individual 5′ chimeras from one time point to the next. Line graphs indicate the total number of chimeras in each time point. g, Line chart displaying the number of promoters passing different combinations of filters in the dataset. The “Absolute > 1.5, Relative > 15% in at least 2 samples from the same tissue” filter was applied for subsequent analyses. h, Bar chart displaying the distribution of expressed transcripts per identified promoter. i, Bar chart displaying the proportion of individual transcripts per identified promoter. j, Proportion of TE-promoters derived from each TE family by number of tissues in which each promoter is active.
Extended Data Fig. 2 Filtering and distribution of TE-chimeras expression across human and mouse organogenesis.
a, UpSet plot displaying the number of TE-chimeric promoters active in different combinations of organs. b, Bar chart displaying the percentage of coding and non-coding TE-chimeric isoforms (predicted using CPAT). c-d, Bar plots displaying the percentage of TE classes across protein-coding status (c), and the percentage of coding and noncoding TE-chimeric isoforms across TE classes (d). e, Bar plots displaying the percentage of coding and noncoding TE-chimeric isoforms, grouped by organ. f, Overlap of homologous genes with TE-promoters in organogenesis in human and mouse. g, Overlap of homologous genes with TE-promoters in embryogenesis in human and mouse. h, Overlap of homologous genes with TE-promoters in embryogenesis in human and mouse by developmental stage. i, Box plots displaying the average absolute promoter activity derived from distinct biological replicates corresponding to different fetal or postnatal organ from multiple donor organisms26. Only promoters with at least 1.5 absolute activity and 15% relative activity in at least 2 samples from the same tissue were included in the analysis; the numbers of fetal and postnatal promoters in human tissues were forebrain (13,952 vs 10,028), heart (14,402 vs 4,548), hindbrain (14,520 vs 11,440), kidney (11,136 vs 3,840), liver (22,015 vs 7,735), ovary (6,084 vs 0), and testis (44,010 vs 22,820), and in mouse tissues were brain (396 vs 0), forebrain (3,000 vs 2,500), heart (5,635 vs 3,220), hindbrain (2,760 vs 2,185), kidney (5,363 vs 3,460), liver (11,438 vs 6,020), ovary (4,824 vs 2,680), and testis (11,203 vs 6,590). Box limits extend from the 25th to 75th percentiles, and the middle line represents the median. Whiskers extend to the largest value no further than 1.5 times the inter-quartile range (IQR) from each box hinge. Points beyond the whiskers are outliers. P-values represent statistically significant differences based on unpaired, two-sided Wilcoxon Rank-Sum test. P-values smaller than approximately 2.2×10-16 as “p < 2.2e-16” due to the limits of double-precision floating-point arithmetic, while exact values are shown when numerically representable. Only organs with both fetal and postnatal samples are displayed. j, Heatmap displaying the relative promoter activity of TE-chimeric promoters across RNA-Seq samples spanning multiple organs and developmental time points in mouse organogenesis.
Extended Data Fig. 3 Filtering and distribution of TE-chimeras across GTEx dataset.
a, The number of TE-chimeras which meet criteria for exonization at 3 filtering steps compared between current human genome builds GRCh38 either including (blue) or excluding (green) novel long-read-informed references derived from GENCODE V29. b, Modeling analysis of the number of TE-chimeras detected in promoters (y-axis, average of 10 iterations) which are identified based on the number of individuals analyzed in GTEx (x-axis). c, Pie chart showing the distribution of TE-chimeras separated by TE class in GTEx which pass filtering criteria of proactive promoter usage (absolute 1.5 and 15% relative proportion) and robust detection (TPM over 0.1 in at least 20% of the individuals) d, Density graphs showing the relative number of TE-chimeras detected (y-axis) against the number of tissues in which they are detected (x-axis). Plots are separated by the relative expression-TPM, left-to-right) or class of TE exonized (top-to-bottom).
Extended Data Fig. 4 TE exonization events are enriched in chemotherapy drug resistance.
a, Ratio of significantly differentially expressed genes (padj<0.01) corresponding to each pathway shown over total number of genes significantly differentially expressed genes (y-axis) were compared between genes encoding TE-exonized transcripts (orange) and all non-exonized transcripts (blue). P-value was calculated using one-sided Fisher’s exact test based on the ratios.
Extended Data Fig. 5 Detection of known and novel LTR-chimeras and Exosc3-dependency.
a, qRT-PCR validation of Exosc3 cKO in COIN mESCs in the presence of tamoxifen (4-OHT) for 48 h. Control cells were treated with ethanol (EtOH). Data indicate the mean of two replicates, with individual values shown. b, Bar plot displaying number of LTR, SINE, and LINE chimeric isoforms detected in the novel isoform-resolved transcriptome, grouped by coding probability. c, Bar plot displaying proportion of LTR chimeric isoforms and other isoforms detected in the novel isoform-resolved transcriptome, grouped by structural categories. d, Same as (c) but grouped by transcript novelty. For (b-d), only the TE-chimera oriented in the same strand of the corresponding TE was considered. e, Venn diagrams displaying the overlap of MT2_Mm (the long terminal repeat regions of MERVL elements) chimeric genes from this study and those identified in Macfarlan et al.16. Only known genes present on standard chromosomes were considered. f, Box plot displaying promoter activity log2 fold changes in Exosc3 cKO compared to WT, grouped by promoter type (n = 12,029, 836, 27, 91 for host, LTR, LINE, and SINE promoter, respectively). Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum test (p = 1.012285e-139, 4.534413e-02, and 1.140699e-07 for LTR vs host, LTR vs LINE, and LTR vs SINE, respectively). g, Box plot displaying promoter activity log2 fold changes in Exosc3 cKO compared to WT, grouped by LTR family (n = 25, 142, 500, and 168 for ERV1, ERVK, ERVL, and ERVL-MaLR, respectively). Gypsy was discarded due to a small number of promoters detected in Fig. 3c. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum test (p = 1.133782e-05, 4.089414e-18, and 1.498875e-15 for ERVL vs ERV1, ERVL vs ERVK, and ERVL vs ERVL-MaLR, respectively). h, Density plots displaying promoter activity log2 fold changes in Exosc3 cKO compared to WT, grouped by LTR promoters with more or same RNAPII enrichment upon Exosc3 cKO. P-values were calculated based on an unpaired, two-sided Wilcoxon rank-sum test (p = 2.171170e-107 for more RNAPII promoter and 6.708358e-39 for same RNAPII promoter). i, Bar plots displaying percentage of LTR promoters with more or same RNAPII enrichment upon Exosc3 cKO, grouped by LTR family. j, Box plot of Hi-C PC1 compartment scores grouped by the presence or absence of chimeric LTR, considering only regions that contain any LTR (n = 772 with chimeric LTR and 101,910 without chimeric LTR). No outliers were observed. P-values were calculated based on an unpaired, two-sided Wilcoxon rank-sum test (p = 1.766434e-35). k, Box plot of Hi-C PC1 compartment scores stratified by the enrichment of chimeric LTR relative to total LTR, considering only bins that contain chimeric LTR. Q5 corresponds to the highest chimeric LTR enrichment. n = 155, 155, 154, 154, and 154 for Q1, Q2, Q3, Q4, and Q5, respectively. No outliers were observed. P-values were calculated based on an unpaired, two-sided Wilcoxon rank-sum test without adjustment for multiple comparisons. For (f, g, j, and k), the box hinges represent the 25th and 75th percentiles, and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5 times the inter-quartile range. Data beyond these limits are outliers.
Extended Data Fig. 6 RNA degradation and genomic position dependent LTR-chimera biogenesis.
a, Bar plot displaying proportion of top 16 TE elements with SD-PAS found in Fig. 4a that were defined as TE-promoters. b, Pie charts displaying the percentage of all chimeric and non-chimeric LTR, grouped by the genomic position relative to the mm10 Ensembl transcriptome using ChIPseeker75,76 R package: promoter (3 kb around TSS), intragenic, downstream (of gene end) <= 300 bp, and distal intergenic. c, Box plot displaying average promoter-proximal antisense coverage from nascent RNA sequencing (metabolic labeling) between genes and its nearest promoter-proximal upstream intergenic LTR in WT, grouped by chimeric status (n = 11,532 non-chimeric and 60 chimeric). The box hinges represent the 25th and 75th percentiles, and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5 times the inter-quartile range. Data beyond these limits are outliers. Asterisks and p-values were calculated based on an unpaired, two-sided Wilcoxon rank-sum test (p = 5.034297e-07). d, Density plot displaying log2 fold changes of average promoter-proximal antisense coverage from nascent RNA sequencing between gene and its nearest promoter-proximal upstream intergenic LTR in Exosc3 cKO compared to WT, grouped by chimeric status. P-values were calculated by an unpaired, two-sided Wilcoxon rank-sum test.
Extended Data Fig. 7 Inhibition of RNA exosome and splicing function leads to upregulation of LTR-chimeras.
a, Western blot displaying depletion of Wdr82 protein in siCTL- and siWdr82-transfected mESCs. Loading control (β-Actin) was run on a separate gel with 1:10 dilution to avoid signal saturation from its high expression. Data shown represent one replicate. b-c, Bar plot displaying the percentage of significantly upregulated TE classes (padj < 0.05) in Exosc3 cKO (b) and knockdown of NEXT, PAXT, Integrator, and Restrictor (c). P-values were calculated using a two-sided Wald test with Benjamini-Hochberg adjustment as implemented in DESeq273. d, Heatmap showing the correlation between upregulated TE and gene expression signature (padj < 0.05 in both comparisons) in Exosc3 cKO compared to WT and that in each nuclear cofactor depletion compared to control. e, Bar plot showing the percentage of upregulated TE classes (padj < 0.05) in splicing factor depletion. P-values were calculated using a two-sided Wald test with Benjamini-Hochberg adjustment as implemented in DESeq273. f, Heatmap representing the correlation between upregulated TE and gene expression signature (padj < 0.05 in both comparisons) in Exosc3 cKO compared to WT and that in each splicing factor depletion compared to control. g and j, Percent cytotoxicity quantified by lactase dehydrogenase (LDH) colorimetric assay. siCTL-transfected cells treated with lysis solution served as the maximum LDH release control. h-i, qRT-PCR analysis of Snrpd2 (h) and Nelfa chimera (i) transcripts in siCTL- and siSnrpd2-transfected mESCs. k-l, qRT-PCR analysis of Srsf7 (k) and Nelfa chimera (l) transcripts in siCTL- and siSrsf7-transfected mESCs. Data indicate the mean of two replicates, with individual values shown (g-l). m, Boxplot showing log2 fold changes of LTR-promoter activity for each splicing factor depletion compared to control, grouped by LTR-promoters with more or same RNAPII enrichment in Exosc3 cKO (defined in Fig. 3e; for Eftud2 KD, n = 129 more and 98 same; for Isy1 KD, n = 174 more and 114 same; for Lsm4 KD, n = 105 more and 82 same; for Snrpb KD, n = 170 more and 122 same; For Snrpb2 KD, n = 194 more and 128 same). Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum (p = 2.739985e-05 for Eftud2 KD, 9.458391e-07 for Isy1 KD, 1.321891e-05 for Lsm4 KD, 0.0003695538 for Snrpb KD, and 6.356563e-05 for Snrpd2 KD). n, Heatmap representing the correlation between log2 fold changes of TE and gene in Exosc3 cKO and those in mESCs cultured with splicing inhibitor (PlaB) at different passages. P0 indicates mESCs cultured without splicing inhibitor. For the PlaB-treated dataset42, log2 fold changes were manually calculated using DESeq2-normalized counts relative to P0 after adding a small constant ( + 0.1) to all values. Only TE and genes that were significantly changed in Exosc3 cKO (padj < 0.05) and showed changes in PlaB-treated cells (abs(log2FoldChange) > 0) were considered. o, Scatter plot showing the correlation of log2 fold changes in Exosc3 cKO and that with Scr AMO treatment. Differentially expressed genes (padj < 0.05) in both comparisons were considered. The Pearson correlation coefficient and corresponding two-sided p-value (below the limits of double-precision floating-point arithmetic) were calculated. The regression line for the Pearson correlation is shown in red. p, Boxplot displaying log2 fold changes for protein-coding genes with 1-2 exons or >2 exons in WT treated with U1 AMO compared to WT treated with Scr AMO. Log2 fold changes were calculated using DESeq273. All expressed genes were considered (n = 2,163 PCGs with 1-2 exons and 15,423 PCGs with >2 exons). Outliers were removed. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum (p = 1.084265e-24). q, Boxplot representing log2 fold changes of splicing efficiency quantified by SPLICE-q78 in WT treated with U6atac AMO compared to WT treated with Scr AMO, grouped by intron classes determined by intronIC79 (n = 59,650 major introns and 204 minor introns). Outliers were removed. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum (p = 4.765693e-132). r, Boxplot showing log2 fold changes of LTR-promoter activities in either U1 or U6atac AMO-treated WT compared to Scr AMO-treated WT (n = 459). Outliers were removed. Asterisks represent statistically significant differences based on a paired, two-sided Wilcoxon signed-rank test (p = 1.563514e-09). s, Bar plots showing the percentage overlap of differentially expressed genes and TEs (padj < 0.05 for each comparison) in each perturbation conditions with those in Exosc3 cKO. t, Schematic of spliceosome assembly80,81 showing the stages that can be blocked by specific factors. Snrpb and Snrpd2 are essential components shared by all five snRNPs82. Eftud2, Lsm4, and Isy1 are components of the spliceosome, mainly functioning as part of the U5 snRNP, U6 snRNP, and NTC complex, respectively83,84,85. U1 AMO inhibits the binding of U1 snRNP86, and PlaB disrupts U2 snRNP function by targeting SF3B187. For clarity, only the earliest step that can be affected by each factor is indicated. u, Boxplots displaying log2 fold changes for protein-coding genes with 1-2 exons or >2 exons in each perturbation compared to the corresponding control. U1 AMO, Exosc3 cKO, and double inhibition were compared to WT treated with Scr AMO and Snrpb and Snrpd2 KD were compared to control siRNA-treated samples. PlaB represents the differential gene expression test in the dataset generated in mESCs cultured with splicing inhibitor (PlaB) across passages. Differential gene expression test was designed to compare the effect in the late passages (passage 4 to 6) over the early passages (passage 0 to 2). Log2 fold changes were calculated using DESeq273. All expressed genes in each comparison were considered (for Exosc3 cKO, n = 2,204 PCGs with 1-2 exons and 15,427 PCGs with >2 exons; for Exosc3 cKO with U1 AMO, n = 2,300 PCGs with 1-2 exons and 15,599 PCGs with >2 exons; for plaB, n = 2,224 PCGs with 1-2 exons and 15,859 PCGs with >2 exons; for Snrpb KD, n = 2,134 PCGs with 1-2 exons and 15,538 PCGs with >2 exons; for Snprd2 KD, n = 2,141 PCGs 1-2 exons and 15,526 PCGs >2 exons). Outliers were removed. Asterisks represent statistically significant differences based on an unpaired, two-sided Wilcoxon rank-sum (p = 3.576957e-46 for Exosc3 cKO, 9.041203e-77 for Exosc3 cKO with U1 AMO, 7.622347e-43 for plaB, 1.237728e-26 for Snrpb KD, and 1.964773e-24 for Snrpd2 KD). v, Bar plots representing log2 fold changes of MERVL-int, Dux (Duxf3), and Zscan4d for U1 AMO treated, Exosc3 cKO, and double inhibition condition. Asterisks represent the nominal p-value quantified by DESeq273. w-x, Confirmation of overexpression of GFP (w) and Dux-FL and Dux-S (x) at the transcript level in WT and Exosc3 cKO. qRT-PCR amplicons were quantified using ImageJ software88 and normalized to the corresponding Actb amplicons intensities. Data indicate the mean of two replicates, with individual values shown. y-z, Coverage profile displaying the enrichment of Dux ChIP-Seq in mESCs (y) and Obox1 Stacc-Seq in late 2-cell mouse embryos (z) across LTRs, grouped by chimeric status and RNAPII enrichment changes in Exosc3 cKO relative to WT. For (d, f and n), WGCNA bicorandpvale function77 was used to generate a biweight midcorrelation coefficients and corresponding two-sided p-values without adjustment for multiple comparisons. For (m, p, q, r, and u), the box hinges represent the 25th and 75th percentiles, and the middle line represents the median. Whiskers extend from the hinges to the most extreme values within 1.5 times the inter-quartile range. Data beyond these limits are outliers. For (s, u, and v), when comparing Exosc3 cKO effect with AMO dataset, the comparison between Exosc3 cKO transfected with Scr AMO and WT transfected with Scr AMO was used.
Extended Data Fig. 8 Evolutionary analysis of mouse TE-chimeras.
a, Bar plots show the proportions of sense chimeric TEs, antisense chimeric TEs, and non-chimeric TEs in the mouse genome. All TEs, murine-specific TEs, rodent-specific TEs, and TEs conserved beyond the rodent lineage are shown from top to bottom panels. b, Analysis of evolutionary ages for ERV1, ERVK, ERVL, ERVL-MaLR, L1, L2, Alu, B2 and hAT-Charlie in the mouse genome. Randomly chosen intergenic non-TE regions are also used as controls. The left bar plots show the percentage of these TEs and non-TEs across different evolutionary ages, from mouse-specific (young; colored red) to eutherian-mammal conserved (old; colored dark blue). The middle boxplots indicate the number of mammalian genomes (among a total of 240) these TEs or genomic regions can map to. Boxes present the median, lower quartile, and upper quartile, while whiskers denote the maximum and minimum after removing outliers (more than 3/2 times the upper quartile or less than 3/2 times the lower quartile). P-values were generated using an unpaired two-sided Wilcoxon rank-sum test. The right bar plots represent the number of sense chimeric and antisense chimeric TEs across different evolutionary ages.
Extended Data Fig. 9 Evolutionary analysis of human TE-chimeras.
a, Analysis of evolutionary ages for HERVH, ERVK, Alu, MIR, hAT-Charlie, and SVA elements in the human genome. Randomly chosen intergenic non-TE regions are also used as controls. The left bar plots show the percentage of these TEs and non-TEs across different evolutionary ages, from human-specific (young; colored red) to eutherian-mammal conserved (old; colored dark blue). The middle boxplots indicate the number of mammalian genomes (among a total of 240) these TEs or genomic regions can map to. Boxes present the median, lower quartile, and upper quartile, while whiskers denote the maximum and minimum after removing outliers (more than 3/2 times the upper quartile or less than 3/2 times the lower quartile). P-values were generated using an unpaired two-sided Wilcoxon rank-sum test. The right bar plots represent the number of sense chimeric and antisense chimeric TEs across different evolutionary ages58.
Extended Data Fig. 10 EXOSC3 and SAFB regulate distinct exonized transcripts.
a, Pipelines to identify exonization events between the current study and Ilyk et al.58 were compared in their ability to identify exonization events in GTEx v8, where most events were shared between reference transcriptomes and expression criteria. b, The number of total differentially expressed (padj <0.1) transcripts in Hela cells from SAFB triple KD (compared so siGFP control) or EXOSC3 siRNA (compared to scrambled control). Y-axis shows gene ratio and bars colored by enrichment p-value. c, The same intersection as A, focused on TE-exonized transcripts, d-g, The top 3 Gene Ontology terms from overrepresentation tests based on the top 300 SAFB triple knockdown genes (d) or exonized genes (e). f-g, The same overrepresentation analyses as d-e, but corresponding to top 300 EXOSC3 knockdown genes (f) or exonized genes (g). P-values were calculated using nonparametric permutation test to build a null distribution of the Enrichment Score (ES). From the distribution of p-values, false discovery rate corrections were made using the Benjamini-Hochberg method resulting in padj values.
Supplementary information
Supplementary Tables
Supplementary Table 1. Table of all identified TE-chimeras in mouse organ data. Supplementary Table 2. Table of TE-chimera expression in mouse organ data across time. Supplementary Table 3. Table of high-confidence TE-chimeras in GTEx data. Supplementary Table 4. Table of SNPs mapped to TE-chimeras in cis. Supplementary Table 5. Table of SNPs mapped to TE-chimeras overlapped with human genome-wide association loci for heritable traits. Supplementary Table 6. List of primers for qPCR. Supplementary Table 7. List of publicly available datasets used in this study.
Source data
Source Data Fig. 5 and Extended Data Figs. 5–7
Numerical source data.
Source Data Fig. 4 and Extended Data Fig. 7
Uncropped blots and gels.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cheon, Y., Alvstad, E.G., Torre, D. et al. Transposable element–gene chimera cartography, origination and role in enhancing transcriptome plasticity. Nat Struct Mol Biol (2026). https://doi.org/10.1038/s41594-026-01757-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41594-026-01757-z






