Abstract
Transposable elements (TEs) are genomic elements present in multiple copies in mammalian genomes. TEs were thought to have little functional relevance but recent studies report roles in biological processes, including embryonic development. To investigate the expression dynamics of TEs during human early development, we generated long-read sequence data from human pluripotent stem cells (hPSCs) in vitro differentiated to endoderm, mesoderm, and ectoderm lineages to construct lineage-specific transcriptome assemblies and accurately place TE sequences. Our analysis reveals that specific TE superfamilies exhibit distinct expression patterns. Notably, we observed TE switching, where the same family of TE is expressed in multiple cell types, but originates from different transcripts. Interestingly, TE-containing transcripts exhibit distinct levels of transcript stability and subcellular localization. Moreover, TE-containing transcripts increasingly associate with chromatin in germ layer cells compared to hPSCs. This study suggests that TEs contribute to human embryonic development through dynamic chromatin interactions.
Similar content being viewed by others
Introduction
TEs are genomic elements with multiple copies resulting from autonomous and non-autonomous duplication in genomes1,2. About half of the human genome consists of TEs1,3,4 that come in families with multiple evolutionary histories and different genomic distributions across species5. Because of their repetitive nature and their underrepresentation in protein-coding genes, TEs were considered to have little functional importance6,7. In comparative genomics studies, TE-containing genomic regions were often actively removed from analysis7,8. However, recent studies have shown that TEs are functional in diverse biological processes2,4, including transcription9,10,11, post-transcription expression regulation12,13, transcript processing and stability12,14,15, chromatin regulation16,17, development18,19 and disease progression20,21. While TEs have been co-opted in normal developmental processes22,23, TEs pose a risk to genomic integrity4,24, and can cause mutations that interfere with regulatory networks and from chimeric spliced coding and noncoding transcripts that alter function2,4. Therefore, TEs have both positive and negative aspects6. Although TEs have been studied in pluripotent stem cells12,23 and some terminally differentiated somatic tissues20,25, their expression patterns and their roles in post-implantation human development and gastrulation have not been well explored.
The presence of multiple copies of TEs in the genome makes the investigation of their functions difficult26,27. This is especially the case when using short-read RNA-seq as TE fragments often cannot be uniquely mapped to a genomic location28, and replicate TE sequences make assembling TE-containing transcripts from short-reads fraught with difficulty even in well-annotated species28,29,30. Previous efforts have considered global TE expression without consideration for the specific TE loci20,31,32. Efforts are now being made to investigate TE expression at specific loci12,26, with an emphasis on understanding how TE sequences are spliced to form chimeric transcripts. Locus-level expression quantification can improve transcript assembly33. Although both short reads and long reads can be used for transcript assembly12,27, the quality of the transcript assembly based on long-reads is superior12,34, and benefits the study of TE transcriptomics.
We have previously shown that TE sequences are richly expressed in hPSCs, including inside coding and noncoding transcripts, and their presence is associated with changes in transcript biophysical properties12. It is generally assumed that TEs are most active in the pluripotent stage of development and decline in somatic tissues. However, this has not been explored, particularly in the context of TE sequences inside transcripts. In this study, we investigated human early development transcripts (hEDTs) assembled exclusively from long-read data to investigate TE expression dynamics. We differentiated hPSCs in vitro to the three germ layers; endoderm, mesoderm, and ectoderm, to mimic human gastrulation and conducted long-read RNA sequencing for each cell type. Using the assemblies, we showed that TE-containing transcripts were dynamically regulated. Our results show that TE-containing transcripts alter RNA stability and chromatin interaction in a cell type-specific pattern. Overall, this study presents the expression dynamics of TEs in early development and implicates TEs in somatic differentiation.
Results
hPSC transcriptome assembled from long-read RNA-Seq data
We have previously described an hPSC-specific transcriptome based on a combined long-read and short-read pipeline12. Here we relied exclusively on deeply sequenced long-read data to assemble transcripts. This strategy has the advantage that TEs will be accurately placed into their transcript context. To this end, we generated PacBio long-read RNA-sequence data for hPSCs. Consensus reads were generated from the PacBio subreads and then processed to produce noiseless reads using a standard PacBio read pipeline (see Supplementary Note and Supplementary Data 1). The reads were mapped to the human genome using Minimap235 and assembled into transcripts using StringTie236,37. After quality assessment, 33,375 hPSC transcripts, corresponding to 16,814 genes, were assembled.
One advantage of transcript assembly using long-reads only is the ability to sequence full-length mostly intact transcripts27,28. However, transcript assembly from long-reads is not error-free29,38,39, due to region-bias errors40, reference-induced errors41, semi-random RNA shearing, and even differences in analysis pipelines41,42,43. We, therefore, checked if our assembled transcripts were full-length and supported by long-reads, as opposed to annotation-aided assembly.
The assembled transcripts were covered entirely by multiple consensus reads (Fig. 1a), and most assembled transcripts were completely covered by at least one consensus read (Fig. 1b), suggesting that full-length transcripts were assembled. Interestingly, almost all the consensus reads were used in the assembly (Supplementary Fig. 1a). As a negative control, we generated a ‘random GTF set’ with the same number and structure as the assembled transcripts (see Supplementary Methods). This random GTF was not supported by our long-read data (Supplementary Fig. 1b). The pileup of consensus reads shows that the reads map from the transcription start sites (TSSs) to the transcription end sites (TESs) of the hPSC assembly (Fig. 1c). The pileup of short-read data indicates end-to-end transcript coverage for the hPSC assembly, but not for random GTF coordinates (Supplementary Fig. 1c).
a Consensus read coverage of the assembled hPSC transcripts using all consensus long reads. b Transcript coverage (percent of transcript covered entirely by a read) ranked by the longest consensus read for the assembled hPSC transcripts. CPM=counts per million. c Pileups for the consensus long reads for the assembled hPSC transcripts across the exons from the TSS to the TES. Each transcript is scaled to a uniform size. d DeepCAGE (5’ end) and polyA tail (3’ end) pileups of the assembled transcripts, from the TSS to the TES scaled to a uniform size and not including intronic regions. PolyA data were from GSE138759103 and GSE111134104 while deepCAGE data were from GSE34448139 and GSE61264102. e Density pileup of POLR2A and phosphorylated POLR2A (pPOLR2A) for the assembled hPSC transcripts. The data were from GSE32465101. f Genome view of a novel hPSC transcript. g Genome view of a variant transcript made by a skipped exon (marked with dotted line) of the LAS1L gene. h Position weight matrix nucleotide frequencies and phyloP conservation scores of the 10-bp positions around the donor and acceptor splice junctions of the assembled hPSC transcripts. bp=base pairs. i Proportion of hPSC junctions covered by indicated minimum number of reads.
To assess the completeness of the assembled transcripts, we utilized published hPSC deepCAGE44 and polyadenylated (polyA) data that mark the TSS and TES, respectively. The long-read assembled TSSs were enriched for deepCAGE tags, while the TESs were enriched for polyA (Fig. 1d). These enrichments were not found in the random GTF set (Supplementary Fig. 1d). Similarly, ChIP-Seq data for the TSS marks POL2RA, H3K27ac and H3K4me3 were enriched at the promoters of the hPSC assembly, but not in the random GTF (Fig. 1e, Supplementary Figs. 1e, f). Additionally, H3k36me3 was enriched on the transcript bodies (Supplementary Fig. 1g). As examples, POU5F1 and SLC2A3 were marked by deepCAGE, POLR2A, H3K4me3 and H3K27ac at the TSS and polyA at the TES (Supplementary Fig. 1h, i). This pattern was also seen on novel transcripts on chromosome 1 (Fig. 1f), and a variant isoform of the LAS1L gene with a skipped exon (Fig. 1g).
We next checked the splice junctions of the assembled transcripts. The evolutionary conservation scores across the junctions were higher relative to introns, and the highest conservation was in the dinucleotide at the splice junctions for both the 5’ and 3’ ends (Fig. 1h). Indeed, the nucleotide frequencies showed the typical GT dinucleotide at the donor site, and AG at the acceptor site. Analysis of short-read RNA-seq data, showed that more than 99% of the splice junctions have at least one short-read, and 85% have 10 or more reads (Fig. 1i, Supplementary Fig. 1j). Taken together, these results, from multiple data, demonstrate the fidelity of the terminals and the splice junctions of the assembled transcripts.
In vitro differentiation transcriptome trajectory in germ layers
To investigate transcript expression dynamics during human embryonic development, we in vitro differentiated hPSCs to three germ layers using previously published protocols45 (Fig. 2a). Lineage-specific marker genes were up-regulated in each lineage while pluripotent markers decreased along the differentiation time course (Supplementary Fig. 2a). We examined the expression patterns of a set of lineage-specific markers46 and found a close match (Supplementary Fig. 2b). We also compared the expression profiles of our cells to published data and found that our cells agreed well with the RNA-Seq data from in vitro differentiation experiments47, gastrulating human embryos48 and human embryoid bodies49 (Supplementary Fig. 2c–e). To confirm the fidelity of our differentiation experiments, we performed Fluorescence-Activated Cell Sorting (FACS) using PAX6, the marker for ectoderm cells. Ectoderm differentiation was confirmed by the expression of PAX6, and 90% of cells were positive (Supplementary Fig. 2f). Ectoderm has four main lineages, each with a unique gene expression pattern50. To investigate the specific lineage of our ectoderm cells, we retrieved short-read RNA-Seq data. Consistent with PAX6 expression (Supplementary Fig. 2f), the correlation analysis showed that our ectoderm cells were closer to neuroectoderm (Supplementary Fig. 2g). These results suggest that our in vitro cells captured a normal developmental trajectory.
a Schematic for the in vitro differentiation of hPSC to each of the three embryonic germ layers. b Number of transcripts assembled from each cell state and the merged transcript assembly. c PCA showing the relationships between the samples based on long-read quantification. In addition to hPSC data generated in this study, hPSC (H1) and (c11) were from previously published data12. d Boxplots showing the spliced transcript lengths (left) and number of exons per assembled transcript (right). Boxplots show the median (orange line in the central bar), and second and third quartiles (central bar), and the whiskers show 1.5 times the interquartile ranges, for this and all subsequent boxplots. Statistics is from a Kruskal-Wallis H-test from 5 hPSC replicates and 2 replicates for all other cell types. kbp=kilobase pairs (e) Stacked bar chart showing the number of coding and noncoding transcripts in the hEDT assemblies (left plot), and the number of long-reads assigned to each category (right plot). f Stacked bar chart showing the number of assembled transcripts based on similarity to the GENCODE (v43) assembly. Matching transcript completely matches to a known transcript (including all exons and splice junctions), while a variant transcript has any base pair in common with a GENCODE exon. Novel transcripts do not overlap a GENCODE exon. g Bar chart of the number of novel coding and noncoding genes identified in each cell state. Because a gene might have more than one isoform, a gene could be both coding and noncoding. h Heatmap showing uniformly expressed and lineage-specific transcripts based on long-read quantification, normalized to the highest expression for each transcript. i Genome views of ectoderm-specific loci FEZF1 (left panel) and PAX6 (right panel). j Single-cell RNA-seq expression of in vitro differentiation of hPSCs47, showing uniformly expressed and marker transcripts specific to the indicated cell states. k Enrichment of SNPs with clinical relevance in novel transcripts compared to a matched random background of transcripts with equal lengths and exon structures. Clinically relevant SNPs were from Ensembl140. Statistics is from a Fisher Exact test.
After confirming the cell types, long-read sequencing for two biological replicates per cell type was performed to a depth of 20-45 million PacBio raw subreads (Supplementary Fig. 3a). The consensus reads generated from each replicate ranged from 0.5 to 1.2 million, and resulted in transcript assemblies containing 31,245 endoderm, 37,161 mesoderm, and 56,063 ectoderm transcripts (Fig. 2b). These corresponded to 14,989 endoderm, 17,520 mesoderm, and 24,766 ectoderm genes (Supplementary Fig. 3b). To assess the reliability of the assembled transcripts, we checked the splice sites of the assembled transcripts. Across cell states, base-level conservation levels and nucleotide frequencies of the splice junctions showed that the assembled transcripts were reliable (Supplementary Fig. 3c, d). The distributions of alternative splicing events were similar across the four cell states (Supplementary Fig. 3e). The consistency of the splicing signals and the similarity across cell states suggest that the quality of our assembled transcripts was reliable.
We investigated how exhaustive our transcript assembly was across different cell states. Saturation analyses by down-sampling the reads suggests that increasing the sequencing depth would discover more transcripts, especially novel transcripts (Supplementary Fig. 3f), consistent with a previous report51. As a further measure of the exhaustiveness of the hEDT transcripts, we investigated the proportion of short-read RNA-seq data that aligned to the assembled transcripts and found that more than 90% of the short-reads mapped to the long-read hEDT transcripts (Supplementary Fig. 3g), suggesting that our assembly was exhaustive and that the expression level of the transcripts remaining to be discovered is low.
Since the number of assembled transcripts can be affected by the sequencing depth28, we asked if the higher transcript count in ectoderm cells is a genuine phenomenon or a consequence of greater sequencing depth (Supplementary Figs. 3a, b). To explore this, we subsampled 300,000 consensus reads from the replicates of each cell state. Interestingly, more transcripts were still assembled in ectoderm using only the subsampled reads (Supplementary Figs. 3h). As the sequencing depth preferentially impacts lowly expressed transcripts28, we checked if the higher transcript count in ectoderm would persist when using a higher expression threshold. Indeed, that was the case (Supplementary Figs. 3i), supporting the increased complexity of the ectoderm transcriptome.
Principal component analysis (PCA) from transcript quantification using the merged transcript reference showed that the same cell states clustered together and distinct from other cell types (Fig. 2c and Supplementary Data 2), indicating that different cell states had a specific transcriptomic landscape. We then proceeded to investigate the features of hEDTs across different cell types. Transcript lengths and exon count per transcript were heterogenous, with ectoderm transcripts having significantly longer spliced transcripts with fewer exons per transcript (Fig. 2d). However, other parameters, including un-spliced transcript length, exon length and number of transcripts per gene were not substantially different (Supplementary Fig. 3j). We next predicted coding potential using FEELnc52, and transcript distributions based on protein coding potential showed that ectoderm tended to have more noncoding transcripts (Fig. 2e). Comparison of the assemblies of each cell state to the GENCODE (v43) transcripts53 revealed that ectoderm indeed had a lower proportion of GENCODE-known (matching) transcripts (Supplementary Fig. 3k). Novel transcripts tended to be noncoding in all cell types, with a higher proportion of novel transcripts in ectoderm (Fig. 2f). Even at the gene level, the majority of the novel genes were predicted as noncoding (Fig. 2g). These results highlight the ability of assembly to identify new transcripts.
Functional regulation of cell type-specific transcripts
To investigate the expression dynamics of the assembled transcripts, we extracted uniform and lineage-specific marker genes across the four cell states (Fig. 2h). The uniformly expressed genes included housekeeping genes such as ACTB (Supplementary Fig. 4a). As examples of lineage-specific genes, the ectoderm-specific FEZF1 and PAX6 had both long-read and short-read support (Fig. 2i). EOMES, an endoderm marker was expressed only in the endoderm, and the mesoderm marker, IGFBP3, was restricted to the mesoderm (Supplementary Fig. 4b). To investigate the heterogeneity of transcript expression at the single cell level we reanalyzed previously published scRNA-seq data47 of hPSCs and differentiated cells. UMAP clustering showed cell types clustered together (Supplementary Fig. 4c). The expression patterns of uniformly expressed and cell-specific transcripts (Fig. 2h) were consistent (Fig. 2j, Supplementary Figs. 4d–h). Both bulk (long and short-read) and scRNA-seq data consistently showed that the in vitro differentiation and bioinformatics procedures captured the expressed hEDT set.
Finally, we checked for potential clinical or functional relevance of the transcripts. Compared to random sequences, both novel and matching transcripts were enriched in SNPs with clinical and functional relevance (Fig. 2k, and Supplementary Fig. 4i). Although ectoderm SNP enrichment was lower, novel transcripts from all cell states had significantly higher SNP enrichment compared to matched random transcripts (Fig. 2k, and Supplementary Fig. 4i). These data suggest that the novel transcripts have potentially relevant biological functions in development and disease.
TE superfamilies in spliced transcripts during differentiation
Having established the reliability of the in vitro differentiation experiments and the assembled transcripts, we investigated TE splicing patterns across cell types. We used nhmmer to annotate TEs inside transcript sequences as previously described2,54. The proportions of TE-containing transcripts revealed that ectoderm expressed more coding and noncoding TE-containing transcripts (Supplementary Fig. 5a). As previously reported12, noncoding transcripts tended to contain more TEs than coding transcripts (Fig. 3a). This observation was true when we considered the number of TE-containing transcripts, or the number of reads assigned to each transcript class (Fig. 3a). Intriguingly, the TE-derived nucleotide composition was less than 50% for both coding and noncoding (Fig. 3b). However, TE-containing noncoding transcripts contained more TE-derived nucleotides, although most noncoding transcripts still contained less than 50% TE-derived sequences. This pattern was common to all cell types (Fig. 3b), with only small differences, as hPSCs and endoderm had fewer overall TE-containing transcripts (Fig. 3a), they tended to have higher TE sequence content in their noncoding transcripts (Fig. 3b).
a Stacked bar charts showing the proportion and number of TE-containing transcripts in the coding and noncoding transcripts in the hEDT assemblies. The upper plots show the number of assembled transcripts, whilst the lower plots show the number of reads assigned to each category. b Histograms showing the percent of nucleotides that are TE-derived (TE coverage) for both coding and noncoding transcripts. c Enrichments of TEs in the assembled coding and noncoding hEDTs, calculated as the proportion of the transcripts containing the TE relative to the proportion of the TEs in the human genome. Genome views of hPSC-enriched (d) and endoderm-enriched (e) LTR-containing novel transcripts. f Bar charts of the proportion of TE sequences that are expressed as parts of transcripts classified as initiating, exonized or terminating. g Chart of the number of TE-containing transcript variants of a gene that has at least one TE-free isoform. h Bar chart showing the consequences of TE sequences in TE-free coding and noncoding transcripts versus the TE-containing isoforms. Preserving is no change with or without a TE. Modifying is preserved coding potential with or without a TE, but the ORF changed. For disrupting, TE insertion causes a TE-containing coding transcript to lose coding potential. Creating means a TE insertion converts a noncoding transcript to coding. Noncoding refers to all other transcripts. i Single-cell RNA-seq expression detectability of coding and noncoding transcripts. The red and blue dotted lines represent the median values for transcripts without TEs. The black line on each violin represents the median for that class.
To investigate the TE splicing pattern, we checked which TE superfamilies were contained in hEDTs. We found that the most frequently spliced TE superfamilies included SINE, retroposon, LINE, LTR, and DNA (Supplementary Fig. 5b). It is important to note that these TE frequencies did not reflect the genomic TE distribution (Supplementary Fig. 5b) as different TE superfamilies were enriched/depleted versus the genomic background (Fig. 3c). This suggests TEs are not randomly spliced from the background. We found that the majority of SINE-containing transcripts also contained retroposons, suggesting that these two TE superfamilies co-exist in hEDTs (Supplementary Fig. 5c). Across the major TE superfamilies, frequent TE sequences were observed in ectoderm, with almost half of the transcripts containing at least one SINE, compared to about a third in hPSC transcripts (Supplementary Figs. 5b). Indeed, the higher frequency of SINE and retroposon splicing is observed across cell states in both coding and noncoding transcripts. Interestingly, TE-containing transcripts were expressed in a cell state-specific manner. For example, a locus containing a cluster of LTRs on human chromosome 3 is exclusively expressed in hPSCs (Fig. 3d) while an LTR-containing locus on human chromosome 17 was exclusive to endoderm (Fig. 3e). Indeed, multiple cell state-specific TE-containing transcripts such as endoderm-specific, ectoderm-specific and mesoderm-specific transcripts were detected, highlighting cell state-specific TE expression (Supplementary Fig. 5d). TEs were found in the isoforms of genes with roles in differentiation. For example, MIXL1, GATA6 and TGFB255,56,57,58, all have TE-containing isoforms (Supplementary Fig. 5e), that were cell type-specifically expressed (Supplementary Fig. 5f). Similarly, hPSC-specific ESRG59,60 has multiple annotated isoforms that derive predominantly from an LTR (Supplementary Fig. 5e). These results revealed tissue-specific activation of TEs and suggest their potential roles in cell state specification.
Since only a small proportion of transcripts are entirely covered by TEs (Fig. 3b), we asked which part of the transcript TEs are located. In most transcripts, TEs are found in an internal exon; a sizable number of transcripts were TE-terminating (i.e. in the TES) while a few transcripts had TEs in the TSS (Fig. 3f). Across different TE superfamilies, the prevalence of TE transcripts was consistent, but the proportions of TE-initiating and TE-terminating were mixed (Fig. 3f). For example, more transcripts tended to terminate with SINEs than other TE types. Also, LTRs were more likely to be in the TSS in hPSCs and endoderm than in mesoderm and ectoderm. These results highlight the bias in TE presence in the hEDT assemblies.
To investigate the consequences of TE insertion into transcripts, we focused on TE-free and TE-containing transcripts. Ectoderm cells tended to contain more TE-containing transcripts from otherwise TE-free genes, and the pattern was particularly prevalent for SINE-containing transcripts (Fig. 3g). Across TE superfamilies and cell states, the major consequence of TE insertion was modifying, in which a TE chimera affected the CDS length but did not lead to the loss of coding potential (Fig. 3h). An example is the TE-containing isoform of WDR4, a gene that promotes cerebellar development61. Although TE presence does not lead to the loss of coding potential, the transcript was dysregulated across cell types (Supplementary Fig. 5h). Interestingly, we also found instances in which a TE-containing isoform led to the complete loss or gain of coding potential (Supplementary Fig. 5h). An example of a loss of coding potential is a TE-containing isoform is TXNL4A, a gene implicated in Burn-McKeown syndrome and isolated choanal atresia62 (Supplementary Fig. 5h). Conversely, TE insertion is an isoform of a pseudogene, DOC2G creates coding potential (Supplementary Fig. 5h). The functional homolog of the gene has been implicated in neurotransmission in mice63,64,65. Overall, TE-containing isoforms altered coding potential in otherwise TE-free transcripts.
Because TE insertion has different consequences on coding potential, we investigated the expression dynamics of TE-containing coding and noncoding transcripts. For both coding (Supplementary Fig. 6a) and noncoding (Supplementary Fig. 6b) transcripts, SINEs and retroposons were prominent. Whereas about 80% of noncoding transcripts in ectoderm contained a SINE element, less than 10% of coding transcripts contained an LTR. The bias towards coding transcripts being TE-free was evident in the proportion of coding transcripts for various categories, 91% of all TE-free transcripts were coding, whilst for all TE superfamilies, it ranged from 33% for LTR in ectoderm to 63% for SINE in endoderm (Supplementary Fig. 6c and d). As TEs could potentially create a minor transcript as an alternative isoform of a major functional transcript, we investigated the frequency of TE sequence in major and minor transcripts. For coding transcripts, major isoforms tended to be TE-free, compared to the TE-containing transcripts (Supplementary Fig. 6e). Conversely, for non-coding transcripts the major isoforms tended to contain TE sequences. This suggests TE-containing coding transcripts are minor isoforms, whilst noncoding TE-transcripts are the major isoforms.
TEs might contribute to cell states as functional modules for cell state transitions66,67. As previously reported68, noncoding transcripts had higher expression variability across cell types (Supplementary Fig. 6f). Further, TE-containing coding and noncoding transcripts tended to have higher expression variability than TE-free transcripts. Interestingly, the Tau coefficient, which measures expression specificity, showed that both TE-containing coding and noncoding transcripts had higher cell type-specific expression than TE-free transcripts (Supplementary Fig. 6f), suggesting that TE expression is regulated in a cell type-specific manner.
We next checked the expression variability within cell types using scRNA-seq data and found that hPSC transcripts had lower expression variability than differentiated cells (Supplementary Fig. 6g). Except for mesoderm LTR-containing transcripts, noncoding transcripts tended to have higher expression variability than the coding transcripts. Notably, TE presence led to higher expression variability in coding transcripts but lower variability in noncoding transcripts. The differences in the expression variability might be influenced by the expression detectability, especially for lowly expressed transcripts, however TE-containing transcripts had higher detectability, especially for noncoding transcripts (Fig. 3i). Taken together, these results suggest that TE expression is coordinated during differentiation.
Biased TE-transcript patterns in coding and noncoding transcripts
Having established the dynamic expression of TE-containing transcript in different lineages, we investigated the TE frequencies of coding and noncoding transcripts in the four cell states. Analysis of TE patterns in coding transcripts showed that TE sequences are rare in coding sequences (CDS) (Fig. 4a, Supplementary Fig. 7a, b), presumably reflecting the evolutionary cost of a TE inserting into and disrupting a CDS. Across TE superfamilies and cell states, TEs were overrepresented in the 5’ and 3’ untranslated regions (UTRs) compared to the CDS, with higher frequencies in the 3’ UTRs (Supplementary Fig. 7a). However, TE subfamilies showed differences. For example, while HERVH and MSTB are both LTRs, their splicing patterns were different, with HERVH sequences being rare in the 3’UTR, whilst MSTBs were enriched (Fig. 4a). In addition, there was a higher frequency of HERVH TE-derived sequences in endoderm cells. Interestingly, the TE pattern of X24_LINE was very similar to that of MSTB and also showed higher presence in ectoderm UTRs, revealing the heterogeneity of the splicing patterns of specific TEs. The comparison of the MER3 and LTR6A frequencies further highlighted differences between these TEs and cell states (Supplementary Fig. 7a).
a The splicing patterns of different TEs in coding transcripts. Coding transcripts were divided into 5’ UTR, CDS and 3’ UTR, and scaled to a uniform size. UTR=untranslated region; CDS=coding sequence. b Proportion of TEs 2 kbp upstream and downstream of coding transcript TSS (transcription start site) and TES (transcription end site). c Proportion of TEs around 200 bp upstream and downstream of the donor and acceptor splice sites of coding transcripts. d The splicing patterns of different TE superfamilies in noncoding transcripts. Each transcript is scaled to a uniform size. e TE contents around the TSS and TES of noncoding transcripts. f TE content around the donor and acceptor sites in the noncoding transcripts. The TE content for each bin was then computed. g Heatmap of TEs with significant differences in coding transcript frequencies between hPSCs and any of the three cell types. h Heatmap of TEs with significant frequency differences in the noncoding transcripts of the hPSC and any of the three cell types. For g and h the statistical significance was defined as an adjusted P value < 0.05 and odds ratio ≥2 from a Fisher exact test (two-sided) with Bonferroni correction. For the heatmap, the Z score of the proportion of TE-containing coding and noncoding transcripts was computed. i Venn diagrams of the overlaps of the TE superfamilies with significant frequency differences in coding and noncoding transcripts. j TE frequencies of selected TEs in the coding and noncoding transcripts in the indicated cell states.
We next looked at the frequency of all TEs in the TSSs and TESs, and interestingly there were no substantial differences across cell states (Fig. 4b). Across the TE superfamilies, TE proportion was low at the TSS. TEs were also reduced at the TES and this reduction extended 5’ into the transcript body (Supplementary Fig. 7c). Similarly, TEs at both the donor and acceptor splice junctions were low (Fig. 4c, Supplementary Fig. 7d). Although there were subtle differences across cell lines and TE superfamilies, the general pattern was similar (Supplementary Fig. 7d). Surprisingly, TE frequencies at splice junctions suggest that suppression of TE insertion at the exon starts is stronger than suppression at the exon ends. This suggests TEs are inhibited from splicing into TSSs, TESs, or splice junctions to prevent the disruption of transcript structure.
We next studied TE splicing patterns in noncoding transcripts and found that they were dynamic between TE family, cell states, and even positions within the transcript (Fig. 4d–f, Supplementary Fig. 7e–g). Specifically, while SINE and retroposon TEs are enriched towards the 5’ and 3’ ends of the transcripts, LINE, LTR, and DNA superfamily TEs are uniformly distributed. Additionally, the overall frequencies were different across cell states (Fig. 4d, Supplementary Fig. 7e). Ectoderm noncoding transcripts were enriched for LINE and DNA TEs, whilst LTRs were enriched in hPSCs, ectoderm, and endoderm (Fig. 4d). The dynamics of TE expression was more obvious for transcripts containing LTRs. For example, while HERVFH21 was concentrated in the center of noncoding transcripts, endoderm LTR6A had higher enrichment around 5’ and 3’ ends (Supplementary Fig. 7e), suggesting these transcripts are noncoding remnants of semi-intact ERVs. HERVH, conversely, was uniformly distributed across noncoding transcripts in hPSC and endoderm cells (Supplementary Fig. 7e), highlighting the differences among TEs.
TE frequencies in noncoding transcripts were also dynamic. SINE, LTRs, and retroposons were rare at the TSS (Fig. 4e, and Supplementary Fig. 7f). Conversely at the transcript ends, SINEs and LTRs were enriched, and then their enrichment dropped substantially after the TES, suggesting that they marked transcript ends. TE sequences were absent at the TSSs of noncoding transcripts, suggesting TE sequences rarely act as noncoding promoters. TE frequencies at noncoding transcript splice junctions were also low (Fig. 4e, and Supplementary Fig. 7g), but not as low as in coding transcripts (Supplementary Fig. 7g). Overall, noncoding transcripts showed distinct patterns compared to coding transcripts at the TSS, TES and at splice junctions.
To explore cell type-specific TE activity, we quantified their frequencies in coding and noncoding transcripts. Comparison of TE-containing coding transcripts identified 218 TEs that were significantly different upon differentiation into endoderm, mesoderm, or ectoderm (Fig. 4g). The majority of the TE subfamilies that were significantly different were enriched in ectoderm cells and were LINEs, LTRs and a few DNA TEs. In contrast, some LTRs like LTR7, HERVH, HERVH48, and MTSB2 were enriched in endoderm coding transcripts. On the other hand, 331 TE subfamilies were differentially enriched in noncoding transcripts in at least one of the four cell states (Fig. 4h), and the majority of the differentially spliced TEs in noncoding transcripts were LINEs and LTRs that were more frequent in the ectoderm. Some LTRs such as LTR7B and HERVH were more frequent in hPSCs, as previously reported69,70 and many of them were in endoderm cells, suggesting these are not specific features of hPSCs. Indeed, LTR7B/Y/H and HERVH were enriched in endoderm (Fig. 4h), supporting the idea that HERVH expression extends to cell types beyond hPSCs. HERVH modulates 3-dimensional genome structure in hPSCs71 and may also perform this function in somatic cells.
Interestingly, there was a substantial overlap in the significantly enriched TE superfamilies in any cell type for both coding and noncoding transcripts (Fig. 4i). Breaking this down by TE superfamily, the overlap is high for LINEs, with just 21 TEs specific for noncoding transcripts and 134 LINE types enriched in both coding and noncoding. For example, although the frequencies were different, the overall abundance patterns for LINE L1MD_orf2 were similar for coding and noncoding transcripts (Fig. 4j). In contrast to LINEs, less LTR overlaps between coding and noncoding transcripts, and only 49 LTR-types were common. This was exemplified by the LTRs MSTB2 and LTR7Y which were ectoderm and hPSC/endoderm enriched, respectively (Fig. 4j). Overall, these data indicate that the TE content in different lineages is dynamic, particularly in the ectoderm in general, and for LTRs in all lineages which are cell type and transcript-type specific.
TE subfamily transcript switching
To explore this, we measured the relationship between TE superfamily and expression level at the individual transcript level. We used bulk short-read RNA-seq data for each lineage with the long-read based hEDT reference, as the dynamic range of expression is higher for short-read data than long-read data28. In all four cell types, the expression levels of TE-free coding transcripts were significantly higher than those of TE-containing coding transcripts (Supplementary Fig. 8a). TEs in the 5’ UTRs resulted in transcripts with the lowest expression levels while 3’ UTR transcript chimeras led to higher expression levels, but not as high as TE-free transcripts. We next investigated the aggregate expression patterns and found that the overall expression patterns for SINE, LTR, LINE, retroposon, and DNA coding transcripts were not substantially different across the four states (Fig. 5a, Supplementary Fig. 8b).
a Boxplots of the aggregate expression levels of coding transcripts containing specific TEs and TE superfamilies. The box represents the first and third quartiles, the midpoint is the median and the whiskers are 1.5 × interquartile range. The same boxplot definition is used in subsequent boxplots. b Heatmap showing the expression changes of cell state-specific coding transcripts containing the indicated TEs. Each row of the heatmap represents a transcript containing the subfamily of TE (top label). c Heatmap of aggregate changes in the expression of noncoding transcripts containing different TEs. Each row represents the mean of all transcripts containing a TE-subfamily that is significantly different from a Wilcoxon signed-rank test (two-sided) with Bonferroni correction (adjusted P value < 0.05 and fold change ≥2) between hPSC and any of the differentiated cell states. For each TE, the cell state with a significant difference, and the number of lineage(s) with significant differences are shown. d Heatmap showing the change in expression of cell state-specific differentially expressed noncoding transcripts containing the indicated TEs. Each row represents a TE-containing transcript that is differentially expressed between hPSCs and any of the three cell states. Bulk RNA-seq (left) and scRNA-seq (right) expression patterns of selected TE-containing transcripts in in vitro (e) and in vivo (f) differentiation data.
Above, we mainly consider TEs at the superfamily or family level, but with long reads we can place TEs into their specific transcript context. Although there is a substantial overlap in the subfamilies of TEs that are found in coding and noncoding transcripts, the exact genomic loci of the TE-containing transcripts are likely to be different. Potentially there are two scenarios. The first scenario is expression of the same TE subfamily from the same genomic locus, while the second is the expression of the same TE subfamily from different loci. If analyzed at the bulk level, the TE family/subfamily expression would remain unchanged in both scenarios, but in the second the underlying transcript providing the TE would be different. Differential transcript expression between hPSC and the three differentiated cell states revealed that hPSC-to-endoderm differentiation induced the least number of transcript changes while mesoderm differentiation drove the largest coding transcript expression changes (Supplementary Fig. 8c). Differentiation in all three lineages led to upregulation and downregulation of TE-containing coding transcripts, suggesting that transcript regulation is locus-specific and that TE expression is not restricted to hPSCs. We checked the overlaps of the differentiation-induced differential transcripts and found that many transcripts were regulated in a lineage-specific manner (Supplementary Fig. 8d). The aggregate expression of specific TE coding transcripts identified seven LTRs (LTR7Y, LTR7C, LTR7B, LTR7, HERVH, HERVH48 and HERVFH21) that were significantly downregulated in both ectoderm and mesoderm cells, and two LINEs (L1P4b_5end and L1HS_5end) that were substantially upregulated in ectoderm cells (Fig. 5a, and Supplementary Figs. 8e). This suggests that although the TEs are similar in all three cell types, they originate from different transcripts.
We then examined the expression levels of the individual TE-containing coding transcripts. Surprisingly, the majority of the HERVH-containing differentially expressed coding transcripts were highly expressed in endoderm cells but not in hPSCs (Fig. 5b). Different coding transcripts containing L2 and L1HS_5end were activated in different cell states, with activation of many L1HS_5end-containing coding transcripts in ectoderm cells. These data support the second of our scenarios that whilst similar TE types are expressed, they are derived from different loci that are expressed in cell type-specific transcripts.
As noncoding transcripts are more likely to be expressed in a cell type-specific manner28,72,73, we expected that noncoding TE-containing transcripts would also be expressed from different transcript loci. Fewer differentially expressed noncoding transcripts were induced by the differentiation process (Supplementary Fig. 8f), reflecting the reduced number of noncoding transcripts in each cell type (Fig. 2e). Also, the differentially expressed transcripts were regulated in a cell type-specific manner, and only a few differentially expressed transcripts were shared among cell states (Supplementary Fig. 8g), and TE-containing transcripts tended to be lower expressed (Supplementary Fig. 8h). A comparison of the aggregate expression for noncoding transcripts containing specific TEs showed that 126 TEs were differentially expressed between the hPSCs and at least one of the differentiated cell states (Fig. 5c). These TEs included LTRs and SINEs that are activated in hPSCs and endoderm cells. LTR6A and other LTR TEs were also enriched in endoderm cells. Analysis of the differentially expressed LTR7-containing transcripts showed that many were downregulated in ectoderm and mesoderm cells, whilst LTR6A-containing noncoding transcripts were up in endoderm (Fig. 5d). These data support a model whereby TE superfamilies/families may appear stably expressed during differentiation, but TE types are switching transcripts and the same TE superfamilies/families are expressed from different transcripts.
Cell differentiation experiments involved the exposure of hPSC to different chemical signals that were able to induce cell state changes. To exclude the possibility that the observed TE regulation across lineages was simply due to exposure to differentiation signals, we quantified the expression of TE-containing differentiation-induced differentially expressed transcripts in hESFs (human embryonic skin fibroblasts). PCA suggested that hESFs were unresponsive to differentiation signals (Supplementary Fig. 8i), and the transcript switching of HERVH and LINE TE-transcripts from hPSCs to endoderm was not observed in hESFs treated with endoderm differentiation medium (Supplementary Fig. 8j). These results showed that the TE expression changes reflected differentiation, and not exposure to differentiation signals.
In vitro differentiation of hPSCs can produce four different types of ectodermal lineages, namely neuroectoderm, neural crest, cranial placode, and non-neural ectoderm50. PAX6 expression and overall correlation suggest that our ectoderm cells were biased towards neuroectoderm (Fig. 2i, and Supplementary Figs. 2a, f). We asked if the TE activation in our ectoderm cells was specific to neuroectoderm or common to all the four ectodermal lineages. By reanalyzing published bulk RNA-Seq data50, LINEs showed consistent transcript-switching in the neurectoderm, neural crest and cranial placode, but not in non-neural ectoderm (Supplementary Fig. 8k). Interestingly, downregulated hPSC-expressed LINE-containing transcripts were down in all four ectodermal lineages, demonstrating dynamic regulation of LINE-containing transcripts in ectodermal lineages.
TE expression dynamics are recapitulated in vitro and in vivo
We next investigated the expression dynamics of TE-containing coding transcripts using scRNA-seq data from in vitro differentiated germ layer cells47. The scRNA-seq data recaptured the higher expression of multiple TEs in ectoderm cells and downregulation of some LTRs such as LTR6A, LTR7, LTR7C, and HERVH in ectoderm and mesoderm cells (Supplementary Fig. 9a). The data also revealed that the expression differences were not obvious at the level of TE superfamilies but became distinct at individual transcripts (Fig. 5e, and Supplementary Fig. 9b). Also, single-cell analyses of the noncoding transcripts were consistent with the observations from bulk data and revealed upregulation of multiple TEs in ectoderm cells (Supplementary Fig. 9c, d). Several LTR TE subfamilies were downregulated in ectoderm and mesoderm, while LTR6A was enriched in endoderm cells (Fig. 5e, and Supplementary Figs. 9c, d). As with bulk data, lineage-specific expression became was clear when TEs were considered at the transcript level. These results demonstrated that the observations in bulk RNA-seq were recaptured in single cells.
To elaborate on the expression dynamics of TE-containing transcripts, we extended our analysis to scRNA-seq from in vivo gastrulation48 (Fig. 5f, and Supplementary Fig. 10a). As in the in vitro system, ectoderm cells were marked by TE-containing coding and noncoding transcripts (Supplementary Figs. 10a, b). Indeed, compared to other cell types, more ectoderm-expressed transcripts were found in almost all TE superfamilies in both coding and noncoding transcripts (Supplementary Figs. 10c). Also, many transcripts that were identified in the in vitro data were recaptured in the in vivo data (Fig. 5f, and Supplementary Fig. 10d), indicating that in vitro observations reflected in vivo phenomena. Taken together, these data revealed that whilst similar TE types are expressed in all cell types, they originate from multiple lineage-specific transcripts, suggesting that TEs of the same type from multiple loci are independently regulated.
Cell type-specific regulation of TE-transcripts
We next sought to explore the regulation of cell type-specific transcripts. There are two potential (not exclusive) modes: firstly, that the normal cell type-specific regulation apparatus regulates TE-transcripts, or secondly, that a separate TE-specific regulation system is in action. We measured the enrichment of transcription factor binding motifs (TFBMs) in the core promoters of TE-containing and TE-free transcripts that were differentially expressed. TFBMs were enriched in the TE-free differentially expressed transcripts in multiple cell types (Supplementary Fig. 11a). An example was the Maz motif which was found in the top 5 most enriched TFBMs in all the lineages. Comparison of the enriched TFBMs of TE-containing and TE-free differentially expressed transcripts showed surprisingly little overlap (Supplementary Fig. 11b). The only exception was in hPSC transcripts in which the overlap was significantly more than expected by chance. These results suggest that these TE-free and TE-containing transcripts are upregulated by different transcription factors during development.
Next, we checked the enrichment of RNA-binding protein (RBP) motifs in the differentially expressed transcripts using transcripts without differentiation-induced expression changes as the background. Motifs for RBPs such as LIN28A and PRPRC1 were enriched in differentially expressed TE-free transcripts of multiple cell types (Supplementary Fig. 11c). Comparison of enriched RBP motifs showed that the number of overlaps was less than expected by chance in all lineages (Supplementary Fig. 11d), reflecting that TE-free and TE-containing transcripts have different RBP motifs. Overall, the analysis of TFBMs and RBP motifs suggests that TE-transcripts tend to be regulated independently. It is illustrative that the RBP motif for m6A (N6-methyladenosine) binding protein IGF2BP3 was found only in the TE-free transcripts, suggesting m6A-mediated suppression of TEs is cell type-specific74,75,76.
TE-subfamily switching in terminally differentiated somatic cells
As the same TE superfamilies/families are expressed from different transcripts, we explored differentiation time course data to investigate the endoderm-specific expression of TE-containing transcripts. As LTR6A and HERVH-containing transcripts are specifically expressed in the endoderm or hPSCs (Fig. 5b, d), we focused on transcripts containing both TE subfamilies. The analysis of time-course bulk RNA-seq showed that LTR6A noncoding transcripts were upregulated at 72 h (Fig. 6a, b). HERVH-containing endoderm-specific noncoding transcripts were upregulated as early as 24 h and reached maximum expression at 72 h. Conversely, HERVH-containing hPSC-specific noncoding transcripts were downregulated in endoderm as early as 12 h and were undetectable at 72 h. The situation was similar for HERVH-containing coding transcripts (Fig. 6a, b, and Supplementary Fig. 12a). Next, we analyzed scRNA-seq data of a 0–96-h differentiation time-course of hPSCs-to-endoderm (Fig. 6c). In agreement with the bulk RNA-seq analysis, endoderm-specific LTR6A, and HERVH-containing noncoding transcripts had high expression only at 72- and 96-h of differentiation (Fig. 6c). In contrast, hPSC-specific noncoding transcripts were shut down at 12 h of differentiation (Fig. 6d). These results demonstrate how HERVH and LTR6A can appear expressed in hPSCs and endoderm, but the transcripts providing these TE sequences are switching during differentiation.
a Violin plots of bulk RNA-seq expression levels of HERVH and LTR6A-containing noncoding transcripts during differentiation to endoderm. The data were from GSE7574847. b Heatmap showing the expression dynamics of HERVH and LTR6A noncoding transcripts differentially expressed in endoderm lineage. c UMAP showing the single-cell expression data for a hPSC to endoderm time course. Hours of differentiation are shown. The three other UMAPs show example aggregate scores for transcripts containing the indicated TE families that are down-regulated or up-regulated in hPSCs. d Expression dynamics in bulk RNA-seq (left) and scRNA-seq (right) of selected transcripts. e Bar plot showing the RBPs with enriched motif in HERVH endoderm-specific transcripts. f Bar plot showing the RBPs with enriched motif in hPSC-specific HERVH transcripts. g Bar plot showing the TFs with enriched motif in endoderm-specific HERVH transcripts. h Bar plot showing the TFs with enriched motif in hPSC-specific HERVH transcripts.
If TE subfamily transcripts switch in hPSCs and endoderm, then they should be regulated differently. We found multiple RBPs with differentially enriched motifs between HERVH transcripts in endoderm and hPSC (Fig. 6e, f). The most represented RBP motifs in HERVH transcripts in endoderm were serine/arginine-rich splicing factor (SRSF) motifs (SRSF9, 1 and 2) (Fig. 6e). These splicing factors have been implicated in differentiation77. Interestingly, the RBPs with biased binding in hPSC-biased HERVH transcripts were also represented in endoderm-biased HERVH transcripts (Fig. 6f). For example, the motif of PABPC4, which has roles in mRNA stability and differentiation78, was found in 92% of hPSC-specific and 78% of endoderm-specific HERVH transcripts (Fig. 6f). Overall, though, hPSCs and endoderm had specific RBP motif enrichments. In addition to the RBP motif enrichment, TFBMs in the core promoters of HERVH transcripts showed lineage-specific enrichment. The most represented biased motif was PITX1, a TF that has been implicated in differentiation79, with motif found in 84% endoderm-biased, compared to 72% hPSC-biased HERVH transcripts, while HOXA13 motif was found in 89% hPSC-biased and 78% endoderm-biased HERVH transcripts (Fig. 6g and h). However, like the RBPs, the endoderm and hPSCs had specific TFBMs, for example, the TEAD motif, associated with genes important for differentiation80,81, was endoderm-specific (Fig. 6g), and LRF motif was hPSC-specific (Fig. 6h). These results suggest that HERVH transcripts were independently regulated in endoderm and hPSCs, supporting a transcript-switching model.
To investigate if the expression of the endoderm-induced upregulated transcripts persists in terminally differentiated somatic cells. We checked their expression in a panel of hepatocyte-related RNA-seq. Surprisingly, the expression of noncoding transcripts containing HERVH or LTR6A that were endoderm-specific was not sustained in terminally differentiated cells (Fig. 7a–c). Importantly, endoderm-specific HERVH-containing noncoding transcripts did not persist in endoderm-derived terminally differentiated somatic cells (Fig. 7b, c). In contrast to HERVH, the expression of ectoderm-specific noncoding transcripts containing L1P1_orf2 and LTRs were maintained in multiple ectoderm-derived somatic cells, such as neurons (Fig. 7d, e). Similarly, the expression of many ectoderm-activated coding transcripts containing L2, L1P1_orf2 and LTR remained active in ectoderm-derived terminally differentiated tissues. In summary, these results demonstrate how TE families can remain active across cell types, and how HERVH-related transcripts are transient in the endoderm, but L1 and L2 LINEs persist in the ectoderm.
a Expression dynamics of endoderm-upregulated HERVH and LTR6A-containing noncoding transcripts in hepatocyte-related cells, using transcripts containing the indicated TE families, that are defined as down-regulated in hPSCs. (EPS: Extended pluripotent stem cells; EPS1: Stage 1 EPS; EPS2: Stage 2 EPS; HPLC: Hepatic progenitor-like cells; EPS HPLCs: EPS-derived HPLCs; hiHEPs: iPSC-derived hepatocytes; EPS1 Heps: EPS1-derived hepatocytes; EPS2 Heps: EPS2-derived hepatocytes; F PH1: Fetal primary hepatocytes; F PHHs: Fresh primary human hepatocytes; hFLC: Human fetal liver cells). b, c Heatmaps (b) and boxplots (c) showing transcript expression of differentially expressed HERVH- and LTR6A-containing noncoding transcripts in hPSCs, endoderm and selected terminally differentiated endoderm-derived tissues. Each sample was from at least three biological replicates. The box represents the first and third quartiles, the midpoint is the median and the whiskers are 1.5 × interquartile range. d, e Heatmaps (d) and boxplots (e) showing the expression levels of differentially expressed noncoding and coding transcripts containing the indicated TEs in hPSC, ectoderm and ectoderm-derived somatic tissues. For the heatmaps, each row represents a differentially expressed transcripts containing the specified TEs. Transcript quantification was done using short-read data. Differential expression analyses were based on the expression in endoderm (b, c) and ectoderm (d, e) relative to hPSC. Data was from the reanalysis in Babarinde et al. 12. Each sample was from at least three biological replicates.
TE sequences influence transcript subcellular localization
Previous studies have shown that TE-containing transcripts tend to localize in the nucleus82,83. To this end, we generated RNA-seq data for nuclear and cytoplasmic subcellular fractions of hPSCs, mesoderm, endoderm, and ectoderm cells. Western blots for GAPDH (cytoplasm) and Histone H2B (nucleus) confirmed the relative purity of the subcellular fractions (Supplementary Fig. 13a). We computed the relative concentration index (RCI)84 to quantify the subcellular localization. This confirmed our previous observation that both coding and noncoding TE-containing transcripts preferentially localize to the nucleus in hPSCs12 (Fig. 8a). This pattern was the same for mesoderm and endoderm (Fig. 8a). Surprisingly, this was not the case for ectoderm, where the opposite pattern was seen: TE-containing transcripts were more likely to localize in the cytoplasm. Indeed, analyses of TE superfamilies revealed that they all tended to preferentially localize to the cytoplasm in ectoderm cells, and to the nucleus in all other cell types (Supplementary Fig. 13b). Greater details were revealed at the transcript level. For example, LTR-containing noncoding transcripts tended to have different cytoplasm/nucleus enrichment when compared to other TE superfamilies in hPSC and endoderm (Supplementary Fig. 13c).
a Cytoplasm/nucleus relative concentration index (RCI). RCI is the log2-transformed relative expression in two subcellular compartments. All data is in biological duplicate. The box represents the first and third quartiles, the midpoint is the median and the whiskers are 1.5 × interquartile range, for this and subsequent boxplots. b Nucleoplasm/chromatin RCI. c Stacked bar charts showing the counts (upper panel) and percentages (lower panel) of coding and noncoding transcripts localized in different subcellular fractions. All data is in biological duplicate. d Heatmap showing subcellular localization based on the Jaccard Index of pair-wise comparisons of hEDT transcripts localized to different subcellular compartments across cell states. e Stacked bars showing the distribution of the number of cell types in which transcripts are localized to a subcellular compartment. Significance is from a Chi-square test of independence. f Proportion of localized transcripts that were differentially expressed during differentiation process. The left charts show the localization in hPSCs while the right charts show the localization in differentiated cell states. g Line plots showing the transcript stability of TE-containing transcripts, using time from actinomycin D treated cells. The expression levels at time 1 and 8 h, relative to the expression at 0 h are shown for each TE superfamily across different cell states. h Boxplots showing the stability of transcripts that are localized to different subcellular structures across cell states Significance is from a Mann-Whitney U rank test (two-sided) with Bonferroni correction. All data is in biological duplicate.
Noncoding RNAs that localize to the nucleus are often recruited to chromatin85,86. Hence, we performed RNA-seq on the nucleoplasm and chromatin fractions to understand the sub-nuclear localization of TE-containing transcripts (Supplementary Fig. 13a). TE-containing hPSC transcripts were reduced in the chromatin compartment, versus the nucleoplasm (Fig. 8b). On the contrary, the TE containing transcripts of endoderm, mesoderm and ectoderm tended to be more enriched in the chromatin fraction (Fig. 8b, and Supplementary Fig. 13d). Transcript-level analyses of nucleoplasm/chromatin enrichment revealed heterogeneity across coding potentials, cell states and TE superfamilies (Supplementary Fig. 13e). To investigate the subcellular transcript localization, we compared the expression levels of individual transcripts enriched in specific sub-cellular fractions (Supplementary Fig. 14a). The number of transcripts localized to the sub-cellular compartments varied across cell types (Fig. 8c). Importantly, compared to the differentiated cells, hPSCs had fewer transcripts localized to the nucleoplasm and chromatin, and transcript distribution revealed both cell state, coding/noncoding and location-specific difference. Specifically, while nucleoplasm-localized transcripts had the highest noncoding proportion in hPSC, chromatin-localized transcripts had the highest noncoding proportion in differentiated cells (Fig. 8c). This effect was also transcript-specific, as few transcripts overlapped in the distinct sub-cellular compartments between the different cell types (Supplementary Fig. 14b). Indeed, using Jaccard similarity, transcripts that localized to the nucleus or cytoplasm were more consistent in hPSC, endoderm and mesoderm but more divergent in ectoderm (Fig. 8d). Interestingly, we found that the distribution of transcripts based on the number of cells in which they were localized varied across subcellular structures (Fig. 8e). The most unique chromatin localization was found in the mesoderm (Fig. 8d). Overall, nucleoplasm localization was the least consistent across the cell states.
To further explore the relationship between transcript subcellular localization and cell state conversion, we checked the proportion of differentiation-induced DETs among the transcripts localized to various subcellular structures across cell states and found that subcellular localization significantly influenced the proportion of DETs (Fig. 8f), as the distribution of DETs were significantly different in localized transcripts compared to overall transcripts. For hPSCs, nucleoplasm-enriched transcripts had the lowest proportion of DETs. For differentiated cells, however, the lowest proportion of DETs was found in chromatin-localized transcripts. Overall, compared to the TE-free transcripts, TE-containing transcripts tended to be cell type-specific (Fig. 8e), with hPSCs having fewer chromatin-enriched TE transcripts.
An important question was how the localization of the transcript to different subcellular fractions was established, particularly in the ectoderm. We hypothesized that RBPs might contribute to the transcript localization. Consequently, we checked the RBP enrichment of sub-cellular localized transcripts. Interestingly, multiple RBP motifs were enriched in transcripts localized to subcellular compartments, and the patterns were cell type-specific (Supplementary Fig. 15a). For example, PABPC4, SART3 and PAPPC1 were associated with strong cytoplasm localization in the mesoderm but there was no significant bias in hPSCs. On the other hand, RBM3 was associated with cytoplasm-localized transcripts in all cell types. Next, we checked if subcellular localization of TE-containing and TE-free transcripts were associated with the same RBPs. Interestingly, several RBPs with significantly higher enrichments in TE-containing transcripts localized to different subcellular structures (Supplementary Fig. 15b). Indeed, multiple RBPs were consistently associated with TE-containing transcripts localized to subcellular compartments in different cell states. For example, RBMS3, DAZAP1, PABPC4, SART3, MSI1 were all found to be enriched in TE-containing transcripts localized to different subcellular compartments in the four cell types. These results suggest that RBPs may play significant role in transcript subcellular localization and that TEs bias RBP interactions.
TE sequences influence transcript stability
As subcellular localization can influence degradation, we wondered if this would impact transcript half-life in a TE, sub-cellular localization, and cell type-dependent manner. We performed RNA-seq in differentiated cells treated with Actinomycin D to block transcription for 1 and 8 h and measured the transcript levels relative to the 0-hour time-point. Surprisingly, there were divergent patterns of transcript half-life across TE superfamilies, coding potentials, and cell states (Fig. 8g). The lowest RNA stability was found in hPSCs and ectoderm, whilst endoderm and mesoderm had more stable RNAs. Interestingly, the impact of TE presence was also dynamic: TE-free transcripts were more stable in hPSC, ectoderm, and endoderm, but not in mesoderm (Fig. 8g, Supplementary Fig. 16a, b). TE-containing ectoderm transcripts also contrasted with hPSCs and endoderm, as whilst TE-containing coding transcripts were less stable, TE-containing noncoding transcripts were more stable than TE-free transcripts (Fig. 8g, and Supplementary Fig. 16a, b). These revealed that transcript stability varied across cell states and transcript coding ability.
We checked if TE presence influenced transcript stability in combination with sub-cellular localization on the basis that cytoplasmic RNAs tend to have a shorter half-life. As expected, nuclear-localized transcripts were more stable than cytoplasm-localized transcripts in the ectoderm, however, the opposite was true in the other cell types (Fig. 8h and Supplementary Fig. 16c). Conversely, chromatin-localized transcripts were more stable than nucleoplasm-localized transcripts in hPSCs, but the nucleoplasm-localized transcripts were more stable than chromatin-localized transcripts in differentiated cells.
A comparison TE-free to TE-containing transcripts identified multiple TEs associated with significant differences in stability after 8 h (Supplementary Fig. 16d). In hPSCs, TE presence mostly led to lower stability in all subcellular compartments (Supplementary Fig. 16d). However, in differentiated cells, mosaic patterns were found. For example, TE-containing transcripts localized to cytoplasm and nucleoplasm in mesoderm were more stable, along with cytoplasm-localized transcripts in ectoderm (Supplementary Fig. 16d). Overall, the data suggest that stability and subcellular localization of TE-containing transcripts are dynamic across different cell state conversions.
Discussion
TEs are difficult to place in their genomic context due to their repetitive nature. However, understanding expression dynamics would substantially benefit from robust TE-containing transcript assemblies12,33. Here, we used long-read RNA-seq technology to generate an accurate TE-containing transcript assembly27,28 in in vitro cells that mimic early human embryonic development. Using this assembly, we found widespread and dynamic TE expression in human early development. Multiple studies have shown the overexpression of TEs in hPSCs9,10,87,88. However, we find that the expression of TE sequences is not restricted to hPSCs, and multiple TEs are expressed in differentiated germ layers. Interestingly, although the same TE subfamilies are expressed, they are found in different transcripts. For example, hPSCs and endoderm both expressed HERVH, but we show the transcripts that contribute those TEs are different. There was a similar case for LINE elements in ectoderm cells.
Our long-read RNA-seq data indicates that the ectoderm has a particularly complex transcriptome with higher levels of TE-transcripts. The expression of LINEs was particularly prevalent, and, unlike the endoderm-specific TEs, many ectoderm-specific TEs were expressed in terminal somatic cells and later stages of development. This is consistent with reports of TE activity in terminally differentiated ectoderm-derived cells and tissues31,89,90,91. The exceptions to higher TE activities in ectoderm involved HERVH, LTR7, LTR6A, and several other LTRs that have previously been reported to be enriched in hPSCs10,70. In addition to expression in hPSCs, HERVH is also expressed in endoderm, and some HERVH-containing transcripts were higher in endoderm while others were higher in hPSCs. Overall, all three lineages had specific patterns of TE expression, broken down by superfamily, family and subfamily level down to individual transcripts. This reveals the complex pattern of TE expression in chimeric transcripts and shows how TE expression is not just limited to the pre-implantation stages of embryonic development but is also prominent at gastrulation.
Consistent with previous studies12,92,93, we found that TE presence leads to lower expression levels, probably due to the activities of RBPs15,93. Indeed, SINE elements are overrepresented in 3’ UTRs, and STAU1 promotes mRNA decay by targeting Alu elements15. This indicates that meta-analysis of TEs has limitations, and TE sequences should be considered in their specific transcript context. From a biological perspective, it also suggests that cell-state specific regulation of TE expression occurs at the transcriptional and post-transcriptional stages. Interestingly, we found that TEs tend to be enriched in the chromatin fraction of the differentiated cells, suggesting that TE enrichment on chromatin might be associated with chromatin changes which might induce or regulate cell fate transitions. Indeed, sub-cellular enrichment of TE-transcripts was dynamic between the different lineages, and there was no simple relationship between TE, TE-transcript, cell type and subcellular localization. One prominent feature of mesoderm is the depletion of TE-transcripts in the nuclear fraction. Similarly, TE-transcript stability was also cell type-specific: unstable in hPSCs and endoderm, but more stable in mesoderm and ectoderm. The functional impact of these changes remains to be elucidated, and whether all TE expression changes are drivers of cell differentiation or bystanders requires detailed studies.
Several studies have demonstrated roles for TE-containing transcripts in pluripotency70,94,95, cardiomyocyte development96 and neurogenesis91,97,98, and we predict that TE-transcripts, including coding TE sequence fragments have an unappreciated role in somatic development. Taken together, this study reveals how TE-containing transcripts are a normal part of germ layer development and are highly dynamic along multiple axes.
Methods
In vitro cell differentiation into the three germ layers and RT-qPCR validation
H1 human pluripotent stem cell lines were maintained on a Matrigel treatment plate, cultured in mTeSR1 (STEMCELL TECHNOLOGIES, 85850) medium and digested with Accutase (Sigma, A6964) every 5–7 days. In vitro germ layers were generated by changing the medium and adding inhibitors or cytokines when human PSC reached 60–70% confluency as previously reported45.
For differentiation, cells were treated as previously described45.
Briefly, for endoderm differentiation, hPSCs were cultured for 120 h in RPMI medium (Life Technologies) supplemented with 100 ng/ml Activin A (PEPROTECH, 120-14E-100), 50 nM/ml WNT3A (MedChemExpress, HY-P70453A), 0.5% FBS (BIOWEST, S1580), 2X GlutaMax (Thermo Fisher, 35050061), 0.2x MEM Non-Essential Amino Acids Solution (Thermo Fisher,11140050), and 55 µM β-mercaptoethanol (BBI, A600194).
For mesoderm differentiation, hPSCs were cultured in DMEM/F12 supplemented with 2X GlutaMax, 0.2X MEM Non-Essential Amino Acids Solution, 55 µM b-mercaptoethanol, 0.5% FBS, 100 ng/ml VEGF (PEPROTECH,100-20-50), 100 ng/ml BMP4 (PEPROTECH, 120-05ET), 10 ng/ml bFGF (PEPROTECH, 100-18B-100), and 100 ng/ml Activin A. From 24 to 120 h of mesoderm differentiation, Activin A was removed from the culture medium.
For ectoderm differentiation, hPSCs were cultured in DMEM/F12 supplemented with 0.2X MEM Non-Essential Amino Acids Solution, and 55 µM b-mercaptoethanol, 15% KOSR (Life Technologies), 2 µM A83-01 (Tocris, 2939), 2 µM PNU-74654 (Tocris, 113906), and 2uM Dorsomorphin (Tocris, 3093).
Medium was changed daily, and the cells were collected after 5 days. The success of each differentiation experiment was confirmed by the expression of lineage specific marker genes using RT-qPCR quantification. Oligonucleotides are listed in Supplementary Data 3.
Long-read and short-read sequencing
RNA was isolated using RNAzol RT (MRC, RN190) according to the manufacturer’s protocol. Samples for short-read RNA-Seq was prepared for sequencing with RNA-seq NEB Next Ultra RNA Library Prep Kit (NEB, #7530). Short-read RNA sequencing was performed on Illumina Novaseq 6000 platform. Samples for long-read RNA sequencing were purified by the RNAeasy Mini kit. The RNA library generated with PacBio binding kit (Pacbio, 101-849-000) was sequenced on Sequel II (Pacbio).
Data generation and retrieval
PacBio long-read bulk RNA-Seq data for the four cell states (PSC, endoderm, mesoderm and ectoderm) were generated in duplicate for the purpose of transcript assembly and long-read quantification. For each cell state, short-read bulk RNA-Seq data were also generated in duplicate for the purpose of expression quantification. We also generated replicated sub-cellular RNA-Seq data for nucleus, cytoplasm, chromatin and nucleoplasm for the four cell states. Further, we generated replicated stability RNA-Seq data for the four cell states 0, 1 and 8 h after actinomycin B treatment.
We retrieved a collection of short-read RNA-seq data generated from multiple studies. These data were previously analyzed for a different purpose28. Short-read bulk RNA-Seq hepatocyte data to investigate transcript expressions in hepatocytes were retried from two studies99,100. The scRNA-seq data and endoderm differentiation bulk RNA-Seq data were retrieved from Chu et al.47. We also retrieved the mesoderm, endoderm and ectoderm scRNA-seq data from a human in vivo gastrulating embryo48. POLR2A ChIP-Seq and a portion of deepCAGE data were retrieved from Encode database101 (https://www.encodeproject.org). Another set of deepCAGE data were retrieved from Poletti et al.102. Polyadenylated data from previous studies103,104 were retrieved. For the analyses of histone modification in hPSC, published data from two studies105,106 were retrieved. Nanopore long-read data of human early development46 were retrieved to confirm the identify of our cell states. Short-read RNA-Seq data were retrieved from Tchieu et al.50 to further confirm the identities of the ectoderm cells and investigate TE dysregulation in different ectodermal lineages.
Transcript assembly and assembly quality evaluation
Independent transcript assembly based on long-read RNA-Seq data was done for each cell state. Transcript assembly was done using IsoSeq V3 pipeline of the Pacific Biosciences (PacBio) (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md). For each sample, the high-fidelity (HiFi) consensus reads were first generated from the raw PacBio subreads using ccs (https://github.com/PacificBiosciences/ccs). The following parameters were used: --min-rq 0.9 --min-passes 1 --min-snr 1. The full-length sequences were then generated by primer removal and demultiplexing using lima (https://github.com/lima-vm/lima). The full-length sequences were refined by isoseq3 (https://github.com/PacificBiosciences/IsoSeq) to remove the noise. The noiseless full-length sequences were then converted to the fastq format using PacBio’s bam2fastq (https://github.com/PacificBiosciences/pbtk#bam2fastx). After the processing of the PacBio raw reads, replicates of the same cell state were merged.
For the mapping of the noiseless full-length sequences to human genome, Minimap235 reference was prepared using version GRCh38 primary assembly of human genome and version 43 (GENCODEv43) of human reference GTF obtained from the GENCODE database53. GENCODE reference GTF junctions were made using paftools in Minimap2 package. The sequence mapping was done using Minimap2. The output alignment was then sorted using SAMtools107. The transcript assembly was sone using StringTie36,37. For StringTie assembly, the following parameters were used: -s 2 -c 1 -L. The assembled transcripts were then filtered to remove transcripts with undefined strand or those shorter than 200 bp. The initial quality assessment of our assembly was based on the strand information and the transcript lengths. Further quality assessments were done using different genomic data and read coverage and conservation analyses. For expression comparison across different cell states, the assemblies from each cell state were merged into a human early development transcript (hEDT) reference using gffcompare108.
Splice junction coverage
For splice coverage computation, Bowtie2109 reference was first built from the fasta sequences of the assembled transcripts. Paired-end short read transcriptome sequences were then mapped to the built reference using Bowtie2. Using a custom python script (junction_coverage.py) with the assembly GTF file and bowtie2 alignment as inputs, the numbers of reads covering each splice junction was estimated. Only the coverage with at least 10-bp anchor length was considered.
Splice site conservation
For splice conservation, the flanking regions, including 10 bp in the exons and 10 bp in the introns were extracted for both the 5’ and 3’ sides of the splice junction. The 100-way phyloP conservation scores110 for version GRCh38 of the human genome was downloaded from the UCSC database111. Using a custom Python script (wig_overlap_range_full_use.py), the PhyloP conservation score for each position, relative to the splice junction of each multi-exon transcript, was computed.
Splice junction nucleotide frequencies
The flanking regions, including 10 bp in the exons and 10 bp in the introns were extracted separately for both acceptor and donor splice junctions. The nucleotide sequences for each junction were then extracted using BEDtools. The nucleotide frequencies were then computed using WebLogo 3112,113.
Alternative splicing event reference count
SUPPA2114 was used to investigate the numbers of alternative splicing event. The GTF files of hEDT assemblies were first converted to SUPPA-recognized GTF file. Then, generateEvents in SUPPA package was used to generate individual splicing events. The number of each event in all the assemblies were then counted.
Generation of random GTF sequences
For the generation of random GTF, the gene coordinates of the transcripts were first extracted. Then random coordinates of input gene coordinates were then shuffled with BEDtools115 shuffle. Using the marked shuffled coordinates, random transcripts were then extracted using the original isoform and exon structures. This procedure ensured that random sequences are found on the same chromosome, have the same lengths and isoform/exon structure as the original input GTF. A custom random GTF generator Python script was written to accomplish this purpose.
Transcript assembly pileup and heatmap for POLR2A and histone modification ChIP sequencing, deepCAGE and polyA tail sequencing
For short read heatmaps, Bowtie2109 genome reference was built using the GRCh38.p13 version of human genome. The reads were then mapped to the Bowtie2 genome reference using Bowtie2. For long reads, the alignment made with Minimap235 was used. The alignment was sorted and the alignment index was prepared with SAMtools107. Alignment was done individually for each sample. Individual alignment was then converted to bigwig file using bamCoverage in deepTools2116. For samples and regions to be plotted together, computeMatrix of deepTools2 package was used to prepare the matrix. The matrix was then plotted using plotHeatmap of deepTools package.
Estimation of the assembly exhaustiveness using short-read data
To investigate the proportion of potential transcripts that might have been missed by our assembly procedure, we took advantage of the short-read RNA-seq data. For each cell states, the short-read RNA-seq data were first cleaned with fastp. The reads were then mapped to the assembled transcripts using bowtie2 with --very-sensitive-local option. The overall alignment rates and the alignment rates of concordant read pairs were then retrieved. The percentages of the short-read data mappable to the assemblies from the long-read data were used as the indicators of the exhaustiveness of the exhaustiveness of our transcript assembly.
Coding potential and coding sequence prediction
FEELnc52 was used for the prediction of coding potential of the assembled transcripts for each cell state and the merged assembly. Version 43 of the human protein coding (pc) and long noncoding RNA (lncRNA) transcript fasta sequences obtained from the GENCODE database were used for the random forest training. Using the version GRCh38 primary assembly genome, FEELnc_codpot.pl was used to estimate the coding potential of the transcripts.
For the CDS annotation of the assembled transcripts, version 42 of the amino acid sequences of human protein-coding transcripts were downloaded from the GENCODE database53. The sequences were used to make a BLAST117 database. Pfam-A.hmm database118 was also downloaded. CDS prediction started with the extraction of the longest open reading frame (orf) using TransDecoder.LongOrfs of the TransDecoder package (https://github.com/TransDecoder/TransDecoder) from the Trinity transcript assembler119,120. The homology searches for the closest amino acid sequence of the longest orf were then conducted using BLASTP117 and hmmscan121. The outputs of the homology searches were then used for the prediction of CDS in the transcripts using TransDecoder.Predict in the TransDecoder package.
TE search
TE searches were performed individually for the assembly of each cell state, and for the merged assembly. The assembled transcripts were converted to fasta format using BEDTools115. For the search of TEs in the assembled transcripts, HMM TE sequence alignment was obtained from the version Dfam_3.6 of Dfam database122. TE search was conducted using nhmmer54. TE splicing frequency was computed using a custom python script. For noncoding transcripts, body of the transcripts were divided into 20 equal bins. Then proportion of transcripts with the specified TE was then computed for each bin position. For coding transcripts, the transcripts were first divided into 5’ UTR, CDS and 3’ UTR. Each region was divided into 20 bins. For each region, the proportion of transcripts with the specified TE was then computed for each bin position.
For the analyses of TE presence at the TSS and TES, the TSS and TES for each transcript were first extracted. For each transcript, the regions of TSS and TES including 2 kbp upstream and downstream were extracted to make a 4 kbp region for each transcript end. The sequences of the TSS and TES regions were then searched in the TE database. The regions were then divided into 20 bins after which the proportion of the transcript with the specified TE was estimated for each bin.
Analysis of the consequences of TE chimeric transcripts
We focused our analyses of the consequence of TE chimeric transcripts on genes with both TE-containing and TE-free transcripts. We compared the exon structure of TE-containing transcript to the closest TE-free transcript. The closest TE-free transcript was defined as the transcript with the highest overlapping exon coverage. FEELnc52 prediction and CDS length from TransDecoder in Trinity assembler119,120 were used to infer the consequences of TE insertion. The consequence was defined with respect to the coding potential and the CDS length. If both TE-containing and the closest TE-free transcripts were both noncoding, the consequence was defined noncoding. When the TE-free was predicted noncoding and the TE-containing was predicted coding, the consequence was termed creating. For TE-free coding transcript associated with TE-containing noncoding transcript, the consequence was termed disrupting. When both TE-containing and TE-free were both coding and their CDS lengths were within 90% of the other, the consequence was termed preserving. When both coding and non-coding were both predicted coding but the CDS lengths were not within 90% of the other, the consequence was termed modifying.
Differentially spliced TEs
To identify TEs that tend to be more frequently retrieved in the assembly of a specific cell states, the numbers of transcripts with or without the specified TE was compared. The comparisons were done separately for coding and noncoding transcripts. Because the differentiation process started with PSCs, the assemblies of the three germ layers were compared to the PSC assembly. Fisher exact tests followed by Bonferroni multiple test corrections were used to test statistical significance. Odd ratio threshold of 2 was set for statistical significance. Matrix of the proportion of transcripts containing the significant TEs were then made.
Long-read alignment and transcript quantification
Long read alignments were made with minimap235. Noiseless full-length sequences were aligned to the assembled transcripts. Transcript quantification was done using a custom Python script that quantifies the number of sequences that map to each transcript. When a sequence maps to more than one transcript, the sequence is assigned to the transcript with the highest percent identity and aligned transcript coverage.
Cell state-specific transcript with long-read quantification
For the evaluation of the cell state specific expression, previously reported MGFR123 steps were adapted. GC-content expression normalization was done using EDASEQ124. Then the samples were ranked by expression samples. Transcripts for which all replicates of a cell state were ranked highest were retained. Using DESeq2125 to extract differentially expressed transcripts, transcripts for which the highest-ranking cell state are not significantly different from others were discarded. To get the marker transcripts, transcripts expressed at less than 10 in the highest-ranking cell state or more than 10 in other cell states were discarded. Additionally, for a marker transcript, every replicate of a marked cell state must have at least 10-fold higher expression compared to any of the replicate of the non-marked cell states. For the extraction of the top uniformly expressed transcripts, the transcripts were ranked by the expression variation (coefficient of variation). The top 20 uniformly expressed transcripts with the lowest expression variation were presented.
Short-read alignment and transcript quantification
Short read transcript assembly was done using STAR126,127. STAR references were made from the assembled transcripts. Short read sequences were then mapped to the STAR references. For short read quantification, RSEM STAR references were made using the assembled transcripts. The short read sequences were first filtered with fastp128. The filtered SR sequences were then used for quantification using STAR and RSEM129. DESeq2 was used for the identification of differentially expressed transcripts. Adjusted P value of 0.05 was used as the threshold for statistical significance.
Human embryonic skin fibroblast (hESF) culturing and analyses
We used human embryonic skin fibroblasts, hESF (CCC-ESF-1; BNCC100546; RRID:CVCL_6884) to investigate the influence of differentiation-inducing factors on the transcriptional landscape to establish that the expression patterns were due to the actual change in expression patterns, and not just reflection of different factor treatments. The initial hESF cells were grown in 90% DMEM-H basal medium supplemented with 10% fetal bovine serum (FBS, NATOCOR). The medium was changed every day. We cultured the cell lines in endoderm and mesoderm media for five days with the normal differentiation-inducing factors, mimicking the differentiation process from the hPSCs. As a control, hESFs were also cultured in the normal hESF medium for five days. At the end of the fifth day, RNA was extracted and sequenced on short-read platforms. Transcript quantification was done as for hPSC-derived cell lines. Cell clustering analyses were then done using DESeq2.
Functional single nucleotide polymorphism (SNP) analyses for novel and matching transcripts
Clinically-associated and phenotype-associated SNPs were retrieved from version 113 of Ensembl database130. The number of the functionally relevant SNPs found per 1000 bp (kbp) or 1,000,000 bp (Mbp) of the transcripts were then computed using a custom Python script. Similar analyses were performed for the matched random transcripts (with same length and exon structure). Functional SNP analyses were focused on the exons of the novel and matching transcripts for each cell lineage. Fisher Exact test was used to test statistical significance of the differences.
Short read differential TE expression
To investigate TEs that were differentially expressed during differentiation process, short read transcript expression quantification was performed using the merged reference for all the replicates and cell states. The average TPM expressions for all transcripts were computed. The expressions were log2-transfornmed. The expression levels of all TE-containing transcripts in each of the three germ layers were compared to PSC levels. T-test was used to test the statistical significance of average expression of transcripts containing specified TE. Additional, 2-fold expression change was required for a TE to be considered significantly differentially expressed.
Genomic data visualization using Integrative Genomics Viewer (IGV)
Visualization of genome regions of interest was done using version 2.17.4 of the Integrative Genomics Viewer131. For the high-throughput data to be visualized, the sorted alignment files were converted to bigwig format using bamCoverage in deepTools2116. Count per million (CPM) normalization with bin size of 10 bp was used. For TE visualization, repeat mask file of Hg38 human genome was retrieved from the UCSC database111. BED file was then made using a custom Python script. Only LINE, SINE, LTR, Retroposon and DNA TEs were included in the TE track. Each of the five TEs were assigned different color tracks.
Single cell RNA (scRNA) transcript quantification
For scRNA-Seq data, SR sequences were first filtered with fastp128. Transcript quantification was done with RSEM129 using the merged hEDT assembly. scRNA-Seq was merged into a matrix. scRNA-Seq processing was done using Seurat132,133. Transcripts with at least 2 expression count in less than 200 cells were discarded. Cells with at least 25% mitochondria-derived or 40% ribosome-derived transcripts were discarded. The filtered matrix was normalized using logNormalize. Using vst method, the top 10,000 variable transcripts were retained. The matrix was then used to run the PCA. The optimum number of dimensions was decided from 1 to 50 based on the Elbow analysis. Cell clustering was done using UMAP. For the expression of multiple transcripts containing the same TE, MetaFeature of Surat was used to merged the TE-containing transcripts. FeaturePlot was used to plot the expression levels of the individual and merged transcripts.
RNA-seq from subcellular structures
RNA extraction from subcellular structures were performed according to the reported pipelines134. The plated cells were digested and washed once with ice-cold PBS. Cytoplasmic fractions were isolated by incubating the cells in ice-cold cytoplasmic lysis buffer containing 10 mM Tris-HCl pH 7.5, 0.05% NP-40, 150 mM NaCl, 25 μM α-amanitin (Sigma, A2263), 40 U/ml SUPERase.IN (Thermo Fisher, AM2694) and 1× complete protease inhibitor mix (Sigma, 11697498001). Cell lysate was carefully layered on top of a new 1.5 ml tube containing 500 μl of sucrose buffer (10 mM Tris-HCl pH 7.0,150 mM NaCl 25% (wt/vol) sucrose, 25 μM α-amanitin, 20 U SUPERase.IN, and 1x Protease inhibitor mix). The sample was centrifuged for 10 min at ~2000 x g in 4 °C. The cytoplasmic fraction was collected in the supernatant, transferred to a new 1.5 ml tube and put on ice. The nuclear pellet was washed with 500 μl of ice-cold PBS. Nuclear RNA fraction was extracted from a portion of the RNA pellet.
The nuclear pellet was resuspended in 100 μl of nuclear resuspension buffer (20 mM Tris-HCl (pH 8.0), 75 mM NaCl, 0.5 mM EDTA, 50% (vol/vol) glycerol, 0.85 mM DTT, 25 μM α-amanitin, 10 U SUPERase.IN, and 1x Protease inhibitor mix). Next, 100 μl of nuclear lysis buffer (1% (vol/vol) NP-40, 20 mM HEPES (pH 7.5), 300 mM NaCl, 1 M urea 0.2 mM EDTA, 1 mM DTT 25 μM α-amanitin,10 U SUPERase.IN, and 1x Protease inhibitor mix) was introduced, and vortexed for 5 s. After 3 min of ice incubation, the sample was centrifuged at approx. 26,000 × g at 4 °C for 2 min. Nucleoplasmic fraction was collected in a new 1.5-ml tube as the supernatant. The chromatin-associated fraction was collected as the pellet on ice-cold PBS.
RNA extraction for nucleus- and chromatin-associated fraction was carried out by adding 100 ul PBS to resuspend the pellet. RNA was isolated by adding 500 ul RNAzol RT (MRC, RN190), according to manufacturer’s protocol. RNA suspended in supernatant fraction (including cytoplasmic and nucleoplasmic fraction) was extracted by the TRIZOL-PLS region. In short, 3.5x sample volumes RLT buffer introduced and mixed. The initial sample was added to 2.5x of the ice-cold absolute ethanol. The sample was mixed before centrifuging at 16,000 g at 4 °C for 15 min. The ethanol was removed, resuspended and washed twice with 75% ethanol. The dry pellet was dissolved with RNase/DNase-free water. The success of the sub-cellular RNA extraction was confirmed by using western blotting for the genes known to be preferentially localized to specific sub-cellular structure: HNRNPU (1:2000, Santa Cruz; sc-32315), H2B (1:2000, Abcam; ab1790), GAPDH (1:2000, Novus; NB300-221SS) and LMNB1 (1:2000, Abcam; ab133741). Raw unprocessed western blots are available in Supplementary Data 4.
Computation of transcript relative concentration index (RCI)
For the quantification of subcellular RCI, short read transcript quantification was first done using RSEM129 and STAR126 as highlighted above. RCI were then computed using the average expression of each transcript as previously reported12,84. For transcript x with the average TPM expressions xA in subcellular location A and xB in subcellular location B, the RCI for subcellular location B relative to location B is given as (1):
Retrieval of transcripts localized to each subcellular structure
The retrieval of differentially expressed transcripts was done at two levels for each cell state. For the extraction of transcripts localized to either cytoplasm and nucleus, differential transcript expression was conducted between cytoplasm and nucleus using DESeq2125. Differentially localized transcripts were defined as transcripts with ≥1 absolute log-fold change and the adjusted p values ≤ 0.05 between the two subcellular structures. Transcripts were then classified into cytoplasm or nucleus transcripts. Next, we conducted the subcellular structure localization for chromatin and cytoplasm using the same procedure. Consequently, transcripts localized to different subcellular structures in the four states were retrieved.
Subcellular structure consistency
To investigate the consistency of subcellular transcript localization across cell states, we computed Jaccard similarity scores between the transcript sets localized to the same structures across each pair of cell states. Given Ax, the set of transcripts localized to subcellular location x in cell A, and Bx, the set of transcripts localized to subcellular location x in cell B, the Jaccard similarity score is given as (2);
Transcription factor binding motif enrichment
Transcription factor binding motif (TFBM) analyses were done using homer2135. Known vertebrate motifs in the homer database (472 in number) were used for the analyses. We focused our search on core promoters (200 bp) upstream of the transcription start site. For each transcript categories, we extracted the 200 bp promoter regions. Overlapping promoter regions were concatenated such that no genomic region was used more than once. Significant enrichment was defined as the homer adjusted p value of 0.05. P value adjustment was done using Bonferroni correction. To investigate the TFBM enrichment during lineage differentiation, differentially expressed transcripts between PSCs and each of the three germ layers were extracted. These produced transcripts that were upregulated (up) or downregulated (down) in each germ layer cell. Transcripts classified as hPSC-enriched included transcripts with significantly higher expression in hPSC than in at least one of the differentiated cells. TFBM analyses were done independently for TE-containing and TE-free transcripts. As background, we retrieved transcripts with at least 10 DESeq2-normalized average expression but were not classified as differentially expressed by our definition. The procedure would identify TFBMs that might be relevant in each differentiation process. The significantly enriched TFBMs were shown in the scatter plots while the names of the five most significant TFs were shown. To compare the TFBM enrichment for HERVH transcripts in hPSC and endoderm, we extracted the sequences of HERV transcripts with significantly different expression in hPSC and endoderm. To get TFBM overrepresentation in endoderm-biased HERVH transcripts, hPSC-biased HERVH transcripts were used as background. Conversely, endoderm-biased HERVH transcripts were used as background to get hPSC-overrepresented TFBM.
RNA-binding protein (RBP) enrichment analyses for differentially expressed and localized transcripts
To investigate if transcript differential expression was driven by known RNA-binding proteins, we checked the enrichment of localized transcripts in each cell lineage using AME136 in version 5.5.7 of The MEME Suite137. Previously published human RBP motifs138 available in The MEME Suite was used for the analyses. Differentially expressed TE-containing and TE-free transcripts and control transcripts were retrieved as described under TFBM analyses. The nucleotide sequences of the spliced transcripts were then extracted. To investigate RBPs that differentially bind TE-containing transcripts in each subcellular structure, we classified all localized transcripts into TE-containing and TE-free sets. We then used TE-free transcripts as control transcripts to identify RBP motifs with overrepresentation in TE transcripts across different subcellular structures in each cell states. Enrichment significance was set at the adjusted AME P value of 0.05.
Computation of transcript relative stability
Human PSC and differentiated three germ cells were cultured in 6-well plates. Then Actinomycin B was added to inhibit the synthesis of new RNA. RNA was extracted at 0, 1, and 8 h after the addition of Actinomycin B. Short-read sequencing was then done for RNA extracted at the time points after transcription stoppage. For each time point, short-read transcript quantification was first done using RSEM and STAR as highlighted above. Relative stability was then computed using the average expression of each transcript as the specified time, relative to the expression at time 0 h after actinomycin treatment. For transcript x with the average TPM expressions x0 at time 0 hr and xT at time T h, relative stability is given as (3):
Bin analyses of TE enrichment in ranked transcripts
TE enrichments of binned ranked transcripts were computed for transcript stability and sub-cellular localization. For each transcript, relative stability or relative concentration index (RCI) were first calculated as described earlier. The transcripts were then ranked in ascending orders. The ranked transcripts were divided into 20 bins. Given a total of N transcripts with detectable expression in the assembly, with NTE transcripts containing the specific TE of interest, for each bin containing n transcripts, including nTE with the specific TE, TE enrichment for the bin is computed as (4);
Statistics & reproducibility
All statistical tests were conducted using Scipy in Python3. Statistical comparison of average log2-transformed values was done with T-test. For the comparison of median values, Mann-Whitney U test was employed. Kruskal-Wallis test was performed to check the homogeneity of median values. Fisher exact test was used to compare the distribution of 2 × 2 contingency table. To compare the distributions with more than two values, Chi square tests were done. In general, three biological replicates were generated, although the long-read was performed in biological duplicate for endoderm, mesoderm and ectoderm cell types. Some samples failed quality control and were omitted from the analysis. In this case, two biological replicates may only be available. Biological replicates were chosen based on cost versus benefit and is typical for the current state of the field. No statistical method was used to predetermine sample size.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The sequencing data generated in this study have been deposited in the Gene Expression Omnibus under accession codes GSE269270, GSE269272, GSE269273, and GSE269274. Source Data are provided with this paper Source data are provided with this paper.
Code availability
Python scripts used in the analyses and the Seurat R commands used for scRNA analyses are deposited in (https://github.com/iababarinde/Transposable-Elements-Human-Gastrulation).
References
Hutchins, A. P. & Pei, D. Transposable elements at the center of the crossroads between embryogenesis, embryonic stem cells, reprogramming, and long non-coding RNAs. Sci. Bull. (Beijing) 60, 1722–1733 (2015).
Ma, G., Babarinde, I. A., Zhou, X. & Hutchins, A. P. Transposable elements in pluripotent stem cells and human disease. Front Genet 13, 902541 (2022).
Biémont, C. A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics 186, 1085–1093 (2010).
Bourque, G. et al. Ten things you should know about transposable elements. Genome Biol. 19, 199 (2018).
Osmanski, A. B. et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science (1979) 380, eabn1430 (2023).
Naville, M. et al. Not so bad after all: retroviruses and long terminal repeat retrotransposons as a source of new genes in vertebrates. Clin. Microbiol. Infect. 22, 312–323 (2016).
Kim, Y.-J., Lee, J. & Han, K. Transposable elements: no more ‘Junk DNA’. Genomics Inf. 10, 226–233 (2012).
Babarinde, I. A. & Saitou, N. Tempo and mode of conserved noncoding sequence evolution among four mammalian orders. Genome Biol. Evol. 5, 2330–2343 (2013).
Wang, J., Huang, J., & Shi, G. Retrotransposons in pluripotent stem cells. Cell Regen. 9, 4 (2020).
Wang, J. et al. Primate-specific endogenous retrovirus-driven transcription defines naive-like stem cells. Nature 516, 405–409 (2014).
Elbarbary, R. A., Lucas, B. A. & Maquat, L. E. Retrotransposons as regulators of gene expression. Science (1979) 351, aac7247 (2016).
Babarinde, I. A. et al. Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells. Nucleic Acids Res 49, 9132–9153 (2021).
Fueyo, R., Judd, J., Feschotte, C. & Wysocka, J. Roles of transposable elements in the regulation of mammalian transcription. Nat. Rev. Mol. Cell Biol. 23, 481–497 (2022).
Farré, D., Engel, P. & Angulo, A. Novel role of 3’UTR-Embedded Alu Elements as facilitators of processed pseudogene genesis and host gene capture by viral genomes. PLoS One 11, e0169196 (2016).
Gong, C. & Maquat, L. E. E. lncRNAs transactivate STAU1-mediated mRNA decay by duplexing with 3’ UTRs via Alu elements. Nature 470, 284–288 (2011).
Raviram, R. et al. Analysis of 3D genomic interactions identifies candidate host genes that transposable elements potentially regulate. Genome Biol. 19, 216 (2018).
Sun, L., Fu, X., Ma, G. & Hutchins, A. P. Chromatin and epigenetic rearrangements in embryonic stem cell fate transitions. Front Cell Dev. Biol. 9, 174 (2021).
Garcia-Perez, J. L., Widmann, T. J. & Adams, I. R. The impact of transposable elements on mammalian development. Development 143, 4101–4114 (2016).
Hackett, J. A., Kobayashi, T., Dietmann, S. & Surani, M. A. Activation of lineage regulators and transposable elements across a pluripotent spectrum. Stem Cell Rep. 8, 1645–1658 (2017).
Criscione, S. W., Zhang, Y., Thompson, W., Sedivy, J. M. & Neretti, N. Transcriptional landscape of repetitive elements in normal and cancer human cells. BMC Genomics 15, 583 (2014).
LaRocca, T. J., Cavalier, A. N. & Wahl, D. Repetitive elements as a transcriptomic marker of aging: evidence in multiple datasets and models. Aging Cell 19, e13167 (2020).
Kruse, K. et al. Transposable elements drive reorganisation of 3D chromatin during early embryogenesis. Preprint at bioRxiv https://doi.org/10.1101/523712 (2019).
Nellåker, C. et al. The genomic landscape shaped by selection on transposable elements across 18 mouse strains. Genome Biol. 13, R45 (2012).
Gray, Y. H. M. It takes two transposons to tango:transposable-element-mediated chromosomal rearrangements. Trends Genet. 16, 461–468 (2000).
Chishima, T. et al. Identification of transposable elements contributing to tissue-specific expression of long non-coding RNAs. Genes (Basel) 9, 23 (2018).
Berrens, R. V. et al. Locus-specific expression of transposable elements in single cells with CELLO-seq. Nat. Biotechnol. 40, 546–554 (2022).
Babarinde, I. A., Li, Y. & Hutchins, A. P. Computational methods for mapping, assembly and quantification for coding and non-coding transcripts. Comput Struct. Biotechnol. J. 17, 628–637 (2019).
Babarinde, I. A. & Hutchins, A. P. The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome. BMC Genomics 23, 487 (2022).
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput Biol. 18, e1009730 (2022).
Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet 49, 1731–1740 (2017).
He, J. et al. Identifying transposable element expression dynamics and heterogeneity during development at the single-cell level with a processing pipeline scTE. Nat. Commun. 12, 1456 (2021).
Rodríguez-Quiroz, R. & Valdebenito-Maturana, B. SoloTE for improved analysis of transposable elements in single-cell RNA-Seq data using locus-specific expression. Commun. Biol. 5, 1063 (2022).
Shao, W. & Wang, T. Transcript assembly improves expression quantification of transposable elements in single-cell RNA-seq data. Genome Res 31, 88–100 (2021).
Bayega, A., Fahiminiya, S., Oikonomopoulos, S. & Ragoussis, J. Current and future methods for mRNA analysis: a drive toward single molecule sequencing. In Methods in Mol. Biol. 1783, 209–241 (Humana Press, New York, NY, 2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with inspector. Genome Biol. 22, 12 (2021).
Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).
Ono, Y., Hamada, M. & Asai, K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom. Bioinform. 4, lqac092 (2022).
Nip, K. M. et al. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nat. Commun. 14, 2940 (2023).
Pardo-Palacios, F. J. et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat. Methods 21, 1349–1363 (2024).
Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).
Abugessaisa, I. et al. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci. Data 4, 170107 (2017).
Tsankov, A. M. et al. Transcription factor binding dynamics during human ES cell differentiation. Nature 518, 344–349 (2015).
Dobner, J., Diecke, S., Krutmann, J., Prigione, A. & Rossi, A. Reassessment of marker genes in human induced pluripotent stem cells for enhanced quality control. Nat. Commun. 15, 8547 (2024).
Chu, L. F. et al. Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 17, 173 (2016).
Tyser, R. C. V. et al. Single-cell transcriptomic characterization of a gastrulating human embryo. Nature 600, 285–289 (2021).
Rhodes, K. et al. Human embryoid bodies as a novel system for genomic studies of functionally diverse cell types. Elife 11, e71361 (2022).
Tchieu, J. et al. A modular platform for differentiation of human PSCs into all major ectodermal lineages. Cell Stem Cell 21, 399–410.e7 (2017).
Wang, L., Wang, S. & Li, W. RSeQC: quality control of RNA-seq experiments. Bioinformatics 28, 2184–2185 (2012).
Wucher, V. et al. FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res 45, gkw1306 (2017).
Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res 51, D942–D949 (2023).
Wheeler, T. J. & Eddy, S. R. Nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013).
Osteil, P. et al. MIXL1 activation in endoderm differentiation of human induced pluripotent stem cells. Stem Cell Rep. 20, 102482 (2025).
Li, J. et al. TGF-β2 and TGF-β1 differentially regulate the odontogenic and osteogenic differentiation of mesenchymal stem cells. Arch. Oral. Biol. 135, 105357 (2022).
Tang, Y. et al. TGFBR3 dependent mechanism of TGFB2 in smooth muscle cell differentiation and implications for TGFB2-related aortic aneurysm. Stem Cells Transl Med 14, (2025).
Yang, J. et al. GATA6-AS1 Regulates GATA6 Expression to Modulate Human Endoderm Differentiation. Stem Cell Rep. 15, 694–705 (2020).
Yasui, R., Matsui, A., Sekine, K., Okamoto, S. & Taniguchi, H. Highly Sensitive Detection of Human Pluripotent Stem Cells by Loop-Mediated Isothermal Amplification. Stem Cell Rev. Rep. 18, 2995–3007 (2022).
Sekine, K. et al. Robust detection of undifferentiated iPSC among differentiated cells. Sci. Rep. 10, 10293 (2020).
Wu, P. R. et al. Wdr4 promotes cerebellar development and locomotion through Arhgap17-mediated Rac1 activation. Cell Death Dis. 14, 1–16 (2023).
Goos, J. A. C. et al. Identification of causative variants in TXNL4A in Burn-McKeown syndrome and isolated choanal atresia. Eur. J. Hum. Genet. 25, 1126–1133 (2017).
Yao, J., Gaffaney, J. D., Kwon, S. E. & Chapman, E. R. Doc2 Is a Ca2+ Sensor Required for Asynchronous Neurotransmitter Release. Cell 147, 666–677 (2011).
Xue, M., Giagtzoglou, N. & Bellen, H. J. Dueling Ca2+ Sensors in Neurotransmitter Release. Cell 147, 491–493 (2011).
Pang, Z. P. et al. Doc2 Supports Spontaneous Synaptic Transmission by a Ca2 + -Independent Mechanism. Neuron 70, 244–251 (2011).
Sakashita, A. et al. Transcription of MERVL retrotransposons is required for preimplantation embryo development. Nat. Genet 55, 484–495 (2023).
Modzelewski, A. J. et al. A mouse-specific retrotransposon drives a conserved Cdk2ap1 isoform essential for development. Cell 184, 5541–5558.e22 (2021).
Cabili, M. N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Santoni, F. A., Guerra, J. & Luban, J. HERV-H RNA is abundant in human embryonic stem cells and a precise marker for pluripotency. Retrovirology 9, 111 (2012).
Lu, X. et al. The retrovirus HERVH is a long noncoding RNA required for human embryonic stem cell identity. Nat. Struct. Mol. Biol. 21, 423–425 (2014).
Zhang, Y. et al. Transcriptionally active HERV-H retrotransposons demarcate topologically associating domains in human pluripotent stem cells. Nat. Genet 51, 1380–1388 (2019).
Mattick, J. S. et al. Long non-coding RNAs: definitions, functions, challenges and recommendations. Nat. Rev. Mol. Cell Biol. 24, 430–447 (2023).
Harrow, J. et al. GENCODE: The reference human genome annotation for the ENCODE project. Genome Res 22, 1760–1774 (2012).
Billon, V. & Cristofari, G. Nascent RNA m6A modification at the heart of the gene–retrotransposon conflict. Cell Res 31, 829–831 (2021).
Liu, J. et al. The RNA m6A reader YTHDC1 silences retrotransposons and guards ES cell identity. Nature 591, 322–326 (2021).
Chen, C. et al. Nuclear m6A reader YTHDC1 regulates the scaffold function of LINE1 RNA in mouse ESCs and early embryos. Protein Cell 12, 455–474 (2021).
Leclair, N. K. et al. Poison exon splicing regulates a coordinated network of SR protein expression during differentiation and tumorigenesis. Mol. Cell 80, 648–665.e9 (2020).
Kini, H. K., Kong, J. & Liebhaber, S. A. Cytoplasmic Poly(A) binding protein C4 serves a critical role in erythroid differentiation. Mol. Cell Biol. 34, 1300–1309 (2014).
Byun, J. S. et al. The transcription factor PITX1 drives astrocyte differentiation by regulating the SOX9 gene. J. Biol. Chem. 295, 13677–13690 (2020).
Joshi, S. et al. TEAD transcription factors are required for normal primary myoblast differentiation in vitro and muscle regeneration in vivo. PLoS Genet 13, e1006600 (2017).
Yuan, Y. et al. YAP1/TAZ-TEAD transcriptional networks maintain skin homeostasis by regulating cell proliferation and limiting KLF4 activity. Nat. Commun. 11, 1472 (2020).
Carlevaro-Fita, J. et al. Ancient exapted transposable elements promote nuclear enrichment of human long noncoding RNAs. Genome Res 29, 208–222 (2019).
Lubelsky, Y. & Ulitsky, I. Sequences enriched in Alu repeats drive nuclear localization of long RNAs in human cells. Nature 555, 107–111 (2018).
Mas-Ponte, D. et al. LncATLAS database for subcellular localization of long noncoding RNAs. RNA 23, 1080–1087 (2017).
Yin, Y. et al. U1 snRNP regulates chromatin retention of noncoding RNAs. Nature 580, 147–150 (2020).
Mishra, K. & Kanduri, C. Understanding Long Noncoding RNA and Chromatin Interactions: What We Know So Far. Non-Coding RNA 5, 54 (2019).
Mustafin, R. N. The role of transposable elements in the differentiation of stem cells. Mol. Genet. Microbiol. Virol. (Russian version) 34, 67–74 (2019).
Castro-Diaz, N. et al. Evolutionally dynamic L1 regulation in embryonic stem cells. Genes Dev. 28, 1397–1409 (2014).
Pereira Fernandes, D., Bitar, M., Jacobs, F. M. J. & Barry, G. Long Non-Coding RNAs in Neuronal Aging. Noncoding RNA 4, 12 (2018).
Reilly, M. T., Faulkner, G. J., Dubnau, J., Ponomarev, I. & Gage, F. H. The role of transposable elements in health and diseases of the central nervous system. J. Neurosci. 33, 17577–17586 (2013).
Lapp, H. E. & Hunter, R. G. Early life exposures, neurodevelopmental disorders, and transposable elements. Neurobiol. Stress 11, 100174 (2019).
Kelley, D. & Rinn, J. Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome Biol. 13, R107 (2012).
Kelley, D. R., Hendrickson, D. G., Tenen, D. & Rinn, J. L. Transposable elements modulate human RNA abundance and splicing via specific RNA-protein interactions. Genome Biol. 15, 537 (2014).
Sexton, C. E., Tillett, R. L. & Han, M. V. The essential but enigmatic regulatory role of HERVH in pluripotency. Trends in Genetics. 38, https://doi.org/10.1016/j.tig.2021.07.007 (2022).
Göke, J. et al. Dynamic transcription of distinct classes of endogenous retroviral elements marks specific populations of early human embryonic cells. Cell Stem Cell 16, 135–141 (2015).
Zhang, R. et al. A primate-specific endogenous retroviral envelope protein sequesters SFRP2 to regulate human cardiomyocyte development. Cell Stem Cell https://doi.org/10.1016/j.stem.2024.07.006 (2024).
Sekine, K., Onoguchi, M. & Hamada, M. Transposons contribute to the acquisition of cell type-specific cis-elements in the brain. Commun. Biol. 6, 631 (2023).
Mustafin, R. N. & Khusnutdinova, E. K. Involvement of transposable elements in neurogenesis. Vavilovskii Zhurnal Genetiki i Selektsii 24, https://doi.org/10.18699/VJ20.613 (2020).
Xie, B. et al. A two-step lineage reprogramming strategy to generate functionally competent human hepatocytes from fibroblasts. Cell Res 29, 696–710 (2019).
Wang, Q. et al. Generation of human hepatocytes from extended pluripotent stem cells. Cell Research 30, https://doi.org/10.1038/s41422-020-0293-x (2020).
Gertz, J. et al. Distinct Properties of Cell-Type-Specific and Shared Transcription Factor Binding Sites. Mol. Cell 52, 25–36 (2013).
Poletti, V. et al. Genome-wide definition of promoter and enhancer usage during neural induction of human embryonic stem cells. PLoS One 10, e0126590 (2015).
Cheng, L. C. et al. Widespread transcript shortening through alternative polyadenylation in secretory cell differentiation. Nat. Commun. 11, 3182 (2020).
Wang, R., Zheng, D., Yehia, G. & Tian, B. A compendium of conserved cleavage and polyadenylation events in mammalian genes. Genome Res 28, 1427–1441 (2018).
Dixon, G.et al. QSER1 protects DNA methylation valleys from de novo methylation. Science (1979) 372, (2021).
Ai, Z. et al. Krüppel-like factor 5 rewires NANOG regulatory network to activate human naive pluripotency specific LTR7Ys and promote naive pluripotency. Cell Rep. 40, 111240 (2022).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Res 9, 304 (2020).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121 (2010).
Nassar, L. R. et al. The UCSC genome browser database: 2023 update. Nucleic Acids Res 51, D1188–D1195-D1195 (2023).
Crooks, G. E., Hon, G., Chandonia, J. M. & Brenner, S. E. WebLogo: a sequence logo generator. Genome Res 14, 1188–1190 (2004).
Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100 (1990).
Trincado, J. L. et al. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biol. 19, 40 (2018).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res 44, W160–W165 (2016).
Altschul, S. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402 (1997).
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res 49, D412–D419 (2021).
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data HHS public access. Nat. Biotechnol. 29, 644–652 (2011).
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput Biol 7, e1002195 (2011).
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 12, 2 (2021).
Amrani, K. El, Alanis-Lobato, G., Mah, N., Kurtz, A. & Andrade-navarro, M. A. Detection of condition-specific marker genes from RNA-seq data with MGFR. PeerJ 7, e6970 (2019).
Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-Content Normalization for RNA-Seq Data. BMC Bioinforma. 12, 480 (2011).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Dobin, A. & Gingeras, T. R. Mapping RNA-seq Reads with STAR. Curr. Protoc. Bioinforma. 51, 11.14.1–19 (2015).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinforma. 12, 323 (2011).
Harrison, P. W. et al. Ensembl 2024. Nucleic Acids Res 52, 495–502 (2024).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. https://doi.org/10.1038/nbt.1754 (2011).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, (2021).
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, (2015).
Mayer, A. & Churchman, L. S. A detailed protocol for subcellular RNA sequencing (subRNA-seq). Curr. Protoc. Mol. Biol. 120, 4.29.1–4.29.18 (2017).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
McLeay, R. C. & Bailey, T. L. Motif enrichment analysis: a unified framework and an evaluation on ChIP data. BMC Bioinforma. 11, 165 (2010).
Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME suite. Nucleic Acids Res. 43, W39–W49 (2015).
Ray, D. et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 499, 172–177 (2013).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Dyer, S. C. et al. Ensembl 2025. Nucleic Acids Res 53, D948–D957 (2025).
Acknowledgements
We acknowledge the assistance of SUSTech Core Research Facilities. Funding was from the National Natural Science Foundation of China (32270597) the Science Technology and Innovation Commission of Shenzhen (RCBS20221008093109033), and the Guangdong Basic and Applied Basic Research Foundation (2023A1515111170). A.R.’s laboratory is funded by the Russian Science Foundation (22-65-00022).
Author information
Authors and Affiliations
Contributions
I.A.B. planned and designed the study, performed most of the bioinformatic analyses, drafted the paper, and coordinated the experimental work. X.F. and G.M. performed the experiments, assisted by J.X., Y.Q., and Z.X., Y.L., M.T.A., Z.Liang, Z.Lin. and X.Z. assisted with analysis. K.O. and A.R. interpreted data and revised the manuscript. A.P.H. designed the study, revised the manuscript, supervised and funded the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Andrew Modzelewski who co-reviewed with Youjia Guo and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Babarinde, I.A., Fu, X., Ma, G. et al. Transposable element expression and sub-cellular dynamics during hPSC differentiation to endoderm, mesoderm, and ectoderm lineages. Nat Commun 16, 7670 (2025). https://doi.org/10.1038/s41467-025-63080-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-63080-3
This article is cited by
-
Dissecting the contribution of transposable elements to interphase chromosome structure
Genome Biology (2026)










