Main

Repetitive elements (REs) are DNA sequences found in high copy number in the genome1. Transposable elements (TEs), or selfish REs, are REs that have the ability to copy themselves and move to new genomic locations, either directly as DNA (DNA transposons) or through an RNA intermediate that is reverse-transcribed (long interspersed nuclear elements (LINEs); short interspersed nuclear elements (SINEs); long terminal repeat (LTR) retrotransposons). The selfish replication of TEs has led them to occupy a large portion of genomes (around 40% in mammals). TE activity can cause cellular damage by causing insertional mutagenesis2, generation of double-stranded DNA breaks2, stimulation of inflammatory pathways3,4,5,6,7 and disruption of gene regulatory networks8. Because of their pathogenic potential, TEs are kept under tight control by the host with multiple regulatory layers9. TEs are recognized by transcription factors (TFs), notably KRAB-zinc fingers, and PIWI-interacting RNAs (piRNAs) leading to silencing by H3K9 trimethylation and DNA 5-cytosine methylation (5mC)10. Notably, other rarer forms of DNA methylation, such as 5-hydroxymethylcytosine (5hmC) and N6-methyladenine (6mA) are involved in transcriptional regulation, and are associated with gene expression11,12. However, for the rest of this study, we will focus on 5mC methylation. DNA methylation patterns are established in bulk during development and are then largely maintained throughout the lifespan, although de novo methylation and active demethylation still occur13. Previous studies in multiple organisms and tissues found that methylation patterns undergo a slow drift during aging, with many normally hypermethylated regions becoming less repressed14,15,16. At the same time, TEs are activated during aging in invertebrates, mice, human senescent cells and certain cancers2,7,17,18. Thus, it is possible that age-related alterations of DNA methylation could have a role in TE activation in humans, as has been observed in other species and during development11,19,20.

Aging clocks are statistical models trained to predict age and age-related phenotypes, including time to death21. In addition to predicting the age of samples of unknown age, for example in forensics, aging clocks have been used to study health conditions, lifestyles, and genetic or pharmacological treatments that alter an organism’s biological age. Typically, age predictions are based on omic data types, including gene expression22,23, protein abundance24, chromatin accessibility25 and, most commonly, DNA methylation26,27,28,29,30,31. One common criticism of aging clocks deals with the difficulty in interpreting the biological meaning of observed changes in DNA methylation patterns. One strategy previously used to improve clock interpretability is to group clock CpGs into different modules corresponding to different biological processes32,33.

In this study, we explored the use of TE methylation as a biomarker of age and disease. First, we reanalyzed public human blood methylation data to determine the trajectory of TE methylation during aging, comparing evolutionarily young and old TEs. We then constructed age predictors for mice and humans. Lastly, we investigated associations between accelerated age prediction, and more generally loss of methylation at TEs, and disease.

Results

Data description

To investigate changes in RE methylation that occur during aging we collected publicly available human blood methylation array data (GSE40279: n = 656; GSE64495: n = 113; GSE87648: n = 382; GSE147221: n = 679; GSE157131: n = 1,218). Later, we investigate the association between TE methylation and disease using the Women’s Health Initiative (WHI) BA23 dataset (n = 2,175). The characteristics of the dataset are summarized in Fig. 1a. All datasets were generated with the Illumina Infinium 450k array, which measures methylation at 485,578 CpGs. We annotated array CpGs based on the type of RE and genic region (exon, intron, promoter, 5′ UTR, 3′ UTR, intergenic) they laid within. Array CpGs were generally biased to genic regions, whereas complex repeats were generally found in intergenic regions or introns (Extended Data Fig. 1a). Nonetheless, 69,426 CpGs were contained within REs, mainly LINEs, SINEs, LTR retrotransposons, DNA transposons and simple repeats (Extended Data Fig. 1b). While most RE CpGs were primarily intergenic and intronic (Extended Data Fig. 1b), simple repeats and low-complexity regions were predominantly found in promoters. It should be noted that because methylation array CpGs are not a random selection of all CpGs, but are instead biased to certain regions, array data may not be generalizable to the whole genome.

Fig. 1: Transposons and particularly young L1s are biased toward losing methylation during aging.
figure 1

a, Public human blood DNA methylation datasets and age distributions. b, Youthful methylation level and age-related drift of CpGs inside and outside of repetitive elements. c, Methylation drift rate of CpGs grouped according to major repeat class. Selfish (transposons) and nonselfish repeats were grouped separately. Only RE classes represented by 100 or more CpGs in the Infinium array are shown. d, Methylation drift rate of CpG in L1s, grouped according to family and sorted according to average sequence length, a proxy of evolutionary age. e, Methylation drift rate of CpG in LTR retrotransposons, grouped according to family and sorted according to average sequence length. d,e, Only families represented by 40 or more CpGs in the Infinium array are shown. ce, Boxes show the median, and the 25th and 75th percentiles. The whiskers extend to the 25th and 75th percentiles ± 1.5 times the interquartile range (IQR). Points outside the whiskers are not shown.

TEs, especially young L1s become demethylated during aging

First, we investigated the age dynamics of RE and non-RE CpGs. We used limma34 to fit linear regression models to the methylation levels of all array CpGs, including age, sex and the study of origin as independent variables (Supplementary Data 1). Patients with reported health conditions in the original studies were not included in the analysis to initially focus on RE methylation changes that are associated with aging rather than disease. RE CpGs were hypermethylated in young individuals (20 years old), but were more likely to have decreased methylation in older individuals compared to non-RE CpGs (Fig. 1b). When investigating different classes of REs individually, we found that TEs (LINEs, SINEs, LTR retrotransposons, DNA transposons, SINE-VNTR-Alu (SVA) retrotransposons) were much more prone to losing methylation than nonselfish REs (tRNA, rRNA, satellites, simple repeats, low-complexity regions; Wilcoxon rank-sum test P < 2.2 × 10−16; Fig. 1c). We initially focused on L1s because they are the only TEs known to be active and autonomous in humans35. Therefore, de-repression of L1s could be sufficient to cause cellular damage. Fortunately, most L1 copies are truncated or have mutated over evolutionary timescales and are thus inactive36,37. Conversely, competent, evolutionarily young L1 copies are closer to 6,000-bp long. We found an association between the average copy length of L1 families and their propensity to become demethylated with age (Spearman’s r = -0.28, P = 0.0012; Fig. 1d). The L1 families with the most consistent methylation loss across measured CpGs were L1HS, L1PA2, L1PA3 and L1PA4, which are the four youngest L1 families present in the human genome36. Middle-aged, partially truncated L1s, such as L1PB1, L1PA15 and L1PA16, showed intermediate rates of methylation loss. Older, highly truncated families were also generally prone to methylation loss, but to a much smaller extent. We wondered if this discrepancy in drift rate could be due to representation of different parts of the L1 sequence in the array. Specifically, if the regions most affected by methylation loss are found on the L1 5′­ end, the truncation commonly observed in older families could explain why older L1s may have slower drift, as they would often lack 5′ ends. We aligned CpG probe sequences to the consensus of several representative L1 families and found that indeed most young L1 array CpGs were located in the 5′ UTR, whereas older L1 CpGs were more spread throughout the L1 sequence, with a 3′ bias (Extended Data Fig. 2). However, 5′ UTR CpGs did not lose methylation at a faster rate than CpGs found in the rest of the element; thus, differences in CpG localization between young and old L1 elements do not explain differences in methylation drift rate. We then investigated other TE classes. Among LTR retrotransposons, the THE1A and THE1C families showed the fastest methylation loss (Fig. 1e). While not retrotransposition-competent, de-repression of these families was found to drive expression of oncogenes8. Most SINEs, SVAs and DNA transposons were also biased toward losing methylation during aging, but the median drift rate was relatively small and no particular family stood out (Extended Data Fig. 3a–c).

Young L1s gain accessibility but not expression during aging

We asked if the age-related changes in DNA methylation we observed in TEs and other REs were associated with coherent changes in chromatin accessibility and expression. We used matched RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin with sequencing (ATAC–seq) data from 159 human peripheral blood mononuclear cell (PBMC) samples (GSE193142, n = 155 and n = 143 after quality control for RNA-seq and ATAC–seq, respectively; Supplementary Data 3)25. Chromatin accessibility of TEs followed similar trends as DNA methylation: most LINE, LTR, SVA and DNA transposon families showed increased accessibility during aging (Extended Data Fig. 4b). We also observed a stronger gain in accessibility at young L1s compared to old, and age-related changes in L1 family methylation and accessibility were highly correlated (Pearson’s r = -0.443, P < 0.001; Extended Data Fig. 4c). Interestingly, tRNAs and rRNAs showed a decline in accessibility with age, whereas they had not shown particular gains in DNA methylation. TE expression, on the other hand, did not follow similar trends as DNA methylation or chromatin accessibility (Extended Data Fig. 4a). L1 families in general showed a mild increase in expression during aging, but the youngest L1 families did not increase in expression (Extended Data Fig. 4c). Similar findings were recently reported, where inducing large-scale demethylation using 5-aza-2′-deoxycytidine only led to expression of few L1HS loci, with refractory loci being primarily located in heterochromatic regions38. However, in our findings, we also saw increased accessibility at young L1s, which indicates heterochromatin relaxation at these elements. Then, perhaps expression of these young L1s is prevented by RNA silencing pathways, which are crucial in TE repression.

Demethylation of young L1s outpaces passive methylation loss

The difference in demethylation rate between young and old L1s could indicate that they become de-repressed by different means: de-repression of old L1s may be a result of global age-related methylation loss, which has been previously documented and is often attributed to imperfect maintenance of methylation marks by DNA methyltransferase 1 (DNMT1)15. Conversely, young L1s may be actively de-repressed by recruiting activating TFs at their 5′ UTR39. Alternatively, this discrepancy may be explained by differences between the CpG landscape of young and old L1 families. For example, young L1s have a higher CpG density, which is gradually lost over evolutionary timescales because of C>T mutations40, and CpG density affects the rate of passive methylation loss41,42. Additionally, the initial (after development) level of CpG methylation may affect the methylation drift rate simply because highly polarized states (for example, fully methylated) can only lose methylation, while intermediate methylation states can both gain and lose methylation during aging. Thus, we modeled the average methylation drift rate of CpGs based on local CpG density, youthful methylation level and the interaction of the two (Fig. 2a,b). This model explained 24.7% of age coefficient variation and confirmed previous reports that low CpG density is associated with age-related methylation loss. Hypomethylated CpGs (<20% methylated) were more likely to gain methylation during aging; however, hypermethylated CpGs (>80% methylated) were not particularly biased toward methylation loss. We then adjusted the previously calculated age coefficients with this information (Supplementary Data 1). These adjusted age coefficients should be interpreted as ‘the age drift rate of a given CpG, compared to what would be expected from the average CpG with the same local CpG density and youthful methylation level’. The adjusted coefficients of most TE families of all five major classes were close to zero or even slightly positive, meaning that their age-related methylation loss could be explained by the local CpG context and youthful methylation state and is probably a passive phenomenon (Fig. 2c,d and Extended Data Fig. 3d–f). Conversely, L1HS, L1PA2, L1PA3 and L1PA4 retained a higher-than-expected rate of methylation loss, reinforcing the hypothesis that their de-repression may be, at least in part, an active process.

Fig. 2: Age drift of TE CpGs compared to what is expected based on CpG density and youthful methylation level.
figure 2

a, Trends of methylation drift based on youthful methylation levels. b, Trends of methylation drift based on local CpG density. c,d, Age coefficient of methylation at LINEs (c) and LTR retrotransposons (d) after adjustment for CpG density and youthful methylation level. Only families represented by 40 or more CpGs in the Infinium array are shown. e, TF binding motifs enriched at young L1s and associated with increased or decreased methylation drift rate in young L1s (two-sided Wilcoxon rank-sum test, P < 0.05 for both adjusted and unadjusted age drift rate coefficient). P values are only shown for comparisons in young L1s; the exact P values are, in order, 0.021, 0.039, 0.028, 0.016. n of CpGs in old L1s: 341; n CpGs in young L1s: 621, of which 416 are flanked by an ARNTL motif, 589 by an NFKB2 motif, 375 by a FOXO1 motif and 390 by a HIC2 motif. Additional motifs are shown in Extended Data Fig. 5. ce, Boxes show the median and the 25th and 75th percentiles. The whiskers extend to the 25th and 75th percentiles ± 1.5 times the IQR. Points outside the whiskers are not shown. * P < 0.05.

TF binding motifs associating with differences in methylation drift rate between L1s

Because age-related methylation loss at young L1s could not be fully described as a passive process, we searched for TF binding motifs whose presence or absence may be associated with faster methylation loss at young L1s compared to older L1s (after accounting for CpG density and youthful methylation). Because of the evolutionary arms race between TEs and host, we expected to find that younger L1s may have lost motifs important for repression, such as those of KRAB-zinc fingers, and potentially may have gained new activators43,44. We used simple enrichment analysis (SEA)45 to search for motifs enriched within 250 bp upstream and downstream of CpGs in L1HS, L1PA2, L1PA3 and L1PA4, compared to CpGs in older L1 families. In particular, we selected L1PB1, L1PA15, L1PA16, L1M1, L1M2 and L1MA3 as representative older L1s because while older than L1HS, L1PA2, L1PA3 and L1PA4, they showed the lowest amount of truncation among L1 families adequately represented in the Infinium array (>40 CpGs per family; Fig. 1d). Moreover, L1PA15, L1PA16, L1PB1 and L1MA3 are primate-specific families36. We took this precaution to ensure that the sequences used for motif enrichment would be as complete and as comparable as possible. This yielded 621 sequences in young L1s and 341 in older ones. Motif enrichment using SEA found 620 motifs enriched in young L1s (Supplementary Data 2), including motifs of known regulators of L1 expression (YY1 (ref. 46), SRY47, ZFP57 (ref. 48)). To narrow down this list, we further restricted our analysis to motifs found in the majority of young L1s (>50%) and a minority of old ones (<20%), yielding 103 motifs. These motifs could be both activating or repressive as both types of motifs could be subject to mutation over evolutionary timescales and thus be absent in older L1s, leading to an enrichment in young L1s compared to old ones. To infer the regulatory role of these motifs, we further filtered these 103 motifs to ones whose presence was associated with accelerated or attenuated methylation loss at young L1 CpGs. Thus, we considered motifs found at young L1s with decreased methylation loss as putative repressive; conversely, we considered motifs found at young L1s with accelerated methylation loss as putative activating. However, no motif fell in the latter category (Fig. 2e and Extended Data Fig. 5a). The putative repressive set included 20 motifs. Of these, 13 had a similar consensus sequence (GTAACA) belonging to the forkhead box (FOX) TF family. FOXA1 was reported to activate, rather than repress L1s37; however, given the extreme similarity between the recognition sequence of many FOX family members, we suspect that FOXA1 may not be the sole regulator of L1 activity and that other members of the family may have repressive effects. Although FOXA1 motifs were enriched at young L1 CpGs, young L1s CpGs with FOXA1 motifs did not lose methylation faster than those without. Instead, the presence of several other factors, such as FOXO1, FOXE1 and FOXL1, was associated with a mildly attenuated methylation drift rate (full results in Supplementary Data 2). The remaining motifs belonged to the TFs NFKB2, NR1I2, AR, RXRA::VDR, hypermethylated in cancer 2 (HIC2), SP5 and BMAL1 (also called ARNTL). NFKB2 is a member of the noncanonical NFKB pathway; a previous study found that NFKB activation by TRIM5α induced L1 promoter silencing49. ARNTL is a circadian regulator recently shown to be an important repressor of L1 (ref. 50). Decreased ARNTL expression leads to an accelerated aging phenotype in human and macaque cells because of stimulation of cytosolic DNA sensing pathways by L1. HIC2 and the highly homologous HIC1 were downregulated in multiple cancer types51,52,53. Interestingly, HIC2 also induced SIRT1 expression54, which itself may be involved in L1 silencing55. Lastly, we performed motif enrichment directly comparing young L1 copies with faster-than-median methylation drift to those with slower-than-median methylation drift, regardless of enrichment compared to older families. The only motif enriched at ‘fast demethylating’ L1HS CpGs was that of BHLHE22, a factor with no known associations with TE activity (Extended Data Fig. 5c). Two studies investigated its role as a modulator of inflammation, finding an anti-inflammatory effect in prostate cancer, but a pro-inflammatory effect in endometrial cancer56,57. Future efforts may want to investigate this apparent contradiction and investigate if the pro-inflammatory effect may be mediated by L1HS activation. Fast demethylating L1PA2 CpGs showed enrichment for the most TFs, in part because of large representation in the Infinium array (318 CpGs in total; Extended Data Fig. 5b). A notable number of TFs were members of the AP-1 complex (JUN–BATF–FOSL1), which initiates the senescence program and upregulates part of the senescence-associated secretory phenotype58,59. It has also been observed that inflammation in senescent cells is in part driven by L1 and ERV expression18,60. No motifs were enriched at fast demethylating L1PA3 CpGs, whereas motifs enriched at fast demethylating L1PA4 CpGs included RUNX3, a known L1 activator (Extended Data Fig. 5d)61.

TE methylation as an accurate and interpretable biomarker of age

Next, we investigated if the methylation state of TEs could be used to predict chronological age. Thus, we selected CpGs found in TEs (LINEs, SINEs, LTR retrotransposons, DNA transposons, normalized CpGs = 56,352; Fig. 3a) and trained an elastic net model on a portion of our data (n = 999), leaving out a portion of each dataset (n = 248) and the entirety of GSE64495 (n = 104) as external validation (Fig. 3b,d). The coefficients are available in Supplementary Data 4. This individual CpG TE clock was in both cases highly accurate (root mean square error (RMSE) = 5.58, median absolute error (MAE) = 2.96, r = 0.95 on GSE64495). We compared this performance with other state-of-the-art chronological age clocks and found that the individual CpG TE clock performed better than the Hannum clock and Horvath Pan-tissue clock but worse than the Horvath Skin & Blood clock. Thus, the methylation state of individual CpGs in TEs can be used to construct a remarkably accurate clock. We additionally visualized how many CpGs selected by the individual CpG TE clock were shared with the aforementioned published clocks and their importance in the model (Extended Data Fig. 6b and Supplementary Data 4). In general, there was little overlap between the clocks, probably because of redundancy in the biological information provided by CpGs, and the imposed restriction to use CpGs in TEs only.

Fig. 3: Construction of age biomarkers based on the methylation of individual CpGs within TEs and genome-wide TE family methylation.
figure 3

a, Feature construction strategy. b, Test set performance of the individual CpG clock. c, Test set performance of the combined CpG clock. d, Benchmark of individual and combined CpG clocks against state-of-the art methylation clocks. The benchmark was performed on GSE64495, which was not included in the training set of any of the clocks shown. e, Performance of a combined CpG clock trained on multi-tissue mouse RRBS data, tested using nested cross-validation. f, Age prediction on long-lived mouse strains compared to matching controls (two-sided Wilcoxon rank-sum test: *P = 0.026 and **P = 0.002, respectively). Boxes show the median and 25th and 75th percentiles. The whiskers extend to the 25th and 75th percentiles ± 1.5 times the IQR. Points outside whiskers (outliers) are shown individually.

While constructing a biomarker on a particular biological process such as TE de-repression can help with interpretability, further considerations should be made. Most importantly, transposons are disseminated everywhere in the genome, including near genes and very commonly in introns. Thus, while the state of methylation of a single TE CpG may be representative of the status of that TE copy, it may also be affected by the local chromatin context (for example, whether a nearby gene is transcribed or not). To further improve interpretability, we trained a new clock, this time on the average genome-wide methylation state of TE families, separating genic and intergenic TE copies. We chose not to completely discard genic TE copies because a sizable portion of TEs, including some active L1s, is found in introns. Additionally, we only kept groups of at least five CpGs to reduce the impact of the local regulatory context at each CpG and ensure that each feature could be interpreted as the global methylation of a given TE family (Fig. 3a). Validation was again performed on a portion of each dataset (n = 248) and the entirety of GSE64495 (n = 104). The coefficients are available in Supplementary Data 4. We were surprised to see that while performance of this combined CpG TE clock was worse than that of the individual CpG TE clock, it was still satisfactory (Fig. 3c,d). In particular, it still had an r of 0.90 when validated on the external dataset GSE64495.

Lastly, we applied the same combined CpG training strategy on the reduced representation bisulfite sequencing (RRBS) data of multiple mouse tissues. Because of limited data availability, the predictor was trained and validated using nested cross-validation, once again only including wild-type (WT) untreated mice (n = 276). The coefficients are available in Supplementary Data 4. This again yielded an accurate predictor, with r = 0.90 (Fig. 3e). Thus, our feature construction strategy was successful on multiple sequencing platforms, tissues and organisms. We note that while the strategy was indeed successful across different species, generating a single TE-based biomarker for multiple species would be difficult because TEs evolve very rapidly. For example, mice and humans have a very different number and set of active TEs36,62.

Accelerated TE methylation age is associated with health status

Next, we investigated associations between age acceleration (the difference between predicted and chronological age) and health status. We tested our biomarkers on methylation data from the WHI, a long-term study, deeply phenotyping postmenopausal women. Specifically, we used data from substudy BA23, consisting of 2,175 women aged 50–79 years at baseline, of which ~1,070 developed coronary heart disease (CHD) during the follow-up. We examined associations between age acceleration and time to death, diagnosis of any cancer and CHD using Cox regression, including chronological age as a covariate. Accelerated aging according to the individual CpG TE clock was significantly associated with higher risk for all three outcomes (Fig. 4a). Age acceleration according to PhenoAge29, an aging biomarker trained on clinical phenotypes rather than chronological age alone, had similar associations with risk of cancer and mortality as our individual CpG TE clock. However, increased CHD risk was most associated with age acceleration according to PhenoAge. Our combined CpG TE clock, on the other hand, showed no significant associations with cancer or CHD risk, but was still associated with risk of death. We suspect this may be because of the decreased accuracy of this predictor, which relies on genome-wide methylation features. We additionally tested our mouse RRBS clock on data from ref. 63, which included long-lived growth hormone receptor knockout (GHRKO) and Snell dwarf mice63. Matching WT controls were not used to train the RRBS clock. Excitingly, both GHRKO and Snell dwarf mice were predicted as significantly younger than matching controls (Fig. 3f; Wilcoxon rank-sum test: P = 0.026 and 0.002, respectively). Thus, we conclude that both individual CpG clock and combined CpG TE clock show an association with the health status of the individual and not only their chronological age.

Fig. 4: Association between TE clock acceleration, methylation loss and disease.
figure 4

a, Association between age acceleration and risk of cancer, CHD and mortality according to the individual and combined CpG clocks in the WHI BA23 dataset. Results were benchmarked against state-of-the-art chronological age (Horvath Pan-tissue and Horvath Skin & Blood) and biological age (Levine PhenoAge) clocks. Associations between predicted age acceleration and risks were determined with a Cox proportional hazards model, accounting for age. The bars represent coefficients ± s.e. b, Age trajectory of methylation at young (first row) and old L1s with the largest coefficients in the combined CpG clock (second row). Data are from GSE40279. The orange dashed line shows a linear fit ± 95% confidence interval (CI), excluding patients older than 65 years. The teal line shows a locally estimated scatterplot smoothing fit ± 95% CI on the full age range.**P < 0.01, ***P < 0.001. c, Effect of cancer within 3 years and age on methylation of young and old L1s in the WHI data. Coefficients and P values were determined by fitting a linear model with the following formula: methylation ~ age + any cancer in 3 years. d, Performance of predictors of risk of cancer, CHD and mortality within 3 years, based on young L1 CpGs. The bars represent the mean area under the receiver operating characteristic curve (ROC a.u.c.) ± s.d. in tenfold cross-validation. Best and parsimonious models are shown.

Properties of young and old L1s as biomarkers

Finally, we investigated the TE families selected by our combined CpG clock. Among the notable TE families we identified, only L1HS (genic) was chosen as a feature by the human combined CpG clock, with methylation loss associating with increasing age. However, several older L1 families were chosen with stronger coefficients (for example, L1MEi, L1PA11, L1MA4A, L1M7). We found this puzzling, because we expected that the strong age association of younger L1s (L1HS, L1PA2, L1PA3 and L1PA4) would make them useful for age prediction. Thus, we investigated the exact trajectory of young L1 de-repression in greater detail (Fig. 4b). We were surprised to see that young L1s had negligible methylation loss under the age of 65 and then rapidly lost methylation in older patients with a nonlinear trajectory. In comparison, the older L1 families selected by our combined CpG predictor showed a more linear trajectory and began demethylating at younger ages. This led us to suspect that older, ‘passively demethylating’ TE families may be better predictors of chronological age, whereas methylation loss at younger TEs, in particular those with pathogenic potential, may be better predictors of disease risk. Thus, we modeled average methylation at young L1s (L1HS, L1PA2, L1PA3, L1PA4) and old L1s with large clock coefficients (L1MEi, L1PA11, L1MA4A, L1M7) as a function of age, this time including whether individuals would be diagnosed with any cancer within 3 years of sample collection (linear model: methylation ~ age + any cancer in 3 years; Fig. 4c). We found that cancer was significantly associated with decreased methylation of young L1s, but not older ones, although a trend was still present. Conversely, when accounting for cancer, age was associated with decreased methylation at older L1s but not at young ones. We found that most cancer types in the dataset showed lower than average L1 methylation, although the sample size for most types was insufficient to draw strong conclusions (Extended Data Fig. 7a). The main exception to this trend was lung cancer, which showed higher than average young L1 methylation compared to healthy patients. With this knowledge, we trained predictors of cancer, CHD and mortality within the next 3 years solely based on young L1 CpGs (n = 621) in the WHI data. These events were quite rare (cancer: n = 52; CHD: n = 140; death: n = 39; total: n = 2,175) making training challenging. Nonetheless, the resulting models had mild predictive ability (Fig. 4d). Interestingly, while the mortality and CHD predictors were rather complex, even when choosing the optimal model with parsimony (best mortality predictor: normalized CpGs = 136; parsimonious mortality predictor: normalized CpGs = 36; best CHD predictor: normalized CpGs = 135; parsimonious CHD predictor: normalized CpGs = 52; Supplementary Data 5), the cancer predictors were remarkably simple, using only a handful of CpGs. The simplest model based predictions on just two CpGs: cg07575166, found in an intergenic L1HS 5′ UTR, and cg26106149, located in a full-length L1PA3 in an intron of FBXL4, a gene with no known role in cancer initiation. The more complex model used four more CpGs but assigned the most weight to the aforementioned two. We trained an additional set of models allowing THE1A and THE1C CpGs to be selected by the model, intending to determine if the reported links between THE1 element activation and cancer could improve predictive performance; however, this only led to improvements in the performance of the CHD predictor and not of the cancer and mortality predictors (Extended Data Fig. 7b).

Discussion

In summary, we studied the age dynamics of TE methylation, finding that most TEs, from evolutionarily young to ancestral ones, were likely to lose methylation during the course of aging. However, this tendency was accentuated for young L1 elements (L1HS, L1PA2, L1PA3 and L1PA4), and two LTR retrotransposon families (THE1A and THE1C). We observed similar patterns of de-repression in chromatin accessibility data, but not in gene expression data, suggesting that RNA interference pathways may be able to compensate for the looser chromatin state, as found in other studies38. Future works should investigate if RNA interference pathways or other epigenetic mechanisms are indeed responsible for this discrepancy.

Local CpG density and youthful methylation affect methylation drift rate during aging. The rate of methylation loss at most TEs was well described by those two factors, but this was not the case for young L1s. Thus, we hypothesize that most TEs have lost their regulatory sequences, and thus lose methylation passively. Conversely, young L1s probably still contain regulatory sequences that enable recruitment of activating epigenetic machinery. Therefore, we searched for TF binding motifs enriched at young L1s compared to older ones and which associated with faster or slower methylation loss in young L1s. Several TF binding motifs, including FOX, BMAL1 (ARNTL), HIC2 and NFKB2, were associated with slower drift. However, we did not find motifs associated with faster methylation loss. An additional way to identify evidence of active de-repression of young L1s would be to measure methylation marks such as 5hmC or 6mA, which are associated with transcriptional activation, as opposed to 5mC11,12.

We next explored the use of TE methylation loss as a biomarker of age and disease. An age predictor based on individual CpGs found in TEs had remarkable accuracy and showed associations with cancer and mortality comparable to PhenoAge. Another study recently obtained similar results using individual CpGs within full-length L1s and human endogenous retroviruses64. We generated additional predictors based on the average methylation of TEs genome-wide, for both human blood methylation array data and multi-tissue mouse RRBS data. While less accurate than their individual CpG counterparts, these predictors were still satisfactory (r > 0.9) and showed associations with health status. Future studies may want to explore clinical uses for such biomarkers to prevent disease and inform lifestyle choices. An important question is which health conditions beyond cancer and overall mortality can be predicted by these biomarkers. Additionally, restricting feature selection to TEs provides some benefits in terms of interpretability; however, the models remain complex and more work is required to establish the precise links between TE de-repression and aging or disease, especially for older transposon families. Indeed, we were surprised to see that these predictors did not mainly rely on young L1s, despite their strong age association, prompting us to investigate the exact timing of young L1 de-repression. We found that young L1s rapidly de-repressed only after the age of 65 and were otherwise very stable beforehand. This age coincides with the age of onset of many age-related diseases. Thus, we explored associations between loss of methylation and disease, finding that methylation loss at young L1s was associated with cancer but not age, while the opposite was true for the older L1s selected by the clock. Finally, we trained predictors for cancer, CHD and mortality within 3 years of the methylation measurement, solely based on young L1 CpGs. The mortality and cancer predictors were mildly successful; in particular, the cancer predictor made use of only two CpGs in young L1s. Future studies may want to investigate the mechanism behind this seemingly direct relationship. An obvious question is whether young L1 de-repression is the cause or consequence of cancer. Indeed, both mechanisms are possible as mutations of epigenetic machinery are common in cancer65. However, because the loss of CpGs was detected in the blood and was predictive of cancer events in other organs, it is possible that TE de-repression may promote cancer by accelerating inflammation or by promoting other pathological processes through other non-cell-autonomous mechanisms. Finally, loss of methylation at young L1s could be neither the cause nor the consequence of cancer; instead, both events could have common drivers. Clonal hematopoiesis is a probable suspect as the most common mutation in clonal hematopoiesis is loss of function of DNMT3A, a de novo methyltransferase66,67,68, which may also contribute to the loss of methylation on TEs.

Methods

Datasets

We used four public human blood array datasets (GSE64495 (ref. 69), GSE40279 (ref. 26), GSE157131 (ref. 70) and GSE147221 (ref. 71)) to determine the associations between age and TE methylation loss, and to train and validate the human age predictors. GSE87648 (ref. 72) was only included in predictor training and validation because it had an internal batch effect (determined by principal component analysis). The WHI human blood dataset BA23 (https://www.whi.org/study/BA23) and related metadata were used to investigate the relationships between TE clock age acceleration and risk of disease and mortality, and later to investigate the associations between young L1 methylation loss and disease. The mouse multi-tissue datasets GSE60012 (ref. 73), GSE93957 (ref. 74) and GSE80672 (ref. 63) were used to train and validate the mouse age predictor. All data were used as preprocessed by the original authors with the exception of GSE60012, as the needed processed files were unavailable. The matched ATAC–seq and RNA-seq dataset GSE193142 (ref. 25) from human PBMCs was used to determine if the observed alterations in RE methylation were associated with consistent changes in chromatin accessibility and expression.

Annotation of CpGs and REs

The coordinates of Infinium array CpGs were obtained from the Illumina manifest. We used RepeatMasker to annotate repeats in the GRCh37 and GRCm38 genomes. ChIPSeeker75 was used to annotate the genomic context of CpGs. Consensus sequences for TEs were obtained from Repbase76. The position of 5′ UTR, open reading frames 1 and 2, 3′ UTR, LTR and internal sequences were derived from the information included with the consensus sequences in the EMBL format. CpGs were then located in consensus sequences by aligning the probe target sequences with the consensus using pairwiseAlignment in the Biostrings R package (http://bioconductor.org/packages/Biostrings/).

RRBS data processing

Raw reads were downloaded from the Sequence Read Archive (SRA) and trimmed using Trim Galore! (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) with the --rrbs option. We aligned the trimmed reads to the GRCm38 genome build using Bismark (https://www.bioinformatics.babraham.ac.uk/projects/bismark/) and quantified methylation using bismark_methylation_extractor and bismark2bedGraph.

ATAC–seq data processing

Raw reads were downloaded from the SRA, trimmed with Trim Galore! using default settings and aligned to the GRCh38 using Bowtie 2 with the following options: --very-sensitive -X 1,000 --dovetail. Multimapping, mitochondrial reads and PCR duplicates were removed using SAMtools77 and Picard MarkDuplicates (https://broadinstitute.github.io/picard/). Reads over REs were counted with featureCounts78 with the options -p -B--read2pos 5, over regions defined by an SAF file derived from the RepeatMasker output. All RE loci were then collapsed by summing the counts of loci of the same TE family. Age coefficients were fitted using edgeR79 by including age and sex as independent variables. Comparison with DNA methylation data was only performed for REs that were adequately represented in the Infinium array (no. of CpGs per RE class ≥ 100; no. of CpGs per RE family > 40).

RNA-seq data processing

Raw reads were downloaded from SRA, trimmed with Trim Galore! using default settings and aligned to GRCh37 using STAR with the following options: --outFilterMultimapNmax 100--winAnchorMultimapNmax 200--outFilterMismatchNoverLmax 0.04. Reads over REs were counted and collapsed to family using TEcounts, using the RE annotations provided on the TEtranscripts website (https://www.mghlab.org/software/tetranscripts). Age coefficients were fitted using edgeR by including age and sex as independent variables, filtering out REs expressed in fewer than three samples. Comparison with DNA methylation data was only performed for REs that were adequately represented in the Infinium array (no. of CpGs per RE class ≥ 100; no. of CpGs per RE family > 40).

Motif analysis

TF binding motif enrichment was performed with SEA45 on 500-bp sequences flanking young and old L1 CpGs (250 bp upstream and downstream). We used the JASPAR (nonredundant) vertebrate DNA motif database and an E-value threshold of less than ten. The SEA output was then further filtered to focus on motifs highly enriched on young L1s (present near more than 50% of young L1 CpGs and less than 20% of old L1 CpGs) and whose presence in young L1s was associated with faster or slower methylation loss (Wilcoxon rank-sum test P < 0.05 for both adjusted and unadjusted age coefficients). We performed additional motif enrichment analysis contrasting young L1 CpGs with faster-than-median methylation loss to those with slower-than-median methylation loss. No further filtering was applied to this analysis.

Predictor training and validation

All predictors in this study are a form of elastic net, implemented using the glmnet R package. Age predictors use the Gaussian family argument whereas the disease and mortality predictors use the binomial (logistic) family argument. Age predictions were evaluated using the RMSE, MAE and Pearson’s r. Disease and mortality predictions were evaluated using the ROC a.u.c. Before training and predicting, we transformed ages using the same age transformation used by Horvath in the Pan-tissue27 and Skin & Blood28 clocks. Briefly, ages below the age of maturity (20 years for humans, 6 weeks for mice) were log-transformed to linearize the relationship between age and methylation in developmental stages. When sufficient samples were available, we validated our predictors by leaving out a portion of all data and an entire dataset (GSE64495) for testing, and training and choosing hyperparameters on the remainder of the data using cross-validation. When the number of samples was limited, we used nested cross-validation. Hyperparameters were explored using grid search and selected to give the lowest cross-validation mean squared error or ROC a.u.c., with the exception of the models we called ‘parsimonious’ for which hyperparameters were selected to give the simplest model within 1 s.d. of the best performance. Any individual with known health conditions or treatments were excluded from model training. The matching WT controls of the GHRKO and Snell dwarf strains were also excluded from clock training to have a fair comparison.

Predictor benchmarking

We downloaded the clock coefficients published with the original manuscripts. Ages were transformed (and inverse-transformed) for prediction if required (Horvath Pan-tissue and Skin & Blood). All clocks were then applied to the same samples of GSE64495 and the WHI BA23 dataset. Clock features with missing values in the WHI BA23 (1.5% of all values) were imputed using the makeX R function.

Statistics and reproducibility

No statistical methods were used to predetermine sample sizes. Instead, we estimated the number of samples required to successfully train age predictors based on previous publications26,27,28,29 and assembled the public datasets accordingly. Individuals with known health conditions were excluded from any analyses investigating normal aging and from predictor training. Further outlier removal was performed before predictor training and benchmarking based on principal component analysis. Briefly, samples outside of the 95% confidence ellipse calculated using the stat_ellipse R function were excluded. This was performed separately for each dataset, sex and health status. Where possible, nonparametric statistical tests were used; in other cases, data distribution was not formally tested. Associations between age and Infinium array CpG methylation were determined using limma34, with the age + sex + study design. The fitted coefficients were used as methylation drift rates, whereas methylation at 20 years of age was calculated as intercept + coefficient × 20. Our fitting of expected age drift as a function of CpG density and youthful methylation level used a general additive model (gam) with covariates for CpG density within 100 bp of the CpG in question, the methylation of that CpG at 20 years of age and the interaction of the two covariates (age_coef ~ s(methylation at 20 years, bs = ‘cs’) + s(CpG_density, bs = ‘cs’) + s(methylation at 20 years, bs = ‘cs’, by = CpG_density)). Associations between age acceleration and mortality or disease risk were tested using a Cox regression model (coxph in R) with the formula Surv(time-to-event, status) − acceleration + age. The boxplots show the median, lower and upper quartiles and 1.5 times the IQR, with outliers shown individually unless indicated otherwise. The barplot error bars represent the s.d.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.