Introduction

Aging is associated with a wide range of molecular and cellular changes. Understanding the impact of these changes may provide insights into the biology of aging and help identify potential therapeutic strategies for aging and age-related diseases. DNA methylation (DNAm) alterations rank among the most well-studied and prominent age-related molecular changes; importantly, these alterations generally replicate well across many different cohorts and species1. DNAm is part of the epigenome and is involved in regulating gene expression, but the downstream consequences of most age-related DNAm changes remain incompletely understood, rendering it challenging to interpret their functional impact. Conversely, transcriptomic changes are more functionally informative than DNAm due to their indication of the current level of gene expression in tissues and cells. Age-related changes to the transcriptome have also been described and have facilitated the development of ‘transcriptomic clocks’2; however, these clocks generally feature lower replicability across cohorts and datasets than DNAm-based clocks3, higher noise, and susceptibility to batch effects4. These features make comparison of existing transcriptomic clocks across studies difficult.

The integration of multiple omics-based measures could assist in overcoming specific limitations of an individual readout. Previous studies have leveraged multi-omics analyses to derive DNAm-based predictors (or proxies) of features in other omics or clinical data, combining the reliability of DNAm profiling with the more functional insights derived from other molecular changes. For instance, DNAm-based EpiScores have been shown to predict plasma proteomic features5, and recently, a DNAm-based multi-omic aging clock was described6. The combination of epigenetic and transcriptomic data has been suggested to improve biological age predictions2, yet few studies to date have investigated direct links between age-related epigenetic and transcriptomic changes1,7,8,9,10,11. These studies have reported conflicting findings, with some reporting that methylome changes are indicative of alterations in gene expression7,8,9 and others noting limited association with expression of affected genes1,10,11.

A comprehensive, integrative analysis of epigenetic and transcriptomic changes across the human lifespan has not yet been conducted. Such a study could shed further light on the functional relevance of age-related DNAm changes and identify genes that could be targeted to combat aging-related declines in physiological health. Here, we perform a large-scale analysis of DNAm and gene expression data from blood samples, utilizing data from several cohorts (n = 4174 and 3461 total samples for DNAm and RNA-seq, respectively; Table 1). Our integrative multi-omic approach enabled the identification of functionally relevant DNAm changes in blood associated with gene expression alterations, hereafter referred to as multi-omic aging genes. We validate our findings in an external high-quality blood DNAm dataset, generated using the latest methylation array technology, from the Mass General Brigham Biobank (n = 500) that comprises a broad range of ages and has additional data on many aging-associated outcomes. Our findings provide a first in-depth assessment of the functional consequences of age-related DNAm changes in blood, and highlight the potential for multi-omics approaches to uncover functionally relevant genes and genetic loci. In turn, these features could be used to develop meaningful predictors for relevant aging-related outcomes and may be targeted by therapeutics to mitigate aging-related molecular changes and decline.

Table 1 Datasets used in this study

Results

Age-associated gene expression is less reproducible than DNAm

We initially aimed to identify age-associated genes from six large transcriptomic datasets, including data from the Multi-Ethnic Study of Atherosclerosis (MESA), which features two sampling time points (referred to hereafter as MESA1 and MESA2); the Parkinson’s Progression Markers Initiative (PPMI); the Gates Grand Challenge (GC6); the 500 Functional Genomics Project (500FG); and the JenAge Ageing Factor Database. These datasets cover wide age ranges and are balanced between male and female participants (Fig. 1a). For each cohort, we assessed the correlation of gene expression levels with age (see “Methods”). Among the top 10 transcripts associated with age across all datasets, we observed that the correlation with age was quite variable between cohorts (Fig. 1b). For instance, we found loss of expression of CD248 as a function of age, consistent with previous reports12,13, but the magnitude of the correlation ranged from − 0.3 to − 0.5 (Fig. 1c). For further validation of this observation, we examined the correlation between age and expression levels of aging transcripts previously identified by Peters et al. across these cohorts. In agreement with our previous result, we found generally low levels of correlation between expression levels of these genes and age across cohorts (Fig. 1d). Thus, age-related transcriptomic data appear to replicate poorly across cohorts.

Fig. 1: Aging transcripts and aging CpGs in blood.
figure 1

a Distribution of ages by cohort and sex in 6 RNA-seq datasets used in this analysis. Sample sizes: MESA1 n = 1016, MESA2 n = 845, PPMI n = 1111, GC6 n = 327, 500FG n = 97, JenAge n = 62. Box plots show median (center line), first and third quartiles (box boundaries), and 1.5 × interquartile range (whiskers). b Top 10 age-associated transcripts across the cohorts. Statistical comparisons between up-regulated and down-regulated transcripts using two-sided Mann-Whitney U test: MESA1 p = 0.0003, MESA2 p = 0.0003, PPMI p = 0.0003, GC6 p = 0.0003, 500FG p = 0.0003, JenAge p = 0.0003. Box plots as in (a). Significance: ***p < 0.001. c Gene expression levels of CD248 by age across the cohorts. Pearson correlation between age and gene expression: MESA1 r = − 0.34, p = 2.08e-28; MESA2 r = − 0.43, p = 1.80e-38; PPMI r = − 0.34, p = 2.34e-32; GC6 r = −0.44, p = 1.23e-16; 500FG r = − 0.54, p = 1.04e-08; JenAge r = − 0.56, p = 2.65e-06. d Correlation between age and gene expression at aging transcripts (Peters et al. 2015) across the cohorts. Each cohort contains n = 1331 measurements (MESA1, MESA2, PPMI) or n = 1324 (GC6, 500FG), n = 1321 (JenAge). Two-sided Mann-Whitney U test comparing up-regulated vs down-regulated transcripts: MESA1 p = 5.57e-175, MESA2 p = 2.97e-199, PPMI p = 1.21e-208, GC6 p = 2.00e-161, 500FG p = 8.22e-199, JenAge p = 8.58e-197. Box plots as in (a). Significance: ***p < 0.001. e Distribution of ages by cohort and sex in 6 DNAm datasets used in this analysis. Sample sizes: MESA1 n = 886, MESA2 n = 888, PPMI n = 510, RA n = 689, Grady n = 795, GENOA n = 946, MGB n = 461. Box plots as in (a). f Top 10 age-associated DNAm across the cohorts. Two-sided Mann-Whitney U test: MESA1 p = 0.0006, MESA2 p = 0.0006, PPMI p = 0.0012, RA p = 0.0002, Grady p = 0.0002, GENOA p = 0.0002. Box plots as in (a). Significance: **p < 0.01, ***p < 0.001. g Correlation between age and DNAm levels at aging CpGs (Varshavsky et al. 2023) across the cohorts. Pearson correlation: MESA1 r = 0.74, p = 2.59e-155; MESA2 r = 0.67, p = 7.86e-119; PPMI r = 0.80, p = 3.39e-115; RA r = 0.84, p = 1.85e-181; Grady r = 0.80, p = 2.21e-174; GENOA r = 0.84, p = 4.56e-251. h DNAm levels of ELOVL2 promoter (cg16867657) by age across the cohorts. Two-sided Mann-Whitney U test: MESA1 p = 1.44e-122, MESA2 p = 2.12e-122, PPMI p = 7.94e-132, RA p = 3.64e-155, Grady p = 6.73e-144, GENOA p = 5.44e-141. Box plots as in (a). Significance: ***p < 0.001. Gene expression data was normalized using median normalization. Statistical significance is annotated: *p < 0.05, **p < 0.01, ***p < 0.001, ns = not significant (p ≥ 0.05).

We next asked whether the same was true of DNAm data. We analyzed DNAm from six large data sources: MESA1, MESA2, and PPMI, as above; a cohort of patients with Rheumatoid Arthritis and healthy controls (RA); the Grady Trauma Project (Grady); and the Genetic Epidemiology Network of Arteriopathy (GENOA) study (Fig. 1e). For the top 10 CpG sites that were found to be associated with age, we found considerably higher and more consistent correlations with age across the datasets (Fig. 1f). For instance, cg16867657, a well-known aging-associated CpG site within the ELOVL2 promoter14, emerged as the top site consistently positively correlated with age (Fig. 1g). Across aging CpGs reported by Varshavsky et al. in 202315, we again found correlations that were larger in magnitude than those observed for aging transcripts, and more consistent across cohorts. Thus, age-associated DNAm data are considerably more replicable across cohorts than transcriptomic data. Statistical comparison of correlation strengths confirmed that age-associated DNAm data show significantly higher and more consistent correlations with age across datasets compared to aging transcripts (Supplementary Fig. 1).

Integrating omics identifies validated age-associated genes

To identify age-associated genes that are consistent across blood samples of several cohorts, we integrated DNAm and transcriptomic data from studies that produced both data modalities (MESA and PPMI) to identify genes whose age-associated DNAm changes have functional consequences at the level of gene expression. On average, across MESA time points and across cohorts, we observed surprisingly little correlation between gene expression levels and age for genes whose CpG sites had the greatest gain of methylation over aging (Fig. 2a), consistent with some previous reports1,10. Conversely, we found that genes with the greatest decrease in expression levels with aging consistently featured significantly higher DNAm correlation with age (Fig. 2b). This observation was statistically significant across the MESA1, MESA2, and PPMI cohorts (Supplementary Fig. 2). These observations were consistent at the level of individual genes: There were minimal differences in the expression of ELOVL2, KLF14, and FHL2, the genes associated with the top-ranked aging-associated CpGs, between young and old individuals (Fig. 2c, left, see Methods for details). In contrast, the top down-regulated aging transcripts, CD248, LRRN3, and NELL2 each featured increased DNAm in older compared to younger individuals (Fig. 2c, right). CD248 in particular showed strong positive correlation with age at the level of DNAm (Fig. 2d, left) and strong inverse correlation with gene expression (Fig. 2d, right); this was true at both MESA time points and in the PPMI dataset.

Fig. 2: Multi-omics biomarkers of aging identify functional aging genes.
figure 2

a Correlation of gene expression levels with age at regions with the highest age-associated DNAm gain in MESA1, MESA2, and PPMI cohorts. Sample sizes: n_all_CpGs=273168 (MESA1, MESA2), 276316 (PPMI); n_DNAm_gain=100 (each cohort). Two-sided Mann-Whitney U test: MESA1 p = 0.2895, MESA2 p = 0.3435, PPMI p = 0.0053. Box plots show median (center line), first and third quartiles (box boundaries), and 1.5 × interquartile range (whiskers). Individual data points are overlaid. Significance: **p < 0.01, ns=not significant (p ≥ 0.05). b Correlation of DNAm levels with age at promoters of most age-associated down-regulated genes in MESA1, MESA2, and PPMI cohorts. n_down-regulated=100 (each cohort). Two-sided Mann-Whitney U test: MESA1 p = 3.94e-19, MESA2 p = 1.46e-05, PPMI p = 1.98e-08. Box plots as in (a). Significance: ***p < 0.001. c DNAm and expression levels of genes with the highest age-associated DNAm (left panels) and age-associated expression (right panels) for young (≤ 45 years) and old (≥ 75 years) participants in MESA1, MESA2, and PPMI cohorts. Sample sizes: n_young and n_old vary by gene and cohort (see Source Data). Two-sided Mann-Whitney U test comparing young vs old groups: DNAm panels (left): ELOVL2: MESA1 p = 1.20e-68, MESA2 p = 3.41e-70, PPMI p = 6.76e-48; KLF14: MESA1 p = 3.77e-38, MESA2 p = 3.63e-34, PPMI p = 1.09e-32; FHL2: MESA1 p = 3.02e-54, MESA2 p = 4.84e-53, PPMI p = 8.37e-35; CD248: MESA1 p = 1.42e-10, MESA2 p = 1.65e-10, PPMI p = 6.50e-10; LRRN3: MESA1 p = 0.0002, MESA2 p = 1.53e-09, PPMI p = 0.0028; NELL2: MESA1 p = 0.0150, MESA2 p = 5.67e-05, PPMI p = 0.0006; RNA panels (right): ELOVL2: MESA1 p = 0.3406, MESA2 p = 0.7273, PPMI p = 0.4158; KLF14: MESA1 p = 0.0020, MESA2 p = 0.7864, PPMI p = 0.3641; FHL2: MESA1 p = 0.0022, MESA2 p = 0.7713, PPMI p = 0.0154; CD248: MESA1 p = 4.24e-24, MESA2 p = 4.90e-33, PPMI p = 2.90e-26; LRRN3: MESA1 p = 2.44e-19, MESA2 p = 2.92e-33, PPMI p = 1.89e-30; NELL2: MESA1 p = 1.75e-05, MESA2 p = 2.40e-18, PPMI p = 2.44e-10; Box plots as in (a). Significance: *p < 0.05, **p < 0.01, ***p < 0.001, ns = not significant (p ≥ 0.05). d DNAm and expression levels of CD248 in MESA 2000 (left, d1) and MESA 2010 (right, d2). Scatter plots showing individual data points with linear regression lines. Pearson correlation: d1 (MESA 2000): African American n = 75, r = 0.72, p = 4.13e-13; non-African American n = 669, r = 0.36, p = 1.98e-22 d2 (MESA 2010): Highest CD248 DNAm in 2000 n = 72, r = 0.38, p = 0.0011; Others n = 781, r = 0.25, p = 3.41e-12. e Percentage of top aging transcripts (Peters et al. 2015) that replicate in MESA1, MESA2, and PPMI cohorts. Bar heights represent the percentage of transcripts validated. Error bars show 68% confidence intervals. Fisher’s exact test comparing transcripts with aging CpGs vs all transcripts: MESA1 p = 7.75e-12, MESA2 p = 2.36e-06, PPMI p = 0.0462. Significance: *p < 0.05, ***p < 0.001. Statistical significance is annotated: *p < 0.05, **p < 0.01, ***p < 0.001, ns = not.

Gold-standard reference DNAm dataset for aging blood

To validate our identified multi-omic aging genes, we aimed to perform an independent external validation using an external cohort. To demonstrate that our integrative analyses and the identified epigenetic-transcriptomic aging genes are replicable, we performed an independent external validation using an independent cohort we specifically designed for omic analysis of aging biomarkers across the lifespan. We generated DNAm profiles for 500 individuals of diverse ages from the Mass General Brigham (MGB) Biobank using the Illumina Infinium MethylationEPIC v2.0 array, which covers over 935,000 CpG sites enriched for regulatory regions and has been shown to exhibit high reproducibility16. Our subjects were recruited from a major metropolitan academic medical center, were roughly balanced between male and female, and were generally representative of the racial/ethnic distribution of the local area (Fig. 3a).

Fig. 3: A reference epigenetic aging improves validation of transcriptomic aging.
figure 3

a Distribution of sex (left) and race (right) by age in the MGB cohort. Sample sizes: female n = 3080, male n = 2095. Mean age: females 56.4 ± 15.3 years (SD), males 60.1 ± 13.3 years (SD). b Correlation between age and DNAm levels at aging CpGs (Varshavsky et al. 2023) in the MGB cohort compared to other cohorts. Sample sizes (n_CpG_sites) : MESA1 n = 705192, MESA2 n = 705192, PPMI n = 719282, RA n = 485577, Grady n = 800195, GENOA n = 445198. Two-sided Mann-Whitney U test: MESA1 p = 1.44e-122, MESA2 p = 2.12e-122, PPMI p = 7.94e-132, RA p = 3.64e-155, Grady p = 6.73e-144, GENOA p = 5.44e-141. Box plots show median (center line), first and third quartiles (box boundaries), and 1.5 × interquartile range (whiskers). Significance: ***p < 0.001. c Correlation between age and DNAm levels at promoters of genes with different levels of expression. Sample sizes and statistical tests are provided in Source Data. Scatter plots show individual data points. d Correlation between age and DNAm levels at promoters of genes with different levels of PRC2 binding (Q5 = highest 20%, see Methods for details). Sample sizes and statistical tests are provided in Source Data. e Correlation between age and DNAm levels of genes with different gene expression and PRC2 binding levels. Sample sizes and statistical tests are provided in Source Data. f Correlation of DNAm levels with age (in MGB) at promoters of most age-associated down-regulated genes in MESA1, MESA2, and PPMI cohorts. Sample sizes: n_all_CpGs = 290012 (each comparison), n_down-regulated = 100 (each cohort). Two-sided Mann-Whitney U test: MESA1 p = 4.24e-21, MESA2 p = 6.62e-08, PPMI p = 1.99e-10. Box plots as in (b). Significance: ***p < 0.001. g Percentage of aging transcripts (Peters et al. 2015) that replicate in each cohort. Sample sizes for validation: n_total = 100 per cohort. Binomial test (two-sided): MESA1 p = 1.0000 (n_validated = 14), MESA2 p = 1.0000 (n_validated = 31), PPMI p = 1.0000 (n_validated=18), GC6 p = 1.0000 (n_validated=16), 500FG p = 1.0000 (n_validated = 8), JenAge p = 1.0000 (n_validated = 8). Bar heights represent the percentage of transcripts validated. Error bars show 68% confidence intervals. Significance: ns = not significant (p ≥ 0.05). Statistical significance is annotated: *p < 0.05, **p < 0.01, ***p < 0.001, ns = not significant (p ≥ 0.05).

As expected, DNAm levels at aging CpGs15 exhibited consistent age-dependent changes, with slightly stronger correlations for the MGB cohort compared to other cohorts (Fig. 3b). The age-CpG associations in the MGB500 cohort were robust, showing high concordance between Pearson and Spearman correlation methods, indicating that results are not driven by outliers (Supplementary Fig. 3). Sites with the strongest DNAm correlation with age, particularly those that gain methylation, were enriched in lowly-expressed and repressed genes (Fig. 3c). Such genes were enriched for strong PRC2 binding (Fig. 3d, e), consistent with our previous analyses of genome-wide PRC2-binding and genomic regions with age-associated CpGs17. Further enrichment analyses of these PRC2-associated multi-omic aging genes highlighted pathways related to immune function, cell differentiation, and developmental processes (Supplementary Fig. 4). We also confirmed the gain of methylation at CpG sites within promoters of genes whose expression is inversely associated with age in this dataset (Fig. 3f). Finally, we again compared the concordance of the top age-associated genes identified by Peters et al. with gene expression data, with the addition of DNAm data from the MGB cohort. We were able to validate 106 genes that feature both an age-dependent gain of DNAm and loss of expression levels; we term such genes multi-omic aging genes (see Methods, Supplementary Data 1). For these 106 multi-omic aging genes, there was no strong overall association between the magnitude of age-correlation for gene expression and the age-correlation for their associated CpG methylation (Supplementary Fig. 5). The addition of DNAm data considerably increased the percentage of age-associated genes we were able to validate in both MESA time points and in PPMI, as well as in the other three cohorts with RNA-seq data (Fig. 3g). Interestingly, we found our multi-omic aging genes to be particularly enriched for T cell-specific genes (e.g., CD27, CD28, CD248, TCF7). KEGG pathway enrichment analysis confirmed significant enrichment for T-cell specific functions among these multi-omic aging genes, including T cell receptor signaling and differentiation pathways (Supplementary Fig. 6). In contrast, the broader set of 1497 age-associated genes identified by Peters et al. based on transcriptomics alone did not show particular enrichment for T cell-specific pathways (Supplementary Fig. 7). To further explore the regulatory landscape, transcription factor enrichment analysis using ChEA identified key upstream regulators, including PRC2 components (e.g., EZH2, SUZ12) and factors involved in immune/haematopoietic cell development (e.g., RUNX), coordinating the expression of these multi-omic aging genes (Supplementary Fig. 8). These analyses demonstrate that integration of multiple omics modalities has the ability to refine the identification and validation of true aging-associated genes, where both DNA methylation and gene expression are jointly altered, pointing towards functional effects. Moreover, the high-quality dataset we generated exhibits strong age-related DNAm changes, covers a broad age range, and features a multitude of aging-associated outcome data such as mortality. Thus, this dataset is ideal for benchmarking existing and future biomarkers of aging.

Multi-omic aging genes predict aging outcomes

To evaluate the functional relevance of the identified multi-omic aging genes, we analyzed their associations with aging-related outcomes, particularly mortality. We utilized large datasets containing both DNA methylation and mortality data, including a subset of participants from the MGB Biobank (MGB-4K) and the Generation Scotland (GS) cohort. CpG sites associated with our multi-omic aging genes were strongly associated with all-cause mortality risk in both cohorts (Fig. 4a), with hazard ratios for the top CpGs ranging from approximately 1.3–1.75 per standard deviation (Fig. 4b). Notably, these CpGs selected via our multi-omic approach showed stronger and more consistent associations with mortality compared to CpGs selected based solely on age correlation without transcriptomic filtering (Supplementary Fig. 9). Importantly, survival analysis based on the top mortality-risk-associated CpG site showed clear stratification of subjects (Fig. 4c). Kaplan-Meier plots for additional top CpG sites in both the MGB-4K and GS cohorts, with and without cell composition adjustment, are provided in Supplementary Fig. 10 and Supplementary Fig. 11, respectively. In addition, disease gene network over-representation analysis of our multi-omic aging genes revealed enrichment for numerous aging-associated diseases, particularly those related to the aging immune system, such as various lymphomas (Fig. 5f). Thus, multi-omic aging genes are strongly associated with clinically relevant aging-related outcomes such as disease and mortality risk.

Fig. 4: Association of multi-omic aging genes with mortality.
figure 4

a Volcano plots showing the association between DNA methylation levels and mortality in the MGB-4K (left) and GS (right) cohorts. Points represent individual CpG sites, colored by their natural logarithm of hazard ratio [ln(HR)]. The x-axis shows ln(HR) per 1 standard deviation increase in DNA methylation, and the y-axis shows -log10(FDR-adjusted P-values). Horizontal dashed line indicates the significance threshold (FDR-adjusted P < 0.05), and vertical dashed line indicates HR = 1. The top 10 CpGs with the largest absolute effect sizes (|ln(HR)|) among significant CpGs (FDR < 0.05) are labeled by gene name. Cox proportional hazards regression models were adjusted for age and sex. P-values were corrected for multiple testing using the Benjamini-Hochberg false discovery rate (FDR) method. All statistical tests were two-sided. b Forest plots displaying hazard ratios (HR) with 95% confidence intervals (CI) for the top 10 aging-associated CpG sites in each cohort. HRs represent the risk of all-cause mortality per 1 standard deviation increase in DNA methylation. Error bars indicate 95% CIs calculated as HR ± 1.96 × SE[ln(HR)]. Cox proportional hazards regression models were adjusted for age and sex. MGB-4K cohort: n = 4243 participants with 460 events. GS cohort: n = 18,859 participants with 1516 events. Exact two-sided P-values are shown for each CpG. No adjustment for multiple comparisons was applied to the P-values shown in the forest plots (nominal P-values). CpGs are sorted by ln(HR). c Kaplan-Meier survival curves showing the association between the top CpG site in each cohort (as determined by the largest absolute effect size) and all-cause mortality. Participants were stratified by methylation level quartiles: Q1 (lower quartile, blue) versus Q4 (upper quartile, red). Shaded areas represent 95% confidence intervals. Survival differences between quartiles were assessed using the log-rank test (two-sided). The number at risk at each time point is shown below each plot.

Fig. 5: Aging genes, cell differentiations, and cell compositions.
figure 5

a Functional enrichment of genes associated with aging at the levels of gene expression and DNA methylation in blood. Dot size indicates the number of genes in each GO Biological Process term, and dot color indicates BH-adjusted P-value. Enrichment was performed with a two-sided Fisher exact test on GO BP terms. Redundant terms were removed with clusterProfiler::simplify. b Gene–concept network of multi-omic aging biomarkers. Gray nodes are genes, and yellow nodes are enriched GO terms. Edges indicate gene to term membership. GO node size reflects the gene ratio for that term. c Cell proportions versus age across immune cell types. Points represent individual participants. The solid line shows the mean fitted by ordinary least squares, and the shaded band shows the 95 percent confidence interval of the fit. Statistical model: two-sided OLS, proportion ~ age. For each cell type, slope, standard error, t statistic, degrees of freedom, 95 percent confidence interval, and two-sided P value are provided in Source Data. d Cell abundances versus age across immune cell types. Display conventions as in panel (c). Statistical model: two-sided OLS, abundance ~ age. For each cell type, slope, standard error, t statistic, degrees of freedom, 95 percent confidence interval, and two-sided P value are provided in Source Data. e Cell-composition-corrected aging CpGs. Each point is a CpG. The x-axis is the age coefficient from a two-sided linear regression of methylation beta value on age with covariates for estimated cell composition and technical factors. The y-axis is minus log10 of the BH-adjusted P-value. Coloring indicates whether the association remains after cell-composition correction. Labels highlight CpGs in SATB1, CD248, CD27, and CD28 and selected CpGs with large minus log10 FDR. For highlighted CpGs, effect size per year, 95 percent confidence interval, test statistic, degrees of freedom, and FDR-adjusted P-value are provided in Source Data. f Over-representation analysis for disease–gene associations. Bars show terms with BH-adjusted P-values (color scale) and counts (bar lengths). Enrichment used two-sided Fisher exact tests with BH FDR correction. Representative enriched terms: Disease Ontology, Primary immunodeficiency disease (8 genes, FDR = 4.31 × 10−5) and Sjogren’s syndrome (4 genes, FDR = 2.81 × 10−4); DisGeNET, Herpes NOS (7 genes, FDR = 6.98 × 10−7), T-cell lymphoma (12 genes, FDR = 1.40 × 10−6), mantle cell lymphoma (11 genes, FDR = 6.21 × 10−6), and follicular lymphoma (11 genes, FDR = 7.74 × 10−6). For each term, odds ratio, 95 percent confidence interval, test statistic, and adjusted P value are provided in Source Data.

Age-related methylation in aging genes is independent of cell composition changes

Blood is a heterogeneous tissue composed of multiple cell types, the proportions of which change with age. Given these age-dependent alterations in blood cell composition18, age-related changes observed in the transcriptome and epigenome may potentially be related to changes in cell proportions over time19. Noting a clear enrichment for T cell differentiation genes in our multi-omic aging genes (Fig. 5a, b), we sought to investigate whether aging patterns of our multi-omic aging genes were truly intrinsic to individual blood cell types or simply a reflection of age-related cell composition changes. To do so, we generated methylation-based cell composition estimates20,21 for our MGB cohort samples (Fig. 3a) using predictors of proportions for the most abundant blood cell types and ordinal abundance values for less common, yet age-dependent, cell types. We observed modest age-related changes in cell composition for the most common cell types (Fig. 5c). However, for naive T cells, we found a dramatic decline associated with age that is consistent with a well-documented feature of human immune aging (Fig. 5d)22. Finally, we characterized the relationship between age and methylation level for CpGs within our identified multi-omic aging genes after adjusting for cell composition. Significant associations of methylation levels at these CpGs with age remained after the adjustment (Fig. 5e), indicating that age-related changes in these CpGs are independent of changes in cell composition. The results were consistent when adjusting for cell type compositions using 12 cell-type deconvolution (Supplementary Fig. 12). Thus, methylation levels at CpG sites within our multi-omic aging genes are robustly associated with aging.

Discussion

As omic profiling becomes increasingly common in cohort studies of aging, integration across multiple datasets and omics modalities has the potential to reveal molecular changes associated with aging at a level of resolution previously inaccessible. In the current study, we provide one of the most comprehensive integrative multi-omic analyses of age-related DNAm and gene expression changes to date, leveraging DNAm and RNA-seq data from over 4600 and 3500 samples, respectively, drawn from diverse cohorts. In addition, we performed a fully external validation of our findings in an external, high-quality reference DNAm dataset for aging biomarkers (Fig. 3) derived from a representative cohort in the Mass General Brigham Biobank. We expect this high-quality DNAm dataset generated on the Illumina HumanMethylationEPIC version 2 will prove to be a useful future reference dataset to benchmark DNAm-based biomarkers of aging. We acknowledge the MGB dataset’s modest sample size as a limitation, though its EPIC2 platform provided superior CpG coverage critical for comprehensive epigenetic aging analysis. Future studies should integrate EPIC2 technology with larger cohorts to validate our findings and potentially identify additional aging biomarkers that may have been missed due to power constraints. While the DNA methylation array technology applied measures only a subset of genomic CpG (approximately 935,000 CpG sites, representing 5%) in contrast to sequencing approaches, such as whole-genome bisulfite sequencing (WGBS), this technology offers several advantages for epigenetic studies, including technical reproducibility, standardized protocols and established analysis pipelines that significantly reduce technical variation between laboratories, and cost-efficiency.

Our analysis revealed that CpGs most strongly correlated with aging do not necessarily predict changes in the expression of their associated genes. For instance, the ELOVL2 gene that harbors the CpG most strongly correlated with age (Fig. 1g) and included in the models of many popular biomarkers of aging does not exhibit gene expression changes with age (Fig. 2c). This observation applies to most of the top age-related DNAm loci (Fig. 2a, c). This suggests that age-dependent gain of DNAm may occur at sites where cells can afford such changes because they do not lead to any functional consequences. Further work will be needed to identify the features that characterize such sites. Importantly, these results indicate that previous bioinformatic analyses of genes harboring such CpG sites should be interpreted with caution, as downstream ramifications of these DNAm changes may be limited.

To overcome this issue, we integrated DNAm and RNA-seq data to identify functionally relevant multi-omic aging genes in blood. We chose to focus on aging genes rather than genes associated with specific age-related diseases (e.g., Alzheimer’s disease, cardiovascular disease) to capture upstream, systemic biological processes that drive disease susceptibility across multiple diseases with age and may underlie a shared vulnerability to a spectrum of age-related conditions rather than processes specific to individual diseases. Because our analysis explicitly coupled age-related changes in DNAm to changes in gene expression, the results are more biologically interpretable than those derived from DNAm data alone. Aging genes identified using our multi-omic approach are enriched for genes associated with lymphocyte and T cell differentiation and activation (e.g., TCF7, CD27, CCR9, IRF8, among others; Fig. 5a), in line with the notion that age-dependent changes in T cell biology represent a prominent feature of aging23. Interestingly, it has previously been suggested that T cell aging could play a role in whole-body deterioration, indicating that our identified genes may represent important targets for strategies to combat age-related disease. While aging also results in well-described changes in immune cell proportion24 (Fig. 5c), our analysis suggests that for the majority of our multi-omic aging genes, the association with age remains even after correction for cell type proportions (Fig. 5d). Our findings also indicate that in contrast to age-related genes identified via RNA-seq alone, multi-omic aging genes much more robustly replicate across datasets (Fig. 2e) and thus have the potential to improve the reliability of transcriptomic clocks, which to-date have suffered from poor replicability across datasets.

Our observation that genes with strong age-related decreases in expression also tend to show substantial age-associated DNAm changes is consistent with some prior findings. Peters et al. (Nat Commun 2015) suggested that DNAm mediates the majority of associations between age and gene expression, with 1248 of 1497 age-associated genes showing at least one CpG site with a significant Sobel test. However, there are notable differences between our analyses. Their findings that such mediating CpGs were more likely to reside in enhancers and less likely in promoters is in contrast with our own findings, as detailed in Supplementary Fig. 1, where promoter CpGs showed the strongest age correlations in our MGB500 cohort. Our analysis moreover focused specifically on CpG sites showing robust methylation gain with age, a process known to preferentially affect promoter regions. In addition, our study sought to identify multi-omic aging genes where both expression and DNAm shift with age, rather than test mediation. Thus, while the mechanistic focus differs, our findings are complementary and highlight the complexity of DNAm regulation across genomic contexts during aging.

While the correlation with aging was interesting, we were particularly motivated to explore the ability of our multi-omic aging genes to predict aging outcomes. A mortality risk analysis in several independent cohorts demonstrated that CpG methylation levels at sites associated with our multi-omic aging genes are highly predictive of aging outcomes (Fig. 4a, c), with hazard ratios of individual sites of over 1.3 per standard deviation (e.g., cg06175418 in TCF7, Fig. 4b). Thus, these results suggest that integration of multiple omics allows for the identification of reliable and interpretable aging genes and CpGs that associate with aging outcomes.

Most importantly, this study demonstrates that integrative multi-omic analyses have the powerful ability to overcome limitations associated with analyses of single-omic modalities. In this sense, multi-omic analysis represents an important frontier for the aging biology field. We anticipate that our findings will catalyze future integrative analysis and allow both for the development of improved and functionally interpretable biomarkers of aging using DNAm or transcriptomic data, and the identification of putative genetic targets for future interventions.

Methods

All research was conducted in accordance with relevant ethical regulations. The study protocol was approved by the Mass General Brigham Institutional Review Board (protocol 2021P003059) and the Tayside Committee on Medical Research Ethics, on behalf of the National Health Service, for Generation Scotland.

Study datasets

The current study leverages several previously-described DNA methylation and RNA sequencing datasets, shown in Table 1. Their underlying cohorts or population characteristics have been described in detail in publications by the original respective authors. In addition, we generate a DNA methylation dataset encompassing 500 individuals of various ages derived from the Mass General Brigham Biobank. This study was approved by the Mass General Brigham IRB (protocol 2021P003059). Details on methylation data and RNA sequencing data processing are described below. For comparison of molecular data between young and old groups, participants 45 years old or younger are labeled as young, and those 75 years or older are labeled as old.

The Generation Scotland (GS) cohort comprises 18,859 participants (58.8% women, 41.2% men) aged 18–99 years, recruited between 2006 and 2011. Whole blood samples were collected, DNA extracted, and methylation measured on the Illumina Infinium MethylationEPIC v1.0 array. Mortality follow-up was ascertained through the National Records of Scotland up to October 2023 (median follow-up 13.9 years). Data were processed using standard quality control pipelines, including background correction, probe filtering, and normalization25.

The Mass General Brigham 4 K (MGB4K) cohort included 4243 participants (65.4% women, 34.6% men). Whole blood samples were collected, DNA was extracted, and methylation profiles were generated using the Illumina Infinium MethylationEPIC v1.0 array. Mortality follow-up was ascertained through linkage with electronic health records until July 28, 2022 (10.8% deceased). Data preprocessing included background correction, probe filtering, and normalization using the R SeSAMe package6.

MGB500 samples were selected from healthy control donors of diverse ages from the MGB Biobank. The donors were recruited from a major metropolitan academic medical center, were roughly balanced between male and female, and were generally representative of the racial/ethnic distribution of the local area (Fig. 3a).

MESA cohort is a population-based sample of an initial 6814 men and women aged 45–84 without known cardiovascular disease. Thirty-eight percent of the recruited participants were White, 28 percent African American, 22 percent Hispanic, and 12 percent Asian, predominantly of Chinese descent. Participants were recruited from six field centers across the United States: Wake Forest University, Columbia University, Johns Hopkins University, University of Minnesota, Northwestern University and University of California - Los Angeles. (mesa-nhlbi.org).

Public DNAm and RNA-seq data were selected among commonly used datasets used in epigenetic or transcriptomic studies of aging from the generally healthy populations, with the exception of PPMI, which was included to increase our multi-omic data. Diseased participants from the GC6 cohort were excluded from the analyses.

Participant characteristics and consent

This study included human participants from multiple cohorts. No participants were recruited for this study. Blood samples from MGB Biobank were used to generate DNA Methylation data. The Mass General Brigham (MGB500) cohort comprised 500 individuals (461 after quality control) between 20 and 90 years of age, with a roughly balanced distribution of males and females. The MGB4K cohort consisted of 4243 individuals aged 20–90 years, of whom 65.4% were female, and 34.6% were male. Generation Scotland included 18,859 individuals aged 18–99 years, of whom 58.8% were female, and 41.2% were male. The Multi-Ethnic Study of Atherosclerosis (MESA) recruited 6814 participants aged 45–84 years, including White, African American, Hispanic, and Asian individuals with a balanced sex distribution. Additional cohorts, including PPMI, GC6, 500FG, JenAge, RA, Grady, and GENOA, are described in Table 1.

Participants provided written informed consent at the time of recruitment, as required by each biobank or study protocol. Ethical approval was obtained from the Mass General Brigham Institutional Review Board (protocol 2021P003059) and from the Tayside Committee on Medical Research Ethics, on behalf of the National Health Service, for Generation Scotland. Participants were not financially compensated for contributing biospecimens to the MGB Biobank or the Generation Scotland resource, in accordance with institutional policies for biobank participation.

Sex of participants was determined by self-report at enrollment, and gender identity data were not collected. Sex information was available and analyzed in aggregate when reporting cohort demographics. The study was not specifically designed to test sex- or gender-related differences in methylation or gene expression, since the primary aim was to identify aging-associated changes across populations, and therefore analyses were not stratified by sex. Future work will be required to investigate sex-specific effects in more detail. This study complies with SAGER guidelines: sex was reported for all cohorts, and justification has been provided for not conducting sex-specific analyses.

Human tissue samples

Human peripheral blood samples for DNA methylation profiling were obtained from the Mass General Brigham (MGB) Biobank (MGB500 and MGB4K cohorts). All participants provided informed consent for the research use of their biospecimens and data. Samples were processed to generate DNA methylation profiles using the Illumina HumanMethylationEPIC arrays. Residual DNA samples have been conserved within the MGB Biobank and may be available for further analysis. Requests for access to these samples can be made to the Mass General Brigham Biobank (biobank@partners.org) and will be considered in accordance with institutional and IRB policies. Samples from Generation Scotland (GS) were obtained under approval of the Tayside Committee on Medical Research Ethics, and have been retained in the Generation Scotland resource (contact: access@genscot.ac.uk). No unique samples were destroyed for this study beyond routine laboratory consumption during DNA extraction and array processing.

Methylation data generation and processing

Previously processed described datasets were obtained from relevant sources (e.g., GEO, dbGaP, or others), and no further processing or filtering was conducted. Methylation data from the MGB biobank were generated using the Illumina HumanMethylationEPIC version 2 array (cat# 20087709), encompassing over 950.000 CpG sites. Raw.IDAT files were preprocessed using standard parameters in the R SeSAMe package26, version 1.22.1, using the recommended processing for EPICv2 data, with the QCDPB parameter (qualityMask, inferInfiniumIChannel, dyeBiasNL, pOOBAH, noob). Samples and CpGs with more than 10 percent missing values were excluded from the generated beta value matrix. PRC2 binding regions were identified using EZH2 ChIP-seq peaks from ENCODE (ENCFF109KCQ), with CpGs within these regions marked as PRC2-associated sites, as in Moqri et al. 2024. Quality control analysis examined the distribution of the plate number as well as Type I and Type II probe loadings on the leading principal component across all cohorts to verify the absence of systematic technical bias (Supplementary Fig. 13).

RNA seq data and processing

We obtained previous RNA-seq datasets as raw gene expression counts. Count data were normalized using log transformation on (raw counts + 1), and the median adjusted to account for batch effects when merging datasets. The majority of analyses was conducted within individual datasets rather than merging multiple datasets.

Integrating DNA Methylation and RNA-seq data to identify multi-omic aging genes

To integrate DNA Methylation and RNA-seq data, we employed a 4-step process:

  1. 1)

    We initially identified aging transcripts (ATs) in the cell type of interest, as defined by a statistically significantly Pearson correlation with chronological age (p-value < 0.05), starting from the 1497 transcripts described as the ageing transcriptome by Peters et al. (2015);

  2. 2)

    We outlined aging regulatory regions (ARGs) for each AT identified in step 1. ARGs were defined as the 1500 bases flanking the transcription start site (TSS) in Gencode Hg38;

  3. 3)

    Aging CpGs (ACs) within each ARG were identified by marking regions with n ≥ 3 consecutive cytosine-guanine (CpG) sites whose DNA methylation levels were significantly associated with chronological age at R2 > 0.2;

  4. 4)

    Finally, candidate aging genes (candidate AGs) were defined by filtering ACs whose methylation levels exhibited a discordant directionality with their associated ATs, e.g., a gain in DNAm with an associated downregulation in expression with age.

This process resulted in 106 multi-omic aging genes (Supplementary Data 1).

Validated aging genes (validated AGs) were defined as candidate AGs if the direction of age-dependent changes was confirmed in independent datasets. Of note, the number of validated genes in each cohort varies.

Survival analysis

Linkage to records for the MGB-4k was described previously in Chen et al. 20236. GS survival data was ascertained via linkage to the National Records of Scotland. The dates of death range from the time of the baseline assessment (that took place, depending on participant, between 2006 and 2011) and October 2023, with the latter date representing the date of censoring for alive participants. The median follow-up for GS data is 13.9 years (alive participants: 14.0; dead participants: 9.4).

For both datasets, we applied a multivariate Cox Proportional Hazard regression model to test the association between all-cause mortality and each aging CpG site, adjusting for age and sex. Next, we ranked the aging CpG sites by p-values from smallest to largest and selected the top 10 sites, as demonstrated in the volcano plot. For these top aging CpGs, we drew a forest plot to show the point estimates and 95% confidence intervals of the adjusted hazard ratios for all-cause mortality. In addition, we plotted the adjusted survival curves for these top 10 aging CpGs.

Cell composition analysis

To estimate the cell composition of each sample in the MGB cohort, we applied software provided by the Clock Foundation that deconvolutes cell composition using bulk methylation profiles. This software estimates the proportion of cell types through constrained quadratic programming for common blood cell types (CD8 + T cells, CD4 + T cells, NK cells, B cells, monocytes, and granulocytes)21 and penalized regression for rarer cell types (plasmablasts, CD8 + CD28-CD45RA- T cells, naive CD8 + T cells, and naive CD4 + T cells)20.

For each CpG site associated with a multi-omic aging gene, we corrected for cell composition through the following approach: first, we applied multivariate linear regression to predict the methylation beta value of that CpG site using, as inputs, age and the cell proportion of each cell type, excluding granulocytes and plasmablasts due to high collinearity (defined as a variance inflation factor over 5). Next, we determined whether the regression coefficient p-value for age was still significant (Benjamini-Hochberg-corrected p-value < 0.05) in predicting CpG methylation.

Functional enrichment analysis

Functional enrichment of genes associated with aging across modalities was performed using Fisher’s exact test and gene terms from GO Biological Process, using all available genes in the ontology with the clusterProfiler package in R. Redundant functions with high semantic similarity were filtered out with the simplify function. Adjustment for multiple comparisons was performed with the default Benjamini-Hochberg approach, and gene terms with adjusted p-value < 0.05 were considered statistically significant. Enriched functions were visualized with a dot plot. Gene-concept network of multi-omic blood biomarkers of aging and representative enriched functional terms was constructed with the cnetplot function from the enrichplot package in R. Over-representation analysis for the disease gene network (Fig. 5f) was conducted using R version 4.3.3 and the enrichDGN function from the DOSE package. This function leverages DisGeNET (Janet et al., 2015) to construct disease-associated gene networks.

Statistics & reproducibility

No statistical method was used to predetermine sample size. Sample sizes were determined by the availability of participants in each cohort and are consistent with previous large-scale epigenetic and transcriptomic studies. No data were excluded from the analyses unless they failed predefined quality control criteria (e.g., > 10% missing CpGs or poor sequencing quality), as described above. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

All statistical analyses were conducted in R (version 4.3.3) using standard packages as detailed above. Associations between DNA methylation or gene expression and age were assessed using Pearson correlation and validated with Spearman correlation for robustness. Multiple testing correction was performed using the Benjamini–Hochberg method, with adjusted p-values < 0.05 considered significant. Survival analyses were conducted using multivariate Cox proportional hazards regression, adjusting for age and sex.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.