OMICmAge quantifies biological age by integrating multi-omics with electronic medical records

Chen, Qingwen; Dwaraka, Varun B.; Carreras-Gallo, Natàlia; Armstrong, Jenel F.; Sehgal, Raghav; Argentieri, M. Austin; Richmond, Anne; Aparicio, Andrea; Mendez, Kevin; Chen, Yulu; Begum, Sofina; Kachroo, Priyadarshini; Prince, Nicole; Guo, Tao; Went, Hannah; Mendez, Tavis; Lin, Aaron; Turner, Logan; Moqri, Mahdi; Chu, Su H.; Kelly, Rachel S.; Weiss, Scott T.; Rattray, Nicholas J. W.; Gladyshev, Vadim N.; Karlson, Elizabeth; Wheelock, Craig E.; Mathé, Ewy A.; Dahlin, Amber; McGeachie, Michael J.; Marioni, Riccardo E.; Higgins-Chen, Albert T.; Smith, Ryan; Lasky-Su, Jessica

doi:10.1038/s43587-026-01073-7

Download PDF

Technical Report
Open access
Published: 25 February 2026

OMICmAge quantifies biological age by integrating multi-omics with electronic medical records

Nature Aging (2026)Cite this article

Subjects

Abstract

Biological aging reflects complex cellular and biochemical processes that can be measured across multiple omic layers. Using routine clinical laboratory data from ~31,000 participants in the Mass General Brigham Biobank, we developed EMRAge, a biomarker of mortality risk that can be broadly recapitulated across electronic medical records. Here we show that EMRAge can be modeled using elastic net regression with DNA methylation and multi-omics to generate DNAmEMRAge and OMICmAge, respectively. Both biomarkers are strongly associated with incident and prevalent chronic diseases and mortality, performing comparably or better than current biomarkers across discovery (Massachusetts General Brigham Aging Biobank Cohort, n = 3,451) and validation cohorts (TruDiagnostic, n = 14,213; Generation Scotland, n = 18,672). Importantly, OMICmAge leverages epigenetic biomarker proxies to integrate proteomic, metabolomic and clinical domains while remaining quantifiable from DNA methylation alone. This framework establishes an accessible, scalable measure of biological aging with potential to reveal molecular interconnections that shape healthspan and disease risk.

Main

A major goal of aging research is to define biomarkers of aging that capture interindividual differences in functional decline, chronic disease development and mortality not identified through chronological age alone¹. Molecular and clinical data quantify complementary attributes of biological aging. Multiple molecular biomarkers of aging, or ‘clocks’, have been developed as proxies for these hallmarks of aging². These biomarkers have been based on a variety of measures such as telomere length³, neuroimaging data^4,5,6,7, immune cell counts⁸ and large-scale omics including DNA methylation (DNAm)^2,9,10,11, metabolomics¹², glycomics¹³ and proteomics^14,15,16,17.

Over the last two decades, electronic medical records (EMRs) have been widely used in clinical research, in particular for precision medicine, enabling deep phenotype mining from dense, comprehensive time-dependent data¹⁸. These data track longitudinal physiological change and real-time health status. Capitalizing on EMRs provides a unique opportunity to quantify the aging process in a reproducible way across clinical settings. While healthy aging encompasses both quality of life and lifespan, metrics of biological age have traditionally focused on using either clinical data to quantify quality of life^19,20, or mortality risk to quantify lifespan²¹, resulting in biological phenotypes that are optimized to one of these attributes, while not fully reflecting the other. With the wealth of data available via EMRs, biological aging phenotypes that incorporate both dense clinical data and mortality can be created to synthesize these important attributes of aging into a single measure.

Clinical data are essential in constructing and validating age readouts; connecting them to molecular underpinnings is equally important. Here we combine EMR data with multi-omic profiling to develop a more biologically informed measure of aging. The strong molecular link between DNAm and aging has driven widespread development of DNAm clocks reflecting clinical biomarkers (for example, PhenoAge¹⁹), mortality (for example, GrimAge²¹) and the rate of aging (for example, DunedinPACE²²).

Proteomics and metabolomics directly reflect biological processes that inform the aging process. The proteome is altered by hallmarks of aging including loss of proteostasis, dysregulated nutrient sensing, altered intercellular communication and cellular senescence²³. Although the source of circulating proteins and metabolites is often unclear, individual blood-based proteins and metabolites are established biomarkers for specific organ function (for example, albumin, C-reactive protein and creatinine), while peripheral omic signatures reflect organ-specific changes with aging^17,23,24. The metabolome not only provides critical information about metabolic processes, but also captures measures of environmental exposures, including xenobiotics, that may be critically linked to the aging process^25,26. Blood metabolomics captures molecules from multiple tissues across the body, providing rich information on aging information that may not be captured in methylation and transcriptomic clocks^27,28.

Despite the important advantages of other omics, the development of epigenomic, proteomic and metabolomic clocks for biologic aging phenotypes has been limited. Initial work has demonstrated that while individual omics clocks share commonalities, each omic data type provides a distinct window on the aging process²⁹, suggesting that the best and most clinically informative approach would integrate information from multiple omic measurements to create an optimized aging biomarker. However, the integration of multiple omics into a multi-omic clock or to inform DNAm-based readouts remains an area of unfulfilled clinical potential.

To this end, we used ~31,000 participants from the Massachusetts General Brigham (MGB) Biobank to develop and validate three distinct and clinically relevant measures of biological age: (1) EMRAge, a clinically based mortality predictor that can be broadly recapitulated across EMRs; (2) DNAmEMRAge, a DNAm aging biomarker trained to predict EMRAge; and (3) OMICmAge, a multi-omic-informed aging biomarker trained to predict EMRAge, using proteomic, metabolomic and clinical data distilled into DNAm via epigenetic biomarker proxies (EBPs). Overall, these aging biomarkers show strong associations with both incident and prevalent chronic disease outcomes and mortality, while further substantiating the biological relevance and value of integrating multiple omics data into one biological aging biomarker. To validate the findings, we used three independent cohorts—All of Us, TruDiagnostic Biobank and Generation Scotland.

Results

Overview of study design

We developed and validated three aging biomarkers: EMRAge, DNAmEMRAge and OMICmAge. Participants in the MGB Biobank with available plasma and clinical data were used to develop EMRAge (n = 31,264; Extended Data Fig. 1), which was validated using the All of Us cohort (n = 10,769). A subset of individuals from the MGB cohort also had available multi-omic data and were used to develop DNAmEMRAge and OMICmAge (Massachusetts General Brigham Aging Biobank Cohort or MGB-ABC; n = 3,451). These aging biomarkers were validated using two independent cohorts—the TruDiagnostic Biobank (n = 14,213) and the Generation Scotland (n = 18,672; Fig. 1). Additional clinical characteristics and demographics are in Supplementary Table 1.

Development of EMRAge

We filtered 60,370 individuals and 28 clinical phenotypes (43 clinical variables; Supplementary Table 2) from the MGB Biobank down to 31,264 individuals with complete data on 19 clinical variables that were used to develop EMRAge (Extended Data Fig. 1). We split the cohort by 70:30 into training and testing sets. A Cox proportional-hazards model was fitted in the training set to estimate the weightings of the 19 selected clinical variables (Supplementary Table 3). In a manner analogous to the GrimAge approach²¹, we converted the linear combination of estimated weights and predictor values into an ‘age’ metric. The Pearson correlation coefficient (ρ) between EMRAge and chronological age was 0.76 (P < 0.001) in the testing set and 0.75 (P < 0.001) in the training set (Extended Data Fig. 2). We validated the EMRAge predictors by retraining the algorithm at four time points in 2-year increments: 1 January 2008, 2010, 2012 and 2014. The four derived equations were then applied to participants (N = 11,673) on 1 January 2016. The Pearson correlations among these estimates were ~1, affirming robustness (Fig. 2a). We then assessed the association between EMRAge, PhenoAge and chronological age with aging-related health outcomes, including all-cause mortality, stroke, type 2 diabetes, chronic obstructive pulmonary disease (COPD), depression, cardiovascular disease (CVD) and any type of cancer. All association tests were adjusted for age (if using EMRAge or PhenoAge), sex, race, smoking status and alcohol consumption. The prospective association analysis in the testing set shows that EMRAge has the largest hazard ratios (HRs) for all-cause mortality (HR = 4.53, P = 4.42 × 10⁻¹²⁹), stroke (HR = 2.00, P = 1.20⁻²²), COPD (HR = 2.21, P = 4.01⁻¹⁵) and cancer (HR = 2.22, P = 2.00 × 10⁻²⁷) and demonstrates comparable HRs for type 2 diabetes (HR = 2.05, P = 2.07 × 10⁻¹⁹), depression (HR = 1.59, P = 3.89 × 10⁻¹²) and CVD (HR = 1.99, P = 7.27 × 10⁻³²) when compared to PhenoAge (HR = 2.26, P = 1.03 × 10⁻¹⁸; HR = 1.67, P = 3.33 × 10⁻¹⁰; HR = 2.00 P = 1.00 × 10⁻²⁰, respectively; Fig. 2b and Supplementary Table 4). All associations were significant with a false discovery rate (FDR) threshold of 0.05 after adjusting for multiple testing. Kaplan–Meier plots show that EMRAge provides the best divergence of survival probabilities among ‘age’ groups, compared to PhenoAge and chronological age (Fig. 2c). Additionally, a comparative analysis including both EMRAge and PhenoAge into the same model revealed that EMRAge consistently shows higher HRs for the aging-related incident outcomes (Fig. 2d and Supplementary Table 5).

**Fig. 2: Development, robustness and comparators of EMRAge.**

Validation of EMRAge

We validated EMRAge using data from the All of Us Research Program (CDR version 8). Median imputation of missing lab values occurred within ±1 year of enrollment. Following imputation, the cohort comprised 10,769 adult participants with complete data for both EMRAge and PhenoAge calculation. Among these participants, 378 (3.5%) were recorded as deceased by 1 October 2023. Demographic characteristics are detailed in Supplementary Table 1. We then performed association analyses between EMRAge, PhenoAge or chronological age with aging-related diseases. All association tests were adjusted for the same set of covariates as previously described. Our prospective analysis (Extended Data Fig. 3a and Supplementary Table 6) revealed that EMRAge demonstrated the strongest association with all-cause mortality (HR = 3.08 (per s.d., same below), P = 4.97 × 10⁻⁷⁴). Similarly, our cross-sectional analysis (Extended Data Fig. 3a and Supplementary Table 7) indicated that EMRAge exhibited the strongest associations with multiple prevalent aging-related diseases, including stroke (odds ratio (OR) = 1.08, P = 3.14 × 10⁻⁶²), COPD (OR = 1.10, P = 3.16 × 10⁻⁹⁶), depression (OR = 1.07, P = 1.48 × 10⁻²⁷) and cancer (OR = 1.10, P = 7.32 × 10⁻⁹⁰), when compared to PhenoAge and chronological age. Furthermore, in a joint analysis including both EMRAge and PhenoAge in the same model (Extended Data Fig. 3b), EMRAge showed a stronger association with all-cause mortality in the prospective analysis (HR = 2.40, P = 9.08 × 10⁻²²; Supplementary Table 8). The cross-sectional analysis from this joint model (Supplementary Table 9) further demonstrated that EMRAge exhibited stronger associations, as illustrated in Extended Data Fig. 3b.

Development of DNAmEMRAge

After developing the EMRAge measure, we created a DNAm surrogate predictor of EMRAge, DNAmEMRAge, using DNAm data in an elastic net regression model (alpha = 0.1) to select the CpG sites that are most predictive of EMRAge. The model for DNAmEMRAge included 1,097 CpG sites and age as predictors. A 25-fold cross-validation selected an optimal lambda value with R² = 0.827 between the observed and predicted values, suggesting good concordance in prediction. To further assess agreement, the data were resampled to identify a training set composed of samples used to generate the model and samples not in the model (N = 2,762). Within the training data, DNAmEMRAge and EMRAge values showed high correlation (Fig. 3a; N = 2,762, R² = 0.82, P < 2.2 × 10⁻¹⁶, ρ = 0.91, P < 2.2 × 10⁻¹⁶). A test dataset showed comparable correlations (N = 689, R² = 0.83, P < 2.2 × 10⁻¹⁶, ρ = 0.91, P < 2.2 × 10⁻¹⁶; Fig. 3b). The mean absolute error between DNAmEMRAge and EMRAge is 8.33 years in the training set and 8.50 years in the testing set, and the intraclass correlation coefficient (ICC) was 0.995 (Fig. 3c).

**Fig. 3: Correlation plots to EMRAge and ICCs for DNAmEMRAge and OMICmAge.**

Development of OMICmAge

Metabolomic, proteomic and clinical EBPs

Untargeted global plasma metabolomic profiling was performed on the Metabolon platform. After preprocessing and scaling, the final metabolomic dataset consisted of 1,459 metabolites, covering a broad range of metabolic pathways (Extended Data Fig. 4) across 1,986 individuals, among whom 1,691 were matched to methylation data. Global proteomic data were generated using the Seer SP100 platform, based on liquid chromatography–mass spectrometry (LC–MS). The final processed dataset consisted of 2,098 nonunique annotated proteins and 536 unique protein groups (denoted as ‘proteins’) across 1,789 individuals, among whom 1,475 were matched with methylation data. We further considered 46 clinical variables that have potential relationships with aging and aging-related outcomes. As implemented in the development of other aging biomarkers²¹, we restricted the number of EBPs for inclusion by selecting proteins and clinical variables with a significant Pearson correlation to EMRAge greater than 0.1 and a nominal P value < 0.05 (Supplementary Table 10). To select metabolites highly correlated with EMRAge while minimizing interdependence among the selected metabolites, we used hierarchical clustering. This process grouped the metabolites into 286 clusters characterized by low intercluster correlation (that is, the 90th percentile of average intercorrelations between clusters was below 0.15) and high intra-cluster correlation (that is, the 10th percentile of average intra-correlations within clusters was above 0.5). Subsequently, we selected the metabolite exhibiting the strongest correlation with EMRAge from each cluster. Following this strategy, 286 metabolites, 110 proteins and 25 clinical variables were retained (Fig. 4). We then generated epigenetic predictors (that is, EBPs) for each selected metabolite, protein and clinical variable—via an elastic net regression model. We retained all EBPs with a nominal P value (P < 0.05) and a Pearson correlation above 0.2 with their estimated metabolite/protein/clinical value. We observed strong correlations between several of the selected EBPs and actual clinical values (for example, ρ = 0.66 and 0.63 for C-reactive protein and HbA1C EBPs, respectively). In total, 266 metabolite EBPs, 109 protein EBPs and 21 clinical EBPs were retained, totaling 396 EBPs that were included as features in the predictive model for OMICmAge (Supplementary Table 11). OMICmAge was then generated by integrating proteomic, metabolomic and clinical data EBPs into a DNAm clock.

**Fig. 4: Illustration of the filtration process for DNAm-based multi-omic features used in the development of OMICmAge.**

Predictive model for the OMICmAge

OMICmAge was generated via a penalized elastic net regression model of EMRAge that included methylation CpG values, relative percentages of 12 immune cell subsets, 396 EBPs, age and sex as features in the model. This model retained 990 CpGs, 40 EBPs (16 protein EBPs, 14 metabolite EBPs and 10 clinical EBPs; Fig. 4) and age as selected predictors of EMRAge with varying weightings in the final model. The model did not retain any of the immune cell subsets after penalization. We tested an independent model including them as unpenalized features, but results did not change substantially. Thus, we continued with the model where all the features were penalized. Figure 3d–f shows the correlation between EMRAge and OMICmAge in the training (N = 2,762, R² = 0.83, P < 2.2 × 10⁻¹⁶; ρ = 0.91, P < 2.2 × 10⁻¹⁶) and testing (N = 689, R² = 0.84, P < 2.2 × 10⁻¹⁶; ρ = 0.92, P < 2.2 × 10⁻¹⁶) sets, as well as the ICCs using 30 replicates (0.998). In terms of error, the mean absolute error between OMICmAge and EMRAge was 4.96 years in the training set and 4.97 years in the testing set, lower than the mean absolute error for DNAmEMRAge (8.33 and 8.50, respectively).

Comparison of OMICmAge to previous epigenetic biomarkers of aging

We compared DNAmEMRAge and OMICmAge to previous epigenetic aging biomarkers, including DunedinPACE²² and the principal component (PC) versions of PCHorvath¹⁰, PCHannum¹¹, PCPhenoAge¹⁹ and PCGrimAge²¹ for their improved precision³⁰. We compared the CpG sites included in the predictive model, and their relationship with immune cell subsets, aging-related disease outcomes and five- and ten-year mortality. Overall, we observed consistent correlations between all epigenetic clocks and immune subsets. We observed stronger correlations with sex for both OMICmAge and DNAmEMRAge (R = 0.28, P value = 0.02, and R = 0.36, P value = 0.009, respectively) when compared to previous aging biomarkers (Extended Data Fig. 5). There was minimal overlap between the CpG sites retained in DNAmEMRAge and OMICmAge and prior aging biomarkers (Fig. 5a); DNAmEMRAge and OMICmAge had 660 and 657 unique CpG sites, respectively, with 411 CpG sites shared between these measures. While PhenoAge and Horvath biomarkers share 50 CpG sites and Horvath and Hannum share 29, the maximum number of probes shared between OMICmAge and any previous aging biomarker is 3.

**Fig. 5: Comparison of OMICmAge and DNAmEMRAge to previously established aging biomarkers.**

To estimate OMICmAge, we use the 990 CpG sites retained in the model and the 40 EBPs, which require 10,315 additional CpG sites. Remarkably, 50.8% of them (5,740) are available on the 450 K array. A version for 450 K is in development.

We compared the prevalence and incidence of aging-related disease associations between OMICmAge, DNAmEMRAge and other aging biomarkers in MGB-ABC (Fig. 5b and Supplementary Tables 12 and 13). For prevalent disease associations, OMICmAge had the highest ORs for four of six aging-related chronic diseases assessed, with particularly high ORs for type 2 diabetes (OR = 5.04, P = 1.37 × 10⁻¹⁵) and CVD (OR = 4.62, P = 3.56 × 10⁻¹²). The association between PCGrimAge and COPD (OR = 3.97, P = 6.90 × 10⁻⁶) was also particularly high. OMICmAge also had the highest ORs for stroke (OR = 2.21, P = 1.4 × 10⁻⁴) and depression (HR = 1.94, P = 9.77 × 10⁻⁵) and chronological age had the highest OR for cancer (OR = 2.40, P = 4.85 × 10⁻¹³). However, the differences between several of the strongest aging biomarkers were all within the CIs. All these associations were significant after adjusting for multiple testing (FDR Q value ≤ 0.05). For incident disease associations, we also observed that OMICmAge had the highest HRs for type 2 diabetes (HR = 2.68, P = 6.14 × 10⁻⁴), CVD (HR = 3.28, P = 4.85 × 10⁻⁶) and all-cause mortality (HR = 11.31, P = 2.65 × 10⁻²³), which all met FDR significance (Q value ≤ 0.05). PCGrimAge, PCPhenoAge and chronological age were also FDR significant for CVD, while chronological age was the only measure that was significant for stroke (HR = 1.85, P = 9.59 × 10⁻⁴). No aging biomarkers were significantly associated with the incidence of depression, COPD or cancer.

We conducted similar analyses in the Generation Scotland cohort (Extended Data Fig. 6b and Supplementary Tables 14 and 15). While chronological age showed the strongest associations with all-cause mortality (HR = 5.58, P < 1 × 10⁻⁹⁹), incident stroke (HR = 4.10, P = 1.64 × 10⁻⁸⁸) and cancer (HR = 2.76, P = 9.15 × 10⁻¹⁹²), OMICmAge generally ranked second after PCGrimAge (except cancer). OMICmAge also ranked among the top two aging biomarkers for incident type 2 diabetes (HR = 4.18, P = 5.75 × 10⁻²⁵), CVD (HR = 4.14, P = 2.46 × 10⁻²¹) and COPD (HR = 1.97, P = 1.90 × 10⁻¹⁰). Notably, OMICmAge exhibited the strongest association with incident depression (HR = 3.14, P = 1.52 × 10⁻⁵). For prevalent disease associations, OMICmAge was also among the top two aging biomarkers by OR estimates, following PCGrimAge. However, no aging biomarkers were significantly associated with prevalent stroke after adjusting for chronological age and other covariates.

We further evaluated the prevalent disease associations in one additional cohort, the TruDiagnostic Biobank. In this cohort, PCGrimAge, OMICmAge, DNAmEMRAge and chronologic age had FDR-significant associations with type 2 diabetes, CVD, COPD and cancer. PCGrimAge had the highest OR with COPD (OR = 6.15, P = 1.59 × 10⁻⁰⁴). Overall, chronological age had stronger associations with aging-related diseases in the TruDiagnostic cohort than in the other cohorts. Chronological age was the only measure that was associated with stroke (OR = 2.21, P = 8.51 × 10⁻¹⁵) and had the highest ORs with CVD (OR = 1.74, P = 1.93 × 10⁻¹⁷²) and cancer (OR = 2.27, P = 3.26 × 10⁻¹⁴⁶). OMICmAge was in the top two highest associations for type 2 diabetes (OR = 2.78, P = 3.61 × 10⁻¹³), depression (OR = 1.26, P = 3.53 × 10⁻³) and cancer (OR = 1.50, P = 6.28 × 10⁻⁸; Extended Data Fig. 6a and Supplementary Table 16).

We also calculated the area under the curve (AUC) for 5-year and 10-year survival using prediction classifiers for OMICmAge, DNAmEMRAge, PCGrimAge, chronological age and other aging biomarkers in both MGB and Generation Scotland cohorts (Fig. 5c and Extended Data Fig. 7). In the prediction models, we included age for those biomarkers in which age is not included as a feature (all except OMICmAge, DNAmEMRAge and PCGrimAge). In the MGB testing set, DNAmEMRAge showed the highest AUC values (5-year AUC: 0.898, OR = 10.77, P = 1.14 × 10⁻¹⁴; 10-year AUC: 0.89, OR = 7.99, P = 2.17 × 10⁻¹⁷), followed by OMICmAge with very similar values (5-year AUC: 0.892, OR = 14.83, P = 5.25 × 10⁻¹⁴; 10-year AUC: 0.873, OR = 10.42, P = 2.53 × 10⁻¹⁶). Chronological age and the other methylation clocks had AUC values lower than OMICmAge and DNAmEMRAge. In the Generation Scotland cohort, OMICmAge also ranked as the second-best aging biomarker based on AUC values for both the 5-year (AUC: 0.861; OR = 3.86, P = 1.58 × 10⁻¹²) and 10-year (AUC: 0.859; OR = 4.13, P = 3.30 × 10⁻²⁹) periods, following PCGrimAge (5-year AUC: 0.870, OR = 8.13, P = 9.03 × 10⁻¹⁵; 10-year AUC: 0.866, OR = 8.08, P = 2.30 × 10⁻³¹).

Finally, we evaluated the association between OMICmAge and lifestyle factors in both MGB and TruDiagnostic Biobank cohorts with varying lifestyle information, using the FDR to identify significance after adjusting for multiple testing (FDR Q value ≤ 0.05; Fig. 6 and Supplementary Tables 17 and 18). In all the cohorts, we observed significant negative associations with female sex, education level and exercise per week. We also observed consistent significant positive associations with Black race, obesity and tobacco smoking across all cohorts. While we observed a significant positive association with being underweight in the MGB-ABC cohort, this is likely an indication of illness among individuals with low body weight that is present in the MGB Biobank and not observed in the other cohorts. This relationship has been previously reported in epidemiological studies and in the proteomics-aging clock paper developed by Oh et al.¹⁷. Finally, occasional recreational drug use was significantly associated with higher OMICmAge, while antioxidants and omega-3 fish oil intake were significantly associated with a lower biological age in the TruDiagnostic cohort.

**Fig. 6: Forest plot for the representation of the lifestyle factors associated with OMICmAge in the MGB-ABC and TruDiagnostic Biobank.**

Discussion

The objective of this study was to develop a clinically relevant aging biomarker that can be implemented into the current electronic infrastructure and an analogous DNAm biomarker informed by multi-omic data. We did this through the generation of EMRAge, DNAmEMRAge and OMICmAge. The motivating premise is that biological aging is a multifactorial process involving complex interactions of cellular and biochemical processes that is best understood via multiple omic profiles. To date, aging biomarkers (also known as ‘clocks’) have been generated predominantly with singular omic data types^{3,4,5,6,7,8,9,10,11,12,13}. While this approach most often generates highly predictive aging biomarkers, without additional molecular data the overall biological understanding is limited. This leaves a gap between predictive accuracy and driving physiological mechanisms.

A major goal for aging biomarkers is to implement these measures into clinical care and improve overall health. We used common clinical laboratory measures in ~30,000 individuals from the MGB Biobank to develop EMRAge. The use of readily available EMR data suggests that EMRAge can be broadly recapitulated across multiple EMR systems. While prior aging biomarkers were developed using either clinical data or mortality prediction models^31,32,33, EMRAge reflects a hybrid aging biomarker that distills health status and mortality into a single aging biomarker. EMRAge is highly reproducible, has strong associations with incident and prevalent chronic disease outcomes, and is an accurate predictor of mortality risk that outperforms chronologic age and PhenoAge in both our discovery cohort and a large independent cohort. The broad clinical relevance and ease of large-scale implementation highlight the translational potential of EMRAge. Further assessment in diverse populations and across different EMR systems will characterize its generalizability.

Because EMRAge alone does not resolve specific physiology, we used machine learning to predict EMRAge with DNAm and multiple omics, generating DNAmEMRAge and OMICmAge, respectively. Both aging biomarkers had excellent accuracy for 5-year and 10-year mortality risk, and DNAmEMRAge maintained the highest accuracy in one of the validation cohorts. We also demonstrated that both DNAmEMRAge and OMICmAge have strong associations with aging-related health outcomes, including CVD, stroke, type 2 diabetes, COPD, depression and mortality. Notably, among studied DNAm-based aging biomarkers, OMICmAge often exhibits either the strongest (for example, for depression) or second strongest (for example, for type 2 diabetes) association with prevalent or incident aging-related morbidities across cohorts. These patterns hold across cohorts with differing health profiles and ascertainment. The consistency in the association findings suggests that DNAmEMRAge and OMICmAge are reliable and broadly applicable across a diverse range of cohort characteristics. This argument for OMICmAge is further supported by previous studies^34,35,36 indicating that a significant portion of the signal captured by epigenetic aging biomarkers stems from the accumulation of stochastic variation over time. In essence, as an aging biomarker becomes more predictive (that is, higher correlation) of chronological age, it increasingly reflects a pure stochastic process, thereby demonstrating less biological relevance. Given that OMICmAge exhibited a moderate correlation with chronological age in both MGB and TruDiagnostic Biobank cohorts, its broad applicability is largely justified by its substantial deviation from purely stochastic accumulation. This suggests that OMICmAge reflects non-stochastic physiological processes related to aging.

In addition, consistent reproducibility has previously been an issue with epigenetic biomarkers, which has traditionally only been improved through the inclusion of summary features such as PCs^30,37. With OMICmAge, we observed high ICCs, demonstrating strong reproducibility. One ongoing challenge with multi-omic aging biomarkers is the complexity of integrating different omic data together and the subsequent interpretation of the findings. Moreover, multi-omic clocks are often impractical due to high costs and logistics. The approach we used to develop OMICmAge has the advantage of estimating metabolites, proteins and clinical data (via EBPs) while distilling this into a single DNAm-based aging algorithm^29,31,38. Using DNAm as the primary metric was selected for its stability and cost-effectiveness. In this sense, OMICmAge reflects aging processes on multiple levels of systems biology while only necessitating DNAm in its calculation.

An advantage of OMICmAge and the development with concurrent multi-omic data is the ability to further elucidate the biological mechanisms associated with this biomarker. Although the epigenome is central to aging, functions implied by specific methylation perturbations are often unclear, limiting clinical interpretability when accelerated aging arises from heterogeneous mechanisms. In contrast to the epigenome, proteins reflect a broad range of aging-related biology, including immune function and inflammatory processes that are often well understood and have clear clinical implications for treatment and/or modification. Changes in oxidative stress, hormones and lipid profiles are just a few examples of the metabolic processes reflected via the metabolome that represent specific biology relevant for aging processes³⁷. When developing OMICmAge, we included protein and metabolite EBPs that were correlated with EMRAge to improve the likelihood of retaining physiologically relevant measures. Retained EBPs included albumin³⁸ and the androgenic steroid androsterone sulfate. The algorithm for OMICmAge also retained protein and metabolite EBPs with a less well-understood relationship with aging, such as ribitol, which has been identified as a metabolite predictive of mortality but has very little mechanistic information³⁹. It is important to highlight that not all retained features are causal, nor do they necessarily have the strongest overall associations with OMICmAge. Follow-up functional work and/or causal modeling is necessary to infer any potential causal links between these proteins/metabolites and EMRAge.

There are several limitations that need to be addressed in future work. First, EMRAge was developed using EMR data and is, therefore, primarily tailored for clinical data. The major advantage of this is that this measure uses real-world data and can be recapitulated in several EMR systems. However, real-world data are not systematically collected and so missingness is universal. To assess robustness, we calculated the median values in different time increments surrounding the time point when EMRAge is being estimated and found that the EMRAge estimates had near-perfect pairwise correlations among reconstructed EMRAge estimates. Furthermore, while EMRAge demonstrated superior performance over PhenoAge in predicting the risk of all-cause mortality using the All of Us Research Program data, EMRAge did not consistently outperform PhenoAge for other incident aging-related diseases, likely influenced by shorter follow-up periods (median: 3 years versus 5.5 years). Future work should test EMRAge across diverse populations and EMR systems to confirm validity and generalizability. A major advantage of OMICmAge is the incorporation of proteins, metabolites and clinical EBPs; however, more work is necessary to further improve the accuracy and precision of EBPs. Targeted, quantitative protein and metabolites assays will improve EBP accuracy and reflect the actual clinical levels. There is also room to expand upon the EBPs that were included into the feature space, both with additional metabolites/proteins and with other omics. Finally, additional validation of OMICmAge across diverse populations and with more aging biomarkers will continue to highlight potential advantages and limitations in this aging biomarker.

The present study introduces several notable steps in aging biomarkers. First, EMRAge advances the field as a hybrid aging biomarker that integrates clinical health data and mortality risk into a robust measure readily scalable across EMRs. Building on this foundation, we established DNAmEMRAge and OMICmAge, epigenetic aging algorithms that extend predictive accuracy and mechanistic insight. OMICmAge in particular integrates DNAm with proteomic, metabolomic and clinical data through EBPs, yet it remains measurable from DNAm alone. This systems-biology framework unifies multiple biological levels into a single readout of aging, enabling a more comprehensive and interpretable view than prior clocks. Together, these tools establish a clinically practical and biologically grounded platform for assessing biological age. Ongoing validation across diverse populations will extend their translational reach, with the potential to transform both research and clinical practice in aging and aging-related diseases.

Methods

Discovery cohort

MGB Biobank

The MGB Biobank is a large biorepository that provides access to research data and approximately 130,000 high-quality banked samples (plasma, serum and DNA) from >100,000 consented individuals enrolled in the MGB system⁴⁰. Written informed consent was obtained from all participants upon enrollment in the biobank. Participants were linkable to EMR data spanning their MGB medical histories and to surveys on lifestyle, environment and family history. Plasma donors initially totaled 60,371 from the MGB Biobank; 124 were excluded for being <18 years old at collection. Among remaining adults, vital status for 59,213 was verified (alive/deceased) with death dates recorded as of 28 July 2022. Another 28,329 participants were excluded for missing phenotype data (Extended Data Fig. 1).

MGB-ABC

The MGB-ABC comprises 3,451 randomly selected MGB Biobank participants to yield an age-, sex- and BMI-balanced sample representative of the Biobank. For selected participants, comprehensive EMRs plus metabolomic, proteomic and epigenetic data are available. Blood samples obtained during clinical care or research draws at Brigham and Women’s Hospital or Massachusetts General Hospital were used for serum, plasma and DNA/genomic analyses. Typical draws collected 30–50 ml and were linked to EMR data; the Biobank also captured additional health information at collection. Questionnaires were administered electronically or on paper and took approximately 10–15 min to complete. Items covered family history, lifestyle and environmental factors. Confidentiality and data security were prioritized: no personally identifiable information was collected, identities were protected, and survey data were encrypted.

The Phenotype Discovery Center of MGB integrates various data sources, including the Research Patient Data Registry, health information surveys and genotype results, into the Biobank Portal. This portal combines specimen data with EMR data, creating a comprehensive SQL Server database with a user-friendly web-based application⁴⁰. Researchers can perform queries, visualize longitudinal data with timestamps, use established algorithms to define phenotypes, utilize automated natural language processing tools for analyzing EMR data using the Informatics for Integrating Biology and the Bedside (i2b2) tool kit⁴¹ and request samples from cases and controls. Biobank Portal data include narrative clinical notes; text reports (cardiology, pathology, radiology, operative, discharge summaries); codified elements (demographics, diagnoses, procedures, labs, medications); and patient-reported exposures and family history from surveys. Additional measures (for example, lung function) were extracted using an in-house natural language processing-based algorithm.

Metabolomic profiling

Untargeted global plasma metabolomics profiling was generated by Metabolon. Coefficients of variation were measured in blinded quality-control (QC) samples randomly distributed among study samples. Batch variation was controlled for in the analysis. Sample preparation and global metabolomics profiling was performed according to methods described previously⁴². Metabolomic profiling was performed using four LC–MS methods that measure complementary sets of metabolite classes⁴³: (1) amines and polar metabolites that ionize in the positive ion mode; (2) central metabolites and polar metabolites that ionize in the negative ion mode; (3) polar and nonpolar lipids; (4) free fatty acids, bile acids and metabolites of intermediate polarity. All reagents and columns for this project will be purchased in bulk from a single lot, and all instruments will be calibrated for mass resolution and mass accuracy daily⁴⁴.

Metabolite peaks were quantified by the AUC. Raw area counts for each metabolite in each sample were normalized to the run-day median to correct inter-day instrument tuning differences, setting each run’s median to 1.0. Metabolites were identified by automated comparison of ion features to a ~8,000-entry reference library of chemical standards that includes retention time, molecular weight (m/z), preferred adducts, in-source fragments and associated mass spectrometry spectra, with visual QC using software developed at Metabolon⁴⁴. Known chemical entities were identified by comparison to library entries of purified standards. Recurrent, structurally unnamed biochemicals generated additional spectral entries; these may be resolved upon acquisition of matching purified standards or by classical structural analysis. QC and data processing used an in-house method^45,46,47. Metabolite features with signal-to-noise ratio < 10 or with undetectable/missing values in >10% of samples were excluded. Remaining missing values were imputed as half the minimum peak intensity for that feature across the cohort. Features with a pooled-sample coefficient of variation > 25% were removed to ensure technical reproducibility. Analyses used LC–MS peak areas; values were subjected to log transformation (approximate normality, variance stabilization) and Pareto scaling to harmonize measurement scales. After QC, 1,459 metabolites from 1,986 samples remained for analysis.

Methylation profiling

DNAm data were generated with the Illumina Infinium MethylationEPIC 850 K BeadChip (>850,000 sites) covering CpG islands, non-CpG and differentially methylated sites, FANTOM5 enhancers, ENCODE open chromatin and transcription-factor binding, and microRNA promoters. Biobanked samples were stored at −80 °C and shipped to TruDiagnostic for extraction and preprocessing. From whole blood, 500 ng DNA was extracted and bisulfite-converted using the Zymo Research EZ DNA Methylation kit per the manufacturer’s instructions. Converted DNA was randomly assigned to Infinium HumanMethylationEPIC chip wells. Laboratory preprocessing comprised DNA amplification, hybridization to the EPIC array, staining/washing and imaging on the Illumina iScan SQ to generate raw intensities.

Raw methylation data for the MGB Biobank were processed using the ‘minfi’ pipeline⁴⁸, and low-quality samples were identified using the qcfilter() function from the ENmix package⁴⁹, using default parameters. Overall, a total of 4,803 samples passed the quality assurance/QC (P < 0.05) and were deemed to be high-quality samples. In addition, we removed low-quality probes (P < 0.05 out-of-band) that were identified among the samples. This process retained 721,802 among 866,239 probes that were high quality and indicated that a large portion of the methylation data were of high quality. A combinatorial normalization processing using the Funnorm procedure (‘minfi’ package), followed by the RCP method (ENmix package) was performed to minimize sample-to-sample variation as noted in Foox et al.⁵⁰.

Proteomic profiling

We used the Seer proteomic platform to identify proteins and peptides related to chronological and biological aging. This uses nanoparticles with different binding capabilities to isolate and extract peptides and proteins via corona covalent attachment to the surface, paired with LC–MS/MS, and thus enables detection of low-abundance peptides and proteins.

Relative protein levels were quantified in 2,000 samples (1,600 MGB-ABC; 400 process controls) using the Proteograph Product Suite (Seer) with LC–MS. Samples were incubated with five proprietary nanoparticles on the Seer SP100 Proteograph to form protein coronas enabling physicochemical capture. Proteins were digested by trypsin, and relative levels quantified by the default data-independent acquisition method in Proteograph Analysis Software. To address peptide–protein ambiguity and improve quantification, peptides were aggregated into protein groups, then preprocessed with control-based normalization and outlier detection. This yielded estimates for 28,490 peptides across blood samples (mean 15,239) and 10,265 in controls (mean 4,281). Peptides were consolidated into 3,695 protein groups in MGB-ABC samples (mean 2,587) and 1,360 in plate controls. Following the signal drift and batch effect correction via the QC-robust spline correction algorithm⁵¹, we applied log₁₀ transformation, Pareto scaling and k-nearest-neighbor imputation based on current guidelines⁴⁸. Stringent filters, including 80% protein presence, relative standard deviation of quality control (RSD-qc) < 0.20%, and the D-ratio comparing technical variability to biological variability < 0.70, were utilized to reinforce data validity and reliability⁵². The final processed dataset consists of 2,805 nonunique, or 536 unique, protein groups, across 1,789 samples, in which the majority of samples (N = 1,475) matched to methylation data.

Definition of aging-related diseases

We utilized International Classification of Diseases (ICD)-9/ICD-10 codes to identify aging-related diseases, including type 2 diabetes, COPD, depression, cancer, stroke and other CVDs, as detailed in Supplementary Table 19. We also utilized SNOMED codes to identify the same set of diseases for the All of Us cohort, as detailed in Supplementary Table 20.

Validation cohort

All of Us cohort

The All of Us Research Program, an initiative launched by the National Institutes of Health (NIH) in 2018, represents a national collaborative effort to aggregate genetic, lifestyle, environmental and EMR data from one million participants⁵³. All the participants provided a written consent form at the time of enrollment. Given the diverse data sources utilized by All of Us, including EMR data from healthcare facilities, participant surveys and self-reported measurements, the program implemented the Observational Medical Outcomes Partnership Common Data Model⁵⁴ to store and standardize this heterogeneous data, which were coded using the SNOMED CT dictionary. To facilitate the efficient retrieval of diagnosis records, we mapped the curated ICD-9 and ICD-10 codes to corresponding SNOMED CT codes, as detailed in Supplementary Tables 20 and 21.

On 4 February 2025, All of Us released the latest version of its Curated Data Repository (CDR v8), encompassing participant data up to a cutoff date of 1 October 2023. Within CDR v8, 389,379 adult participants had both EMR and lifestyle survey data available. Following that, we extracted demographic information, lifestyle factors, diagnosis records and laboratory test results necessary for the calculation of EMRAge and PhenoAge. To address missing laboratory values and ensure a sufficient sample size for association analyses, we imputed missing values with the median of all measurements recorded within one year of each participant’s enrollment date. This imputation process resulted in a final cohort of 10,769 participants with complete data from the All of Us Research Program.

TruDiagnostic Biobank cohort

The TruDiagnostic Biobank included 14,698 individuals who underwent the commercial TruDiagnostic TruAge test with DNAm profiling. Participants, recruited from October 2020 to April 2023, were predominantly based in the United States and generally healthier than MGB Biobank participants, likely reflecting proactive health interest and willingness to pay for testing. Most samples were obtained under healthcare provider guidance, and <5% were via direct-to-consumer testing. This recruitment likely introduces self-selection toward preventive care and fewer comorbidities. At enrollment, participants completed surveys covering personal information, medical, social, lifestyle and family history. The study was approved by the Institute for Regenerative and Cellular Medicine Institutional Review Board, and all participants provided written informed consent.

Peripheral blood was collected via lancet/capillary, placed in lysis buffer and DNA extracted. Bisulfite conversion of 500 ng DNA was performed with the Zymo Research EZ DNA Methylation kit per the manufacturer’s instructions. Converted DNA was randomly assigned to Infinium HumanMethylationEPIC BeadChip wells, then amplified, hybridized, stained/washed and imaged on an Illumina iScan SQ to generate raw intensities.

TruDiagnostic methylation data were preprocessed using the MGB-ABC pipeline, with normalization adapted for computational constraints. In total, 14,213 individuals (96.7% of originals) passed quality assurance/QC (P < 0.05). To retain CpGs needed for clock calculation, no probes were removed. We applied normal-exponential out-of-band (Noob) normalization using minfi’s preprocessNoob function. Finally, we used a 12-cell immune deconvolution method to estimate cell-type proportions^55,56,57.

Generation Scotland

Generation Scotland is a Scottish, family-based cohort study with over 24,000 volunteers, aged 17–99, stemming from >5,500 families⁵⁸. The majority of volunteers provided blood samples at a baseline clinic between 2006 and 2011 in addition to completing health and lifestyle questionnaires and giving consent for data linkage to their electronic health records. All components of Generation Scotland received ethical approval from the NHS Tayside Committee on Medical Research Ethics (REC reference number: 05/S1401/89). All participants provided broad and enduring written informed consent for biomedical research. This study was performed in accordance with the Helsinki declaration.

DNAm data have been profiled using the Illumina EPICv1 array. QC details have been described previously⁵⁹. Briefly, samples were assessed in four sets yielding data for 18,869 individuals after QC (N-set1 = 5,087, N-set2 = 459, N-set3 = 4,450, N-set4 = 8,873). Here, after the removal of 11 individuals who subsequently withdrew consent, we had data for 18,858 volunteers. Secondary care linkage was available for 99% of Generation Scotland volunteers (N-analysis = 18,672). OMICmAge and the other epigenetic biomarkers were estimated as described for the MGB-ABC cohort.

Event status and age at event for six disease outcomes and all-cause mortality were determined via linkage to electronic health records. The secondary (ICD) and primary care (READ) codes used to define each outcome are listed in Supplementary Table 22. Secondary care linkage was available for 99% of Generation Scotland volunteers. While all volunteers provided consent for linkage to primary care records, currently these are only available for ~40% of volunteers due to consent constraints with the data holders (individual GP surgeries). The latest date of linkage for primary and secondary care records was October 2023, which was set at the censoring date for the time-to-event analyses. Mortality records up to October 2023 were obtained via linkage to the National Records of Scotland.