Introduction

When a woman is diagnosed with breast cancer, it is generally not possible to ascertain how long the tumor has been growing. Yet knowledge about a tumor’s age at diagnosis could provide important prognostic clues: older indolent tumors that are less likely to progress may require less invasive treatment, whereas younger fast-growing tumors require more urgent and aggressive treatment. While there is a rich literature on the estimation of the mean sojourn time of breast cancer1,2,3,4,5—the average time tumors spend in a detectable but asymptomatic, pre-clinical state—there is a paucity of tools to assess the age of individual tumors at the time of detection.

Epigenetic clocks provide a promising approach to estimate individual tumor age. Originally developed to quantify the biologic aging process in humans, epigenetic clocks leverage specific patterns of DNA methylation that are strongly correlated with biologic tissue age6,7. Broadly, these clocks focus on the DNA methylation status of CpG sites across the genome that become differentially methylated with increasing tissue age. These clocks are well suited for estimating the number of cell divisions since birth of the person, making them valuable tools for studying normal tissue aging and pre-neoplastic changes8,9,10,11,12. However, they are less effective for estimating tumor mitotic age, defined as the number of cell divisions since birth of the first tumor cell.

In contrast to “one-way” clocks that measure age-related methylation changes, “two-way” epigenetic clocks leverage CpG sites that fluctuate between the unmethylated and methylated states on a relatively fast time scale (Fig. 1A). Originally introduced in the context of homeostatic intestinal stem cell dynamics13, such stochastic epigenetic clocks have also been applied to hematologic malignancies14. Here we follow similar design principles to develop a breast cancer-specific two-way epigenetic clock to measure individual tumor mitotic age at diagnosis based on average methylation levels (\(\beta\)-values) of select CpG sites included in standard methylation arrays. Unlike “one-way” methylation clocks that begin ticking at birth, our clock is reset at onset of tumor growth, allowing it to track the number of cell divisions up to the time of diagnosis.

Fig. 1: Fluctuating CpG sites (fCpG) and modulators of tumor age.
figure 1

A Balanced fCpGs stochastically oscillate between the unmethylated, hemi-methylated, and methylated states. B During tumor cell division, methylation and de-methylation events occur stochastically. As illustrated for the first three generations of an exponential tumor expansion, the average methylation status (β-value) evolves over time. C In silico modeling of the fCpG dynamics, based on a simple birth-death model of tumorigenesis. The average methylation value (β) is simulated for three independent fCpGs, whose initial states in the first tumor cell are fully methylated (red), hemi-methylated (yellow), and fully unmethylated (blue), respectively. Insert: the number of tumor cells over time. Simulation parameters: cell proliferation rate \(\alpha =0.17\) divisions/day; cell death rate \(\lambda =0.15\) deaths/day; (de-)methylation rate \(\mu =0.002\) flips/division. See Methods for details. D The age of a tumor at detection is modulated by relative strengths of intrinsic proliferation and growth-suppression induced by the immune tumor microenvironment (TME); tumors will reach a detectable size faster when proliferation is greater and/or the TME is weaker.

The proposed stochastic clock measures the entropy of an ensemble of fluctuating CpG (fCpG) sites. In the tumor’s most recent common ancestor cell, each fCpG was either unmethylated (\(\beta =0\)), hemi-methylated (\(\beta =0.5\)), or methylated (\(\beta =1\)). As the tumor expands, replication errors produce a mixture of cells with different methylation states (Fig. 1B), thus progressively increasing the tumor’s epigenetic entropy15. In the special case of unbiased fCpG sites—whose methylation and de-methylation rates are in balance—the bulk-level methylation converges to \(\beta =0.5\) with increasing tumor mitotic age, regardless of the first tumor cell’s state (Fig. 1C). Thus, by measuring the distribution of unbiased fCpG sites, we can derive an estimate of the age of a given tumor cell population relative to the start of the most recent clonal expansion.

The combination of tumor-specific age estimates and gene expression profiles further provides a unique opportunity to characterize the evolutionary and ecological pressures that shape the temporal landscape of breast cancer. Notably, aggressive tumors that evolve in a weakly suppressive immune microenvironment are expected to reach a detectable size faster than indolent, slow-growing tumors in a strongly suppressive immune microenvironment (Fig. 1D).

The manuscript is structured as follows. Combining DNA methylation and gene expression data from several hundred breast cancer and normal breast tissue samples, we first identify a set of unbiased fCpG sites and introduce the epigenetic clock index as a proxy measure of tumor mitotic age. We then evaluate the face validity of the index by examining its relationship with established prognostic markers, and we combine methylation and gene expression data to identify tumor- and microenvironment-specific factors that modulate tumor mitotic age. Finally, we validate key properties of the clock index in independent cohorts of patients with paired primary-metastasis samples, and we derive quantitative estimates of individual breast cancers’ mitotic and calendar ages.

Results

Selection of unbiased fCpG sites

To identify a set of unbiased fCpG sites in breast cancer, we used 450K methylation array data from 634 invasive breast cancers in The Cancer Genome Atlas16 (TCGA) and 79 normal breast tissue samples17. Using a two-step selection process, we identified an ensemble of fCpG sites with balanced (de-)methylation rates as follows.

First, we identified CpG sites with an average \(\beta\)-value close to 0.5 in both normal breast tissue and breast cancers, thus excluding sites with an inherent bias toward methylation or de-methylation, and sites that are subject to systematic selection during homeostasis and/or tumorigenesis (Fig. 2A). To avoid confounding by cell type heterogeneity, we further constrained candidate sites to be balanced within luminal and basal epithelial breast tissues, respectively (Fig. 2B).

Fig. 2: Selection and validation of unbiased fluctuating CpG (fCpG) sites.
figure 2

A Starting with all CpG sites (gray dots), sites with an average methylation value β between 0.4 and 0.6 in both breast cancers (N = 634, TCGA cohort) and normal samples (N = 79, Normal cohort) were labeled as Unbiased Set I (yellow shaded rectangle). B Starting with sites in Unbiased Set I (yellow dots), sites with an average methylation value β between 0.4 and 0.6 in both normal breast luminal epithelial tissue samples (N = 3, single-cell DNAm atlas) and normal breast basal epithelial tissue samples (N = 4, single-cell DNAm atlas) were labeled as Unbiased Set II (red shaded rectangle). C Sites in Unbiased Set II were ranked by their inter-tumor standard deviation, and the top 500 were included in the clock set \({{\mathscr{C}}}\). D Distribution of individual fCpG β-values across tumor and normal cohorts; each curve represents the kernel density estimate of the distribution of a single fCpG across the TCGA cohort (blue; N = 633) or the normal breast cohort (yellow; N = 79). E In a separate cohort of breast cancers (N = 146, Lund cohort), the inter-tumor standard deviation of \(\beta\) was higher for fCpG sites in the clock set \({{\mathscr{C}}}\) (median: 0.188) compared to CpGs not included in \({{\mathscr{C}}}\) (median: 0.102; P < 10−10; Wilcoxon rank-sum test). F In a small cohort of patients (N = 5) with multiple primary samples (3-5 per patient), the intra-tumor standard deviation of β was higher for CpG sites in the clock set \({{\mathscr{C}}}\) (median: 0.056), compared to CpGs not included in \({{\mathscr{C}}}\) (median: 0.026; P < 10−10, Wilcoxon rank-sum test).

In the second step, we ordered the set of unbiased CpG sites by between-tumor variability and included only the 500 most fluctuating sites in the final clock set of unbiased fCpGs (Fig. 2C; see Supplementary Data 1 for clock fCpGs). Importantly, this final step excludes non-informative sites that either do not fluctuate at all (i.e., imprinted hemi-methylated state) or fluctuate too fast (i.e., steady-state methylation of \(\beta \approx 0.5\) reached on time scales much shorter than the average tumor mitotic age at diagnosis). To ensure that this step did not select for CpG sites that mainly captured cell-type composition effects, we verified that individual fCpGs had low \({{\rm{\beta }}}\)-value variation across the normal samples (median standard deviation of 0.06 compared to 0.19 across tumors; P < 10−10, Wilcoxon rank-sum test; Fig. 2D). Furthermore, as we expected, the fCpG sites were enriched for non-genic/regulatory CpGs, with 27.6% of fCpGs versus 17.8% of non-fCpGs being non-genic/regulatory (P = 2 × 10−8, \({\chi }^{2}\) test).

Next, we sought to validate the unbiased and fluctuating nature of the clock set in two independent cohorts. In a cohort of 146 breast cancer patients (Lund cohort)18, we found significantly higher inter-tumor variability in \(\beta\)-values among the CpG sites in the clock set, as compared to the CpG sites not included in the clock set (Fig. 2E). Similarly, in a small cohort of 5 patients with multiple samples from their primary tumors19, we found elevated intra-tumor variability in clock set vs non-clock set sites (Fig. 2F). Together, these patterns corroborate the unbiased and fluctuating nature of the clock set of CpG sites.

Interestingly, the fCpG sites in the clock set were more tightly concentrated around \(\beta =0.5\) in each normal breast sample, as compared to breast cancers (Supplementary Fig. 1). Consistent with the underlying dynamic model of the clock (Fig. 1A), this suggests that over decades of breast development and maintenance, the fCpGs had converged to the stationary methylation state of \(\beta =0.5\) (Fig. 1C).

Epigenetic clock index

At the level of individual tumors, the 500 fCpG sites in the clock set exhibited primarily unimodal or bimodal distributions of \(\beta\)-values (Fig. 3A). We explored how these tumor-specific distributions of \(\beta\)-values could be used to estimate tumor mitotic age. In the founding tumor cell, each fCpG starts in either the unmethylated (\(\beta =0\)), hemi-methylated (\(\beta =0.5\)), or methylated (\(\beta =1\)) state (Fig. 1A). Although the trajectories of individual sites are subject to stochastic fluctuations (Fig. 1C), an ensemble of sites starting in the same initial configuration collectively drift toward the steady state of \(\beta =0.5\) (Fig. 3B).

Fig. 3: The distribution of β-values across the clock set encodes tumor age.
figure 3

A The empirical β-value distributions for the clock set fCpGs (N = 500) are shown for three select tumors in the TCGA cohort. B Simulated trajectories for an ensemble of fCpG sites (N = 90), starting in the unmethylated, hemi-methylated, and methylated initial configurations, respectively (n = 30 each). Simulation parameters as detailed in Fig. 1. C Cross-sectional \(\beta\)-value distributions for the simulated clock set in panel B, shown after 0, 1, and 2 years of growth. D Standard deviation (sβ) of \(\beta\)-values and epigenetic clock index (cβ = 1 - sβ) over 2 years of growth for the simulated clock set in B. E The distribution of epigenetic clock index values (\({c}_{\beta }\)) across invasive ductal carcinomas in TCGA (N = 400); the three tumors from (A) are labeled.

By considering the histograms of \(\beta\)-value distributions at different tumor mitotic ages, we can track the evolution of the three “peaks” corresponding to the subsets of initially unmethylated, hemi-methylated, and methylated clock sites (Fig. 3C). As the tumor’s mitotic age increases, the left peak of the histogram (consisting of originally unmethylated clock sites) starts moving to the right, whereas the right peak (originally methylated sites) moves to the left; the middle peak (originally hemi-methylated sites) remains stationary. By measuring the extent to which the three peaks have converged to the stationary value of \(\beta =0.5\), we can thus estimate the mitotic age of individual tumors.

Concretely, we used the standard deviation of the \(\beta\)-values, denoted by \({s}_{\beta }\), to quantify the relationship between tumor mitotic age and the evolving clock set profile (Fig. 3C). Because \({s}_{\beta }\) is highest at time 0, when the \(\beta\)-value distribution exhibits three sharp peaks, and then monotonically decreases over time (Fig. 3D), we introduced the epigenetic clock index \({c}_{\beta }=1-{s}_{\beta }\) as a proxy measure of tumor mitotic age (Fig. 3E; see Supplementary Data 2 for clock index values).

While most breast cancers are thought to derive from a single clone of origin, recent work suggests that some breast cancers may derive from multiple founder cells20. If this is the case, the initial \(\beta\)-value distribution of fCpGs is not confined to the three discrete peaks at \(\beta\) = 0, 0.5, and 1. The initial epigenetic entropy of a multiclonal cancer is thus higher compared to a cancer of monoclonal origin, and the epigenetic clock index will overestimate the true tumor mitotic age at the time of resection. However, because it appears unlikely that a tumor derives from a large number of independent clones, we expect this bias to be limited.

We identified a significant difference in the distribution of the epigenetic clock index \({c}_{\beta }\) among invasive ductal carcinomas (IDC; median, 0.81) and invasive lobular carcinomas (ILC; median, 0.85; P < 10−10, Wilcoxon rank-sum test). This is aligned with the notion of ILC as a molecularly and clinically distinct disease entity21, and consistent with the propensity of ILCs to be mammographically occult22, which may lead to older tumor mitotic ages at the time of diagnosis. Cognizant of these differences between ILC and IDC, we decided to reduce further confounding in downstream analyses by focusing on the ductal cancers.

In the next two sections, we characterize the relationship between a tumor’s mitotic age, as quantified by the epigenetic clock index \({c}_{\beta }\), and its evolutionary-ecological context as determined by its intrinsic growth potential and external pressures from the microenvironment.

Younger tumors have more aggressive phenotypes

As a breast tumor grows, its likelihood of detection on the basis of imaging or symptoms increases. Because fast growing tumors are expected to reach a detectable size sooner than slow growing ones, we hypothesized that younger tumor mitotic age would correlate with established markers of tumor aggressiveness. To test this hypothesis, we correlated the epigenetic clock index with several established features of tumor aggressiveness, including molecular subtype23,24, genomic instability25,26, grade27, proliferation28, and size29.

There was a clear relationship between tumor mitotic age and molecular subtype: luminal A tumors, which have a more favorable prognosis, were older than luminal B and basal tumors (Fig. 4A-B; Supplementary Table 1). Similarly, when using clinical instead of molecular subtyping in two additional cohorts of breast tumors (Germany cohort, n = 253; WCHS cohort, n = 445), triple-negative cancers tended to be younger compared to luminal A and luminal B cancers (Supplementary Fig. 2A, 3). Consistent with these subtype patterns, there was a strong correlation between genomic instability and younger tumor mitotic age (Fig. 4C,D). Younger tumors were of higher histopathologic grade (Fig. 4E) but not stage (Fig. 4F), and more likely to be Ki-67-positive (Supplementary Fig. 2B).

Fig. 4: Epigenetic clock index vs. clinicopathological variables.
figure 4

The distribution of epigenetic clock index values (cβ) among invasive ductal carcinomas in the Lund and TCGA cohorts, by (A, B) molecular subtype as predicted by the PAM50 algorithm; (C, D) fraction of genome altered (FGA) by copy number alterations; (E) tumor grade; (F) tumor stage; (G) tumor size; and (H) T-stage. Pairwise comparisons of medians in panels A, B, E, F, and H were performed using a two-sided Wilcoxon rank-sum test (*P < 0.05, **P < 0.01, ***P < 0.001). In panels C, D, and G, regression lines and bootstrapped 95% confidence intervals are shown; Pearson correlations (R) are indicated. Boxplots in A, B, E, F, H, display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.

Another prognostic factor in breast cancer is tumor size, with larger lesions having worse outcomes. We found that smaller tumors were of older mitotic age compared to larger tumors (Fig. 4G, H), presumably because slow growing tumors spend more time at the smaller end of the detectable size range, and are, therefore, more likely to be detected at a smaller size.

Finally, the relationship between tumor mitotic age and patient age at diagnosis was inconclusive, with a weak negative correlation in TCGA (R = −0.18) and no correlation in either the Lund (R = −0.10; Supplementary Table 1) or WCHS (R = −0.03; Supplementary Table 1) cohorts. This is consistent with the notion that the fCpG clock measures the age of the tumor—starting with the most recent common ancestor cell—and not the age of the patient.

Identifying modulators of tumor mitotic age

The time it takes for a tumor to grow from a single cell to a detectable mass depends on its effective growth rate, that is the difference between cell proliferation and cell death (Fig. 5A). Cell proliferation primarily reflects the tumor’s intrinsic growth potential and aggressiveness, whereas cell death is often the result of extrinsic selective pressures applied by the tumor microenvironment, such as immune surveillance and resource constraints due to limited vascularization30,31.

Fig. 5: Tumor mitotic age vs. measures of proliferation.
figure 5

A Simulation of tumor size as a function of tumor age. Both tumors have the same proliferation rate (\(\alpha =0.1\)7 divisions/day), but different death rates: \(\lambda =0.15\) deaths/day (blue) vs. \(\lambda =\)0.16 deaths/day (yellow). B Pearson’s correlation between epigenetic clock index \({c}_{\beta }\) and expression of protein-coding genes (TCGA cohort); correlation with select genes as indicated. C, D Correlation of \({c}_{\beta }\) with average expression of genes involved in M-phase and mitotic checkpoint regulation. E Correlation of \({c}_{\beta }\) with the fraction of cells in S-phase, as measured by flow cytometry (Lund cohort). Regression lines shown with bootstrapped 95% confidence intervals and Pearson correlation (R).

To explore putative modulators of effective tumor growth and tumor mitotic age at diagnosis, we performed genome-wide correlation analyses of the epigenetic clock index \({c}_{\beta }\) against gene expression. As predicted, mitotically younger tumors exhibited increased expression of proliferation-related genes such as Ki67 and MCM2 (Fig. 5B, Supplementary Table 1). The signal was further augmented when considering the average expression across a set of genes involved in M-phase and mitotic checkpoint regulation (Fig. 5C,D) and the fraction of cells in S-phase (Fig. 5E).

Next, we examined the microenvironment’s ability to decrease the effective growth rate of a tumor through increased cell death. As hypothesized, the expression of immune cell markers such as CD3, CD4, CD8 and FOX3 was elevated in mitotically older tumors (Fig. 5B; Supplementary Table 1). This suggests that tumors which are subject to immune surveillance—e.g., through neo-antigen directed immune control by CD8 + T-cells—have a lower effective growth rate and, thus, reach a detectable size at an older mitotic age, as compared to tumors that successfully evade immune control and thus reach a detectable size at a younger mitotic age.

To perform a systematic analysis of tumor mitotic age modulation, we performed a genome-wide gene set enrichment analysis (GSEA) (Fig. 6A). Consistent with the univariate gene expression analyses, mitotically younger tumors were enriched for pathways related to proliferation and cell cycle control. Conversely, mitotically older tumors were enriched for immune pathways and immune-related signaling pathways, again supporting the notion of effective immune control in older, slower growing lesions.

Fig. 6: Pathway enrichment and immune decomposition analyses.
figure 6

A A gene set enrichment analysis (GSEA) was performed for the epigenetic clock index \({c}_{\beta }\). Pathways with a positive (negative) enrichment score are enriched in mitotically older (younger) tumors. Only pathways with a false discovery rate (FDR) below 0.1 are shown; *FDR < .05; **FDR < .01; ***FDR < .001; ***FDR < .0001. B Epigenetic clock index vs. the extent of immune infiltration (absolute immune score) as estimated by CIBERSORTx. C The immune compartment of each tumor was decomposed using CIBERSORTx; the compartment fractions are shown for tumors of similar mitotic age (epigenetic clock index \({c}_{\beta }\)). Boxplots display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.

For a more in-depth analysis of the immune infiltrate, we used the CIBERSORTx algorithm32 to estimate the extent and composition of the immune compartment. As expected, the extent of the immune compartment increased with tumor mitotic age (Fig. 6B). When decomposing each tumor’s immune compartment into the major cell types, we found an increase in the fraction of T-cells in mitotically older tumors (Fig. 6C, Supplementary Table 2), again suggestive of T-cell mediated immune surveillance.

Analysis of paired tumor samples validates epigenetic clock

Multiple tumor samples from the same patient provide a unique opportunity to assess the internal validity of the epigenetic clock. Indeed, paired samples should be epigenetically more related—via their most recent common ancestor cell—than samples from different patients. In a cohort of 8 women with multi-focal breast cancer33, we found that the within-patient correlations of the clock set fCpG sites were higher (median, 0.70) than the between-patient correlations (median, 0.11; P = 3 × 10−6, Wilcoxon rank-sum test; Fig. 7A). The same held true for a cohort of 18 patients with paired primary tumors and lymph node metastases34 (median, 0.85 vs. 0.13, P < 10−10; Fig. 7B) and a subset of 22 patients with paired primary tumors and metastases (including lymph node and distant metastases) from the AURORA US Metastasis Project35 (median, 0.66 vs. 0.13, P < 10−10; Supplementary Fig. 4).

Fig. 7: Paired tumor samples.
figure 7

A In a cohort of 8 patients with multifocal breast cancer, the \(\beta\)-values of the 500 fCpG sites of the epigenetic clock are correlated within (two foci per patient) and between patients. B In a cohort of 18 patients with paired primary tumor and lymph node metastasis samples, the \(\beta\)-values of the 500 fCpG sites are correlated within and between patients. C For the 18 patients from panel B and 22 patients from the AURORA US Metastasis Project, the epigenetic clock index of the primary tumor is plotted against the index of the lymph node metastasis. Three patients (AF, AI, and AX) from the lymph node cohort are labeled for the purposes of the following 3 panels. D–F For the 3 labeled patients from C, the \(\beta\)-value distributions of fCpGs are shown both for the primary tumor and the metastasis. G Monte Carlo simulation of an ensemble of 90 fCpG sites, 30 each starting in the unmethylated, hemi-methylated, and methylated initial configurations; every 3 months, a cell was randomly picked to representing the metastasis seeding cell, and the Pearson correlation between the \(\beta\)-value distribution of that cell and that of the entire tumor was calculated. Distributions at each time point represent the results from 30 independent simulations; boxplots display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. Simulation parameters are detailed in Fig. 1. H For the 18 patients from panel B and 22 patients from the AURORA US Metastasis Project, the epigenetic relatedness of primary and lymph node metastasis (Pearson’s R for the 500 fCpG sites in the clock set) is compared to the difference in epigenetic clock index, as a proxy for the difference in tumor mitotic age between the two samples.

The two cohorts of patients with paired primary and metastasis samples34,35 allowed us to test two additional properties of the fCpG clock. First, assuming that each metastasis is seeded by a single cell from the primary tumor, synchronous metastases should be younger than their matched primaries. Indeed, the epigenetic clock measures the mitotic age of the metastasis relative to the seeding event, which occurred after initiation of the primary tumor. Consistent with this prediction, metastases had a lower epigenetic clock index compared to their matched primaries in 33/40 patients, and only one metastasis was noticeably older than the matched primary (Fig. 7C). These findings provide direct support for interpreting the epigenetic clock index as a proxy measure for tumor mitotic age. However, it is important to acknowledge that potential differences in tumor biology and microenvironmental conditions between primary tumors and metastases may act as unmeasured confounders in this analysis.

Second, the timing of metastatic dissemination relative to the primary tumor’s age is expected to impact the epigenetic similarity of the two samples: if the metastasis is seeded early during primary tumor growth (i.e., similar \({c}_{\beta }\) values), the \(\beta\)-values of the two samples are expected to be closely related (Fig. 7D, E) because the metastasis seeding cell came from a mostly homogenous population; conversely, if the metastasis is seeded late (i.e., different \({c}_{\beta }\) values), the \(\beta\)-values are expected to differ more substantially (Fig. 7F) because the seeding cell came from a heterogenous population. Corroborating this hypothesis, and consistent with a corresponding simulation of metastatic seeding based on the oscillator model (Fig. 7G), we found a negative correlation between metastasis mitotic age difference and \(\beta\)-value similarity (Fig. 7H).

Finally, we note that metastases may be seeded not by a single cell, but by a cluster of cells from the primary tumor36. In such cases—similar to the scenario of a multiclonal primary tumor (see ‘Epigenetic clock index’)—our method may overestimate the true mitotic age of the metastasis. However, as long as the estimated mitotic age of the metastasis is younger than that of the matched primary, its true mitotic age must also be younger, and thus the conclusion derived from Fig. 7C remains valid.

Quantifying tumor mitotic age

So far, we have used the epigenetic clock index \({c}_{\beta }\) as a correlate of tumor mitotic age. To derive quantitative estimates of each tumor’s mitotic and calendar ages, we proceeded as follows (see Methods for details). First, we invoked the mathematical oscillator model (Fig. 1A) to relate tumor mitotic age to the measured \(\beta\)-values of fCpG sites in the clock set. Next, we decomposed each tumor’s empirical fCpG \(\beta\)-value distribution into three groups (Fig. 8A): originally unmethylated fCpGs (left peak in the histogram), originally hemi-methylated fCpGs (middle peak), and originally methylated fCpGs (right peak). Finally, we combined the peak location in each group with the oscillator model to infer the estimated mitotic age of the tumor (Fig. 8B).

Fig. 8: Estimating tumor age.
figure 8

A The empirical \({{\rm{\beta }}}\)-value distributions of the 500 fCpG sites in the clock set are decomposed into three peaks corresponding to initially unmethylated (blue), hemi-methylated (yellow), and fully methylated (red) sites. B Tumor mitotic age (the number of generations) is shown for the TCGA and Lund cohorts and colored by molecular subtype. These estimates are based on the peak locations (panel A) and a (de-)methylation rate of \({{\rm{\mu }}}=2\cdot {10}^{-3}\), per cell division and per allele, which yields a mean tumor age of 3 calendar years (see E). C For the Lund cohort, tumor-specific proliferation rates are estimated using the measured fraction of cells in S-phase and shown by molecular subtype. D For the TCGA cohort, proliferation rates are estimated using S-phase fractions predicted based on gene expression, using a model trained on the Lund cohort. E Calendar age of tumors in the TCGA and Lund cohorts, colored by molecular subtype. Boxplots in C and D display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.

Finally, we combined tumor-specific estimates of mitotic age (Fig. 8B) and proliferation rate (Fig. 8C, D) to derive tumor-specific estimates of calendar age (see Supplementary Data 2 for age estimates). Anchoring the median tumor age at a consensus estimate of 3 years (see Methods), the distribution of calendar ages across the TCGA and Lund cohorts ranged from 0.3 to 29.9 years, with an interquartile range of 1.6 to 5.5 years (Fig. 8E). There were notable differences in median tumor calendar ages by molecular subtype, ranging from 1.1 years in basal cancers to 6.5 years in luminal A cancers (Supplementary Table 3). Interestingly, even though there was only a small difference between Her2-positive and luminal A tumors with respect to tumor mitotic age (163 vs.171 generations), there was a clear separation with respect to calendar age (2.5 vs. 6.5 years; Supplementary Table 3). In other words, the Her2-positive cancers accumulated a similar number of cell divisions in a much shorter time, which is consistent with a generally more aggressive clinical presentation and worse prognosis37,38.

Adjusting for tumor purity

Bulk samples contain a mixture of tumor and stroma. Because the epigenetic clock index exhibited correlations with tumor purity as measured in the TCGA cohort by the consensus purity estimate39 (CPE; R = −0.70; Supplementary Fig. 5A), we had restricted our analyses to samples of high tumor purity (CPE ≥ 0.6). Nevertheless, we cannot rule out that the observed variability in \(\beta\)-value distributions among the selected fCpG sites—which are used to estimate tumor mitotic age—were at least partially driven by the methylation patterns of admixed non-epithelial cells. If this is the case, then, e.g., the immune pathway enrichment of older tumors (Fig. 6A) may be confounded by the presence of non-epithelial cells that alter the measured \(\beta\)-value distribution.

First, we checked whether the selection of clock sites was strongly influenced by the presence of non-epithelial cells. To do this, we started with all unbiased sites (mean \(\beta \in [0.4,\,0.6]\) in TCGA, normals, basal/luminal epithelial) and separately identified the 500 most variable sites in the least and most pure tumors (bottom and top CPE quartiles). These two selections showed high overlap (91% and 99%) with the original clock set of most variable CpG sites across all tumors, suggesting that the selection of fCpGs was not unduly driven by the non-epithelial tumor compartment.

Next, to adjust for possible confounding of \(\beta\)-values by tumor purity, we modeled the measured methylation as a mixture of tumor and stroma methylation, see Methods for details. This adjustment was based on the CPE measure of tumor purity39; we also employed EpiSCORE40 to estimate the fraction of epithelial content and found that the two measures of purity were well correlated (R = 0.63). The resulting purity-adjusted epigenetic clock index \({c}_{\beta }^{\alpha }\) exhibited a lower correlation with tumor purity (R = −0.25, Supplementary Fig. 5B) and was lower than the unadjusted epigenetic clock index \({c}_{\beta }\) (Supplementary Fig. 5C).

When replacing the unadjusted epigenetic clock index with the purity-adjusted version, the strength of correlations between markers of tumor aggressiveness and younger tumor mitotic age remained unaltered (Supplementary Table 1, Supplementary Fig. 6, Supplementary Fig. 7). Individual immune genes and the extent of immune infiltration remained associated with older tumor mitotic age, although the correlations were attenuated (Supplementary Table 1, Supplementary Table 2). While the immune pathways were no longer enriched in older tumors (Supplementary Fig. 7), there was still a positive correlation between the fraction of T cells and tumor mitotic age (Supplementary Table 2).

Discussion

In this study, we developed an epigenetic clock to measure the age of newly diagnosed breast cancers. Measuring epigenetic entropy among neutrally fluctuating CpG (fCpG) sites, the clock requires no knowledge about the tumor’s initial DNA methylation state. This allows it to track the number of cell divisions since birth of the tumor, a task that previous tissue clocks were not able to accomplish. Based on standard methylation arrays, the clock has potential as a novel marker of aggressiveness and prognosis in early-stage breast cancer.

Once a patient is diagnosed with breast cancer, the tumor’s mitotic age encodes valuable prognostic information. Intuitively, a slow-growing tumor that takes a long time to reach the threshold of detection is more likely to have a good prognosis compared to a fast-growing tumor that quickly expands into a detectable mass. Our analyses corroborate this hypothesis by revealing that mitotically younger tumors were enriched for features of tumor aggressiveness and predictors of poor outcome, including genomic instability, higher grade, and basal molecular subtype25,26. This property of the epigenetic clock is quite remarkable given that its constituent fCpG sites were selected only on the basis of simple statistical properties of their \({{\rm{\beta }}}\)-value distributions.

Beyond prognostication, the clock holds promise in risk-stratified screening approaches. The efficacy of breast cancer screening critically depends on the sojourn time, that is the time window during which the tumor is asymptomatic but mammographically detectable. If the sojourn time is short, early detection is unlikely even under frequent screening; if it is long, some cancers will be overdiagnosed41. Sojourn time estimates are usually obtained by fitting natural history models to population data, yielding indirect, population-averaged estimates. Our approach, in contrast, allows for direct and individual-level characterization of tumor age, which provides an upper bound for the sojourn time. Assuming an overall median time to detection of 3 years in our cohort, the time to detection in luminal A cancers (6.5 years) was substantially longer compared to that in luminal B (2.4 years), Her2-positive (2.5 years), and basal (1.1 years) cancers. These estimates are consistent with the observation that interval cancers are enriched for more aggressive subtypes compared to screen-detected cancers42, and highlight opportunities for data-driven personalization of screening schedules.

The epigenetic clock also provides an opportunity to quantify the evolutionary-ecological pressures that shape the temporal landscape of breast cancers. Indeed, because most tumors are of comparable size at the time of diagnosis, tumor mitotic age is related to the effective growth rate: tumors that reach the detection threshold at a younger age have a higher effective growth rate compared to tumors that reach the threshold at a higher age. Our analyses characterized the effective growth rate of breast cancers as a competition between tumor-intrinsic growth potential (e.g., proliferation) and microenvironmental pressures (e.g., surveillance by immune cells)43. According to this model, highly proliferative tumors that successfully evade the immune system are detected at a younger age compared to less proliferative lesions subject to continuous immune control.

Our study has several limitations. First, because tumor age is not observable in practice, a direct validation of the clock is not possible. Nevertheless, we note that the clock correctly classified the age ordering of primary tumors and metastases in 33 of 40 patients. Second, the epigenetic clock index was correlated with sample purity, which suggests the latter may be a confounder in our analyses. To mitigate this risk of bias we systematically repeated all analyses using purity-adjusted methylation values; while some of the associations were attenuated, the overall qualitative conclusions remained unchanged. To address the potential confounding of age estimates by tumor purity, single cell methylation data is needed. Third, estimation of tumor mitotic age was based on a simple mathematical model of (de-)methylation dynamics. In future work, this approximation can be refined using more sophisticated simulation-based models that account for the underlying population dynamics, including cell proliferation and death, and possibly selection.

How long a newly diagnosed breast cancer has been growing is generally considered a known unknown. Here we revisited this assumption and developed a way to infer tumor age using standard methylation arrays. While developed specifically for breast cancer, the approach can be generalized to any cancer type and, as such, provides a scalable technology to characterize the temporal landscape of oncology.

Methods

TCGA cohort

Of the 1085 invasive breast cancers from female patients in The Cancer Genome Atlas (TCGA)44, 774 had available methylation array data (Infinium HumanMethylation450 BeadChip, Illumina, San Diego, CA, USA). After excluding 138 tumors of low tumor content (consensus purity estimate39 [CPE] <0.6) and one sample each from two patients with two primary samples, the remaining 634 tumors were used to select the ensemble of 500 fCpG sites. Finally, after excluding 10 tumors with ≥5% missing clock set fCpG measurements and 100 tumors with a histology code other than infiltrating duct carcinoma, the analytic cohort consisted of 400 tumors. The following variables were retrieved: patient age at diagnosis; tumor histology; T stage (subsetted to T1, T2, T3, and T4); summary stage (subsetted to stages I, II, or III). For all 400 patients with invasive ductal carcinoma, gene expression quantification (RNA-seq) and copy number segment data were available as well; when >1 measurement was available, one was selected at random. All clinical and sequencing data were retrieved from the Genomic Data Commons (GDC; https://gdc.cancer.gov) using the R package TCGAbiolinks (version 2.25.3)45.

Lund cohort

We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 181 primary breast cancers in the Southern Sweden Breast Cancer Group tissue bank at the Department of Oncology and Pathology, Skåne University Hospital (Lund, Sweden) and the Department of Pathology, Landspitali University Hospital (Reykjavik, Iceland)18. The data were obtained through the Gene Expression Omnibus (GSE75067). Because calculation of the purity metric CPE requires gene expression, somatic copy-number, and immunohistochemistry in addition to methylation data, we instead assessed tumor purity using the leukocyte unmethylation percentage (LUMP) value. A tumor’s LUMP value is calculated as the average \(\beta\)-value among 44 specific CpG sites, divided by 0.85; we found the LUMP value to be strongly correlated with CPE (R = 0.86). After excluding samples of low purity (LUMP < 0.6; n = 35), the remaining 146 samples all had <5% missing clock set fCpG measurements. After exclusion of non-ductal histology (n = 48) we ended up with an analytic cohort of n = 98. The following variables were retrieved: patient age at diagnosis; tumor grade; tumor size; molecular subtype (PAM50); fraction of genome altered (FGA); expression of a mitotic checkpoint gene module46; fraction of cells in S-phase (flow cytometry).

Normal breast tissue cohort

We obtained publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 100 normal breast tissue samples in the Susan G. Komen Tissue Bank17,47 (GSE88883). We excluded samples of low purity (LUMP < 0.6), resulting in a cohort of 79 normal breast tissue samples used for identifying fCpG sites.

Germany cohort

We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 305 primary breast cancers collected within the Bavarian Breast Cancer Cases and Controls Study 2. The data were obtained through the Gene Expression Omnibus (GSE69914)48. After excluding samples of low purity (LUMP < 0.6; n = 52), the remaining 253 samples all had <5% missing clock set fCpG measurements. ER, PR, Her2, and Ki-67 status measured by IHC were retrieved. Clinical subtypes were defined as follows: Hormone receptor negative (HR-) tumors were defined as those that were ER- and PR-, while HR+ tumors had to be positive for one or both of ER and PR. Then, HR + /Her2- tumors were labeled as luminal A, HR + /Her2+ as luminal B, HR-/Her2+ as Her2 positive, and HR-/Her2- as triple-negative.

WCHS cohort

We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 694 primary breast cancers collected within the Women’s Circle of Health Study (WCHS). The raw intensity data in IDAT format were obtained through the Gene Expression Omnibus (GSE226569) using the R package GEOquery (version 2.76.0)49,50. The data was subjected to standard background correction, bias correction, masking, and conversion to \(\beta\)-values using the openSesame pipeline from the R package sesame (version 1.26.0)51. After removing DCIS (stage 0) tumors (n = 56), we excluded samples of low purity (LUMP < 0.6; n = 79). Then, after noting that some clock set fCpGs had highly sparse data, we removed 51 sites that were missing measurements in more than 25% of the remaining tumors. Finally, 114 tumors were excluded for having ≥5% missing measurements across the 449 remaining fCpGs, resulting in a final cohort of 445 patients with invasive tumors. ER, PR, and Her2 status measured by IHC, patient age, and summary stage were provided by the publishing authors. Clinical subtypes were determined as described previously for the Germany cohort.

Single-cell DNAm atlas

We retrieved publicly available single-cell whole genome bisulfite sequencing data for breast luminal (N = 3 measurements) and basal (N = 4) epithelial tissue from the DNA methylation atlas of normal human cell types collected by Loyfer et al. (GSE186458)52, all of which were of very high purity (LUMP ≥ 0.99). We only retrieved data collected for the 482,422 sites that also appeared on the Infinium HumanMethylation450 BeadChip.

Multiple sample cohorts

We retrieved publicly available methylation array data (Infinium Human-Methylation450 BeadChip) from four cohorts with paired tumor samples. The first cohort consisted of 8 breast cancer patients with multiple primary samples (GSE106360)19. Only samples from the 5 patients (2 patients with 5 samples each; 3 patients with 3 samples each) who had not received neoadjuvant therapy were used. Because LUMP values were highly variable, we did not apply any purity filtering. The second cohort consisted of 10 patients diagnosed with multi-focal breast cancer (GSE39451)33. For each patient, methylation array data from 2 foci were available, and we only included the 8 patients where both samples were of sufficient purity (LUMP ≥ 0.6). The third cohort consisted of paired primary and lymph node metastasis samples from 44 patients (GSE58999)34. Only patients where both samples were of sufficient purity (LUMP ≥ 0.6) were included (n = 18). The fourth cohort, from the AURORA US Metastasis Project, consisted of primary and metastasis samples taken from 55 patients with metastatic breast cancer. In our analysis, we included only patients for whom at least one primary and one metastasis sample of sufficient purity (LUMP ≥ 0.6) were available (n = 22) (GSE212370)35. Only patients with at least one primary and one metastasis sample of sufficient purity (LUMP ≥ 0.6) were included (n = 22). When more than one primary or metastasis sample was available, the one with the highest LUMP value was selected.

Selection of fluctuating CpG (fCpG) sites

First, we sought to identify CpG sites with balanced methylation and de-methylation rates, defined as having an average methylation content (β-value) between 0.4 and 0.6 in the TCGA cohort (N = 634), bulk normal breast tissue cohort (N = 79), single-cell breast luminal epithelial cohort (N = 3), and single-cell breast basal epithelial cohort (N = 4). CpG sites with ≥20 missing values in either the TCGA or bulk normal cohorts were excluded from this selection process. In the second step, we ranked all such balanced CpG sites by their \(\beta\)-value variance among tumors in the TCGA cohort and selected the 500 most variable fCpGs to define the clock set \({{\mathcal{C}}}\). Based on the clock set, each tumor was assigned an epigenetic clock index \({c}_{\beta }=1-{s}_{\beta }\) where \({s}_{\beta }\) is the standard deviation of the β values in the clock set. We compared the proportion of genic/regulatory sites in clock sites vs. non-clock sites using the \({\chi }^{2}\) test; genic/regulatory CpG sites were identified as those associated with regulatory features or genes in one or both of the official annotation files of the Infinium HumanMethylation450 and MethylationEPIC bead chip arrays (https://support.illumina.com).

Gene expression analyses

For tumors in TCGA, relative gene expression levels were taken as the mean-centered, log2(x + 1) transformation of the reported transcript per million (TPM) intensities. Among the 60,616 RNA transcripts recorded in TCGA, only those classified as protein-coding genes by the HUGO Gene Nomenclature Committee53 were included in subsequent analyses (n = 18,910). Expression of a mitotic checkpoint gene module46 was calculated as the average relative expression of the genes included in the module. Molecular subtyping was based on the PAM50 algorithm54 as implemented in R package Genefu55. For tumors in the Lund cohort, identical gene expression and molecular subtyping analyses had previously been reported46, thus enabling a direct comparison between tumors in the TCGA and Lund cohorts.

Pathway enrichment analyses

For the TCGA cohort, we performed a gene set enrichment analysis (GSEA) using the software package GSEA56,57 to identify Hallmark gene sets that are correlated with the epigenetic clock index \({c}_{\beta }\). The analysis was performed using the Pearson correlation to rank individual genes; phenotype-permutation-based P values and false-discovery rate (FDR) Q values were computed using 1000 permutations. All other inputs were kept at their defaults.

CIBERSORTx

To assess the immune cell composition within the tumor microenvironment, we employed CIBERSORTx using the LM22 signature matrix and batch correction32. Briefly, RNA-seq data from the TCGA tumor samples were uploaded to the CIBERSORTx web portal, where gene expression profiles were deconvoluted to estimate the absolute scores for 22 distinct immune cell types. The analysis was performed with the default parameters, including 100 permutations for statistical significance assessment. For reporting of results the 22 distinct cell types were then collapsed into six mutually exclusive categories: B cells, macrophages, mast cells, myeloid cells, natural killer (NK) cells, and T cells.

Copy-number analyses

For tumors in TCGA, copy-number (CN) data consisted of specified chromosomal regions of equal CN, the \({\log }_{2}(\frac{x}{2})\)-transformed CN, and the number of probes. We converted these values to absolute copy numbers and determined each segment to have either a copy number gain (segment mean ≥ 2.5), a copy number loss (segment mean ≤1.5), or no change (1.5 < segment mean < 2.5). The fraction of the genome altered by copy number gains and losses were each calculated for every tumor by dividing the number of probes affected by gains and losses, respectively, by the total number of probes. The total fraction of the genome altered by copy number alterations (FGA) was then calculated as the sum of these two values. For tumors in the Lund cohort, the same approach had previously been used to compute FGA2, thus enabling direct comparison between tumors in the TCGA and Lund cohorts.

In silico model of tumor growth and fCpG dynamics

To simulate the dynamics of fCpG sites in a growing tumor, we used a discrete-time birth-death process. Starting with a single founding tumor cell, the population is updated in time intervals of one day, at which time each cell either divides, dies, or remains unchanged with probabilities \(\alpha\), \(\lambda\), and \(1-\alpha -\lambda\), respectively. Upon cell division, each allele in each cell changes its methylation state with probability \(\mu\). We tracked an ensemble of 90 fCpG sites, assuming independent (de-)methylation dynamics. Unless otherwise specified, the following parameters were used: \(\alpha =0.17\) (the estimated mean proliferation rate in the TCGA-Lund combined cohort, see below for details), \(\lambda =0.15\) (to reach a population of 109 cells in 3 years, or, \({\left(1+\alpha -\lambda \right)}^{3\cdot 365}\approx {\,10}^{9}\)), and \({{\rm{\mu }}}=0.002\) (the estimated flip rate in the combined cohort, see below for details).

Tumor specific proliferation rates

For tumors in the Lund cohort, tumor specific proliferation rates \({\alpha }_{i}\) were estimated based on the reported fraction \({f}_{i}\) of cells in S-phase as \({\alpha }_{i}={f}_{i}/{T}_{S}\), where \({T}_{S}\) is the average time spent in S-phase (see Supplementary Methods for details). We assumed \({T}_{S}\) to equal 12.7 hours, based on an average across five cancer cell lines58. Because \({f}_{i}\) is not reported in the TCGA cohort, we used the Lund data to develop a predictive model of S-phase fraction using an elastic net model. As candidate predictors, we included FGA, LUMP value, and average gene expression levels within each of the following gene modules46: mitotic checkpoint (see above), immune response, stroma, mitotic progression, early response, steroid response, basal, and lipid. The model was fit to the Lund cohort tumors using cross-validation for hyperparameter optimization, and then applied to TCGA tumors to predict tumor-specific S-phase fractions \({f}_{i}\) and proliferation rates \({\alpha }_{i}\).

Tumor age estimation

To estimate tumor mitotic and calendar ages from the empirical \(\beta\)-value distributions, we proceeded in two steps. In the first step, we decomposed each tumor’s empirical \(\beta\)-value distribution into three groups, or “peaks”, of fCpG sites: the originally unmethylated fCpG sites (left peak), the originally hemi-methylated fCpG sites (middle peak), and the originally methylated fCpG sites (right peak). We achieved this by fitting a mixture model of three Beta distributions to the \(\beta\)-values of the 500 fCpG sites in the clock set using the R package BetaModels (version 0.5.2). To improve convergence of this method, sites with extreme \(\beta\)-values (\(\beta > 0.98\) or \(\beta < 0.02\)) were removed before fitting the mixture model (a total of 46 and 244 sites were thus removed in the TCGA and Lund cohorts). In preparation of the next step, we determined the mode of each Beta component in the mixture as the location of the corresponding peak. At this point we excluded tumors with a middle peak location outside the interval [0.4, 0.6] because this suggests a bias in the (de-)methylation rates and thus violates a basic assumption of the fCpG dynamics in the clock set (35 and 8 tumors were excluded in the TCGA and Lund cohorts, respectively). We also excluded 3 TCGA and 7 Lund normal that were of the “normal” molecular subtype. In the second step, we used the stochastic oscillator model (Fig. 1A) to relate the empirical peak location to the approximate age of the tumor. Because this step requires knowledge about the unknown stochastic (de-)methylation rate, we constrained the overall calendar age distribution across the Lund and TCGA cohorts to have a median of 3 years, which corresponds to the mean sojourn time in breast cancer4,59. See Supplementary Methods for details.

EpiSCORE

To provide an orthogonal measure of purity in the TCGA cohort, we used the R package EpiSCORE (version 0.9.6)40. Following the standard instructions of the tool, we converted the 450K methylation array data to gene-level DNAm data and then used the breast reference DNAm matrix provided with the package to estimate, for each tumor, the fraction of each of the following breast cell-types: basal, endothelial, fat, fibroblast, luminal, lymphocyte, and macrophage. The epithelial fraction was computed as the sum of the luminal and basal fractions.

Purity adjusted analyses

Acknowledging the correlation between the epigenetic clock index \({c}_{\beta }\) and tumor purity, we derived a purity-adjusted epigenetic clock index \({c}_{\beta }^{\alpha }\) and repeated relevant correlation analyses with \({c}_{\beta }^{\alpha }\) instead of \({c}_{\beta }\). Because the epigenetic clock index was derived from the distribution of \(\beta\)-values of fCpG sites, we performed the purity adjustment at the level of \(\beta\)-values. For this, we assumed that the measured \({{\rm{\beta }}}\)-value at site \({\mathfrak{i}}\) (\({\beta }_{i}^{m}\)) could be decomposed as a weighted sum of \(\beta\)-values of the tumor (\({\beta }_{i}^{t}\)) and the immune component (\({\beta }_{i}^{s}\)),

$${\beta }_{i}^{m}=p{\beta }_{i}^{t}+\left(1-p\right){\beta }_{i}^{s},$$
(1)

where \(p\) is the sample purity as measured by CPE. To estimate \({\beta }_{i}^{s}\) we combined the CIBERSORTx decomposition of the stroma (see section CIBERSORTx) with \({{\rm{\beta }}}\)-values of its constituent cells (\({\beta }_{k}^{c}\)) to obtain

$${\beta }_{i}^{s}={\sum}_{k\in {LM}22}{{w}_{i,k}\beta }_{k}^{c},$$

where \({w}_{i,{k}}\) is the fraction of cell type \(k\) (in the LM22 signature) in tumor sample \(i\). The \({\beta }_{k}^{c}\) were estimated using published cell-type specific methylation values60. Finally, the purity adjusted \(\beta\)-values were obtained by solving Eq. (1) for \({\beta }_{i}^{t}\) and truncating values below 0 and above 1 (necessary for <5.7% of the adjusted \({{\rm{\beta }}}\)-values).

Statistics and reproducibility

Correlations between two continuous variables were calculated using the Pearson correlation coefficient. The medians of continuous variables were compared using a two-sided Wilcoxon rank-sum test at significance level of 0.05. For each variable, tumors with missing values of that variable were excluded. All analyses and visualizations were performed in Python (3.9.19) and R (version 4.3)61. All analyses were based on publicly available data sources and can thus be fully reproduced. To maximize statistical power, all qualifying samples were included in the respective data analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.