Abstract
Although generally unknown, the age of a newly diagnosed tumor encodes valuable etiologic and prognostic information. Here, we estimate the age of breast cancers, defined as the time from the start of growth to detection, using a measure of epigenetic entropy derived from genome-wide methylation arrays. Based on an ensemble of neutrally fluctuating CpG (fCpG) sites, this stochastic epigenetic clock differs from conventional clocks that measure age-related increases in methylation. We show that younger tumors exhibit hallmarks of aggressiveness, such as increased proliferation and genomic instability, whereas older tumors are characterized by elevated immune infiltration, indicative of enhanced immune surveillance. These findings suggest that the clock captures a tumor’s effective growth rate resulting from the evolutionary-ecological competition between intrinsic growth potential and external systemic pressures. Because of the clock’s ability to differentiate old and stable from young and aggressive tumors, it has potential applications in risk stratification of early-stage breast cancers and guiding early detection efforts.
Introduction
When a woman is diagnosed with breast cancer, it is generally not possible to ascertain how long the tumor has been growing. Yet knowledge about a tumor’s age at diagnosis could provide important prognostic clues: older indolent tumors that are less likely to progress may require less invasive treatment, whereas younger fast-growing tumors require more urgent and aggressive treatment. While there is a rich literature on the estimation of the mean sojourn time of breast cancer1,2,3,4,5—the average time tumors spend in a detectable but asymptomatic, pre-clinical state—there is a paucity of tools to assess the age of individual tumors at the time of detection.
Epigenetic clocks provide a promising approach to estimate individual tumor age. Originally developed to quantify the biologic aging process in humans, epigenetic clocks leverage specific patterns of DNA methylation that are strongly correlated with biologic tissue age6,7. Broadly, these clocks focus on the DNA methylation status of CpG sites across the genome that become differentially methylated with increasing tissue age. These clocks are well suited for estimating the number of cell divisions since birth of the person, making them valuable tools for studying normal tissue aging and pre-neoplastic changes8,9,10,11,12. However, they are less effective for estimating tumor mitotic age, defined as the number of cell divisions since birth of the first tumor cell.
In contrast to “one-way” clocks that measure age-related methylation changes, “two-way” epigenetic clocks leverage CpG sites that fluctuate between the unmethylated and methylated states on a relatively fast time scale (Fig. 1A). Originally introduced in the context of homeostatic intestinal stem cell dynamics13, such stochastic epigenetic clocks have also been applied to hematologic malignancies14. Here we follow similar design principles to develop a breast cancer-specific two-way epigenetic clock to measure individual tumor mitotic age at diagnosis based on average methylation levels (\(\beta\)-values) of select CpG sites included in standard methylation arrays. Unlike “one-way” methylation clocks that begin ticking at birth, our clock is reset at onset of tumor growth, allowing it to track the number of cell divisions up to the time of diagnosis.
A Balanced fCpGs stochastically oscillate between the unmethylated, hemi-methylated, and methylated states. B During tumor cell division, methylation and de-methylation events occur stochastically. As illustrated for the first three generations of an exponential tumor expansion, the average methylation status (β-value) evolves over time. C In silico modeling of the fCpG dynamics, based on a simple birth-death model of tumorigenesis. The average methylation value (β) is simulated for three independent fCpGs, whose initial states in the first tumor cell are fully methylated (red), hemi-methylated (yellow), and fully unmethylated (blue), respectively. Insert: the number of tumor cells over time. Simulation parameters: cell proliferation rate \(\alpha =0.17\) divisions/day; cell death rate \(\lambda =0.15\) deaths/day; (de-)methylation rate \(\mu =0.002\) flips/division. See Methods for details. D The age of a tumor at detection is modulated by relative strengths of intrinsic proliferation and growth-suppression induced by the immune tumor microenvironment (TME); tumors will reach a detectable size faster when proliferation is greater and/or the TME is weaker.
The proposed stochastic clock measures the entropy of an ensemble of fluctuating CpG (fCpG) sites. In the tumor’s most recent common ancestor cell, each fCpG was either unmethylated (\(\beta =0\)), hemi-methylated (\(\beta =0.5\)), or methylated (\(\beta =1\)). As the tumor expands, replication errors produce a mixture of cells with different methylation states (Fig. 1B), thus progressively increasing the tumor’s epigenetic entropy15. In the special case of unbiased fCpG sites—whose methylation and de-methylation rates are in balance—the bulk-level methylation converges to \(\beta =0.5\) with increasing tumor mitotic age, regardless of the first tumor cell’s state (Fig. 1C). Thus, by measuring the distribution of unbiased fCpG sites, we can derive an estimate of the age of a given tumor cell population relative to the start of the most recent clonal expansion.
The combination of tumor-specific age estimates and gene expression profiles further provides a unique opportunity to characterize the evolutionary and ecological pressures that shape the temporal landscape of breast cancer. Notably, aggressive tumors that evolve in a weakly suppressive immune microenvironment are expected to reach a detectable size faster than indolent, slow-growing tumors in a strongly suppressive immune microenvironment (Fig. 1D).
The manuscript is structured as follows. Combining DNA methylation and gene expression data from several hundred breast cancer and normal breast tissue samples, we first identify a set of unbiased fCpG sites and introduce the epigenetic clock index as a proxy measure of tumor mitotic age. We then evaluate the face validity of the index by examining its relationship with established prognostic markers, and we combine methylation and gene expression data to identify tumor- and microenvironment-specific factors that modulate tumor mitotic age. Finally, we validate key properties of the clock index in independent cohorts of patients with paired primary-metastasis samples, and we derive quantitative estimates of individual breast cancers’ mitotic and calendar ages.
Results
Selection of unbiased fCpG sites
To identify a set of unbiased fCpG sites in breast cancer, we used 450K methylation array data from 634 invasive breast cancers in The Cancer Genome Atlas16 (TCGA) and 79 normal breast tissue samples17. Using a two-step selection process, we identified an ensemble of fCpG sites with balanced (de-)methylation rates as follows.
First, we identified CpG sites with an average \(\beta\)-value close to 0.5 in both normal breast tissue and breast cancers, thus excluding sites with an inherent bias toward methylation or de-methylation, and sites that are subject to systematic selection during homeostasis and/or tumorigenesis (Fig. 2A). To avoid confounding by cell type heterogeneity, we further constrained candidate sites to be balanced within luminal and basal epithelial breast tissues, respectively (Fig. 2B).
A Starting with all CpG sites (gray dots), sites with an average methylation value β between 0.4 and 0.6 in both breast cancers (N = 634, TCGA cohort) and normal samples (N = 79, Normal cohort) were labeled as Unbiased Set I (yellow shaded rectangle). B Starting with sites in Unbiased Set I (yellow dots), sites with an average methylation value β between 0.4 and 0.6 in both normal breast luminal epithelial tissue samples (N = 3, single-cell DNAm atlas) and normal breast basal epithelial tissue samples (N = 4, single-cell DNAm atlas) were labeled as Unbiased Set II (red shaded rectangle). C Sites in Unbiased Set II were ranked by their inter-tumor standard deviation, and the top 500 were included in the clock set \({{\mathscr{C}}}\). D Distribution of individual fCpG β-values across tumor and normal cohorts; each curve represents the kernel density estimate of the distribution of a single fCpG across the TCGA cohort (blue; N = 633) or the normal breast cohort (yellow; N = 79). E In a separate cohort of breast cancers (N = 146, Lund cohort), the inter-tumor standard deviation of \(\beta\) was higher for fCpG sites in the clock set \({{\mathscr{C}}}\) (median: 0.188) compared to CpGs not included in \({{\mathscr{C}}}\) (median: 0.102; P < 10−10; Wilcoxon rank-sum test). F In a small cohort of patients (N = 5) with multiple primary samples (3-5 per patient), the intra-tumor standard deviation of β was higher for CpG sites in the clock set \({{\mathscr{C}}}\) (median: 0.056), compared to CpGs not included in \({{\mathscr{C}}}\) (median: 0.026; P < 10−10, Wilcoxon rank-sum test).
In the second step, we ordered the set of unbiased CpG sites by between-tumor variability and included only the 500 most fluctuating sites in the final clock set of unbiased fCpGs (Fig. 2C; see Supplementary Data 1 for clock fCpGs). Importantly, this final step excludes non-informative sites that either do not fluctuate at all (i.e., imprinted hemi-methylated state) or fluctuate too fast (i.e., steady-state methylation of \(\beta \approx 0.5\) reached on time scales much shorter than the average tumor mitotic age at diagnosis). To ensure that this step did not select for CpG sites that mainly captured cell-type composition effects, we verified that individual fCpGs had low \({{\rm{\beta }}}\)-value variation across the normal samples (median standard deviation of 0.06 compared to 0.19 across tumors; P < 10−10, Wilcoxon rank-sum test; Fig. 2D). Furthermore, as we expected, the fCpG sites were enriched for non-genic/regulatory CpGs, with 27.6% of fCpGs versus 17.8% of non-fCpGs being non-genic/regulatory (P = 2 × 10−8, \({\chi }^{2}\) test).
Next, we sought to validate the unbiased and fluctuating nature of the clock set in two independent cohorts. In a cohort of 146 breast cancer patients (Lund cohort)18, we found significantly higher inter-tumor variability in \(\beta\)-values among the CpG sites in the clock set, as compared to the CpG sites not included in the clock set (Fig. 2E). Similarly, in a small cohort of 5 patients with multiple samples from their primary tumors19, we found elevated intra-tumor variability in clock set vs non-clock set sites (Fig. 2F). Together, these patterns corroborate the unbiased and fluctuating nature of the clock set of CpG sites.
Interestingly, the fCpG sites in the clock set were more tightly concentrated around \(\beta =0.5\) in each normal breast sample, as compared to breast cancers (Supplementary Fig. 1). Consistent with the underlying dynamic model of the clock (Fig. 1A), this suggests that over decades of breast development and maintenance, the fCpGs had converged to the stationary methylation state of \(\beta =0.5\) (Fig. 1C).
Epigenetic clock index
At the level of individual tumors, the 500 fCpG sites in the clock set exhibited primarily unimodal or bimodal distributions of \(\beta\)-values (Fig. 3A). We explored how these tumor-specific distributions of \(\beta\)-values could be used to estimate tumor mitotic age. In the founding tumor cell, each fCpG starts in either the unmethylated (\(\beta =0\)), hemi-methylated (\(\beta =0.5\)), or methylated (\(\beta =1\)) state (Fig. 1A). Although the trajectories of individual sites are subject to stochastic fluctuations (Fig. 1C), an ensemble of sites starting in the same initial configuration collectively drift toward the steady state of \(\beta =0.5\) (Fig. 3B).
A The empirical β-value distributions for the clock set fCpGs (N = 500) are shown for three select tumors in the TCGA cohort. B Simulated trajectories for an ensemble of fCpG sites (N = 90), starting in the unmethylated, hemi-methylated, and methylated initial configurations, respectively (n = 30 each). Simulation parameters as detailed in Fig. 1. C Cross-sectional \(\beta\)-value distributions for the simulated clock set in panel B, shown after 0, 1, and 2 years of growth. D Standard deviation (sβ) of \(\beta\)-values and epigenetic clock index (cβ = 1 - sβ) over 2 years of growth for the simulated clock set in B. E The distribution of epigenetic clock index values (\({c}_{\beta }\)) across invasive ductal carcinomas in TCGA (N = 400); the three tumors from (A) are labeled.
By considering the histograms of \(\beta\)-value distributions at different tumor mitotic ages, we can track the evolution of the three “peaks” corresponding to the subsets of initially unmethylated, hemi-methylated, and methylated clock sites (Fig. 3C). As the tumor’s mitotic age increases, the left peak of the histogram (consisting of originally unmethylated clock sites) starts moving to the right, whereas the right peak (originally methylated sites) moves to the left; the middle peak (originally hemi-methylated sites) remains stationary. By measuring the extent to which the three peaks have converged to the stationary value of \(\beta =0.5\), we can thus estimate the mitotic age of individual tumors.
Concretely, we used the standard deviation of the \(\beta\)-values, denoted by \({s}_{\beta }\), to quantify the relationship between tumor mitotic age and the evolving clock set profile (Fig. 3C). Because \({s}_{\beta }\) is highest at time 0, when the \(\beta\)-value distribution exhibits three sharp peaks, and then monotonically decreases over time (Fig. 3D), we introduced the epigenetic clock index \({c}_{\beta }=1-{s}_{\beta }\) as a proxy measure of tumor mitotic age (Fig. 3E; see Supplementary Data 2 for clock index values).
While most breast cancers are thought to derive from a single clone of origin, recent work suggests that some breast cancers may derive from multiple founder cells20. If this is the case, the initial \(\beta\)-value distribution of fCpGs is not confined to the three discrete peaks at \(\beta\) = 0, 0.5, and 1. The initial epigenetic entropy of a multiclonal cancer is thus higher compared to a cancer of monoclonal origin, and the epigenetic clock index will overestimate the true tumor mitotic age at the time of resection. However, because it appears unlikely that a tumor derives from a large number of independent clones, we expect this bias to be limited.
We identified a significant difference in the distribution of the epigenetic clock index \({c}_{\beta }\) among invasive ductal carcinomas (IDC; median, 0.81) and invasive lobular carcinomas (ILC; median, 0.85; P < 10−10, Wilcoxon rank-sum test). This is aligned with the notion of ILC as a molecularly and clinically distinct disease entity21, and consistent with the propensity of ILCs to be mammographically occult22, which may lead to older tumor mitotic ages at the time of diagnosis. Cognizant of these differences between ILC and IDC, we decided to reduce further confounding in downstream analyses by focusing on the ductal cancers.
In the next two sections, we characterize the relationship between a tumor’s mitotic age, as quantified by the epigenetic clock index \({c}_{\beta }\), and its evolutionary-ecological context as determined by its intrinsic growth potential and external pressures from the microenvironment.
Younger tumors have more aggressive phenotypes
As a breast tumor grows, its likelihood of detection on the basis of imaging or symptoms increases. Because fast growing tumors are expected to reach a detectable size sooner than slow growing ones, we hypothesized that younger tumor mitotic age would correlate with established markers of tumor aggressiveness. To test this hypothesis, we correlated the epigenetic clock index with several established features of tumor aggressiveness, including molecular subtype23,24, genomic instability25,26, grade27, proliferation28, and size29.
There was a clear relationship between tumor mitotic age and molecular subtype: luminal A tumors, which have a more favorable prognosis, were older than luminal B and basal tumors (Fig. 4A-B; Supplementary Table 1). Similarly, when using clinical instead of molecular subtyping in two additional cohorts of breast tumors (Germany cohort, n = 253; WCHS cohort, n = 445), triple-negative cancers tended to be younger compared to luminal A and luminal B cancers (Supplementary Fig. 2A, 3). Consistent with these subtype patterns, there was a strong correlation between genomic instability and younger tumor mitotic age (Fig. 4C,D). Younger tumors were of higher histopathologic grade (Fig. 4E) but not stage (Fig. 4F), and more likely to be Ki-67-positive (Supplementary Fig. 2B).
The distribution of epigenetic clock index values (cβ) among invasive ductal carcinomas in the Lund and TCGA cohorts, by (A, B) molecular subtype as predicted by the PAM50 algorithm; (C, D) fraction of genome altered (FGA) by copy number alterations; (E) tumor grade; (F) tumor stage; (G) tumor size; and (H) T-stage. Pairwise comparisons of medians in panels A, B, E, F, and H were performed using a two-sided Wilcoxon rank-sum test (*P < 0.05, **P < 0.01, ***P < 0.001). In panels C, D, and G, regression lines and bootstrapped 95% confidence intervals are shown; Pearson correlations (R) are indicated. Boxplots in A, B, E, F, H, display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.
Another prognostic factor in breast cancer is tumor size, with larger lesions having worse outcomes. We found that smaller tumors were of older mitotic age compared to larger tumors (Fig. 4G, H), presumably because slow growing tumors spend more time at the smaller end of the detectable size range, and are, therefore, more likely to be detected at a smaller size.
Finally, the relationship between tumor mitotic age and patient age at diagnosis was inconclusive, with a weak negative correlation in TCGA (R = −0.18) and no correlation in either the Lund (R = −0.10; Supplementary Table 1) or WCHS (R = −0.03; Supplementary Table 1) cohorts. This is consistent with the notion that the fCpG clock measures the age of the tumor—starting with the most recent common ancestor cell—and not the age of the patient.
Identifying modulators of tumor mitotic age
The time it takes for a tumor to grow from a single cell to a detectable mass depends on its effective growth rate, that is the difference between cell proliferation and cell death (Fig. 5A). Cell proliferation primarily reflects the tumor’s intrinsic growth potential and aggressiveness, whereas cell death is often the result of extrinsic selective pressures applied by the tumor microenvironment, such as immune surveillance and resource constraints due to limited vascularization30,31.
A Simulation of tumor size as a function of tumor age. Both tumors have the same proliferation rate (\(\alpha =0.1\)7 divisions/day), but different death rates: \(\lambda =0.15\) deaths/day (blue) vs. \(\lambda =\)0.16 deaths/day (yellow). B Pearson’s correlation between epigenetic clock index \({c}_{\beta }\) and expression of protein-coding genes (TCGA cohort); correlation with select genes as indicated. C, D Correlation of \({c}_{\beta }\) with average expression of genes involved in M-phase and mitotic checkpoint regulation. E Correlation of \({c}_{\beta }\) with the fraction of cells in S-phase, as measured by flow cytometry (Lund cohort). Regression lines shown with bootstrapped 95% confidence intervals and Pearson correlation (R).
To explore putative modulators of effective tumor growth and tumor mitotic age at diagnosis, we performed genome-wide correlation analyses of the epigenetic clock index \({c}_{\beta }\) against gene expression. As predicted, mitotically younger tumors exhibited increased expression of proliferation-related genes such as Ki67 and MCM2 (Fig. 5B, Supplementary Table 1). The signal was further augmented when considering the average expression across a set of genes involved in M-phase and mitotic checkpoint regulation (Fig. 5C,D) and the fraction of cells in S-phase (Fig. 5E).
Next, we examined the microenvironment’s ability to decrease the effective growth rate of a tumor through increased cell death. As hypothesized, the expression of immune cell markers such as CD3, CD4, CD8 and FOX3 was elevated in mitotically older tumors (Fig. 5B; Supplementary Table 1). This suggests that tumors which are subject to immune surveillance—e.g., through neo-antigen directed immune control by CD8 + T-cells—have a lower effective growth rate and, thus, reach a detectable size at an older mitotic age, as compared to tumors that successfully evade immune control and thus reach a detectable size at a younger mitotic age.
To perform a systematic analysis of tumor mitotic age modulation, we performed a genome-wide gene set enrichment analysis (GSEA) (Fig. 6A). Consistent with the univariate gene expression analyses, mitotically younger tumors were enriched for pathways related to proliferation and cell cycle control. Conversely, mitotically older tumors were enriched for immune pathways and immune-related signaling pathways, again supporting the notion of effective immune control in older, slower growing lesions.
A A gene set enrichment analysis (GSEA) was performed for the epigenetic clock index \({c}_{\beta }\). Pathways with a positive (negative) enrichment score are enriched in mitotically older (younger) tumors. Only pathways with a false discovery rate (FDR) below 0.1 are shown; *FDR < .05; **FDR < .01; ***FDR < .001; ***FDR < .0001. B Epigenetic clock index vs. the extent of immune infiltration (absolute immune score) as estimated by CIBERSORTx. C The immune compartment of each tumor was decomposed using CIBERSORTx; the compartment fractions are shown for tumors of similar mitotic age (epigenetic clock index \({c}_{\beta }\)). Boxplots display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.
For a more in-depth analysis of the immune infiltrate, we used the CIBERSORTx algorithm32 to estimate the extent and composition of the immune compartment. As expected, the extent of the immune compartment increased with tumor mitotic age (Fig. 6B). When decomposing each tumor’s immune compartment into the major cell types, we found an increase in the fraction of T-cells in mitotically older tumors (Fig. 6C, Supplementary Table 2), again suggestive of T-cell mediated immune surveillance.
Analysis of paired tumor samples validates epigenetic clock
Multiple tumor samples from the same patient provide a unique opportunity to assess the internal validity of the epigenetic clock. Indeed, paired samples should be epigenetically more related—via their most recent common ancestor cell—than samples from different patients. In a cohort of 8 women with multi-focal breast cancer33, we found that the within-patient correlations of the clock set fCpG sites were higher (median, 0.70) than the between-patient correlations (median, 0.11; P = 3 × 10−6, Wilcoxon rank-sum test; Fig. 7A). The same held true for a cohort of 18 patients with paired primary tumors and lymph node metastases34 (median, 0.85 vs. 0.13, P < 10−10; Fig. 7B) and a subset of 22 patients with paired primary tumors and metastases (including lymph node and distant metastases) from the AURORA US Metastasis Project35 (median, 0.66 vs. 0.13, P < 10−10; Supplementary Fig. 4).
A In a cohort of 8 patients with multifocal breast cancer, the \(\beta\)-values of the 500 fCpG sites of the epigenetic clock are correlated within (two foci per patient) and between patients. B In a cohort of 18 patients with paired primary tumor and lymph node metastasis samples, the \(\beta\)-values of the 500 fCpG sites are correlated within and between patients. C For the 18 patients from panel B and 22 patients from the AURORA US Metastasis Project, the epigenetic clock index of the primary tumor is plotted against the index of the lymph node metastasis. Three patients (AF, AI, and AX) from the lymph node cohort are labeled for the purposes of the following 3 panels. D–F For the 3 labeled patients from C, the \(\beta\)-value distributions of fCpGs are shown both for the primary tumor and the metastasis. G Monte Carlo simulation of an ensemble of 90 fCpG sites, 30 each starting in the unmethylated, hemi-methylated, and methylated initial configurations; every 3 months, a cell was randomly picked to representing the metastasis seeding cell, and the Pearson correlation between the \(\beta\)-value distribution of that cell and that of the entire tumor was calculated. Distributions at each time point represent the results from 30 independent simulations; boxplots display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers. Simulation parameters are detailed in Fig. 1. H For the 18 patients from panel B and 22 patients from the AURORA US Metastasis Project, the epigenetic relatedness of primary and lymph node metastasis (Pearson’s R for the 500 fCpG sites in the clock set) is compared to the difference in epigenetic clock index, as a proxy for the difference in tumor mitotic age between the two samples.
The two cohorts of patients with paired primary and metastasis samples34,35 allowed us to test two additional properties of the fCpG clock. First, assuming that each metastasis is seeded by a single cell from the primary tumor, synchronous metastases should be younger than their matched primaries. Indeed, the epigenetic clock measures the mitotic age of the metastasis relative to the seeding event, which occurred after initiation of the primary tumor. Consistent with this prediction, metastases had a lower epigenetic clock index compared to their matched primaries in 33/40 patients, and only one metastasis was noticeably older than the matched primary (Fig. 7C). These findings provide direct support for interpreting the epigenetic clock index as a proxy measure for tumor mitotic age. However, it is important to acknowledge that potential differences in tumor biology and microenvironmental conditions between primary tumors and metastases may act as unmeasured confounders in this analysis.
Second, the timing of metastatic dissemination relative to the primary tumor’s age is expected to impact the epigenetic similarity of the two samples: if the metastasis is seeded early during primary tumor growth (i.e., similar \({c}_{\beta }\) values), the \(\beta\)-values of the two samples are expected to be closely related (Fig. 7D, E) because the metastasis seeding cell came from a mostly homogenous population; conversely, if the metastasis is seeded late (i.e., different \({c}_{\beta }\) values), the \(\beta\)-values are expected to differ more substantially (Fig. 7F) because the seeding cell came from a heterogenous population. Corroborating this hypothesis, and consistent with a corresponding simulation of metastatic seeding based on the oscillator model (Fig. 7G), we found a negative correlation between metastasis mitotic age difference and \(\beta\)-value similarity (Fig. 7H).
Finally, we note that metastases may be seeded not by a single cell, but by a cluster of cells from the primary tumor36. In such cases—similar to the scenario of a multiclonal primary tumor (see ‘Epigenetic clock index’)—our method may overestimate the true mitotic age of the metastasis. However, as long as the estimated mitotic age of the metastasis is younger than that of the matched primary, its true mitotic age must also be younger, and thus the conclusion derived from Fig. 7C remains valid.
Quantifying tumor mitotic age
So far, we have used the epigenetic clock index \({c}_{\beta }\) as a correlate of tumor mitotic age. To derive quantitative estimates of each tumor’s mitotic and calendar ages, we proceeded as follows (see Methods for details). First, we invoked the mathematical oscillator model (Fig. 1A) to relate tumor mitotic age to the measured \(\beta\)-values of fCpG sites in the clock set. Next, we decomposed each tumor’s empirical fCpG \(\beta\)-value distribution into three groups (Fig. 8A): originally unmethylated fCpGs (left peak in the histogram), originally hemi-methylated fCpGs (middle peak), and originally methylated fCpGs (right peak). Finally, we combined the peak location in each group with the oscillator model to infer the estimated mitotic age of the tumor (Fig. 8B).
A The empirical \({{\rm{\beta }}}\)-value distributions of the 500 fCpG sites in the clock set are decomposed into three peaks corresponding to initially unmethylated (blue), hemi-methylated (yellow), and fully methylated (red) sites. B Tumor mitotic age (the number of generations) is shown for the TCGA and Lund cohorts and colored by molecular subtype. These estimates are based on the peak locations (panel A) and a (de-)methylation rate of \({{\rm{\mu }}}=2\cdot {10}^{-3}\), per cell division and per allele, which yields a mean tumor age of 3 calendar years (see E). C For the Lund cohort, tumor-specific proliferation rates are estimated using the measured fraction of cells in S-phase and shown by molecular subtype. D For the TCGA cohort, proliferation rates are estimated using S-phase fractions predicted based on gene expression, using a model trained on the Lund cohort. E Calendar age of tumors in the TCGA and Lund cohorts, colored by molecular subtype. Boxplots in C and D display: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.
Finally, we combined tumor-specific estimates of mitotic age (Fig. 8B) and proliferation rate (Fig. 8C, D) to derive tumor-specific estimates of calendar age (see Supplementary Data 2 for age estimates). Anchoring the median tumor age at a consensus estimate of 3 years (see Methods), the distribution of calendar ages across the TCGA and Lund cohorts ranged from 0.3 to 29.9 years, with an interquartile range of 1.6 to 5.5 years (Fig. 8E). There were notable differences in median tumor calendar ages by molecular subtype, ranging from 1.1 years in basal cancers to 6.5 years in luminal A cancers (Supplementary Table 3). Interestingly, even though there was only a small difference between Her2-positive and luminal A tumors with respect to tumor mitotic age (163 vs.171 generations), there was a clear separation with respect to calendar age (2.5 vs. 6.5 years; Supplementary Table 3). In other words, the Her2-positive cancers accumulated a similar number of cell divisions in a much shorter time, which is consistent with a generally more aggressive clinical presentation and worse prognosis37,38.
Adjusting for tumor purity
Bulk samples contain a mixture of tumor and stroma. Because the epigenetic clock index exhibited correlations with tumor purity as measured in the TCGA cohort by the consensus purity estimate39 (CPE; R = −0.70; Supplementary Fig. 5A), we had restricted our analyses to samples of high tumor purity (CPE ≥ 0.6). Nevertheless, we cannot rule out that the observed variability in \(\beta\)-value distributions among the selected fCpG sites—which are used to estimate tumor mitotic age—were at least partially driven by the methylation patterns of admixed non-epithelial cells. If this is the case, then, e.g., the immune pathway enrichment of older tumors (Fig. 6A) may be confounded by the presence of non-epithelial cells that alter the measured \(\beta\)-value distribution.
First, we checked whether the selection of clock sites was strongly influenced by the presence of non-epithelial cells. To do this, we started with all unbiased sites (mean \(\beta \in [0.4,\,0.6]\) in TCGA, normals, basal/luminal epithelial) and separately identified the 500 most variable sites in the least and most pure tumors (bottom and top CPE quartiles). These two selections showed high overlap (91% and 99%) with the original clock set of most variable CpG sites across all tumors, suggesting that the selection of fCpGs was not unduly driven by the non-epithelial tumor compartment.
Next, to adjust for possible confounding of \(\beta\)-values by tumor purity, we modeled the measured methylation as a mixture of tumor and stroma methylation, see Methods for details. This adjustment was based on the CPE measure of tumor purity39; we also employed EpiSCORE40 to estimate the fraction of epithelial content and found that the two measures of purity were well correlated (R = 0.63). The resulting purity-adjusted epigenetic clock index \({c}_{\beta }^{\alpha }\) exhibited a lower correlation with tumor purity (R = −0.25, Supplementary Fig. 5B) and was lower than the unadjusted epigenetic clock index \({c}_{\beta }\) (Supplementary Fig. 5C).
When replacing the unadjusted epigenetic clock index with the purity-adjusted version, the strength of correlations between markers of tumor aggressiveness and younger tumor mitotic age remained unaltered (Supplementary Table 1, Supplementary Fig. 6, Supplementary Fig. 7). Individual immune genes and the extent of immune infiltration remained associated with older tumor mitotic age, although the correlations were attenuated (Supplementary Table 1, Supplementary Table 2). While the immune pathways were no longer enriched in older tumors (Supplementary Fig. 7), there was still a positive correlation between the fraction of T cells and tumor mitotic age (Supplementary Table 2).
Discussion
In this study, we developed an epigenetic clock to measure the age of newly diagnosed breast cancers. Measuring epigenetic entropy among neutrally fluctuating CpG (fCpG) sites, the clock requires no knowledge about the tumor’s initial DNA methylation state. This allows it to track the number of cell divisions since birth of the tumor, a task that previous tissue clocks were not able to accomplish. Based on standard methylation arrays, the clock has potential as a novel marker of aggressiveness and prognosis in early-stage breast cancer.
Once a patient is diagnosed with breast cancer, the tumor’s mitotic age encodes valuable prognostic information. Intuitively, a slow-growing tumor that takes a long time to reach the threshold of detection is more likely to have a good prognosis compared to a fast-growing tumor that quickly expands into a detectable mass. Our analyses corroborate this hypothesis by revealing that mitotically younger tumors were enriched for features of tumor aggressiveness and predictors of poor outcome, including genomic instability, higher grade, and basal molecular subtype25,26. This property of the epigenetic clock is quite remarkable given that its constituent fCpG sites were selected only on the basis of simple statistical properties of their \({{\rm{\beta }}}\)-value distributions.
Beyond prognostication, the clock holds promise in risk-stratified screening approaches. The efficacy of breast cancer screening critically depends on the sojourn time, that is the time window during which the tumor is asymptomatic but mammographically detectable. If the sojourn time is short, early detection is unlikely even under frequent screening; if it is long, some cancers will be overdiagnosed41. Sojourn time estimates are usually obtained by fitting natural history models to population data, yielding indirect, population-averaged estimates. Our approach, in contrast, allows for direct and individual-level characterization of tumor age, which provides an upper bound for the sojourn time. Assuming an overall median time to detection of 3 years in our cohort, the time to detection in luminal A cancers (6.5 years) was substantially longer compared to that in luminal B (2.4 years), Her2-positive (2.5 years), and basal (1.1 years) cancers. These estimates are consistent with the observation that interval cancers are enriched for more aggressive subtypes compared to screen-detected cancers42, and highlight opportunities for data-driven personalization of screening schedules.
The epigenetic clock also provides an opportunity to quantify the evolutionary-ecological pressures that shape the temporal landscape of breast cancers. Indeed, because most tumors are of comparable size at the time of diagnosis, tumor mitotic age is related to the effective growth rate: tumors that reach the detection threshold at a younger age have a higher effective growth rate compared to tumors that reach the threshold at a higher age. Our analyses characterized the effective growth rate of breast cancers as a competition between tumor-intrinsic growth potential (e.g., proliferation) and microenvironmental pressures (e.g., surveillance by immune cells)43. According to this model, highly proliferative tumors that successfully evade the immune system are detected at a younger age compared to less proliferative lesions subject to continuous immune control.
Our study has several limitations. First, because tumor age is not observable in practice, a direct validation of the clock is not possible. Nevertheless, we note that the clock correctly classified the age ordering of primary tumors and metastases in 33 of 40 patients. Second, the epigenetic clock index was correlated with sample purity, which suggests the latter may be a confounder in our analyses. To mitigate this risk of bias we systematically repeated all analyses using purity-adjusted methylation values; while some of the associations were attenuated, the overall qualitative conclusions remained unchanged. To address the potential confounding of age estimates by tumor purity, single cell methylation data is needed. Third, estimation of tumor mitotic age was based on a simple mathematical model of (de-)methylation dynamics. In future work, this approximation can be refined using more sophisticated simulation-based models that account for the underlying population dynamics, including cell proliferation and death, and possibly selection.
How long a newly diagnosed breast cancer has been growing is generally considered a known unknown. Here we revisited this assumption and developed a way to infer tumor age using standard methylation arrays. While developed specifically for breast cancer, the approach can be generalized to any cancer type and, as such, provides a scalable technology to characterize the temporal landscape of oncology.
Methods
TCGA cohort
Of the 1085 invasive breast cancers from female patients in The Cancer Genome Atlas (TCGA)44, 774 had available methylation array data (Infinium HumanMethylation450 BeadChip, Illumina, San Diego, CA, USA). After excluding 138 tumors of low tumor content (consensus purity estimate39 [CPE] <0.6) and one sample each from two patients with two primary samples, the remaining 634 tumors were used to select the ensemble of 500 fCpG sites. Finally, after excluding 10 tumors with ≥5% missing clock set fCpG measurements and 100 tumors with a histology code other than infiltrating duct carcinoma, the analytic cohort consisted of 400 tumors. The following variables were retrieved: patient age at diagnosis; tumor histology; T stage (subsetted to T1, T2, T3, and T4); summary stage (subsetted to stages I, II, or III). For all 400 patients with invasive ductal carcinoma, gene expression quantification (RNA-seq) and copy number segment data were available as well; when >1 measurement was available, one was selected at random. All clinical and sequencing data were retrieved from the Genomic Data Commons (GDC; https://gdc.cancer.gov) using the R package TCGAbiolinks (version 2.25.3)45.
Lund cohort
We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 181 primary breast cancers in the Southern Sweden Breast Cancer Group tissue bank at the Department of Oncology and Pathology, Skåne University Hospital (Lund, Sweden) and the Department of Pathology, Landspitali University Hospital (Reykjavik, Iceland)18. The data were obtained through the Gene Expression Omnibus (GSE75067). Because calculation of the purity metric CPE requires gene expression, somatic copy-number, and immunohistochemistry in addition to methylation data, we instead assessed tumor purity using the leukocyte unmethylation percentage (LUMP) value. A tumor’s LUMP value is calculated as the average \(\beta\)-value among 44 specific CpG sites, divided by 0.85; we found the LUMP value to be strongly correlated with CPE (R = 0.86). After excluding samples of low purity (LUMP < 0.6; n = 35), the remaining 146 samples all had <5% missing clock set fCpG measurements. After exclusion of non-ductal histology (n = 48) we ended up with an analytic cohort of n = 98. The following variables were retrieved: patient age at diagnosis; tumor grade; tumor size; molecular subtype (PAM50); fraction of genome altered (FGA); expression of a mitotic checkpoint gene module46; fraction of cells in S-phase (flow cytometry).
Normal breast tissue cohort
We obtained publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 100 normal breast tissue samples in the Susan G. Komen Tissue Bank17,47 (GSE88883). We excluded samples of low purity (LUMP < 0.6), resulting in a cohort of 79 normal breast tissue samples used for identifying fCpG sites.
Germany cohort
We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 305 primary breast cancers collected within the Bavarian Breast Cancer Cases and Controls Study 2. The data were obtained through the Gene Expression Omnibus (GSE69914)48. After excluding samples of low purity (LUMP < 0.6; n = 52), the remaining 253 samples all had <5% missing clock set fCpG measurements. ER, PR, Her2, and Ki-67 status measured by IHC were retrieved. Clinical subtypes were defined as follows: Hormone receptor negative (HR-) tumors were defined as those that were ER- and PR-, while HR+ tumors had to be positive for one or both of ER and PR. Then, HR + /Her2- tumors were labeled as luminal A, HR + /Her2+ as luminal B, HR-/Her2+ as Her2 positive, and HR-/Her2- as triple-negative.
WCHS cohort
We retrieved publicly available methylation array data (Infinium HumanMethylation450 BeadChip) from 694 primary breast cancers collected within the Women’s Circle of Health Study (WCHS). The raw intensity data in IDAT format were obtained through the Gene Expression Omnibus (GSE226569) using the R package GEOquery (version 2.76.0)49,50. The data was subjected to standard background correction, bias correction, masking, and conversion to \(\beta\)-values using the openSesame pipeline from the R package sesame (version 1.26.0)51. After removing DCIS (stage 0) tumors (n = 56), we excluded samples of low purity (LUMP < 0.6; n = 79). Then, after noting that some clock set fCpGs had highly sparse data, we removed 51 sites that were missing measurements in more than 25% of the remaining tumors. Finally, 114 tumors were excluded for having ≥5% missing measurements across the 449 remaining fCpGs, resulting in a final cohort of 445 patients with invasive tumors. ER, PR, and Her2 status measured by IHC, patient age, and summary stage were provided by the publishing authors. Clinical subtypes were determined as described previously for the Germany cohort.
Single-cell DNAm atlas
We retrieved publicly available single-cell whole genome bisulfite sequencing data for breast luminal (N = 3 measurements) and basal (N = 4) epithelial tissue from the DNA methylation atlas of normal human cell types collected by Loyfer et al. (GSE186458)52, all of which were of very high purity (LUMP ≥ 0.99). We only retrieved data collected for the 482,422 sites that also appeared on the Infinium HumanMethylation450 BeadChip.
Multiple sample cohorts
We retrieved publicly available methylation array data (Infinium Human-Methylation450 BeadChip) from four cohorts with paired tumor samples. The first cohort consisted of 8 breast cancer patients with multiple primary samples (GSE106360)19. Only samples from the 5 patients (2 patients with 5 samples each; 3 patients with 3 samples each) who had not received neoadjuvant therapy were used. Because LUMP values were highly variable, we did not apply any purity filtering. The second cohort consisted of 10 patients diagnosed with multi-focal breast cancer (GSE39451)33. For each patient, methylation array data from 2 foci were available, and we only included the 8 patients where both samples were of sufficient purity (LUMP ≥ 0.6). The third cohort consisted of paired primary and lymph node metastasis samples from 44 patients (GSE58999)34. Only patients where both samples were of sufficient purity (LUMP ≥ 0.6) were included (n = 18). The fourth cohort, from the AURORA US Metastasis Project, consisted of primary and metastasis samples taken from 55 patients with metastatic breast cancer. In our analysis, we included only patients for whom at least one primary and one metastasis sample of sufficient purity (LUMP ≥ 0.6) were available (n = 22) (GSE212370)35. Only patients with at least one primary and one metastasis sample of sufficient purity (LUMP ≥ 0.6) were included (n = 22). When more than one primary or metastasis sample was available, the one with the highest LUMP value was selected.
Selection of fluctuating CpG (fCpG) sites
First, we sought to identify CpG sites with balanced methylation and de-methylation rates, defined as having an average methylation content (β-value) between 0.4 and 0.6 in the TCGA cohort (N = 634), bulk normal breast tissue cohort (N = 79), single-cell breast luminal epithelial cohort (N = 3), and single-cell breast basal epithelial cohort (N = 4). CpG sites with ≥20 missing values in either the TCGA or bulk normal cohorts were excluded from this selection process. In the second step, we ranked all such balanced CpG sites by their \(\beta\)-value variance among tumors in the TCGA cohort and selected the 500 most variable fCpGs to define the clock set \({{\mathcal{C}}}\). Based on the clock set, each tumor was assigned an epigenetic clock index \({c}_{\beta }=1-{s}_{\beta }\) where \({s}_{\beta }\) is the standard deviation of the β values in the clock set. We compared the proportion of genic/regulatory sites in clock sites vs. non-clock sites using the \({\chi }^{2}\) test; genic/regulatory CpG sites were identified as those associated with regulatory features or genes in one or both of the official annotation files of the Infinium HumanMethylation450 and MethylationEPIC bead chip arrays (https://support.illumina.com).
Gene expression analyses
For tumors in TCGA, relative gene expression levels were taken as the mean-centered, log2(x + 1) transformation of the reported transcript per million (TPM) intensities. Among the 60,616 RNA transcripts recorded in TCGA, only those classified as protein-coding genes by the HUGO Gene Nomenclature Committee53 were included in subsequent analyses (n = 18,910). Expression of a mitotic checkpoint gene module46 was calculated as the average relative expression of the genes included in the module. Molecular subtyping was based on the PAM50 algorithm54 as implemented in R package Genefu55. For tumors in the Lund cohort, identical gene expression and molecular subtyping analyses had previously been reported46, thus enabling a direct comparison between tumors in the TCGA and Lund cohorts.
Pathway enrichment analyses
For the TCGA cohort, we performed a gene set enrichment analysis (GSEA) using the software package GSEA56,57 to identify Hallmark gene sets that are correlated with the epigenetic clock index \({c}_{\beta }\). The analysis was performed using the Pearson correlation to rank individual genes; phenotype-permutation-based P values and false-discovery rate (FDR) Q values were computed using 1000 permutations. All other inputs were kept at their defaults.
CIBERSORTx
To assess the immune cell composition within the tumor microenvironment, we employed CIBERSORTx using the LM22 signature matrix and batch correction32. Briefly, RNA-seq data from the TCGA tumor samples were uploaded to the CIBERSORTx web portal, where gene expression profiles were deconvoluted to estimate the absolute scores for 22 distinct immune cell types. The analysis was performed with the default parameters, including 100 permutations for statistical significance assessment. For reporting of results the 22 distinct cell types were then collapsed into six mutually exclusive categories: B cells, macrophages, mast cells, myeloid cells, natural killer (NK) cells, and T cells.
Copy-number analyses
For tumors in TCGA, copy-number (CN) data consisted of specified chromosomal regions of equal CN, the \({\log }_{2}(\frac{x}{2})\)-transformed CN, and the number of probes. We converted these values to absolute copy numbers and determined each segment to have either a copy number gain (segment mean ≥ 2.5), a copy number loss (segment mean ≤1.5), or no change (1.5 < segment mean < 2.5). The fraction of the genome altered by copy number gains and losses were each calculated for every tumor by dividing the number of probes affected by gains and losses, respectively, by the total number of probes. The total fraction of the genome altered by copy number alterations (FGA) was then calculated as the sum of these two values. For tumors in the Lund cohort, the same approach had previously been used to compute FGA2, thus enabling direct comparison between tumors in the TCGA and Lund cohorts.
In silico model of tumor growth and fCpG dynamics
To simulate the dynamics of fCpG sites in a growing tumor, we used a discrete-time birth-death process. Starting with a single founding tumor cell, the population is updated in time intervals of one day, at which time each cell either divides, dies, or remains unchanged with probabilities \(\alpha\), \(\lambda\), and \(1-\alpha -\lambda\), respectively. Upon cell division, each allele in each cell changes its methylation state with probability \(\mu\). We tracked an ensemble of 90 fCpG sites, assuming independent (de-)methylation dynamics. Unless otherwise specified, the following parameters were used: \(\alpha =0.17\) (the estimated mean proliferation rate in the TCGA-Lund combined cohort, see below for details), \(\lambda =0.15\) (to reach a population of 109 cells in 3 years, or, \({\left(1+\alpha -\lambda \right)}^{3\cdot 365}\approx {\,10}^{9}\)), and \({{\rm{\mu }}}=0.002\) (the estimated flip rate in the combined cohort, see below for details).
Tumor specific proliferation rates
For tumors in the Lund cohort, tumor specific proliferation rates \({\alpha }_{i}\) were estimated based on the reported fraction \({f}_{i}\) of cells in S-phase as \({\alpha }_{i}={f}_{i}/{T}_{S}\), where \({T}_{S}\) is the average time spent in S-phase (see Supplementary Methods for details). We assumed \({T}_{S}\) to equal 12.7 hours, based on an average across five cancer cell lines58. Because \({f}_{i}\) is not reported in the TCGA cohort, we used the Lund data to develop a predictive model of S-phase fraction using an elastic net model. As candidate predictors, we included FGA, LUMP value, and average gene expression levels within each of the following gene modules46: mitotic checkpoint (see above), immune response, stroma, mitotic progression, early response, steroid response, basal, and lipid. The model was fit to the Lund cohort tumors using cross-validation for hyperparameter optimization, and then applied to TCGA tumors to predict tumor-specific S-phase fractions \({f}_{i}\) and proliferation rates \({\alpha }_{i}\).
Tumor age estimation
To estimate tumor mitotic and calendar ages from the empirical \(\beta\)-value distributions, we proceeded in two steps. In the first step, we decomposed each tumor’s empirical \(\beta\)-value distribution into three groups, or “peaks”, of fCpG sites: the originally unmethylated fCpG sites (left peak), the originally hemi-methylated fCpG sites (middle peak), and the originally methylated fCpG sites (right peak). We achieved this by fitting a mixture model of three Beta distributions to the \(\beta\)-values of the 500 fCpG sites in the clock set using the R package BetaModels (version 0.5.2). To improve convergence of this method, sites with extreme \(\beta\)-values (\(\beta > 0.98\) or \(\beta < 0.02\)) were removed before fitting the mixture model (a total of 46 and 244 sites were thus removed in the TCGA and Lund cohorts). In preparation of the next step, we determined the mode of each Beta component in the mixture as the location of the corresponding peak. At this point we excluded tumors with a middle peak location outside the interval [0.4, 0.6] because this suggests a bias in the (de-)methylation rates and thus violates a basic assumption of the fCpG dynamics in the clock set (35 and 8 tumors were excluded in the TCGA and Lund cohorts, respectively). We also excluded 3 TCGA and 7 Lund normal that were of the “normal” molecular subtype. In the second step, we used the stochastic oscillator model (Fig. 1A) to relate the empirical peak location to the approximate age of the tumor. Because this step requires knowledge about the unknown stochastic (de-)methylation rate, we constrained the overall calendar age distribution across the Lund and TCGA cohorts to have a median of 3 years, which corresponds to the mean sojourn time in breast cancer4,59. See Supplementary Methods for details.
EpiSCORE
To provide an orthogonal measure of purity in the TCGA cohort, we used the R package EpiSCORE (version 0.9.6)40. Following the standard instructions of the tool, we converted the 450K methylation array data to gene-level DNAm data and then used the breast reference DNAm matrix provided with the package to estimate, for each tumor, the fraction of each of the following breast cell-types: basal, endothelial, fat, fibroblast, luminal, lymphocyte, and macrophage. The epithelial fraction was computed as the sum of the luminal and basal fractions.
Purity adjusted analyses
Acknowledging the correlation between the epigenetic clock index \({c}_{\beta }\) and tumor purity, we derived a purity-adjusted epigenetic clock index \({c}_{\beta }^{\alpha }\) and repeated relevant correlation analyses with \({c}_{\beta }^{\alpha }\) instead of \({c}_{\beta }\). Because the epigenetic clock index was derived from the distribution of \(\beta\)-values of fCpG sites, we performed the purity adjustment at the level of \(\beta\)-values. For this, we assumed that the measured \({{\rm{\beta }}}\)-value at site \({\mathfrak{i}}\) (\({\beta }_{i}^{m}\)) could be decomposed as a weighted sum of \(\beta\)-values of the tumor (\({\beta }_{i}^{t}\)) and the immune component (\({\beta }_{i}^{s}\)),
where \(p\) is the sample purity as measured by CPE. To estimate \({\beta }_{i}^{s}\) we combined the CIBERSORTx decomposition of the stroma (see section CIBERSORTx) with \({{\rm{\beta }}}\)-values of its constituent cells (\({\beta }_{k}^{c}\)) to obtain
where \({w}_{i,{k}}\) is the fraction of cell type \(k\) (in the LM22 signature) in tumor sample \(i\). The \({\beta }_{k}^{c}\) were estimated using published cell-type specific methylation values60. Finally, the purity adjusted \(\beta\)-values were obtained by solving Eq. (1) for \({\beta }_{i}^{t}\) and truncating values below 0 and above 1 (necessary for <5.7% of the adjusted \({{\rm{\beta }}}\)-values).
Statistics and reproducibility
Correlations between two continuous variables were calculated using the Pearson correlation coefficient. The medians of continuous variables were compared using a two-sided Wilcoxon rank-sum test at significance level of 0.05. For each variable, tumors with missing values of that variable were excluded. All analyses and visualizations were performed in Python (3.9.19) and R (version 4.3)61. All analyses were based on publicly available data sources and can thus be fully reproduced. To maximize statistical power, all qualifying samples were included in the respective data analyses.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data used in this work are publicly available under https://www.cancer.gov/ccg/research/genome-sequencing/tcga (TCGA data), and under the following Gene Expression Omnibus accession numbers: GSE75067 (Lund cohort); GSE88883 (normal breast cohort); GSE69914 (Germany cohort); GSE226569 (WCHS cohort); GSE186458 (DNAm atlas); GSE106360, GSE39451, GSE58999, GSE212370 (multiple sample cohorts). A list of the 500 fCpG sites used to estimate tumor mitotic age is provided in Supplementary Data 1. Tumor mitotic ages for samples in the TCGA and Lund cohorts are provided in Supplementary Data 2. The numerical data underlying the figures in this manuscript are provided in Supplementary Data 3.
Code availability
All Python (version 3.9.19) and R (version 4.3) code used to produce the results in this paper are found on GitHub at https://github.com/danmonyak/EpiClockInvasiveBRCA (MIT License)61.
References
Duffy, S. W., Chen, H. H., Tabar, L. & Day, N. E. Estimation of mean sojourn time in breast cancer screening using a Markov chain model of both entry to and exit from the preclinical detectable phase. Stat. Med. 14, 1531–1543 (1995).
Michaelson, J. et al. Estimates of breast cancer growth rate and sojourn time from screening database information. J. Women’s Imaging 5, 11–19 (2003).
Shapiro, S., Goldberg, J. D. & Hutchison, G. B. Lead time in breast cancer detection and implications for periodicity of screening. Am. J. Epidemiol. 100, 357–366 (1974).
Shen, Y. & Zelen, M. Screening sensitivity and sojourn time from breast cancer early detection clinical trials: mammograms and physical examinations. J. Clin. Oncol. 19, 3490–3499 (2001).
Weedon-Fekjær, H., Vatten, L. J., Aalen, O. O., Lindqvist, B. & Tretli, S. Estimating mean sojourn time and screening test sensitivity in breast cancer mammography screening: new results. J. Med. Screen. 12, 172–178 (2005).
Hannum, G. et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. cell 49, 359–367 (2013).
Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, 1–20 (2013).
Yang, Z. et al. Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol. 17, 1–18 (2016).
Youn, A. & Wang, S. The MiAge Calculator: a DNA methylation-based mitotic age calculator of human tissue types. Epigenetics 13, 192–206 (2018).
Zhu, T., Tong, H., Du, Z., Beck, S. & Teschendorff, A. E. An improved epigenetic counter to track mitotic age in normal and precancerous tissues. Nat. Commun. 15, 4211 (2024).
Zhou, W. et al. DNA methylation loss in late-replicating domains is linked to mitotic cell division. Nat. Genet. 50, 591–602 (2018).
Teschendorff, A. E. A comparison of epigenetic mitotic-like clocks for cancer risk prediction. Genome Med. 12, 1–17 (2020).
Gabbutt, C. et al. Fluctuating methylation clocks for cell lineage tracing at high temporal resolution in human tissues. Nat. Biotechnol. 40, 720–730 (2022).
Gabbutt, C. et al. Evolutionary dynamics of 1,976 lymphoid malignancies predict clinical outcome. medRxiv 2023.2011. 2010.23298336 (2023).
Teschendorff, A. E. On epigenetic stochasticity, entropy and cancer risk. Philos. Trans. R. Soc. B 379, 20230054 (2024).
Koboldt, D. C. et al. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Johnson, K. C., Houseman, E. A., King, J. E. & Christensen, B. C. Normal breast tissue DNA methylation differences at regulatory elements are associated with the cancer risk factor age. Breast Cancer Res. 19, 1–11 (2017).
Holm, K. et al. An integrated genomics analysis of epigenetic subtypes in human breast tumors links DNA methylation patterns to chromatin states in normal mammary cells. Breast Cancer Res. 18, 1–20 (2016).
Luo, Y. et al. Regional methylome profiling reveals dynamic epigenetic heterogeneity and convergent hypomethylation of stem cell quiescence-associated genes in breast cancer following neoadjuvant chemotherapy. Cell Biosci. 9, 16 (2019).
Nishimura, T. et al. Evolutionary histories of breast cancer and related clones. Nature 620, 607–614 (2023).
Ciriello, G. et al. Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163, 506–519 (2015).
van der Veer, E. L. et al. Causes and consequences of delayed diagnosis in breast cancer screening with a focus on mammographic features and tumour characteristics. Eur. J. Radiol. 167, 111048 (2023).
Danielsen, H. E., Pradhan, M. & Novelli, M. Revisiting tumour aneuploidy — the place of ploidy assessment in the molecular era. Nat. Rev. Clin. Oncol. 13, 291–304 (2016).
Ricke, R. M., van Ree, J. H. & van Deursen, J. M. Whole chromosome instability and cancer: a complex relationship. Trends Genet. 24, 457–466 (2008).
Chia, S. K. et al. A 50-gene intrinsic subtype classifier for prognosis and prediction of benefit from adjuvant tamoxifen. Clin. Cancer Res. 18, 4465–4472 (2012).
Wallden, B. et al. Development and verification of the PAM50-based Prosigna breast cancer gene signature assay. BMC Med. Genom. 8, 1–14 (2015).
Rakha, E. A. et al. Breast cancer prognostic classification in the molecular era: the role of histological grade. Breast Cancer Res. 12, 1–12 (2010).
Wiesner, F. G. et al. Ki-67 as a prognostic molecular marker in routine clinical use in breast cancer patients. Breast 18, 135–141 (2009).
Carter, C. L., Allen, C. & Henson, D. E. Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases. Cancer 63, 181–187 (1989).
Loftus, L. V., Amend, S. R. & Pienta, K. J. Interplay between cell death and cell proliferation reveals new strategies for cancer therapy. Int. J. Mol. Sci. 23, 4723 (2022).
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011).
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Desmedt, C. et al. Abstract S6-2: characterization of different foci of multifocal breast cancer using genomic, transcriptomic and epigenomic data. Cancer Res. 72, S6–2 (2012).
Reyngold, M. et al. Remodeling of the methylation landscape in breast cancer metastasis. PloS One 9, e103896 (2014).
Garcia-Recio, S. et al. Multiomics in primary and metastatic breast tumors from the AURORA US network finds microenvironment and epigenetic drivers of metastasis. Nat. Cancer 4, 128–147 (2023).
Cheung, K. J. & Ewald, A. J. A collective route to metastasis: seeding by tumor cell clusters. Science 352, 167–169 (2016).
Fan, C. et al. Concordance among gene-expression–based predictors for breast cancer. N. Engl. J. Med. 355, 560–569 (2006).
Haque, R. et al. Impact of breast cancer subtypes and treatment on survival: an analysis spanning two decades. Cancer Epidemiol. Biomark. Prev. 21, 1848–1855 (2012).
Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
Teschendorff, A. E., Zhu, T., Breeze, C. E. & Beck, S. EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data. Genome Biol. 21, 221 (2020).
Welch, H. G. & Black, W. C. Overdiagnosis in cancer. J. Natl. Cancer Inst. 102, 605–613 (2010).
Li, J. et al. Molecular differences between screen-detected and interval breast cancers are largely explained by PAM50 subtypes. Clin. Cancer Res. 23, 2584–2592 (2017).
Maley, C. C. et al. Classifying the evolutionary and ecological features of neoplasms. Nat. Rev. Cancer 17, 605–619 (2017).
Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Colaprico, A. et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 44, e71–e71 (2015).
Fredlund, E. et al. The gene expression landscape of breast cancer is shaped by tumor protein p53 status and epithelial-mesenchymal transition. Breast Cancer Res. 14, 1–13 (2012).
Sherman, M. E. et al. The Susan G. Komen for the Cure Tissue Bank at the IU Simon Cancer Center: a unique resource for defining the “molecular histology” of the breast. Cancer Prev. Res. 5, 528–535 (2012).
Yang, Z., Jones, A., Widschwendter, M. & Teschendorff, A. E. An integrative pan-cancer-wide analysis of epigenetic enzymes reveals universal patterns of epigenomic deregulation in cancer. Genome Biol. 16, 140 (2015).
Chen, J. et al. An epigenome-wide analysis of socioeconomic position and tumor DNA methylation in breast cancer patients. Clin. Epigenet.15, 68 (2023).
Davis, S. & Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 23, 1846–1847 (2007).
Zhou, W., Triche, T. J. Jr, Laird, P. W. & Shen, H. SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 46, e123–e123 (2018).
Loyfer, N. et al. A DNA methylation atlas of normal human cell types. Nature 613, 355–364 (2023).
Seal, R. L., et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2023).
Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1160–1167 (2009).
Gendoo, D. M. et al. Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics 32, 1097–1099 (2016).
Mootha, V. K. et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34, 267–273 (2003).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Bialic, M., Al Ahmad Nachar, B., Koźlak, M., Coulon, V. & Schwob, E. Measuring S-phase duration from asynchronous cells using dual EdU-BrdU pulse-chase labeling flow cytometry. Genes 13, 408 (2022).
Bhatt, R. et al. Estimation of age of onset and progression of breast cancer by absolute risk dependent on polygenic risk score and other risk factors. Cancer 130, 1590–1599 (2024).
Hannon, E. et al. Assessing the co-variability of DNA methylation across peripheral cells and tissues: Implications for the interpretation of findings in epigenetic epidemiology. PLoS Genet. 17, e1009443 (2021).
Monyak, D. L. Analysis scripts and documentation for “Mapping the Temporal Landscape of Breast Cancer Using Epigenetic Entropy”. v1.0. Zenodo. https://doi.org/10.5281/zenodo.16813782 (2025).
Acknowledgements
We gratefully recognize our funders who provided support for this work: National Institutes of Health (grant R01-CA271237 to M.D.R. and L.J.G.; grant U2C-CA233254 to E.S.H.; grant U54-CA217376 to D.S.), and Breast Cancer Research Foundation (grant BCRF-19-074 to E.S.H).
Author information
Authors and Affiliations
Contributions
M.D.R., D.S., J.R.M., E.S.H., and L.J.G. conceived the study and secured funding. M.D.R., D.L.M., and S.T.H. developed and optimized the analytical methods. D.L.M., G.G., and S.T.H. curated the datasets, maintained the code base, and performed the computational analyses. D.L.M. and S.T.H. prepared the figures and illustrations. All authors contributed to the interpretation of the results. M.D.R., D.S., and J.R.M. supervised the project. D.L.M., M.D.R., and D.S. drafted the manuscript, and all authors revised it critically and approved the final version.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks Andrew Teschendorff and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editor: Johannes Stortz. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Monyak, D.L., Holloway, S.T., Gumbert, G.J. et al. Mapping the temporal landscape of breast cancer using epigenetic entropy. Commun Biol 8, 1477 (2025). https://doi.org/10.1038/s42003-025-08867-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-025-08867-2