Abstract
Li-Fraumeni syndrome (LFS) confers high lifetime cancer risk due to germline TP53 pathogenic variants (PV). A comprehensive surveillance regimen termed the ‘Toronto Protocol’, has been adopted for early tumor detection, demonstrating improved survival among TP53 PV carriers. However, the protocol’s “one-size-fits-all” approach fails to consider individual cancer risk. To personalize screening, we developed a support vector machine model to predict early onset of primary tumors (age < 6) using peripheral blood methylation data of TP53 PV carriers (n = 237). Validation (n = 64) and external testing (n = 79) showed AUROC = 0.928 [0.835–1.000], F1-score = 0.692 [0.435–0.867], and NPV = 0.984 [0.946–1.000]. The model achieved 91% accuracy, correctly classifying 90% of patients with cancer before the age of six and 87% of cancer-free individuals in the external test set. Our tool enables risk stratification for early-onset malignancies, to optimize clinical surveillance and improve patient outcomes.
Similar content being viewed by others
Introduction
Li-Fraumeni syndrome (LFS) (OMIM#151623) is an autosomal dominant cancer predisposition syndrome characterized by germline pathogenic variants (PV) of the TP53 tumor suppressor gene in approximately 80% of cases1. Affected individuals are predisposed to a spectrum of early-onset cancers, and have an increased risk of developing multiple malignancies, even in the absence of a family history of cancer. The population carrier rate of pathogenic TP53 variants is estimated to be between 1:500 and 1:50002,3,4. Since the initial link between germline TP53 variants and LFS was discovered in 19905, studies of this phenotypically heterogeneous disorder have led to a better understanding of the mechanisms of cancer susceptibility and p53 function, the role of genomic instability in cancer predisposition, and the introduction of the “Toronto Surveillance Protocol”—a multi-modal screening approach to clinical surveillance for early tumor detection in TP53 variant carriers6.
Studies implementing comprehensive life-long clinical surveillance protocols for TP53 PV carriers have demonstrated the ability to detect asymptomatic low-grade tumors and low-stage malignant tumors, leading to improved survival for these patients6,7,8,9. However, the spectrum of malignancies and age of cancer onset in LFS is heterogeneous10, which complicates strategies for early tumor detection. Current surveillance strategies utilize a one-size-fits-all approach and fail to consider an individual patient’s cancer risk. They are also intensive and burdensome for patients and the healthcare system, requiring annual (adults) or biannual (children) ultrasounds, whole-body and brain MRIs, blood tests, and physical examinations from birth, or at first confirmation of TP53 variant status11,12,13.
Although numerous studies have attempted to determine cancer risk factors or cancer incidence in an entire population, only a handful incorporate genome-level data to inform cancer surveillance recommendations. Previous studies have leveraged blood-derived DNA methylation data to develop measurements of biological aging defined by the difference between epigenetic and chronological aging, termed intrinsic epigenetic age acceleration (IEAA) and extrinsic epigenetic age acceleration (EEAA), respectively, for the prediction of cancer risk14,15,16. However, recent research demonstrates that the adult epigenetic clock is not a good predictor for children17 as the margin of error is 2-3 years18. In addition, risk models for sporadic cancers are not directly transferable to hereditary cancers, like those occurring in LFS, because of clinical and molecular differences19,20. In particular, individuals with TP53 PV have a considerably earlier age of cancer onset compared to their sporadic counterparts8,21. Among individuals with LFS, missense TP53 variants22,23, especially those with loss-of-function (LOF) effects, are associated with early cancer onset24. Although a handful of modifiers, including mir-60525, PIN3, MDM2 SNP30926, p.Arg72Pro polymorphism of p5327, and telomere length28 have been associated with age of onset in LFS, they are insufficient for predicting early cancer onset at an individual level. Similarly, the performance of models leveraging family history of cancer for penetrance estimates of time to first primary diagnosis in LFS, suggests low clinical utility29.
In this work, we implemented a strategy to estimate the risk of early-onset malignancies in TP53 PV carriers. Among TP53 PV carriers, 51% of diagnoses are estimated to occur during early childhood30. The most characteristic pediatric cancers of LFS are adrenocortical carcinoma (ACC), choroid plexus carcinoma (CPC), rhabdomyosarcoma (RMS), and medulloblastoma (MB), with the majority arising prior to the age of six years30,31. Nearly a quarter of LFS patients are diagnosed with cancer before six years of age, during which time cancer screening presents unique challenges for the patient, including the administration of general anesthesia for whole-body and brain MRIs30,32. Early detection and intervention are critical in these patients as they are at the highest risk of a secondary malignancy, treatment-related morbidity, and loss of life years33. Stratifying TP53 PV carriers based on how likely they are to develop an early-onset malignancy can alleviate the burden of unnecessary screening for patients, their families, and the healthcare system.
Studies have shown that mutant p53 reprograms cancer cell metabolism by decreasing levels of critical metabolites like α-ketoglutarate and S-adenosylmethionine (SAM)34. This metabolic shift disrupts the function of TET enzymes, which are essential for DNA demethylation, while simultaneously enhancing the activity of DNA methyltransferases responsible for DNA methylation35. As a result, these alterations lead to abnormal DNA methylation patterns that silence tumor suppressor genes and promote a malignant cellular phenotype36. Thus, we retrospectively performed DNA methylation profiling on peripheral blood leukocytes (PBL) collected as part of a long-term follow-up of a large, multi-institutional cohort of LFS patients. We established the proof-of-principle for a targeted and patient-specific cancer screening process by developing a DNA methylation-based predictive model that accurately stratifies TP53 PV carriers based on how likely they are to develop cancer before six years of age. Not only will this improve patient outcomes by identifying individuals who would benefit from high-intensity surveillance at very young ages, it will also reduce the resources and psychosocial burdens associated with unnecessary screening.
Results
Predicting early cancer onset before the age of six years
We tested the ability of our model to predict whether an individual will be diagnosed with cancer before or after the age of six years on a validation set (n = 64), in addition to an external test set (n = 79) unseen during the training phase of the algorithm. We achieved an AUROC of 0.938 [95% CI, 0.851–1.000], F1-score of 0.842 [95% CI, 0.600–1.000] and a NPV of 0.981 [95% CI, 0.939–1] on our validation set (Table 1 and Fig. 2), only misclassifying two patients who were diagnosed with cancer before the age of six (FN). In addition, we falsely classified two patients in our validation set as having cancer prior to the age of six (FP), when in fact they developed cancer later in life, one at 15 years, and the other at 6 years 3 months, very close to our age cut-off of 6 years. In our external test set, we achieved an AUROC of 0.928 [95% CI, 0.835–1.000], F1-score of 0.692 [95% CI, 0.435–0.867], and a NPV of 0.984 [95% CI, 0.946–1.000] misclassifying only one patient who developed cancer before the age of six years (Table 2 and Fig. 3). In addition, we falsely classified seven patients in our external test set as having cancer prior to the age of six years when in fact three patients developed cancer later in life and four remain cancer-free to-date, with clinical follow-up until at least 6 years of age. Importantly, we correctly classified all patients sampled prior to cancer diagnosis, including a patient who we predicted would develop cancer before the age of six, 7 months prior to their cancer diagnosis of an ACC at 1.6 years of age. To provide further confidence in the robustness of our model, we built two additional models to predict early cancer onset using cut-offs of four and five years, both of which performed similarly, but not, as well as predicting cancer onset before the age of six years (Supplementary Figs. 1–4). We also analyzed changes in the prediction probability over time for patients with availability for samples at multiple timepoints (Supplementary Fig. 5). Patient (ID: 109185), diagnosed before age six, had a high probability at diagnosis, which became negative one year after treatment, suggesting reversible methylation changes. All other patients diagnosed after six years showed negative predictions regardless of sample age, except patient (ID: 15010), a cancer-free individual, falsely predicted positive at 19 years.
Mitigating against potential model biases
We did not observe any patterns between the patients classified as false positives (FP) or false negatives (FN) and characteristics of the patient’s TP53 variant (i.e., functional effect, location, or specific variant), age, cancer type, and/or age of sample collection. An important consideration in developing our model is the variable age of sample collection and its high correlation with the age of cancer onset (R = 0.81, Fig. 1C), making it a potential confounder. The average difference between the age of cancer onset and age at sample collection in our cohort was 8.06 years [95% CI, 6.62–9.49]; this was not significantly different (p = 0.424, two-sided Mann–Whitney U-test) between the patients that developed cancer before the age of six (9.56 years [95% CI, 6.18–12.94]) and the patients that developed cancer after the age of six (7.42 years [95% CI, 4.04–10.80]). We found methylation of the features in our model were very weakly correlated with age of sample collection (RPC1 = 0.314); this is in contrast to the 4228 age-associated probes removed during our preprocessing step, which we found to be highly-associated with age of sample collection (RPC1 = −0.786; Supplementary Fig. 6). In order to further assess that our model was not confounded by age at sample collection, we evaluated our algorithm on cancer-free TP53 PV carriers that encompass a wide range of ages at sample collection (0–76 years), with clinical follow-up until at least 6 years of age. Our model correctly classified all the cancer-free individuals in our validation set (n = 21) and 87% in the external test set (n = 30); risk scores for these individuals were far below the decision boundary, regardless of age at sample collection (Figs. 2 and 3). Moreover, the misclassified cancer-free TP53 PV carriers had ages of sample collection before (n = 3) and after (n = 1) six years, further confirming the predictions are not biased by age of sample collection.
A Summary of data workflow by TP53 status, methylation data, and cancer status. B Distribution of age of cancer diagnosis colored by training, validation, test, or control dataset. The red dotted line indicates our threshold of cancer onset at the age of six years. C A scatter plot of the age of cancer onset (years) in relation to the age of sample collection (years), colored by training, validation, or external test dataset. The red dotted line indicates our threshold of cancer onset at the age of six years.
A Receiver operating characteristic (ROC) curve for the validation cohort. TP true positive, TN true negative, FN false negative, FP false positive, AUROC area under the receiver operator characteristic curve. B Predicted probability of early cancer onset on the validation set for patients with cancer, using a decision boundary of 0.34. The number beside the falsely predicted patients indicates the true age of cancer onset. C Predicted probability of early cancer onset for the null cohort (patients without cancer). The number beside the falsely predicted patients indicates the age of the sample collection. Red = optimized decision threshold, blue = optimal sensitivity. Source data are provided as a Source Data file.
A ROC curve for the validation cohort. TP true positive, TN true negative, FN false negative, FP false positive, AUROC area under the receiver operator characteristic curve. B Predicted probability of early cancer onset on the test set for patients with cancer, using a decision boundary of 0.34. The number beside the falsely predicted patients indicates the true age of cancer onset. C Predicted probability of early cancer onset for the null cohort (patients without cancer). The number beside the falsely predicted patients indicates the age of the sample collection. Red = optimized decision threshold, blue = optimal sensitivity. Source data are provided as a Source Data file.
Use of clinical characteristics and TP53 PV for predicting early cancer onset
In addition to methylation, we evaluated the use of sex, family history of cancer, systemic treatment status, and characteristics of TP53 PV, including variant location (i.e., DNA binding domain, tetramerization domain, oligomerization domain, n-terminal) and variant type (i.e., splice, deletion, missense, nonsense, and frameshift) as features in our model. The distribution of TP53 PVs in the LFS cohort indicates that missense mutations within the DNA binding domain are most prevalent, with C > T transitions representing the most frequent change (Supplementary Fig. 7). Across all combinations of clinical and TP53 variant-related features, we found comparable performance by using methylation alone (AUROC = 0.906 [95% CI, 0.805–1.000], F1-score = 0.581 [95% CI, 0.348–0.774], NPV = 0.983 [95% CI, 0.944–1.000]). However, the best performing model included TP53 PV and achieved an AUROC of 0.928 [95% CI, 0.835–1.000], F1-score of 0.692 [95% CI, 0.435–0.867], and a NPV of 0.984 [95% CI, 0.946–1.000] (Supplementary Fig. 8). We assessed feature importance analysis using Shapley values (Supplementary Fig. 9) and found a gradual decline in importance across the ranked features (Fig. 4A). Among the TP53-associated features, the presence or absence of a missense mutation had the highest importance for predicting early cancer onset (Fig. 4B), which is in accordance with previous literature22,23. Use of TP53 PV without gene-based methylation data achieved an AUROC of 0.522 [95% CI, 0.498–0.546], suggesting poor discriminatory power based on TP53 status alone. Including sex as a feature in the model did not result in a significant change to model performance. The addition of family history of cancer enhanced the AUROC by 0.030 [95% CI, −0.145 to 0.195] but reduced the NPV by −0.052 [95% CI, −0.125 to 0.042], compared to using methylation alone. However, it should also be noted that the features used for family history of cancer may possess inherent biases, as the number of individuals with a known history of cancer for each patient’s family is variable. Interestingly, we found systemic treatment status decreased AUROC by 0.161 [95% CI, 0.075–0.241], relative to using methylation alone.
A Model feature importance (SHAP value) for early cancer onset prediction, across all the features. B Feature importance for the TP53 variant-related features. OD oligomerization domain, DBD DNA binding domain, TA transactivation domain. C. Feature importance for the top 20 features. Source data are provided as a Source Data file.
Interpretation of methylation features predictive of early cancer onset
During the feature selection process, we determined an LFS-specific signature for early cancer onset by identifying differentially methylated sites between variant and wild-type TP53 carriers with histologically comparable malignancies. Compared to randomly sampled probes, we found leveraging probes associated with TP53 status consistently yielded the highest AUROC and F1-score on the test set (Supplementary Fig. 10). LFS-associated probes in the 3’UTR demonstrated superior performance compared to models that included probes from all functional regions, with and without PCA transformation (Supplementary Fig. 11). We found the model probes were particularly enriched at chromosome 17, but also distributed throughout the genome with the exception of chromosomes 9 and 15, where TP53 is located (Supplementary Fig. 12A). The methylation features were enriched at H3K36me3 histone marks and genes that are highly expressed in blood cells (Supplementary Fig. 12B). Among the most important features were cancer-related genes RET and LEF1, and genes involved with Golgi-associated vesicles like GOPC, SPPL2B, and MAP6 (Fig. 4C). Pathway analysis revealed that these probes are enriched for biological processes involved in the cellular response to interleukin-4 (IL-4; p = 6.212 × 10−3). In particular, this involved the genes LEF1, TCF7, PTPN2, and CD300LF. IL-4 has a well-established role in inducing B-cell activation and differentiation of naive T-cells to effector T-cells, as well as inhibitory effects on neutrophils37. Our model probes also demonstrate distinct clustering patterns that align with tissue of origin for the cancer developed (Supplementary Fig. 13). In order to understand the early cancer onset signature in a broader disease context, we compared methylation of the model features across TP53 PV carriers, wildtype TP53 carriers with histological comparable malignancies to LFS, sporadic cancers (i.e., breast38 and colon cancer39), immune-related diseases (i.e., rheumatoid arthritis40 and multiple sclerosis41) and healthy controls. We found TP53 PV carriers with early cancer onset, clustered together, further away from the other patients and healthy controls40 (Supplementary Fig. 14).
Immune cell deconvolution of peripheral blood DNA methylation
We trained a model without methylation, leveraging cellular composition, TP53 mutation and demographic variables and found it performed with an AUROC of 0.75 [95% CI, 0.456–0.97], F1-score of 0.615 [95% CI, 0.25–0.889], and NPV of 0.907 [95% CI, 0.81–0.977] on the validation set, and an AUROC of 0.794 [95% CI, 0.641–0.92], F1 score of 0.267 [95% CI, 0–0.556], and NPV of 0.774 [95% CI, 0.817–0.959] on the test set (Supplementary Fig. 15). We investigated the relationship between early cancer onset in LFS and immune cell proportions derived from the methylation data. We found TP53 PV carriers who developed an early onset malignancy had increased B-cells and CD8 + T-cells and decreased neutrophils, compared to TP53 PV carriers with later cancer onset (Fig. 5A). With respect to our model, the predicted probability of early cancer onset and true labels are weakly, positively correlated with B-cells (p = 8.6 × 10−5; R = 0.23) and negatively correlated with neutrophils (p = 8.6 × 10−5; R = −0.22; Fig. 5B). However, we found that inclusion of cell proportions as features in our model reduced the AUROC and F1-score by −0.014 [95% CI, −0.06 to 0.109] and −0.012 [95% CI, −0.121 to 0.145], respectively, and increased NPV by 0.011 [95% CI, 0.171–0.149] (SupplementaryFig. 8). Furthermore, immune cell proportions did not significantly differ between variant TP53 carriers and wildtype TP53 carriers with histologically comparable malignancies (Supplementary Fig. 16).
A Comparison of the centered log ratio transform of the immune cell proportions inferred from deconvolution of bulk methylation signal in LFS patients with cancer before the age of six (n = 96) to those that developed cancer after the age of six (n = 209) or not at all (n = 155). Two-sided Mann–Whitney U-tests were used for pairwise comparisons, with Bonferroni adjustment for multiple-hypothesis correction. Box plots display the median (50th percentile, center line), the interquartile range (IQR; box spans the 25th–75th percentiles, Q1–Q3), and whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR. B Correlation between estimated immune cell proportions and the predicted probability of cancer onset before the age of six years. Source data are provided as a Source Data file.
Discussion
In this study, we have demonstrated the utility of an AI tool that leverages DNA methylation of PBL as a minimally invasive strategy for personalizing cancer screening of LFS patients. Having taken a rigorous approach to building our model, we developed a robust machine learning tool that is able to predict whether cancer is likely to occur prior to the age of six years in TP53 PV carriers (Fig. 6). This is an interesting and much-needed endeavor for several reasons. Firstly, previous approaches leveraging epigenetic aging are not reflective of cancer onset in TP53 PV carriers (Supplementary Fig. 17). Secondly, the ability to predict an early onset of malignancy can greatly reduce the burden of unnecessary screening measures. The number of patients correctly predicted to develop cancer after the age of six years in our validation and external test sets suggests that clinically deploying our model could result in a potential 93% reduction in unnecessary screening, with the associated psychological burden and logistical complications for young patients and their families. This benefit is in addition to substantial savings to the healthcare system from a reduction in use of MRIs and other imaging modalities and other clinical diagnostic interventions. Though any misclassifications are not ideal, this is still substantially lower than the false positive rates (FPR) noted in clinical surveillance studies9. Similarly, the patients our model misclassified as healthy made up only 10% of our external test set, in line with current surveillance strategies in clinical use6,11,42. These findings also have potentially broader implications by providing useful insight into how PBL-derived DNA methylation data can be leveraged for cancer surveillance in other cancer predisposition scenarios.
Our pipeline starts with the blood draw in the clinic, followed by DNA methylation profiling, which is input for a machine learning model that predicts the probability of cancer occurrence before six years of age. The prediction can then be used to personalize cancer surveillance for LFS patients into low, standard (Toronto), or high intensity surveillance protocols.
Prospective studies of blood-derived methylation in lung, colorectal, and breast cancer suggest these changes are inherent characteristics that exist long before cancer development38,39,43. These changes could be due to systemic effects driven by TP53 PV or due to the existence of germline modifiers (e.g., SNPs) with effects that are reflected in the methylome. Immune cell proportions estimated from bulk methylation suggest an association of early cancer onset with increased B-cells and decreased neutrophils; however, it is worth noting this may also be influenced by the patient’s state at sample collection, including their age, cancer, or treatment status. In our study, we found methylation probes associated with TP53 status in the 3’UTR are predictive of early cancer onset in LFS patients. Altered methylation at the 3’UTR can lead to changes in post-transcriptional control of regulatory regions by disrupting microRNAs, RNA-binding proteins, poly(A) signals, and m6A sites, leading to aberrant gene expression programs that drive early cancer development44,45. This is further supported by previous work that suggests a role for microRNAs in LFS, in particular the association of miR-605 with accelerated age of tumor onset25 and hypermethylation of mir34a with poorer overall survival in TP53 PV carriers36. Feature analysis suggests aberrant methylation could lock cells into a pro-tumorigenic state by enhancing Golgi-mediated secretion of oncogenic factors and reinforcing an immune-suppressive environment through IL-4 signaling. Although it should be worth noting, Shapley values assume feature independence, which can lead to misleading attribution of feature importance upon multicollinearity and should be interpreted with caution46. Further validation and experimental studies are needed to uncover the biological mechanisms behind the methylation signature and identify ways to leverage them for personalized therapeutics. While identifying the specific mechanisms underlying early tumor onset are important in understanding the evolution of LFS-associated cancers, it is beyond the scope of this study, which is intended to introduce a novel clinical decision support tool.
In our analysis, we have accounted for many potential confounders. Leaving confounders unchecked may result in biased predictions that will not generalize47. Correcting for the batch, array, and age at sample collection confounders is not as simple as controlling for a single variable in the model, given the nature of our surveillance protocol and the data collection process. The main limitation of this retrospective study is the timing of blood sampling with respect to cancer diagnosis and how it occurs in a variable fashion as part of long-term follow-up. As a result, there exists a high correlation between the age of cancer onset and the age at sample collection for a given individual in our cohort, whereby directly controlling for the age at sample collection in the model would result in heavily biased coefficients. In order to demonstrate that our model is not stratifying patients by the age of sample collection, we accurately classified cancer-free individuals irrespective of the variable age of sample collection (0–76 years) and we correctly classified all patients sampled prior to diagnosis, including a patient who developed cancer at 19 months whose sample was drawn 7 months prior. However, we emphasize the need for standardized sample collection, larger cohorts, and a prospective study designed to validate and further refine the findings. Benchmarking against existing standard-of-care tests is necessary to accurately assess the potential gains of AI-based blood DNA methylation predictions over current clinical surveillance practices.
In this study, we have established that DNA methylation of PBL can be leveraged to identify LFS patients at risk of an early-onset malignancy. We have devised a model that estimates the risk of cancer onset before the age of six years using patient blood methylation profiles. Our model achieved an accuracy of 91% on a test set from an institution distinct from the data generated in the training and validation sets, misclassifying only two patients with cancer before the age of six. This has the potential to not only relieve the cost and burden of surveillance for patients who are not at risk of early childhood cancers, but also reinforce the justification for intensive surveillance for those TP53 PV carriers who are predicted to develop cancer in this earlier age group.
Methods
Ethics
This study was conducted in accordance with the ethical principles of the Declaration of Helsinki. All LFS (#1000051699) patients were approved for molecular profiling by the SickKids institutional review board. The NCI LFS cohort study was approved by the NCI institutional review board (ClinicalTrials.gov identifier NCT01443468). Written informed consent was signed by all participants or their legal guardians prior to sample collection. Sample sizes were determined by the availability of banked blood samples, corresponding to the number of eligible TP53 variant carriers who presented to and consented at participating hospitals. No statistical method was used to predetermine sample size.
Study population
The cohort consists of 497 consecutively ascertained TP53 PV carriers from 248 families collected primarily from four different centers (The Hospital for Sick Children, Toronto; Princess Margaret Cancer Centre, Toronto; University of Utah, Salt Lake City; National Cancer Institute, Bethesda), with clinical follow-up until at least the age of six years. Bisulfite treatment and methylation profiling were performed at The Centre for Applied Genomics (TCAG)20 on DNA (1 μg) extracted from PBL of 398 out of 497 individuals with available DNA and clinical information (Supplementary Data 1). The 398 individuals consisted of 46 that developed their first cancer prior to the time of blood draw (0-11.8 years), 209 that developed their first cancer following the time of blood draw, 130 that had no history of cancer and 13 with unknown age of sample collections (Fig. 1A). The ratio of TP53 PV carriers sampled prior to diagnosis to those sampled following diagnosis is similar between patients that developed cancer before and after our cutoff of six years (Supplementary Fig. 18). The age of sample collection of the cancer-free individuals ranges from 0 years to 76 years with a mean age of 21.4 [18.8–24] years (Supplementary Fig. 19). Notably, the mean age of sample collection in the training set of 27 [95% CI, 24.6–29.4] years, is not significantly different from the mean age of sample collection of the validation set 27.8 years [95% CI, 22.6–33.4] (p = 0.67, two-sided Mann–Whitney U-Test) and the test set 28 years [95% CI, 24.3–31.4] (p = 0.55, two-sided Mann–Whitney U-Test). In addition, we profiled 48 technical replicates and 17 biological replicates. Technical replicates are samples from the same patient that do not vary by age of sample collection. Biological replicates are samples from the same patient, taken at different times, with variable ages of sample collection. Overall, our data recapitulates the distribution of age of first cancer onset that is well-documented in population-level studies of LFS18. In particular, this includes an initial peak in the distribution of age of cancer onset (Fig. 1B) in which 19% (72/380) of first cancer diagnoses occur before the age of six years. In addition, we performed methylation on a separate cohort of 132 patients and family members with wild-type TP53 in whom histologically comparable malignancies developed.
Training data
The training set (n = 237) consisted of 53 TP53 PV carriers with a cancer diagnosis prior to the age of six, 105 with cancer after six years of age, and 79 with no cancer diagnosis. DNA methylation of PBL was initially collected on 111 TP53 PV carriers in 2015 using the Illumina HumanMethylation450 (450k) array48. The Illumina 450k technology allowed us to obtain methylation probe level values on >480,000 CpG sites and account for 99% and 96% of reference genes and methylation islands, respectively. In addition to the methylation samples obtained in 2015, we also collected methylation on 126 TP53 PV carriers from the National Cancer Institute (NCI; Bethesda, MD) and the Hospital for Sick Children (SickKids; Toronto, ON) between November 2019 to July 2020 using the Illumina EPIC BeadChip (850k) technology49. The EPIC microarray covers ~850,000 CpG methylation sites50,51,52,53; this includes over 90% of the CpGs found on the 450 K array, with an additional 333,265 sites.
Validation data
The validation cohort (n = 64) consisted of 9 TP53 PV carriers with cancer prior to six years, 33 with cancer after six years, and 21 with no cancer diagnosis. The patient samples were profiled for PBL DNA methylation between 2019 to 2020 using the Illumina 850k technology.
External test data
We evaluated our model on an external test cohort (n = 79) from the University of Utah Health (Salt Lake City, UT). The external test cohort was profiled using the Illumina 850k technology and preprocessed separately from the training and validation data. It consisted of 10 TP53 PV carriers who developed cancer prior to six years of age, 39 who developed cancer after age six, and 31 who had no cancer diagnosis at the time of analysis.
Preprocessing and outlier removal
Since the 850k array demonstrates high reproducibility at the 450k CpG sites54, we took the intersection of the probes on both array types (n = 452,453). We used a single-sample method called normal-exponential out-of-band (ssNoob) with R v3.6.1, using the package minfi v1.44.055,56 for the joint analysis and normalization of data from the 450k and 850k platforms. Following principal component analysis (PCA) transformation of the training, validation, and test data, samples within 3 standard deviations of the mean of PC1 (μPC1 ± 3σ) and PC2 (μPC2 ± 3σ) were retained; this resulted in the removal of 6 outliers.
Array confounder correction
To correct for potential biases between the 450k and 850k platforms, we constructed linear models for each probe (excluding technical replicates), transforming methylation values from the 450k to the 850k platform using 30 technical replicates. The models were then applied to the remaining 450k data (n = 85), in order to map them onto the 850k space (Supplementary Fig. 20).
Batch confounder correction
Batch artifacts were addressed using PCA, a technique proven effective in time-course experiments57 and gene expression data58 to distinguish noise from signal. This allowed us to identify and remove the principal component most correlated with batch effects, using only the training data. Validation and test data were subsequently projected onto this corrected PCA space (Supplementary Fig. 21). Our array and batch effect correction models are available in our GitHub repository.
Wildtype TP53 carriers
We also performed PBL DNA methylation on 132 wild-type TP53 carriers in whom histologically comparable malignancies developed (Supplementary Fig. 22B)36; this consisted of 92 individuals who developed cancer and 40 cancer-free family members. The average age of cancer onset for the wildtype TP53 carriers that developed cancer was 13.7 years and ranged from 0 years to 60 years (Supplementary Fig. 22C). For a small subset (n = 12), the age of cancer onset was not known.
‘LFS’ signature
Given that mutant p53 alters DNA methylation to favor a malignant cell fate35,36, we performed feature selection by identifying differentially methylated regions (DMRs) associated with germline TP53 status–termed an ‘LFS’ signature. Using Bumphunter v1.40.059 we determined DMRs between TP53 PV carriers in our training set (n = 237) and individuals with wildtype TP53 (n = 132), which served as controls. DMRs were calculated with array type (450 or EPIC) and cancer status (unaffected or cancer) as covariates, 50 bootstrap randomizations, and a DNAm difference threshold of 0.05. We identified 5580 probes associated with TP53 status using a FDR cutoff of 0.05. To understand whether the ‘LFS’ signature was tagging a particular functional region of a gene or represent a genome-wide signature, we performed the following steps: 1) randomly sampled 100 sets of probes from each functional region of a gene (3’UTR, Body, 1st Exon, 5’UTR, TSS200 and TSS1500) and our ‘LFS’ signature; and 2) fit a SVM with a radial kernel to each set of the randomly sampled probes and 3) assessed the average prediction performance (Supplementary Fig. 23).
Methylation feature selection
We restricted the number of features in our training objective to remove spurious correlations that could lead to overfitting by performing a reduction to ~100 probes from more than 400,000 using a mixture of biological rationale and data-driven approaches, as follows. As a preliminary step to mitigate potential bias attributed to the age at sample collection60, we identified 4228 methylation probes identified in the literature to be associated with aging14,17,61. We verified that these 4228 probes were indeed associated with aging in our data, and subsequently removed them, which resulted in age not appearing as a substantial contributor (Supplementary Fig. 6). We further reduced the number of features by selecting for the 5580 probes that contain an ‘LFS signature’ (Supplementary Fig. 24). Next, we determined the functional region of a gene most predictive of early cancer onset by subsetting the 5580 probes by each functional region and fitting a support vector machine (SVM) with a radial kernel for each region-specific probe set. Through this, we determined that the probes in the 3’UTR were the most predictive of early cancer onset (Supplementary Fig. 24A). Subsetting by probes in the 3’UTR of a gene reduced the size of the methylation profiles from 5580 to 134 probes (Supplementary Table 1). Finally, we aggregated the 134 probes by taking the average methylation value across probes that fall in the same gene (IlluminaHumanMethylation450kanno.ilmn12.hg19 v0.6.1), resulting in 129 gene-wise features to be used in the final model.
Model development
To predict the probability of cancer onset before the age of six years, we trained eight different models (caret v6.0.94): random forest (randomForest v4.7.1), gradient boosted tree (xgboost v1.7.9), gradient boosting machine (gbm v2.1.8), elastic net regularized linear model (glmnet v4.1.8), SVM (linear kernel (e1071 v1.7.13), SVM (radial kernel), SVM (polynomial kernel) and feed-forward neural network (neuralnet v1.44.2) with a single hidden layer (Supplementary Fig. 24B). Hyperparameters for each model were tuned using 5-fold cross-validation optimized for area under the receiver operator characteristic curve (AUROC) using the R package ROCR v1.0.11 and predicted probabilities were subsequently calibrated using Platt Scaling62. Confidence intervals were calculated by averaging performance over 10 random seeds. The best-performing model was selected based on AUROC and was the SVM with radial kernel. We also built two additional models to estimate the probability of cancer onset before the age of four and five using the same feature selection, model, and training procedure (Fig. S1–S4).
Covariates analysis
In addition to leveraging methylation data in the model, we evaluated the predictive gain/loss of including the following covariates: (1) sex, which is of particular interest since we use germline methylation profiles, and methylation is known to vary with sex63; (2) TP53 variant location (i.e., DNA binding domain, tetramerization domain, oligomerization domain, N-terminal) and TP53 variant type (i.e., splice, deletion, missense, nonsense, frameshift); (3) family history of cancer, which is particularly relevant to hereditary CPS like LFS where multiple individuals within a single family can develop cancer (Supplementary Fig. 25). Family history of cancer was encoded using (i) an indicator variable for each individual based on whether or not there exists family member(s) with cancer; and (ii) a ratio of cancer to cancer-free individuals within each family; (4) systemic treatment status, a binary variable indicating if the patient received treatment for a previous or active cancer prior to sample collection, and (5) the estimated proportion of immune cells (CD4 + T-cells, CD8 + T-cells, B-cells, natural killer cells, monocytes, neutrophils) calculated by deconvolution of bulk methylation data using the IDOL algorithm64 (Supplementary Fig. 8). Due to the compositional nature of the estimated immune proportion data, we transformed the data with center log-ratio (CLR) and handled zeroes by adding a pseudocount of 1 × 10−6.
Metrics and thresholding
A false positive (FP) is a patient predicted to have cancer prior to the age of six years who developed cancer later in life or not at all. A false negative (FN), which is of particular clinical relevance, is a patient predicted to have a low risk of cancer before the age of six years who in fact developed cancer prior to the age of six. We summarized the operating characteristics of our SVM with the radial kernel using a receiver operating characteristic (ROC) curve and the AUROC. In the context of predicting early cancer onset, FN are particularly detrimental; failing to identify a patient at risk of early onset cancer can delay critical surveillance and timely intervention. To address this, we optimized the decision boundary by minimizing the total cost associated with each threshold by weighting FN twice as heavily as FP, achieving an FN:FP ratio of 2:1 in the validation set. The optimal decision boundary of 0.34 was then used to evaluate performance metrics (accuracy, sensitivity, specificity, negative predictive value, and F1-score), ensuring that our model’s predictions are reliable and clinically relevant.
Feature analysis
We calculated the enrichment of the model probes for each chromosome by:
# of model probes
# of probes in 3’UTR
We performed a feature enrichment analysis of the 129 model features using ten genomic properties, which included proximity to histone marks, open chromatin, and expression levels from blood-derived samples. ChromImpute p-value signal tracks (bigwig files) were downloaded from the Roadmap Consortium (https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidatedImputed/) for the following genomic properties: H3K4me1, H3K4me3, H3K27me3, H3K9me3, H3K27ac, H3K36me3, DNase, H2A.Z, H3K79me2, and RNA-sequencing, in 23 blood-derived samples (E062, E034, E045, E033, E044, E043, E039, E041, E042, E040, E037, E048, E038, E047, E029, E031, E035, E051, E050, E036, E032, E046, and E030). The model’s probes were compared to 100,000 methylation probes randomly sampled with replacement. p-values were calculated using a two-sided Mann–Whitney U-test, and Cohen’s d metric was used to determine the effect size. SHAP (SHapley Additive exPlanations) values were calculated on the test set with the Kernel SHAP method using shapviz v0.6.0, using the training data as the background dataset. An ordered pathway analysis was performed using gProfiler with the genes used in the model, in order of decreasing importance based on their SHAP values.
Epigenetic age acceleration
We use the R package methylclock v1.4.065 to estimate biological age and evaluate epigenetic age acceleration using DNA methylation clocks: Horvath, Hannum, Levine, PedBE, Wu, TL, BLUP, and EN.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The methylation data generated in this study have been deposited in the European Genome-Phenome Archive (EGA) under the accession number EGAS00001007075. The data is available under restricted access to protect participant privacy and ensure compliance with ethical and legal requirements. Access can be obtained by submitting a written request to the study team, demonstrating external peer-reviewed funding, obtaining approval from the SickKids Research Ethics Board, and, for external investigators, completing a material transfer agreement and ethics board approval at their institution. Data enquiries should be directed to David Malkin (david.malkin@sickkids.ca). Data access will be granted within 2–4 weeks upon completion of all necessary requirements. Publicly available datasets used for the disease comparison analysis are available at the GEO database under the accession numbers: GSE51032 [10.18632/oncotarget.15573], GSE51057 [10.1093/jnci/djz065], GSE130029, GSE130030, GSE43976, GSE106648 [10.1016/j.ebiom.2019.04.042], and GSE42861 [10.1038/nbt.2487]. Source data are provided with this paper.
Code availability
Code for the analysis and prediction pipeline is available in our GitHub repository: github.com/vsubasri/LFS-Early-Cancer-Onset-Prediction66.
References
Malkin, D. Li-Fraumeni syndrome. Genes Cancer 2, 475–484 (2011).
de Andrade, K. C. et al. Higher-than-expected population prevalence of potentially pathogenic germline TP53 variants in individuals unselected for cancer history. Hum. Mutat. 38, 1723–1730 (2017).
Olivier, M., Hollstein, M. & Hainaut, P. TP53 mutations in human cancers: origins, consequences, and clinical use. Cold Spring Harb. Perspect. Biol. 2, a001008 (2010).
de Andrade, K. C. et al. Variable population prevalence estimates of germline TP53 variants: A gnomAD-based analysis. Hum. Mutat. 40, 97–105 (2019).
Malkin, D. et al. Germ line p53 mutations in a familial syndrome of breast cancer, sarcomas, and other neoplasms. Science 250, 1233–1238 (1990).
Villani, A. et al. Biochemical and imaging surveillance in germline TP53 mutation carriers with Li-Fraumeni syndrome: 11 year follow-up of a prospective observational study. Lancet Oncol. 17, 1295–1305 (2016).
Grasparil, A. D. 2nd, Gottumukkala, R. V., Greer, M.-L. C. & Gee, M. S. Whole-Body MRI surveillance of cancer predisposition syndromes: current best practice guidelines for use, performance, and interpretation. AJR Am. J. Roentgenol. 215, 1002–1011 (2020).
Kratz, C. P. et al. Cancer Screening Recommendations for Individuals with Li-Fraumeni Syndrome. Clin. Cancer Res. 23, e38–e45 (2017).
Ballinger, M. L. et al. Baseline surveillance in Li-Fraumeni syndrome using whole-body magnetic resonance imaging: a meta-analysis. JAMA Oncol. 3, 1634–1639 (2017).
Hisada, M., Garber, J. E., Li, F. P., Fung, C. Y. & Fraumeni, J. F. Multiple primary cancers in families with Li-Fraumeni syndrome. J. Natl. Cancer Inst. 90, 606–611 (1998).
Frebourg, T. et al. Guidelines for the Li-Fraumeni and heritable TP53-related cancer syndromes. Eur. J. Hum. Genet. 28, 1379–1386 (2020).
van Engelen, K. et al. Tumor surveillance for children and adolescents with cancer predisposition syndromes: The psychosocial impact reported by adolescents and caregivers. Pediatr. Blood Cancer 68, e29021 (2021).
McBride, K. A. et al. Psychosocial morbidity in TP53 mutation carriers: Is whole-body cancer screening beneficial?. Fam. Cancer 16, 423–432 (2017).
Levine, M. E. et al. DNA methylation age of blood predicts future onset of lung cancer in the Women’s Health Initiative. Aging 7, 690–700 (2015).
Yang, Z. et al. Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol. 17, 205 (2016).
Espín-Pérez, A. et al. Peripheral blood DNA methylation profiles predict future development of B-cell Non-Hodgkin Lymphoma. NPJ Precis. Oncol. 6, 53 (2022).
McEwen, L. M. et al. The PedBE clock accurately estimates DNA methylation age in pediatric buccal cells. Proc. Natl. Acad. Sci. USA 117, 23329–23335 (2020).
Nichols, K. E., Malkin, D., Garber, J. E., Fraumeni, J. F. Jr & Li, F. P. Germ-line p53 mutations predispose to a wide spectrum of early-onset cancers. Cancer Epidemiol. Biomark. Prev. 10, 83–87 (2001).
Fang, W.-L. et al. Molecular and survival differences between familial and sporadic gastric cancers. Biomed. Res. Int. 2013, 396272 (2013).
Subasri, V. et al. Multiple germline events contribute to cancer development in patients with Li-Fraumeni syndrome. Cancer Research Communications 3, 738–754 (2023).
Frank, S. A. Age-specific incidence of inherited versus sporadic cancers: a test of the multistage theory of carcinogenesis. Proc. Natl. Acad. Sci. USA 102, 1071–1075 (2005).
Nichols, K. E. & Malkin, D. Genotype versus phenotype: the Yin and Yang of germline TP53 mutations in Li-Fraumeni syndrome. J. Clin. Oncol. 33, 2331–2333 (2015).
Zerdoumi, Y. et al. Drastic effect of germline TP53 missense mutations in Li-Fraumeni patients. Hum. Mutat. 34, 453–461 (2013).
de Andrade, K. C. et al. Cancer incidence, patterns, and genotype–phenotype associations in individuals with pathogenic or likely pathogenic germline TP53 variants: an observational cohort study. Lancet Oncol. 22, 1787–1798 (2021).
Id Said, B. & Malkin, D. A functional variant in miR-605 modifies the age of onset in Li-Fraumeni syndrome. Cancer Genet. 208, 47–51 (2015).
Marcel, V., Palmero, E. I. & Falagan-Lotsch, P. TP53 PIN3 and MDM2 SNP309 polymorphisms as genetic modifiers in the Li–Fraumeni syndrome: impact on age at first diagnosis. J. Med. 46, 766–772 (2009).
Bougeard, G. et al. Impact of the MDM2 SNP309 and p53 Arg72Pro polymorphism on age of tumour onset in Li-Fraumeni syndrome. J. Med. Genet. 43, 531–533 (2006).
Tabori, U., Nanda, S., Druker, H., Lees, J. & Malkin, D. Younger age of cancer initiation is associated with shorter telomere length in Li-Fraumeni syndrome. Cancer Res. 67, 1415–1418 (2007).
Shin, S. J. et al. Penetrance estimates over time to first and second primary cancer diagnosis in families with Li-Fraumeni syndrome: a single institution perspective. Cancer Res. 80, 347–353 (2020).
Amadou, A., Achatz, M. I. W. & Hainaut, P. Revisiting tumor patterns and penetrance in germline TP53 mutation carriers: temporal phases of Li-Fraumeni syndrome. Curr. Opin. Oncol. 30, 23–29 (2018).
Carta, R. et al. Cancer predisposition syndromes and medulloblastoma in the molecular era. Front. Oncol. 10, 566822 (2020).
Bougeard, G. et al. Revisiting Li-Fraumeni syndrome from TP53 mutation carriers. J. Clin. Oncol. 33, 2345–2352 (2015).
Villani, A. et al. Biochemical and imaging surveillance in germline TP53 mutation carriers with Li-Fraumeni syndrome: a prospective observational study. Lancet Oncol. 12, 559–567 (2011).
Janic, A., Abad, E. & Amelio, I. Decoding p53 tumor suppression: a crosstalk between genomic stability and epigenetic control? Cell Death Differ. 32, 1–8 (2024).
Morris, J. P. et al. α-Ketoglutarate links p53 to cell fate during tumour suppression. Nature 573, 595–599 (2019).
Samuel, N. et al. Genome-wide DNA methylation analysis reveals epigenetic dysregulation of MicroRNA-34A in TP53-associated cancer susceptibility. J. Clin. Oncol. 34, 3697–3704 (2016).
Heeb, L. E. M., Egholm, C. & Boyman, O. Evolution and function of interleukin-4 receptor signaling in adaptive immunity and neutrophils. Genes Immun. 21, 143–149 (2020).
Xu, Z., Sandler, D. P. & Taylor, J. A. Blood DNA methylation and breast cancer: a prospective case-cohort analysis in the sister study. J. Natl Cancer Inst. 112, 87–94 (2020).
Onwuka, J. U. et al. A panel of DNA methylation signature from peripheral blood may predict colorectal cancer susceptibility. BMC Cancer 20, 692 (2020).
Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 31, 142–147 (2013).
Ewing, E. et al. Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during multiple sclerosis progression. EBioMedicine 43, 411–423 (2019).
Bojadzieva, J. et al. Whole body magnetic resonance imaging (WB-MRI) and brain MRI baseline surveillance in TP53 germline mutation carriers: experience from the Li-Fraumeni syndrome education and early detection (LEAD) clinic. Fam. Cancer 17, 287–294 (2018).
Dugué, P.-A. et al. Biological aging measures based on blood DNA methylation and risk of cancer: a prospective study. JNCI Cancer Spectr. 5, kaa109 (2021).
McGuire, M. H. et al. Pan-cancer genomic analysis links 3’UTR DNA methylation with increased gene expression in T cells. EBioMedicine 43, 127–137 (2019).
Wei, W. et al. Comprehensive characterization of posttranscriptional impairment-related 3′-UTR mutations in 2413 whole genomes of cancer patients. npj Genom. Med. 7, 1–12 (2022).
Huang, X. & Marques-Silva, J. The inadequacy of Shapley values for explainability. Preprint at https://arxiv.org/abs/2302.08160 (2023).
Skelly, A. C., Dettori, J. R. & Brodt, E. D. Assessing bias: the importance of considering confounding. Evid. Based Spine Care J. 3, 9 (2012).
Bibikova, M. et al. High density DNA methylation array with single CpG site resolution. Genomics 98, 288–295 (2011).
Infinium MethylationEPIC Kit. Methylation profiling array for EWAS. https://www.illumina.com/products/by-type/microarray-kits/infinium-methylation-epic.html.
Wang, M. & Lemos, B. Ribosomal DNA harbors an evolutionarily conserved clock of biological aging. Genome Res. 29, 325–333 (2019).
Bock, C. et al. BiQ Analyzer: visualization and quality control for DNA methylation data from bisulfite sequencing. Bioinformatics 21, 4067–4068 (2005).
Bock, C. Analysing and interpreting DNA methylation data. Nat. Rev. Genet. 13, 705 (2012).
Teschendorff, A. E. et al. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450k DNA methylation data. Bioinformatics 29, 189–196 (2013).
Moran, S., Arribas, C. & Esteller, M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8, 389–399 (2016).
Fortin, J.-P., Triche, T. J. Jr & Hansen, K. D. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics 33, 558–560 (2017).
Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369 (2014).
Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97, 10101–10106 (2000).
Nielsen, T. O. et al. Molecular characterisation of soft tissue tumours: a gene expression study. Lancet 359, 1301–1307 (2002).
Jaffe, A. E. et al. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int. J. Epidemiol. 41, 200–209 (2012).
Johnson, A. A. et al. The role of DNA methylation in aging, rejuvenation, and age-related disease. Rejuvenation Res. 15, 483 (2012).
Alisch, R. S. et al. Age-associated DNA methylation in pediatric populations. Genome Res. 22, 623–632 (2012).
Niculescu-Mizil, A. & Caruana, R. Predicting good probabilities with supervised learning. Proceedings of the 22nd International Conference on Machine Learning—ICML ’05 (ICML, 2005);https://doi.org/10.1145/1102351.1102430.
Liu, J., Morgan, M., Hutchison, K. & Calhoun, V. D. A study of the influence of sex on genome wide methylation. PLoS One. 5, e10028 (2010).
Salas, L. A. et al. An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray. Genome Biol. 19, 64 (2018).
Pelegí-Sisó, D., de Prado, P., Ronkainen, J., Bustamante, M. & González, J. R. methylclock: a bioconductor package to estimate DNA methylation age. Bioinformatics 37, 1759–1760 (2021).
Subasri, V. LFS-early-cancer-onset-prediction: DNA methylation predicts early onset of primary tumor in patients with Li-Fraumeni syndrome. Github https://repos.ecosyste.ms/hosts/GitHub/repositories/vsubasri%2FLFS-Early-Cancer-Onset-Prediction.
Acknowledgements
This study was funded with support from the Terry Fox Research Institute New Frontiers Program Project (#1081, D.M.) and Canadian Institutes for Health Research Foundation Scheme Grant (#143234, D.M.). We thank The Centre for Applied Genomics (TCAG) DNA sequencing and synthesis facility for their sequencing services. The SickKids Cancer Sequencing (KiCS) program is supported by the Garron Family Cancer Centre with funds from the SickKids Foundation. T.J.P. is supported by the Canada Research Chairs Program and a Senior Investigator Award from the Ontario Institute for Cancer Research. D.M. is supported in part by the CIBC Children’s Foundation Chair in Child Health Research. V.S. is supported by an Ontario Graduate Scholarship and a Vector Institute Research Grant. Samples and data from J.R.H. were supplied by the Children’s Cancer Centre Tissue Bank at the Murdoch Children's Research Institute and The Royal Children’s Hospital (RCH; www.mcri.edu.au/childrenscancercentretissuebank). Establishment and running of the Children’s Cancer Centre Tissue Bank is made possible through generous support by CIKA (Cancer In Kids at RCH; http://www.cika.org.au), Leukemia Auxiliary at RCH (LARCH), the Murdoch Children's Research Institute, and RCH. J.R.H. is supported by grants from the McClurg Foundation, Hospital Research Foundation, Robert Connor Dawes Foundation, and My Room Children’s Cancer Charity. The work of K.C.A., P.P.K., and S.A.S. was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute. We would also like to thank the late Ana Novokmet for her critical contributions to this study.
Author information
Authors and Affiliations
Contributions
V.S., T.G., J.R.H., E.C., C.P., C.E., J.L.F., K.E.N., J.A., W.K., H.G., J.L., N.A., L.B., K.C.A., P.P.K., S.A.S., J.D.S., and D.M. prepared samples and performed clinical characterization. V.S. and B.B. processed the data, and developed the model with guidance from A.G. V.S. and B.L. performed feature interpretation analyses. L.E. provided support for modeling and analytical strategies. V.S., B.B., D.M., and A.G. wrote the manuscript with contributions from L.E., J.R.H., K.E.N., K.C.A., P.P.K., S.A.S., J.D.S., T.J.P., and A.V., A.G. and D.M. supervised the research. All authors approved the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Antoinette Perry, Soeren Lukassen and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Subasri, V., Brew, B., Laverty, B. et al. Peripheral blood DNA methylation predicts the early onset of primary tumor in TP53 mutation carriers. Nat Commun 16, 7976 (2025). https://doi.org/10.1038/s41467-025-62894-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-62894-5