Introduction

In humans and other mammals, the epigenetic modification of DNA methylation (DNAm) predominantly occurs at cytosine-phosphate-guanine (CpG) dinucleotide sites, and it is essential for regulating gene expression, growth, development, and disease1. Age-associated methylation alterations are related to diseases, such as osteoporosis, neurodegenerative disease, diabetes, and cancers2,3. One way to assess age-related disease risks is to measure biological age by epigenetic clocks4. In addition to age, sex also has a strong impact on methylation variations. The sex-associated methylation patterns contribute to sex-biases diseases, such as liver diseases5, autoimmune diseases6, and neurological disorders7. Sex-associated methylation patterns are observed in autosomes8 and the X chromosome9. In fact, DNA methylation is one of the key mechanisms in regulating X-chromosome inactivation (XCI) in females, for balancing X-lined gene dosage between the sexes10. Some genes that escape from XCI may lead to the dual expression dosage and contribute to the female bias in diseases, such as autoimmune diseases6. Since age- and sex-associated DNA methylation plays important regulatory roles in aging and diseases, it is worth in-depth studies.

Various detection methods have been developed to investigate human DNA methylome, with different coverages and resolutions. The most widely used Illumina 450 K11 and 850 K12 arrays cover approximately 450,000 and 850,000 CpGs, respectively, featuring on CpG islands and other regulatory regions13. However, these methylation arrays represent only a very small fraction (~3%) of the 28 million CpG sites in the human genome. Most studies on DNA methylation on aging and sex adopted these methylation arrays, which led to the remaining CpGs in the human genome being less known about their influence.

Moreover, methylation patterns are known to be tissue- and cell-specific14,15. The plasma cfDNA is derived from cellular processes such as apoptosis, necrosis, and active secretion16, which enables non-invasive and real-time monitoring of physiological and pathological conditions in human bodies. The deconvolution algorithms based on the DNA methylation atlas of human tissues/cell types have been used to estimate the origins of tissues/cell types of cfDNA17,18. However, previous cfDNA research has predominantly focused on disease-specific cfDNA profiles and tissue-of-origin results19,20, and less attention has been paid to the age- and sex-related cfDNA methylation profiles. Teo et al. studied on cfDNA nucleosome signals as an aging biomarker21. Shtumpf et al. constructed aging clocks based on cfDNA fragment sizes and nucleosome distances22. Li et al. utilized plasma samples from three age groups of healthy individuals to identify age-related CpGs, and they developed a cfDNA methylation age prediction model, however, they only utilized the CpG sites covered by 450 K array in their model23. Additionally, the sex-differential DNA methylation studies primarily relied on whole blood and tissue samples, and were often based on 450 K and 850 K arrays9,24,25,26.

In this study, we conduct a comprehensive profiling of CpGs methylation characteristics related to age and sex at the whole-genome level, using plasma cell-free DNA whole-genome bisulfite sequencing (WGBS) data from 98 generally healthy adults (52 females and 46 males) aged from 22 to 77. A cfDNA epigenetic clock based on 125 CpG sites is developed and validated by an independent dataset. Furthermore, we find some age and sex differences in the tissue and cell-type origins of plasma cfDNA. The detection of individuals with abnormal tissue-of-origin results show the potential of cfDNA as a biomarker for health monitoring. Our findings provide a foundation for future research on human cfDNA methylation and liquid biopsies.

Methods

Sample collection and ethics statement

This study was approved by the Institutional Review Board on Bioethics and Biosafety of BGI (BGI-IRB 21157-T2). From December 2021 to December 2022, adult participants were recruited during their physical examination, including young, middle-aged, and elderly individuals, with an almost equal number of males and females in all age groups (Fig.1a). Women who were during pregnancy or lactation; and anyone who had fever symptoms, recently had a surgical procedure, or was diagnosed with infectious diseases, cancer, or other severe diseases were all excluded from this study. A total of 98 participants (52 females and 46 males) aged from 22 to 77 were included in this study. There were about 10 males and 10 females in each 10-year age group (Fig.1a). We collected their basic information about their sex, age, and past medical history through a questionnaire. Their physical examination results were also collected. For each participant, a peripheral blood sample was collected using an EDTA blood collection tube. The study was conducted in accordance with the Declaration of Helsinki and informed consent was signed by all participants. All data were de-identified prior to analysis by removing direct identifiers such as names and medical record numbers. Each participant was assigned a unique study code for research use and to protect privacy. Written informed consent was obtained from all participants for the publication of the de-identified data.

Fig. 1: Overview of study design and participants cohort (n  =  98).
figure 1

a The recruitment of participants and the workflow of whole genome bisulfite sequencing (WGBS) of plasma cfDNA. The library preparation method is based on a single-stranded library preparation technique. b Identification of age- and sex- associated CpGs on autosomes using the method of linear regression. The related genes are further analyzed based on the rank of the number of CpGs and the enriched pathways. c Sex difference analysis of methylation patterns on the X chromosome. d Tissue-of-origin analysis of cfDNA and exploration of the influences of age, sex, and individual variances. (Fig. 1 is created with Biorender.com).

Cell-free DNA extraction

Plasma isolation was performed via a two-step centrifugation procedure within 4 h after blood sampling. In the first step, the blood was centrifuged at 1600 g for 10 min at 4 °C. In the second step, the upper layer of plasma was centrifuged again at 16,000 g for 10 min at 4 °C to remove cellular debris. The resultant supernatant plasma was then stored at −80 °C before cfDNA extraction. For each sample, 0.5–1 mL plasma was used for cfDNA extraction using MagPure Circulating DNA KF Kit (Magen, China) according to the manufacturer’s instructions.

Library preparation for WGBS

The input cfDNA amount for the library preparation was 15.23 ± 6.64 ng (mean ± SD). The extracted cfDNA was bisulfite treated and purified using EZ-96 DNA Methylation Kit (Zymo Research). To evaluate bisulfite conversion efficiency, 0.05 ng of lambda DNA (New England Biolabs, #N3011S) was added to each reaction as an unmethylated control before bisulfite treatment. Subsequently, WGBS sequencing libraries were prepared utilizing a modified single-stranded library preparation method of the SPlinted Ligation Adapter Tagging (SPLAT)27,28,29. Briefly, the double-stranded DNA (dsDNA) was denatured into single-stranded DNA (ssDNA) at high temperatures; then adapters containing six random bases were annealed and ligated to both ends of ssDNA; finally, the ligation product was amplified through PCR and the barcode sequences of samples were introduced through PCR primers. Notably, the processes include the utilization of a single-stranded DNA binding protein (ET SSB) to stabilize the presence of ssDNA in solution28. A one-step adapter ligation reaction was used and the adapters were specifically tailored for the DNBSEQ platform (MGI)30. Subsequently, the libraries were subjected to 100-bp paired-end (PE) sequencing using the DNBSEQ platform with a sequencing depth of >30× for each sample.

WGBS data processing

We used Fastp (0.19.5)31 to process the raw sequencing data with default parameters, including trimming adapters, filtering out reads of low quality, and discarding reads with a high proportion of undetermined nucleotides (Ns). Subsequently, the pre-processed reads were aligned to the human reference genome (GRCh38.p14) using BitMapperBS (v1.0.2.3)32 with default settings. Following alignment, PCR duplicates were removed using sambamba (v0.8.2)33. After removing the duplicated reads, the median sequencing depth for samples was 31.19× (Supplementary Data 1). The depth calculation is based on sequencing reads and bases. The overlapping paired-end bases were calculated twice in the depth count.

Then we used MethylDackel (v0.5.1) (https://github.com/dpryan79/MethylDackel) to calculate the methylation values at each CpG site. In brief, methylation values were determined by the ratio of methylated C to the total number of reads supporting C (methylated) and T (unmethylated) at this site. Methylation values range from 0 to 1, with 0 indicating no DNA methylation and 1 indicating complete DNA methylation. CpGs with less than 5X coverage were labeled as NA, and those with more than 10% NA values in samples were removed from further analyses. The remaining NA values were imputed with the impute.knn function (using k  =  10) in R language. The CpGs located on chromosome Y were not investigated in the study. Ultimately, we obtained a comprehensive DNA methylation profile consisting of 23,510,673 CpG sites on autosomes and 996,907 CpG sites on the X chromosome.

Genomic distribution analysis

Manhattan plots were generated using the R package CMplot (https://github.com/YinLiLin/R-CMplot). The annotate R package34 was utilized for annotating various features such as CpG regions and gene regions. CpG regions including open sea regions, CpG islands, CpG shelves, and CpG shores. CpG shores are defined as 2 Kb upstream/downstream from the ends of the CpG islands, less the CpG islands. CpG shelves are defined as another 2 Kb upstream/downstream of the farthest upstream/downstream limits of the CpG shores, less the CpG islands and CpG shores. The remaining genomic regions comprise the open sea annotation gene regions including 1–5 Kb upstream of the TSS (1to5kb), 3′ untranslated region (3′UTR), 5′ untranslated region (5′UTR), exons, introns, promoters (<1 Kb upstream of the TSS), and enhancers.

Age-associated CpGs analysis

We implemented a linear regression model, glm function (family=gaussian) in R language, with two-tailed t test, to identify age-associated CpG sites (Eq. 1). The p-values were adjusted using the Benjamini–Hochberg (BH) method35,36, and all CpG sites with adjusted p-value (Padj) < 0.05 were defined as age-associated.

$${{{\rm{Methylation\; rate}}}} \sim {{{\rm{Age}}}}$$
(1)

The reported age-associated CpGs data were obtained from the EWAS Atlas database, with the traits of aging (https://ngdc.cncb.ac.cn/ewas/browse?traitList=aging).

Development of cfDNA DNAm epigenetic clock

The methylation values of the 3047 age-associated CpGs were standardized using the R language function scale() from the base package. To build the cfDNA methylation epigenetic clock we implemented an elastic net regression model, using the methodology described by Horvath37. The elastic net models were generated using the “glmnet” package in R, using the functions of cv.glmnet and predict.glmnet. The elastic net approach combines Ridge and LASSO regression with an alpha parameter of 0 for Ridge and 1 for LASSO. Here, the elastic net alpha parameter was set to 0.5. The minimal lambda was calculated using 10-fold cross-validation using the “glmnet” package. A transformed version of chronological age was regressed on DNAm levels at all included CpG sites. Given the limited sample size, we used a previously described cross-validation scheme (leave-one-out cross-validation, LOOCV) for determining unbiased estimates of the accuracy of our cfDNA methylation epigenetic clock23. The cross-validation procedure reports the unbiased estimates of age correlation r, which is defined as Pearson correlation between the actual age and the predicted value (the DNAm age), and the median absolute error (MAE).

Sex-associated CpGs analysis

We implemented a linear regression model, glm function (family=gaussian) in R language, with two-tailed t test, to identify sex-associated CpG sites (Eq. 2). Sex was coded as 0 for males and 1 for females. The p-values were adjusted using the Benjamini–Hochberg (BH) method. CpG sites on autosomes with adjusted p-value (Padj) < 0.05 were defined as sex-associated CpGs on autosomes, CpG sites on X chromosome with adjusted p-value (Padj) <  10-6 were defined as sex-associated CpGs on X chromosome.

$${{{\rm{Methylation\; rate}}}} \sim {{{\rm{Sex}}}}$$
(2)

The reported sex-associated CpGs data were obtained from the EWAS Atlas database, with the traits of gender (https://ngdc.cncb.ac.cn/ewas/browse?traitList=gender).

Enrichment analysis

We performed a negative binomial distribution model to identify genes enriched in age- or sex-associated CpGs, adjusting for gene length and total number of CpG sites per gene. The p-values were calculated and then corrected using the Benjamini-Hochberg (BH) method for multiple testing adjustments.

Functional annotation of enriched Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway was performed by Metascape (http://metascape.org), using hypergeometric test and BH correction with the default parameters.

For tissue enrichment analysis, we used Human Protein Atlas (HPA) database38 and TissueEnrich tools39 to make a tissue-specific gene enrichment of our gene sets (https://tissueenrich.gdcb.iastate.edu/), which used hypergeometric test and BH correction with the default parameters.

cfDNA tissue deconvolution analysis

We used two methods for cfDNA tissues-of-origin profiling for 98 samples based on DNA methylation atlas of human cell types17,18. The first one was developed by Moss et al., which included an extensive analysis of 7890 differentially methylated CpG sites across 25 unique human tissues and cell types17. The second one was constructed by Loyfer et al., enabling the quantification of 39 tissues/cell types18. By using wgbstools (https://github.com/nloyfer/wgbs_tools), we quantify the relative contributions of various cell types to the plasma cfDNA.

Statistics and reproducibility

There was no technical replicates for the WGBS of 98 samples. Our mathematical model for age- and sex-associated CpGs analysis, and the methods for development of cfDNA DNAm epigenetic clock, enrichment analysis, and cfDNA tissue deconvolution analysis were summarized with details in the above method details section. Two-tailed tests were used for the linear regression model. Differences among CpGs patterns in gene regions and CpG regions used chi-square test with BH correction. Wilcoxon rank-sum test (two-tailed) were used for the comparison of tissue/cell-type derived cfDNA proportions of females and males in different age groups. Spearman’s correlation was used to test the relationship between the hepatocytes-derived cfDNA proportion and ALT, AST, GGT, and HDLC.

Results

Study design and cfDNA methylation profiling

A total of 98 generally healthy volunteers, including 52 females and 46 males, are recruited in this study. Their age ranged from 22 to 77 (Fig. 1a and Supplement Data 1). We collect their basic information (age, sex, medical histories) and physical examination results. The peripheral blood samples are collected, and plasma cell-free DNA (cfDNA) is used for WGBS with a sequencing depth of ~30× for each sample (Fig. 1a). The WGBS libraries are prepared based on the SPlinted Ligation Adapter Tagging (SPLAT)27,28,29, the sequencing bias of which is relatively small compared to some conventional library preparation methods(Supplementary Fig. 1, Supplement Data 2). The fragment size distributions and the basic statistics about the overall quality of WGBS are summarized in Supplementary Fig. 1 and Supplementary Data 1 and 2.

In the cfDNA methylation analysis, CpGs with coverage below 5X are filtered out, and those CpGs with over 10% missing values in samples are excluded. After the data quality control, a comprehensive whole-genome DNA methylation profile is established, encompassing 23,510,673 CpGs on autosomes and 996,907 CpGs on the X chromosome, corresponding to about 80% CpGs in the human genome. These qualified CpGs are utilized in subsequent analyses of age- and sex-associated methylation patterns (Fig. 1b–d).

Age-associated DNA methylation patterns

We identify 3047 CpGs on autosomes and one CpG on the X chromosome with their methylation rates significantly associated with age (Padj < 0.05, linear regression) (Supplementary Data 3 and Supplementary Data 4). The visualization of the Manhattan plot is shown in Fig. 2a. For all these CpGs, their Pearson correlation coefficients of methylation rate with age |r | > 0.4 (Supplementary Data 4). The age-associated CpG on the X-chromosome chrX:97837735(GRCh38.p14) resides in the open sea region (>4 kb from CpG islands) and intergenic region (>3 kb from the nearest annotated gene). This locus is not covered by the Illumina 450 K/850 K arrays nor annotated as a regulatory element in enhancer-gene links databases (e.g., EpiMap40, ABC maps41). The specific function of this site cannot be determined. Here, we only study and discuss the autosomal age-associated CpGs.

Fig. 2: Genome-wide identification of age-associated CpGs, genes and pathways.
figure 2

a Manhattan plot showing the distribution of 3047 age-associated CpGs (Padj < 0.05) across all the autosomes. The linear regression model is used, and the P values for the regression coefficients are derived from two-tailed t-tests. The P values were adjusted using the Benjamini–Hochberg (BH) method. The numbers of age-associated CpGs in 10 Mb bins are shown on the chromosomes at the bottom of the Manhattan plot, with the color bar showing on the right side. The methylation rates of age-associated CpGs are negatively (94%) or positively (6%) correlated with age. Two examples of CpGs (chr10:13527445 and chr5:141040234) are shown in (b, c) with the methylation rate negatively and positively correlated with age, respectively. Methylation rate of the CpG site for each sample is shown as a dot. The red line is the regression line based on the linear regression model (with 95% confidence intervals shown as shaded areas), representing the relationship between the methylation rates and the ages. The P values are adjusted using the Benjamini–Hochberg (BH) method. d Ranking of the 1587 genes based on the number of age-associated CpGs per gene. Among the 62 genes that have more than 5 age-associated CpGs, 16 genes (such as IFT80 and RILPL1) have not previously been reported to be age-related at the DNA methylation level (EWAS Atlas database). These genes are labeled with gene names, and those genes enriched with age-associated CpGs were labeled in red. The top one gene FIGN is previously known to be age-related at the DNA methylation level, and is enriched with age-associated CpGs. e KEGG enrichment analysis of the 1587 age-associated genes using hypergeometric test (one-tailed test). The P values without adjustment are shown here, and the exact P values and adjusted P values are in Supplementary Data 7. f The cfDNA methylation epigenetic clock with 125 CpGs. The plot shows the relationship between chronological age and epigenetic age. The blue line is the regression line of epigenetic age on chronological age. Pearson’s correlation coefficients (r) and median absolute error (MAE) are denoted. Leave-one-out cross-validation (LOOCV) is used to determine the accuracy of the cfDNA methylation epigenetic clock.

Of the 3047 age-associated CpGs, 2854 CpGs exhibit negative correlations (Fig. 2b) and 193 CpGs show positive correlations with age (Fig. 2c) (Supplementary Data 4). In other words, for most of the age-associated CpGs, the methylation levels tend to decrease with age. This result is consistent with previous findings obtained through microarray analyses42,43. We further explore the genomic distributions of those CpGs in relation to their nearest genes or CpG islands. The age-associated CpGs are enriched in promoters, 5’ UTRs, enhancers, CpG islands, and CpG shores, while they are underrepresented in 3’ UTRs, introns, and open sea (Supplementary Fig. 2a, b and Supplementary Data 5).

On the one hand, we find that only 71 and 208 age-associated CpGs in this study are covered by the 450 K and 850 K arrays, respectively (Supplementary Data 4). By comparing against the EWAS Atlas database, we find that 55 of the 71 CpGs in the 450 K array and 134 of the 208 CpGs in the 850 K array are annotated as age-associated CpGs in the database (Supplementary Data 4). The consistency of age-associated CpG sites with those discovered by microarray assays indicates the reliability of our results. On the other hand, due to the limited number of CpGs coverage by the 450 K and 850 K arrays, the remaining age-associated CpGs in our study have not yet been reported as age-associated CpGs in the EWAS database. We believe our identified age-associated CpGs provide a valuable resource for future studies to elucidate epigenetic regulations in aging.

These 3047 age-associated CpGs are mapped to 1587 genes (Fig. 2d and Supplementary Data 6). The numbers of age-associated CpGs in these genes vary widely. 1168 genes (73%) contain only one age-associated CpG, 357 genes (23%) contain 2 to 4 age-associated CpGs, and 62 genes (4%) contain 5 or more age-associated CpGs. The FIGN gene has the highest number of age-associated CpGs, which is 58. Two of these 58 CpGs (cg15148145 and cg16532938) are covered in the 850 K arrays and annotated as age-related in the EWAS database (Supplementary Data 4). Among the 62 genes that had more than 5 age-associated CpGs, 46 genes are annotated as age-related in the EWAS database; while the reamaining16 genes, such as IFT80 and RILPL1, have not been previously found to be age-related at the DNA methylation level (Fig. 2d, Supplementary Data 4). IFT80 negatively regulates osteoclast differentiation44, and knock-out of IFT80 in the mouse model caused osteoporosis phenotype45,46. Considering the regulation effects of IFT80 on the osteoclast, we speculate that osteoporosis in older adults is influenced by age-associated methylation changes. Additionally, recent research has identified the relationship between RILPL1 and oculopharyngodistal myopathy (OPDM)47, a rare adult-onset hereditary muscle disease with symptoms progressively worsening with age48. The discovery that methylation levels of CpGs in RILPL1 decline with age may give some clues to explain the pathogenesis and progression of OPDM.

We also apply a negative binomial regression model to identify genes enriched in age-associated CpGs (Fig. 2d and Supplementary Data 6), adjusting for gene length and total number of CpG sites per gene. We identify 57 genes significantly associated with age, and 19 genes (e.g., FIGN, TENM2) have been previously reported in aging-related studies (Supplementary Data 4). The remaining genes that are not enriched in age-associated CpGs, or with small numbers of age-associated CpGs might still play important roles in aging. For example, the genes of NEFL, NELL1, and PDGFC contain only one age-associated CpG (Supplementary Data 4). They were already known as age-related genes with some other CpGs being reported49. Moreover, the proteins encoded by these genes are significantly dysregulated in Alzheimer’s disease (AD) patients50. Our results indicate that the DNA methylation alteration of these genes may be involved in the neurodegenerative processes. Further investigation of these age-related genes (Supplementary Data 6), especially the newly discovered genes, is important.

To understand the function and bioprocess of all the 1,587 genes, the KEGG pathway and GO enrichment analysis are performed (Fig. 2e, Supplementary Fig. 2c and Supplement Data 7). The enriched KEGG pathways included cAMP signaling51, TNF signaling52, TRP channel regulation51, cancer53, GnRH secretion54, neurodegeneration55 pathways. These pathways have previous evidence for age-associated DNA methylation alteration or are known to be related to aging.

A methylation epigenetic clock based on plasma cfDNA

Epigenetic clock based on plasma cell-free DNA (cfDNA) methylation has not extensively been explored23. Here, we develope a cfDNA methylation epigenetic clock, using an elastic net regression, based on the 3047 age-associated CpGs. Finally, a set of 125 CpG sites is included in the age-prediction model, achieving a relatively high level of accuracy. The correlation coefficient (r) is 0.91 and the median absolute error (MAE) is 3.74 years (Fig. 2f, Supplementary Data 8). The 125 CpGs and their coefficients with age are provided in Supplementary Data 9. To further validate this cfDNA methylation clock, we test the publicly available cfDNA WGBS data of 23 healthy individuals (GSE186458). The external data also show a strong correlation (r = 0.94) between the chronological age and the biological age, with an MAE of 9.44 years (Supplementary Fig. 3, Supplementary Data 10). To the best of our knowledge, the majority of sites selected for our cfDNA methylation clock are novel and not present in existing DNAm clock algorithms. Only two of the CpG sites were included in Hannum’s blood-based clock (composed of 71 DNAm sites)56, and three CpG sites included in Horvath’s Skin & Blood clock (comprising 391 DNAm sites)57 (Supplementary Data 11). The small overlap of age-associated CpGs can be attributed to several factors, including the sample materials, methylation detection methods (WGBS vs Illumina 450 K/850 K array), and population demographics (Chinese vs European/American, and different age ranges).

Sex-associated DNA methylation patterns on autosomes

We identify 1053 CpGs on autosomes with their methylation rates significantly associated with sex (Padj < 0.05, linear regression) (Supplementary Data 12). The visualization of the Manhattan plot is shown in Fig. 3a. Notably, only seven of the 1053 CpGs are also identified as age-associated CpGs (Supplementary Fig. 4, Supplementary Data 3, 12), revealing that most of the reported age- and sex-associated CpGs are independent. We find that six out of the seven CpGs are in FIGN which contains the highest number of age-associated CpGs (Fig. 2d, Supplementary Data 6). The two CpGs in FIGN (cg15148145 and cg16532938) are annotated as age- and sex- associated in the EWAS database. Here, the six CpGs not covered by the 850 K arrays are discovered to be simultaneously associated with age and sex in our study.

Fig. 3: Genome-wide identification of autosomal sex-associated CpGs, genes and pathways.
figure 3

a Manhattan plot showing the distribution of 1053 sex-associated CpGs (Padj < 0.05) across all the autosomes. The linear regression model is used, and the P values for the regression coefficients are derived from two-tailed t-tests. The P values were adjusted using the Benjamini–Hochberg (BH) method.The numbers of sex-associated CpGs in 10 Mb bins are shown on the chromosomes at the bottom of the Manhattan plot, with the color bar showing on the right side. bd Ranking of genes based on the number of sex-associated CpG sites per gene. The genes containing CpGs with higher methylation rates in females than those in males are shown in (b). The genes containing CpGs with higher methylation rates in males than those in females are shown in (d). The genes containing 5 or more sex-associated CpG sites are labeled with the gene names, and genes enriched with sex-associated CpGs are labeled in brown. The methylation rates of sex-associated CpGs of two example genes: c LINC01597 and e PTPRN2/LOC105375614. Box plots show median ± interquartile range (IQR) and 1.5 IQR ranges (whiskers). f KEGG and g GO enrichment analysis of 1053 autosomal sex-associated genes, using hypergeometric test (one-tailed test). The P values without adjustment are shown in (f, g). The P values and adjusted P values are in Supplementary Data 17.

The 1053 sex-associated CpGs with higher methylation rates in females are named as female-higher methylation positions (HMPs), and those with higher methylation rates in males as male-HMPs. There are 727 female-HMPs (69%) and 326 male-HMPs (31%), consistent with previous reports that most sex-associated CpG sites are more methylated in females than in males58,59. Similar to the finding about age-associated CpGs, only a limited number of these sex-associated CpGs are covered by the 450 K and 850 K arrays, 17 and 33, respectively (Supplementary Data 12). The EWAS database shows that 11 out of 17, and 14 out of 33 of CpGs have previously been found to be sex-associated by DNA methylation arrays. We evaluate the distribution of the 1053 sex-associated CpGs based on their relation to the nearest genes or CpG islands34. The sex-associated CpGs are enriched in promoters, exons, enhancers, CpG islands, and CpG shores, while they are underrepresented in the introns, open sea, and CpG shelves (Supplementary Fig. 5a, b, Supplementary Data 5).

The 1053 sex-associated CpGs are mapped to 324 genes (Supplementary Data 12). The genes containing only female HMPs are defined as female-HMGs. Likewise, the genes containing only male-HMPs are defined as male-HMGs. The remaining genes containing both female-HMPs and male-HMPs were defined as mix-HMGs. In our results, there are 236 female-HMGs (Supplementary Data 13), 82 male-HMGs (Supplementary Data 14), and 6 mix-HMGs (Supplementary Data 15). The genes enriched with female or male HMPs are labeled in brown in Fig. 3b, d, and summarized in Supplementary Data 13, Supplementary Data 14. To our knowledge, most of the CpGs are newly discovered as sex-associated in this study, however, at the gene level, a considerable number of genes are already known to be sex-associated. A previous study based on whole blood samples using the 850 K assays showed that 16 of the 236 female-HMGs have been previously reported to have higher methylation in females, and 2 of the 82 male-HMGs have been previously reported to have higher methylation in males24. Additionally, transcriptome analysis across various tissues60 has revealed that 61 of the 236 female-HMGs (19%) display sex-biased expression, and 20 of the 82 female-HMGs (24%) display sex-biased expression.

In female-HMGs, the LINC01597 gene contains the highest number of female-HMPs, which is 33 (Fig. 3b, c, Supplementary Data 13, Supplementary Data 16). Previously, the transcriptome analysis of brain tissue showed that this gene had a lower expression level in females than in males61. We speculate that the methylation pattern differences in LINC01597 between sexes may play a role in regulating gene expression. In male-HMGs, the PTPRN2/LOC105375614 gene contains the highest number of male-HMPs, which is 22 (Fig.3d, e, Supplementary Data 14, Supplementary Data 16). These CpGs are located within the coding gene PTPRN2 and the long non-coding RNA gene LOC105375614, on the minus strand and the plus strand of the genome, respectively. For the protein-coding gene, PTPRN2 is important in the secretion of hormones and neurotransmitters62. In females, but not in males, it influenced the secretion of the pituitary hormones luteinizing hormone (LH) and follicle-stimulating hormone (FSH), and thus impacted the infertility in the mouse model63. Previous research has also reported higher methylation levels in the PTPRN2 in males than those in females in the whole blood samples and the brain tissues24,64. Given that lncRNA expression is often involved the regulation of DNA methylation and gene expression65,66,67, the detailed regulation mechanism and the interaction between lncRNAs and the target genes needs further investigation.

Next, we analyze the 324 genes containing sex-associated CpGs by KEGG pathway and GO enrichment (Fig. 3f, g, Supplementary Data 17). The enriched KEGG pathways included the MAPK signaling pathway, calcium signaling pathway, salivary secretion, and morphine addiction, all of which were reported to be different between males and females at the DNA methylation level7,68,69. It is worth noting that among the enriched GO functions, four are related to neural functions (Fig. 3g). Concordantly, these genes were significantly enriched in genes predominantly expressed in the cerebral cortex (Supplementary Fig. 6a, b). Previous research has demonstrated sex differences in the brain epigenome and transcriptome of neuropsychiatric disorders61,70, and the sex-associated genes mentioned before, such as LINC01597 and PTPRN2, have displayed sex differences in DNA methylation64 and gene expression61 in neuropsychiatric disorders. Our study reveals that such methylation differences in brain-related functions exist in healthy people. In addition, PTPRN2 plays an important role in insulin secretion in response to glucose stimuli71. As we know, there are sex differences in glucose metabolism and the related disease of diabetes more frequently affects males72. This difference is also found in the enriched KEGG pathway of glycolysis/gluconeogenesis (Fig. 3f). To sum up, we find many sex-associated genes with diverse molecular and biological functions, and their functions are relevant to sex-biased diseases.

Sex-associated DNA methylation on the X chromosome

On the X chromosome, we identify 638,599 CpGs (64 %) significantly associated with sex (Padj < 0.05, linear regression). The methylation differences are related to X chromosome inactivation (XCI). When applying a more stringent threshold (Padj < 10−6), we identified 29,446 CpGs (5%) with significant sex differences. In the subsequent analysis, we focus on these more significant sex-associated CpGs.

To illustrate the differences in CpG methylation rates between sexes, we plot the average methylation rates of all CpG sites on the X chromosome in males and females on the XY axis (Fig. 4a, Supplementary Data 18). Similar to the display in a previous research9, this plot reveals five methylation patterns: red dots (pattern A, 28,464 CpGs) show significantly higher methylation in females (Padj < 10−6), likely reflecting XCI; dark blue dots (pattern B, 13,018 CpGs) indicat hypomethylation (Methylation rate < 0.25) in both sexes (Padj > 10−6), suggesting potential escape from XCI in females; orange dots (pattern C, 561,972 CpGs) represent hypermethylation (Methylation rate > 0.75) in both sexes (Padj > 10−6); purple dots (pattern D, 265,880 CpGs) show significantly higher methylation in males (Padj < 10−6); and gray dots (pattern other, 121,573 CpGs) represent methylation rate range from 0.25 to 0.75 (Padj > 10−6).

Fig. 4: Characterization of X-chromosome sex-associated CpGs, genes and pathways.
figure 4

a Comparison of DNA methylation rates in females and males for the CpGs on the X chromosome, revealing CpGs under XCI significantly (Padj < 10−6) more methylated in females than in males colored red (pattern A), CpGs escaping XCI with methylation rates <0.25 in both sexes with no significant sex difference (Padj > 10−6) colors dark blue (pattern B), CpGs with methylation rates > 0.75 in both sexes with no sex difference (Padj > 10-6) colors orange (pattern C), CpGs significantly more methylated in males than in females (Padj < 10-6) colors purple (pattern D), and CpGs with methylation rates range from 0.25 to 0.75 (Padj > 10-6) colors gray (pattern other). Distribution of methylation patterns in gene regions (b) and CpG regions (c). Significant differences among patterns (chi-square test, two-tailed test, adjusted P < 0.001) exist in all tested gene regions and CpG regions (Supplementary Data 5). KEGG enrichment analysis for genes with pattern A and pattern B in 5UTRs (d) and promoters (e), using hypergeometric test (one-tailed test). The P values without adjustment are shown in (d, e). The P values and adjusted P values are in Supplementary Data 20, 21. The genes with four XCI status categories are labeled in red (XCI), yellow (escape XCI), black (variable XCI), and blue (unknown). The genes labeled in blue in pattern A and B are currently unknown for their XCI status.

We then evaluate the distribution of the methylation patterns based on their relation to the nearest genes or CpG islands (Fig. 4b, c, Supplementary Data 5)34. Patterns A and B are obviously enriched in promoters, 5’ UTRs, enhancers, CpG islands, and CpG shores, while patterns C, D, and others are obviously enriched in introns and the open sea (Fig. 4b, c, Supplementary Data 5). This distribution aligns with previous findings that loci with lower methylation in males (patterns A and B) are typically in promoters9. In fact, a considerable number of genes displayed multiple patterns including A, B, C, and D, indicating a high degree of complexity in the regulatory mechanisms of gene expression on the X chromosome (Supplementary Fig. 7, Supplementary Data 19).

Given the regulation role of XCI and the significant impact of the promoter and 5’UTR on gene expression regulation73, we utilize KEGG pathway enrichment analysis to the genes with CpGs in patterns A and B, and in the promoter and the 5’UTR region (Fig. 4d, e, Supplementary Data 20,21). These genes in pattern A are enriched in pathways such as NF-κB pathway74, Primary Immunodeficiencies75, and Polycomb repressive complexes76, which are known to be associated with XCI. Most of the genes in the enriched pathways are classified as XCI genes and some genes are defined as variable escape genes (Fig. 4d, e), according to combined survey approaches for XCI status77,78. Five genes (PABPC5, PABPC1L2A, NCBP2L, IL2RG, and GABRQ) enriched in these pathways are unknown in the current catalog of XCI status78,79. The genes in pattern B are enriched in the JAK-STAT signaling pathway and Neuroactive ligand-receptor interaction. Three genes (CSF2RA, CRLF2, IL3RA) in the JAK-STAT signaling pathway are located in the pseudoautosomal region PAR1 and one gene (IL9R) is in the PAR2 region. The detected genes in the PAR1 region are usually reported to be XCI escape genes72, however, the XCI status of CRLF2 has not been thoroughly studied and classified. Although this gene is involved in diseases, such as leukemia and autoimmune disease (MalaCards Version 5.23), there is limited research on the methylation level of this gene. Notably, The JAK-STAT pathway is implicated in various physiological and pathological processes80, including autoimmune diseases that predominantly affect women6. Three genes in another enriched pathway of Neuroactive ligand-receptor interaction, including GRPR81, P2RY882, and P2RY1083 are also involved in autoimmune diseases. Beyond these genes, numerous other genes containing sex-specific CpG sites may escape or variably escape from XCI, which need further investigations, especially for the miRNA genes that were not systematically studied for their XCI status79,84. Here, we provide a list of all genes with patterns A and B in the promoter and 5’UTR regions for future studies (Supplementary Data 22). Our results demonstrate that cfDNA methylation patterns likely reflect XCI status, and provide epigenetic evidence to support conventional understandings. The non-invasive test of cfDNA methylome may be useful for XCI status analysis for future disease studies.

Characterization of cfDNA tissue-of-origins in the generally healthy individuals

We perform cfDNA tissues-of-origin profiling for 98 samples using two deconvolution methods based on the DNA methylation atlas of human bodies developed by Moss et al.17 and Loyfer et al.18. As a result, the blood cells, such as granulocytes, erythroid progenitors, monocytes/macrophages, and NK cells, are the dominant cfDNA origins (Fig. 5a, b, Supplementary Fig. 8a, b, Supplementary Data 23). These results are also consistent with previous studies17,18. There are some differences between the deconvolution results. We focus more on the generally consistent results for the blood cells and tissues verified by both methods.

Fig. 5: Cell-type origin analysis of the plasma cfDNA based on the methylation profile (Deconvolution method: Loyfer et al.18.).
figure 5

a Cell-type composition of plasma cfDNA for each individual (n = 98). b Cellular contributors to cfDNA (median value of 98 samples). c Boxplot comparing the hepatocytes-derived cfDNA proportions of females (rose red) and males (blue) in different age groups. The numbers of individual samples in each group are shown in Fig. 1a. Box plots show median ± interquartile range (IQR) and 1.5 IQR ranges (whiskers). The P values were calculated by the Wilcoxon rank-sum test (two-tailed). dg Spearman’s correlation between the hepatocytes-derived cfDNA proportion and ALT, AST, GGT, and HDLC, respectively. h, i The rank of samples based on cfDNA derived from monocytes/macrophages, granulocytes, erythroid progenitor cells, megakaryocytes, and hepatocytes. Special triangle (Δ), square (), hollow circle (), solid circles (), and star (✩) in (a) represent the corresponding participants in (hl).

We find that the increase in the relative proportion of granulocyte-derived cfDNA with age exists only significantly in females but not in males (Supplementary Fig. 9a–d, Supplementary Data 23). This phenomenon may be associated with the modulatory effects of estrogen on neutrophil apoptosis85, which declines with age in women. By using the deconvolution method developed by Moss et al.17, we find that females have a higher relative proportion of monocytes/macrophages-derived cfDNA compared to males in relatively young groups (age 20–30 and 31–40), but not in older groups (age: 41–50, 51–60, and >60) (Supplementary Fig. 10a, Supplementary Data 23). The significant decrease of monocytes/macrophages-derived cfDNA with age is also discovered only in female groups (Supplementary Fig. 10b, c, Supplementary Data 23). The observed variations in monocyte/macrophage-derived cfDNA are probably associated with the sex difference in monocyte counts86 and monocyte cytotoxic activity87. However, the findings about monocyte/macrophage-derived cfDNA are not supported by the method developed by Loyfer et al.18. Therefore, it needs further investigation to get a solid conclusion.

In addition to the blood-cells-derived cfDNA, the hepatocytes-derived cfDNA is the highest among all organs. Age- and sex-related variations in cfDNA origins are found in hepatocytes (Fig. 5c, Supplementary Fig. 8c, Supplementary Data 23). Compared to the relatively young people, hepatocyte-derived cfDNA is higher in older males, but not in older females. A significant sex difference (P < 0.05) is found in the old group (age:51–60), but not in younger groups (age:20–30, 31–40, and 41–50). As we know, the prevalence of chronic liver diseases, such as metabolic dysfunction-associated steatotic liver disease (MASLD) and hepatocellular carcinoma, is increasing with age, especially for those above age 5088, and is much higher in males than in females89. Cell death and tissue injuries in old males may contribute to the higher level of hepatocytes-derived cfDNA.

Next, we compare the relative proportion of hepatocyte-derived cfDNA with the blood biochemical test results. We find that the hepatocyte-derived cfDNA shows positive correlations with the levels of alanine aminotransferase (ALT), aspartate aminotransferase (AST), and gamma-glutamyl transpeptidase (GGT) (Fig. 5d–f, Supplementary Fig. 8d–f, Supplementary Data 24), which is consistent with the positive correlations found in COVID-19 patients19. Moreover, the hepatocyte-derived cfDNA shows negative correlations with High Density Lipoprotein Cholesterol (HDLC) (Fig. 5g, Supplementary Fig. 8g), an important biomarker related to liver function. Notably, our research is based on generally healthy participants rather than patients. These results indicate that cfDNA may be a promising and sensitive biomarker in the evaluation of liver health, in both patients and generally healthy individuals.

The plasma cfDNA tissue origins exhibit some individual variances. Although the participants are generally healthy, some of them may still have some health problems or non-severe diseases, which led to abnormal cfDNA deviations. For example, some participants showed abnormally high proportions of cfDNA derived from monocytes/macrophages (EB069), granulocytes (EB080), erythroid progenitors (EB005), megakaryocytes (EB005), and hepatocytes (EB071) (Fig. 5a, h–l, Supplementary Fig. 8a, h–k). To explore the possible reasons for the outliers, we investigate their physical examination and questionnaire information.

As a result, we find that EB069 (Fig. 5h, Supplementary Fig. 8h) was the only participant who had a history of gout and was currently undergoing gout-specific medicine treatment. As monocytes are known to be involved in the inflammatory processes in gout pathology90, this participant may exhibit enhanced monocyte-mediated immune responses, which increased monocyte-derived cfDNA. For another participant EB080 (Fig. 5i, Supplementary Fig. 8i), the chest CT scan reveals chronic pulmonary inflammatory lesions. Besides, this participant had a history of hyperlipidemia and hypertension, which has also been widely recognized to trigger inflammatory pathways, resulting in heightened neutrophil production and mobilization91,92. Participant EB005 (Fig. 5j, k, Supplementary Fig. 8j) was diagnosed with thrombocytosis and is undergoing treatment. Thrombocytosis is characterized by the overproduction of platelets by megakaryocytes. The clonal expansion of hematopoietic stem cells in thrombocytosis may result in increased production of hematopoietic cell lineages93, such as erythroid progenitors. However, the connection between thrombocytosis and the overproduction of erythroid progenitors is not typically a prominent feature and more studies are needed to understand the phenomenon. Participant EB071 (Fig. 5l, Supplementary Fig. 8k) who has the highest cfDNA level of hepatocytes, has elevated liver enzyme levels, with ALT at 70.9 U/L and AST at 76.5 U/L, both exceeding the normal range (0–40 U/L).

Discussion

Our research demonstrate the great value of whole-genome study on DNA methylation. Only 7% of the age- and sex-associated CpG sites on the autosomes identified in our study are included in the 450 K/850 K arrays. To the best of our knowledge, many of the CpGs are newly discovered to be age- and sex-related at the DNA methylation level, although many related genes and pathways have been proved to be age- or sex-related through other detection methods, such as transcriptome and proteome analysis. For the very special gene of FIGN, with the largest number of age-associated CpGs and simultaneously containing 6 age- and sex-associated CpGs, previous research has reported something about its relationship with aging and sex. FIGN shows sex-specific deviations in centenarians of decelerated aging25. A methylated genomic units (DMUs) specific to long-lived-man in an intergenic regions near FIGN has also been discovered94. Its expression level is relatively high in the ovary, tibial nerve and artery (GTEx Analysis Release V10), and this gene is also associated with diseases such as polycystic ovary syndrome, Parkinson’s disease and pulmonary hypertension (MalaCards Version 5.23). Our analysis provides an epigenetic perspective to understand some sex-biased and/or aging-related diseases. Moreover, we discover that many lncRNA genes contain a lot of age- and sex-associated CpGs. The regulation of DNA methylation on lncRNA gene expression may impact the downstream regulation of lncRNA-targeted genes67,95, which displays a more complex mechanism of regulation.

Plasma cfDNA is a very special and valuable sample material for health and disease studies, for it provides a non-invasive approach to measure DNA methylation alterations in various tissues and cell types in the human body16,17. For example, early cancer screening and diagnosis are studied by detecting the tumor-derived cfDNA and localizing potential tumors16. CfDNA methylation has also been explored to evaluate tissue (e.g., neutrophils, adipocytes, heart, lung, liver, and kidney) injuries in COVID-1919,96, and many other diseases97,98,99. These results improve our understanding of cfDNA methylation signatures for identifying tissue-specific injuries and systemic pathological conditions.

In the past decade, DNA methylation patterns have been employed to measure biological age accurately100. Like other classical epigenetic clocks, we believe cfDNA methylation data can also be trained to estimate age acceleration and predict aging-associated diseases and mortality risk. In addition to the hematopoietic cell types, cfDNA carries aging signals from tissues and organs. These aging and tissue injury signatures may enable risk stratifications for the general population, facilitating personalized interventions (e.g., lifestyle modification, clinical therapies) to mitigate disease progression2.

This study has several limitations. For example, ethnic differences in human DNA methylation have been widely reported, while this study only focused on the Chinese population. The sample size in this study is relatively small, expanding the participants to a wider cohort will enhance the comprehensive understanding of aging, sex-biased, and disease-related methylation profiles. Moreover, the selection of CpGs and the predictive accuracy of the epigenetic clock can be further refined by using other modeling algorithms, and its application in disease prediction and health monitoring also needs further exploration. We anticipate that future cfDNA methylation studies will provide more thorough insight into methylation variations and promote the practical application of cfDNA signatures as biomarkers.

Conclusion

In the present study, we use cell-free DNA (cfDNA) whole-genome bisulfite sequencing (WGBS, ~30 X) to comprehensively investigate the epigenetic signatures of methylation that correlate with age and sex. Our analysis reveals 3047 CpGs and 1587 genes on autosomes that exhibit significant associations with age. We provide a list of genes with methylation alterations, including genes of IFT80 and RILPL1 that are related to aging diseases. Based on the age-associated CpGs, we developed a relatively accurate methylation epigenetic clock (R = 0.91, MAE = 3.74 years) utilizing 125 CpG sites, thereby expanding the research on epigenetic clocks by using cfDNA. Additionally, we identify 1053 sex-associated CpG sites and 324 genes on autosomes. The sex-associated genes are relevant to sex-biased pathways and diseases, including those of neural functions, psychiatric disorders101, and diabetes102. We demonstrate that cfDNA methylation patterns on the X chromosome could also indicate XCI status, and the XCI escape genes involved in modulating immune responses could be found through the analysis of methylation patterns. Furthermore, our study discover age- and sex- associated cfDNA features in the tissue-of-origin, including the relative proportion derived from granulocyte, monocytes/macrophages, and hepatocytes. Through the cfDNA profiling of the general population, we detect four samples that deviated from others. They had abnormally high relative proportions of cfDNA derived from certain cell types, which reflects their health problems. Although this analysis needs a more comprehensive health data survey with larger sample sizes for further validation, our findings highlight the importance of age and sex in influencing cfDNA characteristics and display the great potential of cfDNA methylation as a biomarker in clinical applications.