Abstract
Cell-free DNA (cfDNA) consists of degraded DNA fragments released into body fluids. Its genetic and pathological information makes it useful for prenatal testing and early tumor detection. However, the mechanisms behind cfDNA biology are largely unknown. In this study, for the first time, we conduct a genome-wide association study (GWAS) to explore the genetic basis of cfDNA end motif frequencies, termed cfGWAS, in 28,016 pregnant women. We identify 15 study-wide significant loci, including the well-known cfDNA-related genes DFFB and DNASE1L3, as well as novel genes potentially involved in cfDNA biology, such as PANX1 and DNASE1L1. The findings are further verified through three independent GWAS studies and experimental validation in knockout mice and cell lines. Subsequent analyses reveal strong causal relationships of leukocytes, especially neutrophils, with cfDNA features. In summary, we introduce the cfGWAS, revealing the genetic basis of cfDNA biology on a genome-wide scale. Novel knowledge uncovered by this study promises to revolutionize liquid biopsy technology and lead to potential new drugs targeting certain diseases. Given that millions of cfDNA whole genome sequencing data have been generated from clinical testing, the potential of this paradigm is enormous.
Similar content being viewed by others
Introduction
Cell-free DNA (cfDNA) is DNA fragments released from cells, often originating from processes such as apoptosis, necrosis, active secretion, and the specialized cell death process known as NETosis1,2. These DNA fragments circulate freely in bodily fluids such as blood, saliva, and urine, reflecting the biological status of cells or tissues. CfDNA analysis has extensive applications in non-invasive prenatal testing (NIPT) for detecting fetal abnormalities3. It has also emerged as a valuable biomarker for early cancer screening and monitoring4,5, as well as for detecting organ rejection in transplantation6. Abnormalities in cfDNA have been observed in autoimmune diseases such as systemic lupus erythematosus (SLE)7. Additionally, cfDNA shows promise as a tool for diagnosing and managing infectious diseases8,9.
Commonly used characteristics of cfDNA include cfDNA concentration, fragment length, end motifs, jagged ends, as well as nucleosome footprints10. In healthy individuals, cfDNA concentration is low, but it significantly increases in diseases like cancer, inflammation, and tissue damage11,12,13,14. Therefore, measuring cfDNA concentration can help with disease prediction and monitoring. The fragment length of plasma cfDNA fragments typically ranges from 50 to 600 base pairs, though both shorter and longer fragments have also been observed15,16, but this may vary due to individual differences, physiological states, and disease conditions10. Researchers analyzed the fragment lengths to capture the disease signals, such as aberrant shortening in cfDNA from cancer patients17,18. End motifs refer to the nucleotide composition at the 5’ end of cfDNA, generated during cfDNA digestion. Different cfDNA end motifs reflect nuclease activity, pathophysiological processes, and tissue-specific DNA fragmentation patterns. Specifically, DFFB, DNASE1L3, and DNASE1 preferentially generate A-end, C-end, and T-end motifs, respectively11. Recent studies have highlighted cfDNA end motifs as emerging biomarkers for oncology and transplantation monitoring due to their frequency shifts under pathological conditions and their tissue-specific profiles19.
These cell-free DNA features arise from the degradation of DNA both inside and outside of cells, and some may correlate due to shared biological processes10,20,21. Understanding the genes responsible for cell-free DNA fragmentation and their functions is essential for elucidating the biological mechanisms that influence cfDNA fragmentomic profiles in both healthy and disease states. Currently, there is a lack of effective methods for identifying these genes, which has restricted previous studies to examining those associated with cell death and nuclease activity. The nucleases are the main actors of DNA degradation. Some other enzymes, such as neutrophil elastase and myeloperoxidase, that are involved in NETosis, also contribute to cfDNA release22. Earlier research has focused on a limited number of genes through targeted analysis23,24,25,26. To date, only three genes DFFB, DNASE1, and DNASE1L3, all of which encode nucleases, have been confirmed to impact cfDNA digestion and fragment characteristics27. Studies on nuclease-deficient mice have revealed their role in cfDNA degradation and shaping fragmentomics11,28,29. It has been demonstrated that cfDNA is first generated intracellularly with DFFB, DNASE1L3, and other nucleases, followed by extracellular fragmentation with circulating DNASE1L3 and DNASE130. Despite these findings, the genome-wide genetic basis and additional key genes involved in regulating cfDNA formation remain unclear. Identifying more genes involved in the cfDNA fragmentation process could significantly enhance our understanding of cfDNA dynamics and their roles in various physiological and pathological conditions. Expanding the genetic landscape of cfDNA fragmentation will provide deeper insights into cfDNA’s functions and its potential applications in disease diagnosis and monitoring.
Genome-wide association study (GWAS) is a research method used to identify genetic loci that are statistically associated with specific traits or diseases31. Over the past two decades, GWAS have successfully identified hundreds of thousands of variants linked to numerous human traits and diseases, including height, weight, cardiovascular diseases, tumors, and autoimmune diseases32,33,34. As the most comprehensive repository of genotype-trait associations, by May 2024, the GWAS catalog contained nearly 7000 publications and over 600,000 association signals, serving as a valuable resource for exploring the genetic background of phenotypes35.
In previous works, our group has proven that with appropriate algorithms and methodology, cfDNA could be utilized as a resource for high-quality genetic analysis, including variant detection, allele frequency estimation, genetic structure analysis, and GWAS36,37,38,39,40,41. In this study, we leveraged genetic and phenotypic data from 28,016 Chinese pregnant women to conduct the first genome-wide association study on cfDNA motif frequencies, termed cfGWAS (short for cfDNA-GWAS). We used the same copy of whole-genome-sequencing data from cfDNA to extract both genotype information and cell-free DNA features, identifying genetic loci associated with these cfDNA molecular characteristics. A comprehensive overview figure has been provided to illustrate the overall design of this work (Fig. 1).
A total of 28,016 pregnant women were included in this study, with clinical phenotypes from prenatal care and sequencing data from non-invasive prenatal testing (NIPT) collected. Fragmentation characteristics of cell-free DNA (cfDNA) were extracted from the sequencing data, and after genotype imputation, genotype data were obtained for genome-wide association studies (GWAS). A total of 104 phenotypes and 256 motifs were subjected to GWAS as phenotypic variables, discovering the novel associations of PANX1 and DNASE1L1 genes with the cfDNA end motifs. Experiments with gene knockout (KO) mice and cell lines were conducted to validate these associations. Additionally, a series of post-GWAS analyses was performed by integrating the one-sample 104 phenotypes GWAS results, reported associated loci from the GWAS catalog, and diseases reported in the OMIM database, to explore the biological mechanisms, causal relationship, and gene pleiotropy of cfDNA fragmentation characteristics. The icon of “Public databases” is the official icon of the GWAS catalog database. GWAS Catalog logo is available for re-use under CC-BY-4.0, https://creativecommons.org/licenses/by/4.0/. Created in BioRender. Su, S. (2025) https://BioRender.com/m1tkhdf.
Key findings from our study include the identification of 15 study-wide significant loci linked to 176 cell-free DNA end motifs, the majority of which were novel associations to cfDNA features, with rediscovery of well-known cfDNA-related genes such as DFFB and DNASE1L3, and the discovery of novel genes like PANX1 and DNASE1L1. These results were validated through independent replication studies and experimental validation in knockout mice and cell lines. Furthermore, our one-sample integrative analysis revealed strong genetic correlations between cfDNA features and phenotypes, such as body mass index (BMI), weight, white blood cell count, and neutrophil count. Given the widespread clinical use of cfDNA whole-genome sequencing, our work underscores the paradigm-shifting potential of cfGWAS in medical research and practice.
Results
Study participants
Participants were recruited from Wuhan Children’s Hospital during routine pregnancy examinations between 2017 and 2020. The only inclusion criterion was the availability of NIPT genotype data. We excluded pregnant women carrying fetuses with trisomies and those with a sequencing depth below 0.05× or a mapping rate under 90%, resulting in a final cohort of 38,668 individuals. All samples were used for genotype imputation to maximize accuracy. Please note that not all 38,668 pregnant women had complete phenotype data, such as age and NIPT test timing. After excluding samples lacking this information, 28,016 samples remained for further analysis. For these 28,016 samples, we provided detailed information, including year of pregnancy, maternal age, gestational week at the time of the NIPT test, type of blood tube used, sex of newborn, and fetal fraction (Supplementary Fig. 1A–F). Approximately 40% of the samples were collected in 2020, with the remainder collected between 2017 and 2019. The average maternal age was 29 years, and the average gestational age at the time of NIPT testing was 16.16 weeks. The male-to-female newborn ratio was 1.10, and the average fetal fraction was 9.8%, consistent with previous reports of 10%42,43.
Genotype and phenotype data
Notably, starting in December 2018, an upgraded NIPT product led to increased sequencing depth. After excluding 1719 samples with unknown collection dates, there were 10,795 pregnant women with an average sequencing depth of 0.1049× before December 2018, and there were 26,154 participants with an average depth of 0.1644× afterward, resulting in an overall average depth of 0.1525× (Supplementary Fig. 1G). The sequencing read length remained consistent at 35 base pairs (bp).
Following STITCH imputation (see the “Methods” section), we obtained a total of 8,134,302 single-nucleotide polymorphisms (SNPs). After filtering out SNPs with an imputation info score < 0.4, a Hardy–Weinberg equilibrium (HWE) p-value < 1e-6, and a minor allele frequency (MAF) < 0.05, 2,897,824 high-quality SNPs remained. To assess imputation accuracy, we calculated Pearson’s correlation between the imputed genotype dosages and the original high-depth genotypes from 30 Chinese samples in the 1000 Genomes Project (1KGP). On average, the imputation accuracy (R) was 90.46%, with an R² of 82.79% (Supplementary Fig. 1H). Additional imputation evaluation results, including analyses using other datasets and SNP-level performance, are detailed in our previous paper36,41. It is notable that the equivalent sample size for GWAS is the product of the true sample size and the imputation accuracy R² value.
Additionally, we calculated the Pearson’s correlation (R²) between the alternate allele frequencies of these high-quality SNPs in our dataset and those in the East Asian samples from the 1000 Genomes Project (Supplementary Fig. 1I). An R² of 0.985 demonstrates a high concordance between our imputed variants and those from the 1KGP.
The phenotypes of interest are cell-free DNA molecular features, specifically the frequencies of 4-mer end motifs. These motifs represent the first four nucleotides at the 5’ end of each plasma DNA strand, leading to a total of 256 distinct phenotypes. We presented the distributions of the 256 motif frequencies using bar plots and box plots (Supplementary Fig. 2). Overall, the distributions of motif frequencies are consistent with previously reported patterns11. For example, the motif CCCA exhibits the highest frequency, C-end motifs tend to have higher average frequencies than others, and motifs containing CG dinucleotides generally show lower frequencies.
GWAS identified 15 highly credible loci
In the GWAS analysis, we included maternal age, gestational week at NIPT screening, sequencing depth, and the first 10 PCs of the genotype data as covariates. Since not all 38,668 participants had complete covariate information, the adjusted sample size was reduced to 28,016. At the genome-wide significance threshold of 5e−08, we identified a total of 507 significant motif-locus associations for the 256 end motif frequencies, involving 32 unique loci and 208 end motifs (Supplementary Data 1). While at the stricter study-wide significance threshold of 1.95e−10 (=5e−8/256), there were 345 significant associations, involving 15 loci and 176 motifs (Supplementary Data 2). We regarded these 15 loci as a group of highly credible signals and conducted follow-up analyses for those not utilizing the complete GWAS summary statistics. To assess the reliability of our GWAS results, we drew the quantile–quantile plots (QQ-plots) and calculated the genomic inflation factor (GIF) of each motif (Supplementary Fig. 3). Specifically, the GIF values were consistently around 1 across all 256 motifs, indicating no substantial inflation and suggesting well-conducted GWAS analyses.
To provide informative visualization of our GWAS results, we synthesized the summary statistics for the 256 motifs by selecting the minimum p-value for each SNP across all motifs, which was used to generate a Manhattan plot (Fig. 2A). The most significant locus identified is PANX1, with the lead SNP rs1138800 (p-value = 7.67e−298, beta = 3.65e−01, se = 9.76e−03). Following PANX1 is DFFB, with lead SNP rs34725762 (p-value = 1.82e−286, beta = 4.53e−01, se = 1.24e−02). The subsequent two most significant loci were DNASE1L3 (rs12633655, p-value = 5.13e-30, beta = −1.12e-01, se = 9.82e−03) and DNASE1L1 (rs2283762, p-value = 8.44e−25, beta = 1.95e−01, se = 1.89e−02). For more details of each signal, we provided the regional plots (Supplementary Fig. 4). Other loci reaching the study-wide significant threshold include SMIM24, PSMD3, FCHO2, PADI4, ABO, and so on. For each locus, we counted the number of its associated motifs (Fig. 2B). The top five loci with the largest number of associated motifs were PANX1 (136), DFFB (123), PSMD3 (21), DNASE1L3 (17), and FCHO2 (9). From the perspective of end motifs, we summed up the number of significant loci for motifs with the same starting nucleotide. The counts for ‘A’, ‘T’, ‘C’, and ‘G’ were 90, 97, 80, and 78, respectively.
A A comprehensive Manhattan plot integrating the GWAS results for 256 motifs, each dot represents the minimum p-value of a SNP associated with the 256 motifs; several gene names are annotated at the top of significant loci; the association of motifs beginning with the four nucleotide bases A, T, C, and G with these loci is indicated by differently colored diamonds. B A bar plot presents the number of associated motifs that hit on each locus, with the frequency of each start nucleotide A, T, C, and G differing by distinct color. C The Venn diagram depicting the unique and shared study-wide significant loci by different replication studies.
We then focused on highlighting the biological functions of four key loci: PANX1, DFFB, DNASE1L3, and DNASE1L1. The gene PANX1 encodes Pannexin-1, a cell membrane channel protein that mediates intracellular–extracellular molecule exchange and participates in cell signaling, apoptosis, inflammation, and development44,45. The gene DFFB (DNA fragmentation factor subunit beta) encodes a protein essential for apoptosis. Specifically, DFFB is a key component of the DNA fragmentation complex, responsible for cleaving DNA into fragments during apoptosis30. The gene DNASE1L3 encodes deoxyribonuclease 1-like 3, a nuclease that degrades DNA. It is predominantly expressed in the spleen and lymph nodes, and is crucial for clearing debris from apoptotic cells, thereby preventing autoimmune responses triggered by self-DNA and maintaining tissue homeostasis30. Similarly, the gene DNASE1L1 encodes deoxyribonuclease 1-like 1, another member of the deoxyribonuclease I family. However, no established links exist between DNASE1L1 and diseases caused by deficient chromatin clearance during apoptosis or necrosis in vivo46.
While DFFB and DNASE1L3 are established apoptotic nucleases crucial for cfDNA fragmentation30, we turned to the novel gene PANX1 for further investigation. To this end, we conducted GWAS replication in three independent cohorts and performed experimental validation using both knockout models and knockout cell lines.
Information for GWAS replication studies
Three independent replication cohorts were introduced and utilized to validate the study-wide associated loci identified in this study. In the Chinese NIPT replication cohort, we recruited 25,578 pregnant women who underwent NIPT tests at Wuhan Children’s Hospital between 2020 and 2021. We provided information on the recruitment year and age distribution of these samples (Supplementary Fig. 5A and B). The Dutch replication cohort was obtained from a recent GWAS of NIPT samples47. They performed GWAS analysis for six cfDNA features, including diversity of bincount, fetal concentration, mitochondrial DNA concentration, fragment size diversity, motif diversity score, and total cfDNA concentration. The sample sizes in GWAS analysis for each phenotype are provided in Supplementary Fig. 5C. To validate our GWAS findings in the general population, rather than in pregnant women, we introduced a natural population replication cohort consisting of 442 normal healthy individuals, consisting of 206 males and 261 non-pregnant females, with an average age of 34.8 (Supplementary Fig. 5D and E).
Replication on integrated study-wide associated loci
We first examined the 15 study-wide significant loci identified by integrating GWAS results across all 256 motifs. At a replication significance threshold of 1e−5, 14 (93.3%), 14 (93.3%), and 3 (20.0%) loci were replicated in the Chinese NIPT, Dutch NIPT, and natural population cohorts, respectively (Fig. 2C, Supplementary Data 2). Specifically, PANX1, DFFB, DNASE1L3, and MSR1 were replicated in the Chinese NIPT cohort; PANX1, DFFB, DNASE1L3, and DNASE1L1 in the Dutch NIPT cohort; and PANX1, DFFB, and PADI4 in the natural population cohort. The replication rates were notably higher in the pregnant cohorts than in the non-pregnant cohort, likely due to larger sample sizes. These findings suggest that the identified genetic associations are not limited to the pregnant population but also in the general population.
Overall, three loci (PANX1, DFFB, and PADI4) were consistently replicated across all three studies, indicating shared genetic associations between pregnant and non-pregnant individuals. Additionally, 13 loci (86.7%) were replicated by at least two studies, while all loci were replicated by at least one study. Ten loci—including ABO, MSR1, and FCHO2—were replicated by two NIPT studies but not in the natural population, suggesting they represent pregnancy-specific genetic associations. Notably, FCHO2 is highly expressed in female reproductive organs, including the uterus, vagina, and cervix, as shown in Supplementary Fig. 6 from the GTEx database (https://gtexportal.org/home/).
Replication on individual motif-locus associations
We assessed the replication status of the 345 study-wide significant motif-locus association pairs using the GWAS summary statistics of the corresponding motifs in the Chinese NIPT and natural population replication study. Overall, 256 associations (74.2%) were replicated in the NIPT study and 44 (12.8%) in the natural population study, with a total of 259 unique associations replicated in at least one cohort (Supplementary Data 3, Supplementary Fig. 7). For the 256 and 44 replicated motif-locus associations in the Chinese NIPT and natural population cohorts, respectively, we further examined whether their top SNPs showed consistent effect directions between the discovery and replication studies (Supplementary Fig. 8, Supplementary Data 3). The scatter plots demonstrated that all replicated motif-locus associations (solid dots) exhibited consistent effect directions across the corresponding studies. For each motif, we calculated Pearson’s correlation coefficients (R) between genetic effects in the discovery study and replication studies (Supplementary Fig. 9). Among the 176 significant motifs, 159 showed R > 0.5, and 143 showed R > 0.75 in the NIPT cohort, while 132 and 112 met these thresholds in the natural population cohort, respectively. The NIPT cohort exhibited a higher degree of consistency in both the direction and magnitude of genetic effects, whereas the natural population cohort showed lower concordance, likely due to its smaller sample size and the non-pregnant samples.
Of the 345 significant motif-locus associations identified in this study, 341 (98.84%) could be replicated in at least one of the six cfDNA phenotypes in the Dutch cohort (Supplementary Data 3). Notably, the “cleave-site motif diversity” phenotype appeared most frequently in both metrics (Supplementary Fig. 10), suggesting it has the strongest association with cfDNA end motif frequencies among the six features analyzed.
Experimental verification in PANX1 knockout mice and cell lines
To validate the role of PANX1 in cell-free DNA generation and clearance, we conducted knockout (KO) experiments in both mice and cell lines. We compared plasma cfDNA motifs between Panx1 KO mice (n = 6) and wild-type (WT) mice (n = 6), as well as cfDNA motifs in cell culture supernatants from PANX1 KO and WT Jurkat cell lines (5 replicates each) (Methods).
In mice, the motif diversity score (MDS) of plasma cfDNA was significantly lower in Panx1 KO mice (mean: 0.923, range: 0.916–0.932) compared to WT mice (mean: 0.930, range: 0.928–0.935, t(10) = −2.78, p-value = 1.95e−02, mean difference = −0.007, 95% CI = [−0.0128, −0.0014]) (Fig. 3A). This alteration in cfDNA motif profiles after Panx1 knockout supports the gene’s key role in cfDNA fragmentation. We further examined cfDNA motifs associated with PANX1 genetic variants identified in our genome-wide association study. The frequency of motifs positively correlated with PANX1 increased significantly in Panx1 KO mice (mean: 9.71, range: 9.03−10.23) compared to WT mice (mean: 9.11, range: 8.67–9.70, t(10) = 2.35, p-value = 4.03e−02, mean difference = 0.594, 95% CI = [0.0321, 1.1564]) (Fig. 3B). Conversely, the frequency of negatively correlated motifs decreased in Panx1 KO mice (mean: 10.00, range: 8.91–11.31) compared to WT mice (mean: 11.43, range: 10.33–12.35, t(10) = −2.82, p-value = 1.83e−02, mean difference = −1.437, 95% CI = [−2.5737, −0.2999]) (Fig. 3C). These findings suggest the impact of PANX1 activity on cfDNA motif patterns.
A Box plot showing the MDS of cfDNA for plasma samples from wild-type (WT) and Panx1 KO mice. B and C Accumulated frequency of motifs showing significant positive and negative correlations, respectively, with SNPs in the PANX1 gene from GWAS analysis in plasma samples from WT and Panx1 KO mice. D Box plot of the MDS of cfDNA for cell culture supernatant from WT and PANX1 KO cell lines. E and F Accumulated frequency of motifs with significant positive and negative correlations, respectively, with SNPs in the PANX1 gene from GWAS analysis in cell culture supernatants from WT and PANX1 KO cell lines. These box plots are defined as follows: the center line represents the median, the box limits indicate the 25th and 75th percentiles, and the whiskers extend to the minimum and maximum values. Statistical significance, determined by a two-sided t-test, is indicated by asterisks with the following conventions: *p < 0.05, **p < 0.01, and ***p < 0.001. Created in BioRender. Su, S. (2025) https://BioRender.com/m1tkhdf.
Similar trends were observed in the PANX1 KO cell line experiments, with an enhanced alteration in motif patterns. The mean MDS decreased from 0.965 in WT cells to 0.956 in KO cells (t(8) = −17.04, p-value = 1.43e−07, mean difference = 0.009, 95% CI = [0.0078, 0.0102]) (Fig. 3D). The frequency of motifs positively correlated with PANX1 increased significantly in KO cells (mean: 7.02, range: 6.93–7.22) compared to WT cells (mean: 6.10, range: 6.01–6.20, t(8) = 13.9, p-value = 7.12e−07, mean difference = 0.930, 95% CI = [0.7757, 1.0855]) (Fig. 3E). Conversely, the frequency of negatively correlated motifs decreased in KO cells (mean: 16.26, range: 15.91–16.46) compared to WT cells (mean: 18.09, range: 17.83–18.35, t(8) = −14.2, p-value = 5.90e−07, mean difference = −1.829, 95% CI = [−2.1261, −1.5319]) (Fig. 3F). These results from both in vivo and in vitro experiments consistently demonstrate the significant role of PANX1 in shaping cfDNA fragmentation patterns.
The fold change of cfDNA concentration in each mouse relative to the mean cfDNA concentration in WT mice was calculated. As depicted in Supplementary Fig. 11, we observed a significant increase in the fold change of cfDNA concentration in Panx1 KO mice (mean: 1.80, range: 1.30–2.90) compared to WT mice (mean: 1.00, range: 0.69–1.43, t(10) = 3.08, p-value = 1.17e−02, mean difference = 0.805, 95% CI = [0.2225, 1.3882]), suggesting the knockout of Panx1 gene influenced the release of cfDNA into plasma (Supplementary Fig. 11A). Although the difference did not reach statistical significance, we do observe a trend of increased cfDNA concentration of PANX1 KO cells (mean fold change: 1.10, range: 1.05–1.15) compared to the WT cells (mean fold change:1.00, range: 0.86–1.10, t(8) = 2.10, p-value = 6.89e−02, mean difference = 0.098, 95% CI = [−0.0096, 0.2065]) (Supplementary Fig. 11B). The results of both in vivo and in vitro KO experiments suggest that the PANX1 gene plays a crucial role in the generation of cfDNA, and the knockout of the PANX1 gene leads to an increased release of cfDNA.
cfDNA end motif frequencies are heritable
Heritability measures the proportion of phenotypic variance explained by genotypes, thus quantifying the genetic influence on phenotypes. We calculated the heritability of each motif using linkage disequilibrium (LD) score regression (Supplementary Data 4, Supplementary Fig. 12). Notably, among the 256 motifs, 47 motifs exhibited a heritability exceeding 10%, while 78 motifs fell within the range of 5–10%, suggesting a substantial genetic component in their variation. We also provide the number of genome-wide and study-wide significant loci associated with each motif (Supplementary Data 4, Supplementary Fig. 12). As expected, the heritability is generally positively proportional to the number of associated loci.
Enrichment in hematological cell types
Our heritability partition analysis revealed potentially significant contributions of cell type-specific elements to the heritability of cfDNA end motifs. At a suggestive significance threshold of 1e−2, we identified 323 significant cell-type-motif associations, involving 82 unique cell types and tissues (Supplementary Data 5, Supplementary Table 1, Supplementary Fig. 13A). Interestingly, the most frequently implicated cell types in these associations were parotid gland (33 motifs), followed by salivary glands (17 motifs), neutrophils (17 motifs), blood (16 motifs), and phagocytes (14 motifs), all of which belong to the digestive and blood/immune categories. The most prominent associations are observed between the CGAG motif and two tissues: the parotid gland (p-value = 2.57e−5) and the salivary glands (p-value = 5.68e−5), and others include the CACC and blood (p-value = 1.12e−4), and GTTA and neutrophils (p-value = 2.27e−4) (Supplementary Fig. 13B). These findings align with the known role of phagocytes (including neutrophils and monocytes) as both sources and clearers of cfDNA during immune responses48. Our analysis identified an enrichment of hematological cell types, particularly immune cells, in the heritability of cfDNA motif features, indicating their active involvement in cfDNA processing. However, the associations between the parotid gland and salivary gland with cfDNA remain underexplored and lack relevant reports, warranting further in-depth investigation. It should be noted that these results were based on a nominal significance threshold of 0.05 without strict multiple testing correction. Therefore, the findings should be interpreted with caution and regarded as exploratory, providing preliminary indications of potentially associated tissues and cell types.
Pathway-based association analysis
The pathway-based analysis results of all 256 end motifs were provided in Supplementary Data 6. At a suggestive significance threshold of 1e−3, a total of 770 motif-pathway pairs demonstrated significance, and they were distributed among four motif groups starting with ‘A’, ‘T’, ‘C’, and ‘G’, associating with 174, 308, 177, and 111 pathways, respectively (Supplementary Fig. 14A). These significant relationships encompassed 79 unique pathways, categorized into groups such as cell death and apoptosis, immune response, cancer-related pathways, among others (Supplementary Table 2).
Among the identified pathways, the top four with the most significant motifs included the Biocarta SET pathway, leading with 83 associated motifs, followed by the Biocarta Mitochondria pathway (78 motifs), the Biocarta DNA fragment pathway (76 motifs), and Reactome apoptosis-induced DNA fragmentation (50 motifs). Specifically, the Biocarta SET pathway is important in regulating B cells by providing essential survival signals that prevent inappropriate apoptosis. The Biocarta Mitochondria pathway sheds light on how mitochondria contribute to cellular functions like metabolism and apoptosis. Furthermore, both the Biocarta DNA fragment pathway and Reactome apoptosis-induced DNA fragmentation are implicated in DNA breakdown, a pivotal process in apoptosis regulation and cell death.
Moreover, we provided the distribution of pathway-based analysis p-values for all motifs through a box plot, sorted by the median (Supplementary Fig. 14B). Notably, the top pathways are almost consistent with the above observations, including the Biocarta DNA fragment pathway, Biocarta SET pathway, Reactome apoptosis-induced DNA fragmentation, Reactome the NLRP3 inflammasome, and Biocarta Mitochondria pathway. In conclusion, the pathway-based association analysis illuminated the biological processes, such as DNA fragmentation, apoptosis, and the mitochondrial pathway, in governing cfDNA generation and clearance.
Genetic correlation with some pregnancy phenotypes
In a parallel study with the same group of pregnant women, we have performed genetic analysis on their 104 pregnancy phenotypes, including laboratory tests during pregnancy (e.g., hematology, liver function), maternal information (e.g., height, BMI), and neonatal outcomes (e.g., birthweight, birth length)36. We conducted a genetic correlation analysis between the 256 end motifs and the 104 pregnancy phenotypes to assess the overall genetic similarity between each pair (Supplementary Data 7, Fig. 4A). At the strictest Bonferroni-corrected significance threshold of 1.88e−6 (=0.05/(256*104)), we observed a total of 13 significant motif–phenotype pairs, involving maternal BMI and maternal weight. Furthermore, when we relaxed the threshold to a more lenient suggestive level of 1e−3, a total of 138 motif–phenotype pairs showed significant correlations (Supplementary Table 3). To further ensure the reliability of the results, we retained only pregnancy phenotypes with more than five associated motifs, resulting in seven phenotypes (Supplementary Fig. 15): maternal BMI (43 motifs), maternal weight (37 motifs), white blood cell count (13 motifs), neutrophil count (13 motifs), lymphocyte percentage (9 motifs), neutrophil percentage (8 motifs), and uric acid (6 motifs). These genetic correlation results were based on a suggestive significance threshold of 1e−3, not subjected to strict multiple testing correction. Thus, the findings should be interpreted with caution and considered exploratory.
A A heatmap illustrating the pairwise genetic correlation p-values between 104 pregnancy phenotypes and 256 motifs, positive and negative genetic correlations are represented in red and blue, respectively. The motifs have been hierarchically clustered, and pregnancy phenotypes are sorted by category; different phenotype categories are distinguished by color coding on the axes. Several regions with strong genetic correlations are magnified and displayed on the right. B The results for Mendelian randomization analysis, highlighting 10 phenotypes that exhibit Bonferroni-corrected significant causal relationships with motifs; motifs are categorized into four groups (A, T, C and G) according to the start nucleotide; for each motif group (n = 64), the beta values from the IVW analysis are depicted in box plots on the right side, with the center line, box limits, and whiskers representing the median, 25th/75th percentiles, and 10th/90th percentiles, respectively. The number of significant motifs is illustrated in bar charts on the left side, and the total number of significant motifs across each group is displayed in bar charts at the top right corner. C The results for colocalization analysis between motifs and pregnancy phenotypes, for each significance locus, if the locus is also present in the GWAS results of pregnancy phenotypes, it is denoted by a colored large dot marker; the bar plots are used to display the number of colocalized motifs in each phenotype above the dot matrix, with different colors distinguishing motifs from different groups (A, T, C and G); on the right side of the dot matrix, the total number of motifs co-localized with phenotypes for each locus is shown; on the upper left corner of the dot matrix, the total number of co-localized motifs across each motifs group is displayed.
Causal effects of hematological cells on cfDNA motifs
We conducted bi-directional Mendelian randomization analyses to explore the potentially causal relationships between pregnancy phenotypes and end motif frequencies. Initially, we treated pregnancy phenotypes as exposures and motifs as outcomes (denoted as P → M) and subsequently reversed the roles (denoted as M → P). At the Bonferroni-corrected significance threshold of 3.55e−6 based on inverse variance weighted (IVW) method, our P → M analysis identified 168 potentially causal associations involving 10 pregnancy phenotypes, 7 of which belonged to the hematological category; these associations were further confirmed using MRlap, which accounts for sample overlap and other biases (Supplementary Data 8, 9, Fig. 4B). Specifically, 91 out of 168 (54.2%) identified causal relationships were between end motifs and leukocytes, including the white blood cell count (30 causalities), percentage of neutrophils (23), percentage of lymphocytes (19), and neutrophil count (19). In addition, the 57 causal associations between BMI and motifs also constitute a substantial proportion of the results. For motifs starting with ‘A’, ‘T’, ‘C’, and ‘G’, they were causally affected by 36, 67, 48, and 17 phenotypes, respectively.
The relationship between BMI, weight, and cfDNA during pregnancy has been extensively researched. Multiple studies have identified an association between increasing maternal weight and a decrease in cell-free fetal DNA fractions43,49,50,51. Additionally, other studies have shown that total cfDNA levels are elevated in obese pregnant women, primarily due to increased necrosis and apoptosis of adipose tissue52,53,54. The relationship between leukocytes and cfDNA has also been extensively studied, which can be elaborated via cellular apoptosis, infection, and inflammatory responses, and DNA clearance mechanisms55,56. The leukocytes, particularly neutrophils, are the major contributors to cfDNA in cancer samples, accounting for ~76% of cfDNA57. Specifically, activated neutrophils undergo NETosis, a specialized form of cell death that forms neutrophil extracellular traps (NETs), significantly increasing cfDNA levels58. Moreover, cfDNA released during exercise mainly originates from extramedullary polymorphonuclear neutrophils, influenced by physical impact, hypoxia, and elevated core body temperature59.
To further ensure the robustness of the 168 causal associations, we performed genetic pleiotropy and heterogeneity tests (Supplementary Data 9). At the significance threshold of 1e−3, all the causal relationships are not affected by genetic pleiotropy, with the smallest p-value being 5.86e−3. We additionally provided the fitted linear graphs for causal pairs of phenotypes and motifs based on the MR-Egger method (Supplementary Fig. 16). The intercepts of the lines were close to zero, also indicating the absence of pleiotropic effects and suggesting reliable causal effects in the MR analysis. Notably, no associations show potential heterogeneity effects (p-value < 1e−3) that may impact the overall findings.
We further compared the effect sizes from MR with those from linear regression analysis for the associations between phenotypes and motifs. For both the IVW and MR-Egger methods, the estimated effect directions for the majority of phenotype–motif pairs were consistent with those from regression analysis (Supplementary Fig. 17). The correlation coefficient R of the MR-Egger method is 0.577, and there is also a correlation (R = 0.558) between IVW beta and regression beta. This finding highlighted the consistency and differences of these methods.
To ensure the validity of our P → M MR results, we performed reversed M → P analysis, with the cfDNA motifs as exposures and the phenotypes as outcomes. At the Bonferroni-corrected significance threshold of 3.27e−6, we found no causal effects of motifs on phenotypes, with the smallest p-value being 1.28e−5 (TGCC → red cell distribution width-standard deviation). This phenomenon is consistent with our prior knowledge that cfDNA features are often considered as consequences rather than causes; the occurrences of some biological or physiological processes (e.g., immune response) may result in alteration of hematological tests (e.g., neutrophil count), causing the change of cfDNA molecular signatures (e.g., concentration, motifs).
Our bi-directional MR analysis, which controlled for both genetic pleiotropy and heterogeneity effects, highlighted the causal effects of hematological indicators, especially immune cells, on cfDNA molecular features, aligning with the understanding that most cfDNA originates from leukocytes.
Colocalized signals between cfDNA motifs and pregnancy phenotypes
To identify shared causal genetic effects on both end motifs and pregnancy phenotypes, we conducted GWAS-GWAS colocalization analysis. This analysis revealed 130 colocalized motif-phenotype pairs at a posterior probability of H4 (one common causal variant) >0.75, distributed across five loci: PSMD3 (94 pairs), FCHO2 (12 pairs), ZFPM2 (9 pairs), ABO (9 pairs), and DEFA10P (6 pairs) (Fig. 4C, Supplementary Fig. 18, Supplementary Data 10). Notably, the phenotypes with the highest number of colocalized signals were primarily related to hematological traits (leukocytes, thrombocytes, and erythrocytes) and protein levels (globulin and albumin).
The most frequent colocalizations appear in locus PSMD3, linking cfDNA motifs and leukocyte cells. The gene PSMD3 (proteasome 26S subunit, ATPase, 3) plays critical roles in the process of protein degradation, in maintaining cellular health, and in regulating various cellular processes. Silencing of PSMD3 inhibits cell proliferation and induces apoptosis, as evidenced by the significant reduction in breast cancer cell proliferation and colony formation following Psmd3 knockdown60. Previous GWAS studies have extensively reported their associations with leukocyte cells61,62,63,64 and immune diseases (e.g., asthma, allergic disease)65,66.
Additionally, from the perspective of end motifs, T-end exhibited the largest number of colocalized signals (91), followed by C-end (17), A-end (13), and G-end (9). These findings provide evidence for a shared genetic background influencing both cfDNA end motifs and pregnancy phenotypes. The significant enrichment of hematological and protein-related traits in the colocalized signals suggests their close relationship with cfDNA molecular features.
Pleiotropy analyses with the GWAS catalog
From the GWAS catalog database, we extracted traits that exhibited pleiotropy with the study-wide associated loci reported in this study (Supplementary Data 11). We focused on traits with a high proportion of pleiotropic loci (ratio > 4%) as well as those with a large absolute number of pleiotropic loci (count > 5) (Supplementary Fig. 19A). In summary, these primarily included a series of hematological traits. Several platelet-related indicators were highlighted due to their high pleiotropy ratios, including platelet forward scatter and platelet counts at different stages of pregnancy. In addition, a range of red and white blood cell-related traits, such as hemoglobin, neutrophil, and lymphocyte, also showed considerable pleiotropy. The genetic overlap between these leukocyte traits and cfDNA motifs provides further genetic evidence for their strong correlation. Furthermore, notable results also included body measurement traits such as height and BMI, as well as total cholesterol levels and low-density lipoprotein cholesterol levels.
Since we did not have cholesterol data for the pregnancy phenotypes, we did not investigate its association with cfDNA motifs. However, previous studies have reported that stressed individuals with cfDNA levels exceeding 1000 showed elevated cholesterol and LDL cholesterol levels67. Additionally, plasma cfDNA has been proposed as a novel biomarker for monitoring and diagnosing cardiovascular diseases68. A recent work based on cfDNA GWAS also reported pleiotropic effects of identified SNPs on cfDNA molecular features and cardiovascular diseases47.
Pleiotropy analyses with OMIM compendium
Among the 50 mapped genes with study-wide significant SNPs associated with cfDNA end motif frequencies, 15 genes across 10 loci have documented disease associations in the OMIM database (Supplementary Table 4, Supplementary Fig. 19B). These associated diseases are distributed across different organ systems, including nervous system and developmental disorders, blood and immune system disorders, and cardiovascular and circulatory system disorders.
Among these diseases, we observed that some are linked to cfDNA characteristics, including systemic lupus erythematosus (SLE, associated with DNASE1L3) and Barrett’s esophagus (BE, associated with MSR1). In detail, SLE is an autoimmune disease characterized by inflammation with the production of autoantibodies. There is increasing evidence that failure in the clearance of cfDNA by deoxyribonucleases (DNASES), particularly DNASE1L3, can lead to the generation of anti-DNA antibodies and SLE69,70,71. The plasma cfDNA levels are closely related to the disease severity of SLE and can potentially be used as a biomarker for SLE diagnosis. The gene MSR1, macrophage scavenger receptor 1, plays an essential inflammatory role in multiple processes, including immunity, lung and liver disease, and cancer72. Previous studies have reported that mutations in MSR1 are associated with the presence of Barrett’s esophagus73. CfDNA analysis becomes a promising approach for monitoring neoplastic progression in Barrett’s esophagus74. The identification of these associations between motif-associated genes and genetic disorders suggests a complex interplay between cfDNA end motifs, genetic variations, and human health.
Discussion
cfDNA molecular features, such as fragment size, end motifs, and methylation patterns, are valuable for understanding various biological processes and disease states. These characteristics of cfDNA molecules have also revolutionized liquid biopsy technology, facilitating non-invasive diagnosis, monitoring disease progression, and predicting patient outcomes. Uncovering key genes involved in cfDNA generation and characteristics provides insights into cfDNA biology and potential therapeutic targets. However, few studies have explored the genetic mechanisms underlying cfDNA molecular signatures. In this study, we analyzed sequencing data from 28,016 Chinese pregnant women, using whole-genome cfDNA data to extract both genotype and cfDNA features. For the first time, we conducted a genome-wide association study, identifying 15 study-wide significant genetic loci linked to 176 motifs. The most significant loci include PANX1, DFFB, DNASE1L3, and DNASE1L1, of which PANX1 and DNASE1L1 are novel modifiers of cfDNA fragmentation. These findings were further validated through independent GWAS analyses and experimental designs, strengthening their biological significance. By in-depth comparison of the cfGWAS results with our parallel pregnancy phenotype GWAS study, we observed genetic correlations between maternal BMI, maternal weight, uric acid, and cfDNA. Additionally, we revealed a strong causal relationship between leukocytes, especially neutrophils, and cfDNA features.
Although the sample size in our study does not reach the scale of hundreds of thousands, cfDNA screenings such as NIPT, tumor liquid biopsies, and pathogen liquid biopsies have generated vast amounts of genomic data. Consequently, the potential of our cfGWAS paradigm is enormous. For example, by June 2023, over 40 million pregnant women worldwide had undergone NIPT testing, resulting in a substantial amount of sequence data. This sample volume far exceeds that of participants with general WGS data75. With the large sample sizes, we believe that cfGWAS has the potential to identify more credible genes associated with cfDNA molecular features, enhancing our understanding of cfDNA generation and clearance. This knowledge could open new avenues for potential clinical applications of cfDNA, such as in liquid biopsy. For instance, circulating tumor DNA (ctDNA) is often present in low quantities, making it challenging to collect sufficient blood for accurate detection, particularly in cases with small tumors. Therefore, strategies to protect cell-free DNA from degradation or to reduce its clearance could ensure adequate concentrations of ctDNA, enabling sensitive and robust liquid biopsy tests even with small blood volumes76. Furthermore, our research could contribute to the discovery of therapeutic targets for diseases influenced by abnormal DNA concentrations, such as SLE and gout. This includes insights into drug targets that accelerate the clearance of cell-free DNA. CfDNA end motifs have been increasingly explored as biomarkers in cancer detection and transplantation monitoring. Our team has recently uncovered their role in predicting preeclampsia during early pregnancy77. Pregnancy also offers a valuable model to study tissue-of-origin dynamics (maternal vs. fetal)19. To our knowledge, this is the first study to explore the causal relationships between cfDNA end motifs and physiological and biochemical phenotypes in a pregnant population. We found that T-end motifs exhibited the most causal associations and colocalizations with pregnancy phenotypes. This may be related to placenta-specific nuclease activity, hormonal changes, or maternal immune adaptation during pregnancy. While our current analysis cannot fully explain this observation, it highlights a promising area for further investigation into the biological and clinical significance of cfDNA end motif dynamics in pregnancy.
Given the widespread clinical use of cfDNA, the significant discoveries from cfGWAS and their potential applications will not only revolutionize our understanding of cfDNA biology but also open new avenues for future clinical applications.
Limited by the sequencing strategy of NIPT technology (single-end 35 bp), we only used end motifs as a snapshot for cfDNA features. Although different cfDNA features were strongly correlated, a cfGWAS of other features, including fragment length, concentration, jagged ends, and nucleosome footprints, would provide a more comprehensive investigation. Furthermore, our discovery set relied on a cohort of pregnant women, so it is unclear how many findings are general for all people or specific to pregnancy. Large-scale independent studies on non-pregnant cohorts are necessary to resolve this uncertainty. Currently, no robust approaches are available for inferring kinship relationships based on NIPT-like ultra-low-depth sequencing data. Neither BaseVar-called variants nor STITCH-imputed genotypes yielded reliable kinship estimates. When applying additional quality control, using more reliable variants, such as those filtered by INFO score, results in more reasonable estimates of relatedness. Given that the cohort consisted predominantly of middle-aged pregnant women, we expect the proportion of close relatives to be very small and their influence on the GWAS results to be negligible. Nevertheless, the development of appropriate methods for kinship estimation under ultra-low-sequencing depth remains an important area for future research. In addition, our discovery GWAS study had a relatively small sample size of 28,000 pregnant women. The statistical power of GWAS is heavily dependent on sample size, with larger samples yielding higher powers to detect genetic signals. We acknowledge the importance of amassing larger sample sizes for more robust and comprehensive genetic investigations in the future. We anticipate uncovering even more novel genetic associations that can further illuminate the complex genetic basis of cfDNA features. Notably, at the time of NIPT testing, the fetal DNA fraction in maternal plasma is approximately 10%, with half of it being identical to the maternal DNA. In our genotype imputation and GWAS analyses, we did not distinguish between maternal and fetal DNA, but rather treated all cfDNA as maternal. While this approach simplifies the analysis, it may introduce minor biases, particularly for loci with strong placental or fetal-specific signals. Recently, a newly developed genotype imputation method, QUILT278, allows simultaneous imputation of maternal and fetal genomes. In ongoing work, we are applying this method to our NIPT dataset to separate maternal and fetal genomes for future GWAS and post-GWAS analyses, which will help further refine the findings. Moreover, we observed that a small subset of motifs yielded negative SNP heritability estimates using LD score regression. While this is likely due to statistical noise rather than biological factors, the exact cause remains unclear and warrants further investigation in larger cohorts or with alternative methods. In this study, we validated the impact of the PANX1 gene on cfDNA end motif frequencies and cfDNA concentration using knockout mouse models and cell lines. However, the cell-free DNA concentrations we measured in mouse plasma and cell culture supernatants do not distinguish between nuclear and mitochondrial DNA, whereas our GWAS study focuses exclusively on nuclear DNA. Further studies are needed to fully uncover the biological mechanisms of these genes on cfDNA. Validation of other candidate genes in knockout models would also be necessary.
Methods
Experimental model and study participant details
Subjects
The participants were recruited from Wuhan Children’s Hospital during their routine pregnancy examinations from 2017 to 2020. A variety of tests were conducted throughout the entire duration of pregnancy, including non-invasive prenatal testing (NIPT), biochemistry assessments, oral glucose tolerance tests, ultrasound screenings, and more. NIPT is a screening test for fetal chromosomal disorders that analyzes cell-free DNA from a blood sample taken from the mother early in pregnancy. The test is typically performed between weeks 12 and 22 of gestation, though the exact timing may vary by country. Before week 12, the fetal fraction is usually too low to detect genetic abnormalities, while after week 22, terminating a pregnancy due to trisomy can pose greater health risks to the mother. Each pregnant participant provided informed consent before enrollment. This study received approval from the Institutional Review Boards (IRB) of both Wuhan Children’s Hospital (2021R062-E03) and the Bioethics and Biosafety of BGI (BGI-IRB 21088-T2). Additionally, authorization was obtained from the National Human Genetic Resources Management Office (Approval No. [2021] CJ2002). Our sole inclusion criterion was the availability of NIPT genotype data. Pregnant women carrying trisomic fetuses, along with those whose sequencing depth was below 0.05 or mapping rate fell below 90%, were excluded, leaving a final cohort of 38,668 individuals. These individuals were then used for genotype imputation.
Pregnancy phenotypes
Throughout the approximately 40-week gestation period, pregnant women undergo a range of routine blood, liver function, kidney function, and more. In our prior investigation, we curated 104 biochemical examination metrics and birth outcome indicators, conducting a comprehensive genome-wide association analysis36. A detailed list providing the information on these phenotypes, including full name, abbreviation, and phenotypic category, is provided in Supplementary Data 7.
Independent cohort for replication study
To validate our GWAS findings on cfDNA features, we conducted replication studies using three independent cohorts: two NIPT cohorts and one natural population. The first replication study involved Chinese NIPT samples collected from Wuhan Children’s Hospital during 2020 and 2021. We included samples without abnormal fetuses, with a sequencing depth >0.05×, a mapping rate exceeding 90%, and explicit information regarding age, sampling year, and gestational week of the NIPT tests. Ultimately, 25,578 pregnant women were retained for GWAS analysis in the replication study. We note that these samples do not overlap with those from the discovery study. The second replication study utilized Dutch NIPT samples47, in which NIPT cfDNA served as the genotype, while cfDNA molecular features were treated as phenotypes. They investigated six phenotypes: bincount diversity, fetal concentration, mitochondrial DNA concentration, fragment size diversity, motif diversity score, and total cfDNA concentration. We downloaded the full summary statistics for these six phenotypes to determine if our significant loci were also identified in their study. The third replication study comprised 442 natural samples, including both males and females, recruited during health examinations at the Dapeng outpatient department in Shenzhen from 2021 to 2022. Each participant provided informed consent before enrollment. This study received ethical approval from the Institutional Review Boards of the Bioethics and Biosafety of BGI (BGI-IRB 21157-T2). The plasma cfDNA was used for whole genome sequencing, using the DNBSEQ platform with paired-end 100 bp mode and an average sequencing depth of ~35×.
Murine models
The animal study was approved by the Institutional Review Boards of Bioethics and Biosafety of BGI (BGI-IRB A24009) and the Institutional Animal Care and Use Committee (IACUC) of Cyagen (GACU24-SY028). The Panx1-deficient mice model (C57BL/6JCya) was created by CRISPR/Cas-mediated genome engineering at Cyagen. The gene knockout mice were first verified by PCR and gel electrophoresis, revealing the target fragment sizes, and then Sanger sequencing of the PCR products was used to confirm the designed deletion of the locus. The sequencing results of the plasma cfDNA of the Panx1-deficient mice were also checked to confirm that the deletion region in the reference genome had no mapping reads (Supplementary Fig. 20). Housing conditions include a 12-h light–dark cycle, controlled temperature and humidity, and pelleted rodent chow. 8-week-old male and female mice were used for the experiment. Sex differences were not analyzed for Panx1-deficient mice.
Cell line models
PANX1 knockout Jurkat cell line model was created by CRISPR/Cas-mediated genome engineering at Cyagen.
Method details
Library preparation and whole-genome sequencing
Peripheral blood (5 mL) was collected from pregnant women using specialized vacuum blood collection tubes. EDTA tubes were transported at low temperatures (2–8 °C), with plasma separation completed within 8 h. For ambient temperature circulating tubes (K tubes, G tubes), samples were transported at 6–35 °C, and plasma separation was performed within 96 h. The blood was centrifuged at 1600×g for 10 min at 4 °C to collect the upper plasma layer, followed by a second centrifugation at 16,000×g for 10 min to further purify the plasma. Free DNA was extracted from the plasma using a nucleic acid extraction kit, and a DNA library was prepared through end repair, adapter ligation, and PCR amplification. After passing quality control, the DNA libraries were pooled and circularized, and DNA nanoballs were generated via rolling circle replication. Single-end sequencing was performed on a BGISEQ-500 sequencer using the combinatorial probe-anchor synthesis method with a read length of 35 bp. Since NIPT testing is typically conducted around 16 weeks of gestation when the fetal DNA content is ~10%, with half of it being identical to maternal DNA, we did not differentiate between maternal and fetal DNA in subsequent analyses. Instead, we treated all DNA as maternal for analysis purposes.
We employed the fastp software79 to conduct quality control analysis on the sequencing data stored in FASTQ format. This involved the removal of adapter sequences and elimination of low-quality sequence fragments (reads). Following this, we utilized BWA80 to align the quality-controlled reads to the hg38 reference genome81, subsequently converting the aligned reads into BAM format and sorting them. Duplicate reads were subsequently eliminated from the sorted BAM files using the Samtools rmdup tool82. Additionally, we utilized the GATK83 BaseRecalibrator to recalibrate base quality scores (BQSR) on the sorted BAM files using known site information. The GATK ApplyBQSR tool was then utilized to perform further base quality score recalibration and sorting on the sorted BAM files, thereby generating index files. Finally, Samtools stats was employed to generate comprehensive statistics for the calibrated BAM files.
Variant detection and genotype imputation
Given the ultra-low nature of NIPT sequencing data, traditional analysis workflows designed for high-depth sequencing, such as variant detection, are inadequate. To ensure precise variant discovery, we used BaseVar (https://github.com/ShujiaHuang/basevar), which was developed for calling variants in ultra-low-pass (<1.0×) sequencing data. Additionally, due to the ultra-low depth of the NIPT sequencing data, covering only about 10% of the genome, there was a significant amount of missing genotype information. To ensure the data were suitable for genetic analysis, we applied genotype imputation using STITCH84 (with parameters K = 10, nGen = 16,000), which provided allele dosages for the imputed genotypes. To assess the accuracy of this imputation, we randomly selected 30 Chinese samples from the high-coverage 1000 Genomes Project (1KGP) dataset as a reference set. These samples were down-sampled to ~0.1× coverage and imputed alongside the original NIPT data. We then calculated the squared Pearson correlation (R²) between the imputed dosages and the high-coverage genotypes to determine imputation accuracy.
CfDNA end motifs
In accordance with the previous definition19, plasma DNA end motifs were identified by analyzing the first four nucleotides (4-mer) at the 5’ ends of plasma DNA fragments aligned to the reference genome. With four nucleotide positions, there are 256 possible 4-mer motifs. We calculated the frequency of each motif and normalized these frequencies by the total number of fragment ends sequenced. To assess the distribution of motif frequencies, we employed a metric called MDS, based on normalized Shannon entropy. MDS is defined by Eq. (1):
where Pi is the frequency of the ith motif. The MDS value ranges from 0 to 1, where a higher MDS indicates greater diversity and randomness in the motif distribution. In contrast, a lower MDS reflects a more skewed distribution of motif frequencies. Therefore, a decreasing MDS suggests increasing skewness in the frequency distribution of end motifs, while an increasing MDS indicates that motif frequencies are becoming more uniform.
Murine sample collection
Mice were terminally anesthetized by CO2, then eyeballs were removed and blood was collected in EDTA tubes (KANGJIAN). 6 Panx1−/− mice (3 males and 3 females) and 6 WT mice (3 males and 3 females) were used. The blood was first centrifuged at 1600×g for 10 min at 4 °C to collect the upper layer of plasma, and then the plasma was centrifuged at 16,000×g for 10 min at 4 °C to remove cellular debris. The supernatant plasma samples were stored at −80 °C before cfDNA extraction.
Cell line culture
Gene knockout cell lines were confirmed by sequencing. WT and PANX1 knockout cells were cultured in RPMI 1640 medium at 37 °C in a humidified atmosphere with 5% CO2. 10% fetal bovine serum was added to the medium. The cell culture supernatant was first centrifuged at 1600×g for 10 min at 4 °C, and then centrifuged again at 16,000×g for 10 min at 4 °C to remove cells and cellular debris during sample preparation.
Cell-free DNA extraction and quantification
The blood plasma and cell culture supernatant samples were used for cell-free DNA extraction using MagPure Circulating DNA KF Kit (MD5432-02, Magen). 100 μL plasma from each mouse and 500 μL cell culture supernatant for each sample were used for cfDNA extraction. The cfDNA concentrations were measured by Qubit™ 4.0 Fluorometer (Invitrogen) using Qubit dsDNA High Sensitivity Assay Kit (Q32854, Invitrogen).
Statistics and Reproducibility
Genome-wide association study
We utilized the genotype dosage (a quantitative value between 0 and 2) imputed by STITCH as genotype data, and 256 cfDNA end motifs served as phenotypes for the genome-wide association analysis. To control for population stratification, we performed principal component analysis (PCA) on the population genetic variation identified by BaseVar using PLINK285, selecting the top 10 principal components as covariates. Furthermore, maternal age, gestational age at NIPT testing, and sequencing depth were included as additional covariates. Based on the STITCH imputation results, we filtered SNPs with a minor allele frequency (MAF) greater than 0.05, a Hardy–Weinberg equilibrium (HWE) p-value >1e−6, and an imputation info score >0.4. A SNP list was then created, and only the SNPs in this list (using the --extract parameter) were included for GWAS analysis. Additional arguments included --pheno-quantile-normalize and --covar-variance-standardize, which were used to transform motif frequencies to a standard normal distribution and to standardize quantitative covariates to have a mean of zero and variance of one. We set the significance threshold at 5e−8 for genome-wide significance and 1.95e−10 (=5e−8/256) for study-wide significance.
Validation in replication studies
To validate the results of the GWAS analysis, we utilized three independent replication cohorts and conducted validation through two approaches, based on integrated study-wide associated loci and individual motif–locus associations, respectively. We first focused on the integrated loci. The GWAS summary statistics of 256 motif frequencies from the discovery study were merged and integrated by selecting the minimum p-value across all 256 phenotypes. The GWAS summary statistics from the three replication studies underwent the same merging procedure before proceeding with replication validation. Subsequently, we iterated through a total of 345 motif–locus associations across various motifs in the discovery study to determine whether they could be validated in the corresponding motifs in the three replication cohorts.
In the Chinese NIPT cohort, we applied exactly identical SNP filtering and analytical procedures as those used in the discovery study described above, which yielded a GWAS result comprising 2,794,215 SNPs. In the natural population cohort, the cfDNA end motif extraction method was the same as described above. We filtered SNPs with a MAF > 0.01, HWE p-value > 1e−6, and a genotype missing rate <10% for the GWAS analysis. The first 10 principal components from PCA, along with age and sex, were included as covariates. Gestational age at blood draw and sequencing depth were not included, as these were not NIPT samples. Regarding the Dutch cohort, since the study did not provide GWAS summary statistics on motif frequency but instead included six cfDNA molecular feature phenotypes, each feature was used for replication when verifying each study-wide motif–locus association. Additionally, it was recorded which specific feature the SNPs supporting the validation originated from.
To define a successfully replicated signal, we extended each significantly associated locus identified in the discovery cohort by 500 kb on both sides and examined whether any SNPs within this extended region in the replication cohorts met the following criteria: (1) a p-value < 1e−5 and (2) linkage disequilibrium (r² > 0.1) with a significant SNP within the corresponding discovery locus. Additionally, in the replication study based on motif frequency GWAS (Chinese NIPT and natural population cohorts), we assessed the consistency of the effect directions for the top SNP in each discovery locus.
Heritability partition
To further understand the genetic architecture of the cfDNA molecular features and the polygenic contributions to the heritability of different genetic components, including cell type-specific elements, we performed a heritability partition analysis. Specifically, to partition heritability from GWAS summary statistics of the 256 end motifs, we used stratified LD score regression (stratified LDSC)86, which can account for linkage disequilibrium. The stratified LDSC converts cell-type/tissue-trait associations into cell-type/tissue-specific SNP functional annotations and estimates the extent to which different cell-type/tissue-specific functional annotations explain trait heritability. The cell type-specific expression data are referred to the Genotype-Tissue Expression (GTEx) project87 and the Franke lab dataset88. In total, there are 205 tissues and cell types from nine categories, including adipose, blood/immune, cardiovascular, central nervous system, digestive, endocrine, liver, musculoskeletal/connective, and others. The LD reference data is the 1000 Genomes Phase 3 East Asian populations.
Pathway-based analysis
Pathway enrichment analysis assists in highlighting pathways enriched with candidate genes pivotal for biological functions, thereby facilitating the unraveling of mechanistic insights into the underlying traits. Leveraging summary statistics from GWAS conducted on 256 motifs, we performed pathway enrichment analysis using Pathway Scoring Algorithm (PASCAL) software—an intuitive tool designed for gene scoring and pathway analysis from GWAS results89. The databases include Reactome90, Biocarta database (http://www.biocarta.com/), and KEGG91. For each of the 256 motifs, we computed the median pathway score and prioritized pathways in ascending order based on their scores, with special attention to those with higher rankings.
Genetic correlation
Genetic correlation is defined as an informative metric to quantify the overall genetic similarity between two complex traits. In this section, we investigated the genetic correlation between 256 end motif frequencies and 104 pregnancy phenotypes measured in the same group of individuals. Specifically, we used LD score regression to perform the genetic correlation analysis with the reference panel being the East Asian population from the 1000 Genomes Project92.
Mendelian randomization
Mendelian randomization (MR) is a methodological approach used in epidemiology and genetics to investigate causal relationships between exposure factors and outcomes. In this study, we propose performing bi-directional MR analysis using end motifs and pregnancy phenotypes as either exposure or outcome variables. Specifically, we conducted LD clumping to select only independent SNPs within a 10,000 kb window and a clumping r2 threshold of 0.1. LD was calculated based on the East Asian population from the 1000 Genomes Project. To infer causality, we employed the inverse variance weighted (IVW) method93 and the MR-Egger method94 using the R function TwoSampleMR:mr95. It is worth noting that we only selected exposure variables with more than three significant SNPs after LD clumping and harmonization with outcome data to ensure the reliability of the MR analysis.
In the scenario where pregnancy phenotypes are the exposures and motif frequencies are the outcomes (denoted as P- > M), we included 55 exposure variables and 256 outcomes, with a significance threshold of 3.55e-6 (=0.05/(55*256)) after Bonferroni correction. In the reverse scenario, where motif frequencies are the exposures and pregnancy phenotypes are the outcomes (denoted as M → P), we included 147 exposures and 104 outcomes, with a significance threshold of 3.27e−6 (=0.05/(147*104)) after Bonferroni correction. We applied the IVW, weighted median, and MR-Egger methods simultaneously, and defined a pair as significant when the IVW results met the significance threshold. Due to sample overlap between the cfDNA motif GWAS and pregnancy phenotype GWAS, and to minimize potential bias in MR-Egger results, we additionally performed the MRlap method96. MRlap is a two-sample MR method that accounts for sample overlap, weak instrument bias, and winner’s curse.
Colocalization analysis
Colocalization analysis is a statistical method used to determine whether two or more traits share the same causal genetic variant(s) within a specific genomic region. This approach is essential when different GWAS identify overlapping signals, suggesting a potential shared genetic architecture between traits. To investigate whether the cfDNA end motifs and pregnancy phenotypes have overlapped genetic effects due to the same causal variant, we performed colocalization analysis using R package coloc97. The prior probability of a SNP is associated with either end motif, or phenotype is set to be 1e−4, and with both traits is 1e−5. For visualizations of the colocalized regions between end motifs and phenotypes, we used R package locuscomparer98.
Pleiotropy analysis with public databases
In this section, we assessed the pleiotropic effects of motif-associated SNPs/genes based on the GWAS catalog35 and OMIM compendium99. First, we downloaded the latest association data from the GWAS catalog (as of September 29, 2025) and performed initial filtering, retaining only significant SNPs at the whole-genome level (p-value < 5e−8). We then defined a search region based on each study-wide association locus associated with cfDNA motifs, extending 100 kb upstream and downstream. For each identified search region, we extracted all significant SNPs from the GWAS catalog data and removed duplicates based on the PubMed ID and associated trait to ensure each trait is recorded only once per study. Then, for each GWAS catalog trait, we counted the total number of recorded associated loci and the number of loci shared with end motifs, and calculated the corresponding ratio. Trait categories were matched using the EFO mappings file from the GWAS Catalog. When a trait was mapped to multiple EFO terms, we prioritized categories not labeled as “Other trait”, “Other measurement”, or “NR”.
OMIM (Online Mendelian Inheritance in Man) is a comprehensive database of human genes and genetic disorders. We created a list for all mapped genes of SNPs associated with cfDNA end motifs and searched for their related diseases in the OMIM database.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data that support the findings of this study have been deposited into the CNGB Sequence Archive (CNSA)100 of the China National GeneBank DataBase (CNGBdb)101 with accession number CNP0005734. The summary statistics of Dutch NIPT replication cohort are available in the LocusZoom database under the following accession codes: cfDNA concentration (fetal) [https://my.locuszoom.org/gwas/964912/data/], cfDNA concentration (total) [https://my.locuszoom.org/gwas/310546/data/], cfDNA concentration (mitochondrial) [https://my.locuszoom.org/gwas/326281/data/], cfDNA fragment size diversity (unfiltered) [https://my.locuszoom.org/gwas/443494/data/], cfDNA cleave-site motif diversity (6 bp Purine/Pyrimidine) [https://my.locuszoom.org/gwas/416825/data/], and cfDNA bincount diversity [https://my.locuszoom.org/gwas/694315/data/]. The cell type-specific expression annotations used in the stratified LDSC analyses were obtained from https://console.cloud.google.com/storage/browser/broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores under the filename Multi_tissue_gene_expr_EAS_1000G_v3_ldscores.tar.gz. The LD reference panel for stratified LDSC was downloaded from Zenodo (https://doi.org/10.5281/zenodo.7768714). The association data used for pleiotropy analysis are available from the GWAS catalog (https://www.ebi.ac.uk/gwas/docs/file-downloads) under the filename All associations v1.0.2. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
Wan, J. C. M. et al. Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nat. Rev. Cancer 17, 223–238 (2017).
Thierry, A. R., El Messaoudi, S., Gahan, P. B., Anker, P. & Stroun, M. Origins, structures, and functions of circulating DNA in oncology. Cancer Metastasis Rev. 35, 347–376 (2016).
Chiu, R. W. et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ 342, c7401 (2011).
Cisneros-Villanueva, M. et al. Cell-free DNA analysis in current cancer clinical trials: a review. Br. J. Cancer 126, 391–400 (2022).
Medina, J. E. et al. Cell-free DNA approaches for cancer early detection and interception. J. Immunother. Cancer 11, https://doi.org/10.1136/jitc-2022-006013 (2023).
Knight, S. R., Thorne, A. & Lo Faro, M. L. Donor-specific cell-free DNA as a biomarker in solid organ transplantation. A systematic review. Transplantation 103, 273–283 (2019).
Chan, R. W. et al. Plasma DNA aberrations in systemic lupus erythematosus revealed by genomic and methylomic sequencing. Proc. Natl. Acad. Sci. USA 111, E5302–E5311 (2014).
Nomura, J. et al. Rapid detection of invasive Mycobacterium chimaera disease via a novel plasma-based next-generation sequencing test. BMC Infect. Dis. 19, 371 (2019).
Zheng, Y. et al. Development and clinical validation of a droplet digital PCR assay for detecting Acinetobacter baumannii and Klebsiella pneumoniae in patients with suspected bloodstream infections. Microbiologyopen 10, e1247 (2021).
Qi, T. et al. Cell-free DNA fragmentomics: the novel promising biomarker. Int. J. Mol. Sci. 24, https://doi.org/10.3390/ijms24021503 (2023).
Zhou, Z. et al. Fragmentation landscape of cell-free DNA revealed by deconvolutional analysis of end motifs. Proc. Natl. Acad. Sci. USA 120, e2220982120 (2023).
Koukourakis, M. I. et al. Circulating plasma cell-free DNA (cfDNA) as a predictive biomarker for radiotherapy: results from a prospective trial in head and neck cancer. Cancer Diagn. Progn. 3, 551–557 (2023).
Li, L. et al. Serum cell-free DNA-based detection of Mycobacterium avium complex infection. Am. J. Respir. Crit. Care Med. 209, 1246–1254 (2024).
Lehmann, J., Giaglis, S., Kyburz, D., Daoudlarian, D. & Walker, U. A. Plasma mtDNA as a possible contributor to and biomarker of inflammation in rheumatoid arthritis. Arthritis Res. Ther. 26, 97 (2024).
Tivey, A., Church, M., Rothwell, D., Dive, C. & Cook, N. Circulating tumour DNA—looking beyond the blood. Nat. Rev. Clin. Oncol. 19, 600–612 (2022).
Yu, S. C. Y. et al. Single-molecule sequencing reveals a large population of long cell-free DNA molecules in maternal plasma. Proc. Natl. Acad. Sci. USA 118, https://doi.org/10.1073/pnas.2114937118 (2021).
Lapin, M. et al. Fragment size and level of cell-free DNA provide prognostic information in patients with advanced pancreatic cancer. J. Transl. Med. 16, 300 (2018).
Thierry, A. R. Circulating DNA fragmentomics and cancer screening. Cell Genom. 3, 100242 (2023).
Jiang, P. et al. Plasma DNA End-motif profiling as a fragmentomic marker in cancer, pregnancy, and transplantation. Cancer Discov. 10, 664–673 (2020).
Moser, T., Kühberger, S., Lazzeri, I., Vlachos, G. & Heitzer, E. Bridging biological cfDNA features and machine learning approaches. Trends Genet. 39, 285–307 (2023).
Zhu, G. et al. Tissue-specific cell-free DNA degradation quantifies circulating tumor DNA burden. Nat. Commun. 12, 2229 (2021).
Pisareva, E. et al. Neutrophil extracellular traps have auto-catabolic activity and produce mononucleosome-associated circulating DNA. Genome Med. 14, 135 (2022).
Jacobson, M. D., Weil, M. & Raff, M. C. Programmed cell death in animal development. Cell 88, 347–354 (1997).
Mannherz, H. G., Peitsch, M. C., Zanotti, S., Paddenberg, R. & Polzar, B. A new function for an old enzyme: the role of DNase I in apoptosis. Curr. Top. Microbiol. Immunol. 198, 161–174 (1995).
Rodriguez, A. M. et al. Identification, localization, and expression of two novel human genes similar to deoxyribonuclease I. Genomics 42, 507–513 (1997).
Widlak, P., Li, P., Wang, X. & Garrard, W. T. Cleavage preferences of the apoptotic endonuclease DFF40 (caspase-activated DNase or nuclease) on naked DNA and chromatin substrates. J. Biol. Chem. 275, 8226–8232 (2000).
Han, D. S. C. & Lo, Y. M. D. The nexus of cfDNA and nuclease biology. Trends Genet 37, 758–770 (2021).
Sin, S. T. et al. Effects of nucleases on cell-free extrachromosomal circular DNA. JCI Insight 7, https://doi.org/10.1172/jci.insight.156070 (2022).
Chen, M. et al. Fragmentomics of urinary cell-free DNA in nuclease knockout mouse models. PLoS Genet. 18, e1010262 (2022).
Han, D. S. C. et al. The biology of cell-free DNA fragmentation and the roles of DNASE1, DNASE1L3, and DFFB. Am. J. Hum. Genet. 106, 202–214 (2020).
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Prim. 1, 59 (2021).
Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat. Genet. 52, 572–581 (2020).
Tcheandjieu, C. et al. Large-scale genome-wide association study of coronary artery disease in genetically diverse populations. Nat. Med. 28, 1679–1692 (2022).
Borrego-Yaniz, G. et al. Risk loci involved in giant cell arteritis susceptibility: a genome-wide association study. Lancet Rheumatol. 6, e374–e383 (2024).
Cerezo, M. et al. The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity. Nucleic Acids Res. 53, D998–d1005 (2025).
Xiao, H. et al. Genetic analyses of 104 phenotypes in 20,900 Chinese pregnant women reveal pregnancy-specific discoveries. Cell Genom. 4, 100633 (2024).
Zhu, H. et al. Novel insights into the genetic architecture of pregnancy glycemic traits from 14,744 Chinese maternities. Cell Genom. 4, 100631 (2024).
Gu, Y. et al. Genetic architecture and risk prediction of gestational diabetes mellitus in Chinese pregnancies. Nat. Commun. 16, 4178 (2025).
Wei, Y. et al. Genome-wide association studies of thyroid-related hormones, dysfunction, and autoimmunity among 85,421 Chinese pregnancies. Nat. Commun. 15, 8004 (2024).
Liu, S. et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell 175, 347–359.e314 (2018).
Liu, S. et al. Utilizing non-invasive prenatal test sequencing data for human genetic investigation. Cell Genom. 4, 100669 (2024).
Deng, C. et al. Maternal and fetal factors influencing fetal fraction: a retrospective analysis of 153,306 pregnant women undergoing noninvasive prenatal screening. Front. Pediatr. 11, 1066178 (2023).
Wang, E. et al. Gestational age and maternal weight effects on fetal cell-free DNA in maternal plasma. Prenat. Diagn. 33, 662–666 (2013).
Whyte-Fagundes, P. & Zoidl, G. Mechanisms of pannexin1 channel gating and regulation. Biochim. Biophys. Acta Biomembr. 1860, 65–71 (2018).
Santavanond, J. P. et al. The small molecule raptinal can simultaneously induce apoptosis and inhibit PANX1 activity. Cell Death Dis. 15, 123 (2024).
Ueki, M. et al. Evaluation of all nonsynonymous single-nucleotide polymorphisms in the gene encoding human deoxyribonuclease I-like 1, possibly implicated in the blocking of endocytosis-mediated foreign gene transfer. DNA Cell Biol. 33, 79–87 (2014).
Linthorst, J., Nivard, M. & Sistermans, E. A. GWAS shows the genetics behind cell-free DNA and highlights the importance of p.Arg206Cys in DNASE1L3 for non-invasive testing. Cell Rep. 43, 114799 (2024).
Blander, J. M. The many ways tissue phagocytes respond to dying cells. Immunol. Rev. 277, 158–173 (2017).
Juul, L. A., Hartwig, T. S., Ambye, L., Sørensen, S. & Jørgensen, F. S. Noninvasive prenatal testing and maternal obesity: a review. Acta Obstet. Gynecol. Scand. 99, 744–750 (2020).
Mhatre, M. et al. The effect of maternal obesity on placental cell-free DNA release in a mouse model. Reprod. Sci. (Thousand Oaks, CA) 26, 1218–1224 (2019).
Stupak, A., Kwaśniewski, W., Goździcka-Józefiak, A. & Kwaśniewska, A. The influence of maternal obesity on cell-free fetal DNA and blood pressure regulation in pregnancies with hypertensive disorders. Medicina (Kaunas) 57, https://doi.org/10.3390/medicina57090962 (2021).
Haghiac, M. et al. Increased death of adipose cells, a path to release cell-free DNA into systemic circulation of obese women. Obesity (Silver Spring) 20, 2213–2219 (2012).
Lapaire, O. et al. Significant correlation between maternal body mass index at delivery and in the second trimester, and second trimester circulating total cell-free DNA levels. Reprod. Sci. (Thousand Oaks, CA) 16, 274–279 (2009).
Vora, N. L. et al. A multifactorial relationship exists between total circulating cell-free DNA levels and maternal BMI. Prenat. Diagn. 32, 912–914 (2012).
Moss, J. et al. Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 9, 5068 (2018).
Leal, A. et al. White blood cell and cell-free DNA analyses for detection of residual disease in gastric cancer. Nat. Commun. 11, 525 (2020).
Mattox, A. K. et al. The origin of highly elevated cell-free DNA in healthy individuals and patients with pancreatic, colorectal, lung, or ovarian cancer. Cancer Discov. 13, 2166–2179 (2023).
Kustanovich, A., Schwartz, R., Peretz, T. & Grinshpun, A. Life and death of circulating cell-free DNA. Cancer Biol. Ther. 20, 1057–1067 (2019).
Fridlich, O. et al. Elevated cfDNA after exercise is derived primarily from mature polymorphonuclear neutrophils, with a minor contribution of cardiomyocytes. Cell Rep. Med. 4, 101074 (2023).
Fararjeh, A. S. et al. Proteasome 26S Subunit, non-ATPase 3 (PSMD3) regulates breast cancer by stabilizing HER2 from degradation. Cancers (Basel) 11, https://doi.org/10.3390/cancers11040527 (2019).
Kamatani, Y. et al. Genome-wide association study of hematological and biochemical traits in a Japanese population. Nat. Genet. 42, 210–215 (2010).
Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429.e1419 (2016).
Chen, M. H. et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell 182, 1198–1213.e1114 (2020).
Kachuri, L. et al. Genetic determinants of blood-cell traits influence susceptibility to childhood acute lymphoblastic leukemia. Am. J. Hum. Genet. 108, 1823–1835 (2021).
Ferreira, M. A. et al. Shared genetic origin of asthma, hay fever and eczema elucidates allergic disease biology. Nat. Genet. 49, 1752–1757 (2017).
Zhu, Z. et al. Shared genetics of asthma and mental health disorders: a large-scale genome-wide cross-trait analysis. Eur. Respir. J. 54, https://doi.org/10.1183/13993003.01507-2019 (2019).
Shan, M. A. et al. Cell-free DNA as a potential diagnostic biomarker in academic stress: a case-control study in young adults. Saudi J. Biol. Sci. 31, 103933 (2024).
Qian, B. et al. Plasma cell-free DNA as a novel biomarker for the diagnosis and monitoring of atherosclerosis. Cells 11, https://doi.org/10.3390/cells11203248 (2022).
Al-Mayouf, S. M. et al. Loss-of-function variant in DNASE1L3 causes a familial form of systemic lupus erythematosus. Nat. Genet. 43, 1186–1188 (2011).
Gerovska, D. & Araúzo-Bravo, M. J. Systemic lupus erythematosus patients with DNASE1L3·Deficiency have a distinctive and specific genic circular DNA profile in plasma. Cells 12, https://doi.org/10.3390/cells12071061 (2023).
Mathapathi, S. & Chu, C. Q. Contribution of impaired DNASE1L3 activity to anti-DNA autoantibody production in systemic lupus erythematosus. Rheumatol. Immunol. Res 3, 17–22 (2022).
Gudgeon, J., Marín-Rubio, J. L. & Trost, M. The role of macrophage scavenger receptor 1 (MSR1) in inflammatory disorders and cancer. Front. Immunol. 13, 1012002 (2022).
Orloff, M. et al. Germline mutations in MSR1, ASCC1, and CTHRC1 in patients with Barrett esophagus and esophageal adenocarcinoma. JAMA 306, 410–419 (2011).
Rumiato, E. et al. Detection of genetic alterations in cfDNA as a possible strategy to monitor the neoplastic progression of Barrett’s esophagus. Transl. Res. 190, 16–24.e11 (2017).
Shendure, J., Findlay, G. M. & Snyder, M. W. Genomic medicine-progress, pitfalls, and promise. Cell 177, 45–57 (2019).
Martin-Alonso, C. et al. Priming agents transiently reduce the clearance of cell-free DNA to improve liquid biopsies. Science (New York, NY) 383, eadf2341 (2024).
Zhang, H. et al. Detection of early-onset severe preeclampsia by cell-free DNA fragmentome. Preprint at medRxiv https://doi.org/10.1101/2024.03.22.24304708 (2024).
Li, Z., Albrechtsen, A. & Davies, R. W. Flexible read-aware genotype imputation from sequence using biobank sized reference panels. Nat. Commun. https://doi.org/10.1038/s41467-025-67218-1 (2025).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Davies, R. W., Flint, J., Myers, S. & Mott, R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965–969 (2016).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, https://doi.org/10.1186/s13742-015-0047-8 (2015).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Consortium, G. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science (New York, NY) 348, 648–660 (2015).
Fehrmann, R. S. et al. Gene expression analysis identifies global gene dosage sensitivity in cancer. Nat. Genet. 47, 115–125 (2015).
Alonso-Gonzalez, A., Calaza, M., Rodriguez-Fontenla, C. & Carracedo, A. Gene-based analysis of ADHD using PASCAL: a biological insight into the novel associated genes. BMC Med. Genom. 12, 143 (2019).
Milacic, M. et al. The Reactome Pathway Knowledgebase 2024. Nucleic acids Res. 52, D672–d678 (2024).
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Burgess, S., Butterworth, A. & Thompson, S. G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet. Epidemiol. 37, 658–665 (2013).
Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int. J. Epidemiol. 44, 512–525 (2015).
Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. eLife 7, https://doi.org/10.7554/eLife.34408 (2018).
Mounier, N. & Kutalik, Z. Bias correction for inverse variance weighting Mendelian randomization. Genet. Epidemiol. 47, 314–331 (2023).
Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 10, e1004383 (2014).
Liu, B., Gloudemans, M. J., Rao, A. S., Ingelsson, E. & Montgomery, S. B. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, 768–769 (2019).
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Wang, W. et al. The China National GeneBank Sequence Archive (CNSA) 2024 update. Hortic. Res. 12, uhaf036 (2025).
Chen, F. Z. et al. CNGBdb: China National GeneBank DataBase. Yi chuan = Hereditas 42, 799–809 (2020).
Acknowledgements
This study was supported by Shenzhen Medical Research Fund (B2404004), National Key Research and Development Plan of China (2022YFC2704703), Hubei Provincial Natural Science Foundation of China (2025AFD674), Shenzhen Science and Technology Program (SYSPG20241211173852024), National Key Research and Development Program of China (2023YFC2605400, 2022YFC2502402), Open Research Project in State Key Laboratory of Vascular Homeostasis and Remodeling (Peking University) (2025-SKLVHR-013), and Key-Area Research and Development Program of Guangdong Province (2023B0303040001).
Author information
Authors and Affiliations
Contributions
X.J., H.H.Z., H.Q.Z., and A.Z. conceived the study, designed the research program, and managed the project. H.X., M.Y., and A.Z. collected the data. L.X.L., J.Z., R.Z., and Y.W. preprocessed the data and finished the quality control. H.H.Z., L.X.L., X.Z., Ying L., L.W., and Y.J. performed the statistical analysis and results visualization. Y.Z., S.Z., R.O., F.L., C.X., and G.Z. performed the biological experiments. H.Q.Z., X.L., L.G.L., Yu L., and Y.H. analyzed the experimental data. H.H.Z., L.X.L., X.Z., Ying L., H.Q.Z., Y.Z., L.W., and Y.W. wrote the manuscript. X.X. and J.W. provided overall project oversight. All authors participated in revising the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Victoria Jackson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhu, H., Zhang, Y., Li, L. et al. cfGWAS reveal genetic basis of cell-free DNA end motifs. Nat Commun 17, 1714 (2026). https://doi.org/10.1038/s41467-025-67940-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-67940-w






