Introduction

Pancreatic cancer is one of the most aggressive cancers in China with only a 10% 5-year survival rate1,2. With its high malignancy, pancreatic cancer remains a major cause of cancer-related mortality3,4. Pancreatic ductal adenocarcinoma (PDAC) accounts for 90% of all pancreatic cancer cases2,5. Smoking, nonhereditary or chronic pancreatitis, chronic diabetes mellitus, obesity, nontype O blood group, and age could be risk factors for PDAC3. Besides, germline mutations in genes such as BRCA2, BRCA1, CDKN2A, ATM, STK11, PRSS1, MLH1, and PALB2 are associated with pancreatic adenocarcinoma (PAAD)3. To date, there’s no reliable screening test for pancreatic cancer, while most patients with pancreatic cancer do not have evident symptoms until the advanced stage6. Surgical resection remains the main therapeutic method for the treatment of pancreatic cancer, but only 10–20% of patients are eligible for surgical resection6. Carbohydrate antigen 19-9 (CA19-9), and carcinoembryonic antigen (CEA), are considered biomarkers of pancreatic cancer7. CA19-9 also plays an important role in guidance of surgery decisions, the use of adjuvant therapy, and detection of post-operative tumor recurrence, but its effect is limited because 10% of patients do not secrete the antigen8. As for biomarker, CA19-9 is lack of sensitivity and specificity, and is elevated in pancreatic benign diseases and other gastrointestinal malignancies9. CEA is also neither sensitive nor specific, and it is elevated in alcoholic cirrhosis, hepatitis, and biliary disease10,11.

Cell-free DNA (cfDNA) is fragmented (approximately 150–350 bases), and typically double-stranded12. Most of the cfDNA is released from hematopoietic cells, and there is also a portion of cfDNA released from cancer cells13,14. It was found that cfDNA was more abundant in patients with gastrointestinal cancer than in healthy controls, and the level of cfDNA in the malignant group was higher than in the benign group15. CfDNA features are closely related to the early genesis of cancer16, therefore, multiple studies indicated that cfDNA can be utilized for early detection of cancer, including liver cancer17,18,19,20, lung cancer21,22, breast cancer23, urothelial bladder carcinoma24, colorectal cancer25, Hodgkin’s lymphoma26, and pancreatic cancer27,28,29.

CfDNA levels are elevated in pancreatic cancer16, providing a potential diagnostic biomarker for diagnosing pancreatic cancer. CfDNA offers several advantages: its detection technology is well established, and its relative stability enables consistent testing30,31,32. Several studies have investigated various cfDNA-based features, such as fragmentomics, mutations, and methylation, to develop diagnostic models for pancreatic cancer. For instance, one study developed a cancer diagnostic model using cfDNA fragmentation profiles, achieving sensitivities ranging from 57% to 99%, with a specificity of 98%33. Additionally, copy number alterations (CNAs) detected via cfDNA have been applied to identify various cancers, including pancreatic cancer34. In methylation-based approaches, leveraging cfDNA 5-hydroxymethylcytosine (5hmC) features has shown strong performance in identifying early-stage pancreatic cancer29. Combining circulating tumor DNA (ctDNA) with protein biomarkers has yielded high diagnostic accuracy for detecting PDAC35. These findings highlight the potential of cfDNA-based approaches as valuable tools for the early detection and diagnosis of pancreatic cancer. However, relying solely on a single biomarker for diagnosis presents inherent limitations. Integrating multiple cfDNA-based features has the potential to significantly enhance diagnostic accuracy and mitigate these constraints.

In this work, we performed a multi-center, large-scale cohort study and employed a state-of-the-art next generation sequencing (NGS) technology to acquire plasma cfDNA end motif36, nucleosome footprint (NF)13, fragmentation33,37,38 profiles, and copy number alteration of cfDNA from all enrolled cases. Predictive features were filtered out using the least absolute shrinkage and selection operator (LASSO). Based on these features, we developed a weighted diagnostic model (PCM score) and a prognostic evaluation model (PCP score).

Results

Characteristic signatures of cfDNA

CfDNA fragment size was measured in plasma samples from patients with pancreatic cancer, pancreatic benign tumor (PBT), chronic pancreatitis (CP), and healthy controls (HC). The fragmentation profiles showed consistency among non-cancer cases (PBT, CP, and HC), but exhibited significant variability in patients with pancreatic cancer (Fig. 1a, Supplementary Fig. 1a). Notably, cfDNA fragments in pancreatic cancer patients were shorter compared to those in PBT, CP, and healthy controls, and the median cfDNA fragment size of pancreatic cancer were 175 bp (range 154 bp to 197 bp) while in CP + PBT and healthy controls were 182 bp (range 165 bp to 198 bp) and 186 bp (range 160 bp to 203 bp) (Fig. 1a, Supplementary Fig. 1a). Among patients in pancreatic cancer, fragment size was not influenced by age, gender, or level of CA125, CA19-9, and CEA but showed significant associations with AJCC stage (Supplementary Fig. 1b). In the PBT group, cfDNA fragment size remained unaffected by age, gender, level of CA125, CA19-9, and CEA (Supplementary Fig. 1c).

Fig. 1: cfDNA fragment, motif, nucleosome footprint signatures in healthy controls, CP, PBT and pancreatic cancer patients.
figure 1

a Size distributions of cfDNA fragments in participants of healthy controls, CP + PBT and pancreatic cancer (The Z-score indicates the ratio of short fragments to long fragments); b KEGG pathway analysis of NF difference between healthy controls and pancreatic cancer. Hypergeometric test was used to detect whether a specific gene set is significantly enriched; c Plasma cfDNA end motif features distribution in healthy controls, CP + PBT and pancreatic cancer. d Size distributions of cfDNA fragments in different subtypes of pancreatic cancer, PBT, CP, and healthy controls. Box plots indicate median (middle line), 25%, 75% percentile (box) and minimum and maximum (whiskers) as well as outliers (single points). e CNA features in participants of healthy controls, CP + PBT, and pancreatic cancer. Source data are provided as a Source Data file. PDAC pancreatic ductal adenocarcinoma, ASCP adenosquamous carcinoma of the pancreas, IPMN intraductal papillary mucinous neoplasm, PNET pancreatic neuroendocrine tumor, SCN serous cystic neoplasm.

KEGG pathway analysis revealed that differentially expressed NF gene were enriched in several cancer-related pathways, including the hedgehog signaling pathway, VEGF signaling pathway, MAPK signaling pathway, TGF-β signaling pathway, and Wnt signaling pathway (Fig. 1b). Unsupervised hierarchical clustering demonstrated a clear distinction between healthy controls, CP, PBT, and pancreatic cancer (Fig. 1c). Fragment lengths were observed to decrease progressively with increasing malignancy (Fig. 1d). Additionally, CNA analysis showed that pancreatic cancer patients exhibited a higher number of CNAs compared to PBT and CP patients, with healthy individuals displaying the lowest CNA numbers (Fig. 1e).

Patients and cohorts

All cases were divided into 4 cohorts, including Training cohort (432 cases), Testing cohort (267 cases), External Validation cohort 1 (129 cases), and External Validation cohort 2 (139 cases) (Fig. 2). Training cohort was designed for the construction of PCM and PCP scoring System. Among 422 patients with pancreatic cancer or PBT, five subtypes were included: PDAC, ade nosquamouscarcinoma of the pancreas (ASCP), intraductal papillary mucinous neoplasm (IPMN), pancreatic neuroendocrine tumor (PNET), serous cystic neoplasm (SCN). Pancreatic cancer cases comprised PDAC and ASCP, while PBT cases included IPMN, PNET, and SCN. We used computer-generated random numbers to assign patients from Changhai Hospital to Training cohort (n = 272) and Testing cohort (n = 98). External Validation cohort 1 consists of patients from the Affiliated Hospital of Qingdao University, and External Validation cohort 2 consists of patients from The Second Affiliated Hospital of Shandong University and The Second Hospital, Cheeloo College of Medicine, Shandong University. The healthy controls were randomly distributed into the Training cohort, the Testing cohort, and the Validation cohorts. The levels of CA19-9, CA125, and CEA across different patient groups are presented in Supplementary Fig. 2, while the TNM stage distribution for all patients is detailed in Supplementary Fig. 3.

Fig. 2: Model construction of PCM score.
figure 2

Patients from 3 hospitals were enrolled in our study, CNA, fragment size, motif and NF features of plasma cfDNA were used to build a classifier.

Establishment of PCM score

The workflow for constructing the diagnosis model was shown in Fig. 2. All participants were divided into four cohorts: Training cohort, Testing cohort, and two External Validation cohorts. In Training cohort, cfDNA was analyzed using low-pass whole-genome sequencing (WGS), and the PCM score were constructed with CNA, fragment signatures, motif signatures, and NF signatures. We constructed 4 models to identify malignant pancreatic cancer from non-cancer patients (PBT, CP, and healthy individuals). In the Training cohort, the combined model (PCM score) showed an AUC of 0. 975 (95% CI: 0.961–0.988), compared with NF (AUC: 0.973, 95% CI: 0.959–0.986), motif (AUC: 0.858, 95% CI: 0.823–0.894), fragment (AUC: 0.968, 95% CI: 0.952–0.983) (Fig. 3a). In the Testing cohort, the combined model showed an AUC of 0.979 (95% CI: 0.961–0.998) (Fig. 3b). In the External Validation cohort 1 and the External Validation cohort 2, our combined model showed AUC of 0.992 (95% CI: 0.983–1) and 0.986 (95% CI: 0.97–1) (Fig. 3c, d). The combined model outperformed the individual feature models across all four cohorts. The detailed information of performance of CNA in distinguishing different types of groups were shown in Supplementary Table 1.

Fig. 3: Performance evaluation of NF, motif, fragment features, and combined model (PCM score) in classification of pancreatic cancer to non-cancer subgroups.
figure 3

a ROC curve analysis for the NF, motif, fragment and combined model (PCM score) in Training cohort. b ROC curve analysis for the NF, motif, fragment or combined model in Testing cohort. c External validation cohort 1. d External validation cohort 2. Source data are provided as a Source Data file.

Our combined model (PCM score) could distinguish pancreatic cancer from healthy controls (HC) with an AUC of 0.990 (95% CI: 0.983–0.997) in the Combined cohort (Testing cohort plus two External Validation cohorts) (Fig. 4a), and resectable stage (stage I/II) from healthy controls, with an AUC of 0.994 (95% CI: 0.989–0.999) in the Combined cohort (Fig. 4b). Fig. 4c shows that the PCM score was able to distinguish pancreatic cancer from PBT, with an AUC of 0.886 (95% CI: 0.835–0.936), compared with CA19-9 with an AUC of 0.819 (95%CI: 0.755–0.883). The model distinguished CA19-9 negative pancreatic cancer from HC with an AUC of 0.990 (95%CI: 0.977–1) (Fig. 4d).

Fig. 4: Performance of NF, motif, fragment features and combined model (PCM score) in the diagnosis of pancreatic cancer.
figure 4

a ROC curve analysis for the NF, motif, fragment or combined model (PCM score) in distinguishing pancreatic cancer and HC. b ROC curve analysis for the NF, motif, fragment or combined model (PCM score) in distinguishing early stage (stage I,II) of pancreatic cancer and HC. c ROC curve analysis for the NF, motif, fragment, combined model (PCM score), and CA19-9 in distinguishing pancreatic cancer and PBT. d ROC curve analysis for the NF, motif, fragment, and combined model (PCM score) in distinguishing CA19-9 negative pancreatic cancer and HC. Source data are provided as a Source Data file. HC healthy control.

The performance of the PCM score for staged pancreatic cancer versus non-cancer (including PBT, CP, and HC) and pancreatic cancer versus healthy is summarized in Table 1. Additionally, Table 2 compares the performance of the PCM score and CA19-9 in differentiating staged pancreatic cancer from benign pancreatic diseases (PBT and CP). As shown in Table 2, the PCM score outperformed CA19-9 across both the Testing cohort and the two External Validation cohorts. Notably, the PCM scoring system demonstrated a superior ability to accurately differentiate early-stage pancreatic cancer compared to CA19-9, highlighting its potential as a more reliable diagnostic tool.

Table 1 Performance of PCM score in the diagnosis of pancreatic cancer
Table 2 Performance of PCM score in the diagnosis of pancreatic cancer compared with CA19-9 (Pancreatic cancer vs PBT + CP)

The PCM score demonstrated high sensitivity in detecting pancreatic cancer, with positive detection rates of 92% for PDAC patients and 100% for ASCP patients. In contrast, the positive detection rates for PBT subtypes and HC were below 40% (Supplementary Fig. 4). Additionally, the PCM score was significantly higher in pancreatic cancer cases compared to non-cancer groups (Supplementary Fig. 5). Plasma samples from patients with other cancer types revealed that the Logistic score was notably elevated in pancreatic cancer compared to both other cancers and HC (Supplementary Fig. 6). When combining CA19-9 with the PCM score, the diagnostic performance improved further. The PCM score and CA19-9 combination distinguished pancreatic cancer from PBT and CP with AUCs of 0.936, 0.968, and 0.864 in the Training, Testing, and External Validation cohorts, respectively, compared to AUCs of 0.888, 0.942, and 0.841 for the PCM score alone (Supplementary Fig. 7a). This combination also exhibited superior performance in identifying early-stage (stage I and II) pancreatic cancer from PBT and CP (Supplementary Fig. 7b) and in distinguishing pancreatic cancer from CP (Supplementary Fig. 7c).

Establishment of PCP score

We investigated the relationship between cfDNA features and prognosis in pancreatic cancer using both the Training cohort and the Combined cohort (which included the Testing cohort and two External Validation cohorts). Utilizing end motif, fragment, and nucleosome footprint features, we developed a prognostic model and introduced the Pancreatic Cancer Prognostic (PCP) score. Kaplan–Meier survival analyses were conducted for both cohorts based on the PCP score. The results demonstrated a significant difference in median overall survival between the high and low PCP score groups in both the Training cohort (p < 0.0001) and the Combined cohort (p < 0.0001) by the log-rank test (Fig. 5a, b). Similarly, recurrence-free survival was significantly longer in the low PCP score group compared to the high PCP score group in both the Training cohort (p < 0.0001) and the Combined cohort (p < 0.0001) (Fig. 5c, d). If we defined patients who experienced death or recurrence within 1 year as high-risk patients, then in External validation cohort 1, there were 19 patients in high risk, of whom 17 were correctly identified based on our threshold, resulting in an identification accuracy of 89.5%. In External validation cohort 2, there were 18 high-risk patients, and all patients were correctly identified using our threshold, yielding an accuracy of 100%. The association of PCP score, clinicopathological characteristics with overall survival and recurrence-free survival was shown in Supplementary Tables 2 and 3.

Fig. 5: Kaplan–Meier analyses of PCP score.
figure 5

a Kaplan–Meier analyses of PCP score with overall survival in Training cohort. Log-rank test was used to compare the survival distributions of two groups b Kaplan–Meier analyses of PCP score with overall survival in Combined cohort (Testing cohort + External validation cohort 1 + External validation cohort 2). Log-rank test was used to compare the survival distributions of two groups. c Kaplan–Meier analyses of PCP score with recurrence-free survival in Training cohort. Log-rank test was used to compare the survival distributions of two groups. d Kaplan–Meier analyses of PCP score with recurrence-free survival in Combined cohort. Log-rank test was used to compare the survival distributions of two groups. Source data are provided as a Source Data file.

Discussion

Pancreatic cancer is notorious for its high malignancy and poor prognosis39, with the majority of patients diagnosed at an advanced stage due to the lack of early symptoms. As such, early detection is crucial for reducing mortality rates. Currently, the blood-based marker CA19-9 is the most widely used biomarker for pancreatic cancer diagnosis. However, its relatively low sensitivity (79%–81%) and specificity (82%–90%) limit its effectiveness, particularly in early-stage detection40.

CfDNA levels are elevated in pancreatic cancer16, providing a potential diagnostic biomarker for diagnosing pancreatic cancer. CfDNA offers several advantages: its detection technology is well established, and its relative stability enables consistent testing30,31,32. Several studies have investigated various cfDNA-based features, such as fragmentomics, mutations, and methylation, to develop diagnostic models for pancreatic cancer. For example, Bie et al. adapt an enzyme-mediated methylation sequencing method and developed a genome-wide cfDNA methylation, fragmentation, and copy number alteration (CNA) characteristics integrated model for cancer detection41. Ju et al. investigated the cfDNA fragmentomic characteristics against nucleosome positioning patterns in hematopoietic cells and developed a cancer diagnostic model based on the cfDNA fragmentomic metrics42. Christopher et al. developed A-plus, which can enhanced sensitivity over that achieved for aneuploidy alone at matched specificities43. DNA methylation can also affect the length of cfDNA fragments, An et al. found that DNA methylation might regulate cfDNA fragmentation, then they developed a cfDNA end-preference-based metric for cancer diagnosis44. Another study using methylation-based cfDNA features constructed a four-gene methylation panel, with a sensitivity of 100% and specificity of 90%27. Additionally, Liu, M.C et al. enrolled more than 50 types of cancer (including pancreatic cancer), through using methylation signatures in cfDNA, achieved high sensitivity in detecting early stage of pancreatic cancer45. Another study used methylation signature of cfDNA, achieved sensitivity of 83.7% in detecting pancreatic cancer46. Combining cfDNA methylation markers with protein biomarkers, such as CA19-9 and TIMP1, significantly improved diagnostic accuracy47. Zill et al. conducted a prospective analysis of five genes (KRAS, TP53, APC, FBXW7, and SMAD4) in tumor tissues and ctDNA from 26 pancreatic cancer patients, and the diagnostic accuracy of ctDNA sequencing was 97.7%, with an average sensitivity of 92.3% and a specificity of 100% for the five genes48.

In this study, we developed a cfDNA-based diagnostic and prognostic model using four different cfDNA features: fragment length, nucleosome footprint, end motif, and CNA. These features demonstrated significant differences among groups, with shorter cfDNA fragment lengths observed in pancreatic cancer patients compared to those with benign pancreatic tumors, suggesting increased cfDNA fragmentation with tumor malignancy. The PCM score effectively distinguished between pancreatic cancer and PBT, as well as early-stage pancreatic cancer from healthy individuals. Importantly, the cfDNA features correlated with prognosis, with a high PCP score indicating high risk.

Previous studies on early pancreatic cancer diagnosis have focused primarily on PDAC, excluding other benign pancreatic tumors. However, distinguishing between malignant and benign pancreatic tumors is challenging using imaging techniques, often requiring pathological confirmation. Traditional liquid biopsy methods, including CA19-9, show poor performance in distinguishing pancreatic cancer from benign tumors. In our analysis, the AUC of CA19-9 for differentiating pancreatic cancer from PBT was 0.819, with 26.7% of pancreatic cancer patients testing negative and 19.1% of chronic pancreatitis patients testing positive for CA19-9. Misdiagnosis based on CA19-9 alone is a significant concern, as elevated levels are observed in many benign conditions. By incorporating cfDNA features, our model achieved an AUC of 0.886 for distinguishing pancreatic cancer from PBT, representing a promising approach for differentiating pancreatic cancer from other pancreatic diseases.

Although cfDNA has shown promise in early pancreatic cancer detection, other biomarkers, such as circulating tumor cells (CTCs) and ctDNA, have also been explored. However, the low abundance of CTCs in early-stage cancer and the lack of validated biomarkers for cell selection limit their utility49. Similarly, ctDNA is unstable and present in low concentrations in early-stage cancer, further constraining its diagnostic potential15. This leads to the current techniques for using ctDNA as a standalone diagnostic marker for early-stage pancreatic cancer being insufficiently developed50. Additionally, others have utilized various biomarkers for the diagnosis of pancreatic cancer. For instance, some studies have investigated extracellular vesicle long RNA51 and exosomal microRNAs for pancreatic cancer diagnosis52. Based on extracellular vesicles long RNA profiling, Shulin Yu et al. developed a d-signature model for PDAC detection, the d-signature was able to identify early stage of pancreatic cancer (stage I/II) with an AUC of 0.94951. Compared with other studies, our study has the following advantage: 1. While others often focus on PDAC in their research, we have collected some pancreatic benign tumor, and we not only detect pancreatic cancer but also differentiate between cancer and non-cancer cases. 2. We validate our model in multicenter cohort; 3. PCP score is associated with overall survival, allowing for prognostic prediction; 4. Our study utilized four different types of features of cfDNA, allowing for a more comprehensive reflection of the differences in cfDNA among different populations. However, there are limitations, first of all, our study was a retrospective study, lack of perspective cohort; Secondly, although we have multicenter cohort that covered patients from different regions of China, extending validation to other countries or ethnic populations would enhance the model’s applicability.

Pancreatic cancer patients generally have poor prognosis. Our PCP score, based on cfDNA features, was associated with survival outcomes, with higher scores indicating worse prognosis. which confirms that cfDNA features are related to prognosis. However, we did not investigate which specific features contribute most to poor outcomes, that is an avenue for future research.

In conclusion, we developed a cfDNA-based diagnostic and prognostic model for pancreatic cancer, validated across multiple independent cohorts. Our PCM score system, integrating CNA, NF, fragmentation, and end motif features, demonstrated high accuracy in distinguishing malignant from benign conditions and was predictive of patient outcomes. Additionally, combining PCM score with CA19-9 significantly improved diagnostic performance, reinforcing the importance of CA19-9 as a biomarker in pancreatic cancer diagnosis.

Methods

Patients

From April 2021 through November 2021, we retrospectively collected a total of 975 cases for this study. Eight cases were excluded from the study according to eligibility criteria (Fig. 2). Finally, 967 cases were analyzed in our study. Including 422 pancreatic cancer or PBT, 47 CP, and 498 healthy controls. Among them, 370 patients were recruited from Changhai Hospital (Shanghai, China), 45 patients were recruited from The Affiliated Hospital of Qingdao University (Shandong province, China), 54 patients were recruited from The Second Affiliated Hospital of Shandong University and The Second Hospital, Cheeloo College of Medicine, Shandong University (Shandong Province, China). Healthy controls were recruited in five geographically centers for regular physical examination and had no history of pancreatic or other systematic diseases. The size of the training cohort was determined to have a power of 80% at a two-sided type I error rate of 0.05, which required at least 174 participants per group (actual enrollment: 432 participants). Detailed information of all patients and healthy controls was listed in Supplementary Data 1. Institutional review board at all participating hospitals reviewed and approved the study protocol. We conducted follow-up on all pancreatic cancer patients until January 2023, with a median follow-up duration of 443 days.

Plasma sample collection and cfDNA isolation

Blood samples were collected from patients and healthy controls in 10 ml EDTA-coated Vacutainer tubes. For all patients enrolled in our study, blood samples were collected before treatment. The plasma sample was centrifuged at the speed of 1600 × g for 10 mins and then at 16,000 × g for 10 mins (Eppendorf 5810 R/5427 R, Germany). The plasma samples were stored at −80 °C. The MagMAX Cell-Free DNA Isolation Kit (Thermo Fisher Scientific, USA) was used to isolate cfDNA according to the product instructions with the help of DNA purification instrument (Thermo Kingfisher FLEX, USA). The concentration of DNA product was then measured with Qubit3 Fluorometers (Thermo, USA). The size of DNA fragments was detected by fragment analyzer (Agilent, USA). The research protocol was approved by Shanghai Changhai Hospital Ethics Committee (CHEC2018-112) and Research Ethics Committee of The Second Hospital of Shandong University (KYLL2024446), and written informed consent was provided by every participant. Institutional review board at all participating hospitals reviewed and approved the study protocol.

Whole-genome sequencing and data processing

Sequencing libraries were prepared using 5 ng DNA. DNA samples were then subjected to end-repair/dA-tailing (5X ER/A-Tailing Enzyme Mix) and adaptor ligation (WGS Ligase). The adaptor sequence was specifically designed for Illumina CN500 platform. After purified by Agencourt AMPure XP beads (Beckman Coulter, USA), Libraries were quantified by the KAPA Library Quantification Kit (Kapa Biosystems, USA) and size was confirmed using Bioanalyzer (Agilent, USA). Sequencing libraries were pooled at equal amount. WGS at an average coverage of 1.5X was performed on Illumina CN500 platform using 2 × 36 bp paired-end sequencing.

Fastq files were processed by fastp software (https://github.com/OpenGene/fastp) to remove adaptor and end sequence together with sequences below 25 bp to acquire clean data. Clean data were aligned to human reference genome GRCh37 using bwa-aln (https://github.com/lh3/bwa). Duplicate reads were marked by sambamba (https://github.com/biod/sambamba/). Samtools (http://samtools.sourceforge.net/) was used to calculate mapping rate, duplicate rate and genome coverage. Reads with mapping rate above 90%, duplicate rate below 25% and coverage above 50% passed the quality control. The bam files were further filtered by Samtools, removing unmapped reads, low quality reads, marked duplicates and sequences with no perfect match between read1 and 2.

We conducted low-pass whole-genome sequencing with all collected samples. The sequencing data allowed us to analyze multiple features, including CNA, nuclear footprint, end motif, and fragmentation. Individuals from each cohort were randomly assigned into training and testing cohorts. In the training process, we performed LASSO regression algorithm for each genomic feature to reduce dimensionality and extract markers, and further employed SVM algorithm to build the optimal model of each genomic features. At last, a logistic model was used to integrate three genomic features CNA.

Procedure of feature selection was as follows; all steps were conducted in Training cohort.

Procedure for quantifying fragments33,53

  1. (1)

    The whole genome was divided to 3055 regions; the length of every region is 1 Mbp.

  2. (2)

    After aligning with the reference genome, the genomic location of each DNA fragment is identified and corrected.

  3. (3)

    Fragments between 90 and 150 bp are defined as short fragments, and those between 151 and 220 bp are defined as long fragments.

  4. (4)

    Calculate the ratio of short fragment to long fragment.

Feature selection results: Regions on the Y chromosome and regions with no detected DNA fragment coverage were removed from the initial 3055 regions, leaving 2890 regions. LASSO was then used for feature selection, resulting in 154 regions, which were used to construct the model.

Procedure for quantifying end motif36,54

  1. (1)

    Align the DNA fragments with the reference genome to determine the start and end positions of each fragment, and perform correction.

  2. (2)

    Count the 4-mer nucleotide sequence at the 5′ end of each fragment.

  3. (3)

    There are 256 possible types of 4-mer sequences; calculate the proportion of each type.

Feature selection: LASSO was used for feature selection, resulting in 33 motifs, which were used to construct the model.

Procedure for quantifying nucleosome footprint 13,55

  1. (1)

    Acquisition of the promoter regions of the whole genome: Promoter regions were identified using transcription start sites (TSS) of the main transcripts of reference genes published in the UCSC database are used (https://genome.ucsc.edu/), with 2500 bp extended upstream and downstream as the promoter region of the gene;

  2. (2)

    Defining the Central and Peripheral Regions: The central region of the promoter is defined as the 250 bp immediately adjacent to the transcription start site of the gene, while the peripheral region was extended 2500 bp on either side of the central region. This classification is based on the observation that actively transcribed genes tend to have sparser nucleosome distribution near the TSS, making them more susceptible to degradation once in the bloodstream. As a result, sequencing depth is expected to be lower in the central region compared to the peripheral region;

  3. (3)

    Obtaining Region Coverage and Sequencing Depth: The software “Bedtools” (v1.6.2) was used to calculate region coverage, while “featureCounts” (v2.19.1) was employed to determine sequencing depth for each region, which was then converted to FPKM;

  4. (4)

    Quantifying the Differences in Nucleosome Distribution for Each Gene: The nucleosome distribution difference score is calculated as the sequencing depth of the peripheral region (FPKM) minus the central region (FPKM). This score represents the distribution of nucleosomes and the transcriptional activity of the gene;

  5. (5)

    Filter Genes and Model Construction: In the Training cohort, after removing housekeeping genes and silenced genes from the whole genome, 20315 genes remained. Genes covered in at least 90% of the samples were retained, and the rank-sum test was applied to calculate p-values. Genes with p ≤ 0.01 were further filtered, resulting in 428 genes. LASSO was then used for dimensionality reduction, ultimately selecting 102 genes for model construction.

Procedure for CNA score calculation56

The human genome was divided into numerous 20Kbp regions. In order to avoid the high variations of CNAs related to small bins, we have connected adjacent small bins that meet the requirements. A certain margin of error is allowed during the connection process, and the final reported length is at least 2Mbs. Any length below 2Mbs will be filtered out. The average sequencing depth of each was counted and GC content corrected (the GC correction process was calculating the average depth of bins for each GC content, then computing the overall average depth of all bins to correct the sequencing depth). A baseline threshold was established for each region with the mean and variance of the average sequencing depth from the data of healthy population in the training cohort. Each region used the above calculated mean and variance to calculate the Zscore. According to the distribution of the Zscore of healthy people in the training cohort in the region, defined that Zscore greater than 2 or less than -2 was the baseline threshold with significant difference. Those with Zscore greater than 2 were copy number amplification, and those with Zscore less than −2 were copy number deletion. Adjacent regions with the same copy number alteration direction will be connected. When adjacent regions were connected, the tolerance was set for regions that were not covered by sequencing data. The CNA region should contain at least 70% of the copy number alteration in the same direction and with a length greater than 2Mbp were reported. The tumor suppressor (TSG) and oncogene (OG) of CNA score was then calculated using the equation reported previously56. The equation of CNA score was shown in Supplementary Table 4.

The LASSO implementation process is as follows: By constructing a penalty function, a more refined model was obtained, so that it compressed some coefficients and sets some coefficients to zero. It compressed the variables with large parameter estimates to 0, while the variables with small parameter estimates were compressed to 0, so as to achieve the effect of feature dimensionality reduction. This process was implemented using the LassoCV() function of the ‘sklearn’ package in Python. The LASSO inputs for NF, fragment, and end motif features were derived exclusively from the training cohort samples. For the NF input, we used the expression values of each gene per sample after filtering for a p-value less than 0.01. The fragment input was defined as the ratio of short to long fragments within each genomic region per sample, also filtered by a p-value threshold of <0.01. For the motif input, we calculated the proportion of various end motifs in each sample after filtering for a p-value less than 0.05. Features with non-zero coefficients in the ‘lasso.coef’ output were retained as the final selected features after dimensionality reduction.

The SVM method (support vector machine) was implemented for individual genomic feature-based model construction, based on three parameters: (1) C: Penalty coefficient; (2) Kernel function; and (3) Gamma. The input was the sample data of the training cohort, and the features of each dimension were filtered by LASSO dimensionality reduction. The GridSearchCV() function of the ‘sklearn’ package in python was used to find the optimal combination of three parameters in the training cohort, and the determination process of the optimal parameter combination used the 10X cross-validation method, that is, the training cohort samples were divided into 10 equal parts, of which 9 were used for parameter training fitting and the remaining 1 was used to verify the performance. Each dimension was trained separately to determine the optimal combination of parameters. The identified optimal parameters were applied directly to independent validation cohort samples. Finally, the predicted value of each sample in each of the three dimensions was the output.

Construction of PCM score

We integrated the fragment model, motif model, NF model, and CNA score using logistic regression method to construct a combined method—PCM scoring system (Fig. 2). The PCM score includes three components. The first component is the CNA score, which was previously introduced in the text. The second component is a logistic regression formula which allows the biomarkers, as a group, to be used to discriminate between pancreatic cancer and non-pancreatic cancer cases. Generally, the Wilcoxon rank-sum test was used to compare two datasets, pancreatic cancer vs non-pancreatic cancer. LASSO was applied to feature selection in the Training cohort. Features used for model construction was shown in Supplementary Tables 57. Data normalization was done using Z-score on Python. Support vector machine (SVM) was implemented for individual genomic feature-based model construction, based on three parameters: (1) C: Penalty coefficient; (2) Kernel function; and (3) Gamma. For the Training cohort, 10-fold cross-validation was employed to figure out the best combination of the parameters. The cutoff value was set at the point with the best diagnostic accuracy in testing cohort. To obtain the best diagnostic model, logistic regression model was generated using the results of the three individual models as input features. The Logistic Score was calculated as below.

$${{{\rm{Logistic}}}}\; {{{\rm{Score}}}}= \exp ({{{\rm{Z}}}})/(1+\exp ({{{\rm{Z}}}})),{{{\rm{where\; Z}}}}=-4.58+(2.13*{{{\rm{NF}}}})\\ +(3.26*{{{\rm{Motif}}}})+(2.85*{{{\rm{Fragment}}}})$$

The third component is individual genomic feature score (Single Score), calculated with the below formula.

$${{{\rm{Single}}}}\; {{{\rm{Score}}}}={\sum }_{{{{\rm{i}}}}|{{{\rm{i}}}}{{{\rm{\epsilon }}}}\{{{{\rm{NF}}}},{{{\rm{Motif}}}},{{{\rm{Fragment}}}}\}}0.25*\left.({{{\rm{sign}}}}({{{{\rm{s}}}}{{{\rm{c}}}}{{{\rm{ore}}}}}_{{{{\rm{i}}}}}-{{{{\rm{cutoff}}}}}_{{{{\rm{i}}}}})+1)\right)$$

Finally, CNA Score, Logistic Score and Single Score are subjected to a multivariate linear equation which generated the final PCM Score. The optimal cutoff of PCM Score was 0.75, determined by Youdens’ index. PCM Score ≥ 0.75 was regarded as positive, otherwise negative.

$${{{\rm{PCM}}}}\,{{{\rm{Score}}}}= 0.5*({{{\rm{sign}}}}({{{{\rm{Logistic}}}}\; {{{\rm{Score}}}}-{{{\rm{cutoff}}}}}_{{{{\rm{Logistic\; Score}}}}})+1)\\ +0.5\times ({{{\rm{sign}}}}({{{{\rm{CNA\; Score}}}}-{{{\rm{cutoff}}}}}_{{{{\rm{CNA}}}}\; {{{\rm{Score}}}}})+1)+{{{\rm{Single}}}}\; {{{\rm{Sco}}}}{{{\rm{re}}}}$$

The detailed calculation equation of PCM score was shown in Supplementary Table 4.

The equation of PCM combined with CA19-9 model was: PCM score + log10(CA19-9), the unit of CA19-9 was U/ml.

Construction of PCP score

We constructed PCP scoring system using fragment, motif, and NF features of cfDNA. Pancreatic cancer patients with follow-up data were included in the analysis. Samples were separated to two groups, with recurrence or death within 1 year were classified as high-risk, while those without recurrence or death were classified as low-risk. Filtering features with significant p-values (p-value < 0.01), then further feature selection with LASSO. The samples in the Training cohort were comparable to those used in the PCM score model. Due to the absence of prognostic information in some samples, the remaining samples were grouped into the Combined cohort. The selected features used for model construction are listed in Supplementary Tables 810. Each of the three indicators was modeled independently using SVM, and the final integration was achieved through logistic regression.

$${{{\rm{PCP}}}}\,{{{\rm{score}}}}= \exp ({{{\rm{Z}}}})/(1+\exp ({{{\rm{Z}}}})),{{{\rm{w}}}}{{{\rm{h}}}}{{{\rm{ere}}}}\,{{{\rm{Z}}}}=-1.97+(2.75*{{{\rm{NF}}}})\\ +(0.47*{{{\rm{Motif}}}})+(0.51*{{{\rm{Fragment}}}})$$

Statistical analysis

Wilcoxon rank-sum test was applied to compare two groups of continuous variables and Fisher’s exact test was applied to categorical variables. P value was calculated using Python software (version 2.7.14), and p < 0.05 was considered as statistically significant. Area Under Curve (AUC) was applied to evaluate model performance. ROC curves were generated by using ‘pROC’ package (v1.16.2) in R software (v.3.6.3), ‘datatable’ (v1.14.2) was used to process the data and ‘pwr’(v1.3.0) was used to process the power analysis in R software (v.3.6.3). Survival curves were generated according to the Kaplan–Meier method and compared using the log-rank test. LASSO and SVM algorithms were performed with ‘sklearn’ in Python software (version 2.7.14).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.