Introduction

Lung cancer is the leading cause of cancer-related mortality worldwide, with 1.8 million deaths caused by lung cancer in 2022, accounting for 18.1% of all cancer-related deaths1. The five-year survival rate for lung cancer patients is only 20%, and the exceptionally high mortality of lung cancer can be attributed to late diagnosis2. In contrast, the five-year survival rate for non-small cell lung cancer (NSCLC) patients with early detected can reach 70%~90%3. Nevertheless, approximately 75% of NSCLC cases are diagnosed at advanced stages2. So, early detection plays a critical role in reducing lung cancer mortality.

Current diagnostic methods for lung cancer include serum biomarkers, sputum cytology, X-rays, and computed tomography (CT) scans4. Low-dose computed tomography (LDCT) is a reliable tool for early lung cancer screening, which decreased the mortality rate by 20% in high-risk populations5,6. However, the high false-positive rate and overdiagnosis associated with LDCT limit its diagnostic accuracy, and 96.4% of these pulmonary nodules were ultimately confirmed to be false positives7. Serum biomarkers for lung cancer diagnosis, such as CEA, are extensively applied in clinical, however, the specificity of CEA is low, with only 61.9% of NSCLC patients detected with abnormal CEA serum levels8. Consequently, there is an urgent demand for accurate and non-invasive diagnostic tools to improve the detection of early lung cancer.

Non-invasive detection of genomic and epigenomic alterations in circulating cell-free DNA (cfDNA) offers promising avenues for early lung cancer detection9,10,11. cfDNA is free DNA in the blood originating from normal or tumor cells, and the concentration of cfDNA is significantly elevated in tumor patients12,13,14. cfDNA encapsulates the genetic and epigenetic variations specific to tumors, such as nucleic acid mutations and methylation variations. Notably, these alterations can be detected in cfDNA even during precancerous stages and early stages of tumor, particularly DNA methylation variations15,16. DNA methylation is an important epigenetic modification that plays a key role in cell development, gene expression, and genome stability17. DNA methylation variations are present in almost all tumors, with global hypomethylation and promoter hypermethylation being widely recognized as hallmarks of various tumors18,19. cfDNA methylation is a potential biomarker for early cancer screening, with several diagnostic assays based on DNA methylation biomarkers already approved by the U.S. Food and Drug Administration20.

The cfDNA concentration, and DNA methylation status of prostaglandin E receptor 4 gene (PTGER4), ras-associated structural domain family 1 A (RASSF1A), short stature homology cassette gene 2 (SHOX2), and H4 clustered histone 6 (H4C6) have been identified as valuable biomarkers for lung cancer diagnosis in several studies21,22,23,24. However, the combined detection of these five biomarkers in lung cancer diagnosis has hardly been reported. Considering the complex tumor microenvironment and heterogeneity during lung cancer development, single circulating biomarker may lack sufficient diagnostic accuracy. In this study, we employed qPCR to examine the promoter methylation of SHOX2, RASSF1A, PTGER4, and H4C6, along with cfDNA concentrations, in a total of 261 plasma samples from lung cancer patients and healthy controls. This approach aimed to evaluate the diagnostic accuracy of the combined use of these five biomarkers in lung cancer.

Materials and methods

Patients recruited

This study recruited 179 lung cancer patients and 82 healthy controls from the First Affiliated Hospital of USTC. All lung cancer patients were histologically confirmed, and peripheral blood samples were collected before treatment. Control samples were recruited during routine physical examinations, excluding participants with a history of cancer. The study was approved by the Institutional Review Board of The First Affiliated Hospital of USTC by relevant ethical guidelines, and informed consent was obtained from all participants.

Sample collection and storage

Peripheral blood was collected using EDTA anticoagulant tubes, stored at 4 °C after collection, and processed within 4 h. Whole blood was centrifuged at 1,600 g for 10 min, and the supernatant was then centrifuged again at 16,000 g for 10 min at 4 °C. The supernatant was aspirated into a 2 mL EP tube and stored at −80 °C until DNA extraction.

DNA isolation and bisulfite conversion

Cell-free DNA was extracted using the Magnetic Serum/Plasma DNA Maxi Kit (Cat# DP710, TIANGEN Biotechnology, Beijing, China) from 4 mL plasma according to the manufacturer’s protocol, and the final elution volume was 55 µL. cfDNA concentrations were detected using the Qubit dsDNA High Sensitivity Assay Kit (Cat# Q33231, Thermo Fisher Scientific, MA, USA). cfDNA bisulfite conversion using the EZ DNA Methylation-Gold Kit (Cat#D5005, ZYMO, CA, USA) according to the manufacturer’s protocol. Briefly, cfDNA was isolated and purified from 4mL of plasma using magnetic beads, and converted the unmethylated cytosine residue to uracil residue in DNA by a bisulfite reaction. Finally, the purified bisulfite-modified DNA was eluted in 10.5 µL with M-Elution Buffer.

DNA methylation analysis

DNA methylation was analysed using the Quantitative PCR (qPCR). The total reaction volume of each PCR reaction mixture was 15 µL, including 7.5 µL of reaction buffer, 2.5 µL of primer mixture, and 5 µL of bisulfite-modified eluted DNA. qPCR was performed on a 96-well plate using the Applied Biosystems 7500 (ABI-7500) platform (Applied Biosystems, Foster City, California). β-Actin (ACTB) was used as a standardized endogenous control. Amplifications were carried out using the following profile: 98 °C for 5 min, followed by 50 cycles of 95 °C for 10 s, 58 °C for 35 s, and 40 °C for 5 s. The primers used for qPCR are shown in Table 1. All samples were within the range of the cycle threshold (Ct) values for ACTB. For each gene, a relative methylation value was modified as follows:

Table 1 Primer sequences of SHOX2, PTGER4, RASSF1A, H4C6 and ACTB genes.

Methylationgene =\(\:\frac{1}{{2}^{{\Delta\:}\text{C}\text{T}}}\),

where ΔCTgene = CTgene – CTACTB.

Machine learning algorithm for feature selection

LASSO (Least Absolute Shrinkage and Selection Operator) is a variable selection method proposed in 1996, which can remove unimportant variables by penalizing the magnitude of the coefficients. The Boruta algorithm is another method used to identify the most important features. Compared to traditional regression methods, LASSO and Boruta can better select the features that are most closely related to the disease. In this study, we performed LASSO and Boruta feature selection using the glmnet package25 and Boruta package26 to filter and identify the most relevant features, respectively. LASSO analysis was performed using the glmnet package with parameters set as family = “binomial”, nfolds = 3, type.measure = “class”. Boruta analysis was performed using the Boruta package with parameters configured as doTrace = 2, maxRuns = 500, getImp = getImpRfZ.

Minimizing confounding bias

To minimize the impact of confounding factors, we employed the hold-out method to randomly select 80% of the samples from the experimental group as the training set. This process was repeated 100 times to mitigate the influence of outlier samples. Propensity Score Matching (PSM) is a technique designed to alleviate the interference caused by extraneous biases and irrelevant variables by matching treated subjects with one or more control subjects based on similar propensity scores. To further reduce the influence of age and other irrelevant variables, we implemented a 1:2 PSM using the R MatchIt package between the lung cancer and non-cancer groups. The propensity scores were generated using a logistic regression model to identify matched samples.

Lung cancer detection model development

Generalized linear models (GLMs) extend traditional linear models by accommodating response variables with error distribution models beyond the normal distribution. We developed the GLM using the train function from the caret package, with the following parameters: method = “repeatedcv”, number = 10, repeats = 5, summaryFunction = twoClassSummary, classProbs = TRUE.

Statistical analysis

R (version 4.1.1) and RStudio were used for statistical analysis. The Mann–Whitney U–test was used to compare the differences between lung cancer samples and control samples, and the Kruskal-Wallis test for multiple groups of continuous variables. p < 0.05 was considered significant. The receiver operating characteristic curves, AUC value, sensitivity, specificity, and accuracy were calculated using the pROC package27. The cut-off values were determined using the Youden index. The DeLong test was used to compare the AUCs of different models, p < 0.05 was considered significant.

Results

Patient information and cfDNA concentration distribution

In this study, we collected plasma samples from 82 non-cancer controls and 179 lung cancer patients. The samples were divided into two cohorts: 184 samples for the training set and 77 samples for the validation set. Detailed characteristics are presented in Table 2. Previous studies have shown that cfDNA concentrations were higher in cancer patients and tend to increase with disease progression. In this study, cfDNA concentration was significantly higher in the lung cancer group (Fig. 1A), and stage Tis (Tumor in situ) was lower than the other stages with a tendency to increase with disease progression, although no significant difference was observed between stages I, II and higher stages (Fig. 1B). We then classified the lung cancer samples into lung adenocarcinoma (LUAD) and lung squamous carcinoma (LUSC), and no significant difference in cfDNA concentration was observed between the two subtypes (Fig. 1C). Age is a known risk factor for lung cancers, patients were divided by median age to analyze the association between cfDNA concentrations and age. However, there was no significant variation in cfDNA concentrations between different age groups (Fig. 1D). There was also no significant difference in cfDNA concentration between women and men (Fig. 1E). These findings indicate that cfDNA concentration in lung cancer is associated with disease progression, but is independent of age, gender, and lung cancer subtype.

Table 2 Patient information of training set and validation set.
Fig. 1
figure 1

Distribution of cfDNA concentration among different groups. (A) The distribution of cfDNA concentration in lung cancer and non-cancer controls. (B) The distribution of cfDNA concentration in different stages. (C) The distribution of cfDNA concentration in different subtype. (D) The distribution of cfDNA concentration in different age subgroups. (E) The distribution of cfDNA concentration in different gender patients.

Diagnostic accuracy of the four methylation biomarkers for lung cancer detection

To analyze the methylation variations of the four methylated biomarkers in lung cancer, we examined the methylation levels of these four methylated biomarkers using the qPCR method. The results showed that the methylation levels of two markers were significantly different between the lung cancer group and the control group, with SHOX2 being hypermethylated and PTGER4 being hypomethylated in the lung cancer group (Fig. 2A and B), but RASSF1A and H4C6 showed no significant differences between the lung cancer group and the control group (Fig. 2C and D). To determine the diagnostic values of the four methylation biomarkers, we performed receiver operating characteristic (ROC) curve analysis to evaluate the capability for the distinguishment between the lung cancer group and the control group. The area under the curve (AUC) values for distinguishing lung cancer from non-cancer controls for SHOX2, PTGER4, RASSF1A, and H4C6 were 0.7462, 0.5967, 0.5506, and 0.5094, respectively (Fig. 3A and D). Additionally, cfDNA concentration was significantly elevated in lung cancer, we also calculated the AUC value of cfDNA concentration in lung cancer diagnosis. The results showed that the AUC value was 0.6017, indicating that cfDNA concentration also has the potential for lung cancer detection (Fig. 3E).

Fig. 2
figure 2

The methylation levels distribution between lung cancer and non-cancer group. (A) The methylation levels of SHOX2 between lung cancer and non-cancer group. (B) The methylation levels of PTGER4 between lung cancer and non-cancer group. (C) The methylation levels of RASSF1A between lung cancer and non-cancer group. (D) The methylation levels of H4C6 between lung cancer and non-cancer group.

Fig. 3
figure 3

Diagnostic accuracy of the cfDNA concentration and four gene methylation in training set. (A) ROC curves of SHOX2 methylation in lung cancer detection. (B) ROC curves of PTGER4 methylation in lung cancer detection. (C) ROC curves of RASSF1A methylation in lung cancer detection. (D) ROC curves of H4C6 methylation in lung cancer detection. (E) ROC curves of cfDNA concentration in lung cancer detection.

Development of a prediction model based on the methylation biomarkers and cfDNA concentration for lung cancer diagnosis

To improve the accuracy of lung cancer diagnosis, we employed machine learning methods to evaluate the performance of the four methylated biomarkers and cfDNA concentration panels in lung cancer diagnosis. First, we employed the Boruta and LASSO algorithms to assess the importance value of cfDNA concentration and the four methylation biomarkers in the training set. The results indicated that all five biomarkers showed good diagnostic potential for lung cancer (Fig. 4A and B). Next, we developed a lung cancer risk model using the generalized linear models algorithm, with the following equation:

Fig. 4
figure 4

Models developed for lung cancer detection. (A) The importance scores of the cfDNA concentration and four gene methylation based on the Boruta algorithm. The importance score reflects the relative contribution of each feature to the model’s predictive performance, with higher values indicating greater significance. (B) Non-zero coefficient were screened using ten-fold cross-validation via minimum λ value in LASSO algorithm. The binomial deviance serves as the cross-validation error metric, where smaller values indicate better model performance. λ directly controls the penalty strength, as λ increases, shrinking more coefficients to zero. (C) ROC curves of the lung cancer detection models in training set. (D) ROC curves of the 3-feature model and 5-feature model.

The probability of lung cancer = \(\:\frac{{e}^{x}}{1+{e}^{x}}\),

where e is the base of the natural logarithm, x = 6.4667 + 0.0493 × cfDNA concentration − 0.1516 × MethylationPTGER4 − 0.2220 × MethylationSHOX2 − 0.0381 × MethylationH4C6 − 0.0155 × MethylationRASSF1A. We then evaluated the performance of the model in the training set, which had an AUC of 0.8012 in lung cancer diagnosis, demonstrating a good potential for lung cancer detection (Fig. 4C). Since RASSF1A and H4C6 methylation showed no statistically significant differences between non-cancer and lung cancer samples, we excluded these two genes and constructed a 3-feature model using the same methodology. The results showed that the 3-feature model achieved an AUC of 0.7875, which was lower than that of the original 5-feature model (Fig. 4D). These findings suggest that RASSF1A and H4C6 methylation can enhance the diagnostic accuracy of the model for lung cancer detection.

Validation of the prediction model in an independent cohort

To evaluate the effectiveness of our models, we tested these models in the independent validation set. We calculated the lung cancer risk of each sample and evaluated the performance of the model in the validation set. The results were consistent with the training set, with an AUC of 0.8436 (95% CI: 0.7565–0.9306), and the sensitivity and specificity were 77.36% and 91.67%, respectively (Fig. 5A). To further assess the robustness of the model and to ensure its independence from clinical variables, we undertook clinical subgroup analyses on the training and validation sets. We analyzed the distribution of risk scores across different subgroups based on age, gender, stage, and histological subtype, and the results showed a significant association with age in the lung cancer group, with the older group exhibiting higher tumor risk compared to younger patients (Fig. 5B). No significant differences in risk scores were observed between different gender groups (Fig. 5C). However, notable variations emerged when analyzing different subtypes and stages: the risk scores of the LUAD group were significantly lower than the LUSC group (Fig. 5D), while early-stage patients had markedly lower scores than advanced-stage patients (Fig. 5E). Furthermore, a progressive increase in risk scores was observed with ascending tumor stages. In summary, these results demonstrate that the model has strong potential for lung cancer detection.

Fig. 5
figure 5

Evaluate of the lung cancer detection model in an independent validation set. (A) ROC curves of the lung cancer detection models in validation set. (B) The distribution of model prediction scores in different age groups. (C) The distribution of model prediction scores in different gender. (D) The distribution of model prediction scores in different subtype. (E) The distribution of model prediction scores in different stages.

Discussion

Lung cancer remains the leading cause of cancer-related mortality worldwide, which is mainly attributed to the lack of obvious clinical symptoms and effective screening methods in the early stages of lung cancer, leading to advanced diagnosis. Early diagnosis is the key to reducing lung cancer mortality and underscoring the clinical significance of the development of accurate non-invasive diagnostic methods. In this study, we detected four methylation biomarkers and cfDNA concentrations in plasma and constructed a diagnostic model for lung cancer. The AUC for this model in the training and validation cohorts was 0.8012 and 0.8436, respectively, which showed high sensitivity and specificity for lung cancer diagnosis.

Plasma cfDNA mainly originates from the hematopoietic system, but the proportion of cellular sources varies considerably in various pathological conditions and body fluids28. In healthy individuals, 55% of plasma cfDNA comes from leukocytes, 30% from erythroid progenitor cells, and 10% from vascular endothelial cells29. In cancer patients, tumor-derived cfDNA ranges from 0.01 to 90% of total cfDNA, with the proportion increasing with tumor progression30,31. Previous studies have shown that cfDNA concentration is a potential early screening marker for lung cancer. Qi et al. achieved an AUC of 0.8777 for gastric cancer detection using cfDNA concentration32and Mirtavoos-Mahyari et al. achieved an AUC of 0.98 for differentiating the recurrence probability of lung cancers33. In this study, although the diagnostic performance of cfDNA concentration was better than most of the methylation markers with an AUC of 0.6467, the accuracy was lower than the previous study, probably because the pathological stage of most patients was stage I, and the proportion of tumor-derived cfDNA in early-stage patients was low.

cfDNA methylation plays a critical role in early tumorigenesis and is a promising biomarker for cancer early detection. In this study, we analyzed the methylation level of SHOX2, RASSF1A, PTGER4, and H4C6 genes, which have shown significant diagnostic potential for lung cancer in previous studies. SHOX2 has shown excellent results in the screening and diagnosis of lung cancer patients and has been approved by the National Medical Products Administration (NMPA)22,34. Christoph Kneip et al. demonstrated that plasma-based SHOX2 DNA methylation could be used as a biomarker to distinguish between malignant lung disease and controls at a sensitivity of 60% and a specificity of 90%. Cancer in patients with stages II (72%), III (55%), and IV (83%) was detected at a higher sensitivity when compared with stage I patients. Small cell lung cancer (80%) and squamous cell carcinoma (63%) were detected at the highest sensitivity when compared with adenocarcinomas34. RASSF1A is a tumor suppressor gene whose was hypermethylated in 63% of NSCLC patients, and it’s one of the common epigenetic inactivation events in human cancers35. PTGER4 belongs to the G-protein-coupled receptor family that influences cellular physiological and pathological processes by activating endogenous G proteins to transmit downstream signals36. The H4C6 gene encodes a replication-dependent histone and is hypermethylated in various cancers, including lung cancer. Dong et al. demonstrated that H4C6 methylation accurately distinguished lung cancer patients from those with benign pulmonary diseases, with an AUC of 0.98, specificity of 96.7% and sensitivity of 87.0%21. In our study, the AUCs for SHOX2, PTGER4, RASSF1A, and H4C6 methylation in lung cancer diagnosis were 0.7462, 0.5967, 0.5506, and 0.5094, respectively. Among these, only SHOX2 exhibited satisfactory diagnostic accuracy, which might be related to the sample type and pathological stage. 84.9% of the collected 179 lung cancer samples we collected were stage Tis and stage I, and early-stage patients had lower cfDNA levels. Additionally, the detection performance of DNA methylation markers varied significantly across the sources of samples. For example, the AUC of SHOX2 methylation for diagnosing lung cancer in serum, pleural fluid, and bronchial lavage fluid was 0.62, 0.70, and 0.94, respectively37. The performance in serum was lower than that in pleural fluid and bronchial lavage fluid, but pleural fluid and bronchial lavage fluid were unavailable in most of the patients, while serum was available in all patients.

To address the suboptimal performance of single biomarkers, multi-biomarker panels may offer a viable solution. Previous studies have demonstrated that multi-marker panels significantly improve the efficiency of cancer diagnosis. Weiss et al. achieved an AUC of 0.88 for lung cancer diagnosis using SHOX2 and PTGER4 methylation38. Similarly, Wei et al. reported an AUC of 0.938 when combining PTGER4, RASSF1A, and SHOX2 methylation, which is better than the single-marker assay39. In the study by Jiaping Zhao et al., the AUC values for distinguishing lung adenocarcinoma from healthy controls for SHOX2 and RASSF1A methylation were 0.751 and 0.747, respectively. Notably, the combined methylation panel of both biomarkers yielded an AUC of 0.81440. In the study by Wenhai Huang et al.41 the AUC values for distinguishing lung cancer from benign lung diseases for SHOX2 and PTGER4 methylation were 0.8514 and 0.8466, respectively, and the combined methylation panel of both biomarkers yielded an AUC of 0.921. In this study, the combined analysis of SHOX2, PTGER4, RASSF1A, and H4C6 methylation significantly enhanced diagnostic accuracy compared to individual markers (0.8012 VS 0.7462, 0.5967, 0.5506, and 0.5094).

Limitations existed in this study. First, the sample size was small, a large sample size is needed in further studies to confirm the results. Second, samples were limited to healthy individuals and patients diagnosed with lung cancer, the absence of non-tumor cancer pulmonary disease samples, combined with incomplete data on lifestyle factors and comorbidities, limits our ability to evaluate the model’s performance in non-malignant pulmonary conditions or high-risk populations. Finally, since RASSF1A and H4C6 methylation levels are elevated in multiple cancer types, our model may lack specificity for tumor localization and could potentially yield false-positive results for other cancers.

Conclusions

In summary, plasma cfDNA concentration and methylation level of SHOX2, RASSF1A, PTGER4, and H4C6 demonstrated high diagnostic efficacy for lung cancer. This non-invasive assay based on the four methylation markers and cfDNA concentration has potential clinical applications and can be used alone or in combination with current diagnostic methods to improve the overall efficiency of lung cancer diagnosis.