Introduction

Lung cancer represents a major health issue and is one of the leading causes of cancer-related mortality worldwide1,2. Non-small cell lung cancer (NSCLC) constitutes the predominant subtype of lung cancer, representing approximately 85% of all cases3,4,5. Among NSCLC cases, lung adenocarcinoma (LUAD) is the most prevalent type6,7,8. Recent global data shows a steady increase in the incidence of LUAD, particularly among women, gradually surpassing squamous cell carcinoma as the most common type of lung cancer9. In 2020, lung cancer cases worldwide reached 2.2 million, with LUAD making up a significant proportion, especially among women, where its incidence reached 57%9. The high morbidity and mortality associated with LUAD not only severely impact patients’ quality of life but also place considerable strain on public health systems.

As early symptoms of lung cancer are often subtle, many patients are diagnosed at advanced stages, complicating both treatment and prognosis10,11,12. Current treatment options, including surgery, chemotherapy, and targeted therapies, show limited efficacy, particularly in advanced stages of the disease13,14. Additionally, the heterogeneity of LUAD further complicates treatment outcomes15,16. Despite progress in genomic research, targeted therapies, and immunotherapy, the mechanisms underlying LUAD progression and drug resistance remain insufficiently understood. With the rapid advancement of precision medicine, statistical and computational algorithms have been increasingly employed to identify novel diagnostic and therapeutic targets17,18. This study addresses this gap by employing bioinformatics analysis to identify potential biomarkers and diagnostic and therapeutic targets. By integrating machine learning techniques, we aim to enhance the diagnostic accuracy of LUAD and uncover key genes associated with prognosis and therapeutic response, ultimately improving patient outcomes.

The main goal of this study was to identify and analyze differentially expressed genes (DEGs) in LUAD using an integrated bioinformatics approach. Datasets from multiple Gene Expression Omnibus (GEO) databases were merged, and batch effect correction techniques were applied to enhance the consistency of gene expression across samples. LUAD requires the identification of new biomarkers and therapeutic targets to address its clinical challenges. In this study, 645 DEGs were identified, and the association between gene modules and clinical features was explored through weighted gene co-expression network analysis (WGCNA), revealing underlying biological processes. Additionally, a robust machine learning diagnostic model was constructed to identify eight core genes with significant predictive power by integrating these DEGs. The study further investigated the expression, prognosis, immune infiltration, and drug sensitivity of the ST14 gene among the core genes in LUAD. Quantitative Reverse Transcription Polymerase Chain Reaction (qRT-PCR) experiments were conducted to validate the expression of the eight core genes in LUAD cell lines, Beas-2B control cells, and 30 pairs of LUAD tumors with adjacent normal tissues. The novelty of this study lies in the integration of DEG identification, machine learning, and single-cell analysis to construct a powerful diagnostic model, exploring the immune landscape and drug sensitivity of LUAD, and offering new insights for future diagnostic and therapeutic strategies, as well as personalized medicine.

Materials and methods

Data acquisition

Clinicopathologic information and genome-wide expression data were obtained from five LUAD cohorts (GSE32665, GSE68465, GSE115002, GSE32863, and GSE75037), all through the GEO database19 (https://www.ncbi.nlm.nih.gov/geo/). We used the “sva” package to correct for batch effects for both the GSE32665 and GSE68465 data sets20. First, multiple expression data files were read and merged to extract the intersecting genes, then the ComBat method was applied for batch correction, and finally the batch-corrected expression data were output. In addition, the GEO database provided single-cell RNA sequence (scRNA-seq) data of 9 LUAD patients in GSE189357. mRNA expression data and corresponding clinical information for LUAD patients were obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/)21. The workflow of this study is illustrated in Fig. S1.

In this study, samples of LUAD tissues and corresponding adjacent non-cancerous tissues were obtained from 30 lung cancer patients at Fujian Provincial Cancer Hospital. The study received approval from the hospital’s Ethics Committee (Approval No. K2023-417-01), and informed consent was obtained from all participants before their inclusion in the study.

WGCNA analysis

WGCNA (version 1.73)22 was conducted to construct a co-expression network for LUAD samples. Normalized expression data were imported, low-variance genes were filtered, and sample group information was extracted. Genes and samples that did not meet quality control standards were excluded based on missing value assessment. A sample clustering tree was generated using hierarchical clustering, and outlier samples were removed by applying a predefined cut height. Clinical trait data were aligned with the expression matrix, followed by re-clustering and heatmap visualization to evaluate sample–trait relationships. To identify an appropriate soft-thresholding power, a range of candidate values was evaluated according to scale-free topology criteria. A weighted adjacency matrix was then constructed and converted into a topological overlap matrix to measure gene co-expression similarity. Genes were clustered into modules using hierarchical clustering with a minimum module size of 60. Modules with similar expression patterns were merged based on eigengene correlation using a threshold of 0.25, resulting in the final module structure. Pearson correlation coefficients were calculated to assess associations between module eigengenes and clinical traits, and the results were visualized in a module–trait correlation heatmap. Module membership and gene significance (GS) were further analyzed to identify key modules and hub genes associated with clinical features.

Diagnostic models of intersecting genes generated by an integrated machine learning-based approach and analysis of model genes

The LUAD data were analyzed using 113 machine learning methods. First, the gene expression data for both the training and validation groups were processed to ensure that genes common to both groups were retained, followed by centering and normalization of the data. Feature selection was then performed using machine learning algorithms such as random forest, glmnet, and xgboost to identify important genes, which were subsequently used to construct predictive models. The performance of each model was measured primarily by calculating the Area Under the Curve (AUC) value and comparing results between the training and validation sets. Risk scores for each sample were calculated, and classification predictions were made. Heatmaps were generated to visualize the AUC values of different models, with the best-performing model selected. The selected features were combined using multivariate logistic regression models to further enhance prediction accuracy. Additionally, the Receiver Operating Characteristic (ROC) curve analysis of the GEO dataset was conducted using the “pROC” package (version 1.18.5). The risk matrix file was read, and dataset ID information was extracted. Sample grouping information was then extracted and labeled as 1 and 0. For each model, ROC curves were generated, and the AUC values were determined, with confidence intervals obtained using the bootstrap method. The ROC curves for each dataset were plotted, with AUC values and 95% confidence intervals labeled.

Simultaneously, volcano plots of the model genes were plotted using the “ggplot2” (version 3.5.1) and “ggrepel” packages, while the gene expression data were analyzed using the “limma” (version 3.62.1), “reshape2” (version 1.4.4), “ggpubr”, and “PerformanceAnalytics” packages. The gene expression data and associated gene lists were read to extract the relevant model gene expressions. The sample grouping information was extracted, and the data were formatted for boxplot generation. Boxplots were plotted using ggboxplot to compare gene expression differences between groups, with significance levels labeled by the Wilcoxon test. Additionally, correlation plots were generated using the “PerformanceAnalytics” package to display the relationships between model gene expressions. Boxplots and correlation plots were then generated to visualize gene expression differences and correlations. Finally, the protein interaction networks of the model genes were explored using the GeneMANIA website23 (https://genemania.org/) to gain deeper insights into gene interrelationships and biological functions.

Single-cell sequencing analysis

An in-depth analysis of model genes was conducted using single-cell RNA sequencing. The “Seurat” package (version 5.1.0) was applied to preprocess the data, removing low-quality cells, normalizing the data, and selecting variant genes. Principal component analysis (PCA) was then performed to reduce the data’s dimensionality, followed by the application of the Harmony algorithm to correct for batch effects. Cells were subsequently clustered using the t-SNE method to identify distinct cell populations. For cell type annotation, the “SingleR” package (version 2.8.0) was employed, integrating multiple reference datasets. The LUAD single-cell data were analyzed using Seurat, and the expression of model genes across different cell clusters was visualized. The expression patterns of these signature genes were presented using scatter plots (FeaturePlot) and bubble plots (DotPlot). In the scatter plot, gene expression levels were represented by a green-to-red gradient, while the bubble plot displayed gene expression intensity across clusters. Based on these visualizations, the distribution and expression differences of the target genes in distinct cell populations were clearly observed. For the single-cell analysis of the ST14 gene, eight distinct single-cell datasets were analyzed using the TISCH database24 (http://tisch.comp-genomics.org/search-gene/) (GSE117570, GSE127465, GSE146100, GSE143423, GSE150660, GSE131907, GSE162498, GSE139955).

Expression, prognosis, and enrichment analysis of ST14

The analysis of ST14 expression and prognosis was conducted using data from the TCGA database. The “limma” and “ggplot2” packages were employed to analyze expression differences of the target gene ST14. First, gene expression data were processed to extract the ST14 gene, determine the sample sizes in the normal and tumor groups, and apply a log2 transformation. Box plots were used to visualize the expression differences of ST14 between the two groups, with statistical significance determined by the Wilcoxon test. Differential analysis was also conducted on paired samples using the “limma” and “ggpubr” packages, and paired difference analysis plots were generated to illustrate the expression variations of the target gene between the groups. Survival analysis of LUAD data was performed using the “survival” and “survminer” packages. Additionally, Gene Set Enrichment Analysis (GSEA) was applied to LUAD data, with samples grouped based on target gene expression. LogFC values were calculated, significantly enriched pathways were identified, and enrichment maps were generated. Gene Set Variation Analysis (GSVA) was also performed on LUAD data25with gene set scores calculated for each group. Differential analysis was conducted, and significant pathways were displayed in bar charts.

Immune infiltration analysis and drug sensitivity analysis

The relationship between the ST14 gene and tumor microenvironment (TME) score was analyzed using the “ggplot2” and “limma” packages in R. Wilcoxon tests were performed to assess the statistical significance. Next, the relationship between ST14 and immune cell infiltration was examined. Box-and-whisker plots were created to show gene expression differences under varying immune cell infiltration types. The Spearman correlation between ST14 expression and immune cell infiltration types was calculated, and the results were reported. The association between ST14 and immune checkpoint (ICP) genes was additionally assessed, and a correlation matrix was plotted to highlight potential associations. Furthermore, the association between ST14 and tumor mutation burden (TMB) was analyzed, with scatter plots illustrating the correlation using Spearman correlation analysis. Lastly, the relationship between ST14 and immune scoring was examined using the “limma” and “ggpubr” packages, with violin plots generated and significance indicated. The “oncoPredict” package26,27 was applied to examine the relationship between ST14 and drug sensitivity, and the final drug sensitivity analysis results were output.

Cell culture and qRT-PCR analysis

In this study, the Beas-2B, H1299, A549, PC9, and HCC827 cell lines were obtained from Procell (Wuhan, China). For cell resuscitation, the frozen storage solution was removed from a − 80 °C freezer and rapidly thawed in a 37 °C water bath. After thawing, the cells were immediately transferred to RPMI-1640 medium supplemented with 10% fetal bovine serum and 1% penicillin/streptomycin. To promote optimal cell growth, the cells were cultured in 10 cm dishes at 37 °C in a humidified incubator with 5% CO2. The medium was regularly replaced, and the cells’ growth was closely monitored to ensure they remained in the logarithmic growth phase. For total RNA extraction, TRIzol reagent (Invitrogen, Carlsbad, CA, USA) was used according to the manufacturer’s instructions. The extracted RNA was then reverse transcribed into complementary DNA (cDNA) using the PrimeScript RT kit (Takara). Quantitative reverse transcription polymerase chain reaction (qRT-PCR) was subsequently performed using the Takara SYBR Green method. The results were analyzed using the 2-ΔΔCt method, and all data were normalized to GAPDH as an internal control. The resulting relative gene expression values were used for subsequent analysis to evaluate the differential expression of genes across different cell lines. Primer sequences used in the qRT-PCR experiments are provided in Table S4.

Statistical analysis

Data processing, analysis, and visualization were performed using R software (version 4.4.2) and GraphPad Prism 9. Quantitative variables were analyzed by independent samples t-tests, and statistical significance was set at P < 0.05 for comparisons between groups.

Results

GEO-LUAD database DEGs identification and WGCNA analysis

Initially, the GSE32665 and GSE68465 datasets were integrated, and batch correction was applied to eliminate differences between the datasets. This correction significantly improved the consistency of gene expression distributions and aligned the datasets with the principal component space (Fig. S2). A total of 645 DEGs were detected using the criteria of |logFC| > 1 and adjusted p < 0.05 (Table S1), comprising 358 upregulated genes and 287 downregulated genes. The distribution of significantly up- and downregulated genes was visualized in a volcano plot (Fig. 1A), while the expression patterns of DEGs across different samples were illustrated in a heatmap (Fig. 1B). Subsequently, WGCNA was performed. A sample clustering tree, based on gene expression data, was constructed to detect potentially anomalous samples, providing a framework for data quality control in subsequent analyses (Fig. S3A). Additionally, the clustering tree, combined with a trait heatmap, revealed expression similarities between samples and their alignment with clinical features, establishing a foundation for module construction and trait correlation analysis (Fig. S3B). When the soft threshold β was set to 4, the network exhibited scale-free distribution characteristics (Fig. 1C). The clustering tree of module eigengenes further elucidated the relationships and similarities among different modules (Fig. S3C). Through the construction of co-expression modules and the aggregation of similar modules, a total of 10 gene modules were discovered (Fig. 1D). Correlation analysis between clinical traits and gene modules revealed that the red module demonstrated the strongest association with gene features (Cor = 0.28, p = 4e−13) (Fig. 1E). Further analysis of module gene significance confirmed that the red module exhibited the highest gene significance, indicating its strong correlation with the target trait (Fig. 1F). Additionally, a scatterplot illustrated the correlation between the module membership values of genes within the module and their gene significance in the treatment group (Fig. S3D). The red module comprised 493 genes in total (Table S2). Venn diagram analysis of the intersection between the red module genes and the DEGs identified 66 shared genes (Fig. 1G) (Table S3). The Gene Ontology (GO) analysis of intersecting genes identified vesicle-mediated transport in synapse and vesicle organization as the most enriched biological processes (BP). For cellular components (CC), the most prominent terms were early endosome, postsynaptic specialization, and neuron-to-neuron synapse. Regarding molecular functions (MF), the analysis highlighted peptide binding and amide binding as the most significant terms (Fig. S4A). The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of intersecting genes identified the most significantly enriched pathways as the renin-angiotensin system, TGF-beta signaling pathway, and starch and sucrose metabolism (Fig. S4B)28,29,30.

Fig. 1
figure 1

Results of differential gene expression analysis and WGCNA module construction. (A) Volcano plot depicting DEGs; (B) Heatmap demonstrating the expression patterns of DEGs; (C) Soft-threshold screening results; (D) Module results of gene clustering tree and dynamic shear tree delineation; (E) Association heatmap of clinical features and modules; (F) Histogram of gene significance of different modules, with the red module has the highest significance; (G) Venn diagram showing the gene intersection of DEG and WGCNA results.

Machine learning construction of diagnostic models and analysis of model genes

For the GEO datasets, training and test sets were constructed, and 113 predictive models were fitted using machine learning algorithms. The C-index of each model was subsequently calculated for both the training and validation sets. The results indicated that the optimal model was the combination of Stepglm[backward] and RF, achieving the highest average C-index (0.999) (Fig. 2A). Based on this model, six core genes were identified: BCMO1, FHL1, ST14, PPAP2C, SNCAIP, EML1, F10, and ADM2. The ROC curves demonstrated that the optimal model exhibited excellent performance in the training set and three validation sets (GSE75037, GSE32863, and GSE115002) (Fig. 2B–E). The AUC for the training set and two validation sets (GSE75037 and GSE32863) was 1.000, while the AUC for the GSE115002 validation set was 0.996. These results highlight the model’s exceptional diagnostic accuracy and robust generalization performance. Furthermore, the confusion matrix confirmed the model’s strong classification ability across different datasets (Fig. S5). Volcano and box plots revealed differential gene expression in the tumor group, where ST14, BCMO1, ADM2, and PPAP2C were upregulated, while FHL1, EML1, F10, and SNCAIP were downregulated (Fig. 2F,G). Additionally, the correlation analysis among the model genes was conducted (Fig. 2H). Figure 2I illustrates the ROC curves of the selected genes, with all genes achieving AUC values greater than 0.7. Among these, ST14 demonstrated the highest AUC (0.842), indicating its superior predictive performance. Moreover, the protein interaction network of the identified genes was constructed (Fig. 2J), further elucidating their functional relationships and biological relevance.

Fig. 2
figure 2

A model was built and validated by a machine learning-based approach. (A) The best model was identified using 113 different machine learning algorithms; (B–E) ROC curves of each dataset and training set, with the AUC values close to 1.0, indicating that the model has good discriminatory ability; (F) Volcano plot of the model genes; (G) Expression levels of the model genes in normal and tumor samples; (H) Correlation analysis between the model genes; (I) Model genes’ ROC curves, with ST14 gene showing the best performance; (J) Protein interaction network of model genes. *p < 0.05; **p < 0.01; ***p < 0.001.

Single-cell analysis of model genes

The characterization of model genes in nine LUAD samples from the GSE189357 dataset was investigated using single-cell sequencing analysis. Quality control and PCA were conducted as part of the single-cell RNA sequencing workflow. Figure S6A illustrates the distribution of gene feature counts, total RNA counts, and mitochondrial gene ratios. Figure S6B highlights the highly loaded genes within the first four principal components of PCA. Figure S6C provides a PCA heatmap, revealing clustering patterns of gene expression across principal components, while Fig. S6D evaluates the significance of these components through p-values, serving as a guide for subsequent analyses. T-SNE plots identified 29 cellular subpopulations, highlighting their distribution characteristics and heterogeneity in two-dimensional space (Fig. 3A). Additionally, t-SNE plots revealed the distribution of distinct cell types in the single-cell RNA sequencing data, including monocytes, stromal cells, dendritic cells (DCs), fibroblasts, epithelial cells, macrophages, and BM cells, each demonstrating clear clustering patterns (Fig. 3B). The expression distribution of five key model genes—FHL1, SNCAIP, EML1, F10, and ADM2—was further analyzed within single-cell t-SNE plots (Fig. 3C). Bubble plots were utilized to illustrate the expression of these genes across various cellular subpopulations (Fig. 3D). In these plots, the vertical axis represents cellular subpopulations, the horizontal axis denotes the genes, the bubble size reflects the expression proportion of the genes in corresponding subpopulations, and the bubble color indicates the average expression level. Notably, FHL1 exhibited a higher expression proportion and average expression level in several subpopulations, such as subpopulations 8 and 12.

Fig. 3
figure 3

Single-cell RNA sequencing analysis of LUAD samples in the GSE189357 dataset. (A) T-SNE plot illustrating 29 distinct cell subpopulations; (B) T-SNE plot depicting the distribution pattern of cell types; (C) T-SNE plot showing the expression distribution of five model genes (FHL1, SNCAIP, EML1, F10, ADM2). (D) Bubble plot showing the proportion of expression (bubble size) and the average expression level (color intensity) of the five model genes in different cell subpopulations.

Expression profile, prognosis and enrichment analysis of ST14

Since the ST14 gene exhibited the best diagnostic efficacy in the model (AUC = 0.842), its role in LUAD was further investigated. As shown in Fig. 4A, the expression of the ST14 gene was higher in LUAD tumor tissues compared to normal tissues. Figure 4B further revealed that ST14 expression was also markedly higher in paired tumor tissues relative to normal tissues. Additionally, the data in Fig. 4C indicated that elevated ST14 expression may be linked to a poorer prognosis. GO analysis (Fig. 4D) revealed that the low ST14 expression group was notably enriched in immune-related biological processes, such as lymphocyte-mediated immunity and adaptive immune response, while the high ST14 expression group was primarily enriched in processes related to cell division, including mitotic nuclear division, mitochondrial-associated cellular components, and sister chromatid segregation. KEGG analysis (Fig. 4E) revealed that the high ST14 expression group was notably enriched in pathways involved in DNA replication, base excision repair, fructose and mannose metabolism, glutathione metabolism, and the pentose phosphate pathway. In contrast, the low ST14 expression group was significantly enriched in immune-related pathways, including hematopoietic cell lineage, cytokine-cytokine receptor interactions, chemokine signaling, systemic lupus erythematosus, and intestinal immune networks for IgA production. Figure 4F presents a bar plot of t-values for GSVA (Gene Set Variation Analysis) scores across different GO functions in ST14 samples. The t-values reflect the extent of enrichment or suppression of each gene set. Numerous GO terms associated with metabolic processes, cell differentiation, and immune responses exhibit upregulation (red), while terms related to calcium ion transport display downregulation (green). Gray bars indicate no significant change in these terms. Specifically, the top three upregulated GO terms include branching involved in labyrinthine layer morphogenesis, hexose catabolic process, and monosaccharide catabolic process. Figure 4G illustrates the t-value bar plots of GSVA scores for various KEGG pathways in ST14 samples. The t-values represent the degree of enrichment or inhibition of each pathway. Many KEGG pathways related to metabolic processes, signaling, and immune responses show upregulation (red), while others associated with cell migration and immune responses are downregulated (green). Gray bars indicate no significant changes in these pathways. Specifically, the top three upregulated KEGG pathways include pentose phosphate pathway, glycosphingolipid biosynthesis, and fructose and mannose metabolism.

Fig. 4
figure 4

Analysis of the role of ST14 gene in LUAD. (A) The variation in ST14 gene expression between tumor and normal tissues; (B) The expression variation of the ST14 gene in paired tumor and normal tissues; (C) The differences in survival rates based on varying ST14 expression levels (p = 0.020); (D) The GO analysis bar graphs; (E) The KEGG analysis bar graphs; (F) GSVA analysis showing t-value bar graphs for different GO functions in ST14 samples, reflecting the degree of enrichment or inhibition of each gene set; (G) GSVA analysis showing t-values for different KEGG pathways in ST14 samples bar graphs reflecting the degree of enrichment or inhibition of each pathway. ***p < 0.001.

Immunoinfiltration analysis of ST14

With the increasing focus on the TME, accumulating evidence suggests that tumor-infiltrating immune cells play a critical role in cancer initiation, progression, and aggressiveness31,32. The correlation between ST14 and the TME was further explored. The analysis demonstrated that the ST14 high-expression group had significantly higher scores in StromalScore, ImmuneScore, and ESTIMATEScore compared to the low-expression group (Fig. 5A). Figure 5B illustrates the differences in ST14 gene expression across multiple immune cell types. Specifically, the ST14 low-expression group exhibited higher expression levels in Plasma cells, T cells gamma delta, and Mast cells resting, while the ST14 high-expression group had higher expression in T cells regulatory helper (Tregs), Macrophages M0, and Mast cells activated. Further correlation analysis demonstrated significant associations between ST14 expression and various immune cell types (Fig. 5C). Notably, ST14 showed a positive correlation with Macrophages M0 (r = 0.27, p = 3.8e−09), activated Mast cells (r = 0.15, p = 0.0012), NK cells resting (r = 0.099, p = 0.033), and T cells regulatory (Tregs) (r = 0.21, p = 8.3e−06), while it exhibited a negative correlation with resting Mast cells (r = − 0.12, p = 0.0081), Plasma cells (r = − 0.18, p = 8.9e−05), and T cells gamma delta (r = -0.21, p = 6.3e−06). Additionally, a significant correlation between ST14 expression and tumor mutation burden was observed (r = 0.12, p = 0.0057) (Fig. 5C). The lollipop plot (Fig. 5D) further highlighted the significant correlations between multiple immune cell types and ST14 gene expression. ST14 was positively correlated with Macrophages M0 (r = 0.27, p < 0.001), T cells regulatory (Tregs) (r = 0.21, p < 0.001), activated Mast cells (r = 0.15, p = 0.001), and NK cells resting (r = 0.10, p = 0.033), while it was negatively correlated with resting Mast cells (r = − 0.12, p = 0.008), Plasma cells (r = − 0.18, p < 0.001), and T cells gamma delta (r = − 0.21, p < 0.001). Furthermore, 12 ICP genes associated with ST14 were identified, with CD276 exhibiting the highest correlation coefficient (COR = 0.47) (Fig. 5E). Finally, the relationship between ST14 expression and immunotherapy efficacy (PD1 and CTLA4 treatment) was analyzed (Fig. 5F), and the results indicated a strong correlation between ST14 expression and the treatment efficacy of PD1 and CTLA4.

Fig. 5
figure 5

Analysis of the role of ST14 gene in TME. (A) The comparison of StromalScore, ImmuneScore, and ESTIMATEScore between groups with high and low ST14 expression; (B) The variation in proportions of high and low ST14 expression groups among multiple immune cell types; (C) Correlation analysis shows significant associations between ST14 and different immune cell types; (D) Lollipop plot demonstrating the correlation between ST14 and multiple immune cell types; (E) Correlation of 12 ICP genes associated with ST14 is demonstrated; (F) Impact of ST14 high and low expression groups on immunotherapy efficacy is shown. **p < 0.01; ***p < 0.001.

Drug sensitivity analysis and single-cell analysis of ST14

Based on the expression levels of ST14, LUAD samples were classified into high and low expression groups to examine their relationship with drug sensitivity (Fig. 6). The analysis indicated that the low ST14 expression group exhibited lower IC50 values for Linsitinib, Lenalidomide, Doxorubicin, Suboxone, Rucaparib, Palbociclib, OSI-027, and GSK2126458, suggesting increased sensitivity to these drugs. In contrast, the high ST14 expression group showed lower IC50 values for Cadrine, ERK_6604, BMS-754,807, and AZD2014, indicating greater sensitivity to these drugs. To investigate the expression pattern of ST14 across various cellular subpopulations in NSCLC, the single-cell database TISCH (http://tisch.comp-genomics.org/search-gene/) was used to analyze eight different single-cell datasets (GSE117570, GSE127465, GSE146100, GSE143423, GSE150660, GSE131907, GSE162498, GSE139955). As illustrated in Fig. 7, ST14 is predominantly expressed in monocytes, macrophages, malignant cells, and epithelial cells.

Fig. 6
figure 6

Drug sensitivity analysis based on ST14 expression. The figure shows the IC50 value, and the lower the IC50 value, the higher the drug sensitivity.

Fig. 7
figure 7

Expression patterns of ST14 gene in different NSCLC single cell datasets. The figure shows the ST14 expression levels of various cell populations.

Model genes expression in LUAD cell lines and tissues

The expression levels of BCMO1, FHL1, ST14, PPAP2C, SNCAIP, EML1, F10, and ADM2 in LUAD cell lines (PC9, A549, H1299, HCC827) and Beas-2B control cells were analyzed using qRT-PCR. The results demonstrated that BCMO1, ST14, PPAP2C, and ADM2 were upregulated in LUAD cell lines, whereas FHL1, SNCAIP, EML1, and F10 were downregulated (Fig. 8A). Additionally, in 30 paired LUAD tumor and adjacent normal tissue samples, a similar expression pattern was observed: BCMO1, ST14, PPAP2C, and ADM2 were upregulated in tumor tissues, while FHL1, SNCAIP, EML1, and F10 were downregulated (Fig. 8B). Furthermore, immunohistochemical data from the HPA database corroborated the upregulation of ST14 and PPAP2C in lung cancer tissues, whereas FHL1, SNCAIP, and EML1 were found to be downregulated (Fig. S7).

Fig. 8
figure 8

Gene expression analysis in LUAD cell lines and tumor tissues. (A) qRT-PCR was performed to detect the expression of BCMO1, FHL1, ST14, PPAP2C, SNCAIP, EML1, F10, and ADM2 in LUAD cell lines versus Beas-2B control. (B) qRT-PCR analysis of gene expression in 30 pairs of LUAD normal and tumors tissues. *p < 0.05; **p < 0.01; ****p < 0.0001.

Discussion

LUAD is a major subtype of NSCLC33,34characterized by its high aggressiveness and poor prognosis. Despite advances in targeted therapies and immunotherapies, the heterogeneity of LUAD presents significant challenges to treatment efficacy and patient prognosis35,36. Therefore, the identification of stable and clinically relevant molecular markers is essential for advancing individualized diagnosis and treatment.

The aim of this study was to systematically identify key genes associated with disease progression and immune regulation in LUAD by integrating multiple bioinformatics strategies and to evaluate their potential utility in diagnosis, prognosis, and therapeutic targeting. To achieve this, two LUAD gene expression datasets (GSE32665 and GSE68465) were integrated, and batch effect correction was performed, which markedly improved the consistency of gene expression across datasets. A total of 645 DEGs, including 358 upregulated and 287 downregulated genes, were identified through differential expression analysis, providing a foundation for subsequent investigations. Based on these DEGs, a weighted gene co-expression network was constructed, resulting in the identification of ten functional modules. Among them, the red module demonstrated the strongest correlation with LUAD clinical traits and showed a high level of consistency in both gene significance and module membership, suggesting that it may be enriched with functionally relevant genes. The intersection of the red module and DEGs was then used for machine learning modeling, leading to the development of a high-performance diagnostic model with an average AUC of 0.999 and the identification of eight core diagnostic genes.

To evaluate the biological relevance of these genes, single-cell RNA sequencing data were incorporated to assess their expression across distinct cellular subpopulations. The results revealed cell type–specific expression patterns within the tumor microenvironment, suggesting potential involvement in immunoregulatory processes and clinical applicability. Notably, ST14 exhibited consistent patterns across differential expression analysis, WGCNA, machine learning, and single-cell analysis, with its high expression significantly associated with poor prognosis, further supporting its potential as a diagnostic and prognostic biomarker for LUAD. Although certain analyses in this study remain at a preliminary, exploratory stage—particularly the single-cell validation, which was limited to expression-level assessment—the overall findings provide a solid foundation for future mechanistic research and clinical translation.

The findings of this study offer important understanding of the molecular mechanisms of LUAD and highlight the potential of specific pathways in guiding therapeutic strategies. KEGG pathway analysis of intersecting genes identified three key pathways: the renin-angiotensin system (RAS), the TGF-β signaling pathway, and carbohydrate metabolism, all of which are closely associated with tumor progression and patient prognosis. The renin-angiotensin system (RAS), traditionally recognized for its role in regulating blood pressure and fluid balance, has also gained attention for its involvement in cancer biology. It has been shown that RAS components can promote tumor growth and metastasis through mechanisms such as angiogenesis and immunomodulation37. The significant enrichment of RAS-related genes in LUAD, observed in our analysis, suggests that targeting this pathway may offer novel therapeutic strategies, particularly in patients with elevated expression of RAS components. TGF-β can induce epithelial-to-mesenchymal transition (EMT), enhancing metastatic potential, and modulate immune responses within the tumor microenvironment38,39,40. Finally, carbohydrate metabolic pathways, especially those involving fructose and mannose metabolism, have emerged as key regulators of cancer cell proliferation and survival41. Alterations in metabolic pathways are increasingly recognized as hallmarks of cancer, with metabolic reprogramming enabling tumor cells to thrive in nutrient-poor environments42,43,44,45. In conclusion, the identification of these pathways deepens our understanding of LUAD biology and paves the way for the development of new targeted therapies.

Previous studies have demonstrated that ST14 is linked to tumor progression and poor prognosis in multiple cancer types46,47,48. The association between elevated ST14 expression and reduced OS in LUAD patients further highlights its clinical relevance, underscoring the need for further investigation into its role in tumor biology. The immune status of LUAD is increasingly acknowledged as a key factor in tumor progression and patient prognosis49. Our analysis highlighted a strong correlation between ST14 expression and several immune cell types, notably M0 macrophages, activated mast cells, and regulatory T cells (Tregs). M0 macrophages exhibit dual roles in the tumor microenvironment, displaying both pro- and anti-tumorigenic activities depending on their polarization status50,51,52,53. The positive correlation between ST14 and M0 macrophages suggests that ST14 may influence macrophage polarization and function, potentially regulating tumor progression and patient prognosis. This association implies that ST14 could contribute to the formation of a tumor-promoting environment by modulating macrophage polarization. Previous studies have linked macrophage infiltration to poor prognosis across multiple cancer types54,55. The relationship between ST14 and Tregs is particularly noteworthy. Tregs are known to suppress anti-tumor immune responses, enabling tumor cells to evade immune surveillance56,57. The positive correlation observed in our study indicates that high ST14 expression may increase Treg populations, thereby exacerbating immune evasion in LUAD. This finding aligns with existing literature, which highlights the role of Tregs in promoting tumor growth and metastasis58,59. Moreover, the association between ST14 and activated mast cells adds further complexity to the immune landscape of LUAD. Mast cells participate in various immune responses, including angiogenesis and tissue remodeling60,61,62and their activation has been linked to tumor progression63,64,65. The positive correlation between ST14 and activated mast cells suggests that ST14 may influence mast cell activity, potentially creating a microenvironment conducive to tumor growth. This is consistent with findings that mast cell infiltration is associated with poor prognosis in various malignancies66,67. In summary, our findings emphasize the role of ST14 in shaping the LUAD immune microenvironment, particularly through its interactions with M0 macrophages, Tregs, and activated mast cells. These insights enhance our understanding of LUAD biology and suggest potential therapeutic targets for modulating the immune response in this malignancy. Future studies should focus on elucidating the precise mechanisms by which ST14 affects immune cell dynamics and explore its potential as a biomarker for immunotherapy response.

Immune checkpoints play a critical regulatory role in tumor immune surveillance68. Among these checkpoint molecules, PD-1 (programmed cell death protein 1) and CTLA-4 (cytotoxic T lymphocyte-associated protein 4) are the most extensively studied69,70. It has been demonstrated that the expression of these molecules influences the efficacy of immunotherapy71. We investigated the association between ST14 expression and the efficacy of immunotherapy, specifically PD-1 and CTLA-4 treatments. The results demonstrated a significant correlation between ST14 expression and therapeutic outcomes for PD-1 and CTLA-4, implying that ST14 could serve as a potential biomarker for predicting response to these therapies. Additionally, enrichment analysis identified that ST14 is associated with various biological processes and pathways, particularly those related to cell division and metabolic processes. These findings not only position ST14 as a potential prognostic marker but also as a candidate target for LUAD treatment. The observed differences in drug sensitivity based on ST14 expression levels further emphasize its potential role in personalized medicine, suggesting that ST14 could guide therapeutic decisions and improve patient prognosis. Moreover, gene expression was analyzed using single-cell analysis across multiple cell types derived from the tumor microenvironment. In tumor samples with significant cellular heterogeneity, single-cell data were downscaled and visualized using the t-SNE method. The results revealed expression patterns of six model genes (FHL1, SNCAIP, EML1, F10, ADM2), with particular emphasis on the expression of ST14. In single-cell t-SNE maps, ST14 expression was predominantly concentrated in monocytes, macrophages, malignant cells, and epithelial cells, indicating that ST14 plays a key role in the tumor immune microenvironment. These findings indicate that ST14 may regulate interactions between immune cells and tumor cells, contributing to tumor progression. Finally, the expression of eight model genes was analyzed in LUAD cell lines (PC9, A549, H1299, HCC827) and Beas-2B control cells, as well as in LUAD tumors and adjacent normal tissues using qRT-PCR. The expression patterns of these genes provide valuable insights into the pathogenesis of lung cancer and may serve as a theoretical foundation for developing novel therapeutic targets.

Although this study systematically constructed and validated potential biomarkers for LUAD through the integration of multi-omics data, several limitations remain. First, the analysis primarily relied on publicly available datasets, which may be subject to sample selection bias, incomplete clinical annotation, and inherent data heterogeneity. Second, although key genes were cross-validated in multiple independent datasets, supporting experimental evidence from prospective clinical samples is still lacking. Additionally, the single-cell RNA sequencing analysis was limited to expression distribution, without further investigation into the specific roles of these genes in cell fate determination, signaling pathway regulation, or immune function. Therefore, future studies should focus on validation in larger, independent prospective cohorts and incorporate functional experiments to further elucidate the biological roles of the identified genes in LUAD progression and therapeutic response.

Conclusion

This study underscores the utility of LUAD biomarkers in enhancing diagnostic precision and prognostic stratification through a comprehensive analysis of publicly accessible datasets, complemented by the application of WGCNA and advanced machine learning models. Notably, the core gene ST14 was significantly correlated with clinical features and immune cell infiltration, highlighting its pivotal role within the TME. Moreover, the integration of single-cell RNA sequencing data enabled a detailed exploration of gene expression heterogeneity across various cell types. While these findings pave the way for further investigation of therapeutic potential and drug sensitivity, the inherent limitations of the study should be acknowledged. Overall, this work contributes valuable insights into the biology of LUAD and its prospective clinical applications.