Introduction

Lung and bronchus cancers continue to outpace other cancers in projected cancer deaths in recent years1, with an estimated 127,070 Americans expected to succumb to lung cancer annually. This figure nearly matches the combined death toll for the next three most common cancers: colon, pancreatic and breast. Approximately 85% of lung cancer diagnoses are of the non-small cell lung (NSCLC) variety2. NSCLC is further categorized into three main subtypes: lung adenocarcinomas (LUADs, 40–50% of lung cancers), squamous cell carcinomas (20–25% of lung cancers), and large cell carcinomas (10–15% of lung cancers), of which LUAD is the most prominent subtype, even in non-smokers (< 100 cigarettes)3. Lung cancer exhibits extensive heterogeneity, and a higher tumor burden compared to many other cancer types. Among NSCLC subtypes, LUADs harbor more therapeutically actionable aberrations4,5. Coupled with the clinical effectiveness of targeted therapies in LUAD, there has been a notable shift towards a driver oncogenic paradigm in the treatment of NSCLC6,7.

In recent years, the advent of updated tyrosine kinase inhibitors (TKIs) targeting EGFR and ALK alterations combined with the introduction of PD1 and PD-L1 inhibitors in 2015, has progressed the development of therapies for advanced and metastatic LUAD. However, earlier stage LUAD has seen few recent advances in therapy8. As many patients do not present with symptoms at early-stage, vigilance for early detection is key9. For those with high risk of lung cancer, a low dose computed tomography screening program augmented with risk prediction models may serve as the best course of action for overall mortality reduction and population-wide early detection10. The stagnation of progress in identifying early-stage treatments for LUAD patients highlights the need for an in-depth systems-level analysis of the complex genetic and molecular interactions that allow for the initiation and progression of LUAD tumor biology in earlier stages (stages 1 and 2).

Prevalent practice involves the use of TNM staging to inform patient prognosis; however, relying solely on anatomical factors alone may not accurately predict patient prognosis. Additionally, patients at the same stage may not respond to treatments the same because of patient-specific genetics and tumor heterogeneity. To better understand the role of LUAD tumor biology in earlier stages (where tumor growth is more contained with limited local invasion, reflecting a relatively preserved tissue structure) division of patients into accurate subtypes informing on survival and influencing treatment paradigms is needed11,12. Expression-based survival signatures are essential for identifying gene expression markers linked to survival outcomes and have been widely used to predict overall and disease-free survival or stratify NSCLC risk/recurrence13,14,15,16,17,18,19,20,21, but still have limitations in reproducibility and clinical test accuracy. These approaches aid in tailoring therapies to tumor-specific characteristics and are crucial for defining patient subtypes in early-stage LUAD. Most notably, Song22, Shedden23 and Soltis24 identified several molecular subtypes either alone or combined with relevant clinical characteristics, revealing expression signatures that are predictive of patient survival outcomes.

In this study, we applied an integrative systems biology approach to explore a LUAD transcript co-expression network, identifying potential biomarkers linked to patient survival. Differentially expressed gene transcripts, identified as network hubs (top 10% of module members ranked by module eigengene-based connectivity) within five selected modules (those most correlated with overall patient survival and tumor staging), were collectively analyzed using Receiver Operating Characteristic (ROC) curves to enhance survival prediction in LUAD patients. Last, we compared our survival prediction targets to existing LUAD prognostic and survival signatures presented by Song, Shedden, and Soltis22,23,24. Our approach demonstrates a gene combination superior in survival prediction to the comparators, nominating novel prognostic biomarkers underscoring mechanisms underlying LUAD tumor cell heterogeneity, proliferation, metastasis, and patient survival.

Materials and methods

RNA-seq data collection and cleaning

TCGA LUAD patient RNA-seq data was acquired from GDC (Genomic Data Commons) via the TCGA data portal: https://portal.gdc.cancer.gov. FPKM values (fragments per kilobase of transcript per million mapped fragments) from the TCGA-LUAD project were downloaded and log2 transformed for each participant along with clinical data totaling 502 primary tumor samples. This log2 transformation was chosen to ensure that the expression data adhere to assumptions of normality and are less susceptible to high dynamic range variances often observed in RNA-seq data. Staging criteria, based on TNM classification, are detailed in Table 1, distinguishing early (stages I–II) from advanced LUAD (stages III–IV). In preparation for ANOVA and parallel bootstrap regression analysis, we first removed samples that did not have staging (n = 8), age (n = 19), sex (n = 0) or t-staging (n = 3) trait data. Age and sex were regressed out during the analysis to account for potential confounding effects, ensuring that survival prediction is based solely on gene expression without demographic biases. We enforced a ≥ 50% missingness restriction by removing transcripts rows with ≥ 50% zero FPKM values. Seven samples with an absolute standardized connectivity more than three standard deviations above the mean were termed outliers and removed from the cohort. Four samples termed “Do not use” by the sample quality GDC whitelist were also removed from the cohort25. Patients without clinical data were also excluded. In total, 41 samples and 46,997 transcripts were expunged. Two methods were used to assess the quality of the dataset before and after data cleaning (1) MBatch, a web-based analysis tool (http://bioinformatics.mdanderson.org/tcgabatcheffects) and (2) a multidimensional scaling analysis adapting the plotMDS function from the R statistical program package limma (Supplementary Fig. 1)26. An overview of patient characteristics, demographic and clinical staging data is provided (Supplementary Tables 1 and 2).

Table 1 TCGA clinical trait outcomes and descriptions for 461 primary tumor RNA abundance samples.

Network construction

Key modules were identified through correlation with all available LUAD patient traits. Co-expression network analysis was performed via the clustering method WGCNA v1.70.3 (Weighted Gene Correlation Network Analysis) R package27. 13,486 gene rows and 461 samples were used for network construction. The ideal soft threshold for the scale-free topology and adjacency calculation was determined graphically (β = 10.2; scale-free fit > 0.8; mean connectivity = 48.9). For automatic network construction, module detection relied on the blockwiseModules function, used with detailed specifications shown in Supplementary Table 3. Genes that were not correlated to a module of at least 100 gene products were assigned the color grey (not in a module). Significance of module correlation to traits was defined by a student’s p-value ≤ 0.05 for biweight midcorrelation (bicor) of the first principal component of a module (module eigengene, ME) to each trait encoded numerically either as a continuous or binary variable, with bicor from − 1 to + 1 as a heatmap scale (Supplementary Fig. 2). Bicor was used instead of Pearson correlation to calculate robust correlations while reducing susceptibility to outliers28. Bicor and associated Student p-values are ideal for summarizing correlation robustly because RNA-seq data regularly has high dynamic range and variance across samples. To ensure robustness, we confirmed that DESeq229 normalized counts and FPKM-based networks were similar via module preservation analysis and by Fisher’s exact test for overrepresentation of count-based modules into the FPKM network modules. Differentially expressed (see below Sect. 2.4) hub genes, defined as the top 10% by connectivity within key modules, and anti-correlated modules were prioritized based on module membership (kME) scores and associations with key clinical traits (stage and survival), to identify those most biologically relevant to LUAD outcomes.

Gene ontology enrichment analysis

Gene ontology enrichment analysis was performed by GOparallel (https://github.com/edammer/GOparallel), on WGCNA module member gene symbols. BaderLab’s monthly updated .GMT formatted ontology gene lists, and additional curated pathways from molecular signature MSig30,31 C2 database, Reactome32 and WikiPathways33 were used as references to determine the association of network modules with assembled lists focusing broadly on regulatory target gene sets, chemical and genetic perturbations, and canonical pathways. Module member gene lists were tested for statistical overrepresentation in select biological processes, molecular functions, and cellular components. Specifically, one tailed-Fisher’s exact test (FET) for hypergeometric overlap was used to assess significant enrichment of module members (gene names) with genes in each GO term, determining if the observed overlap is greater than expected by chance. Additionally, module-specific gene lists were subjected to the same hypergeometric overlap test against experimentally confirmed target genes for selected miRNA cancer driver candidates34,35,36. All reported Fisher exact p-values are adjusted to false discovery rate (FDR) by the Benjamini–Hochberg method to ensure robust significance levels (see Supplemental Materials).

Differential expression analysis

To identify significant differences in gene expression across tumor stages, we conducted a one-way ANOVA on log2-transformed RNA-seq data, enabling the testing of the null hypothesis that mean gene expression is equal across stages 1 (n = 246), 2 (n = 115), 3 (n = 78), and 4 (n = 22). Differential expression analyses included pairwise comparisons between these stages to highlight key gene expression changes relevant to disease progression. Comparisons were conducted between adjacent and non-adjacent stages (e.g., stage 1 vs. stage 2, stage 2 vs. stage 3, stage 1 vs. stage 4, stage 1 vs all other stages) to capture a comprehensive view of gene expression dynamics.

Separate subgroup analyses within each stage were performed to account for variability among clinical groups. Following the ANOVA, Tukey’s Honest Significant Difference (HSD) post-hoc test was applied for pairwise stage comparisons to maintain control over family-wise error rates. Pairwise comparisons, equivalent to t-tests when conducted post-ANOVA, were adjusted using the Benjamini-Hochberg (BH) correction to manage the false discovery rate (FDR) across all genes.

Significance thresholds for differential expression were set at a p-value ≤ 0.05 and a log2 fold change (L2FC) cutoff of |0.5|, selected to capture biologically meaningful expression changes. These criteria were chosen to balance the detection of subtle but potentially significant biological changes with a conservative FDR limit. The ANOVA pipeline, including gene-specific analysis, FDR adjustment, and L2FC computation, was automated to standardize comparisons across groups. The 8 genes making up the reported LUAD prognostic signature were tested for significance not only as above, but also in the DESeq2 framework using downloaded counts and a FDR-corrected Wald test.

ROC testing of combinatorial transcript normalized abundance ratios as predictors of survival

In our survival analysis, we focused on overall survival (OS), defined as the duration from diagnosis to death from any cause, with censoring for patients who were alive at their last recorded follow-up. OS data and survival status were obtained from Liu37, who provided the detailed data structure used in this study. Specifically, death events are coded as 1 for deceased patients and 0 for those alive, while OS.time represents survival duration in days, calculated as the maximum of either the last contact date or the death date (information is detailed in Table 1). This structured approach provides a standardized and clinically relevant endpoint, effectively capturing disease progression and survival outcomes in LUAD.

To evaluate the predictive potential of combinatorial transcript ratios for LUAD patient survival, we conducted a three-step normalization of gene-level data using CLUSTER 3.0 (http://bonsai.hgc.jp/~mdehoon/software/cluster/command.txt) as described in Karn38. Normalized expression levels for differentially expressed and hub transcripts (considered as the top 10 percent of module members (i.e., transcripts) ranked by kME) were summed and ratioed in combinations of 1:1, 2:2, 3:3 or 4:4. Prognostic ratios of sums were calculated for all samples. Each high-expression/high-survival candidate gene’s normalized abundance (sum) was divided by within-sample low-expression/high-survival gene abundance (sum), therefore opposing alterations in sample would amplify sensitivity for survival prediction and ROC specificity if the expression patterns were net coordinated within and across samples, thereby enhancing correlation of the transcript abundance combination to survival measure(s). Corrected and complete OS for the TCGA LUAD cases by Liu37, was used to inform linear model fits of each set of ratio calculations (prognostic predictor) to survival outcomes at 5 survival time points (12 months, 18 months, 3 years, 5 years and 10 years)37. The R glm function, with binomial variance family function and default linkage (binomial), was used to assess prediction accuracy. The pROC R package was utilized to test and plot each ROC curve using a binary OS time-dependent value and ordered glm-fit predictor. Area under the curve (AUC) was used to rank the top predictor ratios for specific time points and as an average AUC for the 5 time points. AUC 95% confidence interval (CI) calculations were made using the R verification package function ci. As previously described in Dill39 iterative assembly of combinations of gene transcript normalized abundance ratios was performed with transcripts nominated via our pipeline, with initial nomination focusing on transcripts that are deemed differentially expressed between stage 1 and stage 2 comparison and present as hub genes from anticorrelated modules in the network. Pipeline-nominated transcript lists from M3-like and M6-like modules assembled into prognostic indicator ratios were tested and by three subsequent rounds of ROC analysis as described in Dill39 with noted distinctions. Additional nomination tactics (e.g., literature, pathway databases, the OncoScore package and top survival-associated miRNAs) were used for additional rounds ROC curve analysis, attempting to achieve higher predictive performance (i.e., AUC) and are included in the supplemental materials.

Comparison of seminal survival related signatures and our top-performing survival signature

To evaluate the performance of our 8-gene survival prediction signature relative to existing LUAD survival-related signatures (e.g. Song, Shedden, and Soltis signatures), we conducted a systematic analysis that included data normalization, prognostic ratio calculations, ROC analysis, and heatmap visualization of correlations. A Research Compendium (RC) supporting this analysis, was developed to best adhere to FAIR (Findability, Accessibility, Interoperability, and Reuse of digital assets) principles40, includes all code and data and, is available online (https://github.com/cdyoung2/LUAD_Signature).

For each LUAD sample, we calculated prognostic ratios using gene-specific abundances derived from each signature. Prognostic ratios were calculated as the sum of expression for high-expression/high-survival genes divided by the sum of low-expression/high-survival genes within each sample. This ratio-based approach amplifies the survival prediction sensitivity by leveraging coordinated expression patterns within each signature. To assess the predictive accuracy of each signature at multiple survival intervals, we conducted ROC analysis using our five survival time points. ROC curves were generated using the pROC package in R, and the AUC was calculated at each time point as a measure of predictive accuracy. We then compared AUCs across the signatures using DeLong’s test, a non-parametric approach to evaluate whether differences in AUC values were statistically significant.

To visualize the associations between gene signatures and survival-related modules, we generated heatmaps representing bicor correlations and their statistical significance. Each heatmap displays the strength and direction of the correlation between each gene signature and specific LUAD-associated biological modules, with particular emphasis on the M6 and M3 modules due to their established relevance to survival outcomes. The bicor (biweight midcorrelation) method was applied to compute robust correlations between gene signatures and survival-related modules. This method, known for its resistance to the influence of outliers, was chosen to accommodate the high dynamic range and variance typical of RNA-seq data. Statistical significance for each bicor was evaluated as a Student’s p-value, with a significance threshold of p ≤ 0.05, enabling the identification and ranking of modules significantly associated with each survival signature. The bicors were visualized as a heatmap using the WGCNA labeledHeatmap function to capture both the magnitude and direction of correlations through color intensity, providing an intuitive, comparative overview of each signature’s alignment with LUAD-associated biological pathways, and enabling for a comprehensive view of the relationships between gene signatures and core LUAD survival-related phenotypes.

Validation of the 8-gene signature in CPTAC and GSE31210 LUAD datasets

To evaluate the generalizability of the 8-gene survival signature, we conducted external validation using two independent datasets: the CPTAC LUAD RNA-seq dataset (accessed via cBioPortal) and GSE31210, a microarray-based expression dataset profiled using the Affymetrix Human Genome U133 Plus 2.0 platform. This analysis aimed to assess the robustness of our transcriptomic signature across distinct platforms and patient populations.

Both TCGA and CPTAC datasets were processed using a harmonized pipeline to ensure comparability. Briefly, gene identifiers were standardized, genes with > 50% missingness were removed, expression values were log2-transformed (excluding zero values), samples were median-centered, and data were scaled to account for platform-specific differences. CPTAC samples were matched with corresponding clinical outcome data, and 18-month overall survival (OS) was defined as the primary endpoint. We computed ratio-based signatures derived from the current study as well as comparator signatures from Song et al. (2022), Shedden et al. (2008), and Soltis et al. (2023). Predictive performance was assessed using ROC and AUC analyses. All code, data processing scripts, and analysis pipelines are publicly available at https://github.com/cdyoung2/LUAD_Signature.

To validate performance on a different platform, we applied the same analysis framework to GSE31210. Raw series matrix and annotation files were downloaded from GEO. Probe-level expression data were mapped to gene symbols based on the GPL570 annotation platform and collapsed to gene-level by averaging across probes. Normalization followed a three-step CLUSTER 3.0-based approach consistent with prior work: log2 transformation, sample-wise median centering etc. The 8-gene ratio was calculated identically to the TCGA analysis, and ROC analysis was performed using 18-month OS as the endpoint.

We note important differences in platform and cohort composition between datasets. While TCGA and CPTAC employed RNA-seq, GSE31210 was profiled using microarray. Additionally, GSE31210 consists predominantly of early-stage, non-smoking LUAD patients from a Japanese cohort, in contrast to the more heterogeneous U.S.-based cohorts in TCGA and CPTAC, which include a broader distribution of disease stage and smoking status. These biological and technical distinctions may contribute to observed differences in predictive performance and are considered in the interpretation of results.

Single-cell RNA-seq data processing and visualization

To explore the tumor microenvironmental (TME) context of our 8-gene LUAD signature, we analyzed publicly available single-cell RNA-seq data from GSE123902. Raw dense matrix files from 17 LUAD tumor samples were downloaded and processed using the Seurat v5.0.1 R package. Each matrix was transposed and filtered to create individual Seurat objects using standard quality control filters (min.cells = 3, min.features = 200). Data were normalized, scaled, and reduced using PCA and UMAP. Seurat objects were merged into a global dataset and clustered using a resolution of 0.5. Expression of the LUAD 8-gene signature (ATP6V0E1, SVBP, HSDL1, UBTD1, GNPNAT1, XRCC2, TFAP2A, PPP1R13L) was visualized using DotPlots and FeaturePlots across clusters. Genes not detected in at least two clusters or lacking variability were excluded from violin visualizations. Cluster identities were inferred based on canonical markers from the original study and immune/stromal/tumor annotation patterns.

Results

Transcriptomic network associated with LUAD survival and staging

RNA-abundance data from 461 LUAD primary tumors (TCGA) were analyzed for associations with clinical outcomes (Table 1, Supplementary Tables 1, 2). Using WGCNA, 18 modules were identified (Supplementary Fig. 3) and ranked by the number of genes contained within those modules from largest to smallest (M1–M18) (Supplementary Table 4). A comprehensive list of all gene transcripts and module membership via correlations (kME) of each gene transcript to the 18 module eigengenes (MEs, calculated as the first principal component of each module across samples) is provided in Supplementary Table 5. MEs were organized by pairwise correlation as a measure of module relatedness and plotted as a dendrogram and heatmap (Supplementary Fig. 4A). To further validate robustness of our network, we performed a module preservation analysis using normalized counts obtained from DESeq2. Networks from normalized counts closely resembled those from FPKM data (Zsummary > 10, p < 1 × 10–22). Overrepresentation analysis confirmed bicor correlations produced similar outcomes, supporting the use of FPKM or DESeq2 normalized data in subsequent analyses.

To address gaps in early LUAD progression, we focused on modules significantly linked to staging and survival. We identified 11 modules (p ≤ 0.01) correlated with staging traits, either positively (red), or negatively (blue). Furthermore, we identified 7 modules to be significantly correlated with survival traits, and 6 modules (M1, M3, M6, M7, M9 and M11) significantly correlated to both survival and staging traits (Supplementary Fig. 2; Fig. 1). The network identifies survival- and staging-associated modules, revealing two opposing gene co-expression clusters: M1, M6, M9, M16 vs. M3, M7 (Supplementary Table 5). The two module clusters showed an antagonistic expression pattern visualized by robust correlation to multiple early staging traits (Stages: T1, 1, N0, and other binary classifications; Supplementary Figs. 2, 4) as well as their correlations to continuous clinical traits representing survival or tumor progression times. Network inter-module correlation and anticorrelation are further detailed in the correlation heatmap (Supplementary Fig. 4B) and kME table (Supplementary Table 5). Transcripts in the kME table are ordered from most to least representative module members within each module and a green–yellow–red color scale was used to identify positive correlation (red) and negative correlation (green) patterns between modules (highlighting top-ranked hubs best representing each module’s expression pattern). The modules identified with significant stage or survival/progression trait correlation(s) are considered candidate neighborhoods of the gene transcript network, in which exist possible drivers of LUAD initiation or progression, or genes that are signatures correlating to patient survival. However, causation cannot be directly inferred from the network’s undirected co-expression-based connections.

Fig. 1
figure 1

Co-expression network analysis of LUAD transcripts identified modules and module-trait correlations relevant to LUAD. The heatmap displays 18 modules identified. A Kruskal Wallis test among the LUAD case groups with significance set to 0.05, revealed 11 modules were significantly associated with early stage, 7 modules were associated with survival traits, 5 modules (M1, M3, M6, M9, and M16) were associated with both staging and survival (SAS) traits. Overlaid numbers are Student’s p values for bicor significance of trait correlation to the module eigengenes. Module-trait bicor color scale (− 1, blue; 0, white; + 1 red) indicates modules with significant Student’s p cluster together.

Gene ontology (GO) enrichment analysis assessed the biological coherence of network modules (Top 5 biological processes; Table 2). Key processes included RNA splicing (M1), immune response (M5, M9, M16), lipid metabolism (M6), cell cycle (M3), oxidative phosphorylation (M7), and nucleic acid metabolism (M11). Modules M3 and M7 correlated positively with advanced pathological stage, later-stage disease, and mortality (1 = death), and negatively with early-stage disease.

Table 2 Ontology enrichment modules for MEs of interest with correlation to survival or staging clinical traits.

Five modules dually correlated to staging and survival are enriched with differentially expressed genes (DEGs)

To further elucidate the mechanisms of LUAD initiation and progression, as well as their roles in mortality, we decided to focus on one primary staging clinical trait (pathological staging) and one primary survival trait (OS; binary trait indicating death)) for additional analysis. Pathological staging and OS were selected for their accuracy and clinical relevance in assessing disease extent and treatment impact. Modules M1, M3, M6, M9 and M16, designated as staging and survival (SAS) modules, were significantly correlated with pathologic staging, and were inversely correlated with stage 1 status versus later stages (Fig. 1). We identified M1, M6, M9 and M16 as negatively correlated with increased pathologic staging and OS. Conversely, M3 was positively correlated with increased pathological staging and OS. To explore the strong anticorrelated nature of M3 to M6, M1, M9 and M16 in our SAS-modules, a one-way ANOVA was used to compare LUAD modules (overall weighted expression patterns) across staging groups, and as expected, expression across stages was significant for the SAS-modules (Fig. 2A). Volcano plots (Fig. 2B) show 556 DEGs in stage 1 versus 2, including 336 in our SAS-modules. Eighty-six genes were upregulated (M3) and 280 were downregulated (M1, M6, M9) between the stage 1 and stage 2 groups. Fifty transcripts meeting our criteria for differential expression between stage 1 and stage 2 are detailed in Table 3. Differential expression analysis via DESeq2 confirmed a majority of differentially expressed genes were preserved across both methods. Additionally, a T-test between OS binary outcome groups was used to explore any significant difference between groups that suffered a death event and those that were alive at last follow-up within the SAS-modules. In Fig. 3, SAS-modules are also shown to be significantly differentially expressed between OS groups (boxplots for the other 13 modules are shown in Supplementary Fig. 5 in Supplementary Materials). Modules M1, M6, M9 and M16 all display a similar expression pattern in both OS and pathologic staging, where module gene expression is increased in living and stage 1 samples. Conversely, M3 gene expression detailed decreased gene expression across those same sample groups. Moreover, there was a notable difference in the rate of mortality between stage 1 samples (~ 24%) and later staged samples (~ 50%). To further assess expression changes in early-stage modules, we analyzed DEGs (n = 119, Table 3) between stage 1 and all later stages (Fig. 2A). SAS-modules enriched DEGs were evaluated as possible module driver genes among module transcripts and are bold in Table 3. To further understand the mechanisms in which these possible drivers are modulated we hypothesized that SAS hub + DEGs serve as targets of upstream dysregulated miRNAs, evaluated in the supplemental material results (Supplementary Tables 6, 7, Supplementary Figs. 6 and 7).

Fig. 2
figure 2

Volcano plots of differentially expressed genes (DEGs) in LUAD transcript cases. (A) A one-way ANOVA was used to compare LUAD modules (overall weighted expression patterns) across staging groups. Expression across stages was significant for the 5 SAS-modules (M1, M3, M6, M9 and M16). (B) LUAD stage 1 case groups were compared to LUAD later stage case groups for 6 SAS-modules in panel B. Log2 (fold change) for each comparison versus Benjamini–Hochberg FDR is plotted, and gene transcripts are colored by module membership. DEGs were counted for > 50% change (vertical cut-off lines at x =  ± 0.58) and FDR < 5% in each individual comparison. No transcripts passed our conditions for upregulation and downregulation in M16.

Table 3 Differentially expressed genes between stage 1 and stage 2 groups (1-way ANOVA) and stage 1 versus all other stages (2-sample T-test) for 5 modules of interest.
Fig. 3
figure 3

Differences in eigengene values for SAS-modules across LUAD cases. Box plots displaying module eigengene values among the five SAS-modules (M1, M3, M6, M9 and M16) are displayed with Students t-test assessed significant differences between OS(N = 461).

Survival significance and prognostic implications of anti-correlated modules for LUAD

Two strongly anti-correlated modules (M6: red and M3: brown) (Supplementary Fig. 4B and Supplementary Table 5) from our opposing SAS-module signatures were selected for further survival analysis. Module M3 correlated positively with a death event (OS) and advanced pathological and clinical stages, while M6 correlated negatively with a death event and later pathological stages but positively with early clinical stage (Fig. 1). Kaplan–Meier (KM) analysis (Fig. 4A) distinguished patients by SAS-module expression. Hazard ratios (HRs) confirmed M3’s opposing HR to M1, M6, M9 & M16 modules. Following M3 module survival association, hub status DEGs which differentiate stage 1 versus stage 2 tumor samples (Table 3) were evaluated via KM analysis (Fig. 4B). Significantly higher survival with lower expression was seen in all M3 differentially expressed hub genes (MYBL2 p < 0.0001, HR = 2.28, CI = 1.49–2.49; KIF4A p = 0.00085, HR = 1.67, CI = 1.23–2.27; RRM2 p < 0.0001, HR = 2.24, CI = 1.65–3.05; DEPDC1 p < 0.0001, HR = 2.07, CI = 1.49–2.87; ANLN p < 0.0001, HR = 2.52, CI = 1.85–3.42).

Fig. 4
figure 4

Survival analysis reveals association between SAS-module eigengene values and overall survival in lung cancer. (A) Two anti-correlated modules (M3 and M6) driving opposing genetic SAS-module signatures were selected for survival analysis. The M3 module was positively associated with OS and while the M6 module was negatively associated with a death event (OS). Kaplan–Meier (KM) survival analysis significantly distinguished patients separated by high and low module expression in all SAS-modules. The red line in each Kaplan–Meier plot represents survival of cases in the higher expression tier and the black line represents survival in cases with lower expression. A Wald test determined p-values and hazard ratio HR scores with 95% CIs. (B) Significantly higher survival with lower expression was seen in all M3 differentially expressed hub genes (MYBL2 p < 0.0001, HR = 2.28, CI = 1.49–2.49; KIF4A p = 0.00085, HR = 1.67, CI = 1.23–2.27; RRM2 p < 0.0001, HR = 2.24, CI = 1.65–3.05; DEPDC1 p < 0.0001, HR = 2.07, CI = 1.49–2.87; ANLN p < 0.0001, HR = 2.52, CI = 1.85–3.42).

Network-nominated ROC testing of multi-gene ratios

We hypothesized that the positive and negative correlation to two key traits and module association with survival candidate genes shown from our previous results could represent genes that interact in determining survival and progression of LUAD patients. To test this hypothesis, we employed ROC curve analysis of ratios of these genes where (1) the numerator genes, when having higher expression, are associated with higher survival and lower stage, and (2) the denominator genes impact survival and reflect the inverse pattern of expression across LUAD tumor stages.

Differentially expressed hub genes (top 10% by kME) from anti-correlated M6 (red) and M3 (brown) modules were selected as LUAD prognostic candidates for the initial round of ROC testing (Table 3; bold transcripts in Stage1-versus-stage2). Given the normalization procedure shown in the methods, and network-defined determination of positions in numerator (high-expression, high-survival) or denominator (low-expression, high-survival). Larger ratios were found to predict better survival in this model.

In the initial round of analysis, 6287 relevant differentially expressed gene combinations were evaluated, covering all possible 1:1, 2:2, 3:3 and 4:4 combination prognostic ratios (Supplementary Table 8). All AUCs were > 56% on average across all 5 timepoints, suggesting the selection of hub status DEGs that differentiate stage 1 versus stage 2 tumor samples may impact survival in the expected direction. The best overall mean AUC across all combinations for all timepoints was FAM189A2 + CYP4B1/TFAP2A + ANLN with a maximum AUC of 84.62 at 10 years and 71.67 at 12 months. The top-ranked mean AUC excluding the 10- and 5-year timepoints was 71.48% (FAM189A2 + TMEM59/TFAP2A + ANLN) with the top ratio at 12 months (72.12%; Fig. 5A). The best combination regardless of timepoint was CYP4B1/RRM2 with an AUC of 87.65% at 10 years, although many ratios were similar. In addition, Supplementary Table 8 details ratio associated p-value, CI, accuracy, sensitivity, and specificity.

Fig. 5
figure 5

Top performing survival predictors by ROC AUC. ROC curves predicting OS were generated on combinatorial ratios of SAS nominated miRNAs, first using normalized transcript abundances in N = 457 LUAD cases with OS traits. (A) In round 1, transcripts consisting of M3 and M6 DEGs and hubs nominated by Kaplan–Meier distinction of OS via their levels in independent RNA-Seq were tested in 6827 combinatorial ratios for OS prediction using all LUAD cases (top performer shown in panel A). (B) Round 2 added nominations top OS.time correlated transcripts. (C) Round 3 added top OS.time correlated miRNA transcripts (D) Round 4 ROC tested whether round 1, + top-performing miRNA nominations from round 3, literature round 2 associated genes, and pathway implicated genes by round 3.

To improve on our test, we generated a volcano plot using columns B and C in Supplementary Table 9 as our X and Y axis, followed by selection of the extremes in the first and second quadrants. We selected 10 genes from the right extreme (M6) and eight from the left (M3); two genes from round 1 persisted (KIF4A and ANLN). In total, we proceeded with 20 genes, 10 M3-like candidate denominators and 10 M6-like candidate numerators.

A total of 60,626 gene ratios (1:1, 2:2, 3:3, 4:4) were evaluated by ROC AUC (Supplementary Table 10) in Round 2. As expected, prognostic ratios, enriched with OS.time-correlated genes, showed higher AUCs across all time points, averaging > 60%. The top combination, BEND5/GNPNAT1, achieved an 87.58% AUC at 10 years, though statistical power declined due to fewer uncensored samples at this timepoint. The top performing ratio in the cohort excluding the 10-year timepoint (12 months–5 year), ATP6V0E1 + SVBP + HSDL1/CENPE + GNPNAT1 + PPP1R13L (AUC = 76.69%) outperformed the next top ratio ATP6V0E1 + SVBP + HSDL1 + UBTD1/CENPE + GNPNAT1 + TFAP2A + PPP1R13L by only 0.05% and the 150th top AUC by less than 1.0%. Figure 5C illustrates the results of Round 3, in which top OS.time-correlated miRNA transcripts were incorporated into ROC testing, further refining the model’s predictive power. The mean AUC determined top ratio when excluding the 10-year timepoint was ZMYND12 + BEND5 + HSDL1 + UBTD1/GNPNAT1 + TFAP2A + KIF20B + PPP1R13L (all timepoints: AUC = 74.77%; excluding 10-year: AUC = 74.06%). When excluding both the 5 and 10 year timepoint the ATP6V0E1 + SVBP + HSDL1 + UBTD1/GNPNAT1 + XRCC2 + TFAP2A + PPP1R13L ratio assumes the top position with an AUC of 75.54% (Fig. 5B). In an effort to further improve on our prediction, we sorted by 12 month to 3 year and 12 month to 5 year mean AUC and selected the genes that appeared in > 20% of the top 100 performing ratios (most frequently from the numerator: ATP6V0E1, HSDL1, SVBP, UBTD1, BEND5, PEBP1, TMEM59 and denominator: GNPNAT1, TFAP2A, PPP1R13L, ANLN, KIF4A) for another round of ROC AUC analysis communicated in Supplemental Materials.

Survival-related signature comparison reveals a robust prognostic signature for LUAD

Many seminal survival-related signatures have been developed, but a comprehensive clinical grade survival-related gene signature remains elusive. We asked how the targets predicting survival identified in our study would perform in comparison to existing LUAD prognostic and survival metrics. So, we selected three established signatures22,23,24 for comparison to our round 2 signature calculating each across 480 TCGA LUAD FPKM cases. Recapitulating ME correlations to survival traits (Fig. 6A, left-panel), trait correlations were extended to the four signatures for Soltis, the current study, Song, and Shedden studies (Fig. 6A, right-panel). Correlation to OS, PFI, and DSS was highly significant for the current study signature, and Song and Shedden signatures also correlated significantly with survival and progression traits. Note M6 and M3, outlined in Fig. 6, are the basis for numerator and denominator genes in the current study’s signature. While our signature correlates with survival traits, we considered all four signatures as additional traits for correlation to the 18 modules (Fig. 6B, left-panel). Modules M3 and M6 were the most significant correlates for all signatures except Soltis. Notably, Song’s signature for poor prognosis was inverted to achieve comparable positive correlation considering survival rather than positive correlation to inflammatory response. Interestingly, as modules are ordered by their correlation distance, closest neighbors (M1, M13, M17 and M16) to our modules contributing genes of our prognostic signature (M3 and M6) do not generally correlate in the same direction to signatures (within row). Since M6 and M3 have the highest correlation significance to three of the four signatures, they represent transcripts with a strong and specific relationship to survival. However, the network more generally has most modules correlating significantly to multiple of the four signatures, indicating broad biology captured by these signatures, which are restricted to a small subset of genes. The Soltis’ signature, which was developed to aid in subtype classification of LUAD, has its best module correlation to our M1 SAS-module, with its gene ontology enriching in ATP synthesis coupled electron transport and mRNA processing as top biological processes. This divergence from top modules’ correlation significance of the other three signatures to M6 and M3 likely indicates that the subtype-distinguishing genes profiled in the Soltis signature are less involved with M3 and M6 biological processes like cell cycle and mRNA processing (M3). The Shedden and Song signatures also correlate more broadly to other modules in the same direction as our gene signature, and, they further correlate to each other as a block (Fig. 6B, right-panel), suggesting that Song’s inflammatory response genes associate with and predict survival like our signature and Shedden, but the underlying genes in the signatures are largely different, with our signature’s eight genes overlapping with none of the genes in the other three signatures. Indeed, Shedden and Song signatures correlate well to this study’s signature (0.44 and 0.49), having been generated and tested independently in different LUAD data. Remarkably, the current study round 2 optimized signature, comprising 8 unique and novel genes not present in the Shedden (459 genes), Soltis (155 genes), or Song (11 genes) signatures, compares favorably as a survival signature to these established models.

Fig. 6.
figure 6

480 LUAD case correlations recalculated to network modules for four curated LUAD prognostic signatures of Soltis, et al., this study, the inverse signature of Song, et al., and the 459 gene signature of Shedden, et al. (A) Survival case correlations. MEs were recalculated based on the N = 461 (457 miRNAs matched to 461 transcripts in network) TCGA network in the full 480 LUAD tumor samples curated from GDC (see RC), and their relatedness plotted as a correlation distance dendrogram (top left), MEs recalculated were correlated using bicor to the uncensored quantitative survival (days) traits relating to overall survival (OS), progression free interval (PFI), disease free interval (DFI), and disease specific survival (DSS), as well as days to any new tumor event in patients after biopsy (lower left of panel). Right, the four curated signatures were likewise correlated to the same six survival traits quantified in days. Any significant bicor is annotated with Student’s p value for the correlation over the heatmap. Bicor scale is − 0.5 to + 0.5 for (A). A correlation distance-based dendrogram among the four signatures is displayed at the top right. (B) The four prognostic signatures were correlated using bicor to the 18 recalculated module eigengenes (left), and to each other using Pearson correlation (right, Pearson rho displayed as bold text, with Student’s p value below in italics). Any significant bicor (left) is annotated with its corresponding Student’s p value for the correlation over the heatmap. Both bicor and Pearson scales are − 1.0 to + 1.0 for (B). In both panels, key survival modules M3 and M6 are highlighted by a dashed hot pink outline around their survival trait (A) and signature (B) correlations.

We validated survival prediction via gene expression ratios by first testing TCGA-LUAD normalized expression data against disease-specific survival (DSS), defined as death with concurrent “with tumor” status or disease-specific mortality. ROC analysis of the top-performing 8-gene ratio predictor (Fig. 5D), using DSS instead of OS, demonstrated similar performance at the 18-month timepoint. The gene ratio identified from LUAD network analysis predicted both OS and DSS, supporting its reproducibility and clinical relevance as a survival predictor in LUAD patients. To further assess generalizability, we performed cross-dataset validation using two independent cohorts. In the CPTAC LUAD RNA-seq dataset (accessed via cBioPortal), predictive performance for 18-month survival remained robust (AUC = 70.59%), closely matching TCGA (AUC = 70.84%), further confirming reproducibility. In GSE31210, a microarray-based dataset, the 8-gene ratio achieved moderate predictive accuracy (AUC = 64.52%), supporting the signature’s cross-platform applicability despite expected variation due to differences in data generation, platform, processing, and cohort composition.

Single-cell RNA sequencing of gene signature expression across LUAD tumor microenvironment

Following quality control and preprocessing, UMAP dimensionality reduction and clustering identified diverse cell populations within the tumor, including putative epithelial, immune, and stromal compartments in GSE123902. Expression of the 8 signature genes was assessed across clusters and individual samples. Of the eight genes, five (ATP6V0E1, GNPNAT1, HSDL1, TFAP2A, and UBTD1) showed detectable expression in multiple clusters, while three genes (PPP1R13L, SVBP, and XRCC2) exhibited sparse or undetectable expression across samples. UMAP feature plots revealed the spatial distribution of signature gene expression across individual cells, highlighting cluster-specific localization (Supplementary Fig. 8). Dot plot visualization summarized the average expression and percent-positive cells for each gene by cluster (Fig. 7). Together, these findings suggest that the 8-gene signature likely reflects contributions from multiple cell types within the LUAD microenvironment, including tumor epithelial and immune compartments.

Fig. 7
figure 7

Dot plot summarizing the expression of the eight-gene LUAD survival signature across annotated cell clusters in single-cell RNA sequencing data from 17 LUAD tumors (GSE123902). Dot size indicates the proportion of cells within a cluster expressing the gene (percent expressed), and dot color reflects scaled average expression. TFAP2A, GNPNAT1, HSDL1, UBTD1, and SVBP show selective expression in epithelial and stromal clusters, while PPP1R13L and XRCC2 are more broadly distributed. ATP6V0E1 exhibits relatively low expression overall. These data highlight the potential contribution of both tumor-intrinsic and microenvironmental compartments to the signature. A UMAP visualization of cell clustering and identity annotations is provided in Supplementary Fig. 8.

Discussion

Well-executed transcript co-expression networks provide a systems biology framework of modules which have coherent biology inherent in their organization. They can be correlated to disease or sample-specific traits with little concern for false positive correlation. This permits biological insight into the complex network of molecules, while further helping researchers nominate novel biomarkers of disease, drug targets, and mechanistically important key drivers of disease processes41,42. Systems biology approaches, such as WGCNA can help untangle the biological complexity of the LUAD malignant state. While LUAD has been well-studied, the complex interactions of mRNA and their influence on survival are not well understood. This study produced a systems-view of the LUAD transcript network as a resource, and we demonstrate a use case in which we nominate influential candidate gene combinations of mRNA able to nominally predict survival (mRNA/miRNA combinations; supplemental materials). Furthermore, our mRNA signature demonstrates its competitiveness by comparing favorably to well-established mRNA signatures, with some comprising over 50-fold as many genes.

Co-expression analysis revealed 18 modules, many of which correlated with key clinical outcomes. Transcripts correlated across multiple samples are the basis of co-expression networks; those genes that are tightly interconnected (e.g., high module membership, kME) have similar expression patterns, and upstream of this shared expression pattern, one can infer common transcription factors, miRNAs or epigenetic mechanisms43. Seven of 18 modules correlated with our selected traits of interest (pathological staging and OS). Six had negative correlation (M16, M5, M9, M10, M6, M1) and one (M3), positive correlation (Fig. 1). Five of the seven modules were identified as strongly significant across both OS and pathological staging and termed SAS-modules. SAS-modules were significant for several biological processes as mentioned previously (Table 2), namely cell cycle and DNA metabolic related processes (M3), immune activation (M9 and M16), and solute transport across membranes and molecular metabolism processes (M6 and M1).

Our results suggest two competing gene expression patterns distinguishing early from later-stage LUAD. Early-stage patients exhibit elevated expression of genes involved in RNA processing, surfactant homeostasis, chemical homeostasis, and innate immunity (M1, M5, M9, M6, M11), which decline with advancing stage. The association of cell cycle activity with worsening stage and survival supports the well-established role of growth-promoting alterations in tumor progression. Additionally, increased metabolic activity linked to excretion and fluid transport (M6), positively correlated with lower stage and improved survival, highlights early immune system activation in lung cancer initiation44. Two SAS-modules (M3 and M6) were revealed to be anti-correlated across OS and staging. We therefore termed modules of transcripts that have low-expression and are associated with high-survival M3-like, and those with high-expression associated with high-survival as M6-like (M1, M6, M9, M16). SAS-module genes with hub and DEG status are considered vital drivers of early LUAD. We hypothesized that SAS-modules, positively correlated to later-stage clinical traits and negatively correlated with longer survival time, are enriched with suitable targets for potential therapeutic correction. SAS-modules could potentially be regulated via miRNA-mediated downregulation by significantly negatively correlated miRNAs, as mentioned in the supplementary materials.

In an effort to add in possible causality and directionality we included expression of upstream regulators of mRNAs where possible to perform ROC analysis, with AUC evaluation of predictor strength as described in Dill39. We began with a logical selection of transcripts that were both DEGs and hub genes of M3(-like) and M6(-like) modules (round 1) and evolved our original selection of transcripts by selecting genes significantly correlated to OS.time (round 2). Round 1 gene selection was validated by > 55% AUC across all timepoints. The top ranked ratio nominated by AUC for round 2 (Supplementary Table 10) included genes, ATP6V0E1, GNPNAT1, TFAP2A and PPP1R13L that were maintained in top-ranked ratio AUCs throughout subsequent rounds of ROC analysis (genes SVBP, UBTD1 and XRCC2 did not persist through subsequent rounds). In round 2, our top ratio selected by mean-AUC across all timepoints, ATP6V0E1 + SVBP + HSDL1 + UBTD1/GNPNAT1 + XRCC2 + TFAP2A + PPP1R13L (Fig. 5B) contained a known driver of lysosome and phagosome maturation in LUAD cells45, ATP6V0E1. TFEB, a known regulator of ATP6V0E1 was also selected for round 4 ROC (Supplemental Materials) testing but was not present in our top genes ranked by mean AUC.

Conversely, our denominator contains two recognized autophagy-related genes, TFAP2A and PPP1R13L46,47,48,49. Reports on TFAP2A’s effects on tumorgenicity have been contradictory. In some cancers, TFAP2A expression acts like a tumor suppressor, while in others it has tumorigenic potential50. In our review of the literature, TFAP2A in LUAD leans towards tumor-promotion via numerous recorded mechanisms51,52,53,54. PPP1R13L/iASPP is a well-known oncogene, but recent studies have indicated it may have an proapoptotic role via its interaction with NF-κBp6554. In our study, PPP1R13L was routinely enriched in the low-expression high-survival group of our ratios suggesting more than likely, an oncogenic function in LUAD. PPP1R13L/iASPP is a well-known oncogene, but recent studies have indicated it may have an proapoptotic role via its NF-κBp65 interaction54. Presence of active p53 is a driver of PPP1R13L expression and regulation and suggests why a later well-performing nominated gene, GADD45GIP1, detailed in the supplemental materials, a binding partner of all GADD45 isoforms, is routinely enriched in top predictor gene combinations55,56. Interestingly, a ratio containing only known interactors PPP1R13L and GADD45GIP1 did not surpass 60% AUC at any time-point, suggesting that other genes were additive. Multiple reports have identified GNPNAT1, a key enzyme in the hexosamine biosynthetic pathway (HSB) as a prognostic marker in LUAD57,58. HSB intermediates and the end-product UDP-GlcNAc serve as key substrates in many biochemical pathways, allow for crosstalk across many tumor promoting processes, and may serve as a sensor of energy availability59. Of the remaining genes in our signature, XRCC2, a key DNA repair gene in the homologous recombination pathway, is overexpressed in LUAD and promotes tumor cell migration and invasion via vimentin stabilization. Elevated XRCC2 expression is associated with poorer overall survival, supporting its role as both a prognostic marker and potential therapeutic target in LUAD60. SVBP, though less studied in LUAD, encodes a protein involved in tubulin detyrosination, a post-translational modification critical for microtubule stability61,62. Aberrant tubulin dynamics have been linked to chemoresistance and metastatic potential in NSCLC, and SVBP was consistently expressed at lower levels in high-survival ratios in our analysis. HSDL1, a peroxisomal enzyme involved in lipid metabolism, has been implicated in the immunological and metabolic landscape of LUAD. One study identified HSDL1 as part of a lipid metabolism–related survival signature, with elevated expression associated with poorer prognosis in LUAD patients63,64. Additionally, HSDL1 expression negatively correlates with anti-tumor immune infiltration and positively with tumor-promoting immune cell types64,65. Lastly, UBTD1, a ubiquitin-like protein that interacts with MDM2 to stabilize p53, also regulates TGF-β signaling and mechanotransduction pathways. In LUAD, reduced UBTD1 expression promotes tumor aggressiveness via RhoA activation and epithelial-to-mesenchymal transition (EMT), suggesting its function as a tumor suppressor may be context-dependent66. Top ranked mean combinatorial ratio AUCs did not reach clinical test accuracy (85% +) but were considered fair predictors of survival at all timepoints. ROC analysis provided an avenue for assessment of additive or synergistic transcript level contributions to OS, and its prediction in LUAD. Further investigation into the network module member genes’ intra- and inter-module connectivity may highlight relevant biology for LUAD initiation, progression and module association to survival and staging traits. Additionally, while clinical application requires further refinement, this gene signature could inform personalized treatment strategies by guiding the selection of chemotherapeutic regimens or adjusting treatment intensity based on prognosis, potentially improving outcomes for patients with poorer prognoses. Our correlation analysis provides a bridge between our knowledge of LuCa literature and current survival signatures, to showcase an integrated approach to understanding LUAD. Additionally, it may offer insights into future studies, to understand more deeply the divergent performance of these signature correlation patterns and their variation in ROC-AUC. An interesting observation from the correlation heatmap (Fig. 6) is that M3 brown and M6 red modules are highly negative or positively correlated for three of the four signatures. The differences between these signatures can be attributed to the differences in their design. These variations stem from their deviation in distinct conceptual frameworks, that are formed from slightly different predictive factors and gene selection or retention criteria. One such difference is the anticorrelation of the Song signature (best visualized directly from the RC code, but visible here as opposing red/blue for most modules). This should be attributable to an opposing outcome that the signature is intended to predict, most likely severity rather than survival. Song’s signature anticorrelation to the others might reflect it’s intended use to predict poor prognosis and drug therapy response prediction, and not survival, as death or progression is the flip side of the same coin.

Recalculations in 480 TCGA cases provide some replication and enable quantitative benchmarking against the Song, Shedden, and Soltis signatures. Our signature correlates most strongly with Song, followed by Shedden, with Soltis being a distant third. Although not specifically selected for, our findings reinforce the link between immune-related genes (IRGs) and LUAD prognosis and survival—two related but distinct attributes. Notably, three of our signature genes (ATP6V0E1, TFAP2A via GSDMD interaction, and PP1R13L) have documented roles in the inflammatory response.

In addition to tumor-intrinsic gene expression programs, recent studies underscore the importance of the TME in shaping disease progression and therapeutic response in early-stage LUAD. Immune-suppressive or immune-excluded TMEs—characterized by stromal remodeling, regulatory immune populations, and impaired CD8⁺ T-cell infiltration—are associated with poor prognosis and resistance to immunotherapy55,67,68,69. To contextualize our 8-gene survival signature within this framework, we assessed cell-type–specific expression of the signature genes using single-cell RNA-seq data from LUAD tumors (GSE123902). Five genes (TFAP2A, GNPNAT1, HSDL1, UBTD1, SVBP) were detectably expressed in specific epithelial or stromal-immune clusters, suggesting the signature reflects both tumor-intrinsic and microenvironmental features.

This interpretation is supported by prior literature: TFAP2A activates ITGB4, promoting epithelial growth and suppressing CD4⁺/CD8⁺ T-cell infiltration via NF-κB; PPP1R13L (iASPP) modulates NF-κB and mTOR signaling with implications for tumor–immune crosstalk; GNPNAT1 correlates with immune infiltration metrics and poor prognosis in LUAD; and ATP6V0E1, a V-ATPase subunit, contributes to acidification of the TME, which can impair immune activity and facilitate metastasis. These data suggest our transcript signature may capture both oncogenic signaling and microenvironmental interactions that jointly influence LUAD survival.

Collectively, these findings align with emerging models of LUAD pathogenesis in which tumor-intrinsic transcriptional programs interact with the TME to shape disease progression and prognosis. Our validation in two independent cohorts—CPTAC (RNA-seq) and GSE31210 (microarray)—further supports the cross-platform and cross-cohort robustness of the 8-gene signature. While microarray data inherently contains less missingness but greater measurement variability, the 8-gene signature retained predictive performance across this platform, supporting its robustness. These results highlight the potential utility of ratio-based transcriptomic predictors that integrate signals across cellular compartments and may serve as scalable biomarkers for early-stage LUAD risk stratification. However, these insights must be interpreted in the context of study limitations, as outlined below.

Limitations

Our identification of candidate biomarker ratios from gene expression sets has limitations. While the best ratio AUCs exceed 97% at the 10-year timepoint (Round 2), they do not surpass 70% at earlier timepoints. Notably, 95% confidence intervals suggest that the low number of observations at 10 years does not impact power. ROC curve performance may be constrained by compensatory changes in co-expressed genes, which modulate rather than directly influence survival. Additionally, OS.time correlations with gene expression were limited to ± 0.3 for mRNAs, suggesting that current ratio AUCs may be near their maximum achievable values.

Furthermore, TCGA tumor profiles represent pre-treatment status, limiting our ability to capture post-treatment transcriptomic changes that may affect therapy response or survival interpretation. Additionally, candidate selection in was constrained by the computational demands of unbiased ROC testing, with Round 4 alone generating over 200,000 combinations from 17 numerator and 13 denominator candidates. Machine learning could enhance transcript and combination selection, enabling more complex, weighted ratios and prioritizing OS-time correlated genes across diverse pathways. LUAD cases in TCGA may also exhibit greater heterogeneity than breast cancer subtypes like triple-negative breast cancer, which Dill39. used for survival prediction modeling. Lastly, validation was limited by the small number of TCGA-LUAD cases with paired miRNA and mRNA data, necessitating an alternative survival metric. The recalculation method used here is adaptable to other datasets and may facilitate module reproduction for comparative analyses.

In conclusion, this study utilized a network-based approach to identify stage- and survival-associated biomarker genes influencing LUAD patient outcomes. Our integrative systems-biology framework provides a foundation for further exploration of LUAD initiation and progression. Pilot analyses of the network reveal that transcript modules targeted by miRNAs do not simply follow expected feed-forward regulation but instead exhibit complex interactions. Our optimized systems analysis pipeline, detailed in the supplemental materials, serves as a resource for the LUAD research community by refining prognostic gene and miRNA combinations. The top-performing combinations involve lysosomal and autophagic processes, the hexosamine biosynthesis pathway, an apoptosis inhibitor, and the pro-oncogenic transcription factor TFAP2A. While these genes are well-established in LUAD, their combinatorial power for survival prediction suggests potential synergistic interactions that warrant further investigation. Comparative analysis underscores our signature’s robustness, demonstrating stronger correlations with survival traits than existing signatures while relying on fewer genes, enhancing clinical utility. Further exploration of intra- and inter-module connectivity may refine a prognostic signature and uncover additional biological insights for LUAD.