Introduction

Lung adenocarcinoma (LUAD) is currently the most prevalent histological subtype of non-small cell lung cancer (NSCLC)1. Lobectomy with systematic lymph node dissection (LND) is considered the standard treatment for early-stage LUAD2. In recent years, the use of sublobar resection with selective LND or without LND and stereotactic body radiation therapy (SBRT) has been increasing for early-stage NSCLC despite the potential risk of lymph node metastasis (LNM) that may affect prognosis3,4,5. The advantage of limited surgery is, of course, that it preserves lung function and reduces perioperative morbidity and mortality6. At the same time, SBRT, as a very effective and non-invasive modality, is very attractive to patients7. If N0 status is reliable, some early-stage patients, particularly those with comorbidities, could benefit from limited surgery or SBRT. If patients have a high risk of LNM, they may benefit from treatment intensification, such as lobectomy with mediastinal node dissection or adjuvant chemotherapy, to reduce recurrence and improve survival outcomes. Accurate staging to confirm N0 status is therefore crucial. 18F-FDG PET/CT has emerged as an effective staging modality in oncology; nevertheless, the prevalence of occult lymph node metastasis (OLM) in LUAD is still as high as 14%–21%8,9. Thus, a new noninvasive approach is urgently needed for accurately predicting OLM in clinical N0 (cN0) LUAD patients staged by PET/CT.

Radiomics, which can provide nonvisual information by extracting high-dimensional quantitative features from images in a non-invasive manner10,11, has shown great promise in clinical diagnosis, prognosis assessment, and prediction of treatment12,13,14. For OLM predicting, numerous studies show that the radiomics signature of the primary tumor has significant incremental diagnostic value over clinicopathologic risk factors in patients with oral cancers15,16and NSCLC17,18. However, the biological significance of radiomics features remains poorly understood. Thus, despite its potential, radiomics-based prediction models are often criticized for a lack of interpretability. Currently, existing studies using radiogenomics attempted to interpret the underlying biological underpinnings of radiomics prediction by decoding the correlations between radiomics phenotypes and gene expression patterns19,20. For example, Perez-Johnston et al.20 investigated the associations between CT-based radiomic consensus clustering and genomic data and outcomes in patients with stage I LUAD. However, these previous reports lacked experimental verification of the identified genes. So far, there is no literature available on investigating the feasibility of radiogenomics for the prediction of OLM. Therefore, this study aimed to: (1) explore the value of a radiomics signature based on multicenter and bimodal images in predicting OLM and prognosis of cN0 LUAD patients; (2) explore the relationships between OLM-associated radiomics phenotypes and gene expression profiling; (3) assess the reproducibility of the identified radiomics-associated key genes by public database, clinical tissue samples and in vitro experiment.

In this study, we show that the PET/CT-based radiomics score (Radscore) accurately predicts OLM and is further identified as an independent predictor of recurrence in cN0 LUAD. We also find that OLM-related radiomics patterns are mostly associated with immune response. These findings suggest that the proposed Radscore has the potential to guide tailored treatments in cN0 LUAD. Besides, the hypothesis that radiomics features reflect the underlying genomic patterns was supported in this study.

Materials and methods

Cohort

We retrospectively collected consecutive patients with cN0 LUAD who underwent surgical resection with the systematic LND between May 2012 and July 2021 as the primary cohort at the First Affiliated Hospital of Wenzhou Medical University (FAHWMU). The primary cohort was randomized as a training cohort (TC) or an internal validation cohort (IVC) according to the proportion of 7:3. The participants with cN0 LUAD who underwent radical surgery from August 2021 to June 2022 at FAHWMU were used as a prospective test cohort (PTC). The patients for the external test cohort (ETC) were retrospectively selected from Shaoxing People’s Hospital (SPH) and Wenzhou Central Hospital (WCH) from January 2017 to August 2021. Patients with complete follow-up data in the primary cohort were collected between May 2012 and May 2019 as the prognostic assessment cohort (PAC). The endpoint was recurrence-free survival (RFS), which was defined as the time interval between the date of surgery and the time of recurrence, death, or the last follow-up. We performed radiogenomic analysis on an additional cohort (a subset of the PTC), the radiogenomic cohort (RGC), using radiomics features from PET/CT and corresponding RNA sequencing (RNA-seq) data from the collected tissue samples. The TC is used to build the model, and the IVC is used to check whether there is either overfitting or underfitting during the model training process. The PTC and ETC are used to verify the generalization ability of the model. The PAC is used to evaluate the prognostic value of the Radscore. The RGC is used to identify radiomics-associated pathways and key genes.

The criteria for cN0 were defined as all lymph nodes’ short-axis diameter <10 mm and focal FDG uptake lower than the surrounding background on PET/CT21. OLMP means that nodal metastasis evaded from the preoperative imaging but was unexpectedly discovered during surgery. The exclusion criteria included: (1) those with a history of other malignancies; (2) cases with multiple primary lung cancers; (3) received neoadjuvant therapy; (4) poor image quality of PET/CT; (5) the with an interval longer than 3 weeks between PET/CT examination and surgery (Fig. E1 [online]). Tumor staging was performed according to the 8th edition of the Union for International Cancer Control TNM classification. The relevant clinical data were collected, including patient age, sex, smoking status, serum tumor markers, pathological subtypes of LUAD, pathological tumor size (Psize), tumor location and imaging phenotypes. Psize was defined as the maximum tumor diameter based on the pathological report. Ethics approval was obtained from institutional review boards at the three participating institutions. Informed consent of retrospective cohorts (including primary cohort, ETC, PAC) was waived, while written informed consent was obtained from all patients of prospective cohorts (including PTC and RGC).

Public database

We downloaded the RNA-seq data and corresponding clinical information of LUAD patients from the Cancer Genome Atlas (TCGA) (https://tcga-data.nci.nih.gov/tcga/), and performed screening according to the following criteria: (1) patients with a history of other malignancies; (2) presence of distant metastasis or uncertainty about distant metastasis; (3) unclear nodal staging. The criteria are detailed in Fig.  E2 (online). Here, we chose overall survival (OS) as the endpoint because of the TCGA-LUAD profiles with only OS data.

Clinical tissue samples

Fresh tumor tissue samples and paracancerous tissue samples were obtained from patients with cN0 LUAD during radical surgery at FAHWMU between July 2021 and August 2022. All collected samples were quickly stored in a −80 °C refrigerator. In total, 56 pairs of fresh tumor tissue samples and corresponding adjacent normal tissue samples were collected. Non-adenocarcinoma tissues and some tissues with RNA degradation over time were excluded; finally, the remaining 32 tissue pairs were used for RNA sequencing and RT-qPCR verification in this study.

In addition, formalin-fixed paraffin-embedded (FFPE) tumor tissue blocks were obtained from patients with cN0 LUAD at FAHWMU between August 2021 and June 2022. Collecting discarded tissue samples was approved by the Institutional Review Board of FAHWMU, and informed consent was obtained from patients.

PET/CT image acquisition

PET/CT scanning was implemented using Gemini TF 64 (Philips, the Netherlands) at FAHWMU or General Electric Discovery Elite (Waukesha, WI, 40 patients) at SPH and WCH. All patients fasted for at least 6 h, and the blood glucose levels were checked (<110 ml/dl) before 18F-FDG injection (3.7–5.55 MBq/kg). Whole-body PET/CT scans (skull base to mid-thigh) were performed, starting approximately an hour after injection. CT images of the PET/CT were used for attenuation correction of PET images and anatomical orientation. Further detailed acquisition parameters are summarized in Table E1 (online).

PET/CT radiomic analysis

Tumor segmentation, image processing and feature extraction

Tumor segmentation and extraction of radiomics features were performed using LIFEx version 7.0.16 software (http://www.lifexsoft.org). The PET or CT images in DICOM format were imported into the software, and the volume of interest (VOIs) were semi-automatically delineated slice-by-slice using 3D drawing tools by two experienced nuclear medicine physicians who were both blinded to the pathological and prognostic information. The 40% threshold of maximum standardized uptake value (SUVmax) was applied to optimize the VOI on the PET images. Isotropic resampling to a voxel size of 2 × 2 × 2 mm was performed on all PET/CT images. Besides, for PET images, intensity discretization automatically decreased the continuous scale to 64 bins, and the intensity rescaling bounds ranged from 0 to 20. For CT images, intensity discretization automatically decreased the continuous scale to 400 bins with absolute scale bounds between −1000 and 3000 HU. After image preprocessing, 128 radiomics features were extracted: 64 features based on PET and 64 features from CT.

Feature selection, radscore building, and diagnostic validation

Radiomic features with good interobserver reproducibility (ICC intraclass correlation coefficient, >0.75) were retained for subsequent investigation. All extracted radiomics features were normalized with a Z-score before further analysis. We applied minimum redundancy maximum relevance (mRMR) and the least absolute shrinkage and selection operator (LASSO) algorithm to select the most powerful features in the training cohort. An optimal value for λ in LASSO logistic regression was determined by performing a 10-fold cross-validation, which yielded a minimum binomial deviance. Furthermore, we performed bootstrapping validation (100 bootstrap resamples) to decrease overfit bias. A radiomics signature was established by linear combination of the selected radiomics features weighted by their corresponding coefficients in LASSO regression, and then the radiomics score (Radscore) was calculated for each patient. Considering the potential effectiveness of some clinical factors, we also developed and validated the clinical model by univariate and stepwise multivariate logistic regression analysis. Then, a nomogram based on the multivariable logistic regression was constructed by incorporating the Radscore and independent clinical predictors.

Model performance

The predictive performance of the models in the training, internal validation, prospective test, external test and RGCs was assessed using the receiver operating characteristic (ROC) curve analysis. The DeLong’s test was applied to compare whether the area under the curve (AUC) values were significantly different between different models. Additionally, the nomogram was evaluated by calibration curve and decision curve analysis (DCA) curve.

Radiogenomics mapping

LncRNA and mRNA sequencing

The transcriptome sequencing of lncRNA and mRNA was performed using the collected LUAD tissues and adjacent normal tissues. In brief, total RNA was first extracted from tissues using the Magzol Reagent (Magen, China). Ribosomal RNAs (rRNAs) were removed from total RNA and fragmented to ~200 bp. Next, the purified RNAfragments were subjected to first- and second-strand cDNA synthesis followed by adaptor ligation and enrichment with a low-cycle according to instructions of NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina (NEB, USA). The purified library products were assessed using the Agilent 2200 TapeStation and Qubit (Thermo Fisher Scientific, USA). Finally, all the libraries were sequenced by Illumina (Illumina, USA) with paired-end 150 bp at Ribobio Co., Ltd (Ribobio, China). Raw data obtained from RNA-Seq were processed to generate clean data before subsequent analyses.

Identification of differentially expressed genes (DEGs)

Analysis of differential expression of lncRNA and mRNA (LUAD tissues vs paracancerous tissues, LUAD tissues of OLMN vs LUAD tissues of OLMP) was performed using DESeq2, and significance was defined as P < 0.1 for lncRNA and P < 0.05 for mRNA. Common differential genes were confirmed after intersection analysis.

Gene co-expression network analysis

To further explore the biological mechanisms of radiomics features for predicting OLM and obtain gene expression modules, we performed weighted gene co-expression network analysis (WGCNA) based on the RGC. WGCNA is a systems biology approach for identifying correlation patterns among genes across microarray samples, so it can detect modules of highly correlated genes. Co-expression gene networks were generated using the R package WGCNA based on DEGs. Firstly, hierarchical cluster analysis was conducted to confirm whether the data had outlier samples. Next, a suitable soft threshold was selected to ensure a scale-free network distribution. In addition, based on a topological overlap matrix (TOM)-based dissimilarity measure (1-TOM), genes were clustered into network modules by using the average-linkage hierarchical clustering method. The modules were then established using the dynamic shear tree algorithm. Finally, the correlation between modules and traits was calculated to identify the key modules.

GO and KEGG analysis of key modules

Gene ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed on the genes from the mRNA modules most associated with radiomics features and Radscore in WGCNA analysis.

Hub genes identification

Protein–protein interaction (PPI) analysis was conducted by STRING database (https://string-db.org/) and visualized by Cytoscape software (Version 3.9.1), based on mRNAs within the most positively and most negatively correlated modules of radscore from WGCNA. To screen the hub-mRNAs of the PPI network, a degree of 10 was set as the cutoff criterion. Additionally, the maximum intramodular connectivity of lncRNAs was defined as intramodular hub genes. Thus, hub-lncRNAs identified from the most positive and the most negative correlation modules of radscore were screened using both |cor.geneModuleMembership| (|MM | ) > 0.9| and |cor.geneTraitSignificance| (GS)å 0.6 as thresholds.

Coexpression analysis and the competitive endogenous RNA network

In order to clarify co-expressed hub mRNA–hub mRNA, hub mRNA–hub lncRNA, or hub lncRNA–hub lncRNA pairs, we calculated Pearson correlation coefficients to determine the significant correlation pairs. Next, we constructed a ceRNA network using bioinformatics tools based on the hub-lncRNAs, miRNAs and hub-mRNAs. First, the miRcode database (http://www.mircode.org/) was used to match hub-lncRNAs and miRNAs. Second, the miRDB (http://www.mirdb.org/), TargetScan (http://www.target scan.org) and miRTarBase (http://mirtarbase.mbc.nctu.edu.tw/) databases were used to predict the target genes of miRNAs. Subsequently, the interaction results between miRNAs and hub-lncRNAs or hub-mRNAs were integrated to construct the ceRNA network, which was visualized by Cytoscape v3.9.1.

Cell culture

All cell lines were purchased from the Cell Resource Center of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (Shanghai, China). Human normal lung epithelial cells (BEAS-2B), PC9 and A549 cells were cultured in Roswell Park Memorial Institute (RPMI)-Dulbecco’s modified Eagle’s medium (DMEM) containing 10% fetal bovine serum (FBS). H1299, H1975, and HCC827 cells were cultured in RPMI-1640 medium with 10% FBS (Thermo Fisher Scientific). All cells were cultured at 37 °C in a humidified cell incubator with 5% CO2.

Immunohistochemical (IHC) staining and scoring

Tissues were fixed in 10% paraformaldehyde at room temperature, embedded in paraffin blocks and then cut into 5-μm-thick sections. The specimens were incubated with primary antibodies overnight at 4 °C, and then the signal was detected by the appropriate secondary antibodies. These slides were then stained using 3,3′-diaminobenzidine chromogen solution and counterstained with haematoxylin. IHC staining results were scored by two experienced pathologists in a blinded manner, and total scores were calculated by multiplying staining intensity scores by the positive staining area scores. For each slide, five randomly selected fields were evaluated, and the average value was adopted as the final score. Staining intensity: 0 point for no positive staining (negative), 1 point for light yellow (weak positive), 2 points for brown yellow (positive) and 3 points for tan (strong positive). Positive staining area: 1 point (≤25%), 2 points (26%–50%), 3 points (51%–75%), and 4 points (>75%).

Quantitative real-time polymerase chain reaction (qRT-PCR)

Total RNAs were extracted from cell lines and tissue samples using TRIzol reagent (Sangon Biotech), which were subsequently reverse-transcribed into complementary DNAs (cDNAs). Next, SYBR Green Real-Time PCR Master Mix (TaKaRa) was used for qRT-PCR. Primer sets were purchased from Shanghai Boyun Biotech Co., Ltd (Shanghai, China). The primer sequences are listed in Supplementary Data 1. The MIR600HG expression was normalized against U6 expression, and the expression of other lncRNAs in this study was normalized against endogenous glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene expression. The results were calculated by the comparative 2−ΔΔCt method.

RNA interference (RNAi) and plasmid transfection

The overexpression plasmids and small interfering RNAs (siRNAs) targeting human MIR600HG were purchased from RiboBio (Guangzhou, China). PCDNA3.1 and nontargeted siRNA (siRNA negative control [SI-NC]) were utilized as the negative control. The sequences of the MIR600HG-targeting siRNAs were as follows (5′−3′): GCTACTACTTACTACCTTT (SI-1), GGAGAAAGATGGCCTGATA (SI-2) and GGCTAAAGAGCATCATACA (SI-3). Transfection of plasmids or siRNAs in LUAD cells was conducted using Lipofectamine 3000 (Carlsbad, CA, USA) according to the instructions. qRT-PCR was used to verify the transfection efficiency after transfection for 24 h.

Clone formation assay

48 h post-siRNA and post-plasmid transfection, cells were inoculated into six-well plates (1000 cells/well) and cultivated overnight. The medium was substituted with fresh medium every 2 days to maintain cell growth for 7 days. At the end of the experiment, colonies were fixed with 4% paraformaldehyde for 15 min and stained with 0.1% crystal violet for 10 min at room temperature. All assays were repeated in triplicate.

EdU cell proliferation assay

Cell proliferation was also detected using 5-ethynyl-2′-deoxyuridine (EdU) Cell Proliferation Assay Kit (RiboBio, Guangzhou, China) according to the manufacturer’s protocol. Briefly, H1299 and A549 cells were seeded into 96-well plates overnight at a density of 5000 cells/well after 48 hours of transfection. 50 μM of EdU was added to the culture medium and incubated for 2 h. Next, the cells were fixed using paraformaldehyde solution (4%) for 30 min, incubated with glycine for 5 min and then permeabilized with 0.5% Triton-X100 for 10 minutes. Subsequently, the cells were stained with 1× Apollo for 30 minutes and Hoechst 33342 for another 30 minutes in the dark. The percentages of EdU-positive cells (red) to total Hoechst 33342-positive cells (blue) were calculated.

Wound-healing migration assay

The transfected cells were seeded into six-well plates and cultured until 90–100% confluence. Then, a 200-μL sterile pipette tip was used to scratch the cell monolayer, and cell debris was removed by washing with PBS. After that, cells were cultured in medium with low FBS concentration (0.5%). The wound at the 0, 24 h and 48 h time points was photographed using a Nikon microscope (Tokyo, Japan), and the scratch area was measured using ImageJ software (NIH).

Transwell migration and invasion assays

24-well transwell chambers (8 µm; Corning Inc., NY) were used for the migration and invasion assays. Transfected cells (5000 cells/well) were seeded into the upper chambers in serum-free medium, and medium containing 10% FBS was added to the lower chambers. After 24 h incubation, migrated or invaded cells were fixed with 4% paraformaldehyde, stained with crystal violet and then imaged under a microscope (Nikon, Tokyo, Japan). The invasion assay protocol was similar to the migration assay, except that the bottom of the upper chamber was first coated with 1 mg/ml Matrigel.

Study design

As shown in Fig. 1, the overall framework of this study consists of three main parts. First, we developed a PET/CT radiomics signature for predicting OLM, and the predictive efficacy was evaluated in the IVC, PTC, ETC and RGC. Prognostic stratification based on the identified signature was evaluated. Second, we linked OLM-related radiomics phenotypes with RNA-seq co-expressed gene modules, and gene set enrichment analysis and screening of the key genes were conducted. Third, we validated the reproducibility of OLM-related key genes on the public database, clinical tissue samples and in vitro experiments. Radiomic features are shown in Supplementary Data 2. The radiomics quality score (RQS) was applied to assess the radiomics workflow11, and the score calculated for this study was 78% (28 out of 36 points) (Table E2).

Fig. 1: Overview of the study.
figure 1

OLM occult lymph node metastasis, OLMN occult lymph node metastasis negative, OLMP occult lymph node metastasis positive, cN0 clinical N0, LUAD lung adenocarcinoma, TCGA The Cancer Genome Atlas, WGCNA weighted gene co-expression network analysis, IHC immunohistochemistry, RT-qPCR reverse transcription and quantitative real-time polymerase chain reaction.

Statistics and reproducibility

SPSS version 25.0, GraphPad Prism version 7.0, Origin version 2018 and R software version 4.1.0 were used for data analysis and graphing. Categorical data were presented using counts and percentages, and numerical data were presented as mean ± standard deviation. Categorical data were analyzed with the chi-square test or Fisher’s exact test. Numerical data were analyzed using the unpaired t-test, Mann–Whitney U test, ANOVA or Kruskal–Wallis test. All radiomics features were normalized by Z-score normalization. Univariate and multivariate logistic regression analyses were conducted. We used the “glmnet” package for LASSO, “pROC” package for ROC analysis, “rms” package for nomogram and calibration analysis, “ResourceSelection” package for Hosmer–Lemeshow test, “rmda” package for DCA, “WGCNA” package for WGCNA analysis. Volcano plots and Venn diagrams were drawn on the bioinformatics platform (https://www.bioinformatics.com.cn). The optimal cutoff values for continuous variables were identified using the X-tile program (version 3.6.1)22. OS and RFS curves were generated using the Kaplan–Meier (K-M) method and compared using a log-rank test, and univariable and multivariable Cox analyses were performed to identify independent predictors of RFS. All tests were two-sided, and P value < 0.05 was considered statistical significance. All assays were repeated 3–6 times. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001, nsP > 0.05.

Results

Characteristics of the cohorts

Baseline characteristics of patients in TC, IVC, PTC and ETC cohorts are detailed in Table 1. A total of 437 patients were identified as the primary cohort (7:3 ratio, TC = 307, IVC = 130), 100 patients were enrolled in the PTC, and 100 patients were included in the ETC (SPH = 59, WCH = 41). The proportions of patients with occult lymph node metastasis positive (OLMP) in TC, IVC, PTC and ETC were 18.89%, 18.46%, 23.00% and 22.00%, respectively. Comparison of characteristics between occult lymph node metastasis negative (OLMN) and OLMP groups in the four cohorts is shown in Table E3 (online). 286 patients were enrolled in the PAC (OLMN = 221, OLMP = 65). The lncRNA/mRNA expression file was obtained by high-throughput RNA-seq of 30 fresh samples (including 20 cases of LUAD (OLMN = 10, OLMP = 10) and 10 paracancer tissues (Nomal = 10)) from 20 patients with cN0 LUAD. The 20 patients (10 males; average age 65.95 ± 10.17) served as the RGC. Detailed description of PAC and RGC is given in the Table E4 (online).

Table 1 Baseline characteristics of patients in four study cohorts

Radiomic analysis

Radscore building and validation

After image segmentation and preprocessing, a total of 128 quantitative imaging features were extracted from PET/CT images. Due to all extracted features with the inter-observer ICC values > 0.75, 128 features were included in subsequent analysis. Following the mRMR and LASSO process, eight radiomic features remained, including three PET features (GLZLM_ZLNU.PET, GLZLM_SZE.PET, GLCM_Energy[=AngularSecondMoment].PET) and five CT features (GLRLM_HGRE.CT, NGLDM_Contrast.CT, GLZLM_LZLGE.CT, CONVENTIONAL_Humax, SHAPE_Sphericity[onlyFor 3DROI].CT) (Fig. 2A–C). Then, the eight selected features were used to construct a radiomics signature by a logistic regression algorithm (see the formula of Radscore in Appendix E1 [online]). ROC analysis was performed to evaluate the predictive performance of the Radscore for the presence of OLM in the TC, IVC, PTC and ETC. The respective values of AUC, accuracy, specificity, sensitivity, positive and negative predictive values were 0.82, 84.48%, 69.88%, 72.64%, 70.69% and 71.86% in the TC; 0.81, 62.50%, 88.68%, 83.85%, 88.11%, and 84.59% in the IVC; 0.78, 82.61%, 71.43%, 74.00%,72.15%, and 73.31% in the PTC; and 0.79, 72.73%, 75.64%, 75.00%,75.48% and 75.16% in the ETC (Fig. 2D, Table 2).

Fig. 2: Radiomics and survival analysis for cN0 LUAD patients.
figure 2

A, B Radiomics feature selection using LASSO logistic regression and C the histogram of Radscore based on the selected features. D ROC curves of the Radscore in four cohorts. EH The AUCs of clinical, radiomic and combined nomogram models in the E training cohort, F internal validation cohort, G prospective test cohort and H external test cohort. RFS curves according to I pathologic OLM status and J Radscore-predicted OLM status in the prognostic assessment cohort. K Forest plot of multivariate Cox regression analysis (n = 286 biologically independent samples). cN0 clinical N0, LUAD lung adenocarcinoma, LASSO least absolute shrinkage and selection operator, ROC receiver operating characteristic, AUC area under the curve, RFS recurrence-free survival.

Table 2 Performance evaluation of three models in four cohorts

Nomogram construction and evaluation

A combined nomogram was constructed based on Radscore and clinical features (Fig.  E3A [online], Table E5 [online]). Calibration curves and DCA analysis showed good concordance and clinical applicability of the nomogram (Fig.  E3B–F [online]). In addition, ROC curve analysis was used to evaluate the performance of radiomics (AUCs: 0.78–0.82), clinical (AUCs: 0.59–0.68) and nomogram model (AUCs: 0.76–0.83) in the four cohorts (Fig. 2E-H). Furthermore, we performed DeLong’s test to analyze the difference between the AUC of the radiomics model and the combined model in the training cohort. The result showed the difference was not significant (P = 0.3188), which implied that clinical features did not substantially contribute to the combined model.

Survival prediction

The 2-year RFS rate was 98.5% in OLMN group and 69.7% in OLMP group, respectively. The K-M curves (Fig. 2I, J) showed that whether pathologic OLM status or Radscore-predicted OLM status, the OLMP group had a shorter recurrence time than the OLMN group (log-rank test, P < 0.0001). In addition, factors (including all clinical parameters and Radscore) were assessed by univariate and multivariate Cox regression analysis. Psize (P = 0.02), T-staging (P = 0.011) and Radscore (P < 0.001) were statistically significant by univariate analysis, while only Radscore (HR 1.798, 95%CI 1.454–2.225; P < 0.001) was identified as an independent predictive factor in predicting the risk of recurrence of cN0 LUAD (Fig. 2K).

Radiogenomics mapping

Identification of DEG

In order to reduce the computational load, DEG analysis was first performed for the preliminary screening of OLM-related genes. Heat map and volcano map of DEG are shown in Figs. E4 and 5 (online). Eventually, 542 DElncRNAs and 1348 DEmRNAs in intersections of two comparative groups were retained for further analysis (Fig. 3B, C).

Fig. 3: Radiogenomics mapping.
figure 3

A ROC curves of the Radscore in the RGC. B, C Venn diagram showing the overlap of identified DEG from B two comparative lncRNAs and C mRNAs (LUAD tissues vs paracancer tissues, LUAD tissues of OLMN vs LUAD tissues of OLMP). D, E Heatmaps of module-trait relationships based on D lncRNA data and E mRNA data. The x axis represents Radscore and its component radiomics features, and the y axis represents co-expressed gene modules with various colors. Each cell includes the correlation coefficient R (top) and P value (bottom). R > 0 indicates positive correlation, represented by red; R < 0 indicates negative correlation, represented by blue; and a deeper color indicates greater correlation. F, G GO and KEGG enrichment analysis of green module genes. ROC receiver operating characteristic, Radscore radiomics score, RGC radiogenomic cohort, DEG differential expressed genes, LUAD lung adenocarcinoma, OLM occult lymph node metastasis negative, OLMP occult lymph node metastasis positive, para-Ca paracancer, ME module eigengene.

Construction of coexpression modules

Prior to WGCNA analysis, ROC curve analysis was performed in the RGC (AUC = 0.80), again suggesting that the Radscore we proposed has a good predictive performance (Fig. 3A and Table E6 [online]). Then, the expression matrix of 542 DElncRNAs and 1348 DEmRNAs was respectively adopted to find key modules associated with Radscore and OLM-related radiomics features (ORRF) through WGCNA. WGCNA groups genes with similar expression patterns together to form a gene module. As shown in the heatmap of the module-trait relationship for lncRNA (Fig. 3D) and mRNA (Fig. 3E), there were significant correlations between modules and Radscore and its component radiomics features, except for SHAPE_Sphericity[onlyFor3DROI].CT. The results showed that Radscore was significantly related to the blue, brown and red lncRNA modules (r = −0.63, −0.57 and 0.65, respectively), and significantly correlated with blue, brown, green and red mRNA modules (r = –0.58, −0.59, 0.72, 0.53, respectively).

Enrichment analysis

GO and KEGG enrichment analyses were performed to describe the biological interpretations of the genes in the most related modules. Green module displayed highest relationship with Radscore, CONVENTIONAL_Humax, GLZLM_LZLGE.CT and GLZLM_ZLNU.PET (Fig. 3E), and genes in the green module were mainly enriched in immune-related functions and pathways, such as T-cell activation and natural killer cell-mediated cytotoxicity (Fig. 3F, G). The blue, brown, magenta, and midnight blue modules showed a high relationship with ORRF, and the results of the enrichment analysis are summarized in Fig.  E7 (online).

Hub genes identification and network analysis

As illustrated in Fig. 4A, B, 10 hub mRNAs and 8 hub lncRNAs were identified in the most positively and negatively relevant modules of Radscore. Moreover, the co-expression (hub-lncRNAs and hub-mRNAs) and ceRNA network are shown in Fig. 4C, D, and the detailed description can be found in Appendix E2(online).

Fig. 4: Hub genes identification and network analysis.
figure 4

A The PPI network of 10 candidate hub mRNAs from green and brown modules. B Eight hub lncRNAs from blue and red modules (|MM | > 0.9 and |GS | > 0.6). C Co-expressions between 8 hub-lncRNAs and 10 hub-mRNAs. D The OLM-associated ceRNA network in LUAD. Triangle, ellipse and diamond represent hub-lncRNAs, miRNAs and hub-mRNAs, respectively. E, F The expression (E) and OS curves (F) of OLM-related hub-lncRNAs (MIR600HG, FAM13A-AS1, AQP4, and GRIA1) in LUAD based on TCGA database (n = 284). PPI protein-protein interaction, MM cor.geneModuleMembership, GS cor.geneTraitSignificance, OLM occult lymph node metastasis, LUAD lung adenocarcinoma, TCGA The Cancer Genome Atlas. All tests were two-sided.

Validation of hub genes

Validation of TCGA database

As shown in Fig. 4E, the expression levels of MIR600HG, FAM13A-AS1, AQP4, and GRIA1 gradually decreased in the three groups of adjacent normal tissues (Normal), tumor tissues without LNM (N–) and tumor tissues with LNM (N+), and there were significant statistical differences between each two groups. The results of the survival analysis (Fig. 4F) showed that LUAD patients with high expression of MIR600HG, FAM13A-AS1, AQP4 and GRIA1 exhibited significantly longer OS time (P = 0.012, 0.022, 0.005, 0.002, respectively). The results for the remaining hub genes are provided in Fig.  E8 and 9 (online).

Validation of clinical tissue samples

The validation of hub-lncRNAs by RT-qPCR and hub-mRNAs by immunohistochemistry (IHC) is detailed in Appendix E3 (online). The qPCR results showed that the expression level of MIR600HG gradually decreased in the three groups of Normal, OLMN and OLMP, and there were significant statistical differences between each two groups, which was consistent with the above result from TCGA database (Fig. 5A). The results for other hub-lncRNAs are shown in Fig. 5A. The representative IHC staining images of hub-mRNAs in the three groups are shown in Fig.  E10 (online).

Fig. 5: Validation of Hub Genes.
figure 5

A The expression of hub-lncRNAs in LUAD tissue samples of OLMN, LUAD tissue samples of OLMP and corresponding noncancerous tissue samples (nomal) were detected using RT-qPCR. B The levels of MIR600HG expression in 5 human LUAD cell lines and 1 human bronchial epithelial cell line were measured by RT-qPCR. C, D Knockdown efficiency of MIR600HG in C A549 and D H1299 cells determined by RT-qPCR. E, F Validation of pcDNA3.1-MIR600HG over-expression vector in E A549 and F H1299 cells as determined by RT-qPCR. LUAD lung adenocarcinoma, OLMN occult lymph node metastasis negative, OLMP occult lymph node metastasis positive, RT-qPCR reverse transcription and quantitative real-time polymerase chain reaction. Horizontal lines (AF) represent mean, and error bars represent standard deviation (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001, nsP > 0.05, unpaired two-sided t test). Each experiment was repeated three times independently (n = 3 biologically independent experiments).

Verification in vitro

Based on the above results, we found that gene MIR600HG is closely related to OLM and the prognosis of LUAD. Therefore, we carried out further in vitro experiments to investigate the gene. MIR600HG expression levels in LUAD cell lines and the efficiency of knockdown and overexpression are shown in Fig. 5B–F (Appendix E4 [online]). We observed that MIR600HG knockdown notably enhanced the proliferation, migration and invasion ability of LUAD cells, while upregulated MIR600HG inhibited cell proliferation, migration and invasion (Fig. 6A–D).

Fig. 6: Validation of MIR600HG.
figure 6

AD Effects of MIR600HG on LUAD cell proliferation, migration and invasion. A EdU proliferation assay, B colony formation assay, C transwell migration and invasion assay, D wound scratch assay in SI-NC or SI-1/SI-3 transfected A549 and H1299 cells, and in pcDNA3.1-MIR600HG transfected A549 and H1299 cells. LUAD lung adenocarcinoma, EdU 5-ethynyl-2′-deoxyuridine. In the boxplots, horizontal lines (A/C/D) represent the mean, and error bars represent the standard deviation. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001, nsP > 0.05, unpaired two-sided t test. Each experiment was repeated three times independently (n = 3 biologically independent experiments). (×100; scale bars indicate 50 μm).

Discussion

Radiomics phenotypes have shown promising predictive value in predicting OLM in cN0 LUAD23,24,25. However, these studies were limited by a single modality image, a relatively small sample size and an unclear biologic meaning of OLM-related radiomics phenotypes. Our proposed Radscore based on dual-modality images (PET and CT) and multiple test cohorts showed excellent robustness and outperformed existing models, achieving AUCs of 0.82, 0.81, 0.78, and 0.79 in the TC, IVC, PTC, and ETC, respectively. Moreover, predictive rules integrating clinical parameters could also be used to calculate the probability of OLM, resulting in AUCs of 0.76–0.83 in the four cohorts. However, the efficacy of the clinical model fluctuated in our validation and test, yielding AUCs of only 0.59–0.68, which seems to be affected by the subjectivity of feature evaluation. Therefore, we compared the performance of different models in the training cohort. Interestingly, compared with the radiomics model, the nomogram model did not improve the classification performance (P = 0.3188), which implied that clinical features did not substantially contribute to the combined model. According to the cutoff Radscore value, patients were divided into high-risk and low-risk groups for OLM. The high-risk group might require aggressive lymph node dissection or even adjuvant therapies, while the low-risk group might only need regional lymph node dissection, lymph node sampling, or no lymph node dissection.

Several previous studies developed radiomics prognostic models that could predict the outcome of patients with lung cancer26,27,28. Being different from previous investigations, we applied OLM as a stratifying factor and explored the survival prediction value according to pathologic and Radscore-predicted OLM status. The results suggested that OLM was closely associated with postoperative recurrence and metastasis. From another perspective, it proved that Radscore could accurately predict OLM. Furthermore, we also found the Radscore as the only independent predictor of RFS in LUAD.

Increasing radiogenomic analyses indicate imaging features are related to biological pathways or processes in lung cancer such as the association of normalized SUVmax with epithelial-mesenchymal transition19,29. WGCNA has been demonstrated to constitute a useful tool to confirm image-to-molecular feature association30. Hence, we attempted to uncover the underlying biological basis of our proposed signature and ORRF using WGCNA by the RGC. The WGCNA results demonstrated that, except for SHAPE_Sphericity[onlyFor3DROI].CT, the rest of ORRF and Radscore were significantly correlated with co-expressed gene modules of lncRNA and mRNA. Enrichment analysis of genes in the modules showed that the most prominent enrichment leans towards pathways involved in immune responses, such as Radscore, CONVENTIONAL_Humax, GLZLM_LZLGE.CT, and GLZLM_ZLNU. PET related to T-cell activation and natural killer cell-mediated cytotoxicity, NGLDM_Contrast.CT is related to the regulation of leukocyte-mediated immunity and T-cell-mediated cytotoxicity. Many studies have confirmed that immune responses are closely related to tumor invasion and metastasis31,32. Recently, increasing evidence suggests LNM plays an important role in generating anti-tumor immunity in NSCLC33,34. These findings implied that patients with high risks of OLM may benefit from immunotherapy to improve the clinical outcome.

To further explore the potential mechanisms of OLM, we constructed the co-expression and ceRNA network based on the hub genes. The co-expression analysis found some highly correlated (|r| ≥0.9, P < 0.01) mRNA–lncRNA, mRNA–mRNA, or lncRNA–lncRNA pairs, including LOC105376718 and FAM13A-AS1, FLJ22447 and LOC105378963, FLJ22447 and MIR31HG, FLJ22447 and SAA1, LOC105378963 and LOC105378964, MIR31HG and SAA1. The ceRNA network analysis showed that 3 hub-lncRNAs (MIR600HG, MIR31HG, FAM13A-AS1), 17 miRNAs, and 5 hub-mRNAs were involved in the ceRNA network related to OLM of LUAD. LncRNAs are considered a major part of the ceRNA network and have attracted increasing attention. MIR31HG has already been reported to be involved in the ceRNA network and to regulate the migration and invasion of lung cancer cells35. Although the effects of MIR600HG and FAM13A-AS1 in lung cancer have not been reported, studies have shown that MIR600HG could suppress the metastasis of colorectal cancer cells36 and FAM13A-AS1 could regulate proliferation and apoptosis of cervical cancer cells37.

To verify the reproducibility of these hub-genes, we validated our findings by public database, clinical tissue samples and in vitro experiment. Verification in the TCGA database showed that the expression levels of MIR600HG, FAM13A-AS1, AQP4 and GRIA1 were significantly lower in LUAD tumor tissues compared with normal lung tissue, and the low expression of these genes is closely related to OLM and poor prognosis of early-stage LUAD. Therefore, these four key RNAs may play an important role in OLM of early-stage LUAD, which could provide new ideas for early diagnosis and treatment of LNM in LUAD. Similarly to what was shown above for the result of TCGA database, verification by RT-qPCR showed that the expression level of MIR600HG was significantly lower in LUAD tumor tissues, and the low expression of MIR600HG is closely related to OLM and poor prognosis of early-stage LUAD. In vitro experiments revealed that MIR600HG could suppress proliferation, migration and invasion of LUAD cells. To sum up, MIR600HG as a tumor suppressor gene could be used as a potential tumor marker for early diagnosis of LNM and the prognosis in LUAD, and may be a new potential target molecule for future LUAD metastasis treatment.

Some limitations exist in our study. First, the sample size was still not large enough, although the PTC and ETC were used to enhance the statistical power. Second, the robustness of radiomics features still needs to be improved due to the diverse PET/CT scanners and acquisition protocols, although image preprocessing and feature normalization were performed. Third, the number of patients and samples collected in the RGC is small owing to the difficulty of collecting fresh tissue samples. Fourth, we did not provide in-depth mechanism studies, although we provide multilevel evidence that links ORRF with lncRNA and mRNA in LUAD. Fifth, we only included patients with LUAD, and there is a limitation in daily clinical practice because some pulmonary nodules are too small to determine their pathology before surgery.

In summary, our study showed the effectiveness of Radscore based on PET/CT that can be used to predict the OLM and as an independent prognostic factor for postoperative recurrence in patients with cN0 LUAD. Besides, the Radscore capturing information for OLM and prognosis was primarily associated with immune regulation, such as T-cell activation and natural killer cell-mediated cytotoxicity. Furthermore, four key genes (MIR600HG, FAM13A-AS1, AQP4 and GRIA1), especially the MIR600HG gene, play important roles in OLM and prognosis of early-stage LUAD, which could provide new ideas for early diagnosis and treatment of LNM in LUAD.