Introduction

Gastric cancer (GC) is one of the most common malignant tumors worldwide, ranking fifth in incidence and fourth in cancer-related mortality1,2. Despite advances in treatment, GC remains a major clinical challenge due to its low 5-year survival rate among digestive system malignancies and its high propensity for distant metastasis3,4. Clinically, early-stage GC is often asymptomatic or presents with nonspecific gastrointestinal symptoms, leading to frequent misdiagnosis as gastritis or other benign conditions5,6. As a result, most patients are diagnosed at an advanced stage. GC is characterized by significant heterogeneity7. Pathologically, it includes several subtypes such as adenocarcinoma, squamous cell carcinoma, and carcinoid tumors1. From a molecular perspective, The Cancer Genome Atlas (TCGA) categorizes GC into four major subtypes: Epstein–Barr virus (EBV)-positive, microsatellite instability (MSI), genomically stable (GS), and chromosomal instability (CIN)8,9. Numerous studies have demonstrated that these molecular subtypes exhibit distinct prognoses and responses to immunotherapy or targeted therapies. However, a robust and clinically applicable molecular classification system or predictive model for guiding individualized treatment remains lacking10. Therefore, it is crucial to further elucidate the molecular mechanisms underlying GC and to promote the development of reliable biomarkers and personalized therapeutic strategies, which may significantly enhance diagnostic accuracy and treatment efficacy in GC management.

Cancer stem cells (CSCs) are a distinct subpopulation of cancer cells with unlimited self-renewal and differentiation potential. They play critical roles in tumor initiation, progression, metastasis, and therapeutic resistance11. In GC, beyond the classical CSCs, recent transcriptomic analyses—particularly at the single-cell level—have identified cancer cell subsets with high stemness scores12,13. Although these cells may not fully meet the phenotypic or functional criteria of traditional gastric cancer stem cells (GCSCs), they display pronounced stemness-like features and contribute similarly to tumor aggressiveness, immune evasion, and drug resistance14,15. As a result, high-stemness GC cells have garnered increasing research interest. Given their malignant molecular characteristics and transcriptional profiles, these cells represent a biologically and clinically relevant subpopulation. Investigating their regulatory networks, functional roles, and interactions with the tumor microenvironment (TME) may enhance our understanding of GC heterogeneity and uncover novel therapeutic targets16,17,18.

Based on these insights, our study identified a subset of malignant GC cells with high stemness scores (HighStem) using CytoTRACE analysis of scRNA-seq data. We comprehensively characterized the biological features of these cells and explored their interactions with other cell types within TME. Through high-dimensional WGCNA (hdWGCNA) and multiple machine learning approaches, we further identified five core marker genes of HighStem cells—APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1. Leveraging these key features, we constructed a robust and highly accurate predictive model for HighStem cell identification using a benchmark machine learning framework. Together, our findings offer novel insights into the molecular landscape of high-stemness GC cells and may provide potential targets and strategies for improving the clinical diagnosis and treatment of gastric cancer.

Results

Identification of cell populations

scRNA-seq analysis revealed 38 transcriptionally distinct clusters with 0.8 resolution in gastric cancer samples via Seurat pipeline (Fig. 1A). Based on canonical marker genes, these clusters were annotated into major cell types including T cells, B cells, NK cells, monocytes, macrophages, dendritic cells, epithelial cells, fibroblasts, endothelial cells, and others (Fig. 1B). Marker gene expression supported accurate cell-type classification (Fig. 1C). For example, CD3D and CD8A marked T cells, NKG7 and KLRD1 marked NK cells, CD68 and CD163 marked macrophages, EPCAM and CDH1 marked epithelial cells, while CD1C, FCER1A, and CLEC9A were enriched in dendritic cells. To visualize gene expression patterns across the UMAP space, gene density plots were generated for selected marker genes (Fig. 1D).

Fig. 1: Identification and annotation of cell populations in GC tissue.
Fig. 1: Identification and annotation of cell populations in GC tissue.
Full size image

A UMAP plot showing 38 transcriptionally distinct cell clusters identified by Seurat. B Cell type annotation based on canonical marker genes. C Dot plot displaying representative marker gene expression across major cell types. Dot size indicates the percentage of expressing cells; color denotes relative expression level. D Gene density plots show the distribution of selected marker genes in the UMAP space.

Identification and characterization of HighStem malignant gastric cancer cells

Given that GC originates from epithelial cells, we first isolated epithelial subpopulations for CNV analysis (Fig. 2A). Cluster 3 exhibited significantly lower CNV scores compared to clusters 1, 2, and 4 (Fig. 2B), indicating a lack of large-scale chromosomal alterations. Therefore, cluster 3 was annotated as normal epithelial cells, while the remaining epithelial clusters were considered malignant GC cells. To evaluate stemness, we applied CytoTRACE, which estimates cellular differentiation potential based on transcriptional diversity. As shown in Fig. 2C, malignant epithelial cells exhibited a continuum of differentiation states. Based on the distribution of CytoTRACE scores, cells were stratified into three groups: LowStem, DTStem (intermediate), and HighStem, corresponding to the bottom 25%, middle 50%, and top 25% of the score range, respectively (Fig. 2D). UMAP visualization demonstrated distinct spatial separation of these subgroups (Fig. 2E), and HighStem cells displayed significantly elevated CytoTRACE scores compared to the other groups (Fig. 2F).

Fig. 2: Identification and characterization of high-stemness malignant GC cells.
Fig. 2: Identification and characterization of high-stemness malignant GC cells.
Full size image

A Heatmap of CNV signals across epithelial clusters. B Violin plot shows CNV score differences among epithelial clusters. C CytoTRACE plot displaying stemness score distribution in malignant epithelial cells. D Histogram of CytoTRACE score-based subgroup classification: LowStem, DTStem, and HighStem. E UMAP visualization of malignant cells colored by stemness subgroups. F Box plot comparing CytoTRACE scores among subgroups. G UMAP showing scPagwas-derived tumor relevance scores (TRS). J UMAP plots of stemness score, TRS score, and their combined distribution. H Box plot of TRS scores among stemness subgroups. I Correlation between CytoTRACE stemness scores and TRS values.

To further explore the genetic basis underlying stemness, we performed scPagwas analysis. The tumor relevance score (TRS), representing the degree of GWAS signal enrichment at the single-cell level, was visualized across the UMAP space (Fig. 2G). HighStem cells exhibited significantly higher TRS scores (Fig. 2H), suggesting a potential genetic contribution to their stemness phenotype. Consistently, UMAP plots of stemness score, TRS score, and their combination revealed overlapping high-scoring regions (Fig. 2J). Finally, correlation analysis revealed a strong positive relationship between CytoTRACE-based stemness scores and TRS values (Fig. 2I), supporting the hypothesis that stemness in malignant GC cells may be driven by inherited genetic regulatory programs.

HighStem exhibit enhanced intercellular communication, active signaling pathways, and distinct metabolic reprogramming

To explore the functional characteristics of HighStem GC cells, we analyzed intercellular communication patterns using the CellChat framework. Compared with LowStem and DTStem groups, HighStem cells exhibited markedly increased interaction frequency and strength with diverse cell types in TME, particularly with macrophages, endothelial cells, and fibroblasts (Fig. 3A–C). Outgoing and incoming signaling analysis further revealed that HighStem cells were active hubs in multiple signaling pathways, suggesting enhanced crosstalk potential (Fig. 3B).

Fig. 3: Cell-cell acrosstalk profiling of HighStem cells.
Fig. 3: Cell-cell acrosstalk profiling of HighStem cells.
Full size image

A, B Cell–cell interaction networks showing total interaction counts and strengths, highlighting HighStem cells. C Incoming vs. outgoing interaction strength across cell types. D, E Increased ligand–receptor interaction pairs involving HighStem cells compared to LowStem. F, G Key ligand–receptor pairs enriched in HighStem-related interactions. H Pathway activity heatmap showing upregulation of oncogenic signals in HighStem cells. I Bubble plots illustrate metabolic pathway enrichment, with broad metabolic activation in HighStem cells.

Further quantitative comparison of ligand–receptor interactions showed that HighStem cells engaged in significantly more communication events than LowStem cells, both as ligand providers and receptor recipients (Fig. 3D, E). Key ligand–receptor pairs enriched in HighStem cells included MIF–CD74, NAMPT–INSR, and MDK–SDC1, indicating their involvement in immune modulation, stress responses, and stemness maintenance (Fig. 3F, G).

Pathway enrichment analysis highlighted that HighStem cells were positively associated with signaling cascades such as PI3K, WNT, TGF-β, and JAK–STAT, all of which are known to promote stemness and tumor progression (Fig. 3H). Moreover, metabolic profiling revealed that HighStem cells exhibited upregulation of a wide range of metabolic pathways, including glutathione metabolism, fatty acid metabolism, steroid biosynthesis, and glycosaminoglycan biosynthesis, indicating active metabolic reprogramming that supports their aggressive phenotype (Fig. 3I).

Identification of HighStem gene co-expression modules via hdWGCNA

To uncover gene expression programs associated with HighStem cells, we performed hdWGCNA. A soft-thresholding power of 6 was selected based on scale-free topology and connectivity criteria (Fig. 4A). Hierarchical clustering identified five distinct co-expression modules, each represented by a unique color (Fig. 4B). Module eigengene (ME) analysis revealed key genes contributing to each module (Fig. 4C). The UMAP projection of module expressions confirmed the modules across malignant cells (Fig. 4D). Among them, the brown, green, yellow, and blue modules showed distinct expression patterns across cells with different stemness states.

Fig. 4: Identification of HighStem-associated gene co-expression modules using hdWGCNA.
Fig. 4: Identification of HighStem-associated gene co-expression modules using hdWGCNA.
Full size image

A Soft-thresholding power selection based on scale-free topology and connectivity. B Dendrogram showing five gene co-expression modules. C Module eigengene-based bar plots highlighting top-contributing genes. D UMAP projections of module expression across malignant cells. E Correlation matrix among modules. F Dot plot showing enrichment of brown, green, and yellow modules in HighStem cells.

Correlation analysis between modules revealed moderate co-expression relationships, with the turquoise module showing less connectivity to others (Fig. 4E). Importantly, the brown, green and yellow modules were specifically enriched in HighStem cells, both in terms of average expression and proportion of expressing cells (Fig. 4F). The 246 genes included in these modules are shown in Supplementary Table 1.

HighStem signature gene selection

To obtain a robust HighStem gene set, we initially identified the 171 genes showing the strongest positive correlation with the HighStem phenotype from three hdWGCNA modules (brown, green, and yellow) in 3367 HighStem cells using Pearson correlation analysis (Fig. 5A, Supplementary Table 2), collectively defined as the HighStem activity gene set. Spatial transcriptomics analysis demonstrated that HighStem activity scores were predominantly enriched within tumor regions (Fig. 5B). Consistently, RNA-seq data from the TCGA cohort revealed that HighStem activity was significantly elevated in GC tissues compared to adjacent normal tissues (Fig. 5C). All GC samples were subsequently scored for HighStem activity and divided into high- and low-activity groups based on the optimal cutoff values. Kaplan–Meier survival analysis showed that patients with high HighStem activity had significantly shorter overall survival than those with low activity (Fig. 5D), highlighting its prognostic relevance.

Fig. 5: HighStem signature gene selection and validation.
Fig. 5: HighStem signature gene selection and validation.
Full size image

A Positively genes (displayed top 100 genes) correlated with the HighStem phenotype across three hdWGCNA modules. B Spatial transcriptomics showing tumor-enriched HighStem activity. C Comparison of HighStem activity between normal and tumor tissues in TCGA. D Kaplan–Meier survival curves stratified by HighStem activity score. E, F Feature importance ranking based on the Boruta algorithm. G Gene importance ranking from Decision Tree analysis. H Feature selection via LASSO regression. I Error rate curve and feature ranking from Random Forest analysis. J Feature importance ranking from GBM analysis. K Feature selection and optimal subset identification using ABESS. L Upset plot showing the intersection of top genes identified by six machine learning algorithms, with five shared hub genes highlighted.

Among them, Boruta and Random Forest were first used to rank gene importance based on their contribution to classification accuracy, with Boruta evaluating feature relevance through comparison with shadow features (Fig. 5E, F, Supplementary Table 3), and DT analysis provided a clear hierarchical view of the gene importance based on node splitting criteria (Fig. 5G, Supplementary Table 4). LASSO regression applied L1 regularization for dimensionality reduction, screening the most valuable predictors while minimizing overfitting (Fig. 5H, Supplementary Table 5). RF analysis further evaluated the importance of each gene based on the mean decrease in accuracy, with a stable decline in model error observed as the number of decision trees increased (Fig. 5I, Supplementary Table 6).GBM, as an integrated boosting method, continuously optimized the residuals of weak learners to improve the model’s accuracy, and prioritized candidate genes based on feature importance (Fig. 5J, Supplementary Table 7). Additionally, ABESS selected the optimal gene subset by evaluating different feature combinations with the lowest loss function, enhancing the reliability of feature screening (Fig. 5K, Supplementary Table 8). By integrating results from all four algorithms, five consistently selected hub genes (APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1) were identified (Fig. 5L).

Validation of HighStem hub genes and machine learning model performance

To investigate the spatial expression characteristics of the HighStem hub genes, we first visualized their expression density across all single cells. The results showed that APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1 were broadly expressed in the tumor tissue microenvironment, with relatively high signal intensity in specific cell populations (Fig. 6A). When focusing on malignant epithelial cells, these five genes demonstrated expression patterns that closely overlapped with the distribution of stemness scores, as indicated by CytoTRACE, suggesting that their expression was tightly associated with the high-stemness phenotype (Fig. 6B). Next, we evaluated the predictive performance of each gene in distinguishing malignant from non-malignant cells at the single-cell level. ROC curve analysis showed that all five genes had moderate to strong discriminatory power, with AUCs ranging from 0.709 to 0.832, among which APMAP (AUC = 0.832) and MAPRE1 (AUC = 0.827) showed the highest accuracy (Fig. 6C).

Fig. 6: Validation of HighStem hub genes and evaluation of machine learning classifier performance.
Fig. 6: Validation of HighStem hub genes and evaluation of machine learning classifier performance.
Full size image

A Expression density plots of five HighStem hub genes in all cells. B Expression density in malignant epithelial cells, showing spatial overlap with stemness scores. C ROC curves show the single-cell classification performance of each gene. D Benchmark comparison of eight machine learning models. E Precision–recall and ROC curves of top-performing models. F ROC and PRAUC of the SVM model in development and test cohorts. G Confusion matrices of the training and test sets. H Decision curve analysis of the final classifier. I SHAP summary plot showing the contribution of each gene. J Predicted probability plots based on normalized gene expression. K SHAP dependency plots illustrate the relationship between feature values and model output.

Furthermore, to develop a reliable classifier for identifying HighStem cells, we systematically compared the performance of 8 machine learning algorithms, including support vector machine (SVM), random forest (ranger), XGBoost, decision tree (rpart), etc. Benchmark analysis showed that the SVM model achieved the highest average AUC in cross-validation folds, showing excellent and stable predictive performance (Fig. 6D, Supplementary Table 9). The SVM model also demonstrated superior precision–recall and ROC curve performance (Fig. 6E), with a final AUC of 0.973 in the independent test set (Fig. 6F). Confusion matrix analysis revealed high classification accuracy in both training and test cohorts, with balanced sensitivity and specificity (Fig. 6G). Decision curve analysis further confirmed the clinical utility of the SVM model, with the proposed classifier showing a significantly higher net benefit across a wide range of thresholds (Fig. 6H). SHAP analysis identified APMAP, MAPRE1, GLB1, TSPAN6, and CDKN2A as the top contributors to model output (Fig. 6I). Predicted probability plots demonstrated a strong positive correlation between normalized gene expression and the predicted HighStem probability (Fig. 6J). Similarly, SHAP dependency plots revealed that increased expression levels of these genes positively contributed to the classification score, further supporting their functional relevance in defining HighStem GC cells (Fig. 6K).

Validation of HighStem hub gene expression and prognostic significance

To further validate the HighStem hub genes, we assessed their expression and clinical relevance using bulk RNA-seq data from TCGA. All five genes—APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1—were significantly upregulated in GC tissues compared to adjacent normal tissues (Fig. 7A). These findings were further confirmed in a paired GC cohort, where consistent overexpression of APMAP, CDKN2A, MAPRE1, and GLB1 was observed, while TSPAN6 showed an upward trend that did not reach statistical significance (Fig. 7B). Time-dependent ROC analysis further supported the diagnostic potential of these genes, with APMAP demonstrating the highest AUC (0.934), followed by TSPAN6 (0.836), MAPRE1 (0.801), CDKN2A (0.750), and GLB1 (0.697) (Fig. 7C). We next explored the prognostic value of these genes. K–M survival analysis revealed that high expression of TSPAN6 (p = 0.018) and MAPRE1 (p = 0.027) was significantly associated with worse overall survival in GC patients (Fig. 7D). GLB1 and APMAP showed borderline significance (p = 0.053 and p = 0.079, respectively), while CDKN2A did not reach statistical significance. These results highlight the potential of the five hub genes as diagnostic and prognostic biomarkers for high-stemness GC cells.

Fig. 7: Validation of HighStem hub genes in TCGA datasets.
Fig. 7: Validation of HighStem hub genes in TCGA datasets.
Full size image

A Expression levels of APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1 in normal and GC tissues. B Paired expression comparison in matched normal and GC samples. C Time-dependent ROC curves showing the diagnostic performance (AUC) of the five hub genes. D K–M survival curves stratified by gene expression levels in TCGA-GC cohort.

Five core marker genes were positively related with maker of stem cells

To further validate the correlation among the core marker genes (APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1), we first analyzed their association with tumor stem cell-related markers—specifically components of the JAK1-STAT3 signaling pathway—using the TIMER 2.0 platform. The analysis revealed that APMAP (also known as C200RF3), GLB1, TSPAN6, and MAPRE1 were positively correlated with both JAK1 and STAT3 expression levels. In contrast, CDKN2A showed no significant correlation with either JAK1 or STAT3 (Fig. 8A).

Fig. 8: Analysis of the correlation between core marker genes and stem cell markers.
Fig. 8: Analysis of the correlation between core marker genes and stem cell markers.
Full size image

A TIMER 2.0 platform analyzed APMAP (also known as C200RF3), GLB1, TSPAN6, and MAPRE1 correlation with both JAK1 and STAT3 expression levels. B RT-PCR analyzed the expression of highStem signature genes following knockdown or overexpression of the core genes in the SGC7901 and HGC-27 GC cell lines.

In addition, we explored the functional involvement of these core genes in tumor stemness. We assessed the expression of highStem signature genes following knockdown or overexpression of the core genes in the SGC7901 and HGC-27 GC cell lines. The results demonstrated that the mRNA expression levels of highStem signature genes—including JAK1 (a tyrosine kinase that mediates cytokine signaling), STAT3 (a transcription factor crucial for cell proliferation and stemness maintenance), Hippo (a signaling pathway that restricts organ size and regulates stem cell self-renewal), YAP1 (a key effector of the Hippo pathway involved in cell growth and survival), and WNT3A (a ligand in the Wnt signaling pathway essential for stem cell regulation and tumor progression). The highStem signature genes were downregulated upon knockdown of the core genes, whereas their expression was upregulated in cells overexpressing the core genes (Fig. 8B).

Knockdown of core genes suppressed JAK1-STAT3 pathway

The JAK-STAT3 signaling pathway plays a pivotal role in the progression of various cancers by promoting cell proliferation, survival, immune evasion, and the maintenance of cancer stem cell properties. Aberrant activation of this pathway has been closely associated with tumor development and poor clinical outcomes in multiple malignancies, including GC.

Then, we analyzed the protein expression levels of components of the JAK1-STAT3 pathway in SGC7901 and HGC-27 GC cell lines following the knockdown of the core marker genes. The results demonstrated that in both SGC7901 and HGC-27 cell lines, knockdown of the core marker genes—APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1—led to a marked reduction in the protein expression of JAK1 and STAT3 (Fig. 9A–E), suggesting that these genes may act as upstream regulators of the JAK-STAT3 axis in GC cells.

Fig. 9: Effect of core gene knockdown on the JAK1-STAT3 pathway.
Fig. 9: Effect of core gene knockdown on the JAK1-STAT3 pathway.
Full size image

AE Western blot analyzed the JAK1-STAT3 pathway in SGC7901 and HGC-27 GC cell lines following the knockdown of the core marker genes.

Knockdown of core genes enhances drug sensitivity in GC cell line

5-Fluorouracil (5-FU) and cisplatin are widely used chemotherapeutic agents in the treatment of GC and other solid tumors. 5-FU functions primarily as a pyrimidine analog that inhibits thymidylate synthase, thereby disrupting DNA synthesis and inducing apoptosis in rapidly dividing cells. Cisplatin exerts its antitumor effect by forming DNA crosslinks, which interfere with DNA replication and transcription, ultimately triggering cell death. However, resistance to these agents remains a major clinical challenge, often associated with tumor stemness and molecular alterations.

Further, we investigated the role of the core marker genes in modulating chemotherapy sensitivity. Upon treatment with 5-FU or cisplatin, knockdown of the core genes enhanced the sensitivity of both SGC7901 and HGC-27 GC cell lines to these agents. Notably, silencing of TSPAN6 and MAPRE1 led to a pronounced inhibition of cell proliferation under chemotherapeutic stress (Fig. 10A–D), suggesting that these genes may contribute to drug resistance mechanisms and could serve as potential therapeutic targets to overcome chemoresistance.

Fig. 10: Core gene knockdown enhances drug sensitivity in gastric cancer cell lines.
Fig. 10: Core gene knockdown enhances drug sensitivity in gastric cancer cell lines.
Full size image

A, B CCK-8 assay was carried to detect the drug sensitivity of 5-FU and cisplatin in HGC-27 GC cell lines after core genes (APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1) knockdown. C, D CCK-8 assay was carried to detect the drug sensitivity of 5-FU and cisplatin in SGC7901 GC cell lines after core genes (APMAP, CDKN2A, TSPAN6, MAPRE1, and GLB1) knockdown.

Discussion

Tumor cell stemness has attracted increasing attention in recent years. Many cancer cells acquire enhanced proliferative capacity and immune evasion ability through the activation of embryonic development-related gene programs19,20. In 2006, the American Association for Cancer Research (AACR) defined CSCs as a subpopulation of tumor cells with self-renewal ability and the potential to generate heterogeneous tumor progeny. Experts at the time also suggested that CSCs may be inherently resistant to conventional therapies, making tumor stemness one of the emerging hallmarks of cancer21.

To date, extensive efforts have been made to identify markers associated with high-stemness tumor cells22. Surface markers such as CD44, CD24, CD29, CD90, and CD133 have been widely used to isolate CSCs from various cancers and cell lines23,24. With the rapid development of transcriptome-based analytical tools such as CytoTRACE25, mRNAsi (OCLR)26, and StemSC27, numerous potential markers have been further identified. For example, Yibo Fan et al. demonstrated that SOX9 maintains stem-like properties in advanced gastric cancer28, while Xiaoli Liu et al. showed that the activation of the Hippo-YAP1 signaling pathway upregulates FOXP4 to sustain gastric cancer stemness, highlighting FOXP4 as both a biomarker and a therapeutic target29. In addition, several non-coding RNAs such as lncRNA HCP530 and circSLC4A731 have also been implicated in stemness regulation. Despite these advances, few studies have systematically screened for high-stemness markers in GC or comprehensively evaluated their predictive value. To address this gap, we applied CytoTRACE to quantitatively assess stemness in malignant GC cells and further identified five robust HighStem signature genes through hdWGCNA combined with multiple machine learning algorithms. A predictive model with high classification accuracy was then constructed using a benchmark machine learning framework. This study not only provides a novel strategy for identifying high-stemness GC cells but also offers new insights into addressing the current clinical challenges in GC treatment.

In our study, five HighStem hub genes were identified, namely APMAP, MAPRE1, GLB1, TSPAN6, and CDKN2A. APMAP (Adipocyte Plasma Membrane Associated Protein) encodes a protein predominantly localized to the plasma membrane and endoplasmic reticulum of adipocytes32,33,34. Recent studies suggest that APMAP plays a potential regulatory role in promoting tumor progression, particularly through the induction of epithelial-mesenchymal transition (EMT) in certain tumor types35,36,37. Notably, EMT is closely linked to tumor stemness. During EMT, tumor cells lose epithelial traits and acquire mesenchymal features, including enhanced migratory capacity and stem cell-like properties38,39. Conversely, CSCs are often characterized by high expression of EMT-related signaling pathways, such as TGF-β40, Notch41, Wnt42, and Hippo/YAP143. Interestingly, these pathways were also enriched in the HighStem subpopulation in our cell–cell communication analysis. Furthermore, APMAP demonstrated the highest SHAP value among all features, indicating its strongest positive predictive contribution to the HighStem phenotype, supporting its potential role as a key regulator of stemness in GC.

MAPRE1 (Microtubule-Associated Protein RP/EB Family Member 1) encodes a protein that regulates microtubule dynamics and was originally identified through its interaction with the adenomatous polyposis coli (APC) gene44. It plays a critical role in maintaining microtubule organization and chromosomal stability45,46. The aberrant expression of MAPRE1 has been implicated in the pathogenesis of various malignancies by disrupting these essential cellular processes47,48,49. In GC, a study by Ye Feng et al. reported CNV of the MAPRE1 gene, suggesting its potential genomic instability50. These features may contribute to the self-renewal capacity and undifferentiated phenotype observed in HighStem GC cells.

The GLB1 gene encodes β-galactosidase, a lysosomal enzyme responsible for the degradation of specific glycolipids and glycoproteins51. Although GLB1 is widely recognized as a canonical marker of cellular senescence52,53, its high expression in the HighStem cell population appears paradoxical. However, it is important to note that GLB1, beyond serving as a senescence marker, plays broader biological roles in intracellular metabolism, glycolipid catabolism, and lysosomal homeostasis54,55. Interestingly, a study by Monique Bernard et al. highlighted that both cancer stem cells and senescent cells exhibit enhanced stress resistance, suggesting that elevated GLB1 expression may reflect a quiescent or dormant cellular state56. In this context, GLB1 may help HighStem cells reduce proliferation and metabolic burden under adverse conditions, thereby facilitating long-term survival and potentially contributing to tumor recurrence or metastasis at later stages. Moreover, emerging evidence indicates that certain cells in EMT or high-stemness states can concurrently activate senescence-associated signaling pathways, such as p16 and p2157,58, implying that HighStem cells are not constantly in a proliferative state. Instead, they may undergo dynamic transitions between dormancy and reactivation in response to tumor microenvironmental cues. This regulated balance between quiescence and activation could represent a self-protective mechanism by which tumor cells maintain homeostasis and stemness potential.

Similarly, CDKN2A (Cyclin-Dependent Kinase Inhibitor 2A) is a well-established tumor suppressor gene involved in the regulation of the cell cycle, induction of cellular senescence, and inhibition of tumorigenesis59,60,61. While the CDKN2A gene can exert antiproliferative effects when properly transcribed and translated, numerous studies have shown that in many tumor types, it is frequently subject to genetic alterations such as mutations, deletions, or epigenetic modifications—particularly promoter hypermethylation—which result in reduced expression or complete functional loss62,63,64. This downregulation contributes to uncontrolled cell proliferation and tumor progression. In GC, CDKN2A promoter hypermethylation has been associated with the loss of p16^INK4a protein expression65,66, enabling GC cells to escape cell cycle control and acquire a highly proliferative phenotype. Moreover, copy number loss of CDKN2A has been identified as a potential biomarker for predicting hematogenous metastasis in GC patients67. These findings align, to a certain extent, with the biological features of high-stemness GC cells, which are characterized by enhanced proliferative capacity and metastatic potential. It is also important to note that, although our study observed elevated CDKN2A mRNA levels in HighStem cells, this may not directly reflect functional protein expression due to possible post-transcriptional regulatory mechanisms such as RNA methylation, miRNA-mediated repression, or impaired translation. Therefore, when interpreting the role of CDKN2A in the context of tumor stemness, both transcriptomic and proteomic levels should be considered.

TSPAN6 (tetraspanin 6) is a member of the transmembrane 4 superfamily, also known as the tetraspanin family. Current studies have reported conflicting roles of TSPAN6 across different cancer types68,69. For instance, in colorectal cancer, TSPAN6 is considered to function as a tumor suppressor, and its downregulation has been associated with tumor progression70. In contrast, in glioblastoma, high TSPAN6 expression has been linked to malignant progression and poor patient prognosis, suggesting a context-dependent role in tumor biology71. Although TSPAN6 has been relatively understudied in GC, other members of the tetraspanin family, such as TSPAN872, TSPAN173, CD15174, and TSPAN475, have been shown to be highly expressed in GC tissues and are closely associated with enhanced proliferation, migration, and invasiveness of GC cells. Given the functional similarity within the tetraspanin family, it is plausible that TSPAN6 may play a comparable role in GC. However, further experimental validation is required to clarify its function and clinical relevance in GC.

The JAK-STAT3 signaling pathway has long been recognized as a central mediator in cancer stemness, tumor cell proliferation, and anti-apoptosis processes76,77,78. Our data indicate that APMAP, TSPAN6, MAPRE1, and GLB1 are positively correlated with the expression levels of JAK1 and STAT3, suggesting that these genes may promote GC stemness by modulating the JAK-STAT3 axis. Notably, although CDKN2A did not show a direct correlation with either JAK1 or STAT3, its known role in cell cycle regulation may contribute indirectly to the modulation of cell differentiation and proliferation. Thus, these core genes likely interact in a complex molecular network to regulate tumor stemness and promote cancer progression. Further experimental validation demonstrated that knockdown of these core genes resulted in a significant reduction in JAK1 and STAT3 protein expression, underscoring their role as upstream regulators in the JAK-STAT3 pathway. This finding not only enhances our understanding of the involvement of these genes in maintaining tumor stemness but also suggests their potential as therapeutic targets, especially in strategies targeting cancer stemness and overcoming therapeutic resistance. In terms of chemoresistance, knockdown of these genes notably increased the sensitivity of GC cells to common chemotherapy agents such as 5-FU and cisplatin. This suggests that these genes play a crucial role not only in maintaining tumor stemness but also in mediating drug resistance mechanisms. Specifically, the silencing of TSPAN6 and MAPRE1 led to a pronounced inhibition of cell proliferation under chemotherapy stress, further supporting their potential as targets for overcoming chemoresistance.

In summary, the five identified HighStem hub genes may contribute to maintaining stemness in gastric cancer through diverse yet complementary mechanisms. These findings provide a foundation for future studies aimed at validating their roles and exploring their potential as biomarkers or therapeutic targets in GC.

Methods

Data collection and processing

scRNA-seq datasets (GSE1839047, GSE20678579) and stRNA-seq data (GSE25195080) were obtained from GEO. Bulk RNA-seq data were retrieved from TCGA, and GWAS summary statistics for scPagwas analysis were downloaded from the IEU OpenGWAS database. Dataset details are provided in Supplementary Table 10. Cells with >20% mitochondrial content or <200 detected genes were excluded. Genes expressed in ≥3 cells and within 200–7000 counts were retained, yielding 269213 high-quality cells from 88 samples for analysis. Data were processed using Seurat, including normalization, PCA, UMAP, and clustering (resolution = 0.8), with batch correction via Harmony. Cell types were annotated using known markers. For stRNA-seq, SCTransform normalization and unsupervised clustering defined spatial domains, supported by H&E staining and marker gene expression. Spatial patterns were visualized with “SpatialDimPlot” and “SpatialFeaturePlot”.

Inference of copy number variations

Copy number variations (CNVs) were inferred from scRNA-seq data using the inferCNV R package. To reduce technical variability, normalization procedures were applied, and malignant cells were analyzed relative to normal reference cells to detect regions with abnormal expression indicative of genomic instability81,82. A CNV score was then calculated to quantify the extent of deviation in each cell from the reference baseline. Malignant cells were extracted according to CNV-based classification, yielding 13,483 tumor cells for downstream analysis.

Stemness scoring of malignant gastric cancer cells

To evaluate cellular stemness, we applied the CytoTRACE algorithm to malignant GC cells identified through CNV analysis. CytoTRACE is a computational framework that infers the differentiation state of individual cells based on transcriptional diversity, under the premise that less differentiated cells express a broader array of genes. Unlike traditional stemness assessments, CytoTRACE does not depend on predefined gene sets or prior biological assumptions, making it broadly applicable across diverse cell types and tissues25,83. Based on the quartile distribution of CytoTRACE scores, cells were stratified into three groups: high stemness (top 25%), dynamic transition stemness (25–75%), and low stemness (bottom 25%)84.

scPagwas analysis

To explore the genetic underpinnings of cellular stemness and tumor progression, we employed scPagwas, an integrative computational framework implemented in the “scPagwas” R package. This method enables the integration of scRNA-seq data with GWAS summary statistics to uncover trait-associated genetic variants that potentially influence cell fate decisions at single-cell resolution. In this study, we focused on mapping genes associated with stemness scores to GWAS summary data derived from large-scale population studies85. By linking cell-type-specific gene expression patterns to genomic loci associated with cancer-related traits, scPagwas allowed us to identify candidate genetic variants that may drive intercellular heterogeneity in stemness and contribute to tumor evolution.

High-dimensional WGCNA (hdWGCNA) analysis

As CytoTRACE provides a global stemness score for individual cells without pinpointing gene-level expression patterns across subpopulations, we employed high-dimensional weighted gene co-expression network analysis (hdWGCNA) to further elucidate the transcriptional characteristics of malignant GC cells. A weighted co-expression network was constructed by calculating pairwise gene expression correlations, and genes were clustered into distinct co-expression modules86,87. To identify modules associated with tumor progression, we performed module–trait relationship analysis, focusing on stemness and metastatic phenotypes. Modules showing strong correlations with high-stemness or metastatic features were considered functionally relevant. Within these key modules, hub genes were defined based on high intra-module connectivity, representing potential core regulators that may drive tumor heterogeneity and malignant progression in GC.

Cell–cell interaction analysis

To investigate intercellular communication within the tumor microenvironment, we utilized the CellChat R package, which infers ligand–receptor interactions based on scRNA-seq data88. Communication networks were constructed to delineate signaling exchanges among annotated cell populations. Visualization of interaction strength and frequency between specific cell types was performed using the netVisual_circle function, which provides a circular plot representing outgoing and incoming signaling patterns. To further dissect individual signaling pathways, the netVisual_bubble function was applied, generating bubble plots that highlight key ligand–receptor pairs and their associated signaling axes.

Screening of HighStem (top 25% CytoTRACE-scored cells) signature genes

To identify key signature genes of the HighStem subpopulation, we employed an integrated machine learning approach combining six algorithms: random forest, LASSO, Boruta, and decision tree (DT), Adaptive Best Subset Selection (ABESS), and Gradient Boosting Machine (GBM). These methods were chosen for their complementary strengths in feature selection and bias reduction.

Random forest89 and Boruta90 (a random forest-based wrapper algorithm) were used to rank gene importance. LASSO regression91 applied L1 regularization to eliminate redundant features, while decision tree92 analysis provided interpretable hierarchical classification. ABESS93, a recently developed algorithm that performs optimal subset selection with theoretical guarantees, was introduced to further refine the gene set while avoiding overfitting or selection bias. Meanwhile, the GBM algorithm94, an ensemble learning method that builds additive models in a forward stage-wise fashion using decision trees, was applied to capture complex nonlinear relationships between genes and stemness phenotypes.Genes identified by all four methods were considered hub HighStem markers, and their intersection was visualized using a Venn diagram.

Machine learning benchmark models for HighStem features

To identify the most effective predictive model for HighStem features at the single-cell level, we benchmarked nine machine learning algorithms using the “mlr3” R package. The models included k-nearest neighbor (KNN), linear discriminant analysis (LDA), naive Bayes (NB), random forest (Ranger), recursive partitioning and regression trees (RPART), support vector machine (SVM), and extreme gradient boosting (XGBoost). Cells classified as HighStem were used as positive samples, and LowStem cells served as controls. The dataset was randomly split into training (80%) and test (20%) sets. Hyperparameter tuning was conducted using five-fold internal cross-validation (CV), and model generalization was evaluated via ten-fold external CV. The model with the highest average area under the curve (AUC) was selected as the optimal framework for HighStem signature prediction.

SHAP analysis for feature interpretation

To interpret the contributions of individual genes to the HighStem prediction model, we applied SHAP (SHapley Additive exPlanations) analysis95. SHAP is a model-agnostic method that quantifies the impact of each feature on the model’s output, based on cooperative game theory. This approach enables both global and local interpretation of feature importance. We computed SHAP values for all samples to evaluate the contribution of each gene to model predictions. The mean absolute SHAP value (mean_phi) for each gene was used as an indicator of its overall importance, with higher values reflecting greater average influence on classification outcomes. Genes with the highest mean SHAP values were considered the most influential predictors and were selected for further biological interpretation.

Cell culture

The SGC7901 and HGC-27 GC cell lines, originally from the American Type Culture Collection (Manassas, VA), was stored by our laboratory and grown in RPMI 1640 supplemented with 2 mM glutamine, 10 mM 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid, pH 7.4, and 10% fetal bovine serum at 37 °C in a 5% CO2 humidified incubator.

Transfection with siRNA

Core genes expression was silenced using siRNA transfection mediated by Lipofectamine 2000 (Invitrogen). Cells were seeded in 6-well plates at 50–70% confluency and transfected the following day. For each well, 5 μL of Core genes-targeting siRNA (final concentration 50 nM) was diluted in 125 μL of Opti-MEM (Gibco), and separately, 5 μL of Lipofectamine 2000 was diluted in another 125 μL of Opti-MEM. After a 5-min incubation at room temperature, the two solutions were combined and incubated for 20 min to form siRNA-lipid complexes. The complexes were then added to the cells in antibiotic-free complete medium. After 4–6 h, the medium was replaced with fresh complete medium. Cells were harvested 24–48 h post-transfection for RNA and protein extraction to evaluate gene silencing efficiency. The information of siRNA sequence are described in the Supplementary Table 11.

Western blot

Cells were transiently transfected or treated with 5-Fluorouracil (5-FU) and cisplatin. The cells were lysed in RIPA buffer. The lysates were subjected to immunoprecipitation with anti-Flag antibody. The lysates and immunoprecipitates were subjected to SDS- 12.5% PAGE, transferred onto PVDF membranes and probed with antibody(s) described. The following antibodies were used: JAK1 (1:1000, Santa Cruz), STAT3 (1:1000, Santa Cruz). Secondary HRP-conjugated Abs were obtained from GE Healthcare Life Sciences,and the light emission was quanti¢ed with a Lumino image analyzer LAS-1000 (FUJI, Japan). Signal quantification was performed by using ImageJ Software.

Realtime PCR

Total RNAwasextractedfromcellsbyusingTrizolreagent (Life Technologies). Taqman probes were used for the detection of miRs (Applied Biosystems), as described by the manufacturer, by using beta-actin as endogenous control. For mRNA-level analysis, cDNA was generated by using reverse transcriptase SuperScript II and poly dT primers (Invitrogen). Realtime PCR was performed by using SYBR Green Master Mix (Invitrogen) and the ABI 7900HT fast real-time PCR System (Applied Biosystems). Primers are described in the Supplemental Information. S18 was used as endogenous control. Primers are described in the Supplementary Table 12. β-actin was used as endogenous control.

Statistical analysis

All data processing, statistical analyses, and visualizations were performed using R software (version 4.1.3). Group comparisons for continuous variables were conducted using either the Wilcoxon rank-sum test or Student’s t test, depending on the distribution. Categorical variables were compared using the chi-squared test or Fisher’s exact test, as appropriate. Multiple testing correction was applied using the False Discovery Rate (FDR) method. Pearson correlation analysis was used to evaluate associations between continuous variables. All statistical tests were two-sided, and a p-value < 0.05 was considered statistically significant.