Introduction

Cholangiocarcinoma (CCA) is a rare and aggressive cancer that arises in the bile ducts, either inside the liver (intrahepatic CCA, or ICCA) or outside of it (extrahepatic CCA, or ECCA). Several risk factors contribute to its development, such as primary sclerosing cholangitis, chronic liver conditions, bile duct cysts, liver fluke infections, and hepatitis B or C infections1,2.

One of the major challenges in treating CCA is that it is frequently diagnosed at an advanced stage, limiting treatment options and contributing to poor outcomes3. Despite advances in cancer therapies, progress in treating CCA has been slow, and the prognosis remains grim. The 5-year survival rate for unresectable ICCA is less than 5–10%, and even for those undergoing liver resection, only 20–35% survive beyond five years4. Given these dismal statistics, there is a pressing need for new therapies to combat this lethal disease.

Predicting clinical outcomes can significantly guide treatment strategies, and the use of biomarkers and prognostic models, particularly those leveraging machine learning, has gained attention in various cancers5. Machine learning, a subset of artificial intelligence, uses statistical methods, probabilistic models, and optimization strategies to enable computers to learn from historical data and uncover subtle patterns within large and/or complex datasets. This ability makes it especially valuable in the medical field, where analyzing intricate genomic and proteomic data is often critical. Consequently, machine learning plays a significant role in cancer diagnosis, treatment, and prognosis6,7.

In the context of ICCA, multiple prognostic models have been proposed to predict the survival of ICCA patients8,9,10. However, traditional approaches, such as the Cox proportional hazards model and the Kaplan-Meier (KM) estimator, often fall short in capturing the complex and nonlinear characteristics of medical data11. Additionally, most published conventional prognostic models have shown limited performance when subjected to external validation12. In contrast, machine learning methods, such as Random Survival Forests and Least Absolute Shrinkage and Selection Operator (LASSO), offer increased flexibility, allowing for the modeling of intricate interactions and hidden patterns with greater accuracy11,13,14. Still, prognostic models for ICCA that incorporate gene expression data and machine learning remain limited.

The tumor microenvironment (TME)—the surrounding non-cancerous cells, extracellular matrix, signaling molecules, and blood vessels—plays a critical role in tumor behavior15,16, and there is compelling evidence linking TME to ICCA progression17,18,19. The TME has demonstrated substantial prognostic significance, with immune infiltration, stromal activation, and immunosuppressive mechanisms emerging as critical factors in predicting patient outcomes20,21. Despite this, there remains a lack of prognostic models specifically focused on TME-related gene expression for ICCA.

In this study, we identified differentially expressed genes (DEGs) by comparing tumor tissues with adjacent non-tumor (NT) tissues in an ICCA dataset. Using KM survival and Cox regression analyses, we developed a prognostic model based on the expression of four DEGs. We then confirmed the expression of these DEGs by multiplex fluorescent immunohistochemistry (mfIHC). Finally, we validated this model with two additional ICCA datasets and further explored its connection to TME characteristics.

Materials and methods

Transcriptomic analysis dataset collection and pre-processing

All transcriptomic datasets were sourced from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). The microarray dataset E-MTAB-638922, which includes 78 ICCA and 31 NT samples, was processed using the Robust Multi-array Average (RMA) algorithm in the R package “oligo” (v1.64.1) for background adjustment and quantile normalization. Probe annotation was conducted using the GPL17585 platform file from GEO, and the resulting expression data were standardized via z-score normalization. For the CCA microarray dataset GSE897492, containing 118 tumor samples, we applied the “lumi” R package to adjust for background noise, followed by log2(count + 1) transformation and z-score normalization.

The RNA-seq dataset GSE10794323, which includes 30 ICCA and 27 NT samples, was pre-processed by converting Reads Per Kilobase per Million mapped reads (RPKM) to Transcripts Per Million (TPM), followed by z-score normalization. Patients with an overall survival of less than 30 days in all these cohorts were excluded. A summary of the final samples used is presented in Table S1.

DEG analysis

To identify DEGs between ICCA and NT samples in the E-MTAB-6389 dataset, we utilized the R package “limma” (v3.56.2), applying a |log2 fold change| > 1 and a false discovery rate (FDR) < 0.05.

Multivariate cox analysis

We conducted multivariate Cox regression analysis using the R package “finalfit” (v1.0.8), including both forward and backward selection, and visualized the results with the “forestplot” R package (v3.1.3).

Prognostic model construction

DEGs identified from the E-MTAB-6389 dataset underwent KM survival analysis and univariate Cox regression using the ‘survidiff’ function within the R package “survival” (v3.7-0). Genes with a P-value < 0.1 in both analyses were further evaluated using LASSO Cox regression with the “Glmnet” R package (v4.1–8). The identified genes are detailed in Table S2. Subsequently, we employed stepwise Cox regression to optimize the model by incorporating the expression of each gene24,25. Genes that significantly enhanced model accuracy were selected to build the final gene set-based prognostic signature for the ICCA (GPSICCA) model. This model was constructed by multiplying the expression level of each marker gene by the respective regression coefficients obtained from stepwise Cox regression.

To validate the model’s predictive capability, we tested it on the GSE89749 and GSE107943 cohorts. The optimal cutoff for high-risk and low-risk ICCA patients was determined using the “surv_cutpoint” function from the R package “survminer” (v0.4.9). KM survival curves were visualized using the “survival” and “survminer” R packages, while heatmaps and bar plots were generated using the “pheatmap” (v1.0.12) and “graphics” (v4.3.0) R packages. The receiver operating characteristic (ROC) analysis was performed using the “timeROC” (v0.4) R package.

TME feature analysis

To analyze TME features, we assessed 75, 111, and 30 tumor samples from the E-MTAB-6389, GSE89749, and GSE107943 datasets, respectively, for stromal and immune scores using the R packages “ESTIMATE” (v1.0.13) and “GSVA” (v1.50.0) (Table S3). The Pearson correlation between the risk score and stromal or immune scores was visualized using the R package “ggpubr” (v0.6.0). Survival analysis based on stromal and immune scores was carried out using “survival” and “survminer”.

For cell type delineation in the TME, we used the R package “xCell” (v1.1.0), with the results outlined in Table S4. Panimmune gene set and immunomodulatory gene analyses were performed using “GSVA,” and the results are presented in Table S5 and Table S6, respectively. For both “xCell” and “GSVA” analyses, the following thresholds were applied: overall survival (log-rank test, P < 0.05) and GPSICCA risk score (Pearson correlation test, |r| ≥ 0.40, P < 0.05). The findings were visualized using “forestplot”.

MfIHC

Samples from five patients (2 male and 3 female, aged 49 to 71) who were diagnosed with ICCA and underwent curative intent resection at our hospital between 2019 and 2023 were used for mfIHC analysis. ICCA paraffin Sect. (5 μm thick) were deparaffinized with xylene and rehydrated with gradient ethanol solutions. The samples were then rinsed with ddH2O and treated with 3% hydrogen peroxide to block endogenous peroxidase activity and underwent antigen retrieval. Blocking was performed using 5% goat serum, after which the sections were incubated with a primary antibody at 37 °C for 2 h and then washed with PBST (PBS + 0.1% Tween 20), followed by incubation with HRP-conjugated secondary antibody solution for 30 min. Next, the sections were washed with PBST again and treated with one fluorophore (RS0039, Immunoway) for 5 min, followed by washing with PBST. Subsequently, the sections were subjected to antigen retrieval again and followed all the IHC steps using another primary antibody. These steps were repeated for every primary antibody. Finally, the sections were washed with PBST and mounted with a 4′,6-diamidino-2-phenylindole (DAPI, stains nuclei)-containing mounting medium (P0131, Beyotime). The results were imaged by a Zeiss LSM 900 confocal microscope. Information on antibodies is presented in Table 1.

Table 1 Antibody information for mfIHC.

Results

Establishment of the GPSICCA prognostic model

To create a prognostic model for ICCA, we first analyzed the E-MTAB-6389 dataset to identify DEGs between ICCA tumorous and NT samples. These genes were then assessed using survival and univariate Cox regression analyses, and the resulting 86 genes were further subjected to LASSO regression analysis. 12 out of these genes were selected for subsequent stepwise Cox regression multivariate analysis (Fig. 1A and B, Table S2). From this analysis, 7 genes were identified, among which 4 key genes with hazard ratios (HR) greater than 1 were chosen to construct the optimized GPSICCA model (Fig. 1C, Table S2). The risk score was calculated using the following formula: (1.1771 * expression of COL4A1) + (0.6895 * expression of GULP1) + (0.7011 * expression of ITGA6) + (0.6168 * expression of STC1). We subsequently validated the presence of these genes in ICCA samples using mfIHC. This analysis revealed strong expression of the genes in ICCA samples (Fig. 2). Consistently, CXCL17, which promotes the progression of liver cancer and regulates immune infiltration26,27, was also highly expressed in ICCA samples (Fig. 2).

Fig. 1
figure 1

Development of the GPSICCA prognostic model. (A) LASSO regression plot based on selected DEGs. (B) Cross-validation plot for the LASSO regression analysis. (C) Forest plot displaying the four genes used to create the GPSICCA model.

Fig. 2
figure 2

mfIHC of signature genes for constructing the GPSICCA model. mfIHC data showing the expression of COL4A1, GULP1, ITGA6, STC1, CXCL17 in ICCA samples. DAPI stains nuclei.

Performance evaluation of the GPSICCA model

To validate the GPSICCA model’s predictive accuracy regarding clinical outcomes in ICCA patients, survival analyses based on risk scores were conducted. The results showed that patients with a high-risk score (N = 55) in the E-MTAB-6389 cohort had significantly worse overall survival compared to those with a low-risk score (N = 116) (HR = 3.594; P < 0.0001; Fig. 3A). Similarly, in the GSE89749 dataset, patients in the high-risk group (N = 229) had a poorer prognosis than those in the low-risk group (N = 64) (HR = 5.59; P = 0.001; Fig. 3B). In the GSE107943 cohort, patients were also divided into high-risk (N = 21) and low-risk (N = 66) groups, with high-risk patients demonstrating worse survival outcomes (HR = 4.669; P = 0.003; Fig. 3C). The relationship between the expression levels of the 4 prognostic genes, GPSICCA risk score, and clinical outcomes in these cohorts is depicted in Figs. 3D-F. We further evaluated the prognostic performance of the GPSICCA model using ROC analysis. The results demonstrated that the model performed satisfactorily in the E-MTAB-6389 and GSE89749 datasets, particularly in predicting 3-year and 5-year survival. However, its performance in the GSE107943 dataset was less robust, due to the limited number of patients (Fig. 3G-I). These findings suggest that the GPSICCA model effectively predicts survival in ICCA patients.

Fig. 3
figure 3

Evaluation of the GPSICCA model’s predictive power for ICCA patient survival. (A-C) KM survival curves comparing overall survival in ICCA patients with high and low GPSICCA risk scores in three cohorts: E-MTAB-6389 (A), GSE89749 (B), and GSE107943 (C). (D-F) Heatmaps and bar charts showing the relationship between the expression levels of the four prognostic genes and GPSICCA risk scores, along with survival status in the E-MTAB-6389 (D), GSE89749 (E), and GSE107943 cohorts (F). (G-I) ROC curves indicating the prognostic performance of the GPSICCA model in the E-MTAB-6389 (G), GSE89749 (H), and GSE107943 cohorts (I).

GPSICCA risk score correlates with immune or stromal score

Since COL4A1, ITGA6, and STC1 have been linked to TME in other contexts28,29,30, and TME is closely associated with CCA progression, we explored the connection between the GPSICCA risk score and immune/stromal scores, aiming to delineate the mechanism by which our model predicts the survival of ICCA patients. Using the ESTIMATE package, we categorized TME into immune and stromal subcomponents, calculating their respective scores using GSVA. Pearson correlation analysis of 216 patients across 3 cohorts revealed a moderate positive correlation between GPSICCA risk score and immune score (cor = 0.22, P = 0.0012, Fig. 4A) and a strong positive correlation with stromal score (cor = 0.58, P < 2.2e-16, Fig. 4B). Furthermore, both immune and stromal scores were significantly associated with overall survival in ICCA patients (HR = 2.071; P = 0.004 for immune score, Fig. 4C; HR = 1.943; P = 0.004 for stromal score, Fig. 4D). These results suggest that a high GPSICCA score is indicative of a tumor-promoting TME in ICCA.

Fig. 4
figure 4

Association between GPSICCA risk scores, immune and stromal scores, and ICCA patient survival. (A, B) Scatter plots illustrating significant positive correlations between GPSICCA risk scores and immune scores (A) as well as stromal scores (B) in ICCA patients. (C, D) KM survival curves showing the overall survival of ICCA patients with high and low immune scores (C) and stromal scores (D).

Association of GPSICCA risk score with TME cell types, panimmune gene sets, and Immunomodulatory (IM) genes in ICCA samples

To further understand the relationship between the GPSICCA score and specific TME cell types, we utilized the “xCell” algorithm. The analysis identified 8 cell types significantly associated with the GPSICCA score. Among these, Th2 cells, mesangial cells, astrocytes, smooth muscle cells, and melanocytes exhibited a positive correlation, while natural killer T (NKT) cells, class-switched memory B cells, and osteoblasts were negatively associated with GPSICCA scores (Fig. 5A).

Fig. 5
figure 5

Correlation of GPSICCA risk scores with TME cell types, panimmune gene sets, and IM genes in ICCA samples. (A-C) Forest plots depicting the correlation of GPSICCA scores with specific TME cell types (A), panimmune gene sets (B), and IM genes (C).

Next, we examined panimmune gene sets31 and IM genes32 potentially linked to the GPSICCA score. Our findings revealed that 17 gene sets were significantly correlated with the GPSICCA score. Of the 15 positively correlated gene sets, 4 were related to interferon (IFN) responses, while natural killer (NK) and Th17 cell-related gene sets were negatively correlated with the GPSICCA score (Fig. 5B). Notably, gene sets related to NK cells showed differential correlations, as CD56dim NK cells had a positive association. Additionally, the expression levels of IL10, ENTPD1, ITGB2, and TNFRSF14 were positively correlated with the GPSICCA score, while TGFB1, PDCD1, and EDNRB exhibited negative correlations (Fig. 5C). Collectively, these results provide insights into the underlying mechanisms and potential targets associated with the GPSICCA score.

Discussion

In this study, we developed a prognostic model for predicting the survival of ICCA patients by using a publicly available ICCA dataset. The model was constructed using the expression profiles of four DEGs. Its predictive capability was further validated in two additional ICCA cohorts, demonstrating its robustness in different datasets.

The GPSICCA prognostic model, developed using machine learning, may mark a significant improvement in predicting outcomes for ICCA patients by integrating novel gene expression signatures and TME characteristics, features often overlooked by conventional models. Incorporating four key genes (COL4A1, GULP1, ITGA6, and STC1) that are identified through rigorous statistical analysis, the model captures crucial molecular determinants of prognosis. The GPSICCA risk score also correlates strongly with immune and stromal components of the TME, as well as specific immune cell types and immunomodulatory genes, reinforcing its biological relevance. This comprehensive integration enhances predictive accuracy and enables consistent stratification of patient survival across multiple independent cohorts.

Unlike traditional prognostic systems for various cancers that typically stratify patients into three distinct categories—poor, intermediate, and favorable prognosis33,34—the GPSICCA model introduces a two-tier risk classification. Based on the expression of four key genes, the GPSICCA score clearly divides patients into high- and low-risk groups, as consistently demonstrated across multiple independent cohorts. This binary stratification enhances clinical utility by simplifying risk assessment and reducing ambiguity in patient categorization, which may ultimately facilitate more decisive therapeutic planning.

Among the four genes included in the GPSICCA risk score, three—COL4A1, ITGA6, and STC1—are consistently implicated in cancer. COL4A1 encodes a collagen protein and acts as an oncogene in hepatocellular carcinoma by activating the FAK-Src signaling35. COL4A1 also appears to be oncogenic as its overexpression promotes the proliferation of breast and oral squamous cancer cells36,37. Similarly, ITGA6, which encodes α6-Integrin, is a crucial cell adhesion molecule that regulates the malignant behaviors of various cancers, such as breast cancer38, pancreatic cancer39, and ovarian cancer40. ITGA6 promotes proliferation, metastasis, and drug resistance in these cancers41. STC1 also facilitates metastasis and chemoresistance of ovarian cancer cells by controlling the FOXC2/ITGB6 signaling42, promotes metastasis of breast cancer by stimulating the EGFR-ERK-S100A4 signaling43, and predicts poor survival in colorectal cancer patients44. In line with these roles, our analysis found that these genes are overexpressed in ICCA and are linked to unfavorable clinical outcomes, suggesting their oncogenic potential in ICCA, though their precise function in ICCA cells remains to be explored.

In contrast, GULP1 appears to have a context-dependent role in cancer. In urothelial carcinoma, it functions as a tumor suppressor through the NRF2-KEAP1 signaling pathway45. GULP1 also acts as a tumor suppressor in ovarian cancer, with reduced expression due to promoter methylation46. In ICCA, however, GULP1 is upregulated and correlated with poor survival, suggesting it may have an oncogenic function in this context. Its impact on ICCA requires further functional validation.

Notably, the expression of these four genes overlaps with TME regulator CXCL1747, indicating their involvement in the TME of ICCA, particularly COL4A1, ITGA6, and STC1, which are known to influence TME in other cancers29,48,49. The role of GULP1 in TME remains unclear. Our analysis revealed that the GPSICCA risk score is positively correlated with immune and stromal scores, both of which predict poor survival outcomes in ICCA patients. This implies that these genes may influence immune cell infiltration and stromal remodeling in ICCA tissues.

Among the TME cell types correlated with the GPSICCA risk score, Th2 cells had the most significant positive association. Th2 cells’ role in TME varies across cancers; in colon and pancreatic cancers, they have anti-tumor effects, while in breast cancer, inhibiting Th2-mediated immunity improves response to immunotherapy50,51. In ICCA, the positive association of Th2 cells with the GPSICCA score suggests they may be oncogenic, though further investigation is needed. Conversely, the negative association with NKT cells, which typically have anti-cancer functions52, suggests that these cells may be suppressed by the signature genes in the ICCA microenvironment.

Correlation analysis with panimmune gene sets revealed a significant link between the GPSICCA score and IFN responses. Although IFNs generally have anti-tumor effects, in some cases, they may promote immune evasion by cancer cells53, raising the possibility that IFNs could be oncogenic in ICCA. This requires further study. Only two panimmune gene sets, regulating NK cells and Th17 cells, were negatively correlated with the GPSICCA score. NK cells are known to inhibit CCA54,55, while Th17 cells can have both pro- and anti-tumor roles56, though their function in ICCA remains to be elucidated.

Regarding specific immunomodulatory genes, ENTPD1 showed the strongest positive correlation with the GPSICCA score, indicating a potential oncogenic role in ICCA. Consistent with this, ENTPD1, expressed by regulatory T cells (Tregs), enhances the growth of hepatic metastatic tumors by promoting tumor cell proliferation via scavenging extracellular ATP57,58. EDNRB, which showed the strongest negative correlation with the GPSICCA score, is considered a tumor suppressor in other cancers, with higher expression linked to better survival in triple-negative breast cancer and hepatocellular carcinoma59,60. Although the specific roles of these genes in ICCA remain unclear, they could serve as promising therapeutic targets for ICCA patients.

While our prognostic model for ICCA demonstrates significant value in predicting patient outcomes, further validation in larger patient cohorts is necessary to confirm its broader clinical applicability. In particular, it remains to be determined whether the model is universally applicable across all ICCA patients or whether its prognostic value is limited to specific subgroups, such as those undergoing treatments with curative intent, like surgical resection. Additionally, deeper exploration into the roles of the four signature genes will provide critical insights into the underlying mechanisms that drive ICCA progression and the model’s predictive power. To improve the biological relevance and interpretability of the model, it is also crucial to link the aberrant expression of these genes to specific TME cell populations using single-cell profiling techniques. Understanding these cancer-related genes will help establish a stronger foundation for utilizing this model in clinical settings and may reveal therapeutic opportunities for targeting these genes in ICCA treatment.

In summary, this study introduces a novel prognostic model that could enhance ICCA patient stratification and treatment strategies. Moreover, our findings shed light on the molecular basis of ICCA, emphasizing the therapeutic potential of the identified signature genes for future clinical interventions.