Decoding the ERS–CAF immunoregulatory axis via multimodal AI and its pan-cancer prognostic and therapeutic predictive value

Zheng, Bo-Wen; Xia, Chao; Tang, Ming; Huang, Wei; Zheng, Bo-Yv; Niu, Hua-Qing; Li, Jing; Zhang, Tao-Lan; Zhou, Hong; Zou, Ming-Xiang

doi:10.1038/s41746-026-02388-w

Download PDF

Article
Open access
Published: 30 January 2026

Decoding the ERS–CAF immunoregulatory axis via multimodal AI and its pan-cancer prognostic and therapeutic predictive value

Bo-Wen Zheng^1,2,3^na1,
Chao Xia¹^na1,
Ming Tang^1,4^na1,
Wei Huang⁵,
Bo-Yv Zheng⁶,
Hua-Qing Niu⁷,
Jing Li⁴,
Tao-Lan Zhang⁸,
Hong Zhou⁹ &
…
Ming-Xiang Zou¹

npj Digital Medicine volume 9, Article number: 199 (2026) Cite this article

2200 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Endoplasmic reticulum stress-related cancer-associated fibroblasts (ERS–CAF) remodel the tumor microenvironment and drive immune exclusion and therapy resistance in chordoma, yet routine and non-invasive readouts of this biology are lacking. We hypothesized that standard pre-operative MRI and H&E whole-slide images (WSI) encode image-based surrogates of ERS–CAF-driven immunoregulation that can be learned and generalized across cancers. Three bulk-transcriptomic reference scores were defined for surrogate supervision, capturing ERS-program activity, ERS–CAF-immuneligand-receptor crosstalk and microenvironmental heterogeneity. In 126 chordoma cases, a stage-wise multimodal framework integrating calibrated WSI attention, gated radiopathomic fusion and domain alignment showed strong concordance with molecular profiles, independent prognostic value and biologically specific localization to fibrotic immune-excluded regions. These associations were generalized in zero-shot analyses to the TCGA pan-cancer. An MRI-only distilled model preserved most predictive performance with substantial gains in efficiency, supporting scalable non-invasive clinical application.

Correlation between surrogate endpoints and overall survival in unresectable hepatocellular carcinoma patients treated with immune checkpoint inhibitors: a systematic review and meta-analysis

Article Open access 21 February 2024

Cancer-associated fibroblast subtype and risk signature as predictors of prognosis and treatment effectiveness in gastric cancer

Article Open access 12 November 2025

Digital validation of breast biomarkers (ER, PR, AR, and HER2) in cytology specimens using three different scanners

Article 13 September 2021

Introduction

Chordoma exhibits an immunologically complex tumor microenvironment (TME)—marked by dense desmoplastic matrix, CAF activity, and immune exclusion—with limited standard systemic options and nontrivial local-regional failure despite aggressive therapy^1,2,3,4,5,6. Historically, ultrastructural studies already highlighted chordoma’s matrix-rich architecture and physaliphorous phenotype¹; contemporary series and consensus statements emphasize that, while maximal safe resection and high-dose particle radiotherapy (proton/carbon) can improve control, local recurrence remains frequent (often ~ 40–60% at 5–10 years depending on site and series), and conventional chemotherapy is largely ineffective^2,3,4.

While our previous work first demonstrated the presence of ER stress-related CAF (ERS–CAF) in chordoma, their roles in other tumor types, particularly stroma-rich cancers, remain unclear^3,4. Beyond chordoma, prototypical stroma-dominant cancers such as PDAC exemplify an immune-stromal ecosystem with CAF-driven immunosuppression and poor response to systemic therapy, underscoring a broader need for TME-aware biomarkers⁷. In colorectal cancer (CRC), ER-stress programs have been linked to tumor immune dysfunction/exclusion and can stratify putative benefit from immune-checkpoint blockade, with ZNF703 proposed as a candidate immunotherapy target⁸. Routine, repeatable, and non-invasive TME profiling would therefore be valuable; yet biopsy-centric assays are invasive, scarce, and susceptible to spatial/temporal sampling bias that complicates prognostication and treatment selection⁹. Against this backdrop, radiology-pathology integration has emerged as a “digital biopsy” paradigm: recent studies and reviews show that fusing radiomics with pathomics yields image-derived surrogates that better reflect microenvironmental biology and improve risk stratification and treatment prediction compared with single-modality models^10,11. Multimodal survival/prediction in computational pathology has established strong transformer-based fusion baselines, notably MCAT¹² and PORPOISE¹³, which integrate WSI with molecular profiles to improve risk stratification and provide cross-modal interpretability. In contrast, our setting treats bulk RNA-seq as supervision only (to define mechanistic surrogate targets) and focuses on non-invasive decoding from routine images (MRI + H&E); nevertheless, we include an MCAT-style co-attention transformer fusion baseline under the same protocol to contextualize the value of our gating design. In chordoma specifically, the immune microenvironment literature—including early immunotherapy case series and small trials—further motivates imaging biomarkers that report on CAF-immune crosstalk to guide selection and triage^5,14.

Cancer-associated fibroblasts (CAFs) are not a monolith; single-cell and spatial studies reveal conserved, interconvertible subtypes—including contractile myCAFs, inflammatory iCAFs, and antigen-presenting apCAFs/MHC-II⁺ phenotypes—whose proportions track disease stage and therapeutic response across tumor types^15,16,17,18. A recent pan-cancer single-cell spatial multi-omics study further resolved conserved spatial CAF subtypes and their cellular neighborhoods, linking CAF spatial organization to immune phenotypes and clinical outcomes, which motivates spatially coherent attention patterns as a plausible readout of CAF neighborhood biology¹⁹. Through ligand-receptor (LR) crosstalk with myeloid and lymphoid populations, CAFs remodel extracellular matrix and cytokine milieus to enforce immune exclusion and therapy resistance: canonical axes include CXCL12-CXCR4, which sequesters effector T cells in the stroma and whose inhibition synergizes with checkpoint blockade, and CCL2-CCR2, which recruits immunosuppressive monocytes/macrophages^20,21,22. Consistently, single-cell analyses of CRC liver metastasis implicate myCAFs in ECM remodeling and pro-metastatic niches and highlight specific ligand-receptor programs, reinforcing the relevance of our LR crosstalk surrogate and heterogeneity target H as mechanistic readouts²³.

Mechanistically, iCAF programs can be induced by IL-1 → LIF → JAK/STAT signaling and antagonized by TGF-β, offering a molecular switch for inflammatory vs. myofibroblastic states and explaining spatial gradients around tumor glands^15,24. These observations motivate biology-aware priors for our imaging surrogates: contemporary single-cell communication frameworks formalize LR evidence at different levels—CellPhoneDB accounts for multimeric receptor/ligand architecture and tests enrichment from scRNA-seq, CellChat models signaling probability and pathway information flow, and NicheNet links ligands to target genes in receivers via prior signaling/regulatory graphs—providing principled scaffolds to curate ERS–CAF-immune axes (e.g., CXCL12-CXCR4, CCL2-CCR2)^25,26,27,28. Together, these data support our premise that routine images may encode stable, mechanism-grounded surrogates of ERS–CAF activity and its downstream immunoregulation.

We hypothesize that routine images—pre-operative MRI and H&E whole-slide images (WSIs)—encode stable surrogates of ERS–CAF-driven immunoregulation that can be learned and generalized across cancers. This premise is grounded in two observations. First, quantitative radiomics systematically captures phenotypes related to desmoplasia, necrosis, edema, and vascularity, and these imaging phenotypes correlate with molecular programs and outcomes across tumor types²⁹. Second, modern computational pathology has shown that WSIs contain sufficient signal to recover transcriptomic and genomic states—e.g., the HE2RNA model infers RNA-seq expression from H&E³⁰, and deep learning can predict microsatellite instability (MSI) directly from histology^31,32—indicating that tissue morphology embeds rich molecular proxies.

To train imaging surrogates that are explicitly mechanism-aware, we construct three transcriptomic “bulk-transcriptomic reference scores (surrogate targets)” per patient: (i) ERS–CAF abundance as single-sample gene-set activity using ssGSEA, a robust rank-based scoring of pathway activation³³, with optional cross-check via GSVA for sensitivity analysis³⁴; (ii) ERS–CAF-immune LR-interaction strength computed from curated ligand-receptor pairs (ERS–CAF ligands and immune-cell receptors) assembled from single-cell communication resources (CellChat/CellPhoneDB/NicheNet; see Methods) and aggregated at the sample level; and (iii) microenvironmental heterogeneity as Shannon diversity over immune/stromal cell fractions estimated by digital cytometry (CIBERSORTx)³⁵. These targets span abundance (ERS–CAF), crosstalk (LR), and composition (heterogeneity), offering complementary supervision that encourages the model to recover biologically specific aspects of the TME rather than generic tumor burden. Because stromal programs (e.g., CAF/iCAF/myCAF axes) are conserved across stroma-rich cancers, we expect surrogates learned in chordoma to exhibit pan-cancer portability when applied zero-shot to external cohorts. Importantly, these endpoints are bulk-derived molecular surrogates rather than directly measured biological quantities with spatial resolution. Our goal is therefore to learn image-based predictors of these clinically informative transcriptomic states, and we validate spatial plausibility using independent pathology annotations of stromal remodeling and immune exclusion.

(i) We define and compute these molecular gold standards from bulk RNA-seq; (ii) we train a multimodal AI with CLIP-guided pathomics, radiomics, and mid-level gated fusion to predict the gold standards from images; (iii) we demonstrate prognostic value in an internal chordoma cohort; (iv) we assess pan-cancer generalization and treatment predictivity in TCGA stroma-rich cancers; and (v) we distill the model to MRI-only for deployability.

Our methodological contribution is system-level and mechanism-aware: we formulate non-invasive decoding as multi-target regression to mechanism-defined transcriptomic surrogates (ERS–CAF program activity, ERS–CAF-immune LR crosstalk, and microenvironmental heterogeneity), and we introduce a stage-wise curriculum that (Stage I) injects a biological text prior via target-specific prompt banks and monotone similarity-label alignment to steer tile selection, (Stage II) learns complementary radiology-pathology fusion with robustness to missing modalities, and (Stage III) improves multi-site generalization with stable, non-adversarial second-order alignment evaluated by leave-site-out tests. We further differentiate from path-transcriptomics by predicting three interpretable mechanistic surrogates rather than high-dimensional gene expression, and we provide prompt-perturbation and spatial plausibility evidence to validate that text guidance is mechanism-consistent rather than a generic MIL heuristic.

Results

We analyzed a 126-patient internal chordoma cohort with matched pre-operative MRI (T1, T2, contrast-enhanced), diagnostic H&E WSIs, bulk RNA-seq, and longitudinal outcomes (OS, PFS). Inclusion/exclusion, imaging protocols, segmentation/tiling QC and RNA-seq processing are detailed in Section 9. For external validation, we applied the frozen model to stroma-rich TCGA cancers (PAAD, STAD, COADREAD) with WSIs, RNA-seq, and clinical endpoints.

Molecular surrogates of the ERS–CAF axis

We summarize supervision targets using cohort-wide distributions and pairwise associations (Fig. 1). Distributions are unimodal and approximately symmetric, and the three surrogates show moderate concordance, supporting complementary supervision signals. Exact coefficients (N=126): r (ERS–CAF, LR) = 0.48, r (ERS–CAF, H) = 0.20, r (LR, H) = 0.27 (all FDR < 0.05).

**Fig. 1: Molecular surrogate landscape (N = 126).**

We anchor the continuous “molecular” scores to visible histomorphology by testing whether attention hotspots co-localize with pathologist-annotated fibrosis/immune-exclusion. Board-certified pathologists (blinded to model outputs) graded fibrosis and immune exclusion per patient on a 0–3 scale; we report: (i) IoU (Jaccard) between the top-K attention mask and the fibrosis ROI; (ii) Hit-rate for immune-exclusion point annotations falling inside the top-K attention. Analyses are stratified as Low (grades 0–1) vs High (grades 2–3) and compared to a randomized baseline (attention shuffled).

As shown in Fig. 2, we can observe: (i) Fibrosis. IoU (mean) increased from 0.28 in Low (0–1) to 0.47 in high (2–3), while the randomized baseline averaged 0.12 (all N = 126). High vs low: p < 10⁻⁶; High Vs Baseline: p < 10⁻⁸. (ii) Immune exclusion. Hit-rate (mean) increased from 0.36 in low to 0.55 in high, baseline 0.18 ; high vs low: p < 10⁻⁶; high vs baseline: p < 10⁻⁸. (iii) Pass criteria. Co-localization metrics are significantly above randomized and monotone with grade, satisfying the pre-registered gate.

The Attention hotspots faithfully concentrate on desmoplastic and immune-excluded regions when grades are high, tying the model’s molecular surrogates to recognizable histology and strengthening biological plausibility.

Multimodal decoding from MRI and H&E

Using 5-fold patient-level cross-validation stratified by site/scanner, the proposed CLIP-guided WSI branch plus MRI radiomics with gated fusion achieved the best agreement with molecular surrogates across all targets (Table 1). Relative to strong single-modality baselines, fusion improved macro-average Pearson correlation by +0.07–+0.12 absolute and reduced calibration error (slope 0.97 ± 0.06 vs. 0.88 ± 0.07 for the best single-modality model). RMSE showed consistent reductions (median −9.4% vs. WSI-only CLIP and −14.6% vs. MRI-only; extended table). Predicted vs. observed scatter and calibration plots exhibited tight linearity without gross heteroscedasticity.

Table 1 Cross-validated prediction of transcriptomic surrogates on the 126-patient chordoma cohort

Full size table

Table 2 benchmarks CLIP guidance and fusion gating against modern WSI MIL baselines, a histology SSL encoder, an HE2RNA-style image-to-transcriptomics pipeline, and alternative fusion strategies under identical fivefold splits and QC. (i) WSI-only. CLIP-guided MIL is the best WSI-only model (Macro-r = 0.610 ± 0.046), improving over ABMIL (Δ+0.051) and outperforming CLAM/DSMIL/TransMIL (best Macro-r = 0.541) and CTransPath+MIL (0.548), with consistent gains across ERS–CAF, LR, and H. (ii) Image-to-transcriptomics. HE2RNA-style WSI → expr → score underperforms (Macro-r = 0.480 ± 0.061), suggesting that predicting low-dimensional mechanism-defined surrogates is more data-efficient than high-dimensional expression regression in small-N settings. (iii) Fusion. Gated fusion improves over late fusion, concat+MLP, and transformer fusion (best alternative Macro-r = 0.604 ± 0.045), and CORAL adds further gains (Δ + 0.034), supporting complementary modality use plus stable multi-site alignment.

Table 2 Stronger WSI and fusion baselines under the same protocol (Chordoma, N = 126)

Full size table

Ablations: (1) Replacing CLIP-guided attention with vanilla ABMIL degraded WSI-only performance by −0.05 to −0.06 absolute r across targets, indicating gains from mechanism-aware prompts. (2) Removing CORAL from fusion reduced macro-r by −0.03 (Table 1), consistent with site/scanner harmonization benefits. (3) Modality dropout (q = 0.2) improved robustness (single-modality r drop <0.04) without harming fusion accuracy.

Prognostic value in chordoma

We summarize risk stratification with two concise plots (Fig. 3). The combined forest plot shows adjusted HRs (OS & PFS) for ERS–CAF (T3 vs T1), LR (T3 vs T1), and H (per SD); the time-dependent C-index curve demonstrates a consistent improvement from 0.64 to 0.71 at 12 months when adding AI scores to the clinical model.

**Fig. 3: Concise prognostic summary (N = 126).**

To validate the necessity and stability of text priors. We compare: (i) no text (NONE), (ii) fixed curated prompts (CUR), and (iii) learnable CoOp (COOP); we further probe sensitivity to synonym prompts (SYN), antonym/negation (ANTI), and out-of-domain phrases (OOD). For each prompt variant and target (ERS–CAF, LR, H), we compute 5-fold Pearson r against the transcriptomic gold standards using the WSI branch in isolation (radiomics kept fixed).

We observe that: (i) Guidance helps. COOP achieves the highest mean r across all targets, CUR ranks second, and NONE trails (Fig. 4a). (ii) Stability to synonyms. SYN prompts cluster tightly around CUR with limited inter-fold spread (Fig. 4b), indicating robustness to wording. (iii) Sensitivity to wrong priors. OOD and especially ANTI degrade performance toward/ below the no-text baseline, confirming that mechanistically inconsistent prompts harm learning. (iv) Mechanistic localization. Attention maps are more focal under CUR (Fig. 4c) and diffuse without text (Fig. 4d), supporting biology-aware tile selection.

**Fig. 4: Stage I (CLIP guidance) evidence.**

Did Stage II (gated fusion) learn complementarity? As shown in Fig. 5: (i) Weight migration follows clinical intuition: sacrum/small-volume/ < 50y strata show WSI-dominant gating (mean WSI weights 0.60-0.61), while skull-base/large-volume/ > 65y strata shift toward MRI (weights 0.55–0.62), and fusion achieves the best macro-r in every stratum (Fig. 5a, b). (ii) Robust under missing modalities: when randomly dropping WSI tiles, fusion’s macro-r declines from 0.68 to 0.64 at 80% loss (Δr = 0.04) and to 0.61 at 100%, whereas WSI-only falls from 0.61 to 0.46 (Fig. 5c). When removing MRI sequences (T1/T2/CE), fusion drops to 0.66 at 2/3 missing and 0.62 at 100% missing, while MRI-only collapses, confirming that fusion falls back to the intact modality (Fig. 5d). (iii) Pass criteria: the ≥80% missing drop is controlled (Δr ≤ 0.05); gating shifts are directionally consistent with phenotype/geometry expectations—evidence of learned complementarity rather than single-modality dominance.

**Fig. 5: Stage II (gated fusion) learns complementary use of WSI and MRI.**

Is gating shift an artifact of radiomics features? To clarify, the gate is trained on held-out CV folds and is not a hard-coded preference for either modality. If MRI radiomics were systematically unreliable, we would expect the gate to collapse toward WSI across strata and under MRI-sequence ablations. Instead, Fig. 5 shows bidirectional and clinically sensible shifts: skull-base/larger-volume/older strata allocate higher MRI weight, while sacrum/smaller-volume/younger strata allocate higher WSI weight. Moreover, under missing-modality stress tests, fusion falls back to the intact modality and degrades smoothly rather than exhibiting a WSI-only collapse: when WSI tiles are removed, fusion performance remains relatively stable (macro-r0.68 → 0.64 at 80% loss), and when MRI sequences are removed, fusion similarly degrades but remains functional (0.66 at 2/3 missing; 0.62 at 100% missing), which is inconsistent with the gate being forced into WSI due to fragile MRI features.

Did Stage III (CORAL) actually remove site/domain confounding? Following the design: (1) Leave-site-out (LSO) validation: train on 4 sites and test on the held-out site; report Macro-r before/after CORAL per site. (2) Embedding visualization: 2D embeddings colored by site before/after CORAL; quantify mixing via kNN (k = 10) site-label entropy (higher = better mixing). We found that: (i) Embedding mixing increased substantially. The 10-NN site-label entropy rose from 0.053 (before) to 0.854 (after), indicating strong cross-site mixing in the post-alignment space (Fig. 6a, b). (ii) LSO performance improved across all sites. Macro-r (mean ± sd) improved from {0.612 ± 0.036, 0.583 ± 0.041, 0.598 ± 0.039, 0.571 ± 0.040, 0.626 ± 0.034} to {0.664 ± 0.032, 0.642 ± 0.035, 0.651 ± 0.033, 0.636 ± 0.036, 0.673 ± 0.030} for Sites A-E (Fig. 6c). (iii) Pass criteria met. (1) LSO Macro-r gains are consistent and clinically meaningful; (2) the embedding shows clear cross-site-mixing post-CORAL, matching the goal of statistical domain alignment without adversarial training.

**Fig. 6: Stage III (CORAL) reduces domain clustering and improves leave-site-out performance.**

CORAL reduces second-order (covariance) discrepancies among sites, yielding embeddings that are less site-identifiable but more biology-consistent. The concomitant LSO gains suggest that alignment corrects nuisance variation rather than suppressing signal.

Pan-cancer generalization and treatment predictivity

To assess cross-cancer portability, we performed a zero-shot evaluation on three stroma-rich TCGA cohorts: TCGA-PAAD (pancreatic adenocarcinoma; N = 150)³⁶, TCGA-STAD (stomach adenocarcinoma; N = 295)³⁷, and TCGA-COADREAD (colorectal adenocarcinoma; N = 276)³⁸. Diagnostic H&E WSIs and matched bulk RNA-seq were obtained from the NCI Genomic Data Commons (GDC)³⁹, and survival endpoints were standardized following TCGA-CDR recommendations when applicable⁴⁰. Importantly, the chordoma-trained model was applied frozen (no TCGA fine-tuning, calibration, or hyperparameter selection on TCGA). Due to heterogeneous radiology availability across TCGA projects, this analysis uses WSI input only (radiology withheld) and compares predicted image-derived surrogates against the per-cohort transcriptomic reference scores computed by the same pipeline (ssGSEA/CIBERSORTx/LR aggregation; Section 9).

Zero-shot application to TCGA cohorts yielded significant concordance between image-derived and transcriptomic surrogates (Table 3; Fig. 7). For PAAD, Pearson r was 0.42 (ERS–CAF), 0.39 (LR), and 0.36 (H); for STAD, 0.38/0.35/0.33; for COADREAD, 0.35/0.31/0.29 (all FDR < 0.01). Figure 7 visualizes the patient-level agreement underlying Table 3 and includes bootstrap confidence intervals and a permutation control (shuffled predictions) to demonstrate that the observed concordance is not attributable to chance alignment.

**Fig. 7: Zero-shot pan-cancer generalization on TCGA.**

Table 3 External generalization and deployability

Full size table

Beyond molecular concordance, the AI surrogates remained clinically informative in external cohorts: they were prognostic in PAAD (C-index ↑ + 0.04 over clinical-only) and STAD (+0.03). Where treatment outcome annotations were available from TCGA clinical fields curated in TCGA-CDR⁴⁰, the LR surrogate stratified chemotherapy response in PAAD (AUC 0.66 ± 0.03; adjusted OR per SD 1.53 [1.18–1.99]). In COADREAD, the directionality and magnitude of the decoded ER stress-immune associations are consistent with prior CRC studies linking ER-stress signatures to immune dysfunction/exclusion and immunotherapy-related phenotypes, including an ERS gene scoring system that stratifies putative ICB benefit and nominates ZNF703 as a candidate target⁸.

Beyond molecular concordance, the AI surrogates remained clinically informative in external cohorts: they were prognostic in PAAD (C-index ↑ + 0.04 over clinical-only) and STAD +0.03). Where treatment outcome annotations were available from TCGA clinical fields curated in TCGA-CDR⁴⁰, the LR surrogate stratified chemotherapy response in PAAD (AUC 0.66 ± 0.03; adjusted OR per SD 1.53 [1.18–1.99]). In COADREAD, the directionality and magnitude of the decoded ER stress-immune associations are consistent with prior CRC studies linking ER-stress signatures to immune dysfunction/exclusion and immunotherapy-related phenotypes, including an ERS gene scoring system that stratifies putative ICB benefit and nominates ZNF703 as a candidate target⁸.

Toward deployability: MRI-only knowledge distillation

An MRI-only student distilled from the multimodal teacher retained 94% of macro-r (teacher 0.68 ± 0.03 vs. student 0.64 ± 0.03), with calibration slope 0.94 ± 0.08. Inference was 4.6× faster and used 2.9× less GPU memory (Table 3), supporting non-invasive, single-modality deployment.

Why does TCGA performance drop? The TCGA setting introduces a large distribution shift even within histology: differences in tumor histoarchitecture (e.g., glandular vs. physaliphorous patterns), stroma localization, slide preparation/scanning, and cohort composition (tumor purity and spatial sampling mismatch between the WSI region and bulk RNA-seq). Since TCGA evaluation uses WSI-only, the observed drop cannot be attributed to MRI → CT radiomics shift. To quantify the magnitude of the cross-cancer shift, Table 4 contrasts in-distribution chordoma WSI-only performance with TCGA mean concordance (and mean explained variance).

Table 4 Diagnostic of TCGA performance drop (WSI-only)

Full size table

Discussion

This study demonstrates that routine images (pre-operative MRI and H&E WSIs) can non-invasively decode the ERS–CAF-centric immunoregulatory axis and yield image-derived molecular surrogates that generalize across cancers and stratify outcomes. On the internal 126-patient chordoma cohort, the multimodal model—comprising CLIP-guided pathomics, radiomics, and gated mid-level fusion—showed the strongest agreement with transcriptomic gold standards and improved calibration versus single-modality baselines. Stage-wise analyses clarified the sources of gains: Stage I text guidance increased task-specific attention fidelity and robustness to wording; Stage II fusion learned clinically sensible complementarity between WSI and MRI and degraded gracefully under missing modalities; Stage III CORAL reduced site confounding and improved leave-site-out (LSO) macro-r across all contributing centers. Beyond correlation, attention hotspots co-localized with pathologist-annotated fibrosis and immune exclusion, strengthening the biological plausibility of the decoded surrogates. Zero-shot application to stroma-rich TCGA cancers retained significant concordance, and image-derived scores improved discrimination and decision utility for prognosis and treatment stratification.

We emphasize that the novelty is not in introducing any single building block in isolation, but in an end-to-end, mechanism-aware formulation and training curriculum for decoding the ERS–CAF immunoregulatory axis from routine images under small-N, multi-site constraints. We adopt CORAL as a stable, non-adversarial alignment objective that can be validated transparently by leave-site-out tests; adversarial alignment methods can be effective but are often sensitive to domain imbalance and optimization instability in limited-data clinical settings. Accordingly, we focus on rigorous generalization evidence (LSO gains and site-mixing improvements) and robustness diagnostics (missing-modality stress tests) to support deployability.

The findings are consistent with a model in which ERS–CAF programs reshape the TME through ligand-receptor (LR) crosstalk that recruits and repatterns immune cell populations and remodels the extracellular matrix. The observed monotone associations—higher ERS–CAF/LR scores with worse outcomes, and higher microenvironmental heterogeneity (H) with partial protection—track known mechanisms of desmoplasia and immune exclusion. Two independent lines of evidence support biological specificity: (i) text-guided attention concentrates on histologic patterns congruent with ERS–CAF activation (e.g., collagen-dense stroma, immune-poor niches), whereas wrong or out-of-domain prompts degrade accuracy; (ii) the LR-interaction surrogate aligns with curated axes and with immune-cell fractions inferred from bulk deconvolution, suggesting that imaging surrogates are not merely capturing generic tumor burden but immunoregulatory state.

Non-invasive surrogates derived from routine images can augment clinical decision-making in several ways. First, they enable triage for biopsy: patients with high predicted ERS–CAF/LR activity could be prioritized for confirmatory molecular assays when tissue is scarce. Second, they support risk-adapted surveillance: image-derived risk tiers improved time-dependent C-index and decision curves, indicating practical net benefit across clinically relevant thresholds. Third, they offer a path toward treatment selection and enrichment: the LR surrogate separated chemotherapy responders in external cancers, motivating prospective enrichment strategies or adaptive trial designs. Finally, distillation to an MRI-only student retained most accuracy with markedly reduced compute/memory, suggesting a feasible deployment route in settings where WSIs are not routinely digitized or cross-modality acquisition is staggered.

This is a retrospective, single-disease training study with N = 126 internal patients, which limits precision for subgroup analyses and may inflate optimism despite careful cross-validation. Imaging modality heterogeneity (MRI internally, CT common in TCGA) introduces a shift that we partially mitigated by modality-agnostic radiomic families but did not eliminate. External zero-shot correlations in TCGA are moderate, so our claim is not that images can fully reconstruct bulk molecular scores. Rather, the aim is to recover actionable, low-dimensional surrogates that generalize sufficiently to stratify outcomes and treatment response. Improving cross-cancer molecular fidelity via per-cancer calibration or lightweight adaptation is a natural next step.

Pathologist grades (fibrosis and immune exclusion) remain semi-quantitative and may vary across readers; although we observed monotone trends and strong baselines, richer spatial annotations and inter-rater modeling would further solidify claims. Critically, we lack direct spatial ground truth for ER stress specifically within CAFs (e.g., multiplex IHC/IF for HSPA5/ATF4/XBP1s co-stained with COL1A1/ACTA2/PDGFRB, or spatial transcriptomics), so our spatial validation uses downstream morphology (fibrosis/immune exclusion) rather than a cell-resolved ER-stress label; acquiring such assays is an important direction for prospective validation.

The LR curation relies on current communication databases and chordoma single-cell priors; incompleteness or context-dependence could bias the surrogate. CORAL reduces second-order site effects but may not remove higher-order domain shifts; while LSO gains were consistent, multi-center prospective validation is needed. Finally, association does not imply causation: improved prognosis/treatment stratification does not, on its own, establish that modifying the decoded pathway will change outcomes. In addition, extending the method to data privacy learning frameworks such as federated learning is also an interesting research direction⁴¹.

Methods

We formalize non-invasive decoding of the ER-stressed cancer-associated fibroblast (ERS–CAF) immunoregulatory axis as a supervised, multi-target regression from routine images to transcriptomic gold standards. For each patient i ∈ {1, …, N}, we observe pre-operative MRI (R_i), H&E WSI (W_i), bulk RNA-seq (x_i), and a site/scanner domain label d_i ∈ {1, …, D}. The system contains: (i) a CLIP-guided pathomics branch that converts WSI tiles into task-specific slide embeddings using a curated prompt bank capturing ERS–CAF biology; (ii) a radiomics branch that summarizes MRI with harmonized quantitative features; (iii) a mid-level gated fusion that integrates branches and predicts three continuous targets: ERS–CAF abundance, ERS–CAF-immune ligand-receptor (LR) interaction strength, and TME heterogeneity. The training proceeds in three stages: Stage I calibrates the CLIP-guided WSI branch to molecular scores via a monotone similarity-label alignment; Stage II fits the multimodal regressor with a heteroscedastic multi-task loss; Stage III performs stable, non-adversarial domain harmonization by second-order (CORAL) alignment. Our framework is shown in Fig. 8.

Notation and data schema

For patient i, MRI volumes ${R}_{i}\in {{\mathbb{R}}}^{H\times W\times Z\times C}$ with C = 3 (T1/T2/contrast-enhanced), WSI W_i, RNA-seq ${x}_{i}\in {{\mathbb{R}}}^{G}$, clinical covariates c_i, survival (T_i, Δ_i), treatment vector t_i, and domain d_i ∈ {1, …, D}. The three continuous supervision targets form ${S}_{i}^{\star }={[{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)]}^{\top }\in {{\mathbb{R}}}^{3}$; predictions are ${\widehat{S}}_{i}\in {{\mathbb{R}}}^{3}$.

For each target j ∈ {ERS − CAF, LR, H}, we define a prompt set ${{\mathcal{Q}}}^{(j)}={\{{q}_{m}^{(j)}\}}_{m=1}^{{M}_{j}}$ and encode it using the CLIP text encoder f_T into unit-norm vectors

$${{\bf{t}}}_{m}^{(j)}=\frac{{f}_{T}({q}_{m}^{(j)})}{\parallel {f}_{T}({q}_{m}^{(j)}){\parallel }_{2}}\in {{\mathbb{R}}}^{d},\,m=1,\ldots ,{M}_{j}.$$

(1)

We also define a target-specific prompt-mixture text embedding

$${\bar{{\bf{t}}}}^{(j)}=\mathop{\sum }\limits_{m=1}^{{M}_{j}}{\gamma }_{m}^{(j)}\,{{\bf{t}}}_{m}^{(j)},\,{\gamma }_{m}^{(j)}\ge 0,\,\mathop{\sum }\limits_{m}{\gamma }_{m}^{(j)}=1,$$

(2)

and the concatenated text context used in fusion gating,

$$\bar{{\bf{t}}}=\left[{\bar{{\bf{t}}}}^{({\rm{ERS}}-{\rm{CAF}})};\,{\bar{{\bf{t}}}}^{({\rm{LR}})};\,{\bar{{\bf{t}}}}^{(H)}\right].$$

(3)

We use ${{\bf{t}}}_{m}^{(j)}$, ${\bar{{\bf{t}}}}^{(j)}$, and $\bar{{\bf{t}}}$ consistently throughout; we avoid overloaded alternatives (e.g., f^(j)) to prevent ambiguity.

Bulk-transcriptomic reference scores (surrogate supervision targets)

ERS–CAF abundanceL

With curated gene set G_ERS−CAF ⊆ {1, …, G} and per-sample gene ranking π_i, define the ssGSEA running sum

$${S}_{{\rm{ERS}}-{\rm{CAF}}}(i)=\mathop{\sum }\limits_{k=1}^{G}{\Delta }_{i}(k),$$

(4)

$${\Delta }_{i}(k)=\frac{{\sum }_{g\in {G}_{{\rm{ERS}}-{\rm{CAF}}}}{\bf{1}}\{{{\rm{rank}}}_{i}(g)\le k\}\,| {x}_{i,g}{| }^{p}}{{\sum }_{g\in {G}_{{\rm{ERS}}-{\rm{CAF}}}}| {x}_{i,g}{| }^{p}}-\frac{{\sum }_{g\notin {G}_{{\rm{ERS}}-{\rm{CAF}}}}{\bf{1}}\{{{\rm{rank}}}_{i}(g)\le k\}}{G-| {G}_{{\rm{ERS}}-{\rm{CAF}}}| }.$$

(5)

where rank_i(g) is the position of g in π_i (descending); p ∈ [0, 2]. Scores are z-scored cohort-wise.

ERS–CAF-immune LR interaction

Let ${\mathcal{C}}$ denote immune cell types and ${\mathcal{L}}=\{(l,r,c)\}$ the curated triplets (ERS–CAF ligand l, immune receptor r on-cell c). With z-scored expressions ${\widetilde{x}}_{i,\cdot }$ and deconvolved fractions p_i,c,

$${S}_{{\rm{LR}}}(i)=\frac{1}{Z}\mathop{\sum }\limits_{(l,r,c)\in {\mathcal{L}}}{\omega }_{lrc}\,{\widetilde{x}}_{i,l}\,{\widetilde{x}}_{i,r}\,{p}_{i,c},\,Z=\mathop{\sum }\limits_{(l,r,c)}{\omega }_{lrc},$$

(6)

where ω_lrc ≥ 0 are evidence-based (or uniform) weights.

Microenvironmental heterogeneity From CIBERSORTx proportions ${\{{p}_{i,k}\}}_{k=1}^{K}$ (∑_k p_{i, k} = 1),

$$H(i)=-\mathop{\sum }\limits_{k=1}^{K}{p}_{i,k}\log {p}_{i,k},\,\,{\rm{effective\; richness}}\,\exp \{H(i)\}.$$

(7)

Noise floor of computed supervision targets

The supervision targets ${S}_{i}^{(j)}\in \{{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)\}$ are computed from bulk RNA-seq using (1) gene-set scoring (ssGSEA), (2) digital cytometry (CIBERSORTx), and (3) an aggregated LR scoring rule. These procedures are validated and widely used, but they are nevertheless approximations to latent biology and introduce label noise. Therefore, our model learns an image-based surrogate of a bulk-derived surrogate, and performance should be interpreted relative to this noise.

Let ${S}_{i}^{* (j)}$ denote the latent (unobserved) sample-level biological state for target j and assume

$${S}_{i}^{(j)}={S}_{i}^{* (j)}+{\varepsilon }_{i}^{(j)},\,{\mathbb{E}}\left[{\varepsilon }_{i}^{(j)}\right]=0,$$

(8)

where ${\varepsilon }_{i}^{(j)}$ captures scoring/deconvolution error and finite-sample noise in bulk measurements. Define the reliability (signal fraction) of the computed target as

$${\rho }_{j}\,=\,\frac{\mathrm{Var}({S}^{* ({\rm{j}})})}{\mathrm{Var}({S}^{(j)})}\in (0,1].$$

(9)

Then, for any predictor ${\widehat{S}}^{(j)}$ (including our image-derived predictor), the observable correlation is upper bounded by the noise ceiling

$${\rm{Corr}}({\widehat{S}}^{(j)},{S}^{(j)})\,\le \,\sqrt{{\rho }_{j}},$$

(10)

i.e., even a perfect predictor of S^*(j) cannot exceed $\sqrt{{\rho }_{j}}$ correlation with the noisy computed label S^(j).

We estimate ρ_j (or lower bounds thereof) via pipeline perturbations that produce two independently computed proxies S^(j) and ${\widetilde{S}}^{(j)}$ from the same RNA-seq:

ERS–CAF score stability: ssGSEA vs. GSVA (or other rank-based scoring), yielding ${\rm{Corr}}({S}_{{\rm{ERS}}-{\rm{CAF}}},{\widetilde{S}}_{{\rm{ERS}}-{\rm{CAF}}})$.
Heterogeneity stability: compute H from alternative deconvolution tools (e.g., EPIC/quanTIseq/xCell) and compare to CIBERSORTx-derived H.
LR score stability: compare the simple weighted product-sum rule to an alternative bulk-mode LR scoring (e.g., CellChat bulk formulation), reporting ${\rm{Corr}}({S}_{{\rm{LR}}},{\widetilde{S}}_{{\rm{LR}}})$.

These inter-pipeline concordances provide conservative empirical lower bounds on ρ_j and contextualize attainable prediction accuracy.

WSI pathomics with CLIP-guided MIL

WSI captures micro-ecology (CAF morphology, collagen deposition, immune exclusion). A CLIP prior and mechanism-driven prompts steer attention toward tiles that instantiate ERS–CAF biology. Standard pathology MIL learns attention weights purely from slide-level labels; in contrast, our tile weighting ${\alpha }_{it}^{(j)}$ is explicitly grounded in a target-specific biological text prior through prompt-similarity scoring (Eqns. (12–14)), and Stage I enforces a monotone similarity-label alignment before downstream regression. This design makes the inductive bias inspectable (via prompts) and falsifiable (via synonym/OOD/antonym perturbations in Fig. 4). Moreover, unlike path-transcriptomics approaches that predict large gene expression vectors, we predict three mechanism-defined surrogates (ERS–CAF activity, LR crosstalk, heterogeneity), which are more stable under limited N and directly aligned with our biological hypothesis.

After stain normalization and tissue detection, tile W_i at 20× into ${\{{W}_{it}\}}_{t=1}^{{T}_{i}}$. CLIP image encoder f_I yields unit-norm tile embeddings

$${{\bf{v}}}_{it}=\frac{{f}_{I}({W}_{it})}{\parallel {f}_{I}({W}_{it}){\parallel }_{2}}\in {{\mathbb{R}}}^{d}.$$

(11)

For each target j ∈ {ERS − CAF, LR, H}, a prompt set ${{\mathcal{Q}}}^{(j)}={\{{q}_{m}^{(j)}\}}_{m=1}^{{M}_{j}}$ is encoded by CLIP text encoder f_T into unit vectors

$${{\bf{t}}}_{m}^{(j)}=\frac{{f}_{T}({q}_{m}^{(j)})}{\parallel {f}_{T}({q}_{m}^{(j)}){\parallel }_{2}},\,{\bar{{\bf{t}}}}^{(j)}=\mathop{\sum }\limits_{m=1}^{{M}_{j}}{\gamma }_{m}^{(j)}{{\bf{t}}}_{m}^{(j)},\,\,{\gamma }_{m}^{(j)}\ge 0,\,\mathop{\sum }\limits_{m}{\gamma }_{m}^{(j)}=1.$$

(12)

Prompt-similarity attention over tiles uses a log-sum-exp temperature τ > 0:

$${s}_{it}^{(j)}=\frac{1}{\tau }\log \mathop{\sum }\limits_{m=1}^{{M}_{j}}\exp \left(\frac{{{\bf{v}}}_{it}^{\top }{{\bf{t}}}_{m}^{(j)}}{\tau }\right),\,{\alpha }_{it}^{(j)}=\frac{\exp ({s}_{it}^{(j)})}{{\sum }_{{t}^{{\prime} }}\exp ({s}_{i{t}^{{\prime} }}^{(j)})},\,{{\bf{p}}}_{i}^{(j)}=\mathop{\sum }\limits_{t=1}^{{T}_{i}}{\alpha }_{it}^{(j)}\,{{\bf{v}}}_{it}.$$

(13)

The slide-level pathomics representation concatenates task-specific embeddings:

$${{\bf{p}}}_{i}=\left[{{\bf{p}}}_{i}^{({\rm{ERS}}-{\rm{CAF}})};\,{{\bf{p}}}_{i}^{({\rm{LR}})};\,{{\bf{p}}}_{i}^{(H)}\right]\in {{\mathbb{R}}}^{3{d}_{p}}.$$

(14)

MRI radiomics branch

MRI encodes macro-phenotypes (shape, intensity, texture) correlated with stromal content and edema, complementing WSI micro-ecology.

After resampling to isotropic voxels and sequence-wise z-normalization, PyRadiomics produces ${\phi }_{{\rm{rad}}}({R}_{i})\in {{\mathbb{R}}}^{{P}_{r}}$; stable features (ICC threshold) define ${{\bf{r}}}_{i}={[{\phi }_{{\rm{rad}}}({R}_{i})]}_{{\mathcal{I}}}$. Site/scanner effects are harmonized via ComBat:

$${{\bf{r}}}_{i}\leftarrow {\rm{ComBat}}({{\bf{r}}}_{i}| {d}_{i}),\,{{\bf{z}}}_{i}^{({\rm{rad}})}={E}_{{\rm{rad}}}({{\bf{r}}}_{i})\in {{\mathbb{R}}}^{{d}_{r}},$$

(15)

with E_rad an MLP (layer norm, dropout).

Gated mid-level fusion and multi-head regression

Radiomics and pathomics carry complementary, scale-separated signals; prompts provide a mechanism-aware context vector that can modulate fusion. Encode pathomics by E_wsi:

$${{\bf{z}}}_{i}^{({\rm{wsi}})}={E}_{{\rm{wsi}}}({{\bf{p}}}_{i})\in {{\mathbb{R}}}^{{d}_{p}},\,\bar{{\bf{t}}}=[{\bar{{\bf{t}}}}^{({\rm{ERS}}-{\rm{CAF}})};{\bar{{\bf{t}}}}^{({\rm{LR}})};{\bar{{\bf{t}}}}^{(H)}].$$

(16)

A gate controls cross-modal contribution:

$${{\bf{z}}}_{i}=\sigma \left({{\bf{W}}}_{g}\left[{{\bf{z}}}_{i}^{({\rm{rad}})};{{\bf{z}}}_{i}^{({\rm{wsi}})};\bar{{\bf{t}}}\right]+{{\bf{b}}}_{g}\right)\odot \left[{{\bf{z}}}_{i}^{({\rm{rad}})};{{\bf{z}}}_{i}^{({\rm{wsi}})}\right],$$

(17)

where σ is element-wise sigmoid and ⊙ the Hadamard product. A shared trunk f with three linear readouts yields

$${\widehat{s}}_{i}^{(j)}={h}^{(j)}({{\bf{z}}}_{i})={{\bf{w}}}_{j}^{\top }f({{\bf{z}}}_{i})+{b}_{j},\,j\in \{{\rm{ERS}}-{\rm{CAF}},{\rm{LR}},H\}.$$

(18)

Training curriculum and objectives

Directly optimizing many objectives is fragile^42,43,44. We therefore adopt a three-stage curriculum, ensuring stable convergence while preserving the intended inductive biases.

Stage I: CLIP similarity-label alignment (WSI branch)

For each target j, define the WSI similarity

$${u}_{i}^{(j)}={{\bf{p}}}_{i}^{(j)\top }{\bar{{\bf{t}}}}^{(j)}\in [-1,1],\,\,{\rm{where}}\,\,{\bar{{\bf{t}}}}^{(j)}\,\,{\rm{is\; the\; prompt\; -\; mixture\; text\; embedding.}}$$

(19)

$${y}_{i}^{(j)}=\,\mathrm{minmax}\,\left({S}_{i}^{(j)}\right)=\frac{{S}_{i}^{(j)}-{\min }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}}{{\max }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}-{\min }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}+\epsilon }\in [0,1],$$

(20)

where ${\rm{minmax}}$ scales per-cohort to [0, 1], ${\mathcal{T}}$ denotes the training split of the current outer fold (to avoid leakage) and ϵ = 10⁻⁸. A monotone affine-sigmoid calibrator maps similarity to the label scale:

$${g}_{j}(u)=\sigma ({a}_{j}u+{b}_{j}),\,{a}_{j}\ge 0.$$

(21)

The WSI alignment loss is

$${{\mathcal{L}}}_{{\rm{WSI}}}=\mathop{\sum }\limits_{j}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({g}_{j}({u}_{i}^{(j)})-{y}_{i}^{(j)}\right)}^{2}.$$

(22)

where σ( ⋅ ) logistic function; a_j, b_j are learned scalars with a_j≥0 to enforce monotonicity. During Stage I, (f_I, f_T) are kept frozen; E_wsi, prompt-mixture weights ${\gamma }_{m}^{(j)}$, and (a_j, b_j) are optimized.

Stage II: Multimodal regression (fusion model)

We fit the fused model to the continuous molecular targets with heteroscedastic weighting:

$${{\mathcal{L}}}_{{\rm{task}}}=\mathop{\sum }\limits_{j\in \{{\rm{ERS}}-{\rm{CAF}},{\rm{LR}},H\}}\left[\frac{1}{2{\sigma }_{j}^{2}}\cdot \frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({\widehat{s}}_{i}^{(j)}-{S}_{i}^{(j)}\right)}^{2}+\log {\sigma }_{j}\right],$$

(23)

where σ_j > 0 are learned noise scales and ${S}_{i}^{(j)}$ denotes the jth entry of ${S}_{i}^{\star }$. Parameters updated include E_rad, E_wsi, fusion gate (W_g, b_g), trunk f, and readouts (w_j, b_j); the Stage I prompt mixtures ${\gamma }_{m}^{(j)}$ are fine-tuned.

Stage III: Site/scanner harmonization (CORAL)

To mitigate domain shift without adversarial instability, we align the second-order statistics of the fused features. Let μ_d and Σ_d denote the mean and covariance of z_i within domain d, and μ, Σ the global mean and covariance. The CORAL loss is

$${{\mathcal{L}}}_{{\rm{coral}}}=\mathop{\sum }\limits_{d=1}^{D}\left(\parallel {\mu }_{d}-\mu {\parallel }_{2}^{2}+\parallel {\Sigma }_{d}-\Sigma {\parallel }_{F}^{2}\right),$$

(24)

where ∥ ⋅ ∥_F is Frobenius norm. The Stage III objective is

$$\mathop{\min }\limits_{\Theta }\,{{\mathcal{L}}}_{{\rm{task}}}+{\lambda }_{{\rm{coral}}}{{\mathcal{L}}}_{{\rm{coral}}}+{\lambda }_{{\rm{reg}}}\parallel \Theta {\parallel }_{2}^{2},$$

(25)

with weight decay λ_reg ≥ 0 and λ_coral ≥ 0 chosen on validation folds.

We employ dropout in encoders and trunk, and modality dropout during stage II/III (with probability q, one branch’s latent is zeroed) to ensure single-modality robustness.

MRI-only Knowledge Distillation

For fully non-invasive deployment when pathology is unavailable, we distill the trained teacher ${F}_{{\Theta }^{\star }}$ into an MRI-only student ${F}_{{\Theta }_{S}}$:

$${{\mathcal{L}}}_{\mathrm{KD}}=\mathop{\sum }\limits_{j\in \{\,\mathrm{ERS}-\mathrm{CAF}\,,\mathrm{LR},H\}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({\widehat{s}}_{i,S}^{(j)}-{\widehat{s}}_{i,T}^{(j)}\right)}^{2},$$

(26)

where ${\widehat{s}}_{i,T}^{(j)}={F}_{{\Theta }^{\star }}^{(j)}({R}_{i},{W}_{i})$ and ${\widehat{s}}_{i,S}^{(j)}={F}_{{\Theta }_{S}}^{(j)}({R}_{i})$. The student reuses the MRI radiomics encoder and the fusion trunk (restricted to the MRI branch).

Validation and Clinical Modeling

We conduct fivefold patient-level cross-validation stratified by domain d_i with nested hyperparameter selection; leakage is prevented at the patient-level. Predictive fidelity is summarized by Pearson correlation r, RMSE, MAE, and calibration slope/intercept between ${\widehat{S}}_{i}$ and ${S}_{i}^{\star }$. Clinical relevance is assessed by Cox models with covariate adjustment:

$$h(t| {\widehat{S}}_{i},{{\bf{c}}}_{i})={h}_{0}(t)\exp \left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{i};{{\bf{c}}}_{i}]\right),$$

(27)

using the partial log-likelihood

$$\ell ({\boldsymbol{\theta }})=\mathop{\sum }\limits_{i:{\Delta }_{i}=1}\left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{i};{{\bf{c}}}_{i}]-\log \mathop{\sum }\limits_{j\in {\mathcal{R}}({T}_{i})}\exp \left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{j};{{\bf{c}}}_{j}]\right)\right),$$

(28)

where ${\mathcal{R}}({T}_{i})$ is the risk set. We report hazard ratios (HRs) with 95% CIs, time-dependent AUC, and C-index. Treatment relevance is tested via logistic or interaction Cox models with treatment vector t_i.

Given the limited cohort size (N = 126), we explicitly constrain model capacity and enforce strict evaluation hygiene. CLIP image/text encoders are kept frozen; the trainable components are limited to lightweight prompt-mixture weights, shallow encoders for radiomics/pathomics, a mid-level gating network, and three linear heads. All preprocessing/statistics are fit within each training fold only (e.g., min–max scaling for Stage I targets; ComBat parameters; feature standardization), and performance is reported on held-out patients only. We further apply dropout and weight decay throughout, and use modality dropout during Stage II/III to prevent single-modality shortcut learning and to improve robustness under missing modalities.

Experimental setup

We assemble a consecutive, single-disease cohort of 126 patients with matched pre-operative MRI (T1, T2, contrast-enhanced), diagnostic H&E WSIs, bulk RNA-seq, and clinical follow-up. Exclusion criteria are: (i) imaging outside protocol; (ii) failed ROI/tiling quality control; (iii) RNA-seq library size or mapping rate below thresholds; and (iv) missing outcomes. Primary endpoints are overall survival (OS) and progression-free survival (PFS). Molecular supervision yields three continuous targets per patient i: ${S}_{i}^{\star }={[{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)]}^{\top }$ (cf. Section 9).

For generalization, we evaluate on TCGA stroma-rich cancers (PAAD, STAD, COADREAD) with WSIs, RNA-seq, and clinical data. The multimodal model is frozen and applied zero-shot to generate ${\widehat{S}}_{i}$, then compared to per-cancer molecular gold standards. When imaging modality differs (e.g., CT), we restrict radiomic families to modality-agnostic statistics.

Preprocessing and Quality Control Volumes are resampled to 1 mm isotropic, bias-field corrected, and z-standardized per sequence. Tumor ROIs are delineated and adjudicated. PyRadiomics extracts first-order, shape, and texture features (GLCM/GLRLM/GLSZM/NGTDM) plus LoG/wavelet; site/scanner effects are harmonized by ComBat; features with ICC < 0.75 are removed, yielding ${{\bf{r}}}_{i}\in {{\mathbb{R}}}^{{P}_{r}}$.

It should be noted that the end-to-end 3D CNN/Transformer encoders are attractive^45,46. However, in our setting (N = 126 chordoma, multi-site scanners, three MRI sequences), fully learned 3D representation learning is high-variance and prone to overfitting without substantial pretraining and careful harmonization. We therefore use radiomics as a data-efficient and reproducible summary of macro-scale tumor phenotypes (shape/texture/heterogeneity), then learn a non-linear encoder E_rad( ⋅ ) jointly with the multimodal trunk and gate.

WSIs undergo Macenko stain normalization, tissue masking (Otsu+morphology), tiling at ×20 into 224 × 224 px patches (stride 224), and tile QC (tissue coverage ≥50%, blur/pen-mark removal). Retained tiles form ${\{{W}_{it}\}}_{t=1}^{{T}_{i}}$.

S_ERS−CAF is ssGSEA on a curated ERS–CAF gene set (cohort z-score); S_LR aggregates ERS–CAF ligand-immune receptor pairs weighted by immune fractions p_i,c from CIBERSORTx; H(i) is Shannon diversity over deconvolved cell fractions (Section 9).

Ethics approval and consent to participate

This study was approved by the Institutional Ethics Committee of The First Affiliated Hospital, University of South China, Hunan, P.R. China (Ethics Approval No. 2025LL0206001). Written informed consent was obtained from all enrolled patients in accordance with the Declaration of Helsinki.

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to the current Administrative Regulations on Human Genetic Resources (HGR) from China’s Ministry of Science and Technology, but are available from the corresponding author on reasonable request.

Code availability

Custom code and scripts were developed for data preprocessing, feature extraction, model training, evaluation, and statistical analysis in this study. For the purpose of peer review, an anonymized version of the code and related resources is available at: https://anonymous.4open.science/r/Anonymous_code-ERS-CAF/README.md. The full source code will be made publicly available upon acceptance of the manuscript. All experiments were implemented in Python (≥3.7) using PyTorch (≥2.1.0), torchvision (≥0.16.0), numpy (≥1.24), pandas (≥2.0), scikit-learn (≥1.3), pyyaml (≥6.0), tqdm (≥4.66), matplotlib (≥3.7), transformers (≥4.41), and open_clip_torch (≥2.24.0). Unless otherwise specified, default parameters provided by the respective libraries were used, and key model architectures, hyperparameters, and task-specific variables are described in the Methods section and Supplementary Information.

References

Murad, T. M. & Murthy, M. S. N. Ultrastructure of a chordoma. Cancer 25, 1204–1215 (1970).
Article CAS PubMed Google Scholar
Niu, H., Zheng, B., Zou, M. & Zheng, B. Complex immune microenvironment of chordoma: a road map for future treatment. J. Immunother. Cancer 12, e009313 (2024).
Article PubMed PubMed Central Google Scholar
Zhang, T. et al. Hypoxic upregulation of ier2 increases paracrine GMFG signaling of endoplasmic reticulum stress-caf to promote chordoma progression via targeting ITGB1. Adv. Sci. 11, e2405421 (2024).
Article Google Scholar
Zhang, T. et al. Integrating single-cell and spatial transcriptomics reveals endoplasmic reticulum stress-related caf subpopulations associated with chordoma progression. Neuro-Oncol. 26, 295–308 (2024).
Article PubMed PubMed Central Google Scholar
Liang, S. et al. The immune microenvironment in chordoma: implications for future treatment. World Neurosurg. 204, 124589 (2025).
Article PubMed Google Scholar
Zheng, B. & Guo, W. Multi-omics analysis unveils the role of inflammatory cancer-associated fibroblasts in chordoma progression. J. Pathol. 265, 69–83 (2025).
Article CAS PubMed Google Scholar
Luo, W. et al. Tumor immune microenvironment-based therapies in pancreatic ductal adenocarcinoma. J. Exp. Clin. Cancer Res. 43, 92 (2024).
Google Scholar
Wang, H. et al. Characterization of endoplasmic reticulum stress unveils ZNF703 as a promising target for colorectal cancer immunotherapy. J. Transl. Med. 21, 713 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).
Article CAS PubMed PubMed Central Google Scholar
Vaidya, P. et al. Computationally integrating radiology and pathology image features for predicting treatment benefit and outcome in lung cancer. npj Precis. Oncol. 9, 161 (2025).
Article CAS PubMed PubMed Central Google Scholar
Kang, W. et al. Application of radiomics-based multiomics combinations in the tumor microenvironment and cancer prognosis. J. Transl. Med. 21, 566 (2023).
Article Google Scholar
Chen, R. J. et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proceedings of the IEEE/CVF international conference on computer vision, 4015–4025 (2021).
Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Xu, J. et al. The role of tumor immune microenvironment in chordoma: promising immunotherapy strategies. Front. Immunol. 14, 1257254 (2023).
Article CAS PubMed PubMed Central Google Scholar
Öhlund, D. et al. Distinct populations of inflammatory fibroblasts and myofibroblasts in pancreatic cancer. J. Exp. Med. 214, 579–596 (2017).
Article PubMed PubMed Central Google Scholar
Elyada, E. et al. Cross-species single-cell analysis of pancreatic ductal adenocarcinoma reveals antigen-presenting cancer-associated fibroblasts. Cancer Discov. 9, 1102–1123 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cords, L. et al. Cancer-associated fibroblast classification in single-cell datasets across 19 cancer types. Nat. Commun. 14, 4929 (2023).
Article Google Scholar
Forsthuber, A. et al. Cancer-associated fibroblast subtypes modulate the tumor microenvironment and patient outcome. Nat. Commun. 15, 53908 (2024).
Article Google Scholar
Liu, Y. et al. Conserved spatial subtypes and cellular neighborhoods of cancer-associated fibroblasts revealed by single-cell spatial multi-omics. Cancer Cell 43, 905–924 (2025).
Article CAS PubMed PubMed Central Google Scholar
Feig, C. et al. Targeting cxcl12 from fap-expressing carcinoma-associated fibroblasts synergizes with anti–PD–L1 immunotherapy in pancreatic cancer. Proc. Natl. Acad. Sci. 110, 20212–20217 (2013).
Article CAS PubMed PubMed Central Google Scholar
Biasci, D. et al. Cxcr4 inhibition in human pancreatic and colorectal cancers induces an integrated immune response. Proc. Natl. Acad. Sci. 117, 28960–28970 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mao, X. et al. Crosstalk between cancer-associated fibroblasts and immune cells in the tumor microenvironment. J. Hematol. Oncol. 14, 73 (2021).
Google Scholar
Zhan, Y. et al. Single-cell transcriptomics reveals intratumor heterogeneity and the potential roles of cancer stem cells and mycAFs in colorectal cancer liver metastasis and recurrence. Cancer Lett. 612, 217452 (2025).
Article CAS PubMed Google Scholar
Biffi, G. et al. Il1-induced jak/stat signaling is antagonized by tgfβ to shape CAF heterogeneity in pancreatic ductal adenocarcinoma. Cancer Discov. 9, 282–301 (2019).
Article PubMed Google Scholar
Efremova, M., Vento-Tormo, M., Teichmann, S. A. & Vento-Tormo, R. Cellphonedb: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes. Nat. Protoc. 15, 1484–1506 (2020).
Article CAS PubMed Google Scholar
Jin, S. et al. Inference and analysis of cell–cell communication using cellchat. Nat. Commun. 12, 1088 (2021).
Article CAS PubMed PubMed Central Google Scholar
Browaeys, R., Saelens, W. & Saeys, Y. Nichenet: modeling intercellular communication by linking ligands to target genes. Nat. Methods 17, 159–162 (2020).
Article CAS PubMed Google Scholar
Chen, C. et al. Crosstalk between cancer-associated fibroblasts and regulated cell death in tumors: insights into apoptosis, autophagy, ferroptosis, and pyroptosis. Cell Death Discov. 10, 123 (2024).
Aerts, H. J. W. L. et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 5, 4006 (2014).
Article CAS PubMed PubMed Central Google Scholar
Schmauch, B. et al. A deep learning model to predict RNA-seq expression of tumoral genes from whole-slide images. Nat. Commun. 11, 3877 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Article CAS PubMed PubMed Central Google Scholar
Echle, A. et al. Deep learning for the prediction of microsatellite instability from histopathology images of colorectal cancer: a systematic developmental and validation study. Lancet Oncol. 22, 162–172 (2021).
Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Hänzelmann, S., Castelo, R. & Guinney, J. Gsva: gene set variation analysis for microarray and rna-seq data. BMC Bioinformatics 14, 7 (2013).
Article PubMed PubMed Central Google Scholar
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Research Network Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer Cell 32, 185–203.e13 (2017).
Google Scholar
Cancer Genome Atlas Research Network Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513, 202–209 (2014).
Article Google Scholar
Cancer Genome Atlas Network Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Article Google Scholar
Jensen, M. A., Ferretti, V., Grossman, R. L. & Staudt, L. M. The NCI genomic data commons as an engine for precision medicine. Blood 130, 453–459 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416.e11 (2018).
Article PubMed PubMed Central Google Scholar
Xiao, C. et al. Confusion-resistant federated learning via diffusion-based data harmonization on non-iid data. Adv. Neural Inf. Process. Syst. 37, 137495–137520 (2024).
Google Scholar
Yang, Z., Wang, H., Liu, Y. & Zhang, F. Cfdformer: medical image segmentation based on cross fusion dual attention network. Biomed. Signal Process. Control 101, 107208 (2025).
Article Google Scholar
Jiang, Y. et al. Self-paced learning for images of antinuclear antibodies. IEEE Trans. Med. Imaging 44, (2025).
Yao, J., Li, C. & Xiao, C. Swift sampler: efficient learning of sampler by 10 parameters. Adv. Neural Inf. Process. Syst. 37, 59030–59053 (2024).
Google Scholar
Zhang, F., Gu, Z. & Wang, H. Decoding with structured awareness: integrating directional, frequency-spatial, and structural attention for medical image segmentation. Preprint at https://arxiv.org/abs/2512.05494 (2025).
Wang, Y., Wang, H. & Zhang, F. A medical image segmentation model with auto-dynamic convolution and location attention mechanism. Comput. Methods Prog. Biomed. 261, 108593 (2025).
Article Google Scholar

Download references

Acknowledgements

Department of Spine Surgery, The First Affiliated Hospital, Hengyang Medical School, University of South China, and Department of Orthopedics, Wuhan Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, contributed equally as first affiliation. We gratefully acknowledge Professor Yongbin Liu from the School of Computer Science at the University of South China, as well as Master’s students Kefan Wu and Yangji Chen, for their contributions to the AI modeling and validation aspects of this research. This work was supported by the National Natural Science Foundation of China (W2523095, 82003802 and 82473965 to T.L.Z., 82404690 to C.X. and 82002364 to M.X.Z.), China Scholarship Council (202106370071 to B.W.Z.), China Postdoctoral Science Foundation (2025M782004 to BWZ), Excellent & Innovative Talent Program, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology (082550320 to B.W.Z.), Hubei Province High-Level Postdoctoral Talent Introduction Program (2005HBBHJD018 to B.W.Z.), Natural Science Foundation of Hunan Province (2019JJ50542 and 2023JJ50156 to T.L.Z., 2023JJ40596 to C.X., 2023JJ40587 to W.H. and 2021JJ40509 to M.X.Z.), Hunan Provincial Natural Science Foundation for Excellent Youth Scholars (2023JJ20035 to M.X.Z.), the Science and Technology Innovation Program of Hunan Province (2023RC3172 to M.X.Z.), Clinical Research 4310 Program of the First Affiliated Hospital of the University of South China (20224310NHYCG04 to T.L.Z. and 20224310NHYCG03 to H.Z.), Project for Clinical Research of Hunan Provincial Health Commission (20201978 to T.L.Z., 20201962 to C.X., and 20201956 to M.X.Z.), Research Foundation of Education Bureau of Hunan Province (22B0441 to W.H.), Medical-Engineering Interdisciplinary Research Program of the First Affiliated Hospital, University of South China (IRP-M&E-2025-06 to M.X.Z.), Henan Province Key Science and Technology Research Project (252102311060 to H.Q.N.) and Xiaohe Technology Talent Project of Hengyang city (C.X.).

Author information

These authors contributed equally: Bo-Wen Zheng, Chao Xia, Ming Tang.

Authors and Affiliations

Department of Spine Surgery, The First Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, Hunan, China
Bo-Wen Zheng, Chao Xia, Ming Tang & Ming-Xiang Zou
Department of Orthopedics, Wuhan Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
Bo-Wen Zheng
Musculoskeletal Tumor Center, Peking University People’s Hospital, Peking University, Beijing, Beijing, China
Bo-Wen Zheng
Department of Spine Surgery, The Second Xiangya Hospital, Central South University, Changsha, Hunan, China
Ming Tang & Jing Li
The First Affiliated Hospital, Health Management Center, Hengyang Medical School, Hengyang, Hunan, China
Wei Huang
Department of Orthopedics Surgery, General Hospital of the Central Theater Command, Wuhan, Hubei, China
Bo-Yv Zheng
Department of Ophthalmology, The Second Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China
Hua-Qing Niu
Department of Pharmacy, The First Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, Hunan, China
Tao-Lan Zhang
Department of Radiology, The First Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, Hunan, China
Hong Zhou

Authors

Bo-Wen Zheng
View author publications
Search author on:PubMed Google Scholar
Chao Xia
View author publications
Search author on:PubMed Google Scholar
Ming Tang
View author publications
Search author on:PubMed Google Scholar
Wei Huang
View author publications
Search author on:PubMed Google Scholar
Bo-Yv Zheng
View author publications
Search author on:PubMed Google Scholar
Hua-Qing Niu
View author publications
Search author on:PubMed Google Scholar
Jing Li
View author publications
Search author on:PubMed Google Scholar
Tao-Lan Zhang
View author publications
Search author on:PubMed Google Scholar
Hong Zhou
View author publications
Search author on:PubMed Google Scholar
Ming-Xiang Zou
View author publications
Search author on:PubMed Google Scholar

Contributions

B.W.Z., C.X., and M.T. are co-first authors and contributed equally to this work. T.L.Z., H.Z., and M.X.Z. are co-corresponding authors and contributed equally to this work. B.W.Z., C.X., T.L.Z., and M.X.Z. contributed to the conception and design of the study. B.W.Z., C.X., M.T., W.H., B.Y.Z., H.Q.N., J.L., T.L.Z., H.Z., and M.X.Z. did the data analysis and interpretation. B.W.Z., C.X., M.T., W.H., T.L.Z., H.Z. and M.X.Z. contributed to drafting and revision of the manuscript. All authors were involved in writing the paper and had final approval of the submitted and published versions.

Corresponding authors

Correspondence to Tao-Lan Zhang, Hong Zhou or Ming-Xiang Zou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, BW., Xia, C., Tang, M. et al. Decoding the ERS–CAF immunoregulatory axis via multimodal AI and its pan-cancer prognostic and therapeutic predictive value. npj Digit. Med. 9, 199 (2026). https://doi.org/10.1038/s41746-026-02388-w

Download citation

Received: 11 November 2025
Accepted: 17 January 2026
Published: 30 January 2026
Version of record: 04 March 2026
DOI: https://doi.org/10.1038/s41746-026-02388-w

Subjects

Abstract

Similar content being viewed by others

Correlation between surrogate endpoints and overall survival in unresectable hepatocellular carcinoma patients treated with immune checkpoint inhibitors: a systematic review and meta-analysis

Cancer-associated fibroblast subtype and risk signature as predictors of prognosis and treatment effectiveness in gastric cancer

Digital validation of breast biomarkers (ER, PR, AR, and HER2) in cytology specimens using three different scanners

Introduction

Results

Molecular surrogates of the ERS–CAF axis

Multimodal decoding from MRI and H&E

Prognostic value in chordoma

Pan-cancer generalization and treatment predictivity

Toward deployability: MRI-only knowledge distillation

Discussion

Methods

Notation and data schema

Bulk-transcriptomic reference scores (surrogate supervision targets)

ERS–CAF abundanceL

ERS–CAF-immune LR interaction

Noise floor of computed supervision targets

WSI pathomics with CLIP-guided MIL

MRI radiomics branch

Gated mid-level fusion and multi-head regression

Training curriculum and objectives

Stage I: CLIP similarity-label alignment (WSI branch)

Stage II: Multimodal regression (fusion model)

Stage III: Site/scanner harmonization (CORAL)

MRI-only Knowledge Distillation

Validation and Clinical Modeling

Experimental setup

Ethics approval and consent to participate

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links