Introduction

Chordoma exhibits an immunologically complex tumor microenvironment (TME)—marked by dense desmoplastic matrix, CAF activity, and immune exclusion—with limited standard systemic options and nontrivial local-regional failure despite aggressive therapy1,2,3,4,5,6. Historically, ultrastructural studies already highlighted chordoma’s matrix-rich architecture and physaliphorous phenotype1; contemporary series and consensus statements emphasize that, while maximal safe resection and high-dose particle radiotherapy (proton/carbon) can improve control, local recurrence remains frequent (often ~ 40–60% at 5–10 years depending on site and series), and conventional chemotherapy is largely ineffective2,3,4.

While our previous work first demonstrated the presence of ER stress-related CAF (ERS–CAF) in chordoma, their roles in other tumor types, particularly stroma-rich cancers, remain unclear3,4. Beyond chordoma, prototypical stroma-dominant cancers such as PDAC exemplify an immune-stromal ecosystem with CAF-driven immunosuppression and poor response to systemic therapy, underscoring a broader need for TME-aware biomarkers7. In colorectal cancer (CRC), ER-stress programs have been linked to tumor immune dysfunction/exclusion and can stratify putative benefit from immune-checkpoint blockade, with ZNF703 proposed as a candidate immunotherapy target8. Routine, repeatable, and non-invasive TME profiling would therefore be valuable; yet biopsy-centric assays are invasive, scarce, and susceptible to spatial/temporal sampling bias that complicates prognostication and treatment selection9. Against this backdrop, radiology-pathology integration has emerged as a “digital biopsy” paradigm: recent studies and reviews show that fusing radiomics with pathomics yields image-derived surrogates that better reflect microenvironmental biology and improve risk stratification and treatment prediction compared with single-modality models10,11. Multimodal survival/prediction in computational pathology has established strong transformer-based fusion baselines, notably MCAT12 and PORPOISE13, which integrate WSI with molecular profiles to improve risk stratification and provide cross-modal interpretability. In contrast, our setting treats bulk RNA-seq as supervision only (to define mechanistic surrogate targets) and focuses on non-invasive decoding from routine images (MRI + H&E); nevertheless, we include an MCAT-style co-attention transformer fusion baseline under the same protocol to contextualize the value of our gating design. In chordoma specifically, the immune microenvironment literature—including early immunotherapy case series and small trials—further motivates imaging biomarkers that report on CAF-immune crosstalk to guide selection and triage5,14.

Cancer-associated fibroblasts (CAFs) are not a monolith; single-cell and spatial studies reveal conserved, interconvertible subtypes—including contractile myCAFs, inflammatory iCAFs, and antigen-presenting apCAFs/MHC-II+ phenotypes—whose proportions track disease stage and therapeutic response across tumor types15,16,17,18. A recent pan-cancer single-cell spatial multi-omics study further resolved conserved spatial CAF subtypes and their cellular neighborhoods, linking CAF spatial organization to immune phenotypes and clinical outcomes, which motivates spatially coherent attention patterns as a plausible readout of CAF neighborhood biology19. Through ligand-receptor (LR) crosstalk with myeloid and lymphoid populations, CAFs remodel extracellular matrix and cytokine milieus to enforce immune exclusion and therapy resistance: canonical axes include CXCL12-CXCR4, which sequesters effector T cells in the stroma and whose inhibition synergizes with checkpoint blockade, and CCL2-CCR2, which recruits immunosuppressive monocytes/macrophages20,21,22. Consistently, single-cell analyses of CRC liver metastasis implicate myCAFs in ECM remodeling and pro-metastatic niches and highlight specific ligand-receptor programs, reinforcing the relevance of our LR crosstalk surrogate and heterogeneity target H as mechanistic readouts23.

Mechanistically, iCAF programs can be induced by IL-1 → LIF → JAK/STAT signaling and antagonized by TGF-β, offering a molecular switch for inflammatory vs. myofibroblastic states and explaining spatial gradients around tumor glands15,24. These observations motivate biology-aware priors for our imaging surrogates: contemporary single-cell communication frameworks formalize LR evidence at different levels—CellPhoneDB accounts for multimeric receptor/ligand architecture and tests enrichment from scRNA-seq, CellChat models signaling probability and pathway information flow, and NicheNet links ligands to target genes in receivers via prior signaling/regulatory graphs—providing principled scaffolds to curate ERS–CAF-immune axes (e.g., CXCL12-CXCR4, CCL2-CCR2)25,26,27,28. Together, these data support our premise that routine images may encode stable, mechanism-grounded surrogates of ERS–CAF activity and its downstream immunoregulation.

We hypothesize that routine images—pre-operative MRI and H&E whole-slide images (WSIs)—encode stable surrogates of ERS–CAF-driven immunoregulation that can be learned and generalized across cancers. This premise is grounded in two observations. First, quantitative radiomics systematically captures phenotypes related to desmoplasia, necrosis, edema, and vascularity, and these imaging phenotypes correlate with molecular programs and outcomes across tumor types29. Second, modern computational pathology has shown that WSIs contain sufficient signal to recover transcriptomic and genomic states—e.g., the HE2RNA model infers RNA-seq expression from H&E30, and deep learning can predict microsatellite instability (MSI) directly from histology31,32—indicating that tissue morphology embeds rich molecular proxies.

To train imaging surrogates that are explicitly mechanism-aware, we construct three transcriptomic “bulk-transcriptomic reference scores (surrogate targets)” per patient: (i) ERS–CAF abundance as single-sample gene-set activity using ssGSEA, a robust rank-based scoring of pathway activation33, with optional cross-check via GSVA for sensitivity analysis34; (ii) ERS–CAF-immune LR-interaction strength computed from curated ligand-receptor pairs (ERS–CAF ligands and immune-cell receptors) assembled from single-cell communication resources (CellChat/CellPhoneDB/NicheNet; see Methods) and aggregated at the sample level; and (iii) microenvironmental heterogeneity as Shannon diversity over immune/stromal cell fractions estimated by digital cytometry (CIBERSORTx)35. These targets span abundance (ERS–CAF), crosstalk (LR), and composition (heterogeneity), offering complementary supervision that encourages the model to recover biologically specific aspects of the TME rather than generic tumor burden. Because stromal programs (e.g., CAF/iCAF/myCAF axes) are conserved across stroma-rich cancers, we expect surrogates learned in chordoma to exhibit pan-cancer portability when applied zero-shot to external cohorts. Importantly, these endpoints are bulk-derived molecular surrogates rather than directly measured biological quantities with spatial resolution. Our goal is therefore to learn image-based predictors of these clinically informative transcriptomic states, and we validate spatial plausibility using independent pathology annotations of stromal remodeling and immune exclusion.

(i) We define and compute these molecular gold standards from bulk RNA-seq; (ii) we train a multimodal AI with CLIP-guided pathomics, radiomics, and mid-level gated fusion to predict the gold standards from images; (iii) we demonstrate prognostic value in an internal chordoma cohort; (iv) we assess pan-cancer generalization and treatment predictivity in TCGA stroma-rich cancers; and (v) we distill the model to MRI-only for deployability.

Our methodological contribution is system-level and mechanism-aware: we formulate non-invasive decoding as multi-target regression to mechanism-defined transcriptomic surrogates (ERS–CAF program activity, ERS–CAF-immune LR crosstalk, and microenvironmental heterogeneity), and we introduce a stage-wise curriculum that (Stage I) injects a biological text prior via target-specific prompt banks and monotone similarity-label alignment to steer tile selection, (Stage II) learns complementary radiology-pathology fusion with robustness to missing modalities, and (Stage III) improves multi-site generalization with stable, non-adversarial second-order alignment evaluated by leave-site-out tests. We further differentiate from path-transcriptomics by predicting three interpretable mechanistic surrogates rather than high-dimensional gene expression, and we provide prompt-perturbation and spatial plausibility evidence to validate that text guidance is mechanism-consistent rather than a generic MIL heuristic.

Results

We analyzed a 126-patient internal chordoma cohort with matched pre-operative MRI (T1, T2, contrast-enhanced), diagnostic H&E WSIs, bulk RNA-seq, and longitudinal outcomes (OS, PFS). Inclusion/exclusion, imaging protocols, segmentation/tiling QC and RNA-seq processing are detailed in Section 9. For external validation, we applied the frozen model to stroma-rich TCGA cancers (PAAD, STAD, COADREAD) with WSIs, RNA-seq, and clinical endpoints.

Molecular surrogates of the ERS–CAF axis

We summarize supervision targets using cohort-wide distributions and pairwise associations (Fig. 1). Distributions are unimodal and approximately symmetric, and the three surrogates show moderate concordance, supporting complementary supervision signals. Exact coefficients (N=126): r (ERS–CAF, LR) = 0.48, r (ERS–CAF, H) = 0.20, r (LR, H) = 0.27 (all FDR < 0.05).

Fig. 1: Molecular surrogate landscape (N = 126).
Fig. 1: Molecular surrogate landscape (N = 126).
Full size image

Top: distributions of the three transcriptomic surrogates (z-scores). Middle: pairwise scatter plots with r and two-sided p annotated. Bottom: Pearson correlation matrix with on-cell coefficients. a ERS-CAF abundance. b LR interaction. c H heterogeneity. d ERS-CAF vs. LR (scatter). e ERS-CAF vs. H (scatter). f LR vs. H (scatter). g Correlation heatmap (r).

We anchor the continuous “molecular” scores to visible histomorphology by testing whether attention hotspots co-localize with pathologist-annotated fibrosis/immune-exclusion. Board-certified pathologists (blinded to model outputs) graded fibrosis and immune exclusion per patient on a 0–3 scale; we report: (i) IoU (Jaccard) between the top-K attention mask and the fibrosis ROI; (ii) Hit-rate for immune-exclusion point annotations falling inside the top-K attention. Analyses are stratified as Low (grades 0–1) vs High (grades 2–3) and compared to a randomized baseline (attention shuffled).

As shown in Fig. 2, we can observe: (i) Fibrosis. IoU (mean) increased from 0.28 in Low (0–1) to 0.47 in high (2–3), while the randomized baseline averaged 0.12 (all N = 126). High vs low: p < 10−6; High Vs Baseline: p < 10−8. (ii) Immune exclusion. Hit-rate (mean) increased from 0.36 in low to 0.55 in high, baseline 0.18 ; high vs low: p < 10−6; high vs baseline: p < 10−8. (iii) Pass criteria. Co-localization metrics are significantly above randomized and monotone with grade, satisfying the pre-registered gate.

Fig. 2: Morphology co-localization.
Fig. 2: Morphology co-localization.
Full size image

Overlays illustrate alignment (left) versus diffusion (right). Quantitatively, IoU and hit-rate increase with pathologist grade and are markedly above randomized baselines. a High fibrosis: attention aligns with the pathologist ROI (contour). b Low fibrosis: attention is diffuse and poorly overlaps the ROI. c Fibrosis co-localization (IoU) by strata and baseline. d Immune-exclusion co-localization (hit-rate) by strata and baseline.

The Attention hotspots faithfully concentrate on desmoplastic and immune-excluded regions when grades are high, tying the model’s molecular surrogates to recognizable histology and strengthening biological plausibility.

Multimodal decoding from MRI and H&E

Using 5-fold patient-level cross-validation stratified by site/scanner, the proposed CLIP-guided WSI branch plus MRI radiomics with gated fusion achieved the best agreement with molecular surrogates across all targets (Table 1). Relative to strong single-modality baselines, fusion improved macro-average Pearson correlation by +0.07–+0.12 absolute and reduced calibration error (slope 0.97 ± 0.06 vs. 0.88 ± 0.07 for the best single-modality model). RMSE showed consistent reductions (median −9.4% vs. WSI-only CLIP and −14.6% vs. MRI-only; extended table). Predicted vs. observed scatter and calibration plots exhibited tight linearity without gross heteroscedasticity.

Table 1 Cross-validated prediction of transcriptomic surrogates on the 126-patient chordoma cohort

Table 2 benchmarks CLIP guidance and fusion gating against modern WSI MIL baselines, a histology SSL encoder, an HE2RNA-style image-to-transcriptomics pipeline, and alternative fusion strategies under identical fivefold splits and QC. (i) WSI-only. CLIP-guided MIL is the best WSI-only model (Macro-r = 0.610 ± 0.046), improving over ABMIL (Δ+0.051) and outperforming CLAM/DSMIL/TransMIL (best Macro-r = 0.541) and CTransPath+MIL (0.548), with consistent gains across ERS–CAF, LR, and H. (ii) Image-to-transcriptomics. HE2RNA-style WSI → expr → score underperforms (Macro-r = 0.480 ± 0.061), suggesting that predicting low-dimensional mechanism-defined surrogates is more data-efficient than high-dimensional expression regression in small-N settings. (iii) Fusion. Gated fusion improves over late fusion, concat+MLP, and transformer fusion (best alternative Macro-r = 0.604 ± 0.045), and CORAL adds further gains (Δ + 0.034), supporting complementary modality use plus stable multi-site alignment.

Table 2 Stronger WSI and fusion baselines under the same protocol (Chordoma, N = 126)

Ablations: (1) Replacing CLIP-guided attention with vanilla ABMIL degraded WSI-only performance by −0.05 to −0.06 absolute r across targets, indicating gains from mechanism-aware prompts. (2) Removing CORAL from fusion reduced macro-r by −0.03 (Table 1), consistent with site/scanner harmonization benefits. (3) Modality dropout (q = 0.2) improved robustness (single-modality r drop <0.04) without harming fusion accuracy.

Prognostic value in chordoma

We summarize risk stratification with two concise plots (Fig. 3). The combined forest plot shows adjusted HRs (OS & PFS) for ERS–CAF (T3 vs T1), LR (T3 vs T1), and H (per SD); the time-dependent C-index curve demonstrates a consistent improvement from 0.64 to 0.71 at 12 months when adding AI scores to the clinical model.

Fig. 3: Concise prognostic summary (N = 126).
Fig. 3: Concise prognostic summary (N = 126).
Full size image

Left: combined forest plot (single chart) with OS (circles) and PFS (squares) hazard ratios and 95% CIs on a log scale; the vertical line marks HR = 1. Right: time-dependent C-index comparing clinical-only vs. clinical+AI models; at 12 months, the C-index rises from 0.64 to 0.71 (Δ + 0.07).

To validate the necessity and stability of text priors. We compare: (i) no text (NONE), (ii) fixed curated prompts (CUR), and (iii) learnable CoOp (COOP); we further probe sensitivity to synonym prompts (SYN), antonym/negation (ANTI), and out-of-domain phrases (OOD). For each prompt variant and target (ERS–CAF, LR, H), we compute 5-fold Pearson r against the transcriptomic gold standards using the WSI branch in isolation (radiomics kept fixed).

We observe that: (i) Guidance helps. COOP achieves the highest mean r across all targets, CUR ranks second, and NONE trails (Fig. 4a). (ii) Stability to synonyms. SYN prompts cluster tightly around CUR with limited inter-fold spread (Fig. 4b), indicating robustness to wording. (iii) Sensitivity to wrong priors. OOD and especially ANTI degrade performance toward/ below the no-text baseline, confirming that mechanistically inconsistent prompts harm learning. (iv) Mechanistic localization. Attention maps are more focal under CUR (Fig. 4c) and diffuse without text (Fig. 4d), supporting biology-aware tile selection.

Fig. 4: Stage I (CLIP guidance) evidence.
Fig. 4: Stage I (CLIP guidance) evidence.
Full size image

Left: curated/learnable prompts improve concordantly across ERS–CAF, LR, and H; synonyms remain stable while OOD/antonyms deteriorate. Right: text guidance drives focused tile attention, whereas no-text yields diffuse, low-contrast maps. a Prompt × target performance (mean r, 5-fold). b Macro-r across folds by prompt group. c Attention overlay with curated prompts. d Attention overlay with no text.

Did Stage II (gated fusion) learn complementarity? As shown in Fig. 5: (i) Weight migration follows clinical intuition: sacrum/small-volume/ < 50y strata show WSI-dominant gating (mean WSI weights 0.60-0.61), while skull-base/large-volume/ > 65y strata shift toward MRI (weights 0.55–0.62), and fusion achieves the best macro-r in every stratum (Fig. 5a, b). (ii) Robust under missing modalities: when randomly dropping WSI tiles, fusion’s macro-r declines from 0.68 to 0.64 at 80% loss (Δr = 0.04) and to 0.61 at 100%, whereas WSI-only falls from 0.61 to 0.46 (Fig. 5c). When removing MRI sequences (T1/T2/CE), fusion drops to 0.66 at 2/3 missing and 0.62 at 100% missing, while MRI-only collapses, confirming that fusion falls back to the intact modality (Fig. 5d). (iii) Pass criteria: the ≥80% missing drop is controlled (Δr ≤ 0.05); gating shifts are directionally consistent with phenotype/geometry expectations—evidence of learned complementarity rather than single-modality dominance.

Fig. 5: Stage II (gated fusion) learns complementary use of WSI and MRI.
Fig. 5: Stage II (gated fusion) learns complementary use of WSI and MRI.
Full size image

Left: fusion allocates higher WSI weight in sacrum/small-volume/65y groups–consistent with imaging/pathology priors. Right: Fusion maintains macro-r with Δr ≤0.04 even at 80% WSI loss, and ≤0.06 when two MRI sequences are missing; single-modality lines degrade markedly. a Gating weights across strata (WSI vs MRI). b Performance across strata (macro-r, Fusion vs WSI-only vs MRI-only). c Robustness to WSI tile missing (0–100%). d Robustness to MRI sequence missing (0–100%).

Is gating shift an artifact of radiomics features? To clarify, the gate is trained on held-out CV folds and is not a hard-coded preference for either modality. If MRI radiomics were systematically unreliable, we would expect the gate to collapse toward WSI across strata and under MRI-sequence ablations. Instead, Fig. 5 shows bidirectional and clinically sensible shifts: skull-base/larger-volume/older strata allocate higher MRI weight, while sacrum/smaller-volume/younger strata allocate higher WSI weight. Moreover, under missing-modality stress tests, fusion falls back to the intact modality and degrades smoothly rather than exhibiting a WSI-only collapse: when WSI tiles are removed, fusion performance remains relatively stable (macro-r0.68 → 0.64 at 80% loss), and when MRI sequences are removed, fusion similarly degrades but remains functional (0.66 at 2/3 missing; 0.62 at 100% missing), which is inconsistent with the gate being forced into WSI due to fragile MRI features.

Did Stage III (CORAL) actually remove site/domain confounding? Following the design: (1) Leave-site-out (LSO) validation: train on 4 sites and test on the held-out site; report Macro-r before/after CORAL per site. (2) Embedding visualization: 2D embeddings colored by site before/after CORAL; quantify mixing via kNN (k = 10) site-label entropy (higher = better mixing). We found that: (i) Embedding mixing increased substantially. The 10-NN site-label entropy rose from 0.053 (before) to 0.854 (after), indicating strong cross-site mixing in the post-alignment space (Fig. 6a, b). (ii) LSO performance improved across all sites. Macro-r (mean ± sd) improved from {0.612 ± 0.036, 0.583 ± 0.041, 0.598 ± 0.039, 0.571 ± 0.040, 0.626 ± 0.034} to {0.664 ± 0.032, 0.642 ± 0.035, 0.651 ± 0.033, 0.636 ± 0.036, 0.673 ± 0.030} for Sites A-E (Fig. 6c). (iii) Pass criteria met. (1) LSO Macro-r gains are consistent and clinically meaningful; (2) the embedding shows clear cross-site-mixing post-CORAL, matching the goal of statistical domain alignment without adversarial training.

Fig. 6: Stage III (CORAL) reduces domain clustering and improves leave-site-out performance.
Fig. 6: Stage III (CORAL) reduces domain clustering and improves leave-site-out performance.
Full size image

Top: site-colored embeddings show strong pre-alignment clustering that mixes after CORAL; the average 10-NN site-label entropy increases markedly (cf. text). Bottom: in LSO testing, Macro-r improves consistently for all sites after CORAL. a Embedding before CORAL (colored by site). b Embedding after CORAL (colored by site). c LSO Macro-r by site (mean ± sd over folds; before vs. after).

CORAL reduces second-order (covariance) discrepancies among sites, yielding embeddings that are less site-identifiable but more biology-consistent. The concomitant LSO gains suggest that alignment corrects nuisance variation rather than suppressing signal.

Pan-cancer generalization and treatment predictivity

To assess cross-cancer portability, we performed a zero-shot evaluation on three stroma-rich TCGA cohorts: TCGA-PAAD (pancreatic adenocarcinoma; N = 150)36, TCGA-STAD (stomach adenocarcinoma; N = 295)37, and TCGA-COADREAD (colorectal adenocarcinoma; N = 276)38. Diagnostic H&E WSIs and matched bulk RNA-seq were obtained from the NCI Genomic Data Commons (GDC)39, and survival endpoints were standardized following TCGA-CDR recommendations when applicable40. Importantly, the chordoma-trained model was applied frozen (no TCGA fine-tuning, calibration, or hyperparameter selection on TCGA). Due to heterogeneous radiology availability across TCGA projects, this analysis uses WSI input only (radiology withheld) and compares predicted image-derived surrogates against the per-cohort transcriptomic reference scores computed by the same pipeline (ssGSEA/CIBERSORTx/LR aggregation; Section 9).

Zero-shot application to TCGA cohorts yielded significant concordance between image-derived and transcriptomic surrogates (Table 3; Fig. 7). For PAAD, Pearson r was 0.42 (ERS–CAF), 0.39 (LR), and 0.36 (H); for STAD, 0.38/0.35/0.33; for COADREAD, 0.35/0.31/0.29 (all FDR < 0.01). Figure 7 visualizes the patient-level agreement underlying Table 3 and includes bootstrap confidence intervals and a permutation control (shuffled predictions) to demonstrate that the observed concordance is not attributable to chance alignment.

Fig. 7: Zero-shot pan-cancer generalization on TCGA.
Fig. 7: Zero-shot pan-cancer generalization on TCGA.
Full size image

The chordoma-trained model is applied as-is (frozen weights, no TCGA tuning) to TCGA-PAAD/STAD/COADREAD WSIs. Patient-level scatter plots visualize concordance against transcriptomic reference scores for the three targets. Concordance is summarized with 95% CIs, and permutation controls show correlations are not explained by chance alignment. a TCGA-PAAD: zero-shot concordance (patient-level). b TCGA-STAD: zero-shot concordance (patient-level). c TCGACOADREAD: zero-shot concordance (patient-level). d Pearson r with 95% CIs (Fisher transform; by cancer and target). e Permutation control: shuffled predictions yield r ≈ 0.

Table 3 External generalization and deployability

Beyond molecular concordance, the AI surrogates remained clinically informative in external cohorts: they were prognostic in PAAD (C-index + 0.04 over clinical-only) and STAD (+0.03). Where treatment outcome annotations were available from TCGA clinical fields curated in TCGA-CDR40, the LR surrogate stratified chemotherapy response in PAAD (AUC 0.66 ± 0.03; adjusted OR per SD 1.53 [1.18–1.99]). In COADREAD, the directionality and magnitude of the decoded ER stress-immune associations are consistent with prior CRC studies linking ER-stress signatures to immune dysfunction/exclusion and immunotherapy-related phenotypes, including an ERS gene scoring system that stratifies putative ICB benefit and nominates ZNF703 as a candidate target8.

Beyond molecular concordance, the AI surrogates remained clinically informative in external cohorts: they were prognostic in PAAD (C-index + 0.04 over clinical-only) and STAD +0.03). Where treatment outcome annotations were available from TCGA clinical fields curated in TCGA-CDR40, the LR surrogate stratified chemotherapy response in PAAD (AUC 0.66 ± 0.03; adjusted OR per SD 1.53 [1.18–1.99]). In COADREAD, the directionality and magnitude of the decoded ER stress-immune associations are consistent with prior CRC studies linking ER-stress signatures to immune dysfunction/exclusion and immunotherapy-related phenotypes, including an ERS gene scoring system that stratifies putative ICB benefit and nominates ZNF703 as a candidate target8.

Toward deployability: MRI-only knowledge distillation

An MRI-only student distilled from the multimodal teacher retained 94% of macro-r (teacher 0.68 ± 0.03 vs. student 0.64 ± 0.03), with calibration slope 0.94 ± 0.08. Inference was 4.6× faster and used 2.9× less GPU memory (Table 3), supporting non-invasive, single-modality deployment.

Why does TCGA performance drop? The TCGA setting introduces a large distribution shift even within histology: differences in tumor histoarchitecture (e.g., glandular vs. physaliphorous patterns), stroma localization, slide preparation/scanning, and cohort composition (tumor purity and spatial sampling mismatch between the WSI region and bulk RNA-seq). Since TCGA evaluation uses WSI-only, the observed drop cannot be attributed to MRI → CT radiomics shift. To quantify the magnitude of the cross-cancer shift, Table 4 contrasts in-distribution chordoma WSI-only performance with TCGA mean concordance (and mean explained variance).

Table 4 Diagnostic of TCGA performance drop (WSI-only)

Discussion

This study demonstrates that routine images (pre-operative MRI and H&E WSIs) can non-invasively decode the ERS–CAF-centric immunoregulatory axis and yield image-derived molecular surrogates that generalize across cancers and stratify outcomes. On the internal 126-patient chordoma cohort, the multimodal model—comprising CLIP-guided pathomics, radiomics, and gated mid-level fusion—showed the strongest agreement with transcriptomic gold standards and improved calibration versus single-modality baselines. Stage-wise analyses clarified the sources of gains: Stage I text guidance increased task-specific attention fidelity and robustness to wording; Stage II fusion learned clinically sensible complementarity between WSI and MRI and degraded gracefully under missing modalities; Stage III CORAL reduced site confounding and improved leave-site-out (LSO) macro-r across all contributing centers. Beyond correlation, attention hotspots co-localized with pathologist-annotated fibrosis and immune exclusion, strengthening the biological plausibility of the decoded surrogates. Zero-shot application to stroma-rich TCGA cancers retained significant concordance, and image-derived scores improved discrimination and decision utility for prognosis and treatment stratification.

We emphasize that the novelty is not in introducing any single building block in isolation, but in an end-to-end, mechanism-aware formulation and training curriculum for decoding the ERS–CAF immunoregulatory axis from routine images under small-N, multi-site constraints. We adopt CORAL as a stable, non-adversarial alignment objective that can be validated transparently by leave-site-out tests; adversarial alignment methods can be effective but are often sensitive to domain imbalance and optimization instability in limited-data clinical settings. Accordingly, we focus on rigorous generalization evidence (LSO gains and site-mixing improvements) and robustness diagnostics (missing-modality stress tests) to support deployability.

The findings are consistent with a model in which ERS–CAF programs reshape the TME through ligand-receptor (LR) crosstalk that recruits and repatterns immune cell populations and remodels the extracellular matrix. The observed monotone associations—higher ERS–CAF/LR scores with worse outcomes, and higher microenvironmental heterogeneity (H) with partial protection—track known mechanisms of desmoplasia and immune exclusion. Two independent lines of evidence support biological specificity: (i) text-guided attention concentrates on histologic patterns congruent with ERS–CAF activation (e.g., collagen-dense stroma, immune-poor niches), whereas wrong or out-of-domain prompts degrade accuracy; (ii) the LR-interaction surrogate aligns with curated axes and with immune-cell fractions inferred from bulk deconvolution, suggesting that imaging surrogates are not merely capturing generic tumor burden but immunoregulatory state.

Non-invasive surrogates derived from routine images can augment clinical decision-making in several ways. First, they enable triage for biopsy: patients with high predicted ERS–CAF/LR activity could be prioritized for confirmatory molecular assays when tissue is scarce. Second, they support risk-adapted surveillance: image-derived risk tiers improved time-dependent C-index and decision curves, indicating practical net benefit across clinically relevant thresholds. Third, they offer a path toward treatment selection and enrichment: the LR surrogate separated chemotherapy responders in external cancers, motivating prospective enrichment strategies or adaptive trial designs. Finally, distillation to an MRI-only student retained most accuracy with markedly reduced compute/memory, suggesting a feasible deployment route in settings where WSIs are not routinely digitized or cross-modality acquisition is staggered.

This is a retrospective, single-disease training study with N = 126 internal patients, which limits precision for subgroup analyses and may inflate optimism despite careful cross-validation. Imaging modality heterogeneity (MRI internally, CT common in TCGA) introduces a shift that we partially mitigated by modality-agnostic radiomic families but did not eliminate. External zero-shot correlations in TCGA are moderate, so our claim is not that images can fully reconstruct bulk molecular scores. Rather, the aim is to recover actionable, low-dimensional surrogates that generalize sufficiently to stratify outcomes and treatment response. Improving cross-cancer molecular fidelity via per-cancer calibration or lightweight adaptation is a natural next step.

Pathologist grades (fibrosis and immune exclusion) remain semi-quantitative and may vary across readers; although we observed monotone trends and strong baselines, richer spatial annotations and inter-rater modeling would further solidify claims. Critically, we lack direct spatial ground truth for ER stress specifically within CAFs (e.g., multiplex IHC/IF for HSPA5/ATF4/XBP1s co-stained with COL1A1/ACTA2/PDGFRB, or spatial transcriptomics), so our spatial validation uses downstream morphology (fibrosis/immune exclusion) rather than a cell-resolved ER-stress label; acquiring such assays is an important direction for prospective validation.

The LR curation relies on current communication databases and chordoma single-cell priors; incompleteness or context-dependence could bias the surrogate. CORAL reduces second-order site effects but may not remove higher-order domain shifts; while LSO gains were consistent, multi-center prospective validation is needed. Finally, association does not imply causation: improved prognosis/treatment stratification does not, on its own, establish that modifying the decoded pathway will change outcomes. In addition, extending the method to data privacy learning frameworks such as federated learning is also an interesting research direction41.

Methods

We formalize non-invasive decoding of the ER-stressed cancer-associated fibroblast (ERS–CAF) immunoregulatory axis as a supervised, multi-target regression from routine images to transcriptomic gold standards. For each patient i {1, …, N}, we observe pre-operative MRI (Ri), H&E WSI (Wi), bulk RNA-seq (xi), and a site/scanner domain label di {1, …, D}. The system contains: (i) a CLIP-guided pathomics branch that converts WSI tiles into task-specific slide embeddings using a curated prompt bank capturing ERS–CAF biology; (ii) a radiomics branch that summarizes MRI with harmonized quantitative features; (iii) a mid-level gated fusion that integrates branches and predicts three continuous targets: ERS–CAF abundance, ERS–CAF-immune ligand-receptor (LR) interaction strength, and TME heterogeneity. The training proceeds in three stages: Stage I calibrates the CLIP-guided WSI branch to molecular scores via a monotone similarity-label alignment; Stage II fits the multimodal regressor with a heteroscedastic multi-task loss; Stage III performs stable, non-adversarial domain harmonization by second-order (CORAL) alignment. Our framework is shown in Fig. 8.

Fig. 8
Fig. 8
Full size image

Stage-wise multimodal pipeline for decoding ERS–CAF-centric TME surrogates from routine images.

Notation and data schema

For patient i, MRI volumes \({R}_{i}\in {{\mathbb{R}}}^{H\times W\times Z\times C}\) with C = 3 (T1/T2/contrast-enhanced), WSI Wi, RNA-seq \({x}_{i}\in {{\mathbb{R}}}^{G}\), clinical covariates ci, survival (Ti, Δi), treatment vector ti, and domain di {1, …, D}. The three continuous supervision targets form \({S}_{i}^{\star }={[{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)]}^{\top }\in {{\mathbb{R}}}^{3}\); predictions are \({\widehat{S}}_{i}\in {{\mathbb{R}}}^{3}\).

For each target j {ERS − CAF, LR, H}, we define a prompt set \({{\mathcal{Q}}}^{(j)}={\{{q}_{m}^{(j)}\}}_{m=1}^{{M}_{j}}\) and encode it using the CLIP text encoder fT into unit-norm vectors

$${{\bf{t}}}_{m}^{(j)}=\frac{{f}_{T}({q}_{m}^{(j)})}{\parallel {f}_{T}({q}_{m}^{(j)}){\parallel }_{2}}\in {{\mathbb{R}}}^{d},\,m=1,\ldots ,{M}_{j}.$$
(1)

We also define a target-specific prompt-mixture text embedding

$${\bar{{\bf{t}}}}^{(j)}=\mathop{\sum }\limits_{m=1}^{{M}_{j}}{\gamma }_{m}^{(j)}\,{{\bf{t}}}_{m}^{(j)},\,{\gamma }_{m}^{(j)}\ge 0,\,\mathop{\sum }\limits_{m}{\gamma }_{m}^{(j)}=1,$$
(2)

and the concatenated text context used in fusion gating,

$$\bar{{\bf{t}}}=\left[{\bar{{\bf{t}}}}^{({\rm{ERS}}-{\rm{CAF}})};\,{\bar{{\bf{t}}}}^{({\rm{LR}})};\,{\bar{{\bf{t}}}}^{(H)}\right].$$
(3)

We use \({{\bf{t}}}_{m}^{(j)}\), \({\bar{{\bf{t}}}}^{(j)}\), and \(\bar{{\bf{t}}}\) consistently throughout; we avoid overloaded alternatives (e.g., f(j)) to prevent ambiguity.

Bulk-transcriptomic reference scores (surrogate supervision targets)

ERS–CAF abundanceL

With curated gene set GERS−CAF {1, …, G} and per-sample gene ranking πi, define the ssGSEA running sum

$${S}_{{\rm{ERS}}-{\rm{CAF}}}(i)=\mathop{\sum }\limits_{k=1}^{G}{\Delta }_{i}(k),$$
(4)
$${\Delta }_{i}(k)=\frac{{\sum }_{g\in {G}_{{\rm{ERS}}-{\rm{CAF}}}}{\bf{1}}\{{{\rm{rank}}}_{i}(g)\le k\}\,| {x}_{i,g}{| }^{p}}{{\sum }_{g\in {G}_{{\rm{ERS}}-{\rm{CAF}}}}| {x}_{i,g}{| }^{p}}-\frac{{\sum }_{g\notin {G}_{{\rm{ERS}}-{\rm{CAF}}}}{\bf{1}}\{{{\rm{rank}}}_{i}(g)\le k\}}{G-| {G}_{{\rm{ERS}}-{\rm{CAF}}}| }.$$
(5)

where ranki(g) is the position of g in πi (descending); p [0, 2]. Scores are z-scored cohort-wise.

ERS–CAF-immune LR interaction

Let \({\mathcal{C}}\) denote immune cell types and \({\mathcal{L}}=\{(l,r,c)\}\) the curated triplets (ERS–CAF ligand l, immune receptor r on-cell c). With z-scored expressions \({\widetilde{x}}_{i,\cdot }\) and deconvolved fractions pi,c,

$${S}_{{\rm{LR}}}(i)=\frac{1}{Z}\mathop{\sum }\limits_{(l,r,c)\in {\mathcal{L}}}{\omega }_{lrc}\,{\widetilde{x}}_{i,l}\,{\widetilde{x}}_{i,r}\,{p}_{i,c},\,Z=\mathop{\sum }\limits_{(l,r,c)}{\omega }_{lrc},$$
(6)

where ωlrc ≥ 0 are evidence-based (or uniform) weights.

Microenvironmental heterogeneity From CIBERSORTx proportions \({\{{p}_{i,k}\}}_{k=1}^{K}\) (∑k pi, k = 1),

$$H(i)=-\mathop{\sum }\limits_{k=1}^{K}{p}_{i,k}\log {p}_{i,k},\,\,{\rm{effective\; richness}}\,\exp \{H(i)\}.$$
(7)

Noise floor of computed supervision targets

The supervision targets \({S}_{i}^{(j)}\in \{{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)\}\) are computed from bulk RNA-seq using (1) gene-set scoring (ssGSEA), (2) digital cytometry (CIBERSORTx), and (3) an aggregated LR scoring rule. These procedures are validated and widely used, but they are nevertheless approximations to latent biology and introduce label noise. Therefore, our model learns an image-based surrogate of a bulk-derived surrogate, and performance should be interpreted relative to this noise.

Let \({S}_{i}^{* (j)}\) denote the latent (unobserved) sample-level biological state for target j and assume

$${S}_{i}^{(j)}={S}_{i}^{* (j)}+{\varepsilon }_{i}^{(j)},\,{\mathbb{E}}\left[{\varepsilon }_{i}^{(j)}\right]=0,$$
(8)

where \({\varepsilon }_{i}^{(j)}\) captures scoring/deconvolution error and finite-sample noise in bulk measurements. Define the reliability (signal fraction) of the computed target as

$${\rho }_{j}\,=\,\frac{\mathrm{Var}({S}^{* ({\rm{j}})})}{\mathrm{Var}({S}^{(j)})}\in (0,1].$$
(9)

Then, for any predictor \({\widehat{S}}^{(j)}\) (including our image-derived predictor), the observable correlation is upper bounded by the noise ceiling

$${\rm{Corr}}({\widehat{S}}^{(j)},{S}^{(j)})\,\le \,\sqrt{{\rho }_{j}},$$
(10)

i.e., even a perfect predictor of S*(j) cannot exceed \(\sqrt{{\rho }_{j}}\) correlation with the noisy computed label S(j).

We estimate ρj (or lower bounds thereof) via pipeline perturbations that produce two independently computed proxies S(j) and \({\widetilde{S}}^{(j)}\) from the same RNA-seq:

  • ERS–CAF score stability: ssGSEA vs. GSVA (or other rank-based scoring), yielding \({\rm{Corr}}({S}_{{\rm{ERS}}-{\rm{CAF}}},{\widetilde{S}}_{{\rm{ERS}}-{\rm{CAF}}})\).

  • Heterogeneity stability: compute H from alternative deconvolution tools (e.g., EPIC/quanTIseq/xCell) and compare to CIBERSORTx-derived H.

  • LR score stability: compare the simple weighted product-sum rule to an alternative bulk-mode LR scoring (e.g., CellChat bulk formulation), reporting \({\rm{Corr}}({S}_{{\rm{LR}}},{\widetilde{S}}_{{\rm{LR}}})\).

These inter-pipeline concordances provide conservative empirical lower bounds on ρj and contextualize attainable prediction accuracy.

WSI pathomics with CLIP-guided MIL

WSI captures micro-ecology (CAF morphology, collagen deposition, immune exclusion). A CLIP prior and mechanism-driven prompts steer attention toward tiles that instantiate ERS–CAF biology. Standard pathology MIL learns attention weights purely from slide-level labels; in contrast, our tile weighting \({\alpha }_{it}^{(j)}\) is explicitly grounded in a target-specific biological text prior through prompt-similarity scoring (Eqns. (12–14)), and Stage I enforces a monotone similarity-label alignment before downstream regression. This design makes the inductive bias inspectable (via prompts) and falsifiable (via synonym/OOD/antonym perturbations in Fig. 4). Moreover, unlike path-transcriptomics approaches that predict large gene expression vectors, we predict three mechanism-defined surrogates (ERS–CAF activity, LR crosstalk, heterogeneity), which are more stable under limited N and directly aligned with our biological hypothesis.

After stain normalization and tissue detection, tile Wi at 20× into \({\{{W}_{it}\}}_{t=1}^{{T}_{i}}\). CLIP image encoder fI yields unit-norm tile embeddings

$${{\bf{v}}}_{it}=\frac{{f}_{I}({W}_{it})}{\parallel {f}_{I}({W}_{it}){\parallel }_{2}}\in {{\mathbb{R}}}^{d}.$$
(11)

For each target j {ERS − CAF, LR, H}, a prompt set \({{\mathcal{Q}}}^{(j)}={\{{q}_{m}^{(j)}\}}_{m=1}^{{M}_{j}}\) is encoded by CLIP text encoder fT into unit vectors

$${{\bf{t}}}_{m}^{(j)}=\frac{{f}_{T}({q}_{m}^{(j)})}{\parallel {f}_{T}({q}_{m}^{(j)}){\parallel }_{2}},\,{\bar{{\bf{t}}}}^{(j)}=\mathop{\sum }\limits_{m=1}^{{M}_{j}}{\gamma }_{m}^{(j)}{{\bf{t}}}_{m}^{(j)},\,\,{\gamma }_{m}^{(j)}\ge 0,\,\mathop{\sum }\limits_{m}{\gamma }_{m}^{(j)}=1.$$
(12)

Prompt-similarity attention over tiles uses a log-sum-exp temperature τ > 0:

$${s}_{it}^{(j)}=\frac{1}{\tau }\log \mathop{\sum }\limits_{m=1}^{{M}_{j}}\exp \left(\frac{{{\bf{v}}}_{it}^{\top }{{\bf{t}}}_{m}^{(j)}}{\tau }\right),\,{\alpha }_{it}^{(j)}=\frac{\exp ({s}_{it}^{(j)})}{{\sum }_{{t}^{{\prime} }}\exp ({s}_{i{t}^{{\prime} }}^{(j)})},\,{{\bf{p}}}_{i}^{(j)}=\mathop{\sum }\limits_{t=1}^{{T}_{i}}{\alpha }_{it}^{(j)}\,{{\bf{v}}}_{it}.$$
(13)

The slide-level pathomics representation concatenates task-specific embeddings:

$${{\bf{p}}}_{i}=\left[{{\bf{p}}}_{i}^{({\rm{ERS}}-{\rm{CAF}})};\,{{\bf{p}}}_{i}^{({\rm{LR}})};\,{{\bf{p}}}_{i}^{(H)}\right]\in {{\mathbb{R}}}^{3{d}_{p}}.$$
(14)

MRI radiomics branch

MRI encodes macro-phenotypes (shape, intensity, texture) correlated with stromal content and edema, complementing WSI micro-ecology.

After resampling to isotropic voxels and sequence-wise z-normalization, PyRadiomics produces \({\phi }_{{\rm{rad}}}({R}_{i})\in {{\mathbb{R}}}^{{P}_{r}}\); stable features (ICC threshold) define \({{\bf{r}}}_{i}={[{\phi }_{{\rm{rad}}}({R}_{i})]}_{{\mathcal{I}}}\). Site/scanner effects are harmonized via ComBat:

$${{\bf{r}}}_{i}\leftarrow {\rm{ComBat}}({{\bf{r}}}_{i}| {d}_{i}),\,{{\bf{z}}}_{i}^{({\rm{rad}})}={E}_{{\rm{rad}}}({{\bf{r}}}_{i})\in {{\mathbb{R}}}^{{d}_{r}},$$
(15)

with Erad an MLP (layer norm, dropout).

Gated mid-level fusion and multi-head regression

Radiomics and pathomics carry complementary, scale-separated signals; prompts provide a mechanism-aware context vector that can modulate fusion. Encode pathomics by Ewsi:

$${{\bf{z}}}_{i}^{({\rm{wsi}})}={E}_{{\rm{wsi}}}({{\bf{p}}}_{i})\in {{\mathbb{R}}}^{{d}_{p}},\,\bar{{\bf{t}}}=[{\bar{{\bf{t}}}}^{({\rm{ERS}}-{\rm{CAF}})};{\bar{{\bf{t}}}}^{({\rm{LR}})};{\bar{{\bf{t}}}}^{(H)}].$$
(16)

A gate controls cross-modal contribution:

$${{\bf{z}}}_{i}=\sigma \left({{\bf{W}}}_{g}\left[{{\bf{z}}}_{i}^{({\rm{rad}})};{{\bf{z}}}_{i}^{({\rm{wsi}})};\bar{{\bf{t}}}\right]+{{\bf{b}}}_{g}\right)\odot \left[{{\bf{z}}}_{i}^{({\rm{rad}})};{{\bf{z}}}_{i}^{({\rm{wsi}})}\right],$$
(17)

where σ is element-wise sigmoid and the Hadamard product. A shared trunk f with three linear readouts yields

$${\widehat{s}}_{i}^{(j)}={h}^{(j)}({{\bf{z}}}_{i})={{\bf{w}}}_{j}^{\top }f({{\bf{z}}}_{i})+{b}_{j},\,j\in \{{\rm{ERS}}-{\rm{CAF}},{\rm{LR}},H\}.$$
(18)

Training curriculum and objectives

Directly optimizing many objectives is fragile42,43,44. We therefore adopt a three-stage curriculum, ensuring stable convergence while preserving the intended inductive biases.

Stage I: CLIP similarity-label alignment (WSI branch)

For each target j, define the WSI similarity

$${u}_{i}^{(j)}={{\bf{p}}}_{i}^{(j)\top }{\bar{{\bf{t}}}}^{(j)}\in [-1,1],\,\,{\rm{where}}\,\,{\bar{{\bf{t}}}}^{(j)}\,\,{\rm{is\; the\; prompt\; -\; mixture\; text\; embedding.}}$$
(19)
$${y}_{i}^{(j)}=\,\mathrm{minmax}\,\left({S}_{i}^{(j)}\right)=\frac{{S}_{i}^{(j)}-{\min }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}}{{\max }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}-{\min }_{{i}^{{\prime} }\in {\mathcal{T}}}\,{S}_{{i}^{{\prime} }}^{(j)}+\epsilon }\in [0,1],$$
(20)

where \({\rm{minmax}}\) scales per-cohort to [0, 1], \({\mathcal{T}}\) denotes the training split of the current outer fold (to avoid leakage) and ϵ = 10−8. A monotone affine-sigmoid calibrator maps similarity to the label scale:

$${g}_{j}(u)=\sigma ({a}_{j}u+{b}_{j}),\,{a}_{j}\ge 0.$$
(21)

The WSI alignment loss is

$${{\mathcal{L}}}_{{\rm{WSI}}}=\mathop{\sum }\limits_{j}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({g}_{j}({u}_{i}^{(j)})-{y}_{i}^{(j)}\right)}^{2}.$$
(22)

where σ( ) logistic function; aj, bj are learned scalars with aj≥0 to enforce monotonicity. During Stage I, (fI, fT) are kept frozen; Ewsi, prompt-mixture weights \({\gamma }_{m}^{(j)}\), and (aj, bj) are optimized.

Stage II: Multimodal regression (fusion model)

We fit the fused model to the continuous molecular targets with heteroscedastic weighting:

$${{\mathcal{L}}}_{{\rm{task}}}=\mathop{\sum }\limits_{j\in \{{\rm{ERS}}-{\rm{CAF}},{\rm{LR}},H\}}\left[\frac{1}{2{\sigma }_{j}^{2}}\cdot \frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({\widehat{s}}_{i}^{(j)}-{S}_{i}^{(j)}\right)}^{2}+\log {\sigma }_{j}\right],$$
(23)

where σj > 0 are learned noise scales and \({S}_{i}^{(j)}\) denotes the jth entry of \({S}_{i}^{\star }\). Parameters updated include Erad, Ewsi, fusion gate (Wg, bg), trunk f, and readouts (wj, bj); the Stage I prompt mixtures \({\gamma }_{m}^{(j)}\) are fine-tuned.

Stage III: Site/scanner harmonization (CORAL)

To mitigate domain shift without adversarial instability, we align the second-order statistics of the fused features. Let μd and Σd denote the mean and covariance of zi within domain d, and μ, Σ the global mean and covariance. The CORAL loss is

$${{\mathcal{L}}}_{{\rm{coral}}}=\mathop{\sum }\limits_{d=1}^{D}\left(\parallel {\mu }_{d}-\mu {\parallel }_{2}^{2}+\parallel {\Sigma }_{d}-\Sigma {\parallel }_{F}^{2}\right),$$
(24)

where F is Frobenius norm. The Stage III objective is

$$\mathop{\min }\limits_{\Theta }\,{{\mathcal{L}}}_{{\rm{task}}}+{\lambda }_{{\rm{coral}}}{{\mathcal{L}}}_{{\rm{coral}}}+{\lambda }_{{\rm{reg}}}\parallel \Theta {\parallel }_{2}^{2},$$
(25)

with weight decay λreg ≥ 0 and λcoral ≥ 0 chosen on validation folds.

We employ dropout in encoders and trunk, and modality dropout during stage II/III (with probability q, one branch’s latent is zeroed) to ensure single-modality robustness.

MRI-only Knowledge Distillation

For fully non-invasive deployment when pathology is unavailable, we distill the trained teacher \({F}_{{\Theta }^{\star }}\) into an MRI-only student \({F}_{{\Theta }_{S}}\):

$${{\mathcal{L}}}_{\mathrm{KD}}=\mathop{\sum }\limits_{j\in \{\,\mathrm{ERS}-\mathrm{CAF}\,,\mathrm{LR},H\}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({\widehat{s}}_{i,S}^{(j)}-{\widehat{s}}_{i,T}^{(j)}\right)}^{2},$$
(26)

where \({\widehat{s}}_{i,T}^{(j)}={F}_{{\Theta }^{\star }}^{(j)}({R}_{i},{W}_{i})\) and \({\widehat{s}}_{i,S}^{(j)}={F}_{{\Theta }_{S}}^{(j)}({R}_{i})\). The student reuses the MRI radiomics encoder and the fusion trunk (restricted to the MRI branch).

Validation and Clinical Modeling

We conduct fivefold patient-level cross-validation stratified by domain di with nested hyperparameter selection; leakage is prevented at the patient-level. Predictive fidelity is summarized by Pearson correlation r, RMSE, MAE, and calibration slope/intercept between \({\widehat{S}}_{i}\) and \({S}_{i}^{\star }\). Clinical relevance is assessed by Cox models with covariate adjustment:

$$h(t| {\widehat{S}}_{i},{{\bf{c}}}_{i})={h}_{0}(t)\exp \left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{i};{{\bf{c}}}_{i}]\right),$$
(27)

using the partial log-likelihood

$$\ell ({\boldsymbol{\theta }})=\mathop{\sum }\limits_{i:{\Delta }_{i}=1}\left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{i};{{\bf{c}}}_{i}]-\log \mathop{\sum }\limits_{j\in {\mathcal{R}}({T}_{i})}\exp \left({{\boldsymbol{\theta }}}^{\top }[{\widehat{S}}_{j};{{\bf{c}}}_{j}]\right)\right),$$
(28)

where \({\mathcal{R}}({T}_{i})\) is the risk set. We report hazard ratios (HRs) with 95% CIs, time-dependent AUC, and C-index. Treatment relevance is tested via logistic or interaction Cox models with treatment vector ti.

Given the limited cohort size (N = 126), we explicitly constrain model capacity and enforce strict evaluation hygiene. CLIP image/text encoders are kept frozen; the trainable components are limited to lightweight prompt-mixture weights, shallow encoders for radiomics/pathomics, a mid-level gating network, and three linear heads. All preprocessing/statistics are fit within each training fold only (e.g., min–max scaling for Stage I targets; ComBat parameters; feature standardization), and performance is reported on held-out patients only. We further apply dropout and weight decay throughout, and use modality dropout during Stage II/III to prevent single-modality shortcut learning and to improve robustness under missing modalities.

Experimental setup

We assemble a consecutive, single-disease cohort of 126 patients with matched pre-operative MRI (T1, T2, contrast-enhanced), diagnostic H&E WSIs, bulk RNA-seq, and clinical follow-up. Exclusion criteria are: (i) imaging outside protocol; (ii) failed ROI/tiling quality control; (iii) RNA-seq library size or mapping rate below thresholds; and (iv) missing outcomes. Primary endpoints are overall survival (OS) and progression-free survival (PFS). Molecular supervision yields three continuous targets per patient i: \({S}_{i}^{\star }={[{S}_{{\rm{ERS}}-{\rm{CAF}}}(i),{S}_{{\rm{LR}}}(i),H(i)]}^{\top }\) (cf. Section 9).

For generalization, we evaluate on TCGA stroma-rich cancers (PAAD, STAD, COADREAD) with WSIs, RNA-seq, and clinical data. The multimodal model is frozen and applied zero-shot to generate \({\widehat{S}}_{i}\), then compared to per-cancer molecular gold standards. When imaging modality differs (e.g., CT), we restrict radiomic families to modality-agnostic statistics.

Preprocessing and Quality Control Volumes are resampled to 1 mm isotropic, bias-field corrected, and z-standardized per sequence. Tumor ROIs are delineated and adjudicated. PyRadiomics extracts first-order, shape, and texture features (GLCM/GLRLM/GLSZM/NGTDM) plus LoG/wavelet; site/scanner effects are harmonized by ComBat; features with ICC < 0.75 are removed, yielding \({{\bf{r}}}_{i}\in {{\mathbb{R}}}^{{P}_{r}}\).

It should be noted that the end-to-end 3D CNN/Transformer encoders are attractive45,46. However, in our setting (N = 126 chordoma, multi-site scanners, three MRI sequences), fully learned 3D representation learning is high-variance and prone to overfitting without substantial pretraining and careful harmonization. We therefore use radiomics as a data-efficient and reproducible summary of macro-scale tumor phenotypes (shape/texture/heterogeneity), then learn a non-linear encoder Erad( ) jointly with the multimodal trunk and gate.

WSIs undergo Macenko stain normalization, tissue masking (Otsu+morphology), tiling at ×20 into 224 × 224 px patches (stride 224), and tile QC (tissue coverage ≥50%, blur/pen-mark removal). Retained tiles form \({\{{W}_{it}\}}_{t=1}^{{T}_{i}}\).

SERS−CAF is ssGSEA on a curated ERS–CAF gene set (cohort z-score); SLR aggregates ERS–CAF ligand-immune receptor pairs weighted by immune fractions pi,c from CIBERSORTx; H(i) is Shannon diversity over deconvolved cell fractions (Section 9).

Ethics approval and consent to participate

This study was approved by the Institutional Ethics Committee of The First Affiliated Hospital, University of South China, Hunan, P.R. China (Ethics Approval No. 2025LL0206001). Written informed consent was obtained from all enrolled patients in accordance with the Declaration of Helsinki.