Fig. 5: Overview of the three-stage multimodal prognosis framework. | npj Digital Medicine

Fig. 5: Overview of the three-stage multimodal prognosis framework.

From: Deep multimodal fusion of patho-radiomic and clinical data for enhanced survival prediction for colorectal cancer patients

Fig. 5: Overview of the three-stage multimodal prognosis framework.The alternative text for this image may have been generated using AI.

(1) Modality processing: WSIs are tiled, stain-normalized, and encoded with CTransPath; patch features are aggregated by gated-attention MIL modulated by mask-derived saliency. 3D CT volumes use a Swin UNETR encoder with ROI-gated and peritumoral-ring pooling plus compact shape/texture descriptors. Clinical/genomic variables are encoded by a GCN on prior graph A. Segmentation serves as a structural prior to gate WSI attention and enable CT ROI/peritumoral pooling and shape/texture descriptors. (2) Fusion & uncertainty: modality vectors \({z}_{{\rm{WSI}}},{z}_{{\rm{CT/Endo}}},{z}_{{\rm{GCN}}}\) are fused via low-rank multimodal fusion (LMF, rank r) into h, with auxiliary per-modality heads. Uncertainty via MC Dropout (optionally deep ensembles) yields calibrated \(\hat{y}\) and variance σ2. Training is three-stage: self-supervised visual pretraining (DINOv2), domain adaptation (Barlow Twins), then supervised fine-tuning with knowledge distillation. (3) Prediction & evaluation: an MLP survival head outputs risk/endpoint estimates; evaluation covers segmentation (Dice/IoU), biomarker AUCs, prognosis (C-index), therapy-response hazard ratios. Fusion remains tri-modal (WSI/CT/GCN); segmentation is not a modality.

Back to article page