Fig. 3: Multimodal multiple instance learning allows the prediction of targets using imaging data. | Nature Communications

Fig. 3: Multimodal multiple instance learning allows the prediction of targets using imaging data.

From: A multimodal dataset for precision oncology in head and neck cancer

Fig. 3: Multimodal multiple instance learning allows the prediction of targets using imaging data.

A Multiple instance learning (MIL) pipeline. Tissue in WSIs is segmented and subsequently sampled in patches. These patches are encoded, for example, using the UNI architecture22, and used as input for MIL together with a specified target. Using the CLAM framework20, we can retrieve attention scores (blue: low attention, red: high attention). Scale bar is 1 cm for the WSI and its attention-labeled counterpart. B Slide-level AUC values for localization prediction on the test dataset (N = 10 each). Color-coded for supervised (blue) and self-supervised (red) encoding backbones. All backbones are based on convolutional neural networks, whereas UNI is based on vision transformers. Boxplots show Q1–Q3 interval with median, whiskers are 1.5 × the inter-quartile (Q1–Q3) range. C Most attended patches for three test WSIs for all localizations tested (oropharynx, larynx and oral cavity). Note the presence of gland tissue in oral cavity-derived samples. Scale bar indicates 30 μm. D Multimodal integration of different imaging data sources. We use separate encodings for WSIs (pink) and TMAs (green) using the UNI encoder for MIL. E Slide-level AUC values for survival prediction on the test dataset (N = 10 each). Boxplots show Q1–Q3 interval with median, whiskers are 1.5 × the inter-quartile (Q1–Q3) range. Scale bar for WSI indicates 1 cm, for TMAs 1 mm. F Attention scores and their frequency across information-containing groups and modalities for the test dataset. Source data are provided as a Source Data file.

Back to article page