Fig. 1: Overview of the proposed Decoupled Multimodal Representation Fusion (MODES) framework. | npj Digital Medicine

Fig. 1: Overview of the proposed Decoupled Multimodal Representation Fusion (MODES) framework.

From: A Representation Fusion Framework for Decoupling Diagnostic Information in Multimodal Learning

Fig. 1

a Clinicians use different diagnostic modalities to learn a holistic view of patient health in order to make clinical diagnosis. b Overview of MODES: The fine-tuned unimodal encoders learn the decoupled shared and modality specific representations. The representation component can then be used by unimodal generators to obtain reconstructed samples. c The fused representations decouple information that are unique to each modality and the shared information. The cMRI-specific representation encodes information such as the anatomy and size of the heart, while the ECG-specific modality encodes information about the electrical activity of the heart. The fused representation can be used by downstream models to predict a variety of diagnostic phenotypes or diagnoses, and offers interpretability to the predictive power of each modality. d The masking component learns the right size of the representations for each shared and modality-specific component. The final size reflects the amount of information embedded in each subspace, and can vary depending on the pair of modalities considered. e MODES learns to embed cross-modal information into the shared space using unimodal encoders. This can be used to infer phenotypes of the missing modality, or to estimate the range of possible samples for the missing modality. The three icons in Fig. 1a are from www.flaticon.com by various authors (Vectors Tank, Linector), used under the Flaticon Free License.

Back to article page