Fig. 1: Multimodal deep learning system and dataset.

a The multimodal architecture is composed of two parts: a tower stack to parse a variable number of digital histopathology slides and another tower stack to merge the resultant features and predict binary outcomes. b The training of the self-supervised model of the image tower stack.