Fig. 5: The overview of our method.
From: Large-scale long-tailed disease diagnosis on radiology images

Three parts demonstrate our proposed visual encoders and fusion module, together with the knowledge enhancement strategy respectively. a The three types of vision encoder, i.e., ResNet-based, ViT-based, and ResNet-ViT-mixing. b The architecture of the fusion module. The figure shows the transformer-based fusion module, enabling case-level information fusion. c The knowledge enhancement strategy. We first pre-train a text encoder with extra medical knowledge with contrastive learning, leveraging synonyms, descriptions, and hierarchy. Then we view the text embedding as a natural classifier to guide the diagnosis classification.