Reliance on text supervision for biomedical image encoders is investigated. The proposed RAD-DINO, pretrained solely on unimodal data, achieves similar or greater performance than state-of-the-art multimodal models on various benchmarks.
- Fernando Pérez-García
- Harshita Sharma
- Ozan Oktay