Fig. 1: Overview of 3DINO methodology and large pretraining dataset. | npj Digital Medicine

Fig. 1: Overview of 3DINO methodology and large pretraining dataset.

From: A generalizable 3D framework and model for self-supervised learning in medical imaging

Fig. 1

a 3DINO combines an image-level objective and a patch-level objective. Original volumes are randomly augmented twice to create global crops, and augmented eight times to yield local crops. The image-level objective is taken by distilling the token representations between the student and exponential moving average (EMA) teacher networks. The patch-level objective is computed between patch representations at masked regions in the student network input and corresponding unmasked EMA teacher representations. LCE indicates Cross-Entropy loss, with the final 3DINO loss consisting of the summed image-level distillation and patch-level reconstruction objectives. b Breakdown of large multimodal, multi-organ pretraining dataset of 100,000 3D scans with over 10 organs from 35 publicly available and internal studies (the number of volumes per modality per anatomical location/organ; MRI = 70,434 volumes, CT = 27,815, and PET = 566). c Original image, principal component analysis (PCA) on patch-level representations, and multi-head self-attention (MHSA) attention map visualized for three image planes. Each row in order: BraTS T1-weighted, T2-weighted, and two patients from BTCV. PCA visualizations are obtained per image from patch-level representation vectors. The first (in terms of explained variance) PCA component is used to mask image background (white) with a simple threshold, with the next three normalized and mapped to RGB channels. MHSA attention maps are obtained from the token of final 3DINO-ViT layer. Images were not registered to atlases for visualization or training/testing.

Back to article page