Fig. 3

Visualization and interpretation of learned features and attention maps over folds. (A) UMAP visualizations of the image embeddings extracted by fine-tuned ResNet-50 on the five-fold test sets. The UMAPs were trained on the training sets supervised by locations. (B) The scaled last-layer attention maps of the transformer encoder contributing to the final classification embedding, averaged over augmented image groups. Each row corresponds to the features of an image sent into the transformer, and the corresponding location is noted. The peritoneum is the most salient location that contributes more to the final prediction than other locations.