Fig. 1

Overview of the complete pipeline. There are two sequential trainings taking place, the first being of the autoencoder (top) and the second of the ViT (bottom) on the image reconstructions. The autoencoder consists of a symmetric encoder-decoder setup of three layers each, with linear projections to and from the embedding space. The final layer of the decoder ends with a sigmoid activation in order to limit the pixel intensities between 0 and 1. The interpretability elements are marked with gray boxes with dashed borders. These elements (latent space embeddings, reconstructed images and the attention maps) help increase the transparency in model behaviour by providing bases for further analysis.