Fig. 2

Our proposed framework consists of three main parts: 1) TANQ-based image translation from \(\text {ceT}_1\) to \(\text {hrT}_2\) images, 2) Multi-view pseudo-\(\text {hrT}_2\) representation via CycleGAN, 3) Construction of a VS/cochlea segmentation model using multi-view pseudo-\(\text {hrT}_2\) images and self-training with real-\(\text {hrT}_2\) images. Specifically, TANQ divides the features based on the \(\text {ceT}_1\) labels in both the encoder and decoder, applying target-aware normalization. Furthermore, it includes an additional decoder called SegDecoder. The Encoder E extracts features from both the real \(\text {ceT}_1\) images and pseudo-\(\text {hrT}_2\) images and then calculates the contrastive loss between selected features using a sorted attention matrix.