Figure 2
From: Multimodal surface-based transformer model for early diagnosis of Alzheimer’s disease

The proposed middle-fusion attention model architecture begins by extracting patches from the icospheres, which are then flattened and embedded into a common dimensional space. In this architecture, the features derived from each PET imaging modality correspond to one of the channels (C) within the PET module, where C ranges from 1 to 2 depending on the number of PET radiotracers used. A self-attention mechanism is applied in the transformer block to capture dependencies within each modality. Next, the mix-transformer block performs modality fusion by employing cross-attention operations to capture relationships across modalities. The fused outputs are then concatenated and passed through a classifier to generate class probabilities.