Fig. 10: Overview of the A-VPT framework.
From: Anatomy-guided visual prompt tuning for cross-modal breast cancer understanding

Left: three input modalities (MG/US/MRI) with anatomy maps (glandular/fatty/ductal). Center: the Anatomy-Guided Prompt Generator pools tissue maps into region embeddings, transforms them by MLPs, and produces anatomy-aware prompt tokens via a rank-r projection (<2% trainable params). These tokens interact with visual tokens inside a frozen ViT-B/16 encoder through the Prompt--Token Interaction Module (PIM): token → prompt attention, prompt → token attention, and a gated residual fusion (α(ℓ) = σ( ⋅ )). Bottom: cross-modal anatomical alignment (\({{\mathcal{L}}}_{{\rm{RCL}}},\,{{\mathcal{L}}}_{{\rm{TXT}}}\)) harmonizes MG/US/MRI tissue semantics. Right: output heads for classification and segmentation.