Introduction

Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide1. Early and accurate detection from medical imaging, such as mammography, ultrasound, and MRI, is crucial for improving survival rates. While recent advances in deep learning and vision transformers (ViTs)2,3 have achieved remarkable progress in computer-aided diagnosis, these models still struggle to generalize across imaging modalities. This limitation arises from the inherent heterogeneity in anatomical structure, imaging physics, and lesion appearance across modalities4.

Vision-language models (VLMs) such as CLIP5 and MedCLIP6 have demonstrated promising capabilities in aligning medical images with textual knowledge, enabling zero-shot or few-shot diagnosis. However, these models are typically trained with global image-level supervision and lack fine-grained anatomical awareness. As a result, they often fail to capture subtle yet clinically critical variations—such as microcalcifications or architectural distortions—that are highly localized and context-dependent. In medical imaging, such omissions can lead to severe misinterpretations and undermine clinical reliability7.

To address this challenge, we introduce Anatomy-guided Visual Prompt Tuning (A-VPT), a novel framework that injects explicit anatomical priors into pre-trained VLMs. Unlike conventional visual prompt tuning (VPT), which employs static or randomly initialized prompts, A-VPT dynamically generates structured prompts guided by anatomical segmentation maps or learned tissue embeddings. This design is motivated by a critical insight: while imaging physics vary drastically (e.g., X-ray attenuation in mammography vs. acoustic reflection in ultrasound), the underlying biological topology remains invariant. By explicitly anchoring the model to these stable anatomical structures, the prompts serve as a semantic bridge, enabling the model to generalize across heterogeneous modalities despite their distinct visual characteristics. Consequently, these anatomy-aware prompts modulate the visual representation space to emphasize semantically consistent regions across modalities, enabling robust feature transfer from mammography to ultrasound or MRI.

Technically, A-VPT integrates three key components: (1) an Anatomical Region Encoder that encodes glandular, fatty, and ductal tissue distributions into prompt tokens; (2) a Cross-Modal Alignment Adapter that ensures shared anatomical semantics across imaging modalities via contrastive supervision; and (3) a Prompt Interaction Module that allows hierarchical fusion between anatomy-aware prompts and visual tokens at multiple transformer layers. This design provides both parameter efficiency and interpretability, making it well-suited for large-scale clinical deployment.

Empirically, we evaluate A-VPT on three breast imaging benchmarks—INbreast, BUSI, and Duke-Breast-MRI. A-VPT achieves consistent gains over state-of-the-art baselines in both classification and lesion segmentation tasks, while requiring less than 2% of the tunable parameters of full fine-tuning. Furthermore, qualitative analyses reveal that our anatomy-guided prompts lead to more localized and clinically meaningful attention maps, effectively bridging the gap between visual representation learning and radiological reasoning. In summary, our contributions can be summarized as follows:

  • We propose A-VPT, a novel anatomy-guided visual prompt tuning framework that injects explicit structural priors into pre-trained VLMs, enabling anatomy-aware feature modulation for breast cancer analysis.

  • We design a Cross-Modal Anatomical Alignment Module that harmonizes tissue-level semantics across mammography, ultrasound, and MRI through contrastive prompt learning, significantly improving modality generalization and diagnostic robustness.

  • We develop a hierarchical Prompt Interaction Mechanism that fuses anatomy-aware prompts and visual tokens across multiple transformer layers, allowing interpretable and anatomically consistent attention propagation.

  • We conduct extensive experiments on three benchmark datasets (INbreast, BUSI, and Duke-Breast-MRI), demonstrating that A-VPT achieves state-of-the-art performance with less than 2% of the tunable parameters required by full fine-tuning, while providing superior interpretability and clinical relevance.

Deep learning in breast cancer imaging

Deep learning has become the cornerstone of automated breast cancer analysis, achieving remarkable success in lesion detection, segmentation, and classification. Convolutional neural networks (CNNs) have been widely adopted for mammography and ultrasound interpretation, yet their limited receptive field and handcrafted preprocessing pipelines restrict their adaptability across modalities8. More recently, transformer-based models have shown strong performance on large-scale medical datasets by capturing long-range dependencies. Despite these advances, most existing methods rely heavily on modality-specific fine-tuning, leading to degraded generalization when encountering unseen imaging protocols or anatomical variations9. Addressing this limitation requires structural priors that can guide the model toward consistent anatomical reasoning across modalities.

Vision-language models in medical imaging

VLMs, such as CLIP5 and ALIGN10, have demonstrated impressive cross-modal alignment between images and text. In medical imaging, several studies have extended this paradigm to clinical contexts, including MedCLIP6, and BioViL11. These models leverage paired image–report data to align visual representations with textual semantics, enabling zero-shot classification or report generation. However, current medical VLMs primarily perform global alignment without explicit anatomical awareness. This lack of structure-level correspondence limits their interpretability and cross-modality robustness—particularly in breast imaging, where subtle tissue boundaries and regional heterogeneity play a critical diagnostic role12. Our work departs from previous approaches by embedding anatomical knowledge directly into the prompt space, effectively bridging the gap between global visual-language alignment and fine-grained anatomical reasoning.

Prompt tuning and parameter-efficient adaptation

Parameter-efficient fine-tuning (PEFT) techniques have emerged as powerful alternatives to full model retraining, reducing computational cost while maintaining performance13. VPT14 extends this concept to the vision domain by introducing learnable prompts at the input token level, enabling task adaptation without modifying backbone weights. Subsequent works have explored hierarchical, generative, and dynamic prompt designs, achieving significant progress in natural image benchmarks15,16. However, these methods remain modality-agnostic and neglect anatomical constraints critical to medical imaging. In contrast, our A-VPT framework incorporates anatomy-guided prompt generation and hierarchical fusion to enhance both interpretability and transferability across breast imaging modalities.

Results

Experimental setup

Datasets

We conduct experiments on three public breast imaging datasets that cover complementary modalities and acquisition protocols: INbreast17, BUSI18, and Duke-Breast-MRI19,20. Together, they provide a comprehensive benchmark for evaluating cross-modal generalization and anatomy-aware adaptation.

Implementation details

All experiments are implemented in PyTorch 2.1 and executed on NVIDIA A100 40GB GPUs. We use a frozen ViT-B/16 backbone pre-trained via CLIP unless otherwise specified. Only the prompt generator, cross-attention layers, and task heads are trainable, accounting for 1.87% of total parameters. We did not test the entire dataset, but rather a subset of it for our experiments.

Training protocol

Images are resized to 224 × 224, normalized per modality, and augmented with random rotation (±10), horizontal flipping, and adaptive histogram equalization. Optimization uses AdamW with learning rate 1 × 10−4, weight decay 1 × 10−4, and cosine annealing for 50 epochs. Warm-up is applied for the first 5 epochs. The batch size is 32 for 2D datasets and 8 for MRI slices. Prompt dimension d is 768, number of prompt tokens K = 6, and low-rank factor r = 8 (justification provided in “Ablation studies”).

Cross-modal training

During training, we alternate between modality batches (MG, US, MRI) within each epoch to maintain balance. The temperature τ in Eq. (12) is set to 0.07. The loss weights in Eq. (17) are λseg = 1.0, λrcl = 0.2, λtxt = 0.1, and λlap = 0.05. For ablations without text supervision, \({{\mathcal{L}}}_{{\rm{TXT}}}\) is disabled.

Evaluation protocol

We adopt a strict 5-fold cross-validation on each dataset and report the mean and standard deviation across folds. For cross-modal transfer, we train on one modality (e.g., mammography) and directly test on another (e.g., ultrasound) without any fine-tuning. All reported metrics are averaged over three random seeds to ensure reproducibility.

Baselines

We compare A-VPT against: (1) full fine-tuning of ViT-B/16; (2) adapter-tuning; (3) LoRA; (4) visual prompt tuning (VPT); (5) cross-modal transformer (CMT); and (6) MedCLIP. All baselines are re-implemented under identical training and data preprocessing conditions for fair comparison.

Quantitative results

Breast cancer classification

Table 1 summarizes the image-level diagnostic results on the INbreast and BUSI datasets. Overall, transformer-based architectures consistently outperform traditional CNN baselines such as ResNet-50 and DenseNet-121, confirming the advantage of self-attention in modeling long-range dependencies within mammographic and ultrasound imagery. Among parameter-efficient methods, VPT and CoOp outperform Adapter and LoRA, indicating the effectiveness of prompt-based modulation for medical domain adaptation.

Table 1 Comparison of classification performance on INbreast and BUSI datasets

Our proposed A-VPT achieves the highest performance across all metrics, reaching an AUC of 97.8% on INbreast and 95.7% on BUSI, with an average F1-score improvement of +1.1% over the best baseline (CoOp). The improvement is especially pronounced in BUSI, where ultrasound data exhibit strong speckle noise and variable intensity distributions. This validates that the anatomy-guided prompts help the model distinguish subtle glandular and ductal structures even under heterogeneous acquisition conditions. We also observe a notable reduction in variance across folds (standard deviation <0.3), indicating stable training dynamics despite the frozen backbone. These results suggest that incorporating structural priors effectively mitigates overfitting and enhances the diagnostic reliability of VLMs.

Lesion segmentation

As shown in Table 2, A-VPT surpasses all existing segmentation networks, including strong transformer-based models such as Swin-U-Net and MedT. Compared to the fully fine-tuned Swin-Transformer, A-VPT achieves higher Dice (+2.1%) and lower Hausdorff distance (−0.9 mm) on INbreast, while using fewer than 2% of the tunable parameters. The improvement is even more significant on the BUSI dataset (+1.4% Dice), which involves severe domain shifts due to varying probe angles and tissue echogenicity. This confirms that our anatomy-aware prompt generator (Eq. 4) enables spatially adaptive feature recalibration that aligns transformer attention with meaningful glandular and lesion boundaries.

Table 2 Segmentation results on INbreast, BUSI, and Duke-Breast-MRI datasets

On the Duke-Breast-MRI dataset, A-VPT maintains consistent performance, outperforming the strong self-configuring nnU-Net by +2.1% Dice and reducing HD95 by 0.7 mm. Unlike pixel-level augmentations, the anatomical priors guide the model to focus on structural correspondences such as the periductal and perilesional regions, leading to more stable lesion delineation across patients. Qualitatively, attention visualizations (section “Interpretability and visualization”) show that A-VPT localizes microcalcifications and spiculated lesion boundaries more accurately than existing methods, confirming that the prompts act as anatomy-specific attention anchors.

Cross-modal generalization

The cross-domain evaluation in Tables 3 and 4 highlights the capacity of A-VPT to generalize across imaging modalities without retraining. Conventional fine-tuning approaches suffer large performance drops (e.g., ViT-B/16 Fine-tuned loses over 15% AUC when transferring from MG → US), underscoring the difficulty of adapting global representations to heterogeneous modalities. While recent multimodal models like MedCLIP and CMT partially alleviate this issue through shared latent alignment, they still rely on paired data and struggle to maintain lesion-level consistency.

Table 3 Cross-modal generalization (part 1): MG → US and MG → MRI
Table 4 Cross-modal generalization (part 2): US → MRI

By contrast, A-VPT achieves substantial gains in all transfer directions, improving AUC by + 3.4% (MG → US), + 3.1% (MG → MRI), and + 3.7% (US → MRI) over the next-best baseline. This improvement originates from our Cross-Modal Anatomical Alignment (section “Cross-modal anatomical alignment (CMAA)”), which enforces tissue-level semantic alignment via contrastive supervision. Notably, A-VPT exhibits balanced performance across both classification (AUC) and segmentation (Dice) metrics, suggesting that anatomy-guided prompts not only preserve semantic coherence but also enhance spatial reasoning when domain gaps are large. Such robustness implies that structural priors serve as transferable anchors that regularize the representation space, enabling A-VPT to operate as a truly universal, anatomy-aware foundation model for breast cancer imaging.

Results summary

To sum up, these quantitative results establish three key findings: (1) anatomy-guided prompts significantly enhance both accuracy and stability in low-data regimes; (2) cross-modal contrastive alignment provides robust generalization across imaging physics; and (3) A-VPT achieves these gains with remarkable parameter efficiency, indicating its potential for large-scale deployment in real clinical systems.

Ablation studies

To better understand the contribution of each module in our framework, we perform extensive ablation experiments on the INbreast and BUSI datasets. Unless otherwise specified, all results are reported as mean (±std) over three random seeds. Each variant is trained under identical hyperparameters and data splits, using the same frozen ViT-B/16 backbone as the main experiments to ensure fair comparison.

Effect of anatomy-guided prompts

As shown in Table 5, incorporating anatomical priors consistently improves both AUC and F1 scores. Random and static prompts achieve similar performance, as they lack spatial context. Adding anatomy-derived prompts without dynamic fusion yields minor improvement, while our full anatomy-guided dynamic prompts deliver the largest gain (+0.9% AUC). This verifies that guiding the prompt space with explicit tissue structures significantly enhances discriminative feature modulation.

Table 5 Ablation: prompt types on INbreast (AUC/F1)

Effect of cross-modal alignment losses

Table 6 examines the role of the two alignment objectives introduced in the section “Cross-modal anatomical alignment (CMAA).” Both region-level contrastive learning (\({{\mathcal{L}}}_{{\rm{RCL}}}\)) and text-guided supervision (\({{\mathcal{L}}}_{{\rm{TXT}}}\)) independently enhance cross-modal transfer, but their combination achieves the best generalization, yielding a +4.1% AUC improvement over the no-alignment baseline. This confirms that combining visual and textual semantics helps establish robust, anatomy-level correspondences across imaging modalities.

Table 6 Ablation: cross-modal alignment (MG → US)

Impact of topological smoothing (\({{\mathcal{L}}}_{{\rm{LAP}}}\))

To validate the necessity of the topological smoothing loss, we conducted an ablation by removing \({{\mathcal{L}}}_{{\rm{LAP}}}\) (setting λlap = 0). As shown in Table 7, removing this term leads to a slight drop in classification AUC (−0.3%) but a more noticeable degradation in segmentation consistency (HD95 increases by +1.2 mm). This indicates that while the model can still learn discriminative features without smoothing, \({{\mathcal{L}}}_{{\rm{LAP}}}\) is crucial for enforcing spatial continuity in the learned prompts, preventing the model from overfitting to disjoint local noise.

Table 7 Ablation: impact of topological smoothing loss (\({{\mathcal{L}}}_{{\rm{LAP}}}\)) on INbreast

Hyperparameter sensitivity analysis

We further analyze the sensitivity of A-VPT to the weights of alignment losses (λrcl and λtxt). We performed a grid search on the validation set, varying λrcl {0.1, 0.2, 0.5} and λtxt {0.05, 0.1, 0.5}. Results indicate that performance is relatively stable within a reasonable range (e.g., AUC fluctuates within ±0.2% for λrcl [0.1, 0.5]). However, setting weights too high (>0.5) forces the model to prioritize alignment over task-specific discrimination, leading to performance drops. Our selected values (λrcl = 0.2, λtxt = 0.1) yield the optimal trade-off.

Layer-wise prompt injection strategy

The results in Table 8 indicate that injecting prompts into all layers provides moderate benefits over shallow or deep insertion, confirming that different transformer blocks encode complementary information. Our hierarchical fusion strategy further improves performance, as it allows progressive refinement of anatomical cues across depth. This demonstrates that layer-wise prompt propagation helps the model learn both low-level texture cues (early layers) and high-level lesion semantics (later layers) in a coordinated manner.

Table 8 Ablation: injection depth. AUC on INbreast; Dice on BUSI

Number of prompt tokens and rank factor

As summarized in Table 9, performance improves as the number of prompt tokens K increases up to six, after which it saturates. A small K limits the diversity of anatomical representation, while excessively large K introduces redundancy and minor instability during training. Similarly, low-rank adaptation with r = 8 provides the best trade-off between accuracy and parameter efficiency. While K = 8 or K = 10 offers similar accuracy, they introduce additional computational overhead in the PIM module. Following the principle of Occam’s razor and prioritizing parameter efficiency, we selected K = 6 as the optimal operating point where the model achieves maximum performance with minimal token complexity.

Table 9 Ablation: number of prompt tokens K and low-rank factor r (INbreast)

Overall analysis

Across all ablations, three major findings emerge: (1) the anatomy-guided prompt mechanism is the dominant contributor to performance gains, proving that structure-aware conditioning is crucial for breast imaging; (2) cross-modal alignment losses further extend model generalization beyond modality boundaries; and (3) hierarchical layer-wise prompt injection and compact low-rank tuning ensure that A-VPT achieves these improvements efficiently and stably. Together, these results validate that each module in A-VPT is both necessary and complementary, yielding a robust and interpretable framework for cross-modal breast cancer understanding.

More analysis

Figure 1 demonstrates a clear and monotonic improvement from random or static prompts to our anatomy-guided variant. The gains are consistent across both datasets, showing that embedding explicit tissue priors into the prompt space helps the model attend to diagnostically relevant regions such as glandular and ductal structures.

Fig. 1: Effect of anatomy-guided prompts.
Fig. 1: Effect of anatomy-guided prompts.
Full size image

Comparison across random, static, and anatomy-aware prompt designs on INbreast and BUSI datasets.

As shown in Fig. 2, both the region-level contrastive loss (\({{\mathcal{L}}}_{{\rm{RCL}}}\)) and the text-grounded loss (\({{\mathcal{L}}}_{{\rm{TXT}}}\)) enhance cross-modality transfer individually, while their combination yields the largest improvement. This complementary effect highlights that anatomical and linguistic supervision jointly enforce consistent semantics across imaging domains.

Fig. 2: Impact of cross-modal alignment.
Fig. 2: Impact of cross-modal alignment.
Full size image

Performance improvement from region-level (\({{\mathcal{L}}}_{{\rm{RCL}}}\)) and text-level (\({{\mathcal{L}}}_{{\rm{TXT}}}\)) objectives on MG → US transfer.

In Fig. 3, injecting prompts only into early or late transformer layers offers modest benefits, whereas full-layer insertion improves stability. Our hierarchical fusion strategy further boosts both AUC and Dice, confirming that multi-depth prompt propagation facilitates progressive reasoning from low-level textures to high-level lesion semantics.

Fig. 3: Layer-wise prompt injection.
Fig. 3: Layer-wise prompt injection.
Full size image

Comparison of early, middle, late, and hierarchical injection strategies on INbreast (AUC) and BUSI (Dice).

Figure 4 reveals that AUC rises with more prompt tokens but saturates around K = 6, while trainable parameters grow linearly. This indicates diminishing returns beyond moderate prompt counts and emphasizes that the semantic quality of prompts is more impactful than quantity. We therefore adopt K = 6 and rank factor r = 8 for the best balance between accuracy and computational efficiency.

Fig. 4: Prompt tokens vs. parameter efficiency.
Fig. 4: Prompt tokens vs. parameter efficiency.
Full size image

Trade-off between the number of prompt tokens (K) and the proportion of trainable parameters.

Interpretability and visualization

To further demonstrate the interpretability and robustness of A-VPT, we present a series of qualitative visualizations across mammography, ultrasound, and MRI modalities. All figures are produced under identical preprocessing and visualization protocols, with unified intensity windowing and annotation schemes. These analyses highlight how anatomy-guided prompts (specifically targeting glandular, fatty, and ductal regions) improve lesion localization, structural reasoning, and cross-modal semantic alignment.

Comparison of mammography

Figure 5 compares U-Net, nnU-Net, MedT, and our A-VPT. Traditional convolutional models exhibit strong bias toward high-contrast regions and often fail in dense-glandular backgrounds, producing incomplete contours. Transformer-based MedT captures long-range context but shows contour drift due to the absence of anatomical priors. By contrast, A-VPT leverages anatomy-guided prompts to preserve morphological structure, producing smooth and accurate boundaries that align with expert annotations. These results confirm that structural priors regularize attention behavior and enhance spatial fidelity.

Fig. 5: Comparison of segmentation results on mammography.
Fig. 5: Comparison of segmentation results on mammography.
Full size image

Some different eight cases segmented by U-Net, nn-U-Net, MedT, and our A-VPT. The red dashed contour denotes the ground truth. A-VPT produces the most anatomically consistent lesion boundaries with clear margin delineation. Note that all comparison methods utilize the same frozen ViT-B/16 backbone where applicable.

Cross-modal embedding alignment

To evaluate feature coherence across modalities, Fig. 6 shows t-SNE projections of tissue embeddings from mammography (MG), ultrasound (US), and MRI. Baseline models form disjoint clusters, indicating poor cross-domain consistency. In contrast, A-VPT achieves compact, overlapping distributions for corresponding tissue classes (fatty, glandular, ductal), demonstrating the success of the cross-modal contrastive objective \({{\mathcal{L}}}_{{\rm{RCL}}}\). The resulting unified feature manifold captures shared tissue semantics independent of modality differences.

Fig. 6: Cross-modal embedding alignment.
Fig. 6: Cross-modal embedding alignment.
Full size image

t-SNE projections of tissue embeddings from mammography (MG), ultrasound (US), and MRI modalities. Colors represent tissue types (fatty, glandular, ductal). A-VPT yields compact, overlapping clusters across modalities, evidencing consistent tissue semantics.

Prompt-guided attention maps (I)

Figure 7 visualizes attention responses under three prompting strategies: Random Prompt, Static Prompt, and Anatomy-Guided Prompt. Random prompts show diffuse, non-specific activations; static prompts partially focus on the lesion but lack precise boundaries. In contrast, A-VPT concentrates attention along glandular ducts and peritumoral margins, highlighting semantically meaningful regions that correspond to radiological findings. These maps show that anatomy-aware prompt tokens guide the model to attend to clinically relevant structures, improving both interpretability and diagnostic trustworthiness.

Fig. 7: Prompt-guided attention maps.
Fig. 7: Prompt-guided attention maps.
Full size image

Visual comparison between random prompt, static prompt, and anatomy-guided prompt (A-VPT) across different mammography cases. The contours mark top-10% attention regions.

Prompt-guided attention maps (II)

A complementary visualization in Fig. 8 provides soft plasma-style heatmaps of the same mammography crops. A-VPT produces broader and more coherent attention distributions, emphasizing perilesional zones that correspond to early tumor infiltration. This indicates that anatomy-guided prompts transfer clinical priors such as shape regularity and glandular topology into the transformer’s attention mechanism. Unlike random or static prompts, A-VPT maintains contextual integrity across samples, providing intuitive insight into how the model reasons spatially.

Fig. 8: Alternative prompt-attention visualization.
Fig. 8: Alternative prompt-attention visualization.
Full size image

Soft plasma-style heatmaps compare the effect of random, static, and anatomy-guided prompts. A-VPT consistently emphasizes the lesion core and margins, whereas other prompts yield dispersed activation.

MRI dynamic contrast enhancement

Figure 9 shows post-contrast MRI slices. Across all patients, the model accurately localizes enhancing regions (purple overlays) that correspond to ground-truth lesions (red dashed contours). A-VPT successfully tracks dynamic enhancement and preserves lesion geometry under temporal intensity variation. This result demonstrates that the anatomy-guided prompting enables temporally consistent segmentation in volumetric MRI, a key advantage for longitudinal breast cancer monitoring.

Fig. 9: Breast MRI (DCE) post-contrast comparison.
Fig. 9: Breast MRI (DCE) post-contrast comparison.
Full size image

post-contrast early-phase scans with overlays. Purple: predicted mask; red dashed: ground truth. The model accurately captures enhancing lesions while preserving anatomical context.

Across all modalities and visualization forms, A-VPT exhibits three major interpretability advantages: (1) Anatomical fidelity—its predictions align closely with real tissue structures and lesion morphologies; (2) Cross-modal coherence—its learned features remain semantically consistent across MG, US, and MRI; and (3) Transparent reasoning—the learned prompts explicitly reveal where and how the model attends. Together, these findings confirm that A-VPT bridges quantitative accuracy and qualitative interpretability, offering a trustworthy and generalizable framework for multimodal breast cancer analysis.

Parameter efficiency and complexity

To evaluate the computational efficiency of our A-VPT framework, we compare its trainable parameter count, inference latency, and GPU memory usage against full fine-tuning and other PEFT baselines. All experiments are conducted on a single NVIDIA A100 GPU (40 GB), using the same frozen ViT-B/16 backbone and batch size of 32. Latency is measured as the average forward pass time per image (in milliseconds), while memory refers to peak GPU memory usage during inference.

Parameter efficiency

Table 10 shows that A-VPT requires only 1.87% of the total parameters compared to full fine-tuning. Unlike Adapter-Tuning or LoRA, which introduce additional linear projection layers into the transformer blocks, our approach confines learnable parameters to lightweight prompt generators and cross-attention adapters. This design maintains parameter compactness while enhancing expressiveness through anatomy-guided structural priors.

Table 10 Parameter efficiency and complexity comparison

Inference latency

In terms of runtime, A-VPT introduces minimal computational overhead. The average inference latency increases by only 1.2 ms compared to full fine-tuning, despite the inclusion of multi-level prompt interactions. This is largely due to the low-dimensional prompt tokens (K = 6) and efficient cross-attention fusion, which incur negligible matrix multiplication cost within transformer layers.

Memory usage

Memory consumption remains nearly identical across all PEFT methods. The slight increase (from 9.3 to 9.9 GB) stems from temporary storage of prompt embeddings and cross-modal alignment features during inference. Importantly, A-VPT achieves this efficiency without any approximation or pruning, ensuring numerical stability and reproducibility under identical backbone configurations.

Computational cost of PIM

The reviewer may raise concerns regarding the bidirectional cross-attention in PIM. However, the additional cost is mathematically negligible. Standard self-attention has a complexity of \({\mathcal{O}}({N}^{2}d)\), where N is the number of visual tokens (e.g., 196 for 224 × 224 images) and d is the dimension. In contrast, our PIM involves cross-attention between N visual tokens and K prompt tokens. The complexity is \({\mathcal{O}}(2\cdot N\cdot K\cdot d)\). Since we set K = 6 (which is significantly smaller than N = 196), the ratio of PIM cost to standard self-attention is approximately 2K/N ≈ 0.06. Therefore, the theoretical latency increase is marginal, which aligns with our empirical observation of only a ~1.2 ms increase in inference time (Table 10).

Discussion

Overall, A-VPT achieves the best balance between trainable parameter efficiency and computational scalability. It retains the full representational power of large ViT backbones while reducing tuning cost by nearly two orders of magnitude. This efficiency, coupled with the model’s interpretability and generalization advantages, makes A-VPT highly practical for real-world deployment in multimodal medical imaging systems.

Discussion

One of the most significant findings of this work is that structural priors, when introduced through anatomy-guided prompts, not only improve quantitative accuracy but also enhance model interpretability. While traditional PEFT methods (e.g., LoRA, VPT) treat prompt learning as an abstract optimization process detached from domain semantics, our anatomy-aware design explicitly ties the prompts to human-understandable anatomical features. This linkage enables the model to reason in a manner that is visually and clinically verifiable. As shown in the section “Interpretability and visualization,” the attention maps generated by A-VPT correspond to meaningful glandular and peritumoral regions, bridging the long-standing gap between performance and transparency in deep medical vision models.

Cross-modality learning remains a key challenge in medical imaging, where heterogeneous acquisition physics often lead to domain shifts. Our experiments demonstrate that anatomy-guided prompting effectively harmonizes tissue-level semantics across mammography, ultrasound, and MRI. This result suggests that embedding anatomical priors into the prompt space provides an implicit form of modality normalization, aligning visual representations around common structural cues. Such alignment is particularly valuable in clinical workflows where multi-modality data fusion is increasingly prevalent.

In addition to interpretability, parameter efficiency is crucial for real-world deployment. A-VPT achieves a balance between computational efficiency and predictive accuracy by requiring less than 2% of the tunable parameters of full fine-tuning while maintaining comparable inference speed. This efficiency not only reduces the cost of model adaptation for new tasks but also enables large-scale deployment in resource-limited healthcare settings. The framework’s modular design further allows integration with emerging vision-language foundation models, suggesting a promising path toward scalable and explainable multimodal systems.

Despite its advantages, A-VPT still relies on the availability of coarse anatomical maps or precomputed tissue priors. A potential concern is the sensitivity of the model to the quality of these input atlases, which may contain noise in clinical practice. However, our framework exhibits intrinsic robustness against imperfect priors. Since the anatomy-aware prompts interact with visual tokens via learnable cross-attention, the attention mechanism acts as a soft filter, allowing the model to dynamically suppress inconsistent or noisy anatomical cues. Furthermore, the topology-aware smoothing loss (\({{\mathcal{L}}}_{{\rm{LAP}}}\)) regularizes the prompt embeddings, preventing the model from overfitting to local segmentation artifacts. Therefore, A-VPT maintains stable performance even when anatomical inputs are coarse or noisy.

Future work could explore self-supervised or generative approaches to learn these priors directly from raw data, reducing dependence on external segmentation tools. Moreover, extending the anatomy-guided prompting mechanism to 3D volumetric transformers or video-based clinical imaging could further improve temporal consistency and interpretability. Finally, integrating textual or radiology-report guidance into the prompting process would open new directions for explainable visual-language reasoning in medical AI.

In this paper, we presented A-VPT, a novel anatomy-guided visual prompt tuning framework for cross-modal breast cancer understanding. Unlike existing PEFT approaches that rely on purely data-driven prompts, A-VPT embeds explicit anatomical priors into the prompt space, guiding transformers to focus on clinically meaningful structures. Through a series of experiments across mammography, ultrasound, and MRI, we showed that A-VPT achieves state-of-the-art accuracy while maintaining exceptional parameter efficiency and interpretability.

Qualitative analyses revealed that the anatomy-guided prompts produce localized, semantically coherent attention maps and harmonize tissue embeddings across modalities. Quantitative evaluations confirmed that A-VPT consistently outperforms both full fine-tuning and other lightweight adaptation baselines with less than 2% trainable parameters. These results suggest that integrating anatomy-driven contextualization into prompt tuning offers a scalable and interpretable solution for multimodal medical imaging.

Beyond breast cancer applications, we believe the principles of A-VPT—namely, structural guidance, prompt efficiency, and cross-modal consistency—can generalize to a wide range of clinical imaging domains. As foundation models continue to evolve, anatomy-guided prompting may serve as a critical bridge between large-scale representation learning and human-centered medical reasoning. We hope this work inspires further exploration into interpretable, efficient, and anatomically grounded adaptation strategies for next-generation medical AI.

Methods

We present A-VPT (Anatomy-guided Visual Prompt Tuning), a parameter-efficient framework that injects explicit anatomical priors into a frozen vision(-language) backbone through structured, layer-wise prompts. Our goal is to achieve cross-modal robustness for breast imaging (mammography, ultrasound, MRI) while preserving interpretability and minimizing trainable parameters. Please see our workflow in Fig. 10.

Fig. 10: Overview of the A-VPT framework.
Fig. 10: Overview of the A-VPT framework.
Full size image

Left: three input modalities (MG/US/MRI) with anatomy maps (glandular/fatty/ductal). Center: the Anatomy-Guided Prompt Generator pools tissue maps into region embeddings, transforms them by MLPs, and produces anatomy-aware prompt tokens via a rank-r projection (<2% trainable params). These tokens interact with visual tokens inside a frozen ViT-B/16 encoder through the Prompt--Token Interaction Module (PIM): token → prompt attention, prompt → token attention, and a gated residual fusion (α() = σ( )). Bottom: cross-modal anatomical alignment (\({{\mathcal{L}}}_{{\rm{RCL}}},\,{{\mathcal{L}}}_{{\rm{TXT}}}\)) harmonizes MG/US/MRI tissue semantics. Right: output heads for classification and segmentation.

Problem setup and notation

Let \({\mathcal{M}}=\{\,{\rm{MG}},{\rm{US}}\,,{\rm{MRI}}\}\) denote imaging modalities. An input image \({x}^{(m)}\in {{\mathbb{R}}}^{H\times W\times C}\) from modality \(m\in {\mathcal{M}}\) is fed to a frozen ViT-style encoder Φ. We assume access to either (i) coarse anatomical maps (segmentation masks or tissue probability maps), or (ii) a learnable anatomy extractor (section “Anatomical region encoder (ARE)”) that produces region confidence fields A(m) [0, 1]H×W×K over K anatomical regions (e.g., K = 3, corresponding to glandular, fatty, and ductal tissues). Downstream tasks include image-level diagnosis (classification), lesion localization (detection/segmentation), or report grounding with an optional text encoder Ψ.

We tokenize x(m) into N patches: \({\{{u}_{i}\}}_{i=1}^{N}\), \({u}_{i}\in {{\mathbb{R}}}^{{P}^{2}C}\), and embed them by a frozen patch projection \({W}_{e}\in {{\mathbb{R}}}^{({P}^{2}C)\times d}\) to obtain

$${z}_{i}^{(0)}={u}_{i}{W}_{e}+{{\rm{PE}}}_{i},\,i=1,\ldots ,N,$$
(1)

where PE is positional encoding and d is the hidden size. We prepend a class token \({z}_{{\rm{cls}}}^{(0)}\). The backbone has L transformer blocks. A-VPT learns a small set of parameters to generate anatomy-aware prompt tokens\({\{{P}^{(\ell )}\}}_{\ell =1}^{L}\), which interact with visual tokens at each layer while the backbone weights remain frozen.

Anatomical region encoder (ARE)

Motivation

Cross-modality distribution shifts often arise from different imaging physics; however, anatomical organization (tissue composition and spatial layout) remains semantically consistent across modalities. We encode such a structure into region-aware embeddings that will steer prompts.

Construction

Given A(m) [0, 1]H×W×K, we compute soft region pools on image features. Let \(\phi :{{\mathbb{R}}}^{H\times W\times C}\to {{\mathbb{R}}}^{{H}^{{\prime} }\times {W}^{{\prime} }\times d}\) be a frozen, shallow visual stem (specifically, the patch embedding layer of the ViT backbone in our implementation). Denote F(m) = ϕ(x(m)) and let \({A}_{\downarrow }^{(m)}\in {[0,1]}^{{H}^{{\prime} }\times {W}^{{\prime} }\times K}\) be bilinearly downsampled region confidences. For region k:

$${\widehat{a}}_{k}^{(m)}\,=\,\frac{{\sum }_{p}{A}_{\downarrow }^{(m)}(p,k)\,{F}^{(m)}(p)}{{\sum }_{p}{A}_{\downarrow }^{(m)}(p,k)+\epsilon }\,\in {{\mathbb{R}}}^{d},$$
(2)

where p indexes spatial locations and ϵ > 0 avoids division by zero. Stacking \({\widehat{a}}_{k}^{(m)}\) yields \({\widehat{A}}^{(m)}\in {{\mathbb{R}}}^{K\times d}\).

We then pass \({\widehat{A}}^{(m)}\) through a lightweight Anatomy MLPgθ (two linear layers with GELU and LayerNorm):

$${E}^{(m)}\,=\,{g}_{\theta }\,\left({\widehat{A}}^{(m)}\right)\in {{\mathbb{R}}}^{K\times d},$$
(3)

which forms our region embeddings. Parameters θ are trainable and constitute a small fraction of the overall model.

Anatomy-guided prompt generator

Prompt bank

For each transformer layer {1, …, L}, we synthesize K prompt tokens from region embeddings via layer-specific linear maps \({W}_{p}^{(\ell )}\in {{\mathbb{R}}}^{d\times d}\) and biases \({b}^{(\ell )}\in {{\mathbb{R}}}^{d}\):

$${P}^{(\ell )}\,=\,{\rm{LN}}\,\left({E}^{(m)}{W}_{p}^{(\ell )}+{\bf{1}}{b}^{(\ell )\top }\right)\,\in \,{{\mathbb{R}}}^{K\times d}.$$
(4)

To improve stability and limit parameters, we share \({W}_{p}^{(\ell )}\) across contiguous layer groups (e.g., stages) and use rank-r factorization \({W}_{p}^{(\ell )}={U}^{(\ell )}{V}^{(\ell )\top }\) with r d.

Topology-aware smoothing: Since anatomy varies smoothly, we regularize prompts across adjacent regions using a graph Laplacian \({\bf{L}}\in {{\mathbb{R}}}^{K\times K}\) (section “Task heads and losses”). This encourages neighboring tissue prompts to encode consistent context.

Prompt–token interaction (PIM)

Motivation

Prior VPT concatenates prompts and tokens and relies on self-attention to mix them. Medical imaging benefits from directed interactions where prompts explicitly query or explain anatomy-specific evidence.

Cross-attention update

At layer , given visual tokens \({Z}^{(\ell -1)}\in {{\mathbb{R}}}^{(N+1)\times d}\) and prompts \({P}^{(\ell )}\in {{\mathbb{R}}}^{K\times d}\), we compute two directed attentions:

(i) Token → Prompt (evidence aggregation):

$${{\rm{Attn}}}_{t\to p}\,=\,{\rm{softmax}}\,\left(\frac{{Z}^{(\ell -1)}{W}_{Q}^{(\ell )}{\left({P}^{(\ell )}{W}_{K}^{(\ell )}\right)}^{\top }}{\sqrt{d}}\right)\in {{\mathbb{R}}}^{(N+1)\times K},$$
(5)
$${\widetilde{P}}^{(\ell )}\,=\,{P}^{(\ell )}\,+\,{{\rm{Attn}}}_{t\to p}^{\top }\,\left({Z}^{(\ell -1)}{W}_{V}^{(\ell )}\right)\,{W}_{O}^{(\ell )}.$$
(6)

(ii) Prompt → Token (guidance infusion):

$${{\rm{Attn}}}_{p\to t}\,=\,{\rm{softmax}}\,\left(\frac{{\widetilde{P}}^{(\ell )}{\bar{W}}_{Q}^{(\ell )}{\left({Z}^{(\ell -1)}{\bar{W}}_{K}^{(\ell )}\right)}^{\top }}{\sqrt{d}}\right)\in {{\mathbb{R}}}^{K\times (N+1)},$$
(7)
$${\widehat{Z}}^{(\ell )}\,=\,{Z}^{(\ell -1)}\,+\,{{\rm{Attn}}}_{p\to t}^{\top }\,\left({\widetilde{P}}^{(\ell )}{\bar{W}}_{V}^{(\ell )}\right)\,{\bar{W}}_{O}^{(\ell )}.$$
(8)

Gated residual fusion: To maintain stability with a frozen backbone, we gate the prompt-induced update:

$${\alpha }^{(\ell )}\,=\,\sigma \,\left({{\rm{MLP}}}_{g}\,\left([{\rm{mean}}({\widetilde{P}}^{(\ell )}),\,{z}_{{\rm{cls}}}^{(\ell -1)}]\right)\right)\in (0,1),$$
(9)
$${Z}^{(\ell )}\,=\,(1-{\alpha }^{(\ell )})\,{Z}^{(\ell -1)}\,+\,{\alpha }^{(\ell )}\,{\widehat{Z}}^{(\ell )},$$
(10)

where [, ] denotes concatenation and σ is the sigmoid. This specific combination concatenates the global visual context (\({z}_{{\rm{cls}}}^{(\ell -1)}\)) with the aggregated anatomical guidance (\({\rm{mean}}({\widetilde{P}}^{(\ell )})\)), allowing the gating mechanism to dynamically calibrate the injection strength based on both the current semantic state and the available anatomical evidence. Eqs. (6)–(10) are implemented with multi-head projections; all projection matrices and MLPg are trainable and small.

Cross-modal anatomical alignment (CMAA)

Motivation

We align tissue-level semantics across modalities to achieve robust transfer. We consider two supervision sources: (a) image–report text pairs via a frozen text encoder Ψ (if available), and (b) region-type labels or pseudo-labels shared across modalities.

Region-level contrast

Let E(m) be region embeddings (Eq. 3). We project them to a contrastive space with \({h}_{\phi }:{{\mathbb{R}}}^{d}\,\to \,{{\mathbb{R}}}^{{d}_{c}}\), and 2-normalize:

$${r}_{k}^{(m)}\,=\,\frac{{h}_{\phi }\,\left({E}_{k}^{(m)}\right)}{{\left\Vert {h}_{\phi }\left({E}_{k}^{(m)}\right)\right\Vert }_{2}}\in {{\mathbb{R}}}^{{d}_{c}}.$$
(11)

For a minibatch \({\mathcal{B}}\) and temperature τ > 0, we maximize agreement between the same region type across modalities (positives) and push away others (negatives):

$${{\mathcal{L}}}_{\mathrm{RCL}}\,=\,-\mathop{\sum }\limits_{(i,k)\in {\mathcal{B}}}\log \frac{{\sum }_{(j,{k}^{+})\in {\mathcal{P}}(i,k)}\exp \,\left(\langle {r}_{k}^{({m}_{i})},{r}_{{k}^{+}}^{({m}_{j})}\rangle /\tau \right)}{{\sum }_{(j,t)\in {\mathcal{N}}(i,k)}\exp \,\left(\langle {r}_{k}^{({m}_{i})},{r}_{t}^{({m}_{j})}\rangle /\tau \right)},$$
(12)

where \({\mathcal{P}}(i,k)\) collects positives that share the same anatomy type as (i, k) but come from other modalities or views; \({\mathcal{N}}(i,k)\) includes all negatives. Region types can be derived from atlas tags or by clustering E(m) with EMA-updated centroids.

Text-grounded alignment (optional)

When reports are available, we encode anatomy phrases (e.g., “ductal tissue,” “glandular density”) via Ψ, obtain text vectors tc, and add a CLIP-style objective:

$${{\mathcal{L}}}_{\mathrm{TXT}}\,=\,-\mathop{\sum }\limits_{(i,k)}\log \frac{\exp \,\left(\langle {r}_{k}^{({m}_{i})},{t}_{c(i,k)}\rangle /{\tau }_{t}\right)}{{\sum }_{{c}^{{\prime} }}\exp \,\left(\langle {r}_{k}^{({m}_{i})},{t}_{{c}^{{\prime} }}\rangle /{\tau }_{t}\right)}.$$
(13)

Task heads and losses

Classification

We apply a linear head on \({z}_{{\rm{cls}}}^{(L)}\) for image-level diagnosis with cross-entropy:

$${{\mathcal{L}}}_{\mathrm{CLS}}\,=\,-\mathop{\sum }\limits_{c=1}^{C}{y}_{c}\log {\widehat{y}}_{c},\,\widehat{y}=\mathrm{softmax}({W}_{\mathrm{cls}}{z}_{\mathrm{cls}}^{(L)}),$$
(14)

where y is the one-hot label.

Segmentation

For pixel/patch-level tasks, we upsample token features and use a light decoder (frozen or tiny trainable) to predict masks \(\widehat{M}\). We combine Dice and BCE:

$${{\mathcal{L}}}_{{\rm{SEG}}}\,=\,1-\frac{2\langle \widehat{M},M\rangle +\epsilon }{\parallel \widehat{M}{\parallel }_{1}+\parallel M{\parallel }_{1}+\epsilon }\,+\,\beta \,{\rm{BCE}}(\widehat{M},M).$$
(15)

Topology-aware prompt smoothing

Let \({P}^{(\ell )}\in {{\mathbb{R}}}^{K\times d}\) and L be the graph Laplacian over regions (adjacent tissues connect). We regularize:

$${{\mathcal{L}}}_{\mathrm{LAP}}\,=\,\frac{1}{L}\mathop{\sum }\limits_{\ell =1}^{L}\mathrm{tr}\,\left({({P}^{(\ell )})}^{\top }{\bf{L}}\,{P}^{(\ell )}\right).$$
(16)

Overall objective: The final loss is

$${\mathcal{L}}={{\mathcal{L}}}_{\mathrm{CLS}}\,+\,{{\rm{\lambda }}}_{\mathrm{seg}}{{\mathcal{L}}}_{\mathrm{SEG}}\,+\,{{\rm{\lambda }}}_{\mathrm{rcl}}{{\mathcal{L}}}_{\mathrm{RCL}}\,+\,{{\rm{\lambda }}}_{\mathrm{txt}}{{\mathcal{L}}}_{\mathrm{TXT}}\,+\,{{\rm{\lambda }}}_{\mathrm{lap}}{{\mathcal{L}}}_{\mathrm{LAP}}.$$
(17)

We set λ’s via validation; \({{\mathcal{L}}}_{{\rm{TXT}}}\) is used only when reports exist.

Training protocol and parameter efficiency

Frozen backbone

All ViT/Swin blocks remain frozen. Trainable modules include: (i) Anatomy MLP gθ, (ii) prompt maps \(\{{W}_{p}^{(\ell )},{b}^{(\ell )}\}\) with low-rank factorization, (iii) PIM projections \(\{{W}_{Q}^{(\ell )},{W}_{K}^{(\ell )},{W}_{V}^{(\ell )},{W}_{O}^{(\ell )},{\bar{W}}_{Q}^{(\ell )},{\bar{W}}_{K}^{(\ell )},{\bar{W}}_{V}^{(\ell )},{\bar{W}}_{O}^{(\ell )}\}\) and gate MLP, (iv) small task heads, and (v) contrastive projector hϕ. This amounts to <2% trainable parameters of full fine-tuning in our experiments.

Optimization

We use AdamW with cosine decay, warm-up, and gradient clipping. To stabilize contrastive training, we maintain modality-specific memory queues of region embeddings for hard negatives.

Inference and interpretability

At test time, we run ARE to obtain E(m), synthesize {P()}, and perform layer-wise PIM updates. For interpretability, we compute prompt-guided attention rollout. Let \({{\bf{A}}}_{p\to t}^{(\ell )}\) be prompt → token attention (averaged across heads). We define a normalized relevance map:

$${\bf{R}}\,=\,\mathrm{Norm}\,\left(\mathop{\prod }\limits_{{\rm{\ell }}=1}^{{\rm{L}}}({\bf{I}}+\gamma \,{{\bf{A}}}_{p\to t}^{(\ell )})\right),$$
(18)

where γ scales prompt influence. Upsampling R to image space yields anatomy-aligned saliency that radiologists can verify.

Complexity analysis

Time/Memory

Compared to vanilla VPT (concatenate K prompts), our PIM adds two cross-attention blocks per layer with O(KN) attention cost (small since K N). Low-rank maps and stage-sharing keep parameters linear in dr with r d.

Why anatomy-guided prompts work

Inductive bias

Prompts are context tokens that reshape attention. By tying them to region embeddings (Eq. 4) and aligning across modalities (Eq. 12), we enforce a stable anatomical basis that persists across imaging physics. Gated fusion (Eq. 10) prevents prompt overreach, preserving backbone priors while enabling targeted adaptation.

Ethics approval and consent to participate

All procedures involving human data were conducted in accordance with the Declaration of Helsinki. This study exclusively analyzed publicly available, anonymized datasets (INbreast, BUSI, and Duke-Breast-MRI), which were collected with prior approval from the respective institutional review boards and with informed consent obtained by the original investigators. Therefore, no additional ethical approval or informed consent was required for the present study.