Anatomically-guided Masked Autoencoder with Domain-Adaptive Prompting (AMAP) for multimodal cerebral aneurysm detection and segmentation

Huang, Mingxuan; Liu, Tiantian; Zhang, Jiayin; Su, Xiaoming; Chen, Hanlin; Li, Miao; Guo, Jinghan; Zu, Kaiyang; Chen, Xiaofeng; Su, Yanguo; Cong, Hengri; Yan, Long; Yan, Tianyi; Deng, Yiming

doi:10.1038/s41746-025-02188-8

Download PDF

Article
Open access
Published: 08 December 2025

Anatomically-guided Masked Autoencoder with Domain-Adaptive Prompting (AMAP) for multimodal cerebral aneurysm detection and segmentation

Mingxuan Huang^1,2^na1,
Tiantian Liu³^na1,
Jiayin Zhang⁴^na1,
Xiaoming Su⁵^na1,
Hanlin Chen^1,2,
Miao Li^1,2,
Jinghan Guo^1,2,
Kaiyang Zu⁶,
Xiaofeng Chen⁵,
Yanguo Su⁵,
Hengri Cong^1,2,
Long Yan⁴,
Tianyi Yan³ &
…
Yiming Deng^1,2

npj Digital Medicine volume 9, Article number: 20 (2026) Cite this article

3974 Accesses
Metrics details

Subjects

Abstract

Intracranial cerebral aneurysms are life-threatening vascular abnormalities whose rupture may result in subarachnoid hemorrhage, stroke, or death. Detecting and delineating aneurysms, particularly those under 5 mm, is essential for risk assessment and treatment planning but remains difficult for current AI approaches. Existing methods often fail to identify small aneurysms, mis-segment vascular bifurcations, and show reduced performance across imaging centers and modalities. We introduce AMAP (Anatomically-guided Masked Autoencoder with domain-adaptive Prompting), a framework for reliable cerebral aneurysm analysis. AMAP incorporates three key components: (1) anatomy-guided MAE pretraining, which directs self-supervised reconstruction toward cerebrovascular structures and captures subtle aneurysm morphology; (2) domain-adaptive prompting, which combines global vascular priors with case-specific prompts to enhance robustness across domains; and (3) boundary-aware contrastive learning with GS-EMA, which aligns vessel boundaries and mitigates false positives at bifurcations. Experiments on three public datasets (ADAM, IntrA, CQ500) and additional unseen domains demonstrate that AMAP surpasses CNN-, Transformer-, and foundation-based baselines, as well as domain generalization methods. It achieves 3−5% higher Dice scores, reduces false positives per case by about 20%, and improves calibration. Qualitative results further show accurate boundary preservation and consistent detection of small aneurysms overlooked by other methods. These findings suggest that AMAP is a step toward trustworthy and clinically applicable AI for aneurysm screening.

Automated, anatomy-based, heuristic post-processing reduces false positives and improves interpretability of deep learning intracranial aneurysm detection models

Article Open access 22 December 2025

Confocal Raman microspectroscopy imaging reveals structural protein reorganization in human intracranial aneurysm tissue

Article Open access 23 December 2025

Intracranial aneurysm CTA images and 3D models dataset with clinical morphological and hemodynamic data

Article Open access 12 November 2024

Introduction

Cerebral aneurysms are abnormal dilatations of intracranial arteries that carry a high risk of rupture, often leading to subarachnoid hemorrhage and severe morbidity or mortality¹. Early and accurate detection, together with precise segmentation, is therefore critical for patient management, surgical planning, and risk stratification². With the increasing use of CTA (computed tomography angiography) and TOF-MRA (time-of-flight magnetic resonance angiography), automated algorithms have drawn considerable interest as tools to assist radiologists in aneurysm detection and measurement³.

Recent progress in deep learning has brought automated anomaly detection closer to clinical practice. CNN-based methods have achieved encouraging results in aneurysm screening^4,5, while transformer architectures and self-supervised strategies such as Vision Transformers (ViT)⁶ and Masked Autoencoders (MAE)⁷ have demonstrated strong representation learning capabilities. At the same time, universal medical foundation models⁸ are beginning to connect natural and medical imaging domains. Despite these advances, several critical challenges remain.

Domain generalization remains limited⁹. Most existing models are trained on single-center, homogeneous datasets, and often perform poorly when applied to external cohorts¹⁰. Small aneurysm detection is difficult¹¹. Many aneurysms are only a few millimeters in size and may be obscured within complex vascular structures, leading to high false positive (FP) rates and reduced localization accuracy¹². In addition, while vision foundation models can improve accuracy, their lack of interpretability¹³ continues to hinder clinical adoption. These challenges highlight the need for a robust, domain-adaptive, and anatomically aware framework.

To address these issues, we present AMAP (Anatomically-guided Masked Autoencoder with Domain-Adaptive Prompting), which integrates three complementary components:Anatomically-guided MAE Pre-training. We use large-scale unlabeled CTA/TOF-MRA data with anatomy-aware masking, directing the reconstruction toward arterial regions. This encourages the encoder to learn vascular-specific priors and improves sensitivity to subtle aneurysm structures¹⁴. Prompt-guided Fine-tuning. Building on prompt learning¹⁵, we introduce lesion-aware prompts (e.g. “focus near arterial bifurcation”) that guide the model to attend to aneurysm-prone regions. This improves localization with limited annotations and supports cross-domain adaptability. Boundary-aware Contrastive Domain Generalization. We design a GS-EMA enhanced contrastive strategy that emphasizes boundary features, enabling domain-invariant vascular representations and reducing false positives at bifurcations¹⁶.

By combining these strategies, AMAP enhances both detection and segmentation of aneurysms, reducing FP rates while improving robustness across domains. On TOF-MRA benchmarks, AMAP achieves 4−8% Dice improvement with substantially fewer false positives, while maintaining stable sensitivity in multi-center validation. In addition, our prompt-guided visualizations provide interpretable outputs that align with radiological practice, narrowing the gap between AI predictions and clinical application.

Our contributions can be summarized as follow: We introduce an anatomically-guided MAE pre-training strategy that directs self-supervised learning toward arterial regions, significantly improving aneurysm detection sensitivity. We design a prompt-guided fine-tuning module that enables accurate localization with limited annotations and enhances cross-domain adaptability. We propose a boundary-aware contrastive generalization method with GS-EMA optimization, which strengthens robustness across scanners and reduces false positives. We demonstrate that AMAP achieves state-of-the-art performance on multiple CTA/TOF-MRA datasets, with higher Dice scores, lower FP/case, and clinically interpretable prompt-based visualizations.

Early CAD systems for IA relied on handcrafted features; recent works leverage deep CNNs and 3D transformers on CTA/TOF-MRA, reporting promising sensitivity yet still facing high false-positive (FP) rates for small lesions and across-site shifts. A 2025 scoping review of 36 IA studies highlighted inconsistent handling of confounders, limited external validation, and a dearth of prospective designs³. Methodologically, deformable-attention multi-scale 3D detectors/segmenters have improved sensitivity on CTA while mitigating class imbalance and extreme scale variance¹⁷. For TOF-MRA, clinical platforms showed high detection accuracy under curated settings¹⁸, and multi-class vascular/pathology segmentation pipelines were explored to support downstream aneurysm analysis¹⁹. On angiography, automated morphology evaluation reached high accuracy for aneurysm measurement²⁰. Patch-wise hybrid pipelines combining classical vesselness cues with deep models further reduced FPs in opportunistic screening. External replication and transferability remain challenging, as demonstrated by cross-site reproductions of IA segmentation. In clinical workflow studies, DL assistance modestly improved radiologist sensitivity on routine MRI, and “one-click” IA diagnosis/size measurement tools were prototyped. Broadly, recent meta-analyses confirm performance gains but also emphasize heterogeneity and deployment gaps.

The segment-anything paradigm catalyzed promptable segmentation in medicine. While the original SAM shows large variance on medical tasks²¹, medical adaptations dramatically improve utility: MedSAM establishes a universal medical segmentation foundation model^22,23, EMedSAM targets efficiency via compact encoders and med-adapters²⁴, and SAM-Med3D scales promptable segmentation to volumetric data with large 3D corpora^25,26. Latest extensions (e.g., MedSAM2) push 3D video/temporal consistency and improved prompting²⁷. Beyond pure segmentation, vision-language medical foundations (e.g. CLIP derivatives) exhibit strong zero/few-shot generalization²⁸. Prompt learning is increasingly adopted as a parameter-efficient adapter: embedded/visual prompt tuning enhances calibration and region focus in few-shot regimes. These advances support our AMAP design where lesion-aware prompts focus attention near bifurcations and thin vessels while keeping adaptation light-weight and domain adaptive.

Compared with prior IA detectors and segmenters^17,18,19,20, foundation/promptable models^{22,23,24,25,26,27,28,29,30,31,32}, and MIM/MAE-based SSL⁷, our AMAP is unique in jointly (i) injecting vascular anatomy during MAE pretraining, (ii) employing lesion-aware domain-adaptive prompts for few-shot cross-site adaptation, and (iii) enforcing boundary-aware contrastive DG with GS-EMA. This combination directly addresses the small-lesion sensitivity/FP trade-off and cross-site brittleness observed in recent evaluations, while remaining parameter-efficient and promptable for practical clinical workflows.

Despite recent progress, prior aneurysm methods typically optimize one axis at a time (pretraining, adaptation, or DG)^{5,15,17,19,22,25,33,34,35,36}. As summarized in Table 1, AMAP is the only approach jointly unifying anatomy-guided self-supervised pretraining, parameter-efficient prompt tuning, and explicit boundary-aware domain generalization.

Table 1 Comparison to recent aneurysm-specific methods

Full size table

Results

Main results

Table 2 summarizes AMAP against representative baselines across three public datasets (ADAM, IntrA, CQ500). We report 95% confidence intervals (CIs) and paired t-tests with FDR correction versus the best baseline (Med-MAE). We also include computational complexity: parameter counts (Params), FLOPs, and peak inference GPU memory (Memory). AMAP achieves the best performance on all metrics; its Dice is significantly higher than every baseline (p < 0.05), while simultaneously attaining the lowest FP/case (0.89), indicating a strong precision-recall balance. Notably, AMAP has a parameter budget comparable to UNETR and far smaller than MedNeXt, yet delivers markedly superior accuracy, highlighting the design efficiency.

Table 2 Comprehensive comparison on ADAM, IntrA, and CQ500

Full size table

Classical CNN backbones (3D U-Net, V-Net, nnU-Net) provide solid reference performance but remain limited on small aneurysms, with Dice in the low-to-mid 70s and FP/case above 1.2. Transformer-based models (UNETR, Swin-UNETR, TransUNet) improve sensitivity and boundary delineation, yet their generic pretraining and lack of anatomical bias restrict further gains. Modern large-kernel CNNs (MedNeXt) reach competitive Dice (81.0%) but still yield elevated FP rates.

Foundation and promptable models: Foundation models (MedSAM, SAM-Med3D, MedSAM2) highlight the benefit of prompt-based interaction, attaining Dice around 78–81% and FROC-AUC ~0.80. However, without lesion-specific prompts, these models sometimes over-highlight vascular segments, leading to unstable FP/case ( >1.2). BiomedCLIP with linear probing underperforms due to lack of 3D adaptation, indicating the need for task-specific prompting. Efficient Prompt Tuning (EPT) narrows the gap but falls short of AMAP since it does not incorporate anatomy-aware or domain-adaptive design.

Vanilla MAE on unlabeled volumes improves representation quality and yields Dice scores around 80%. Med-MAE³⁷, which adopts knowledge-guided masking, further boosts Dice to 81.4%, showing the benefit of aligning pretraining with medical priors. Yet, both rely on uniform or heuristic masking. In contrast, AMAP explicitly biases learning toward vessels and boundaries, bringing ~3 additional Dice points and lower HD95, demonstrating that vascular-focused reconstruction better preserves aneurysm morphology.

DG methods such as Meta-DG³⁸, DG-SOTA³⁹, and TENT³⁶ partially alleviate cross-site shifts, raising sensitivity and reducing FP relative to plain baselines. Their gains, however, remain modest (Dice 78–81%) and vary across datasets. This limitation arises because DG strategies emphasize global invariance but overlook vascular boundaries where aneurysm cues appear. Our boundary-aware contrastive alignment with GS-EMA stabilizes cross-domain representations and reduces FP/case below 0.9, a clear improvement over the next-best DG baseline (1.15 FP/case).

Dedicated pipelines (multi-scale deformable 3D detector¹⁷, TOF-MRA clinical platform¹⁸, joint vessel+aneurysm segmentation¹⁹) outperform generic backbones in sensitivity (up to 80.5%). Yet, they are often tailored to a single modality and lack generalization across domains. They also do not offer the parameter efficiency or adaptability of AMAP.

AMAP surpasses all baselines with Dice 84.6%, HD95 3.9 mm, sensitivity 83.1%, FP/case 0.89, and FROC-AUC 0.861. Improvements are consistent across ADAM, IntrA, and CQ500, confirming strong cross-domain robustness. The ~20% reduction in FP/case compared to strong baselines (e.g., Med-MAE, DG-SOTA) demonstrates that anatomy-guided masking and boundary-aware contrastive learning effectively address the trade-off between small-lesion sensitivity and false positives. In addition, domain-adaptive prompts further support stability under multi-center shifts.

Three key lessons emerge: (i) Anatomy-aware pretraining matters: vascular-focused MAE learns more discriminative features than generic MIM. (ii) Prompts must adapt to domain and case: static prompts (MedSAM) plateau quickly, while AMAP’s adaptive prompts capture both instance and domain variation. (iii) Boundaries drive generalization: aligning vascular boundaries across domains reduces FP and improves stability, a factor overlooked in prior DG work. Together, these findings suggest that future IA models should integrate anatomy priors, adaptive prompts, and explicit boundary alignment to maximize clinical reliability.

Ablation studies

To disentangle the contribution of each module in AMAP, we perform ablation studies on the IntrA dataset (CTA) using the same backbone (ViT-3D) and evaluation protocol as Sec. 2.1. Results are summarized in Table 3.

Table 3 Ablation study on the IntrA dataset

Full size table

Replacing vanilla MAE with our anatomy-guided variant improves Dice by +2.1% and reduces FP/case by ~0.14. This shows that vascular-focused pretraining yields stronger features for small and morphologically diverse aneurysms.

Adding shared prompts increases Dice to 83.0%, and instance-dynamic prompts bring an additional + 0.7%. These results suggest that global vascular priors and case-specific guidance complement one another. Prompts thus act as anatomical anchors, reducing inter-patient variance.

Incorporating vesselness-driven attention bias lowers FP/case from 0.97 to 0.93. This indicates that anatomically weighted attention steers the model toward true vascular abnormalities rather than background noise.

Boundary-aware DG without GS-EMA already reduces FP/case to 0.91, highlighting the role of boundary alignment in domain robustness. Adding GS-EMA further boosts performance, with the full AMAP system reaching 85.0% Dice and pushing FP/case below 0.9, the best across all configurations.

Three conclusions emerge: (i) anatomy-aware masking is the largest contributor, confirming the value of domain-specific self-supervision; (ii) domain-adaptive prompts progressively refine lesion localization, with shared and instance prompts working synergistically; (iii) boundary-aware DG with GS-EMA enhances cross-domain stability and reduces false positives. Together, these modules form a complementary pipeline that consistently improves segmentation and detection.

Cross-domain Generalization and Robustness

CTA vs. TOF-MRA. We train and test within individual modalities to quantify modality gaps. Table 4 shows substantial CTA-TOF differences; AMAP outperforms baselines on both, benefitting from anatomy priors (unit-agnostic) and boundary-aware DG.

Table 4 Performance by modality (CTA vs. TOF-MRA)

Full size table

Figures 1–2 present cross-domain transfer and robustness experiments. We simulate real-world deployment by training models on one dataset (source) and testing on an unseen dataset (target), capturing shifts in scanners, protocols, and populations.

**Fig. 1: Cross-domain Dice performance.**

**Fig. 2: Robustness under degradation.**

CNN and Transformer baselines (nnU-Net, UNETR, Swin-UNETR) show severe degradation on unseen domains, with Dice drops exceeding 20% relative to in-domain results. Foundation and promptable models (MedSAM, BiomedCLIP) do not yield consistent improvements, as their generic prompts fail to adapt to domain-specific intensity variations. Self-supervised approaches (Vanilla MAE, Med-MAE) reduce the drop to ~16%, confirming the value of unsupervised pretraining. However, they remain fragile at vascular boundaries, producing FP/case > 1.1.

Meta-learning DG and DG-SOTA pipelines raise cross-domain performance, but the gains are limited. For example, DG-SOTA reaches Dice 79.4% when trained on IntrA and tested on CQ500, yet FP/case stays above 1.1. TENT (test-time entropy minimization) offers small improvements but becomes unstable when batch statistics vary sharply. These results suggest that global feature alignment alone is insufficient for cerebrovascular tasks, where small lesion cues are highly sensitive to scanner-specific differences.

Our ablated variant (Boundary-DG w/o GS-EMA) achieves Dice ~80% with FP/case <1.0, outperforming prior DG baselines. This shows that aligning vascular boundary representations across domains directly reduces false positives and improves sensitivity to small lesions.

The complete AMAP framework achieves 83.4% Dice on CQ500 → ADAM transfer with only 0.86 FP/case, while maintaining the lowest average Relative Performance Drop (RPD, 11.2%) and Expected Calibration Error (ECE, 6.9%). Compared with the strongest baseline (Med-MAE or DG-SOTA), AMAP reduces FP/case by ~20% and cuts calibration error by half, delivering both higher accuracy and more reliable confidence estimates.

(i) Anatomy-aware alignment is essential: vascular-specific pretraining and boundary-guided contrastive learning are critical to mitigating distribution shifts. (ii) GS-EMA improves stability: adaptive momentum smooths training across heterogeneous domains and avoids collapse seen with fixed EMA. (iii) Clinical reliability: lower FP/case and improved calibration enable safer integration of AMAP predictions into screening workflows, reducing false alarms while maintaining sensitivity.

To further examine case-level robustness, we visualize Dice distributions with violin plots (Fig. 3). CNN and Transformer baselines (e.g., nnU-Net, UNETR) display wide and skewed distributions, reflecting unstable patient-level performance. Med-MAE and DG-SOTA improve mean Dice but retain long tails, with some cases below 70%. In contrast, AMAP produces a concentrated distribution with a higher median and lower variance, showing that anatomy-aware pretraining and boundary-guided alignment not only improve average accuracy but also stabilize predictions across patients. Such consistency is crucial for clinical deployment, where reliability at the individual level is as important as overall mean performance.

**Fig. 3: Distribution of cross-domain Dice scores across methods.**

Cross-domain Generalization. Table 5 reports train-on-source, test-on-unseen-target results with 95% CIs and p-values (AMAP vs. DG SOTA). All baselines degrade notably under domain shift (Avg RPD > 15%). In contrast, AMAP shows the strongest robustness with only 11.2% Avg RPD and significantly outperforms the second-best DG SOTA across all transfer directions (p < 0.05).

Table 5 Cross-domain generalization and robustness

Full size table

Visualization analysis

Figures 4 –6 present qualitative comparisons of aneurysm segmentation. In Fig. 4, AMAP produces masks that closely align with the ground truth, whereas baselines such as EPT and 3D U-Net either miss small lesions or spill into adjacent vessels. Figure 5 illustrates cross-modality robustness: under CTA, TOF-MRA, and T2-FLAIR, AMAP consistently preserves aneurysm morphology and suppresses false positives, while competing methods show clear degradation. Additional cases in Fig. 6 further confirm that AMAP delivers more reliable localization and smoother aneurysm boundaries than other approaches.

**Fig. 4: Case-wise aneurysm segmentation comparison.**

**Fig. 5: Cross-modality visualization on Brain CTA, TOF-MRA, and T2-FLAIR.**

**Fig. 6: Additional qualitative comparison across multiple cases.**

These qualitative observations support the quantitative findings: (i) AMAP preserves boundaries with higher precision, (ii) generalizes more reliably across modalities and cases, and (iii) produces outputs that are clinically interpretable and directly useful to radiologists.

We calculated an “Attention-GT IOU" metric. For this, we extracted the attention maps from the final encoder block (where prompt influence is strongest), binarized them (top 25% of attention), and computed the Intersection over Union (IOU) with the ground truth aneurysm mask. A “Hit" was defined as an IOU > 0.5. As shown in Table 6, the baseline ViT and even the Vanilla MAE models have very poor attention alignment; their attention is diffuse and rarely “hits" the target. In sharp contrast, our AMAP (with prompts) achieves a mean IOU of 0.72, with 88.4% of cases registering as a “hit". This numerically confirms that our domain-adaptive, lesion-aware prompts are successfully steering the model’s focus to the true pathology.

Table 6 Quantitative Analysis of Prompt-Guided Attention Fidelity

Full size table

Clinical relevance and case studies

Beyond benchmark performance, we examine the clinical relevance of AMAP in real-world aneurysm screening and treatment planning. Intracranial aneurysms are often small (2–5 mm) and easily overlooked, while false positives can burden radiologists and increase unnecessary follow-up. Thus, both sensitivity and specificity are crucial for clinical use.

Case studies show that AMAP detects aneurysms as small as 2.3 mm in diameter (Fig. 4, Case C), which baseline models frequently miss. Accurate delineation of such small lesions is critical for early intervention and risk stratification, directly impacting patient outcomes.

In practice, over-segmentation or misclassification at vascular bifurcations is a major source of false alarms. Our boundary-aware contrastive learning reduces these errors (average FP/case < 0.9, see Sec. 2.3), easing radiologists’ workload by directing attention to true pathological findings.

We evaluated AMAP on datasets from different hospitals and imaging protocols (CTA, TOF-MRA, T2-FLAIR). Unlike prior methods that degrade under distribution shifts, AMAP maintains stable sensitivity and calibration across domains (Fig. 5). Such robustness is essential for multi-center deployment, where scanners and acquisition settings vary widely.

Prompt-guided visualizations highlight aneurysm-prone regions (e.g., bifurcations and curved arterial segments), providing intuitive cues that help radiologists verify AI predictions. Expert feedback indicates that AMAP’s outputs are both accurate and interpretable, addressing the trust gap associated with black-box models.

These case studies demonstrate that AMAP enhances clinical workflow in three key ways: (i) reliable detection of small aneurysms that are often missed, (ii) substantial reduction of false positives that decrease reading burden, and (iii) robustness across centers and modalities, supporting scalable deployment. Together, these features move AMAP closer to practical integration into computer-assisted diagnostic systems for cerebrovascular disease.

Subgroup Analysis (Location and Size). We further stratify performance by aneurysm location and size. As shown in Table 7, AMAP outperforms baselines across ICA, MCA, AComA, and PComA. More importantly, Table 8 shows sensitivity by aneurysm diameter: while Med-MAE’s sensitivity drops sharply for <3 mm lesions (68.5%), AMAP sustains 80.2% sensitivity with FP/case < 1.0, underscoring our anatomy-guided and boundary-aware design advantages for the hardest micro-aneurysm regime.

Table 7 Subgroup analysis by aneurysm location on ADAM (DSC %)

Full size table

Table 8 Sensitivity (%) and FP/case by aneurysm size (all datasets)

Full size table

Discussion

We introduced AMAP, an anatomically-guided masked autoencoder with domain-adaptive prompting and boundary-aware generalization, designed for cerebral aneurysm detection and segmentation. Across three public datasets (ADAM, IntrA, CQ500) and multiple unseen domains, AMAP consistently outperformed CNN- and Transformer-based models, foundation and promptable architectures, and domain generalization baselines. Both quantitative metrics (Dice, HD95, FP/case, FROC-AUC) and qualitative visualizations (Sec. 2.4) demonstrate its ability to detect small aneurysms, suppress false positives, and remain robust under domain shifts.

Performance gains arise from three complementary design choices: (i) anatomy-guided pretraining, which directs self-supervised learning toward vascular regions and yields representations sensitive to aneurysm morphology; (ii) domain-adaptive prompting, which combines global vascular priors with case-specific features, supporting fine-tuning across heterogeneous domains; and (iii) boundary-aware contrastive learning with GS-EMA, which enforces cross-domain consistency at vessel boundaries, a common source of false positives. Together, these modules form a unified framework that balances sensitivity and specificity–two essential requirements in clinical practice. This anatomically-specific design allows AMAP to outperform large, general-purpose foundation models (like MedSAM), which, despite larger pretraining, lack the inductive bias needed to reliably identify small, morphologically diverse vascular lesions across different imaging protocols.

As shown in Sec. 2.5, AMAP improves early detection of small aneurysms, reduces false alarms, and generalizes across scanners and imaging protocols. These properties are especially important for deployment in multi-center settings, where differences in acquisition and patient populations often limit the reliability of existing models. By generating prompt-guided, interpretable outputs, AMAP also enhances clinical trust, an important step toward real-world integration.

Despite its promising results, our study has several limitations that frame the agenda for future work. First, and most importantly, this work relies on retrospective, publicly available datasets. These cohorts may contain selection biases (e.g., enrichment of positive cases or exclusion of low-quality scans). Therefore, the performance of AMAP in a real-world, prospective clinical setting is unverified. Second, we did not perform a comparative reader study against radiologists. Such a study is the gold standard to quantify any true clinical utility, such as improved sensitivity or reduced reading time, and to evaluate the model in an AI-assisted workflow (AMAP + radiologist). Third, our model was trained only on aneurysm data. Its behavior on unseen pathologies, particularly aneurysm mimics (e.g., infundibula, vascular loops) that are a known source of false positives, or in the presence of severe imaging artifacts (e.g., motion, metal). Third, our model was trained only on aneurysm data. Its behavior on unseen pathologies, particularly aneurysm mimics (e.g., infundibula, vascular loops) that are a known source of false positives, or in the presence of severe imaging artifacts (e.g., motion, metal). Fourth, our quantitative evaluation was focused on CTA and TOF-MRA. Extension to other modalities, such as digital subtraction angiography (DSA) and 4D flow MRI, is needed. Finally, while AMAP reduces FPs, achieving near-zero FPs for large-scale screening remains a challenge, and the current pipeline’s reliance on vesselness preprocessing may limit end-to-end efficiency.

Our future work will directly address these limitations. The immediate priority is to conduct a prospective clinical trial to validate AMAP on consecutive, real-world cases. This trial will also facilitate the necessary reader study. From a technical standpoint, we will extend AMAP to hemodynamic modeling (e.g., CFD-based rupture risk prediction) and integrate image features with clinical data for comprehensive risk stratification. We will also explore the practical challenges of clinical workflow integration (e.g., PACS deployment, real-time overlays) and the associated ethical considerations (managing FP/FN risks, ensuring algorithmic fairness) and regulatory pathways (e.g., FDA/CE approval). A full cost-effectiveness analysis remains a long-term goal.

AMAP brings algorithmic advances closer to clinical translation by jointly optimizing anatomical priors, adaptive prompting, and boundary-aware generalization. These insights may guide future work on anatomically informed foundation models for medical imaging and contribute to the development of reliable AI for cerebrovascular disease management.

We introduced AMAP, an anatomically-guided masked autoencoder with domain-adaptive prompting and boundary-aware generalization for cerebral aneurysm detection and segmentation. By incorporating vascular priors into pretraining, applying adaptive prompts during fine-tuning, and enforcing cross-domain boundary consistency with GS-EMA, AMAP improves sensitivity, boundary accuracy, and false positive control compared with CNN, Transformer, foundation, and DG baselines. Comprehensive experiments on three public datasets and multiple unseen domains confirm its robustness and clinical relevance, especially in detecting small aneurysms and reducing reading burden in multi-center settings.

This work underscores the value of anatomically informed self-supervision, adaptive prompting, and boundary-aware generalization in developing reliable methods for medical imaging. As discussed earlier, significant limitations remain, chiefly the reliance on retrospective data and the lack of a prospective reader study against clinical experts. AMAP represents a step toward clinically deployable aneurysm screening systems, but its true clinical utility is yet to be proven. Future efforts will focus on prospective validation, extend the framework to additional modalities such as DSA and 4D flow MRI, integrate multimodal clinical features for risk prediction, and evaluate workflow integration through prospective trials. The principles demonstrated here may inform broader progress toward anatomically grounded and domain-robust foundation models for healthcare.

Methods

We propose AMAP (Anatomically-guided Masked Autoencoder with Domain-Adaptive Prompting), a three-stage framework designed for robust intracranial aneurysm (IA) detection and segmentation under cross-site shifts. AMAP integrates (i) anatomy-guided masked pretraining, (ii) prompt-guided fine-tuning, and (iii) boundary-aware domain generalization with GS-EMA, to learn vascular-specific representations, localize small lesions with higher precision, and maintain stable performance across domains. An overview of the framework is shown in Fig. 7.

Preliminaries and notation

Let ${\bf{I}}\in {{\mathbb{R}}}^{H\times W\times D}$ be a 3D CTA or TOF-MRA volume normalized to [0, 1]. We compute anatomy priors: Vesselness map V ∈ [0, 1]^H×W×D (e.g., via multi-scale Hessian/Frangi or a pretrained vessel segmenter) highlighting arterial structures. Centerline map C ∈ [0, 1]^H×W×D (e.g., skeletonization of vessels). Boundary band ${{\mathcal{B}}}_{r}=\{x| {\rm{dist}}(x,\partial {\rm{vessel}})\le r\}$ computed by distance transform, capturing vessel boundaries where IA morphology is informative.

A 3D ViT-style encoder⁶ tokenizes I into non-overlapping patches of size p³, yielding $N=\frac{H}{p}\cdot \frac{W}{p}\cdot \frac{D}{p}$ tokens ${\{{{\bf{x}}}_{i}\}}_{i = 1}^{N}$, each embedded to ${{\mathbb{R}}}^{d}$.

Stage I: Anatomically-guided MAE pretraining

We adapt MAE⁷ to prioritize vascular anatomy. Instead of uniform random masking, we define a vessel-biased mask sampler that increases the probability of masking tokens located in (or near) vessels so that reconstruction pressure is concentrated on cerebrovascular regions.

Anatomy-biased mask probability: Let v_i = pool(V) and ${b}_{i}={\rm{pool}}({\mathbb{1}}[x\in {{\mathcal{B}}}_{r}])$ be average vesselness and boundary-band indicators pooled over the spatial extent of token i. We define a mask probability

$${\pi }_{i}=\sigma \left(\alpha \cdot {\tilde{v}}_{i}+\beta \cdot {\tilde{b}}_{i}+\gamma \right),\qquad {\tilde{v}}_{i}=\frac{{v}_{i}-{\mu }_{V}}{{\sigma }_{V}},{\tilde{b}}_{i}=\frac{{b}_{i}-{\mu }_{B}}{{\sigma }_{B}},$$

(1)

where σ( ⋅ ) is the logistic function and (μ_V, σ_V), (μ_B, σ_B) are running statistics. We sample a binary mask m_i ~ Bernoulli(π_i) with global ratio ρ (e.g., ρ ≈ 0.75) by temperature-scaling α, β, γ during warmup to satisfy ${\mathbb{E}}[\frac{1}{N}{\sum }_{i}{m}_{i}]=\rho$.

Vascular-focused reconstruction loss: Given masked tokens {x_i: m_i = 1} as targets, the decoder reconstructs ${\hat{{\bf{x}}}}_{i}$. We restrict loss to vascular regions:

$${{\mathcal{L}}}_{{\rm{rec}}}=\frac{1}{Z}\mathop{\sum }\limits_{i=1}^{N}{m}_{i}\cdot {w}_{i}\cdot {\ell }_{\delta }\left({\hat{{\bf{x}}}}_{i},{{\bf{x}}}_{i}\right)\quad {\rm{with}}\quad {w}_{i}={\lambda }_{v}{\tilde{v}}_{i}+{\lambda }_{b}{\tilde{b}}_{i}+{\lambda }_{0},$$

(2)

where ℓ_δ is the Huber loss, w_i emphasizes vessel and boundary tokens, and Z normalizes ∑_im_iw_i. This anatomy-aware MAE compels the encoder to capture fine vascular morphology and boundaries crucial for IA cues.

Stage II: Prompt-guided fine-tuning

We inject lesion-aware prompts to steer attention to IA-prone regions (bifurcations, high-curvature segments) while enabling parameter-efficient adaptation across domains.

Prompt parameterization: We use two prompt families concatenated to the token stream at every transformer block (deep prompting):

$${{\bf{P}}}_{{\rm{shared}}}\in {{\mathbb{R}}}^{{K}_{s}\times d}\quad ({\rm{global}},{\rm{learnable}},{\rm{vascular}}\,{\rm{priors}})$$

(3)

$${{\bf{P}}}_{{\rm{inst}}}={f}_{\theta }\left(\phi ({\bf{V}}),\phi ({\bf{C}}),{\rm{Hist}}({\bf{I}})\right)\in {{\mathbb{R}}}^{{K}_{i}\times d}\quad ({\rm{instance}}-{\rm{conditioned}})$$

(4)

where ϕ( ⋅ ) is global average pooling (or strided pooling) to stable low-dim descriptors; Hist(I) is intensity histogram (e.g., 32 bins). The MLP f_θ maps image-specific anatomy/style to dynamic prompts.

Domain-adaptive prompting: We further make prompts domain-adaptive. A style encoder g_ψ produces a domain code ${{\bf{z}}}_{d}={g}_{\psi }({\bf{I}})\in {{\mathbb{R}}}^{r}$ (no site labels needed). A gating head generates mixture coefficients ${\boldsymbol{\gamma }}={\rm{softmax}}({\bf{W}}{{\bf{z}}}_{d})$ and forms

$${\bf{P}}={\gamma }_{1}\,{{\bf{P}}}_{{\rm{shared}}}+{\gamma }_{2}\,{{\bf{P}}}_{{\rm{inst}}},\qquad {\gamma }_{1}+{\gamma }_{2}=1.$$

(5)

This yields domain-adaptive prompts that interpolate between global vascular priors and instance/style-conditioned hints.

Attention biasing with anatomy priors: To bias self-attention toward vessels, we add a spatial bias derived from the upsampled vesselness map V. For a head with attention logits A, we modify:

$${\tilde{{\bf{A}}}}_{ij}={{\bf{A}}}_{ij}+\eta \cdot {r}_{j},\qquad {r}_{j}={\rm{stopgrad}}\left({\rm{clip}}\left({\rm{pool}}({\bf{V}})\right)\right),$$

(6)

where η is learnable (per head or shared). Empirically, Eq. (6) improves small-IA recall without incurring heavy compute.

Task heads and supervision: We attach a UNet-like lightweight decoder for voxel-wise IA segmentation $\hat{{\bf{Y}}}\in {[0,1]}^{H\times W\times D}$ and a 3D detection head (anchor-free center-point heatmap) for IA candidates $\hat{{\bf{Q}}}\in {[0,1]}^{{H}^{{\prime} }\times {W}^{{\prime} }\times {D}^{{\prime} }}$. Supervision:

$${{\mathcal{L}}}_{{\rm{seg}}}={\lambda }_{{\rm{dice}}}\cdot {\rm{DiceLoss}}(\hat{{\bf{Y}}},{\bf{Y}})+{\lambda }_{{\rm{bce}}}\cdot {\rm{BCE}}(\hat{{\bf{Y}}},{\bf{Y}}),$$

(7)

$${{\mathcal{L}}}_{{\det }}={\rm{Focal}}(\hat{{\bf{Q}}},{\bf{Q}})+{\lambda }_{{\rm{box}}}\cdot {\rm{SmoothL1}}(\hat{{\bf{B}}},{\bf{B}}),$$

(8)

with Y the IA mask, Q center heatmap, and $(\hat{{\bf{B}}},{\bf{B}})$ predicted/gt box parameters when available.

Prompt regularization and attention alignment: We stabilize prompts and encourage anatomical focus:

$${{\mathcal{L}}}_{{\rm{prompt}}}=\parallel {{\bf{P}}}_{{\rm{shared}}}{\parallel }_{2}^{2}+{\mathbb{E}}\left[\parallel {{\bf{P}}}_{{\rm{inst}}}{\parallel }_{2}^{2}\right],$$

(9)

$${{\mathcal{L}}}_{{\rm{align}}}={\rm{KL}}\left({\rm{softmax}}(\tilde{{\bf{A}}}/{\tau }_{a})\,\parallel \,{\rm{softmax}}({\bf{R}}/{\tau }_{a})\right),$$

(10)

where R are attention targets derived from V (or bifurcation priors), and τ_a is temperature.

Stage III: Boundary-aware DG with GS-EMA

We address distribution shifts via a student–teacher scheme with a G radient-Stable EMA (GS-EMA) teacher and a boundary-aware contrastive objective.

GS-EMA teacher: Let θ_t be student parameters at step t. We maintain a teacher ${\bar{\theta }}_{t}$ updated by a momentum m_t that depends on the gradient norm to avoid oscillations across heterogeneous domains:

$${m}_{t}={\rm{clip}}\left(1-\kappa \,/\,(\epsilon +\parallel {{\bf{g}}}_{t}{\parallel }_{2})\,,{m}_{\min },{m}_{\max }\right),\quad {\bar{\theta }}_{t}={m}_{t}\,{\bar{\theta }}_{t-1}+(1-{m}_{t})\,{\theta }_{t},$$

(11)

where g_t is the EMA of recent gradients, κ > 0 controls adaptivity, and $0 < {m}_{\min }\le {m}_{\max } < 1$ (e.g., ${m}_{\min }=0.90$, ${m}_{\max }=0.999$). This GS-EMA damps unstable updates typical in multi-site training and improves teacher consistency.

Boundary feature extraction: Given predictions $\hat{{\bf{Y}}}$ (student) and ${\hat{{\bf{Y}}}}^{{\rm{T}}}$ (teacher) with sigmoid σ, we compute boundary maps

$${{\bf{B}}}_{{\rm{stu}}}={\left\Vert \nabla \sigma (\hat{{\bf{Y}}})\right\Vert }_{1},\qquad {{\bf{B}}}_{{\rm{tea}}}={\left\Vert \nabla \sigma ({\hat{{\bf{Y}}}}^{{\rm{T}}})\right\Vert }_{1},$$

(12)

and obtain embedding vectors via a small projector h_ξ with global pooling:

$${{\bf{z}}}_{{\rm{stu}}}={h}_{\xi }({{\bf{B}}}_{{\rm{stu}}}),\quad {{\bf{z}}}_{{\rm{tea}}}={h}_{\xi }({{\bf{B}}}_{{\rm{tea}}}).$$

(13)

Boundary-aware contrastive objective (cross-domain): For a minibatch containing samples from multiple domains, we form positives across domains that have confident teacher boundaries and similar pseudo-labels (threshold τ_c). Using InfoNCE:

$${{\mathcal{L}}}_{{\rm{con}}}=-\sum _{i}\log \frac{\exp \left(\langle {{\bf{z}}}_{{\rm{stu}}}^{i},{{\bf{z}}}_{{\rm{tea}}}^{i+}\rangle /{\tau }_{c}\right)}{{\sum }_{j\in {\mathcal{N}}(i)}\exp \left(\langle {{\bf{z}}}_{{\rm{stu}}}^{i},{{\bf{z}}}_{{\rm{tea}}}^{j}\rangle /{\tau }_{c}\right)},$$

(14)

where 〈 ⋅ , ⋅ 〉 is cosine similarity, ${\mathcal{N}}(i)$ includes cross-domain negatives (background/other structures). This explicitly aligns boundary representations across domains, improving FP control around small IAs.

Consistency and boundary fidelity: We further enforce prediction consistency and boundary sharpness:

$${{\mathcal{L}}}_{{\rm{cons}}}={\rm{KL}}\left(\sigma (\hat{{\bf{Y}}})\,\parallel \,\sigma ({\hat{{\bf{Y}}}}^{{\rm{T}}})\right)+{\lambda }_{{\rm{detc}}}\,{\rm{KL}}\left(\hat{{\bf{Q}}}\,\parallel \,{\hat{{\bf{Q}}}}^{{\rm{T}}}\right),$$

(15)

$${{\mathcal{L}}}_{{\rm{bndry}}}={\left\Vert \nabla \sigma (\hat{{\bf{Y}}})-\nabla {\bf{Y}}\right\Vert }_{1},$$

(16)

where the second term uses labels when available; otherwise replace Y by teacher boundaries with confidence weighting.

Overall objective

The overall loss over labeled source and (optional) unlabeled target data is

$${\mathcal{L}}={\lambda }_{{\rm{rec}}}\,{{\mathcal{L}}}_{{\rm{rec}}}+{\lambda }_{{\rm{seg}}}\,{{\mathcal{L}}}_{{\rm{seg}}}+{\lambda }_{{\det }}\,{{\mathcal{L}}}_{{\det }}+{\lambda }_{{\rm{prompt}}}\,{{\mathcal{L}}}_{{\rm{prompt}}}+{\lambda }_{{\rm{align}}}\,{{\mathcal{L}}}_{{\rm{align}}}+{\lambda }_{{\rm{con}}}\,{{\mathcal{L}}}_{{\rm{con}}}+{\lambda }_{{\rm{cons}}}\,{{\mathcal{L}}}_{{\rm{cons}}}+{\lambda }_{{\rm{bndry}}}\,{{\mathcal{L}}}_{{\rm{bndry}}}.$$

(17)

Stage I minimizes ${{\mathcal{L}}}_{{\rm{rec}}}$ (Eq. (2)); Stage II optimizes ${{\mathcal{L}}}_{{\rm{seg}}}+{{\mathcal{L}}}_{{\det }}+{{\mathcal{L}}}_{{\rm{prompt}}}+{{\mathcal{L}}}_{{\rm{align}}}$; Stage III adds ${{\mathcal{L}}}_{{\rm{con}}}+{{\mathcal{L}}}_{{\rm{cons}}}+{{\mathcal{L}}}_{{\rm{bndry}}}$ with GS-EMA (Eq. (11)).

Loss Weights. All loss weights λ are selected via grid search on the validation set to balance multi-task convergence; the same configuration generalizes across all three datasets without per-dataset tuning. Table 9 details ranges and final values.

Table 9 Loss hyperparameters for Eq. (17): search ranges and final values

Full size table

Training details

Backbone: We adopt a 3D ViT encoder⁶ with patch size p = 16, embedding dimension d = 768, and L = 12 transformer blocks. The MAE decoder is lightweight (width 512, depth 4). The segmentation branch uses a UNet-like decoder with skip connections from token features through learned upsampling. For detection, we employ a center-heatmap head at 1/4 resolution.

Preprocessing: Volumes are resampled to isotropic 0.5–0.8 mm spacing and cropped to 128³ or 160³ regions around vessels using vesselness maps V. Intensities are clipped to the [0.5, 99.5] percentiles and z-scored. Vesselness V and centerlines C are computed offline and cached. Data augmentation includes 3D flips, small rotations (≤10°), elastic deformations, gamma jitter, and cutout outside vascular regions.

MAE training: The masking ratio is ρ = 0.75 with a vessel-biased sampler (Eq. (1)). The reconstruction loss uses Huber δ = 0.5 with weights (λ_v, λ_b, λ₀) = (0.7, 0.2, 0.1). Optimization is performed with AdamW, base learning rate 1.5 × 10⁻⁴, cosine decay, batch size B = 8 on 4 × GPUs, and 200 epochs.

Fine-tuning: Prompts include K_s = 8 shared and K_i = 8 instance prompts. The style encoder g_ψ is a 3D CNN with global pooling. Attention bias η is initialized to 0.1 and updated during training. Loss weights are set as λ_dice = 1.0, λ_bce = 0.5, focal α = 0.25, γ = 2.0, λ_box = 1.0, λ_prompt = 1e − 4, and λ_align = 0.1. Training runs for 150 epochs with learning rate 1e − 4.

Domain generalization with GS-EMA: We set $({m}_{\min },{m}_{\max })=(0.90,0.999)$, κ = 0.1, and ϵ = 1e − 3. The confidence threshold for teacher boundaries is τ_c = 0.7, and the contrastive loss temperature is τ_c = 0.07. Loss weights are λ_con = 0.1, λ_cons = 0.05, and λ_bndry = 0.1. Strong and weak views use intensity/color jitter and elastic/noise, respectively.

Complexity: AMAP introduces: (i) a small MAE decoder used only in Stage I, (ii) K_s + K_i deep prompts per block, and (iii) a lightweight boundary projector. At inference, runtime is essentially unchanged compared with the fine-tuned backbone, since prompts and attention biases are algebraic and the projector contributes <1% FLOPs.

Design rationale and ablations

Why anatomy-guided masking? It concentrates reconstruction capacity on vessels and boundaries where aneurysm cues appear, avoiding wasted effort on irrelevant tissue and improving sensitivity to small lesions.

Why prompts? Domain-adaptive prompts encode instance- and style-specific priors without heavy fine-tuning, enabling cross-site adaptation with limited data and more controllable attention.

Why boundary-aware DG? Aligning boundaries rather than global features reduces false positives along tortuous vessels and stabilizes morphology under protocol variations.

Ablation settings: We evaluate (a) mask sampler variants, (b) prompt families (P_shared vs. P_inst), (c) attention bias on/off, (d) GS-EMA vs. fixed-momentum EMA, (e) boundary-aware vs. standard contrastive learning, and (f) detection head enabled vs. removed.

Stage I (MAE Pre-training): Figure 8 shows the reconstruction loss for our anatomy-guided MAE pre-training. Both the training loss (solid orange) and validation loss (dashed blue) decrease smoothly and converge over 200 epochs. The validation loss closely tracks the training loss, indicating that the encoder is stably learning to capture robust vascular representations without significant overfitting. This provides a strong weight initialization for the downstream tasks.

Stage II (Prompt-Guided Finetuning): Figure 9 illustrates the finetuning stage. The segmentation loss (left axis) for both training (solid orange) and validation (dashed blue) steadily decreases. Concurrently, the validation Dice score (right axis, solid orange) rises rapidly from its initial baseline and begins to plateau after ~100 epochs, converging at a stable high value. This demonstrates a successful and stable adaptation of the pre-trained encoder to the specific segmentation task.

Stage III (Boundary-Aware DG): Finally, Figure 10 shows the convergence of the domain generalization stage. The total loss (left axis), which incorporates the segmentation and boundary-aware contrastive objectives, steadily converges. Critically, the validation Dice score (right axis) shows a further slight improvement from the plateau of Stage II, before stabilizing at its final, highest performance level. Together, these three figures confirm that our multi-stage training approach is stable, robust, and that each stage progressively and verifiably contributes to the model’s final performance.

Datasets and preprocessing

We evaluate AMAP on three publicly available intracranial aneurysm datasets, ensuring reproducibility and comparability with prior studies. All datasets include CTA or TOF-MRA volumes with expert-annotated aneurysm masks and are widely used benchmarks for IA detection and segmentation.

The ADAM dataset⁴⁰, released for the MICCAI 2020 challenge, contains 113 TOF-MRA volumes collected from multiple hospitals. Each case is annotated voxel-wise by neuroradiologists. Following the official split, we use 93 volumes for training and 20 for testing. Aneurysm sizes range from 2 mm to 20 mm, making the dataset particularly challenging for small-lesion detection.

The IntrA dataset⁴¹ includes 103 CTA volumes with voxel-level annotations. Aneurysms occur at diverse arterial locations (ICA, MCA, AComA, PComA, basilar), covering a wide morphological spectrum. Following ref. ⁴¹, we split the dataset into 73 training, 10 validation, and 20 testing cases. Compared with ADAM, IntrA emphasizes cross-location variability and contains both ruptured and unruptured aneurysms.

The CQ500 dataset⁴², originally curated for cranial pathology detection, is a large-scale CTA cohort. A subset (~490 cases) has been re-annotated for vascular analysis, including IA masks from open-source repositories⁴³. We randomly divide the data into 350 training, 70 validation, and 70 testing scans. CQ500 provides a robust benchmark for cross-domain evaluation, as it spans >20 hospitals in India with heterogeneous imaging protocols.

All CTA and TOF-MRA volumes are resampled to isotropic 0.5 mm spacing and intensity-clipped to the [0.5, 99.5] percentile of HU (CTA) or signal intensities (MRA). Vessels are enhanced with a multi-scale Frangi filter to generate vesselness maps V, and centerlines C are extracted by skeletonization. Cropping to 128³ or 160³ regions around vessels is performed using vesselness-based bounding boxes. Data augmentation includes 3D rotation ( ± 10°), elastic deformation, intensity scaling, gamma correction, and cutout outside vascular regions. These steps standardize inputs and support consistent multi-dataset evaluation.

Experiments setup

All experiments were run on an NVIDIA A100 cluster (8 × A100 GPUs, 80 GB each) using CUDA 12.0 and PyTorch 2.2. To ensure reproducibility, we fixed random seeds across NumPy, PyTorch, and data loaders. Mixed-precision training (FP16) with Apex was applied to reduce memory usage. Depending on dataset size and task setup, training required 36–72 h.

The anatomy-guided MAE was pretrained on the combined training sets (ADAM, IntrA, CQ500) without labels. Volumes were resampled to isotropic 0.5 mm and cropped into 160³ patches. We used a patch size of 16³, embedding dimension d = 768, and 12 transformer blocks. The masking ratio was 75% with vessel- and boundary-biased sampling (Sec. 4.2). The decoder had four layers (width 512). Optimization used AdamW (β₁ = 0.9, β₂ = 0.95, weight decay 0.05), an initial learning rate of 1.5 × 10⁻⁴ with cosine decay, and batch size 8 over 8 GPUs. Pretraining ran for 200 epochs with gradient accumulation of 2 steps.

For IA detection and segmentation, the encoder was initialized with pretrained weights and augmented with shared and instance-dynamic prompts (Sec. 4.3). Each block used K_s = 8 shared and K_i = 8 instance prompts. The segmentation head followed a UNet-like decoder, while the detection head adopted an anchor-free center heatmap. Optimization used AdamW with learning rate 1.0 × 10⁻⁴, weight decay 1 × 10⁻⁴, and batch size 4. Training ran for 150 epochs with early stopping on validation Dice. Loss weights were λ_dice = 1.0, λ_bce = 0.5, λ_prompt = 1e − 4, and λ_align = 0.1. Attention bias scaling (Eq. (6)) was initialized to 0.1 and updated during training.

To simulate domain shifts, datasets were divided by institution and protocol. We applied GS-EMA teacher-student training (Sec. 4.4) on unlabeled validation domains. EMA momentum was bounded in [0.90, 0.999] with adaptive coefficient κ = 0.1. The boundary-aware contrastive loss used temperature τ_c = 0.07 and positive sampling threshold 0.7. Loss weights were λ_con = 0.1, λ_cons = 0.05, and λ_bndry = 0.1. Training ran for 80 epochs with initialization from Stage II. Strong and weak augmentations included 3D elastic deformation, random bias fields, Gaussian noise, and intensity jitter.

All stages used data augmentation: random rotation ( ± 10°), flipping, scaling (0.9–1.1 × ), elastic deformation, gamma correction, Gaussian noise (σ = 0.01–0.05), and cutout outside vessels. Dropout (p = 0.1) and stochastic depth (p = 0.2) were applied in transformers. Weight decay and prompt norm penalties (Eq. (2)) stabilized training.

At test time, we used sliding-window inference with 0.5 overlap and ensembled predictions from three augmentations. Outputs combined segmentation probability maps and detection heatmaps, followed by connected-component filtering (minimum volume > 50 voxels) to suppress false positives. Average inference time was ~0.9 s per scan on one A100 GPU, enabling near real-time use.

Evaluation metrics

We adopt standard metrics for segmentation, detection, and domain generalization.

Segmentation: Dice Similarity Coefficient (DSC):

$${\rm{DSC}}=\frac{2| P\cap G| }{| P| +| G| },$$

where P and G are predicted and ground-truth masks. 95% Hausdorff Distance (HD95): 95th percentile of boundary distances between prediction and ground truth. Sensitivity and Specificity: proportion of correctly detected aneurysm voxels and correctly rejected non-aneurysm voxels.

Detection: Sensitivity@FP/case: sensitivity at fixed FP counts per case (0.5, 1, 2 FP/case), common in medical detection benchmarks. FROC: sensitivity-FP/case curve; we report FROC-AUC. Average Precision (AP): lesion-level precision-recall AUC.

Cross-domain Generalization: Domain-wise Dice and HD95: broken down by test domain (e.g., ADAM → IntrA, IntrA → CQ500). Relative Performance Drop (RPD):

$${\rm{RPD}}=\frac{{{\rm{DSC}}}_{{\rm{in}}}-{{\rm{DSC}}}_{{\rm{out}}}}{{{\rm{DSC}}}_{{\rm{in}}}}\times 100 \% ,$$

where DSC_in and DSC_out are in-domain and cross-domain Dice. Expected Calibration Error (ECE): measures calibration of detection confidence across domains.

All results are reported as mean ± standard deviation across test folds. Statistical comparisons use two-tailed paired t-tests, with p < 0.05 considered significant¹².

Baseline Configurations: To ensure fairness and reproducibility, all baselines follow public defaults or are tuned on our validation sets. Key hyperparameters are listed in Table 10.

Table 10 Key hyperparameters for baselines

Full size table

Ethics approval and consent to participate

This study used publicly available, anonymized datasets (ADAM, IntrA, CQ500). As all data were fully de-identified and collected under existing ethical approvals by the original dataset providers, no institutional review board (IRB) approval or additional informed consent was required.

Data availability

The datasets analyzed in this study are publicly available: ADAM Challenge dataset (https://adam.isi.uu.nl/.), IntrA dataset⁴¹, and CQ500 dataset⁴². Processed data used in this study are available from the corresponding author upon reasonable request. The implementation of AMAP, including training and evaluation scripts, will be made publicly available upon publication.

Code availability

The implementation of AMAP, including training and evaluation scripts, will be made publicly available upon publication.

References

Wiebers, D. O. et al. Unruptured intracranial aneurysms: natural history, clinical outcome, and risks of surgical and endovascular treatment. Lancet 362, 103–110 (2003).
Article PubMed Google Scholar
Daga, P., Kumar, R. & Zhang, J. Deep learning for rupture risk stratification of cerebral aneurysms. Stroke 56, 1003–1012 (2025).
Google Scholar
Joo, B. Methodological challenges in deep learning-based detection of intracranial aneurysms: a scoping review. Neurointervention 20, 52–65 (2025).
Article PubMed PubMed Central Google Scholar
Nakao, T. et al. Deep neural network-based computer-assisted detection of cerebral aneurysms in mr angiography. J. Magn. Reson. Imaging 47, 948–953 (2018).
Article PubMed Google Scholar
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Article PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR) (IEEE, 2021).
He, K. et al. Masked autoencoders are scalable vision learners. In Conference on Computer Vision and Pattern Recognition (CVPR), 15979−15988 (IEEE, 2022).
Zhou, T. et al. A universal medical image segmentation framework with vision foundation models. Nat. Mach. Intell. 5, 780–791 (2023).
Google Scholar
Shen, C. et al. Domain generalization for medical imaging classification with meta-learning. Med. Image Anal. 78, 102406 (2022).
Google Scholar
Vach, M., Richter, S. & Ecker, A. Reproducibility of deep learning-based intracranial aneurysm segmentation across institutions. NeuroImag. Clin. 44, 103673 (2024).
Google Scholar
Liu, F. et al. Small lesion detection in medical images using deep learning. Pattern Recognit. 115, 107885 (2021).
Google Scholar
Delfan, S., Mohammadi, N. & Rezaei, A. Artificial intelligence for brain aneurysm detection: a meta-analysis. Eur. Radiol. 35, 2156–2170 (2025).
Google Scholar
Holzinger, A. et al. Explainable ai methods in medical imaging. Nat. Rev. Methods Prim. 2, 1–13 (2022).
Google Scholar
Ceballos-Arroyo, A. M. et al. Anatomically-guided masked autoencoder pre-training for aneurysm detection. arXiv https://doi.org/10.48550/arXiv.2502.21244 (2025).
Zu, R., Zhao, M. & Zhang, H. Efficient prompt tuning for medical image classification. In MICCAI, 560–570 (Springer, 2024).
Lin, F. et al. Gs-ema: Integrating gradient surgery exponential moving average with boundary-aware contrastive learning for enhanced domain generalization in aneurysm segmentation. arXiv https://doi.org/10.48550/arXiv.2402.15239 (2024).
Ceballos-Arroyo, A. M. et al. Vessel-aware aneurysm detection using multi-scale deformable 3d attention. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 754–765 (Springer, 2024).
Li, Y. et al. Deep learning-based platform performs high detection sensitivity of intracranial aneurysms in 3d brain tof-mra: an external clinical validation study. Int. J. Med. Inform. 188, 105487 (2024).
Article PubMed Google Scholar
Kiewitz, J. et al. Deep learning-based multiclass segmentation in aneurysmal subarachnoid hemorrhage. Front. Neurol. 15, 1490216 (2024).
Article PubMed PubMed Central Google Scholar
Nishi, H. et al. Deep learning-based cerebral aneurysm segmentation and morphological analysis with three-dimensional rotational angiography. J. NeuroInt. Surg. 16, 197–203 (2024).
Article Google Scholar
Mazurowski, M. A. et al. Segment anything model for medical image analysis: an experimental study. Radiol. Artif. Intell. 5, e230217 (2023).
Google Scholar
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 6543 (2024).
Google Scholar
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2023).
Dong, G. et al. An efficient segment anything model for the medical image segmentation (emedsam). Sci. Rep. 14, 9425 (2024).
Wang, H. et al. Sam-med3d: Towards general-purpose segmentation models for volumetric medical images. Sci. Rep. 14, 19425 (2023).
Wang, G. et al. Sam-med3d-moe: Towards a non-forgetting segment anything model via mixture of experts for 3d medical image segmentation. arXiv https://doi.org/10.48550/arXiv.2407.04938 (2024).
Ma, J. et al. Medsam2: Segment anything in 3d medical images and videos. arXiv https://doi.org/10.48550/arXiv.2504.03600 (2025).
Hartsock, J., Fang, E. & Xu, M. Vision-language models in radiology: a systematic review. Insights Imag. 15, 64 (2024).
Google Scholar
Wei, Y. et al. More-brain: Routed mixture of experts for interpretable and generalizable cross-subject fmri visual decoding. arXiv https://arxiv.org/abs/2505.15946 (2025).
Xiao, X. et al. Describe anything in medical images. arXiv https://arxiv.org/abs/2505.05804 (2025).
Xiao, X. et al. Hgtdp-dta: Hybrid graph-transformer with dynamic prompt for drug-target binding affinity prediction. In International Conference on Neural Information Processing, 340–354 (Springer, 2024).
Wei, Y. et al. 4d multimodal co-attention fusion network with latent contrastive alignment for alzheimer’s diagnosis. arXiv https://arxiv.org/abs/2504.16798 (2025).
Tang, Y. et al. Swin unetr: swin transformers for semantic segmentation of brain tumors in mri images. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2021. Lecture Notes in Computer Science, (eds. Crimi, A., Bakas, S.) 12962 (Springer (Springer, 2022).
Xie, B., Zhang, K. & Liu, Z. Knowledge-guided masked image modeling for radiology representation learning. Med. Image Anal. 89, 102907 (2024).
Google Scholar
Oh, H., Park, J. & Lee, D. Meta-learning based domain generalization for medical image classification. Med. Image Anal. 87, 102830 (2023).
Google Scholar
Wang, D., Shelhamer, E., Liu, S., Olshausen, B. & Darrell, T. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations (ICLR) (ICLR, 2021).
Gupta, A., Osman, I., Shehata, M. S., Braun, W. J. & Feldman, R. E. Medmae: A self-supervised backbone for medical imaging tasks. Computation 13, 88 (2025).
Article Google Scholar
Li, D., Yang, Y., Song, Y.-Z. & Hospedales, T. Learning to generalize: meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 3490−3497 (AAAI, 2018).
Jahanifar, M. et al. Domain generalization in computational pathology: survey and guidelines. ACM Comput. Surv. 57, 1–37 (2025).
Article Google Scholar
Timmins, K. et al. The Aneurysm Detection And Segmentation (Adam) Challenge Dataset. https://adam.isi.uu.nl/ (2020).
Qin, C. et al. Intra: An intracranial aneurysm dataset for clinical evaluation of detection and segmentation algorithms. Sci. Data 8, 47 (2021).
Google Scholar
Chilamkurthy, S. et al. Development and validation of deep learning algorithms for detection of critical findings in head ct scans. Lancet 392, 2388–2396 (2018).
Article PubMed Google Scholar
Yang, D., Xu, H. & Zhao, Y. Open-Source Intracranial Aneurysm Annotations on the cq500 Dataset. https://github.com/ycchen218/CQ500-IA (2022).

Download references

Acknowledgements

We would like to thank the contributors of the ADAM, IntrA, and CQ500 datasets for making their data publicly available, which enabled this research.This work was supported by the Beijing-Tianjin-Hebei Basic Research Cooperation Project (Grant No. H2024102009), the Natural Science Foundation of Beijing Municipality (Grant No. L242045), the Continuing Education Center of the National Health (Grant No. GWJJ2023100103), the Key-Area Research and Development Program of Guangdong Province (Grant No. 2023B0303030002), the STI 2030-Major Projects (Grant No. 2022ZD0208500), and the National Natural Science Foundation of China (Grant No. 62336002, 62406025).

Author information

These authors contributed equally: Mingxuan Huang, Tiantian Liu, Jiayin Zhang, Xiaoming Su.

Authors and Affiliations

Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, 100070, Beijing, China
Mingxuan Huang, Hanlin Chen, Miao Li, Jinghan Guo, Hengri Cong & Yiming Deng
China National Clinical Research Center for Neurological Diseases, Beijing, 100070, Beijing, China
Mingxuan Huang, Hanlin Chen, Miao Li, Jinghan Guo, Hengri Cong & Yiming Deng
School of Medical Technology, Beijing Institute of Technology, Beijing, 100081, Beijing, China
Tiantian Liu & Tianyi Yan
Department of Neurosurgery, The Second Hospital of Jilin University, Changchun, 130041, Jilin, China
Jiayin Zhang & Long Yan
Zhangjiakou First Hospital, Zhangjiakou, 075061, Hebei, China
Xiaoming Su, Xiaofeng Chen & Yanguo Su
School of Basic Medical Sciences, Capital Medical University, Beijing, 100069, Beijing, China
Kaiyang Zu

Authors

Mingxuan Huang
View author publications
Search author on:PubMed Google Scholar
Tiantian Liu
View author publications
Search author on:PubMed Google Scholar
Jiayin Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaoming Su
View author publications
Search author on:PubMed Google Scholar
Hanlin Chen
View author publications
Search author on:PubMed Google Scholar
Miao Li
View author publications
Search author on:PubMed Google Scholar
Jinghan Guo
View author publications
Search author on:PubMed Google Scholar
Kaiyang Zu
View author publications
Search author on:PubMed Google Scholar
Xiaofeng Chen
View author publications
Search author on:PubMed Google Scholar
Yanguo Su
View author publications
Search author on:PubMed Google Scholar
Hengri Cong
View author publications
Search author on:PubMed Google Scholar
Long Yan
View author publications
Search author on:PubMed Google Scholar
Tianyi Yan
View author publications
Search author on:PubMed Google Scholar
Yiming Deng
View author publications
Search author on:PubMed Google Scholar

Contributions

M.H., T.L., J.Z. and X.S. contributed equally to this work. Y.D., T.Y., L.Y. and H.R.C. contributed equally to this work and are all corresponding authors. M.H., X.S., K.Z. and J.G. conceptualized the study, designed the methodology, and participated in securing research funding (Conceptualization, Methodology, Funding acquisition). T.L., X.C. and H.L.C. carried out data acquisition, curation, and investigation (Investigation, Data curation) and provided key resources, instruments, and technical support (Resources, Software). J.Z., M.L. and Y.S. drafted the initial manuscript and generated visualizations (Writing – Original Draft, Visualization). Y.D., T.Y., L.Y. and H.R.C. supervised the project, coordinated collaborations, and ensured administrative support (Supervision, Project administration). All authors contributed to reviewing and revising the manuscript critically for important intellectual content (Writing – Review & Editing) and approved the final version for submission.

Corresponding authors

Correspondence to Hengri Cong, Long Yan, Tianyi Yan or Yiming Deng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, M., Liu, T., Zhang, J. et al. Anatomically-guided Masked Autoencoder with Domain-Adaptive Prompting (AMAP) for multimodal cerebral aneurysm detection and segmentation. npj Digit. Med. 9, 20 (2026). https://doi.org/10.1038/s41746-025-02188-8

Download citation

Received: 29 September 2025
Accepted: 18 November 2025
Published: 08 December 2025
Version of record: 07 January 2026
DOI: https://doi.org/10.1038/s41746-025-02188-8

Subjects

Abstract

Similar content being viewed by others

Automated, anatomy-based, heuristic post-processing reduces false positives and improves interpretability of deep learning intracranial aneurysm detection models

Confocal Raman microspectroscopy imaging reveals structural protein reorganization in human intracranial aneurysm tissue

Intracranial aneurysm CTA images and 3D models dataset with clinical morphological and hemodynamic data

Introduction

Results

Main results

Ablation studies

Cross-domain Generalization and Robustness

Visualization analysis

Clinical relevance and case studies

Discussion

Methods

Preliminaries and notation

Stage I: Anatomically-guided MAE pretraining

Stage II: Prompt-guided fine-tuning

Stage III: Boundary-aware DG with GS-EMA

Overall objective

Training details

Design rationale and ablations

Datasets and preprocessing

Experiments setup

Evaluation metrics

Ethics approval and consent to participate

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links