Introduction

Breast cancer is one of the most significant health challenges globally, with millions of new cases diagnosed each year1,2,3. Early detection is crucial to improving patient outcomes4, and mammography serves as a cornerstone tool for the detection and diagnosis of breast cancer5,6. As a specialized radiographic technique, mammography provides critical visual evidence to detect early signs of cancer, such as masses or calcifications7. However, interpreting mammograms is inherently complex and highly dependent on the expertise of qualified radiologists. Even among experts, diagnoses can involve subjective judgments and inconsistencies. Therefore, developing high-quality mammogram datasets tailored to support model training and evaluation has become an essential requirement to advance intelligent applications in this domain.

In recent years, the rapid development of Large Vision-Language Models (LVLMs) has demonstrated substantial potential in multimodal learning. General-domain LVLMs, such as Flamingo8, BLIP-29, LLaVA10, and MiniGPT-411, have shown remarkable performance in tasks like image-text generation, visual question answering (VQA), and image captioning. These models leverage large-scale pretraining datasets and cross-modal alignment techniques to enhance generalization capabilities in both vision and language tasks. In the medical domain, specifically tailored medical LVLMs have further expanded the application scope of LVLMs. For example, RadFM12 optimizes PMC-LLaMA13 with 16 million radiology image-text pairs, while Med-Flamingo14 enhances OpenFlamingo-9B15 using billions of biomedical image-text pairs to meet the demands of medical image analysis and report generation. LLaVA-Med16 extends LLaVA10 by incorporating a biomedical figure caption dataset extracted from PubMed Central17, significantly improving biomedical image understanding and open-domain dialog capabilities. MedVInT18 strengthens medical imaging analysis and open-ended VQA through visual instruction fine-tuning on 227,000 vision-question-answer pairs from the PMC-VQA18 dataset. Furthermore, the latest GMAI-VL19, developed using the proposed GMAI-VL-5.5M dataset, advances research in multimodal medical representation. Despite these advancements, the application of LVLMs in specific medical fields, i.e., mammograms, remains limited.

To validate the VQA performance of existing LVLMs on mammograms, we investigate a variety of datasets that have supported advancements in mammographic image analysis. Early datasets like MIAS20 and INbreast21 laid the foundation, with MIAS providing 322 images with basic annotations and INbreast offering 410 high-resolution images with detailed lesion labels. Despite their contributions, these datasets are limited in size and diversity. CBIS-DDSM22, built on the previous DDSM dataset, introduced standardized mammograms with detailed annotations such as lesion segmentation and Breast Imaging-Reporting and Data System (BI-RADS23) descriptors, significantly improving the quality of mammogram data. More recent datasets, such as Vindr-Mammo24, CSAW-M25, KAU-BCMD26, and BMCD27, provide thousands of mammograms with comprehensive annotations, addressing the need for larger and more diverse data. Specialized datasets like CDD-CESM28, which focus on contrast-enhanced spectral mammography (CESM) with over 1000 annotated images, are designed for diagnostics in dense breast tissue. In contrast, DMID29 emphasizes multimodal breast imaging, combining digital mammography with other imaging techniques and detailed annotations to support research on integrating complementary imaging modalities for better detection of breast cancer. Large-scale datasets, such as EMBED30, which includes a total of 3.4 million images (with around 480,000 currently available), and the Radiological Society of North America (RSNA) Breast Cancer Detection Challenge dataset31, which contains around 54,700 images, offer substantial resources for AI development. These datasets with diverse scales and imaging modalities have laid a solid foundation for advancing mammogram analysis. However, their lack of design specificity for the VQA task limits their applicability in vision-language research.

In addition, we explore existing medical VQA datasets, such as VQA-RAD32 and SLAKE33, which have laid the foundation for this advancement. VQA-RAD focuses on radiological images but covers a limited range of anatomical regions, while SLAKE expands its coverage to include areas such as the pelvis and neck, though its data diversity remains constrained. Additionally, benchmark datasets like MIMIC-CXR34 and CheXpert35 in the chest radiographic imaging domain have supported numerous multimodal studies. PMC-VQA18 has further contributed to the field by incorporating various imaging modalities and tasks. OmniMedVQA36 and the recent GMAI-VL-5.5M19 dataset attempt to bridge this gap by covering 12 and 13 imaging modalities, respectively, along with multiple anatomical regions and medical professional tasks, making them the largest and most diverse medical VQA datasets to date. Despite these advancements, these datasets contain limited or no mammogram data, which is insufficient for evaluating or enabling model performance in interpreting mammograms.

To address this gap, we introduce MammoVQA, a VQA dataset specifically focused on mammograms. MammoVQA is a multimodal medical imaging dataset designed to meet the unique requirements of the mammogram, comprising 15 mammogram datasets and 565,092 question-answering pairs designed to cover clinically relevant tasks such as BI-RADS classification, breast density assessment, and abnormality detection.

Beyond simply providing a benchmark, the central contribution of this work lies in rigorously evaluating the capabilities of existing LVLMs in interpreting mammography images. While the general-domain models have shown outstanding performance in general VQA tasks, their performance on mammograms remains underexplored. Similarly, the medical-domain models, although tailored for medical applications, have not been extensively tested in mammography-specific contexts. This gap motivates our study: Do LVLMs truly possess sufficient interpretation ability when faced with clinically relevant mammography tasks?

To validate the efficacy of MammoVQA, we systematically evaluated the performance of 12 state-of-the-art LVLMs on this dataset, including 6 general-domain models and 6 medical-domain models. LVLMs face significant challenges when addressing mammography-related questions, with almost all models on various question topics showing performance comparable to random guessing. This indicates that they cannot interpret mammograms effectively. These issues likely derive from the models’ lack of sufficient mammogram data during the training phase. To investigate the impact of domain-specific adaptation on performance, we fine-tune LLaVA-NeXT37 on the MammoVQA training set, aiming to further enhance its performance in mammography-related tasks.

Furthermore, we train and evaluate vision-only models (ResNet-5038 and DINOv239) with linear probing and a multimodal framework ViLT40 that does not utilize large language models on MammoVQA. Unlike LVLMs, the outputs of vision-only models and ViLT-based models are closed-set, meaning their answer spaces are constrained to predefined options. Although vision-only models and ViLT-based models exhibit a certain level of competitiveness in MammoVQA, their overall performance remains significantly inferior to that of the domain-optimized LVLMs. The external validation particularly highlights LLaVA-Mammo’s robustness. These findings indicate that LVLMs have substantial potential for the VQA task on mammograms, as they can handle more open-ended and complex question-answering scenarios. However, they also highlight the limitations of existing LVLMs, particularly in tasks requiring highly domain-specific knowledge, where further optimization and adaptation are still demanded.

In this work, we make the following contributions:

  • We introduce a VQA benchmark designed specifically for mammogram interpretation, MammoVQA, which is created by integrating 15 public mammogram datasets. This comprehensive benchmark combines 131,847 images with 420,923 QA pairs for image-level cases and 72,518 examinations (475,971 images) with 144,169 QA pairs for exam-level cases. By providing both an evaluation framework for assessing LVLMs’ mammogram interpretation capabilities and structured data for domain adaptation, MammoVQA establishes a critical infrastructure for advancing research in AI-assisted mammogram interpretation.

  • We conduct a systematic evaluation of 6 general-domain and 6 medical-domain LVLMs on MammoVQA, revealing that their mammogram interpretation performance was statistically indistinguishable from random guessing. This striking inadequacy highlights the challenges of mammogram interpretation, where subtle anatomical patterns and complex contextual relationships across multiple views differ fundamentally from other domain images.

  • We propose a domain-optimized model, LLaVA-Mammo, which significantly outperforms both general and medical LVLMs in mammogram VQA tasks. Specifically, for internal validation, LLaVA-Mammo achieves performance gains of 38.29% average absolute accuracy (dataset level) and 19.66% average weighted accuracy (question topic level) over the best recent high-performance model in internal validation. External validation across 4 independent external datasets demonstrates that LLaVA-Mammo outperforms the best recent high-performance model by 19.87% average absolute accuracy (dataset level) and 21.21% average weighted accuracy (question topic level). This advancement demonstrates the critical importance of domain-specific design for medical AI and provides a foundation for future development of vision-language systems in mammogram interpretation.

Results

Construct MammoVQA from public mammogram datasets

Recognizing the scarcity of image-text data in mammograms, we aggregate a large number of classification datasets from this domain and convert them into the VQA format to form our MammoVQA dataset. MammoVQA comprises 15 mammogram datasets released by various authoritative medical institutions (shown in Fig. 1a), resulting in a diverse set of images that enable models to learn more generalized representations and validate their performance across a wide range of heterogeneous data. Importantly, all images are sourced from real medical settings, ensuring that MammoVQA is closely aligned with real-world applications. Through manual verification, we re-examined all mammograms to ensure the absence of image corruption, unreadable files, or visual abnormalities. Furthermore, we verified that all bounding box coordinates precisely delineated the target mass regions, and confirmed that all classification labels were accurately mapped to our predefined unified label space. The dataset covers 9 question topics (shown in Fig. 1b and detailed in Supplementary Table 8), including but not limited to BI-RADS classification, density assessment, and abnormality detection, fully reflecting the diversity and complexity of the mammogram. MammoVQA includes 131,847 images and 420,923 QA pairs for image-level cases, as well as 72,518 examinations (475,971 images) and 144,169 QA pairs for exam-level cases, establishing a large-scale dataset.

Fig. 1: Overview of MammoVQA.
Fig. 1: Overview of MammoVQA.
Full size image

a Dataset composition statistics, which describe the number of images contained in each sub-dataset of MammoVQA (* indicates the number of examinations) and the distribution of corresponding question topics. b Hierarchical taxonomy of 9 clinically validated question topics organized by diagnostic workflow stages. c The sub-datasets of MammoVQA are categorized into three types according to the format of the provided labels. d An example of question-answer pair generation: for each question topic, four question formats are generated using GPT-4o, and one is randomly selected as the {Question} of the question-answer pair. The {Options} are then randomly ordered from the candidate options and filled into the template to form the final question-answer pair. e An example of a question and its corresponding answer for each of the 9 question topics in MammoVQA.

Systematic evaluation of existing LVLMs

To assess the capabilities of existing LVLMs in interpreting mammograms, we select 12 models pre-trained on large-scale datasets. These models, with similar scales and distinct characteristics, have gained widespread recognition for their performance. We conduct zero-shot experiments on the MammoVQA benchmark to evaluate their ability to interpret mammograms. The selected models include 6 general-domain models, namely MiniGPT-4-7B11, BLIP-2-11B9, InstructBLIP-7B41, LLaVA-NeXT-Interleave-7B42, InternVL3-8B43, and Qwen2.5-VL-7B44, along with 6 medical-domain models, including LLaVA-Med-7B16, RadFM-14B12, Med-Flamingo-7B14, MedVInT-TD-7B18, MedDr-40B45, and MedGemma-4B46.

By selecting these diverse models, we aim to comprehensively evaluate their performance in mammogram interpretation. Notably, RadFM and MedDr’s pre-training datasets included a small number of mammograms.

To establish performance benchmarks, we calculate the accuracy of a random guess based on the number of answer categories for each question topic. To thoroughly evaluate model performance and avoid assessment bias due to single-category predictions, we employ both absolute accuracy and weighted accuracy metrics. We also evaluate the macro-F1 score, which exhibits an overall trend consistent with that of weighted accuracy. To maintain conciseness in the main text, the full F1 results are provided exclusively in the supplementary materials. Through detailed analysis of weighted accuracy, we observe a notable phenomenon: the majority of LVLMs perform close to random guess levels across various question topics.

InternVL3 and MedGemma show significant advantages in some tasks. Specifically, in the pathology (breast) task, these 2 models outperform random guessing by 5.82% and 9.80%, respectively. In the pathology (finding) task, the advantages further expand to 10.42% and 13.50%. Moreover, in the abnormality (breast) task, they exceed random guessing by 2.57% and 3.55%, respectively. In the abnormality (finding) task, they also outperform other models (except BLIP-2) by approximately 3–4%.

The experimental results show that most LVLMs perform at near-random levels in mammogram interpretation. Even better-performing models (e.g., InternVL3 and MedGemma) still exhibit insufficient accuracy in most tasks, indicating a lack of domain-specific interpretation of breast lesion patterns. This limitation highlights the importance of improving model performance in mammogram interpretation, which is crucial to achieving reliable AI-assisted early detection of breast cancer.

How does MammoVQA boost the LVLMs’ interpretation ability on mammograms?

To further enhance the model performance in MammoVQA, we fine-tune the LLaVA-NeXT model in the MammoVQA training set, resulting in LLaVA-Mammo (Fig. 2a). Through fine-tuning on the MammoVQA dataset, the model can better learn the associations between the features of mammograms and category words, while adapting to the unique semantics and question types in the MammoVQA task. Moreover, the fine-tuned LLaVA-Mammo can also provide a powerful baseline model for subsequent research, promoting the further development of the mammogram visual question-answering task.

Fig. 2: Overview of domain-optimized models, output samples, and distribution.
Fig. 2: Overview of domain-optimized models, output samples, and distribution.
Full size image

a Architecture of LLaVA-Mammo. b Architecture of ViLT-Mammo. c Architecture of ViLT-Mammo-Expert. d Architecture of R50-Mammo and DINOv2-Mammo. e A VQA sample on MammoVQA. The robot icon and doctor icon are from Flaticon.com, created by Freepik. f The distribution of answer predictions by LLaVA-Mammo on MammoVQA's internal benchmark set. #Source data are provided with this paper.

LLaVA-Mammo demonstrates absolute superiority over existing LVLMs on the MammoVQA internal benchmark set (the output example can be viewed in Fig. 2e, and the overall performance can be viewed in Fig. 3a, b). Specifically, in the background tissue task, the weighted accuracy of LLaVA-Mammo reaches 54.46%, which is 13.01% higher than the best-performing InternVL3 among other LVLMs (21.13% higher than a random guess). In the view task, the weighted accuracies of LLaVA-Mammo reach 98.45%, which is 44.43% (48.45% higher than a random guess) higher than the best-performing MedDr among other LVLMs. In the abnormality (finding) task, the weighted accuracy of LLaVA-Mammo reaches 13.61%, which is 4.89% higher than the best-performing model. In the subtlety, pathology (breast), pathology (finding), masking potential, BI-RADS (breast), Density (breast), laterality, and abnormality (breast) tasks, LLaVA-Mammo achieves weighted accuracies of 37.23%, 46.34%, 70.60%, 27.02%, 32.31%, 66.85%, 99.97%, and 6.72%, respectively. Except for the abnormality (breast) task where it is relatively lower, in other tasks, it is 12.93% (20.56% higher than random guess), 3.21% (13.01% higher than random guess), 23.77% (37.27% higher than random guess), 9.46% (14.52% higher than random guess), 13.65% (15.64% higher than random guess), 35.04% (41.85% higher than random guess), and 45.95% (49.97% higher than random guess) higher, respectively. Overall, in the image case, the average absolute accuracy and weighted accuracy of LLaVA-Mammo reach 73.89% and 50.32%, which is 38.07% and 19.66% (25.66% higher than a random guess) higher than the best-performing MedGemma among LVLMs, showing a significant improvement.

Fig. 3: Overview of performance.
Fig. 3: Overview of performance.
Full size image

a Weighted accuracy across question topics. b Absolute accuracy across sub-datasets. c Category distribution of question topics in MammoVQA. df Comparison of two top existing and three domain-optimized models on MammoVQA benchmarks, showing weighted accuracy and absolute accuracy with 95% confidence intervals (calculated via 9999 bootstrap iterations), where the sample size n represents the total number of test cases. Error bars represent the range (mean  ± CI). Dashed (---) and solid (-) lines indicate mean absolute and weighted accuracies, excluding exam-level question topics and EMBED sub-dataset. #Source data are provided with this paper. Detailed performance metrics are available in Supplementary Tables 1-7.

For the exam case, only LLaVA-Mammo, LLaVA-NeXT-interleave, InternVL3, Qwen2.5-VL, RadFM, Med-Flamingo, and MedGemma can process the multi-image input. In the BI-RADS (exam) task, LLaVA-Mammo achieves a weighted accuracy that is 1.07% lower than that of Qwen2.5-VL and is 0.85% higher than a random guess. In the density (exam) task, the weighted accuracy is 65.91%, which is 33.70% and 40.91% higher than the best-performing MedGemma among LVLMs and random guess, respectively.

From the perspective of sub-datasets, LLaVA-Mammo achieves an absolute accuracy of 74.33% and 68.49% on average in the breast dataset and the exam dataset, which is 38.29% and 44.48% higher than the best-performing MedGemma among LVLMs.

It is widely recognized that specialized models adopting closed-set outputs outperform large language models (LLMs) with open-ended outputs in classification tasks47. To verify whether this holds for our MammoVQA dataset constructed from classification datasets, we train 2 vision-only models and 2 ViLT40-based models with DINOv239 as the vision backbone on the MammoVQA training set.

For vision-only models, we perform linear probing on ResNet-5038 and DINOv239 using a multi-classification-head-per-model setup (Fig. 2d). For architectures based on ViLT, we first train ViLT-Mammo (Fig. 2b) with the same multi-head approach to handle multiple question types concurrently. Subsequently, we train ViLT-Mammo-Expert (Fig. 2c) using a single-classification-head-per-model strategy, where a dedicated classification head is optimized for each question type to improve specialization. Overall, DINOv2 achieves the best performance, followed by the two ViLT-based models with comparable results, while ResNet-50 performs the worst.

Under the same image-text input setting, a comparison between ViLT-based models and LLaVA-Mammo reveals that, aside from the simplest view and laterality tasks where performance is comparable, LLaVA-Mammo outperforms in other tasks by approximately 6–20% in terms of weighted accuracy. In summary, the weighted accuracy of LLaVA-Mammo on question topics is on average 6.72% and 7.82% higher than that of ViLT-Mammo and ViLT-Mammo-Expert, respectively, and its absolute accuracy is on average 7.09% and 7.01% higher. From the perspective of sub-datasets, the absolute accuracy is on average 9.72% and 7.13% higher.

The results show that ResNet-50 performed poorly, with its predictions consistently biased toward a single class upon statistical analysis. This suggests that the features extracted by ResNet-50 may not be directly suitable for mammography classification tasks. In contrast, DINOv2 achieved performance second only to LLaVA-Mammo. Since MammoVQA is essentially a pure visual classification task, it is intuitive that DINOv2 outperforms ViLT-based models, which require joint representation learning for both image and text, as the introduction of a textual modality may introduce irrelevant noise, and the language model could interfere with visual features. The superior performance of LLaVA-Mammo can be attributed to its powerful generalization capability of the large language model.

External validation

To evaluate the generalizability and reliability of LLaVA-Mammo, we conduct external validation using four independent datasets (DBT48, LAMIS49, MM50, and NLBS51). The external benchmark sets encompass six question topics: BI-RADS (breast), breast density, view, laterality, pathology (breast), and pathology (finding). External validation across 4 external datasets (shown in Fig. 3f) demonstrates that LLaVA-Mammo outperforms the best-performing model by 19.87% in absolute accuracy at the dataset level, and by 26.39% in absolute accuracy and 21.21% in weighted accuracy at the question-topic level. These findings demonstrate that LLaVA-Mammo maintains strong generalization capabilities, consistently outperforming existing models across all tasks and showing superior performance compared to domain-specialized models, thus indicating robust cross-dataset adaptability and reliability for diverse mammogram interpretation tasks.

Discussion

Importance of MammoVQA

MammoVQA is a large-scale mammogram VQA dataset, comprising 131,847 images with 420,923 QA pairs for image-level cases and 72,518 examinations (475,971 images) with 144,169 QA pairs for exam-level cases. MammoVQA addresses pivotal challenges in medical multimodal AI by: (1) establishing a dedicated evaluation framework for assessing LVLMs’ diagnostic capabilities in mammogram interpretation, (2) providing curated training data to adapt general-purpose LVLMs to the specialized domain of breast imaging through structured QA pairs, (3) emulating clinical reading paradigms by including both single-view cases and multi-view cases that reflect radiologists’ reliance on composite information for accurate diagnosis, and (4) creating the foundational infrastructure for developing human-AI collaborative systems where models can assist in preliminary screening while maintaining physician oversight for final decisions.

Importantly, all questions in MammoVQA are closed-ended and formulated as fundamental classification tasks. We adopt this simple but critical format, as poor performance on these basic tasks would indicate even greater challenges for more complex, open-ended VQA scenarios. The systematic evaluation of LVLMs on such fundamental tasks is not just a technical exercise, but has tangible clinical implications: As LVLMs demonstrate broad potential in medical image analysis, their ability to reliably interpret medical images directly impacts the safety, generalizability, and trustworthiness of future clinical decision support systems. LVLMs not only have the potential to enable more natural human-computer interactions, such as generating imaging descriptions and preliminary assessments through conversational interfaces, but could also significantly improve the efficiency and interpretability of medical report generation. For these reasons, before deploying such models in real-world applications, it is essential to rigorously and systematically evaluate their foundational capabilities, particularly in high-stakes domains such as healthcare. Performance in basic classification tasks serves as a critical indicator of visual-language alignment and generalization ability in medical concepts.

Performance analysis

Through detailed analysis of the answer distributions of the MedDr and LLaVA-NeXT-interleave, we find that their prediction patterns are unique. When dealing with pathology questions, when the true answer is ‘normal’, the models tend to make correct predictions, while for other categories, the predictions appear random. In the abnormality task, they also show high accuracy for ‘normal’ samples, but for abnormal samples, they tend to predict the most common ‘mass’ category. We believe that the excellent performance of the MedDr model may be attributed to the inclusion of mammogram data in its training set, while the outstanding performance of the LLaVA-NeXT-interleave model lacks a clear explanation.

In the BI-RADS (exam) task, LLaVA-Mammo’s weighted accuracy is 1.07% lower than that of Qwen2.5-VL. This indicates that LLaVA-Mammo does not gain the ability to diagnose the BI-RADS score of the corresponding patient from multiple images through fine-tuning. Based on the good results in the density (exam) task, we can confirm that the model can extract key information from multiple images. However, since judging the BI-RADS score requires very detailed information, and different images in the same exam may vary greatly in key detailed information, we believe that one of the problems is that LLaVA-Mammo cannot determine which image-provided information to use. Secondly, we find that LLaVA-Mammo performs very well in tasks such as view and laterality that only require macro-information for prediction, but performs poorly in problems such as identifying abnormalities in breast images that require detailed information. This shows that LLaVA-Mammo has a poor ability to capture detailed information, which we also consider as one of the problems.

ViLT-Mammo-Expert performs at the level of random guess in the background tissue and masking potential tasks. We attribute this to the imbalanced distribution of the training data. In contrast, ViLT-Mammo achieves good weighted accuracy in these 2 tasks, outperforming a random guess by 19.37% and 9.2%, respectively. We believe that this is because the background tissue, masking potential, and density (breast) tasks share commonalities as they are all related to the density of breast glands. The multi-classification-head design of ViLT-Mammo allows for knowledge sharing, which mitigates the impact of data imbalance.

Reliability analysis of LLaVA-Mammo

We select 2 top-performing models (InternVL3 and MedGemma) from 12 chosen LVLMs and compare their performance on question topics with 3 image-text domain-optimized models. In Fig. 3c, the outer bars represent the results of absolute accuracy, while the inner bars represent the results of weighted accuracy. The proximity of the two bars reflects the reliability of the model’s predictions: the closer the bars, the more stable the model’s predictions. From the figure, it can be observed that for the view and laterality tasks, the two bars for all 5 models are very close, indicating that these tasks are relatively simple and the models can handle them well. However, for the abnormality (breast) and abnormality (finding) tasks, the gaps between the two bars for all 5 models are significant, reflecting the complexity of abnormality tasks, particularly in identifying abnormality in mammograms, where the models’ reliability still has considerable room for improvement.

Further analysis of Fig. 3d reveals that the balance of the distribution of training data has a significant impact on the performance of the trained model. The figure shows that the more balanced the training data distribution, the smaller the gap between the two bars for the 3 trained models, indicating that the balanced training data improves the reliability of model prediction. Furthermore, Fig. 3e shows that LLaVA-Mammo consistently outperforms ViLT-based models on the finding dataset, and the overall performance of the finding dataset is significantly higher than that of the breast dataset. This result reflects LLaVA-Mammo’s limitations in extracting detailed information from breast images.

From Fig. 2f, we can observe that, apart from the view and laterality tasks, the labels of other tasks exhibit a gradual progression from mild to severe. Although the prediction distributions for background tissue and BI-RADS (exam) are less ideal due to uneven training data distribution and relatively high task difficulty, the prediction distributions for other tasks show clear trend characteristics. This indicates that LLaVA-Mammo indeed possesses the ability to distinguish different features in mammograms, especially when handling tasks with varying degrees of severity, where the model can capture hierarchical information in the data.

Limitations and future works

Our study has four primary limitations. First, computational constraints limit our experiments to Vicuna-7B, preventing verification of the scaling law hypothesis with larger LVLMs. Second, performance bias emerges from the imbalance of training data in the question topics. Third, MammoVQA’s classification-based design limits answers to closed categories, restricting LVLMs’ open-ended reasoning potential. Fourth, no systematic investigation was conducted to identify optimal model architectures or training strategies that would maximize MammoVQA’s effectiveness.

These limitations motivate four research priorities: (1) scaling to larger architectures, (2) developing robust data balancing methods, (3) constructing open-ended mammogram VQA datasets to properly assess LVLMs’ mammography interpretation abilities, and (4) using the MammoVQA dataset to develop lightweight LVLMs via knowledge distillation techniques, optimizing for real-time clinical mammogram QA applications.

It is crucial to emphasize that MammoVQA is fundamentally a technical benchmark designed to evaluate AI model performance, not a study comparing diagnostic accuracy against experienced radiologists. Therefore, any reported ‘superior accuracy’ should be strictly interpreted as superior technical performance on this specific benchmark task and must not be misconstrued as evidence of superior clinical diagnostic utility. Superior performance on the MammoVQA benchmark is a necessary initial step. Nevertheless, it is critically important to recognize that this represents technical proficiency rather than proven clinical utility. The definitive evidence for any model’s diagnostic value must ultimately be established through prospective, patient-centered clinical studies.

Methods

This project has been reviewed and approved by the Human and Artefacts Research Ethics Committee (HAREC). The protocol number is HREP-2025-0025.

MammoVQA construction

As shown in Fig. 1c, we use classification labels of each dataset to categorize all datasets based on the label types. Specifically, datasets where each label corresponds to an individual image are categorized as breast datasets. In contrast, datasets where the labels correspond to an entire examination are classified as exam datasets, provided that each examination contains no more than 15 images. VinDr-Mammo and RSNA datasets include labels for both examinations and individual images. To ensure data balance, these datasets are treated as breast datasets. For the breast datasets, if bounding boxes and corresponding labels for the findings are provided, the finding regions are cropped and used to construct the finding datasets. Additionally, since the finding dataset only includes confirmed abnormal cases and lacks normal cases, random crops of the same size as the findings are taken from the original images to create normal cases. This structured approach ensures that the MammoVQA is well organized and covers a wide range of mammography-related tasks, facilitating effective model evaluation. All datasets are presented in Tables 1 and 2.

Table 1 Download links and characteristics of MammoVQA internal datasets
Table 2 Download links and characteristics of MammoVQA external datasets

MammoVQA splits

In constructing MammoVQA, the internal data is divided into training, validation, and internal benchmark sets in a 7:1:2 ratio. This division specifically applies to the 11 internal datasets, ensuring that each sub-dataset is proportionally represented across all three sets. Additionally, MammoVQA includes 4 external datasets (as shown in Fig. 1a) that are reserved exclusively for external validation, providing a comprehensive evaluation framework that assesses model generalization across diverse data distributions. Furthermore, this division guarantees comprehensive coverage of all question topics, such as BI-RADS classification and density assessments, within each internal split. This is particularly important because not every mammogram contains labels for all question topics. By ensuring a balanced distribution of question topics across the splits, the design maximizes the representation of all labels, enabling the model to learn and be evaluated on multiple tasks effectively. This approach minimizes the risk of certain sub-datasets or question topics disproportionately influencing model performance and allows for a thorough evaluation of the model’s generalization capabilities across varying data distributions and tasks. Maintaining a proportional representation of sub-datasets and balanced coverage of all question topics also helps mitigate potential data biases, providing a more realistic reflection of the model’s ability to handle the diversity and complexity inherent in mammogram analysis.

Question-Answer pair generation

To construct the question-answer (QA) pairs based on the identified question topics, we leverage the category information of each question topic to guide the process, as shown in Fig. 1d. For each question topic, we identify whether it originates from a breast dataset, a finding dataset, or an exam dataset, and accordingly use GPT-4o52 to generate four corresponding question templates. For question topics from the breast dataset and finding dataset, the templates use terms such as ‘image’ or ‘mammogram’, whereas, for exam datasets, these terms are replaced with ‘exam’ to maintain consistency with the dataset context. To better evaluate the performance of LVLMs, we construct our QA pairs in a multiple-choice format. Specifically, for each dataset entry in MammoVQA, we construct prompts based on whether the question topic corresponds to a single-choice or multiple-choice question. For single-choice questions, the prompt is designed to ensure concise and evaluable answers, adopting the following structure: ‘This is a mammography-related medical question with several options, only one of which is correct. Select the correct answer and respond with just the chosen option, without any further explanation. ### Question: {Question} ### Options: {Options}. ### Answer:’. For multiple-choice questions, a similar structure is used with slight modifications to the phrasing of the instructions. Here, {Question} represents the question generated by GPT-4o, and {Options} is a randomized list of all possible options for the corresponding question topic. For example, for the question topic ‘Laterality’, {Options} could take the form of ‘A: Left, B: Right’ or ‘A: Right, B: Left’, depending on the random shuffle. Randomizing the order of the options is crucial to avoid biases where LVLMs might consistently predict the same choice (e.g., ‘A’) due to a tendency toward fixed option orders, thus ensuring more realistic performance evaluation metrics. By carefully designing the prompts and randomizing the orders of the options, our objective is to minimize biases and maximize the evaluability of the responses of the LVLMs, thus improving the reliability of our benchmark results. The examples of QA pairs of each question topic can be viewed in Fig. 1e.

Evaluation metrics

We employ three evaluation metrics in our study:

$$\,{{\mbox{Absolute \,Accuracy}}}\,=\frac{1}{N}{\sum }_{{\mbox{{i=1}}}}^{N}scor{e}_{i}$$
(1)
$$\,{{\mbox{Weighted \,Accuracy}}}\,=\frac{\mathop{\sum }_{i=1}^{N}{w}_{i}\cdot scor{e}_{i}}{\mathop{\sum }_{i=1}^{N}{w}_{i}}\quad \,{{\mbox{where}}}\,\quad {w}_{i}=\frac{1}{| {{{{\mathcal{C}}}}}_{i}| }$$
(2)

where N is the total number of samples, scorei {0, 1} indicates prediction correctness, and \(| {{{{\mathcal{C}}}}}_{i}|\) is the number of samples in the i-th sample’s category. Additionally, we report Macro F1-score results in the supplementary tables for comprehensive performance evaluation.

Evaluation method

Since the outputs of LVLMs are in the form of open-ended text, to obtain reliable and accurate model performance, we adopt a two-step evaluation approach for single-choice questions. First, we use difflib.SequenceMatcher to calculate the similarity (the ratio of the longest common subsequence to the total length of both texts) between the predicted text and each option, and then sort the options in descending order of similarity. If there is only one option with the highest similarity, we select this option as the final output. If there are multiple options with the highest similarity, we use the fuzzywuzzy library to calculate the similarity (Levenshtein Distance) again. If the options with the highest similarity are still not unique, the output will be ‘make no choice’ (judged as incorrect). For multiple-choice questions, we use the keyword-matching method. A prediction is considered correct if and only if all the correct answers appear in the model’s output.

Model training detail and hyperparameter setting

To obtain the LLaVA-Mammo model, we adopt the Low-Rank Adaptation (LoRa)53 to fine-tune the LLM component of LLaVA-NeXT while freezing the parameters of the other parts of the model. Specifically, we set the LoRa hyperparameters as follows: both lora_alpha and lora_r were set to 8, and lora_dropout was set to 0.05. During training, the total number of epochs was set to 1, the batch size was 16, the learning rate was 2 × 10−5, and the maximum text length was 32768. The entire fine-tuning process took approximately 10 days and used four NVIDIA L20 (48GB). For the two ViLT-based models and two vision-only models, we froze the parameters of the vision backbone for training. Tables 1 and 2 The number of training epochs and the batch size were the same as those for LLaVA-Mammo, which were 1 and 16, respectively, and the learning rate was set to 0.001. All training processes were implemented using Python 3.9, PyTorch 2.5.1, and CUDA 12.2.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.