Introduction: the “big data dream” in pathology

Artificial intelligence (AI) is rapidly transforming pathology, offering unprecedented opportunities to improve diagnostic quality and standardization, particularly in complex, high-impact disease areas such as breast cancer1,2. Major drivers of this transformation are the so-called “foundation models”, i.e., self-supervised AI systems trained on large datasets, with pathology whole slide images (WSIs) as a central data source3,4. A key feature of these models is their potential adaptation to diverse downstream tasks (e.g. slide-level classification, region-of-interest analysis, survival prediction, and biomarker discovery)5. Digital pathology enthusiasts and technology companies project these models to perform diagnostic tasks with accuracy comparable to, or even exceeding, that of a human pathologist6,7,8,9. At the same time, novel AI tools are showing fascinating results in predicting molecular alterations directly from WSIs, including breast cancer subtyping and molecular status prediction, suggesting a potential to complement and/or anticipate traditional biomarker testing4,10.

Following the advances in various fields of image recognition (e.g. image search, surveillance, facial/object detection, autonomous driving), the dominant view in digital pathology has been that larger models trained on more data will lead to better performance11,12. This assumption has fueled a renewed emphasis on the “big data” paradigm and the pursuit of increasingly complex model architectures, with the expectation that data scale alone would ensure robustness and generalizability for clinical use13. Notably, this mirrors the early days of genomics, when comprehensive approaches like whole-genome sequencing (WGS) and “multi-omics” were hailed as the ultimate tools for precision oncology14. Yet, over time, the field shifted toward more pragmatic, focused strategies such as targeted sequencing panels, which proved more efficient and clinically actionable15,16,17. In the field of digital pathology, however, the allure of big data has been reborn, raising important questions about whether this approach will ultimately yield similar adjustments toward streamlined, purpose-driven solutions18,19.

While many academic centers and pathology or biobank consortia now generate impressive volumes of data, these resources remain largely underutilized in the development of clinically meaningful and deployable AI solutions20,21,22,23. One of the main barriers is the lack of high-quality data curation, harmonization, and annotation, which are essential, albeit resource-intensive, steps24. As a result, the indiscriminate accumulation of data, especially from digitalized retrospective tissue archives, often results in datasets that are difficult to use, marked by inconsistent metadata and variable quality25. These limitations can lead models not only to overfit, but also to underperform due to label noise, domain shift, or bias, ultimately compromising clinical utility26. Adding to these challenges, the infrastructure required to train and deploy such models is not only computationally intensive and costly, but also environmentally unsustainable27,28. To develop clinically impactful solutions for breast cancer management, the field of computational pathology is called to progressively move toward smarter, task-oriented approaches that prioritize efficiency and scalability.

Limits of large foundation models in breast cancer

In breast cancer, where molecular profiling is essential for diagnosis, prognosis, and therapeutic decision-making, the implementation of foundation models in routine pathology is evolving. Unlike traditional deep learning models trained on narrow, single-purpose datasets, foundation models are designed to exploit extremely large and diverse bodies, often including millions of WSIs from heterogeneous cohorts, precisely to improve generalizability and robustness across tumor types, populations, and institutions29,30. One of the key advantages of foundation models lies in their capacity to act as universal feature extractors, adaptable across multiple downstream tasks and cancer types, rather than requiring bespoke training for each endpoint31. Moreover, it is essential to distinguish between vision-only models and multimodal models that integrate both histological images and textual data32,33. The latter, image-text aligned models, have shown superior generalization capabilities in recent studies, outperforming vision-only approaches in the prediction of clinical biomarkers and subtypes34. For example, a recent work directly compared visual and multimodal foundation models across several cancer types, demonstrating consistent gains in performance and task adaptability when textual data were incorporated35. However, variability in WSIs quality, staining protocols, scanner devices, and tumor morpho-biological features can compromise model reliability36. This is particularly relevant for rare histologic subtypes, such as micropapillary or apocrine carcinomas which, while underrepresented in large datasets, may possess highly specific morphological features37,38,39. From a deep learning perspective, the morphological distinctiveness might even facilitate tumor recognition by foundation models, potentially reducing the number of training examples required for effective learning40. Nevertheless, the need for ad hoc studies focusing on rare variants remains, especially when the goal is regulatory-grade validation1,41. Beyond model performance, there are systemic barriers to clinical integration that extend to digital infrastructure, data governance, and computational resources42,43. Many pathology departments, particularly in non-academic or resource-limited settings, still operate with fragmented Information Technology (IT) systems and laboratory information systems (LIS) that are ill-suited for AI-based workflows44. These challenges, shared across all AI tools, not just foundation models, are part of the broader effort of digital transition in pathology, which involves standardization of data formats, secure storage, interoperability, and scalable GPU infrastructure45. Recently, it has been emphasized how these challenges can delay or hinder the deployment of AI solutions even in well-resourced centers46,47,48. An overview of recent foundation models and their applications in breast cancer is provided in Table 112,49,50,51,52,53.

Table 1 Foundation models applied to breast cancer whole-slide images

Optimized models for breast cancer molecular pathology

In response to the limitations of large-scale foundation models, a new generation of optimized AI models is gaining momentum53,54. These models are intentionally designed to be compact, task-specific, and clinically aligned, offering a pragmatic alternative for AI integration in breast cancer diagnostics35. Rather than attempting to capture the full morphological spectrum of disease, these models are trained to perform well-defined diagnostic or predictive tasks, such as HR, HER2, or Ki-67 status assessment55,56,57, typically in the early-stage setting, where treatment decisions are based on precise immunohistochemical stratification. This task-specific paradigm has been the dominant approach in computational pathology since its inception, well before the advent of foundation models. Although the field has recently embraced large-scale pathology foundation models (PFMs) for their promise of general-purpose adaptability, early experience suggests that their complexity, data requirements, and computational cost may limit their immediate clinical applicability58. Recent work with PFMs has demonstrated strong generalization across multiple tumor types, including rare cancers and out-of-distribution cohorts (AUC ≈ 0.95), but performance gaps remain for certain rare variants, emphasizing that purely generalist solutions may not fully meet clinical needs. These observations underscore the ongoing relevance of task-oriented models, which can be optimized for specific diagnostic or predictive endpoints and deployed efficiently in real-world workflows58. Some models have also been developed to infer actionable genomic alterations, such as germline BRCA mutations, directly from histopathological slides59. Faycal et al. introduced a convolutional neural network (CNN) trained on H&E slides from triple-negative breast cancer cases to predict BRCA mutational status60. Similarly, Bergstrom et al. proposed a deep learning model to predict homologous recombination deficiency (HRD) by integrating histological and genomic features, achieving area under the curve (AUC) values ranging from 0.78 to 0.8761. Rather than directly identifying a unique phenotype, these models estimate the probability of HRD based on morphologic features that co-occur with the alteration in the training set61. As such, they hold promises as screening tools or decision aids to prioritize confirmatory sequencing, particularly in resource-limited settings, but they should not be considered as a replacement for molecular testing.

The benefits of compact models are both technical and clinical. By avoiding the computational burden of large foundation architectures, optimized models offer faster inference times, lower hardware requirements, and greater ease of deployment in real-world pathology workflows62,63. One example is Orpheus, a multimodal deep learning model trained to infer the Oncotype DX Recurrence Score from H&E-stained WSIs64. This tool demonstrated the ability to stratify patients by risk of recurrence, independent of molecular surrogate markers, opening the door to histology-based decision support in settings where molecular assays are unavailable or cost-prohibitive.

These advances extend beyond biomarker prediction. Recently, RlapsRisk BC, a deep learning model developed to assess metastatic relapse risk in early-stage ER-positive/HER2-negative breast cancer, demonstrated that WSIs alone can predict 5-year metastasis-free survival (MFS) with a concordance index (C-index) of 0.81, outperforming traditional clinico-pathological models (C-index 0.76, p < 0.05)65. Notably, combining AI-derived risk with clinical features improved both sensitivity and specificity in patient stratification. Importantly, expert review of model-identified high-impact regions confirmed that the predictions were grounded in recognizable histological features, reinforcing the model’s interpretability and biological plausibility. Together, these examples highlight the clinical promise of optimized AI models for both molecular classification and outcome prediction directly from standard histological slides.

Model distillation and deployment

In breast cancer pathology, the successful adoption of AI depends less on abstract performance metrics and more on its ability to deliver actionable, explainable, and accessible solutions within real-world diagnostic settings66,67,68. Even high-performing foundation models often fail to deliver when they cannot be integrated into routine workflows, explained to clinicians, or accessed by institutions with limited technical infrastructure69,70. Explainability is increasingly recognized as a prerequisite for clinical adoption of AI models, particularly in scenarios where model outputs appear to exceed human perception. Traditional methods such as attention mapping, saliency maps, Grad-CAM, and concept attribution techniques can localize regions or features that most strongly influence model predictions, providing visual cues that support human verification. However, recent evidence has highlighted important limitations. The Explainability Paradox71 showed that different explanation methods may produce inconsistent or even contradictory outputs, and that pathologists vary widely in how they interpret and trust these explanations. Moreover, explainability methods can be sensitive to small perturbations, may highlight non-causal artifacts, and do not necessarily clarify why a given pattern is predictive, raising concerns about stability and fidelity. Beyond explainability, the emerging concept of causability emphasizes that clinicians should be able to interact with AI systems—formulating “what-if” questions, exploring counterfactuals, and investigating how changes in input would alter predictions72. Such interactive and human-in-the-loop approaches may improve understanding, enable error analysis, and foster trust in algorithmic recommendations. This is particularly crucial when AI models appear to outperform human observers, as in the case of the Quantitative Continuous Scoring (QCS) model for TROP2 expression in lung cancer, which provides a reproducible and continuous score that can be cross-validated by experts73. Ensuring that predictions are not only accurate but also interpretable and biologically plausible is essential to support safe deployment and clinician acceptance in real-world workflows. Among the techniques that enable the development of optimized models, distillation stands out for its translational value74,75. Rather than learning from raw data, the distilled model learns from the outputs of the original, inheriting key insights while shedding unnecessary computational weight value74,75. This approach is particularly suited to breast cancer, where molecular features must be interpreted consistently and rapidly across diverse institutional contexts76. Distilled models are easier to interpret, update, and validate, making them well-aligned with regulatory requirements and clinical expectations. Moreover, their simplicity fosters transparency, a prerequisite for clinical trust and broader adoption, especially when AI is used to predict therapeutic biomarkers or to perform risk stratification69,75,77,78. Some examples such as compact models distilled for microsatellite instability (MSI) prediction in colorectal cancer, breast cancer risk estimation directly from H&E slides, and the Quantitative Continuous Scoring (QCS) model for TROP2 expression in lung cancer demonstrate that distillation is more than technical optimization79,80. This approach to TROP2 quantification is also likely to play a role in breast cancer81. The deployment of AI models for diagnostic and predictive biomarker workflows raises several ethical challenges65,82. When algorithms are used to stratify patients for targeted therapies or inclusion in clinical trials, such as with QCS of TROP2 expression, there is a risk that black-box predictions may drive critical clinical decisions without sufficient human interpretability or confirmatory testing. This can amplify biases present in the training data, potentially leading to systematic over- or under-treatment of specific patient subgroups83,84. Ethical deployment therefore requires rigorous external validation, prospective studies, and robust quality control pipelines that allow clinicians to audit model outputs and compare them against visual inspection or orthogonal molecular assays where feasible85. Furthermore, transparent reporting of model development, dataset composition, and performance on diverse populations is essential to ensure equity, reproducibility, and patient safety65,84. These considerations are central to building trustworthy AI systems that complement, rather than replace, expert judgment in pathology.

Bias and fairness considerations

Bias can enter AI-based computational pathology workflows at multiple levels: dataset composition (e.g., over-representation of certain tumor subtypes, demographics, or staining protocols), label quality, model architecture, and evaluation metrics86,87. Importantly, “human-in-the-loop” approaches, while valuable for improving interpretability and trust, can unintentionally amplify bias if the human feedback reflects pre-existing diagnostic conventions or subjective patterns, reinforcing rather than correcting model errors88. Similarly, targeted models optimized for specific biomarkers may perform well in the training domain but fail to generalize to under-represented populations or rare morphologies89. Mitigation strategies include curating diverse and representative training datasets, using bias-aware metrics (e.g., subgroup performance reporting), and performing external validation across multiple institutions84. Regular auditing and monitoring of deployed models are also recommended to detect and correct bias drift over time, ensuring equitable performance for all patient subgroups83,90. Emerging explainability frameworks can further help identify model weaknesses and spurious correlations before deployment91.

Conclusion and future directions: precision over power

AI is entering a new phase in breast cancer pathology, characterized by a shift in focus from technological scale to clinical precision. The limitations of large foundation models, including challenges in integration, interpretability, and consistency across real-world clinical settings, underscore the need for a more pragmatic and clinically oriented approach2,12,69,92. Task-oriented AI models, built through techniques such as model distillation, weak supervision, and modular training, represent a viable and scalable alternative. These optimized systems can support the prediction of key biomarkers, risk stratification, and surrogate molecular signatures, offering a pathway to enhance diagnostic workflows and guide personalized treatment decisions (Fig. 1). Future progress will depend on the quality of the datasets, validation across institutions, and collaboration between computational scientists and pathology teams93,94,95,96. Regulatory clarity and clinical trust are essential to ensure the safe deployment and widespread adoption of AI technologies. Ultimately, by aligning AI development with the specific needs of oncology, the field can progress beyond proof-of-concept stages toward real-world impact, delivering accessible, explainable, and clinically meaningful innovations in breast cancer diagnostics.

Fig. 1: Evolving AI paradigms from “big data” to precision pathology in breast cancer.
figure 1

The figure describes the evolution from large foundation models often characterized by unclear “black-box” behavior, high complexity, poor generalizability, to the current focus on more task-oriented, clinically optimized AI systems. The latter systems are designed with a focus on specific diagnostic or predictive objectives, offering clinically interpretable outputs and seamless integration with laboratory information systems (LIS). These models enable more reliable and scalable approaches to the prediction of biomarkers such as HER2-low, Ki-67, PIK3CA, ESR1, and gBRCA. Emerging strategies center on whole-slide image (WSI) analysis, integration with clinical metadata, weak or unsupervised learning, and modular training to enhance real-world performance.