Introduction

Hormone receptor-positive disease without HER2 overexpression or amplification (HR + /HER2-) is the most common subtype of early breast cancer (EBC), accounting for approximately 70% of diagnoses1. A major challenge in the management of this disease has been identifying the cancers for which adjuvant chemotherapy meaningfully reduces the risk of recurrence. Risk stratification of HR + /HER2- EBC relies upon the integration of traditional clinicopathological features (e.g., tumor size, nodal status, Nottingham grade) with multigene assays to estimate the risk of recurrence and personalize adjuvant therapy. Among the commercially available assays, Oncotype DX (ODX) ® (Exact Sciences, Madison, WI) is the most extensively validated and widely used in clinical practice. By measuring the transcriptional abundance of 16 genes, including ESR1, PGR, HER2, MKI67, and MMP11, against the abundance of five reference genes using reverse transcription quantitative real-time PCR2, ODX calculates a recurrence score (RS) ranging from zero to 100 with both prognostic and predictive value2,3,4,5,6,7,8,9.

Substantial clinical evidence from retrospective and prospective trials has shown that ODX can improve clinical decision-making in breast cancer. Retrospective analyses of the NSABP B142 and TransATAC5 trials demonstrated the prognostic value of ODX in stratifying the risk of recurrence for HR + /HER2- EBC patients. Similarly, analyses of the NSABP B206 and SWOG88147 clinical trials established the predictive value of ODX by uncovering a survival benefit with the addition of adjuvant chemotherapy to endocrine therapy for patients with a high risk of disease relapse. These studies provided the rationale for the prospective evaluation of ODX in the TAILORx8 (>10,000 patients with node-negative disease) and RxPONDER9 (5083 patients with one to three positive lymph nodes) trials and established ODX as the preferred genomic assay for adjuvant treatment-decision-making in HR + /HER2- EBC10,11.

While guidelines have recommended the use of ODX or other assays for more than a decade10,11,12, reimbursement restrictions and global accessibility barriers have limited universal adoption13. Beyond the United States, the cost of around 4000 USD per sample14,15 and turnaround time delaying the start of therapy have created barriers to adoption, despite analyses indicating downstream savings from more tailored adjuvant therapy16. Some efforts have been undertaken to develop nomograms based on clinical and pathologic features annotated during the standard of care, aiming to predict ODX scores17. However, such tools require manual extraction of relevant inputs from the unstructured electronic healthcare record and leave room for improvement in terms of performance, with the assay itself still providing greater cost-effectiveness than these tools16.

We investigated the use of whole-slide images (WSIs) from routinely available formalin-fixed paraffin-embedded (FFPE) tissue slides stained with hematoxylin and eosin (H&E) to predict RS. As previous studies have demonstrated, these slides can be effectively analyzed using deep learning algorithms to predict relapse risk18,19,20,21,22,23,24,25,26. Such algorithms have already been approved for colorectal cancer27,28 in Europe, though their widespread adoption is yet to be realized. One possible reason for this delay could be the limited clinical validation against the standard of care29. However, the field of deep learning is progressing rapidly. Recently, two techniques have markedly enhanced system performance: transformers and self-supervised learning30 (SSL). Furthermore, recent studies have shown that integrating histopathologic imaging with additional modalities, such as genomics, text, and clinical imaging, uncovers intermodal relationships and often improves predictive performance31,32,33,34.

In this study, we develop Orpheus, a multimodal deep learning model to infer the ODX RS from H&E-stained whole-slide images and validate it across three independent patient cohorts for the identification of high-risk patients. Moreover, a head-to-head comparison is made between the Orpheus and ODX RS to identify patients with documented metastatic recurrence. This work advances a multimodal machine learning paradigm in precision oncology, applies it to accurately infer the ODX RS from routine histopathology images, and outperforms the ODX RS in identifying the risk of metastatic recurrence in patients with low ODX RS. This study has the potential to extend access to the established and exploratory applications of the well-validated Recurrence Score and to refine sub-stratification for patients treated using the current standard paradigm.

Results

Data assembly

We assembled three independent cohorts comprising 6172 patients with HR + /HER2- EBC with surgically resected primary tumors (Fig. 1a). Tissue samples were subjected to H&E staining and immunohistochemical (IHC) analysis for hormone receptors and HER2 according to ASCO/CAP guidelines35,36, and samples were submitted for calculation of RS per clinical practice. For a subset, genomic data from clinical MSK-IMPACT targeted sequencing were also available (Fig. 1b). These derivative data were subsequently digitized (Fig. 1c) and used for multimodal modeling (Fig. 1d). We curated a retrospective cohort of 5145 (Fig. 1e) patients with HR + /HER2- EBC (MSK-BRCA; Fig. 1a; Supp. Figure 1) for model training, validation, and testing, whose primary tumors had H&E-stained FFPE tissue specimens available, textual pathology reports, and targeted panel sequencing for a subset (n = 481; Fig. 1b). We allocated these patients a priori into either a withheld test set (20%) or a set used for training and validation (80%; Supp. Tab. 1). Moreover, we assembled two additional independent cohorts of WSIs derived from patients with HR+/HER2- EBC, IEO-BRCA (452 patients) and MDX-BRCA37 (575 patients), for external validation. The patient’s age, sex, and race are reported in Supp. Tab. 2. A patient is considered high-risk with a molecular RS > 25, following TAILORx8.

Fig. 1: Developing a multimodal transformer model for breast cancer risk.
Fig. 1: Developing a multimodal transformer model for breast cancer risk.
Full size image

Early-stage breast tumors are a resected, b profiled histologically, c digitized, and d used for downstream modeling of recurrence risk. e Number of pathologic slides, pathology reports, and patients included in each split. f Histogram depicting number of slides with a given number of tiles. g Tissue detection, tessellation, transformer-based modeling of CTransPath-derived tile embeddings, pathology report scraping, tokenization and transformer-based modeling, nuclear segmentation for interpretation, tensor fusion for multimodal integration. Created in BioRender. Boehm, K. (2025) https://biorender.com/s94u717.

Model training

We developed a transformer model to directly regress the ODX RS from WSIs of EBC. To train this architecture, we employed a two-step process. First, we projected each slide’s tissue-containing tiles (Fig. 1f) into an informative space using a frozen model trained using SSL on over 30,000 slides (Fig. 1g)38. Subsequently, we adapted a transformer architecture39, which was previously validated in a large multicenter study of colorectal cancer40, to map the phenotypic-genotypic correlation between the extracted features and the ODX RS (Fig. 1g). The unimodal and multimodal models were trained to regress RS as a continuous variable (Fig. 1g).

Deep learning infers recurrence risk score from whole-slide images

First, we develop and test the WSI-based model across the three cohorts to measure the generalizability of its performance. In the withheld MSK-BRCA test set, the unimodal WSI-based model achieved a Pearson correlation of 0.60 (95% C.I. 0.55–0.65, p < 10-4; Supp. Fig. 2a) and concordance correlation coefficient (CCC) of 0.57 (95% C.I. 0.52–0.62), along with the area under the precision-recall curve (AUPRC) of 0.55 (95% C.I. 0.47–0.64; Supp. Fig. 2d) and area under the receiver operating characteristic curve (AUROC) of 0.85 (95% C.I. 0.81–0.88; Fig. 2a) for high-risk disease. In the external IEO-BRCA test set, the same model achieved a Pearson correlation of 0.60 (95% C.I. 0.55–0.65; p < 10-4; Supp. Fig. 2b) and CCC of 0.58 (95% C.I. 0.52–0.63; Fig. 4b) along with AUPRC of 0.69 (95% C.I. 0.61–0.76; Supp. Fig. 2e) and AUROC of 0.81 (95% C.I. 0.77–0.85; Fig. 2b). In the external MDX-BRCA test set, which used an inferred, ODX-like RS (see Methods), the same model achieved a Pearson correlation of 0.58 (95% C.I. 0.53–0.63; p < 10-4; Supp. Figure 2c) and CCC of 0.40 (95% C.I. 0.35–0.45) along with AUPRC of 0.71 (95% C.I. 0.65–0.78; Supp. Fig. 2f) and AUROC of 0.80 (95% C.I. 0.76–0.84; Fig. 2c). Full results are detailed in the other panels of Supp. Figs. 2, 3 and in Supp. Data 1. In summary, the WSI-based model robustly infers RS and accurately identifies high-risk diseases across three test cohorts derived from different medical centers and countries.

Fig. 2: Orpheus performance for TAILORx risk stratification.
Fig. 2: Orpheus performance for TAILORx risk stratification.
Full size image

ac The whole-slide image (WSI)-based model reliably identifies high-risk disease (RS > 25) as defined by TAILORx across the MSK-BRCA (n = 1029), IEO-BRCA (n = 452), and MDX-BRCA (n = 572) test cohorts. d The multimodal model outperforms the WSI- and text-based unimodal models. Error bars by 1000-fold bootstrapping. e, f The multimodal model outperforms a clinicogenomic nomogram in identifying high-risk diseases. g Calibration plot and h predicted score frequencies. All results are shown for test sets.

Deep learning infers recurrence risk scores from text-based reports

Second, we develop a unimodal text report-based model that achieves a Pearson correlation of 0.59 (95% C.I. 0.53–0.65, p < 10-4) and CCC of 0.53 (95% C.I. 0.47–0.58; Supp. Fig. 4c) along with AUPRC of 0.53 (95% C.I. 0.45 - 0.61; Supp. Fig. 4f) and AUROC of 0.81 (95% C.I. 0.76–0.85; Supp. Fig. 4i) in the MSK-BRCA test set. Full results are detailed in Supp. Fig. 4 and Supp. Data 1. For the subset of male patients in the MSK-BRCA test set, the model demonstrates comparable performance (Supp. Tab. 3).

Multimodal integration of images and text improves recurrence risk score prediction

Finally, we evaluate whether the multimodal model integrating WSIs and text-based reports, Orpheus, improves predictive performance relative to the unimodal models. In the MSK-BRCA test set, the multimodal model achieved a Pearson correlation of 0.70 (Supp. Fig. 5a; 95% C.I. 0.65–0.74, p < 10-4) and CCC of 0.67 (95% C.I. 0.62–0.72). For classification of high-risk (RS > 25) disease, the AUPRC was 0.65 (p < 10-4; 95% C.I. 0.57–0.72), with a macro-averaged F1 score of 0.75 (Supp. Figure 5b). The CCC and Pearson’s correlation based on multimodal scores were higher than those based on unimodal scores (Fig. 2d; Supp. Fig. 5e). The AUROC was 0.88 (Supp. Fig. 5d; 95% C.I. 0.86–0.91, p < 10-4). A confusion matrix for the withheld test set is depicted in Supp. Fig. 5e, showing minimal misclassification between extreme categories, with moderate errors between intermediate and extreme categories (p < 10-4).

Multimodal recurrence risk model outperforms clinicopathologic nomogram

Next, we compare Orpheus to the state-of-the-art nomogram17 for predicting the ODX RS. Specifically, we analyzed the subset of the MSK-BRCA test set with available tumor grades and IHC-derived HR status in the text report as extracted by regular expressions. We compared the ability to discriminate high-risk disease of a nomogram based on clinical and pathologist-annotated features17 to that of the Orpheus, the multimodal (Fig. 2e; Supp. Fig. 6a), text-based (Supp. Fig. 6b, d), and whole-slide image-based (Supp. Fig. 6c, e) model. Orpheus achieved an AUROC of 0.89 and AUPRC of 0.67 (95% C.I. 0.580.75), the vision model achieved an AUPRC of 0.55 (95% C.I. 0.46–0.65), and the language model achieved an AUPRC of 0.57 (95% C.I. 0.48–0.66). By comparison, the nomogram17 achieved an AUPRC of 0.40 (95% C.I. 0.32–0.50). For the multimodal model, we suggest an operating point of 31.8 with 93% precision and 23% recall (Fig. 2f). The model was well calibrated for risk stratification based on >25 as a high-risk threshold, with most predicted scores being a low-risk (Fig. 2g, h).

Orpheus outperforms molecular risk score to identify patients with recurrent disease

Using 6814 hand-annotated cases, we tuned a large language model (see “Methods”) to infer recurrences from the electronic medical record, achieving an accuracy of 0.96 (Supp. Fig. 7). By analyzing cases with at least two years of follow-up and ODX RS values ≤ 25, we first used linear models to develop Orpheus+ models to infer risk of distant recurrence from H&E-stained WSIs. In the combined MSK-BRCA test and validation set, Orpheus+ ascertained the risk of distant recurrence with an AUROC of 0.77 (95% C.I. 0.68, 0.85)). This was superior to the Oncotype DX ® RS itself, which was uninformative in the RS ≤ 25 cohort (AUROC = 0.51 (95% C.I. 0.41, 0.61)). Scores differed significantly for cases with or without metastatic recurrence for Orpheus+ (p ≤ 1e-4), but not for ODX RS (p > 0.05; Fig. 3a).

Fig. 3: Orpheus+ performance for identifying distant recurrence.
Fig. 3: Orpheus+ performance for identifying distant recurrence.
Full size image

a, b Orpheus+ risk scores versus Oncotype DX ® (ODX, N = 1464) and Multiplex DX ® (MDX, N = 575) recurrence score (RS) values for cases with or without distant recurrence. p-values by Mann–Whitney-Wilcoxon test, two-sided. c Time-dependent areas under the receiver operating characteristic curve for Orpheus+ and ODX RS scores against recurrence in the MSK-BRCA test set for cases with ODX RS ≤ 25, with associated calibration plots. The mean value represents AUC over 2–8 years. d Receiver operating characteristic curves for Orpheus+ and MDX RS scores against recurrence in the MDX-BRCA test/validation set, with associated calibration plots. **** denotes p ≤ 1e−4, *** denotes p ≤ 1e-3, ** denotes p ≤ 1e-2, and * denotes – ≤ 5e-2. In box plots, boxes denote 25th–75th percentiles with lines at the median, whiskers denote the range without outliers, and individual points denote outliers. Exact p-values, top to bottom, for a are 5e-7 and 0.80, and for b, are 0.0013 and 0.0011.

Next, we calculated the dynamic time-dependent AUROC41 for Orpheus+ in a time-frame from two to eight years after surgery, accounting for censoring events to capture time-to-event dynamics encompassing the clinically relevant 5-year prediction window. Orpheus+ achieved a mean time-dependent AUROC of 0.75, whereas ODX RS attained a value of 0.49 (Fig. 3c). In the MDX-BRCA test set, we observed comparable identification of recurrence risk by Orpheus+ (AUROC = 0.64 (95% C.I. 0.56, 0.72)) to that by the Oncotype-like Multiplex DX ® laboratory assay risk score (AUROC = 0.64 (95% C.I. 0.57, 0.70)) itself (Fig. 3b, d). These results show the potential of Orpheus to improve the identification of recurrence risk for patients classified as low risk by TAILORx categories. Considering patients with any ODX RS (rather than only ≤25), the Orpheus+ performance is preserved, with a mean time-dependent AUROC of 0.72, and ODX RS achieves a value of 0.59 (Supp. Fig. 8a). Considering only patients in the MSK-BRCA test set (rather than the combined test and validation set) for cases with ODX RS ≤ 25, the Orpheus+ performance is preserved, with a mean time-dependent AUROC of 0.88, and ODX RS achieves a value of 0.50 (Supp. Fig. 8a). The characteristics of the distant recurrences in the test set are shown in Supp. Data 2. For full results, see Supp. Fig. 8.

Multimodal interpretability shows concordance with biological processes

Next, we sought to understand the decision-making process of the model’s predictions using Orpheus’ innate interpretability mechanisms. Specifically, using the attention mechanism of the image and text transformers of Orpheus, we visualized the importance of features within the WSIs and pathology reports. Visualizing the attention of the slide (Fig. 4a), the model designates most tiles as background with low-attention scores (Fig. 4b). Higher-attention tiles tend to contain invasive and in situ carcinoma compared to lower-attention tiles, which are more likely to contain fat and stroma (Fig. 4c; Supp. Fig. 9). Analogously to the tiles, the importance of word tokens comprising the synoptic pathology report can be analyzed (Supp. Fig. 10a). For the attention of the reports, analysis shows that words around immunohistochemical analyses for estrogen and progesterone receptors and lymphovascular invasion tend to have highest mean relative attention, alongside punctuation and descriptions of Nottingham grade (Supp. Fig. 10b). Ablating grade or progesterone receptor status from the text report decremented performance in the test set (Supp. Fig. 11). Analyzing the latent space of the learned embeddings of the trained models reveals separation by histologic grade (Supp. Fig. 12a) and progesterone receptor expression (Supp. Fig. 12b) in the MSK-BRCA test set, with the gradients appearing along a learned, lyre-shaped manifold for the multimodal model. The same was observed for the ODX RS itself (Supp. Fig. 12c). We further tested the association of predicted multimodal scores with genomic features. Limiting to cases with MSK-IMPACT, predicted RS was higher for tumors with TP53 mutations, MYC amplifications, PIK3CA amplifications, and BRCA2 mutations (Supp. Fig. 12d–g), and it trended slightly higher for specimens with a greater fraction of genome altered (Supp. Fig. 12h). In the test set alone, the relationships for TP53 and a fraction of genome altered persisted (Supp. Fig. 13).

Fig. 4: Cellular and transcriptomic correlates of risk.
Fig. 4: Cellular and transcriptomic correlates of risk.
Full size image

a Saliency map of contributory foreground tiles, with one tile-attention value pair denoted by the arrows and predicted score. b Histogram of tile-attention values. c The five highest- and lowest-attention tiles at greater magnification. d Association of cellular features with high- and low-risk tissue. Hypothesis testing was performed with the two-sided Mann–Whitney U-test with corrections for multiple testing. e High and f low relative abundance of inflammatory cells. g High and h low standard deviation of neoplastic nuclear area. il Quantification of stromal fraction (SF), tumor cell proliferation (Prolif), lymphocyte infiltrating signature score (LISS), and lymphocyte fraction (LF) for predicted low- and high-risk patients (50 each) depicted in blue and orange, respectively, in the MSK-BRCA cohort. p-values are generated using an independent two-sided t-test. In box plots, boxes denote 25th–75th percentiles, whiskers denote the range without outliers, and individual points denote outliers. Scale bars denote 64 µm.

Orpheus provides histologic characterization of high-risk disease

To further explore the model’s capability of correlating histologic features with ODX RS, we analyze the most-attended tiles42 for high- and low-risk disease. The nuclei of these tiles were segmented43, and derivative features of cell type proportions and cellular morphology were tabulated (Fig. 4d). This reveals a relative abundance of inflammatory cells (Fig. 4e, f) and neoplastic cells along with the standard deviation of the neoplastic nuclear area (Fig. 4g, h) as some of the features differ significantly between the groups. Analysis of the tumor microenvironment reveals that high-risk disease exhibited greater stromal fraction (p < 10-4, n = 100) (Fig. 4i), tumor cell proliferation (p < 10-4, n = 100) (Fig. 4j), lymphocyte infiltration signature (p = 2 × 10-3, n = 100) (Fig. 4k), and leukocyte fraction (p < 10-4, n = 100) (Fig. 4l). Extending the tumor microenvironment analysis to external cohorts corroborated these results, especially for tumor cell proliferation which exhibited a significant difference between the predicted high- and low-risk disease patients (p < 10-4, n = 100) in all three cohorts (Supp. Fig. 14). As a further study of differences, we also trained a generative model to synthesize fields of view for informative tiles for high- and low-risk disease (Supp. Fig. 15). Tiles conditioned on the high-risk class depict confluent clusters of tumor cells with moderate to marked nuclear pleomorphism and prominent nucleoli, and tiles conditioned on the low-risk class depicted trabeculae and clusters of tumor cells with moderate nuclear pleomorphism and inconspicuous nucleoli. Tiles conditioned on the background class depicted stroma without epithelial cells.

Orpheus as a conceptual triaging tool for low- and high-risk disease

We test the utility of Orpheus as a pre-screening tool to reduce the load of laboratory testing for breast cancer recurrence risk in clinical workflows. First, we conducted a sensitivity analysis to evaluate the performance of the predicted recurrence risk score in identifying low-risk patients, defined as those with a risk recurrence score <11. The analysis yielded a sensitivity for the test set of MSK-BRCA (n = 1029, 16% predicted low-risk), IEO-BRCA (n = 452, 6% predicted low-risk), and MDX-BRCA (n = 572, 1% predicted low-risk), of 0.90 (Supp. Fig. 16a), 0.96 (Supp. Fig. 16b) and 0.99 (Supp. Fig. 16c), respectively, for the low-risk subgroup. Second, we conducted a specificity analysis to evaluate the performance of the predicted recurrence risk score in identifying high-risk patients, defined as those with a risk recurrence score >25 and who are most likely to benefit from adjuvant chemotherapy8. This resulted in a specificity of 0.94 (Supp. Fig. 16d), 0.68 (Supp. Fig. 16e), and 0.62 (Supp. Fig. 16f) to predict the high-risk subgroup for the test set of MSK-BRCA (n = 1029, 11% predicted high-risk), IEO-BRCA (n = 452, 45% predicted high-risk) and MDX-BRCA (n = 572, 61% predicted high-risk), respectively. Next, we repeated the analyses stratified by age, specifically patients above 50 years of age (Supp. Fig. 17) and patients below or equal to 50 years of age (Supp. Fig. 18), and nodal status, with similar performance metrics in all cohorts regardless of age and nodal status following the TAILORx risk groups8 (Supp. Data 3). Finally, we analyzed the model’s performance on the intermediate-risk (RS 11–25) subgroup using the AUROC, Cohen’s Kappa, F1 score, accuracy, and Matthew’s Correlation Coefficient (Supp. Tab. 4), utilizing additional clinically relevant thresholds of 10, 15 and 25 to binarize the risk predictions8. When observing the model’s performance, specifically on the intermediate-risk group (RS 11–25), we observe a substantial decrease in all metrics compared to all risk groups (RS 0–100).

In summary, Orpheus accurately identifies patients with high-risk diseases, as defined by TAILORx, with a high degree of confidence. The model shows potential to guide adjuvant chemotherapy decisions without the need for multigene assay testing (Fig. 5). Specifically, adjuvant chemotherapy could be selectively recommended for a subset of patients classified as high-risk with high confidence. This approach could streamline treatment decisions and reduce the need for additional testing, ultimately improving patient care and resource allocation. Furthermore, for patients who are treated per TAILORx with adjuvant chemoendocrine or endocrine therapy based on ODX RS results, Orpheus identifies distant metastatic recurrences more accurately than ODX RS itself in the test set. With further validation, this prognostic value has the potential to refine patient selection for personalization of adjuvant treatments and follow-up strategies.

Fig. 5: Potential clinical use case of the Orpheus recurrence risk prediction model.
Fig. 5: Potential clinical use case of the Orpheus recurrence risk prediction model.
Full size image

The Orpheus multimodal prediction model for recurrence risk prediction is potentially capable of guiding decision-making for adjuvant cytotoxic chemotherapy alongside adjuvant endocrine therapy for predicted low- and high-risk patients. The model is within scope for early-stage hormone receptor-positive (HR+) and HER2- breast cancer patients. Created in BioRender. Marra, A. (2025) https://BioRender.com/l96q019.

Discussion

The ODX RS is among the most extensively validated predictive and prognostic biomarkers, and given this widespread application, new applications continue to be developed. Our model, Orpheus, provides a way to approximate the continuous RS from routine H&E WSIs, extending the ODX RS and its myriad applications to centers where it is not feasible financially or logistically to order the laboratory assay itself13. In one example application, Orpheus can precisely identify approximately one-quarter of patients with high-risk disease as defined by TAILORx without the need for ODX RS, with superior discrimination of this class compared to a state-of-the-art nomogram, which integrates clinicopathologic features such as IHC-derived progesterone/estrogen receptor positivity, tumor size, lobular versus ductal histology, Nottingham grade, and age17. This would potentially enable physicians to forgo molecular testing in selected cases. Orpheus has the added advantage of not requiring manual curation of these features from the healthcare record. By inferring the continuous RS rather than identifying risk categories, Orpheus further enables emerging applications that tools such as the nomogram would not support, such as identification of risk of local recurrence, clinical trial eligibility44,45, or defining populations that will benefit from the use of neoadjuvant systemic therapies beyond the currently used clinical characteristics46. We further show that this correlation with RS corresponds to meaningful prognostication: for patients who are treated per TAILORx with adjuvant chemoendocrine or endocrine therapy based on ODX RS results, Orpheus identifies distant metastatic recurrences more accurately than ODX RS itself in the test set. With further validation, this prognostic value has implications ranging from tailoring the frequency of surveillance imaging and use in patient decision-making around treatment escalation and compliance with adjuvant endocrine therapy.

Moreover, the findings that multimodal approaches significantly outperform unimodal models further strengthen the broader perspective32,47,48 that integrating multimodal real-world data is a promising direction for AI in oncology. Orpheus is a flexible machine learning framework building on validated unimodal transformer-based architectures and data integration paradigms from the field of sentiment analysis, and the lightweight framework flexibly accepts tokens from the rapidly evolving foundation models for subsequent integration, allowing machine learning practitioners to harness the representative power of deep learning with small cohorts comprising only thousands of patients.

From a biological perspective, our study corroborates via orthogonal histopathologic and inferred transcriptomic analyses that greater immune infiltration portends higher-risk disease. This finding is in accordance with prior studies that tumor-infiltrating lymphocytes are a negative prognostic factor in HR + /HER2- EBC and may be associated with higher RS49,50,51,52. By estimating transcriptomic programs from images using our validated model53, proliferation was also found to be higher in our analysis of patients with predicted high risk, correlating with grade, the MKI67 gene included in the calculation of the RS, the TP53 mutations, and perhaps explaining the empiric association of more heterogeneous areas and perimeters of cancer cells with higher-risk disease. Finally, the greater inferred stromal fraction for higher-risk disease is possibly related to cancer associated fibroblasts and provides support for this line of inquiry54,55 Together, these findings show that deep learning interpretability is greatly improved by orthogonal molecular and cell-level data, which in turn can yield hypothesis driven insights for biological discovery. Future work should include spatial transcriptomic data for direct characterization of clonal heterogeneity and immune programming.

Taken together, this study advances an improved platform to approximate the ODX RS from routine histopathologic WSIs, outperforming a leading existing method in the identification of high-risk disease and—critically—identifying metastatic recurrences for cases with low ODX RS values more accurately than the ODX RS itself. The multimodal model improves performance when pathology text reports are available and is a lightweight and flexible machine learning architecture suitable for application to biomarkers for other histologies. The orthogonal histopathologic and transcriptomic analyses corroborate proliferation and tumor-infiltrating lymphocytes as markers of higher risk. Clinically, Orpheus has the potential to both expand access to precision medicine through ODX RS approximation and enhance its efficacy by identifying patients at risk of distant metastatic recurrence, even among those deemed low-risk by TAILORx-guided treatment.

Methods

Ethics statement

This study complies with all relevant ethical regulations and was approved by the Institutional Review Boards of Memorial Sloan Kettering Cancer Center and IEO Milan, along with the Ethics Commission of the Medical Faculty of the Technical University Dresden. For targeted clinical sequencing, patients provided informed consent for enrollment on a prospective protocol as overseen by the Memorial Sloan Kettering Cancer Center Institutional Review Board. Otherwise, informed consent was waived for this retrospective study, and participants were not compensated.

Statistics and reproducibility

This study was conducted retrospectively. Therefore, no statistical method was used to predetermine sample size, the experiments were not randomized, and the investigators were not blinded to allocation during experiments and outcome assessment. The data is split into a training and validation cohort for training of Orpheus, MSK-BRCA, and external validation cohorts, IEO-BRCA and MDX-BRCA. Model training included both male and female cases irrespective of age and nodal status, followed by stratified analyses to evaluate performance in each subgroup. Sex was determined based on sex assigned at birth reported by the institutional database. Before any analysis, the MSK-BRCA cohort was split into a training/validation set and a withheld test set using 80% and 20% of patient IDs, respectively. The model evaluation consisted of two steps: first, comparing Orpheus’s predicted risk scores against OncotypeDX recurrence scores using correlation metrics and risk group classification performance; second, assessing both Orpheus and ODX RS predictions for distant metastatic recurrence through calibration and time-dependent analyses. Unless otherwise specified, we consistently define high-risk as scores >25 and low-risk as scores ≤25, following TAILORx thresholds. This standardized cut-off applies to both Orpheus and OncotypeDX recurrence scores, independent of age or nodal status, throughout the entire manuscript for direct comparability between the predicted Orpheus scores and OncotypeDX recurrence scores. Additional statistical and experimental details, including sample inclusion and exclusion criteria, are provided in the subsequent cohort curation, model training, and model evaluation sections. For Fig. 4c, micrographs were selected by attention scores of one random slide. For Fig. 4e–h, representative fields of view with high or low values of the feature in question were selected. The experiments were not repeated to generate different micrographs.

Cohort curation

For the MSK cohort, cases from 2013-2020 were selected according to Supp. Fig. 1 for this retrospective analysis. All cases were pathologically confirmed HR+/HER2- invasive breast carcinoma without distinction by specific histologic subtype. Pathology reports, and slides were joined by the surgical pathology part number rather than case or block. The synoptic pathology text report for the part used to calculate the RS was included, with fields such as histologic subtype, HR and HER2 IHC percent positivity, histologic grade, anatomic site, and DCIS and LCIS. Oncotype DX results were recorded manually from the healthcare record by a medical oncologist. Regular expressions were used to extract progesterone, estrogen, and HER2 receptor percentage positivity from the pathology text reports. For the external validation cohorts, the cohort from the European Institute of Oncology (IEO-BRCA, Milan, Italy) contained a total of 456 early-stage breast cancer patients who received the official Oncotype DX test. Only histopathology slides and corresponding clinicopathological variables were available for analysis. After filtering down patients based on histology slide availability, 452 patients in the IEO-BRCA cohort were available for external validation. The cohort from MultiplexDX (MDX-BRCA, Bratislava, Slovakia) contained a total of 1013 early-stage breast cancer patients originally obtained for a retrospective study37 from Biobank Graz of the Medical University of Graz, (n = 390, Graz, Austria) and PATH Biobank (n = 592, Munich, Germany), The Biomedical Research Institute of Málaga (n = 27, IBIMA-CIMES-UMA, Malaga, Spain), and a commercial company (n = 4, AMBIO). Histopathology slides, corresponding clinicopathological variables, and research-based Oncotype DX scores derived from RNA-sequencing were available for analysis. After filtering down patients based on histology slide availability and nodal, ER, PR, and HER2 status, which would have been eligible for Oncotype DX, 575 patients in the MDX-BRCA cohort remained for external validation. The research-based scores were calculated using the GeneFu Bioconductor package56, based on the original algorithm to calculate the OncotypeDX score2. Because research-based versions of OncotypeDX use different data inputs (e.g., microarray/RNA-seq) compared to the official OncotypeDX (e.g., RT-qPCR), this may result in scaling effects when comparing research-based scores with official scores, as demonstrated by the OPTIMA trial group57. Consequently, we used their outlined approach57, where they provide a linear equation that models the relationship between research-based versus true OncotypeDX, to rescale the research-based recurrence score into a more realistic range. All included cases were hormone receptor-positive and HER2-negative as defined by the American Society of Clinical Oncology/College of American Pathologists clinical practice guidelines35,36. Experienced breast cancer pathologists from each Institution reviewed the case to confirm the diagnosis of invasive breast cancer and receptor status.

Training setup of vision, language, and multimodal models

For vision model training, images were preprocessed using STAMP with 1.14 microns per pixel, a tile edge length of 224 pixels, and Macenko normalization58. Tiles were embedded using CTransPath38. After a fully connected layer to project the CTransPath-derived tokens into 512-dimensional space and a ReLU, two PyTorch TransformerEncoderLayers with dimensionality 512 and eight heads were stacked before a final LayerNorm and projection to scalar space. No activation function was used, and no positional encodings were used. For training, a maximum learning rate of 2e-5 with a linear warmup of 1000 steps, learning rate decay by a factor of 0.9999 every step, and L2 decay of 2e-5 were used. A batch size of one slide per GPU across two GPUs with accumulated gradients over four batches was used, with gradients clipped at 0.5. The model was trained for up to 50 epochs with early stopping. This was implemented in PyTorch-Lightning and ran on two NVIDIA Tesla V100 GPUs (CUDA 12.1) on a cluster running Linux. Metrics were tracked using Weights and Biases, and the model with the lowest validation mean squared error was chosen for downstream use. During inference, attention rollout42. was used to attribute attention to each input token, and multiple predictions with dropout enabled were used to estimate confidence intervals. Regression was used instead of classification because classification discards information, and it substantially outperformed classification in our validation set53. For cases with multiple slides corresponding to the same pathologic specimen, all the tiles were bagged prior to any transformer-based analysis.

For language model training, the HuggingFace model tsantos/PathologyBERT was tuned using the HuggingFace BertForSequenceClassification, Trainer, and AutoTokenizer with a batch size of eight-part descriptions per GPU, four Nvidia Tesla V100 GPUs, four gradient accumulations per backprop, a learning rate of 2e-5, L2 decay of 0.01, and ten training epochs. Prior to tokenization, the text corresponding to the part used to measure the Oncotype score was extracted using regular expressions. Addenda, when available, was concatenated to the part description. Names, initials, and the most common logistical comments were removed prior to tokenization. The tsantos/PathologyBERT tokenizer was not modified. The model with the lowest validation loss was chosen for downstream use. During inference, attention rollout was used to attribute attention to each input token, and multiple predictions with dropout enabled were used to estimate confidence intervals.

For multimodal model training, multiple architectures, including simple concatenation of embeddings before dense layers, attention-based integration of unimodal embeddings, or averaging of scores, were explored using the validation set. The final model chosen took the pre-computed unimodal embeddings as input, projected them into 96-dimensional space, performed tensor fusion by prepending unity and taking the Cartesian product59, applied 30% dropout, and passed through a small 96-dimensional regression head to yield a scalar regression score. Finally, linear regression trained on the training set was used to calibrate the weight of the unimodal and multimodal scores to yield a final score. Multiple predictions with dropout enabled were used to estimate confidence intervals. UMAP plots were generated using the Python umap software package with 10 neighbors and min_dist 0.5 for all plots fit on the training set, and only test sets are shown.

Model evaluation

When multiple slides were available for a single measured score, and all contained relevant tissue as per preprocessing, the relevant tissue tiles were bagged prior to inference or training by the vision model. Models were evaluated primarily by Pearson correlation and associated significance and concordance correlation coefficient60. 95% confidence intervals were calculated using bootstrapping (random sampling with replacement) 1000 times. Areas under the precision-recall and receiver operating characteristic curve were calculated using binary thresholding of high- and low-risk disease. Significance was established using 1000-fold permutation tests. We analyzed operating points on the precision-recall curve by varying the threshold from greatest to lowest and tabulating the respective precision and recall for each value. F1 scores were calculated using the weighted average. Confusion matrices were established using the three risk categories (<11, 11–25, >25), and significance was established using McNemar’s test of homogeneity. To further evaluate the performance of our model, we utilized the AUROC, F1 score, accuracy, Cohen’s Kappa, and Matthew’s Correlation Coefficient (MCC). These metrics were assessed across all risk groups (0–100) and specifically within the intermediate-risk group (11–25). To investigate the model’s performance using various clinically-defined thresholds in Supp. Data 3, we applied binary risk stratification (low versus high) at cut-off values of 10, 15, and 25, analyzing both the entire population and subgroups stratified by age and nodal status. This approach allowed us to thoroughly examine the model’s predictive capabilities across populations with different definitions of risk and to evaluate its performance within the clinically challenging intermediate-risk group. To compare Orpheus against the state-of-the-art clinical nomogram, logistic regression for high-risk disease was performed using the formula established by Orucevic et al.17, including age, histological classification, tumor grade, PR regression status, and tumor size, with corresponding coefficients and an intercept term. The features were extracted using regular expressions from the pathology report. Reports with failed extraction (e.g., due to the absence of hormone receptor annotation in the pathology report or unusual formatting precluding the extraction of tumor size) were excluded from the comparison analysis. Precision-recall analyses were performed using the predicted score [0, 100], in the case of our transformer regression, and the logistic regression score [0, 1] in the case of the nomogram’s logistic regression formula. Summarizing, we evaluated Orpheus through a two-pronged approach. First, we assessed its ability to predict ODX RS values by comparing sample-level continuous risk predictions using Pearson correlation and concordance correlation coefficient. We then evaluated its performance in classifying patients into TAILORx risk groups, measuring agreement with the gold-standard RS categorization through multiple metrics: AUROC, AUPRC, F1 score, accuracy, Cohen’s Kappa, and Matthew’s Correlation Coefficient. In the second step, we compared both Orpheus and RS against actual clinical outcomes, specifically distant metastatic recurrence. This comparison utilized calibration plots and time-dependent AUROCs, which account for censoring events to capture time-to-event dynamics beyond the clinically relevant 5-year prediction window.

Patient outcome modeling

We developed a method to identify primary breast cancer patients with metastatic recurrence from a combination of medical oncology notes, radiology notes, tumor marker lab values and internal referral data using natural language processing (NLP) and machine learning methods. Briefly, a pretrained Clinical longformer model pretrained on MIMIC III data (https://huggingface.co/yikuan8/Clinical-Longformer) was selected to predict metastatic recurrence from medical oncology notes. Our model was finetuned on a note dataset consisting of the note closest to the follow-up date of the patient and a random sample of three notes following the date of surgery for N = 6814 patients with disease status labels curated from the Breast Disease Management Team at Memorial Sloan Kettering. Notes prior to the date of recurrence were labeled ‘early’ while notes following the date of local and metastatic recurrence were labeled ‘local’ and ‘metastatic’ respectively. On a held-out dataset of N = 1363 primary breast cancer patients, the finetuned Clinical Longformer model was tested. Ensembling was performed: specifically, the local and metastatic probabilities of the NLP model were further included as features of a Random Forest model that also included outputs from a metastatic sites model (Clinical BERT), progression probability (RoBERTa), positive tumor markers (CEA, CA 15-3, CA 125), and indicators of internal referral. The MDX-BRCA cases were manually annotated as distant, non-recurrent, or uncertain recurrences, where solely cases with ≥24 months follow up time were considered for these analyses. Distant metastatic recurrences in the MSK-BRCA test set were manually validated and tabulated. For the subsequent outcome modeling, bagged WSI-based embeddings were generated for each case. Using the same train/validation/test split of the MSK-BRCA dataset, a logistic regression model implemented in scikit-learn with C = 1e-5 was trained with class weights calculated by the scikit-learn compute_class_weight utility given the class imbalance. Calibration was performed using the sigmoid method with five-fold cross-validation in the training set. The model was applied to the combined MSK-BRCA test and MSK-BRCA validation sets and the MDX-BRCA test set. Orpheus+ models were trained separately for any recurrence, locoregional recurrence, and distant metastatic recurrence. Calibration curves were plotted using the corresponding scikit-learn functions, comparing predicted probabilities from Orpheus with observed outcomes, accounting for censoring events. Moreover, the time-dependent area under the receiver operating characteristic (AUROC) curves were plotted using scikit-survival with three-month bins from 24 to 96 months, accounting for censoring events to capture time-to-event dynamics beyond the clinically relevant 5-year prediction window. Mann–Whitney U-tests were used to compare inferred risk scores for cases with, and without, the type of recurrence under analysis.

Model interpretability

To interpret model predictions, we employed multiple analytical approaches: visualizing attention patterns in text reports for the language component of the model, examining nuclear features, generating tissue representations, quantifying tumor microenvironment components, and conducting gene association analyses across risk groups for the vision component of the model.

For report visualization, Django was used to colorize each token in the description by relative attention. A similar color scale in HSV space was applied to tint each tile by its relative attention, with absolute attention plotted on corresponding histograms with logarithmic counts displayed. The five tiles with the highest and lowest absolute attention are displayed for quality control by users.

For nuclear analysis, the three most informative tiles from each slide with the highest 100 and lowest 100 predicted scores were identified. HoVerNet43 with PanNuke-derived weights was used for instance segmentation, and quantitative features such as solidity, area, and perimeter were calculated for the outline of each nucleus. Within each tile, summative statistics were used to aggregate these features. The feature with the highest variance inflation factor was iteratively removed until the highest variance inflation factor was 10 or less. Comparisons for the values of these remaining features between the high- and low-risk groups were made with the two-sided Mann–Whitney U-test with corrections for multiple testing. For the volcano plot, log fold change was calculated as the base-two logarithm of the mean value for the high-risk group over the same for the low-risk group. Tiles with fewer than 50 total nuclei were excluded. For comparisons involving standard deviations, tiles with fewer than ten of the relevant cell types were excluded.

For the generation of tissue representations across the risk groups, we identified the highest-attention tiles from cases with measured recurrence scores below 11 or above 25 as chosen by an Attn-MIL model61 trained on the training set. Subsequently, using these tiles and a random sampling of low-attention tiles across the same slides, we used Studio GAN62 to train a ReACGAN architecture with big_resnet backbone, batch size of 36 per GPU across four NVIDIA Tesla V100 GPUs, and default loss parameters. The conditional architecture encoded three classes: high score, low score, and background. Spectrum plots and canvases were generated as per the default StudioGAN code.

For the quantification of the tumor microenvironment, we used a pretrained deep learning model for the quantification of the tumor microenvironment for the top 50 predicted high- and low-risk patients by the recurrence risk vision prediction model, specifically for the stromal fraction (SF) and leukocyte fraction (LF) as assessed via DNA methylation analysis, lymphocyte infiltrating signature score (LISS) and proliferation (Prolif) as measured by RNA expression63. The deep learning regression model was trained on whole-slide images from a breast cancer cohort from The Cancer Genome Atlas (TCGA) in a weakly-supervised setting53 using the open-source biomarker data from Thorsson et al.63 Statistical significance is measured by an independent t-test, indicating a difference in sample means between predicted high- and low-risk patients (p < 0.05). The scores for the tumor microenvironment quantification are inferred based on the same tile embeddings (1×768) which were used in training the vision model for the recurrence risk prediction.

To analyze the association of genetic patterns with predicted risk groups, the specimens were sequenced by MSK-IMPACT64, annotated by OncoKB65, and accessed via cBioPortal66. Only specimens attributed to a patient without other specimens were considered to avoid part-level mismatch, and specimens across the MSK-BRCA training, validation, and test set were included. Variants in the most commonly altered genes in the MSKCC Clinical Sequencing Cohort (TP53, PTEN, BRCA2, FGF4, KMT2C, PAK1, FGF19, CDH1, MYC, ARID1A, FGF3, GATA3, CBFB, CCND1, RUNX1, PIK3CA, FGFR1, NSD3, MAP3K1) were considered, provided that the variant in question (e.g., TP53 SNV) occurred in at least two of the considered samples. Passenger mutations were ignored. Using Bonferroni correction, genes associated with high- or low-risk status based on the measured, ground truth recurrence score with a significance of q = 0.05 were identified. The fraction of genome altered, tumor mutational burden, and mutation count were analyzed using the Pearson correlation with the same significance and correction.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.