Main

Breast cancer is the most commonly diagnosed cancer worldwide, representing a major public health challenge1. Population-based mammography screening has proven to be the most effective way to detect breast cancer early and reduce mortality2,3. Yet disparities in patient outcomes persist—women with dense breast tissue that can mask cancer lesions in mammograms face higher cancer risk and greater likelihood of missed cancer diagnoses; and Black women in the USA experience significantly higher breast cancer mortality, despite a lower incidence compared with white women. This racial disparity is linked to not only differences in tumour biology but also systemic barriers that result in reduced access, follow-up and delayed diagnoses for Black women4. Differences in outcomes based on race and breast density have led the US Preventive Services Task Force to call for more inclusive and effective screening strategies for these increased risk groups. Artificial intelligence (AI) has shown strong potential to improve screening outcomes including increases to the cancer detection rate (CDR) with no increase or a decrease in the recall rate (RR)5,6,7,8,9,10, and positive indications of improved outcomes generalizing to limited subpopulations11. However, such large-scale evaluations have exclusively been conducted in European settings with bi- or tri-ennial population-based invitations to screening and double reading9,10 of full-field digital mammography. This paradigm substantially differs from US practice with annual, opportunistic screening and single-reading workflows with digital breast tomosynthesis (DBT).

With the USA representing one of the largest and most diverse screening populations, and performing approximately 40% of worldwide screening mammograms each year12,13, this study, called AI-Supported Safeguard Review Evaluation (ASSURE), addresses an important evidence gap. We evaluate the real-world deployment and clinical use of a validated14 DBT-compatible AI-driven workflow, tailored for single-reading paradigms, at scale for over half a million women across 109 sites. Clinical outcomes were stratified by breast density and racial subgroups to assess whether outcomes were equitable across groups at increased risk of their cancer being missed (for example, women with dense breasts) or at increased risk of poor cancer outcomes (for example, Black women).

Results

Real-world deployment of the multistage AI-driven workflow was conducted at 5 practices across the USA (109 sites, 96 radiologists) in a diverse, nationally distributed (California, Delaware, Maryland and New York) outpatient imaging setting. The multistage AI-driven workflow aids the radiologist at two points in the workflow (Fig. 1a). First by interpreting the mammogram with a computer-aided detection and diagnosis (CADe/x) device (DeepHealth Breast AI version 2.x) and second by an AI-supported safeguard review (SafeGuard Review). The CADe/x device provides an overall four-level category (minimal, low, intermediate and high) of suspicion for cancer and localized bounding boxes for suspicious lesions14. The SafeGuard Review routes exams above a predetermined DeepHealth Breast AI threshold that were not recalled by the interpreting radiologist for review by a breast imaging specialist (reviewer). Reviewers were selected by the breast imaging practice leadership based on experience and clinical performance record. If the reviewer agreed with the AI and found the exam suspicious, they provided feedback on the exam to the interpreting radiologist, who made the final recall decision. The standard-of-care workflow consisted of single reading of DBT exams without the use of the multistage AI-driven workflow.

Fig. 1: Details of study design and timeframe.
figure 1

a, During the standard-of-care period, patients followed a typical screening workflow; during the multistage AI-driven workflow period, a CADe/x device (DeepHealth Breast AI) was added for the initial reader and, if routed by SafeGuard Review, a safeguard review was performed by a breast imaging specialist to detect possible missed cancers. b, Times during which exams were collected during the standard-of-care and multistage AI-driven workflow periods. BCSC, Breast Cancer Surveillance Consortium.

Data from the multistage AI-driven workflow was included from 3 August 2022 to 31 December 2022 after a 2-month training period, and compared with a standard-of-care cohort before deployment, from 1 September 2021 to 19 May 2022 (Fig. 1b). In both cohorts, radiologists had access to non-AI-based computer-aided detection outputs. A prospective consecutive case series study design was selected for this investigation for two reasons: (1) to capture the real-world impact of the device when used for routine reading in a clinical setting; and (2) a double-blind randomized control trial was not possible as the reading radiologists could not use the device while blinded.

The primary outcomes of the ASSURE study were unadjusted CDRs, RRs and positive predictive value of recalls (PPV1) before and after deployment of the multistage AI-driven workflow for the overall screening population, and prespecified subpopulations of women with dense breasts and Black, non-Hispanic women. Secondarily, unadjusted and adjusted CDR, RR and PPV1 were investigated before and after AI deployment for the whole population and all subpopulations. Adjusted analyses utilized generalized linear models with generalized estimating equations controlling for race and ethnicity, breast density, and age, as well as grouping by interpreting radiologist as performed in previous studies15,16,17. The study was powered to detect a change in the CDR and in PPV1 in the whole population and to detect a change in the CDR in all prespecified racial and ethnic and density subpopulations of interest.

Patient characteristics

This study included 579,583 exams: 370,692 (64%) in standard of care and 208,891 (36%) in the multistage AI-driven workflow. Exams were included only if they were bilateral, DBT, and were from an eligible manufacturer. A flow chart of exam exclusions based on study and product exclusion and inclusion criteria is shown in Fig. 2. The same exclusion criteria were applied to both cohorts despite the AI algorithm not processing the standard-of-care cohort exams. Only a small number of DBT exams did not meet the device inclusion criteria (standard of care, 6,772 (1.5%); multistage AI-driven workflow, 2,678 (1.1%)). Population demographics, including patient age, race and ethnicity, and breast density, were similar between the cohorts (Table 1). Out of the 208,891 exams that went through the multistage AI-driven workflow, 16,763 underwent an additional safeguard review (8.0% of all exams). Zero adverse events were reported during the study period. Practice specific clinical performance and differences in demographics between practices are presented in Supplementary Tables 4 and 5, respectively.

Fig. 2: Case collection and exclusion diagram showing counts of exams and their reasons for exclusion from the analysis.
figure 2

Exam excl., exam exclusion criteria; Pt. excl., patient exclusion criteria; Rad. excl., radiologist exclusion criteria. Exclusion because an exam was not or would not be accepted by DeepHealth Breast AI included product-level requirements (see Supplementary Note 1 for more detail).

Table 1 Characteristics of 579,583 screening mammograms interpreted from September 2021 to December 2022

Screening performance of the multistage AI-driven workflow

In the whole population, compared with the standard of care, the multistage AI-driven workflow cohort was associated with an absolute increase in the CDR (Δ0.99 cancers per 1,000 exams = 21.6%, 95% confidence interval (CI) 12.9–31.0%, P < 0.001), RR (Δ0.60 recalls per 100 exams = 5.7%, 95% CI 4.1–7.3%, P < 0.001) and PPV1 (Δ0.66 cancers per 100 recalls = 15.0%, 95% CI 7.0–23.7%, P < 0.001) (Fig. 3 and Table 2). All prespecified subpopulations had a higher CDR (Δ0.73–1.23 cancers per 1,000 exams = 20.4–22.7%, P ≤ 0.045) associated with the multistage AI-driven workflow (see Table 2 for values for prespecified subpopulations and Supplementary Table 1 for values for additional subpopulations). All prespecified subpopulations also had a higher RR (Δ0.48–0.99 recalls per 100 exams = 5.0–9.2%, P ≤ 0.001), except women in the ‘other race’ category (Δ0.31 recalls per 100 exams = 2.6%, P = 0.135). CDR increases were greater than RR increases in all cases, resulting in a significant improvement in PPV1 in 4 out of the 7 subpopulations of interest (whole population; white, non-Hispanic women (Δ0.95 cancers per 100 recalls = 16.0%, 95% CI 3.7–29.7%, P = 0.010); women with non-dense breasts (Δ0.74 cancers per 100 recalls = 13.8%, 95% CI 2.8–26.1%, P = 0.014); and women with dense breasts (Δ0.56 cancers per 100 recalls = 15.3%, 95% CI 3.9–27.8%, P = 0.008)). In the other three subpopulations (Black, non-Hispanic women, Hispanic women, and women in the ‘other race’ category of race and ethnicity), a similar trend was observed with a non-significantly higher PPV1 for the multistage AI-driven workflow cohort; however, the study was not powered to detect an increase in PPV1 in any of the subpopulations. The distribution of cancers across AI suspicion levels did not change between cohorts (Supplementary Table 2).

Fig. 3: Impact of the multistage AI-driven workflow on breast cancer screening outcomes.
figure 3

CDR, RR and PPV1 in the standard of care versus the multistage AI-driven workflow cohort across the whole population, in individual race and ethnicity subpopulations, and divided up by breast density. See Table 2 for numerator and denominator values. Data are presented as the unadjusted rate, and lines are the 95% Agresti and Coull CIs. All standard of care (grey) and multistage AI-driven workflow (purple) paired comparisons indicated with an asterisk are significant (*P < 0.05) under an unadjusted one-sided chi-squared comparison (see Table 2 for exact P values). See Supplementary Table 1 for details of CDR, RR and PPV1 and comparisons for other demographic groups.

Table 2 Outcome metrics for standard of care versus the multistage AI-driven workflow, and unadjusted estimates of the percentage change

Adjusted results, which simultaneously accounted for age, race and ethnicity, breast density, and the radiologist reading the study, showed an overall marginal effect for the CDR of 1.29 cancers per 1,000 exams (95% CI 0.35–2.23, P = 0.007), for RR of 0.72 recalls per 100 exams (CI 0.03–1.41, P = 0.04) and for PPV1 of 0.92 cancers per 100 recalls (95% CI 0.07–1.78, P = 0.03). The consistency of these adjusted effects with the unadjusted findings indicates that the improved CDR and cancer detection efficiency observed are robust, even after controlling for potential confounding variables such as patient age, race and ethnicity, breast density, and the reading radiologist. This result further supports comparable performance of the multistage AI-driven workflow across all patient subpopulations. In addition, interaction terms between the multistage AI-driven workflow and patient factors such as age, race and ethnicity, and breast density were not statistically significant, with the exception of the term for multistage AI-driven workflow in ≥80 years; however, the population of women ≥80 was small (N = 15,550). This suggests comparable performance of the multistage AI-driven workflow across all patient subpopulations. See Supplementary Table 3 for full details on all terms in the adjusted models.

Discussion

This large real-world study demonstrated that a multistage AI-driven workflow for screening mammography deployed across several diverse US screening practices was associated with improved CDR across all prespecified breast density and race and ethnicity subpopulations. For the overall population, the CDR increased by 0.99 per 1,000 screens (4.59 to 5.58, P < 0.001). PPV1 also improved for the whole population and all powered subpopulations of interest in both the unadjusted and adjusted analysis. While RR increased by 5.7% (10.6 to 11.1, P = 0.015) overall, the increase in PPV1 suggests that additional recalls and diagnostic evaluations were appropriate because they led to a higher rate of additional cancer diagnoses. Increases in CDR held for women with dense and non-dense breasts, as well as for Black, non-Hispanic; Hispanic; and white, non-Hispanic women. Our results suggest that the multistage AI-driven workflow would not widen existing disparities in US screening outcomes, but rather could provide equitable benefits across key subpopulations of women. This level of increase in the CDR represents a potential additional 34,097 cancers found through early breast cancer screening over the 43 million mammograms performed in the USA each year, assuming that 80% of these are screening mammograms13.

The overall CDR increase observed here of 21.6% is greater than estimates of increased CDR (11%) associated with double reading 100% of exams in the USA18, highlighting the efficiency of combining a CADe/x device with a safeguard review in which only 8% of cases required a second review. This CDR increase is in addition to that already expected from a transition from full-field digital mammography to DBT of approximately 36% (ref. 19). Finally, the CDR increase was greater than that reported in ref. 5, which found an increase of 0.7 cancers per 1,000 screens, or ref. 9, which found a 17% increase in CDR in a double-reading standard-of-care cohort. The study in ref. 5 was of a prospective trial of 16,000 exams implementing an additional review process, analogous to the SafeGuard Review presented here, but in a European screening setting with double reading of full-field digital mammograms in women with 2-year screening intervals. References 9,10 demonstrated that, in the European double-reading setting, replacing one of the two readers with AI can achieve an increase in CDR or non-inferior CDR, respectively, alongside a decrease in the RR. However, double reading is not standard in the USA, so it is difficult to directly compare results in Europe with the USA. These different results highlight the importance of demonstrating the effectiveness of AI-assisted screening across varied populations and within the context of different workflows, screening paradigms and algorithm versions.

The CDR was 22.7% higher for women with dense breasts with versus without the multistage AI-driven workflow, suggesting that it may help address concerns for missed cancers in this subpopulation. With new US federal mandates requiring that women be informed of their density category after each screening mammogram20,21, the multistage AI-driven workflow may represent a welcome choice for women with dense breasts. These results are in contrast to those recently reported by ref. 6, which showed a non-significant improvement of CDR in dense breasts over a large age-restricted (50–69 years) prospective European cohort; however, this study used a different AI algorithm and different workflow where AI assistance was added to double reading.

Black and Hispanic women showed large relative improvements in their CDR (20.4% and 21.8%, respectively). Absolute increases in CDR were smaller for Black, non-Hispanic and Hispanic women than for white, non-Hispanic women, which can be explained by the lower reported incidence of cancer in Black, non-Hispanic and in Hispanic than in white, non-Hispanic women22,23 that is also seen in our data (Fig. 3). One of the driving forces for the recent revisions to the US Preventive Services Task Force screening recommendations for starting age of 40 years rather than 50 years was to improve health equity in breast cancer outcomes, especially for Black women24. By increasing the CDR, our study suggests that the multistage AI-driven workflow may facilitate the detection of cancers in earlier screening exams for racial and ethnic minorities, a population that has historically faced breast cancer diagnosis at later stages with worse morbidity and mortality24.

The clinically meaningful and statistically significant increase in PPV1 in the whole population and trend observed across all subpopulations of interest indicate that the additional recalls made with the multistage AI-driven workflow resulted in detecting additional cancers at a higher rate than the standard of care. Although the absolute increase in PPV1 was smaller for Black, non-Hispanic women than it was for white women (0.60 versus 0.95), the adjusted model did not demonstrate a statistically significant difference in the impact of the multistage AI-driven workflow on different racial and ethnic subpopulations. This suggests that, when demographic and radiologist-level factors are controlled, the relationship between the multistage AI-driven workflow and CDR, RR and PPV1 is similar for all subpopulations.

The strengths of our study include that this is one of the largest real-world US studies evaluating mammography screening with AI so far and includes data across 4 states, 109 individual sites and 96 individual radiologists. Most previous studies measuring CDR with DBT have been small and performed predominantly in academic research centres2,3. In contrast, our study represents real-world evidence collected from a large number of geographically diverse outpatient imaging centres and may better reflect the average US patient experience. The combination of (1) a CADe/x device on all cases and (2) a safeguard review by an expert reviewer for high-suspicion cases interpreted as normal by the initial radiologist is unique, particularly in a single-reading paradigm. The second-stage SafeGuard Review provides a process analogous to the consensus review in double-reading screening programmes in which all exams are read by at least two radiologists. However, in our workflow, only a small set of patients (8%) at highest risk for having cancer are double read. This enables nearly the full cancer detection benefits of double reading for <10% of the added effort and the cost of the software. To reduce radiologist-level factors, only radiologists who interpreted a minimum number of exams in both cohorts and only exams from sites that were present in both cohorts were included. As such, the sites, interpreting radiologists and patient characteristics are comparable in the two cohorts. Furthermore, a 2-month learning curve period before starting the post-intervention period was used, similar to previous studies25. Finally, we observe similar changes in CDR, RR and PPV1 across the radiology practices (Supplementary Table 4) indicating that the AI algorithm and SafeGuard Review workflow are generalizable across the diverse set of practices investigated.

There are also several limitations to our study. First, there were insufficient follow-up data after screening to report sensitivity, specificity, false-negative rates, interval cancers or cancer stage at diagnosis. However, previous work comparing radiologist performance with versus without this CADe/x device (in both cases without the SafeGuard Review component) showed that radiologists improved sensitivity (80.8% without versus 89.6% with the device, P < 0.01) and did not reduce specificity (75.1% without versus 76.0% with the device, P = 0.65)14. In addition, the same study showed that radiologists reading with DeepHealth Breast AI had improved sensitivity across all lesion sizes and pathologies (invasive versus non-invasive), and ref. 26 reported similar distributions of invasive and triple negative cancers using the SafeGuard Review workflow described here compared with cancers identified without AI assistance. Second, it was not possible to extract the clinical impact of the CADe/x device from the SafeGuard Review owing to the unique aspects of the AI-driven workflow (for example, integration with existing imaging viewing software; workflow paths that include both the CADe/x and SafeGuard Review devices on a single exam; and user training and knowledge of both devices). Our results are therefore applicable to only the device under investigation. Third, we chose not to correct for multiple comparisons because our outcomes were highly correlated (for example, Black, non-Hispanic women were also included in the whole population; CDR, RR and PPV1 are related by radiologist behaviour and so on). However, we do account for correlation in the data through the adjusted generalized estimating equations models, and these adjusted results support the conclusions drawn from the unadjusted results. Finally, the cohorts were divided into two sequential groups in this real-world observational study, which does not control for unknown biases and confounders in the patient groups as a randomized trial would. However, the study prioritized external generalizability by assessing the AI workflow in a real-world clinical setting, thus avoiding biases that could arise from a highly controlled interventional study. Comparison between demographics, however, showed similar patient characteristics between groups, and these main confounders were controlled in the adjusted analysis.

In summary, the ASSURE study presents large-scale, real-world evidence that using a multistage AI-driven workflow is associated with improved mammography screening performance for the population as a whole and across density and key race and ethnicity subpopulations. These results demonstrate that the multistage AI-driven workflow can provide significant and equitable cancer detection benefits to women.

Methods

Data were collected in compliance with the Health Insurance Portability and Accountability Act and under Advarra institutional review board approval (DH-ACC-001-030623) with a waiver of consent. A multistage AI-driven workflow for breast cancer screening was prospectively deployed in the USA at 5 practices (109 sites, 96 radiologists) in a diverse, nationally distributed (California, Delaware, Maryland and New York) outpatient imaging setting. All radiologists were board certified and Mammography Quality Standards Act (MQSA) qualified, and no trainees were included in this study. A mixture of breast imaging specialists and general radiologists were investigated. Our primary outcomes were unadjusted CDR, RR and PPV1 before and after deployment of the multistage AI-driven workflow for the overall screening population, and for the key subpopulations of women with dense breasts and Black, non-Hispanic women. Secondarily, adjusted and unadjusted CDR, RR and PPV1 were investigated before and after AI deployment for all subpopulations including women with non-dense breasts; Hispanic women; white, non-Hispanic women; and women whose race and ethnicity was not Black, non-Hispanic; Hispanic; or white, non-Hispanic (other race); and obtained multivariable adjusted CDR, RR and PPV1 estimates.

Multistage AI-driven workflow

The multistage AI-driven workflow consists of two components (Fig. 1a): interpreting the mammogram with a computer-aided detection and diagnosis (CADe/x) device (DeepHealth Breast AI version 2.x, DeepHealth) and an AI-supported SafeGuard Review. The previously validated CADe/x device showed improved performance for both general radiologists and breast imaging specialists in a reader study14. The SafeGuard Review routes exams above a predetermined DeepHealth Breast AI threshold that were not recalled by the interpreting radiologist for review by a breast imaging specialist (reviewer). Reviewers were selected by the breast imaging practice leadership based on experience and clinical performance record. If the reviewer agreed with the AI and found the exam suspicious, they discussed the exam with the interpreting radiologist, who made the final recall decision.

Study design

All screening exams at the five practices during the study period were eligible for inclusion in the study. Exams between 1 September 2021 and 19 May 2022 did not receive the multistage AI-driven workflow and formed the standard-of-care comparison cohort. The multistage AI-driven workflow was deployed on all exams satisfying the product instructions for use from 20 May 2022 to 31 December 2022 (Fig. 1b). Data were collected from 3 August 2022 to 31 December 2022, starting 2 months after deployment to allow radiologists to adapt to the new technology (multistage AI-driven workflow cohort). Radiologists in both cohorts had access to non-AI-based computer-aided detection outputs (ImageChecker, Hologic). AI suspicion levels were determined for screening exams resulting in a cancer finding for both periods using DeepHealth Breast AI 2.x.

Exam eligibility

Exams were included if they met all exam, patient and radiologist criteria (Fig. 2). Exam criteria included: bilateral screening DBT without implants or additional diagnostic imaging; American College of Radiology Breast Imaging Reporting and Data System (BI-RADS) interpretation of 0, 1 or 2; valid breast density; compatibility with DeepHealth Breast AI 2.x; and met DeepHealth Breast AI 2.x input requirements (see Supplementary Note 1 for details). Patient criteria: ≥35 years old and self-reported as female. Radiologist criteria: interpreted screening mammograms during both study periods based on the MQSA required minimum of 960 every 2 years17 (for example, 372 exams during the standard-of-care period and 175 exams during the multistage AI-driven workflow period) resulting in excluding 83 radiologists and the 14,472 (5.7%) exams they read.

Data collection

Examination-level, patient-level and outcome-level data were collected from screening mammograms during both study periods. Exam data collected included screening BI-RADS assessment and breast density (non-dense: BI-RADS A, fatty or B, scattered fibroglandular; versus dense: C, heterogeneously dense or D, extremely dense) as reported by the interpreting radiologist. Patient data collected included self-reported sex, age at exam, and self-reported race and ethnicity (Asian; Black, non-Hispanic; Hispanic; Native American; Pacific Islander; white, non-Hispanic; multiracial (listed ≥1 race) or other; or declined to specify). Owing to limitations on sample size, women who identified as Asian; Native American; Pacific Islander; multiracial or other; or who declined to specify were combined for some analyses into a category called other race. Four exams missing breast density were excluded. Adverse events were monitored as part of post-market surveillance activities.

Metrics

Metrics were calculated based on the Breast Cancer Surveillance Consortium Standard Definitions v3.1 and the BI-RADS Atlas 5th edition, and included CDR, RR and PPV1 (refs. 27,28). The CDR was defined as the number of BI-RADS 0 (positive) exams with a malignant biopsy (invasive lobular carcinoma, invasive ductal carcinoma, ductal carcinoma in situ) divided by the total number of exams multiplied by 1,000. The RR was defined as the percentage of screening exams that were positive. The PPV1 was defined as the percentage of positive exams that resulted in a malignant biopsy.

Statistical analysis

Descriptive statistics (unadjusted mean and 95% CI29) were used to evaluate the CDR, RR and PPV1 in both cohorts for the whole population and for all subpopulations. Chi-squared tests were used for unadjusted CDR, RR and PPV1 estimates for the multistage AI-driven workflow across the whole population and in the subpopulations of interest (Black, non-Hispanic women; Hispanic women; white, non-Hispanic women; women with non-dense breasts; and women with dense breasts). As these are real-world data, and because all the results are correlated and not independent, we did not correct for multiple comparisons. To account for the correlated nature of the data and to test whether the multistage AI-driven workflow showed differences between subpopulations, generalized linear models with generalized estimating equations were used to predict multivariable adjusted CDR, RR and PPV1 fit with terms for covariates known to influence screening performance, including race and ethnicity, breast density, and age, and grouped on interpreting radiologist to account for radiologist-level factors on screening metrics15,16,17. To evaluate differences in the multistage AI-driven workflow performance across subpopulations, terms were included for the cohort and for the interaction between cohort and each of the subpopulation terms (for example, multistage AI-driven workflow: Black, non-Hispanic; multistage AI-driven workflow: dense). Number of exams undergoing the SafeGuard Review workflow and their outcomes were also reported. All analyses were performed with Python 3.10 (packages: statsmodels, scipy) with a critical P value of 0.05.

Power analysis

A post-hoc sample size calculation was completed based on two proportions, two-sided power analysis to determine the sample size required to address the primary outcome of the CDR across the whole population and in subpopulations of interest. Assuming a base CDR of 5 cancers per 1,000 exams, a 23% increase in th CDR from the standard of care to the multistage AI-driven workflow, α = 0.05, β = 0.2 and sampling ratio of 1.8 standard-of-care exams for each multistage AI-driven workflow exam, 94,822 exams were required in the standard of care and 52,679 exams for the multistage AI-driven workflow cohort. Using the same approach to determine the sample size required to evaluate PPV1, from a base PPV1 of 4% and a 15% increase between cohorts, 25,595 recalls were required in the standard of care and 14,219 recalls required in the multistage AI-driven workflow cohort.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.