Introduction

Artificial intelligence (AI) tools in breast imaging have rapidly increased, offering significant potential to improve breast cancer detection and management. The Food and Drug Administration (FDA) has approved several AI tools for screening mammography with applications in triaging exams, detecting and classifying lesions, and assessing breast density1. With increasing imaging volume and continued demand for breast cancer screening, radiomics and AI studies have demonstrated that AI-enhanced workflows hold promise for improving diagnostic accuracy and efficiency2.

Nonetheless, successfully implementing AI tools in clinical practice requires participation from all stakeholders, including radiologists, developers, policymakers, and patients3. Given that screening examinations can cause anxiety, understanding patients’ reactions to the incorporation of AI in this space is crucial4. Patient surveys reveal various levels of trust and acceptance toward AI tools in breast imaging, which are impacted by factors such as education level and prior knowledge about AI5,6. While many express positive attitudes toward radiologist and AI collaboration, concerns surrounding trust, transparency, and accountability persist5,6,7,8. Critically, little research has examined patients’ perceptions toward AI in breast imaging—or imaging, more broadly—with respect to how it is communicated in a patient portal setting.

With the movement for transparency and patient-centered care, patient portals have played an increasingly important role in healthcare delivery, enabling patients to send messages, schedule appointments, and access lab and imaging results9. As more AI tools are implemented, it is important to establish best practices for communicating these results to patients. For a traditional radiologist report, writing with patients as the audience—such as by reducing medical jargon—can facilitate communication and comprehension10. For an AI-interpreted imaging report, however, there are additional considerations about effectively presenting information and ensuring patient autonomy11. Put differently, while AI disclosure is ethically essential, inadequate guidance on AI interpretation may inadvertently lead to further patient anxiety and confusion.

The purpose of this study was to illustrate how contextual information can modify patient attitudes; our hope is that these results help inform what information is provided to patients in their portal regarding their screening mammogram. In the US, screening mammograms for average-risk women are conducted annually starting at age 40, with some variations among societal guidelines. We compare participants’ hypothetical decision-making and attitudes when presented with a Breast Imaging Reporting and Data System (BI-RADS) 1 radiologist report (a report concluding no breast cancer was found) and variations of an AI-interpreted report. Our specific hypotheses are listed in Supplemental Table 1.

Table 1 Demographic characteristics

Results

Primary outcome

Litigation

As seen in Fig. 1 and Supplementary Tables 3 and 4, consideration of a lawsuit after a BI-RADS 1 determination upon discovering breast cancer the following year increased when AI results were included relative to No AI, p = 0.04977. Consideration of a lawsuit increased from 39.7% with No AI (control) to 46.9% when the AI score of 0 was presented alone. However, this decreased back to the control of 40.0% upon providing both the cutoff threshold of 30 and the FOR of 0.2%—all other conditions where AI agreed with the radiologist had higher percentages pursuing litigation compared to control, p = 0.3734. Once AI flagged the case as abnormal, consideration of a lawsuit increased to 60.7% when an AI score of 31 was provided alone, decreasing to 56.1% when paired with the threshold and further to 31.6% when the FDR of 95% was provided, p = 0.0005. Moreover, when the AI score of 50 was presented alone, the consideration of a lawsuit was 48.3%. This increased to 51.9% when paired with the cutoff threshold but decreased to 36.8% when the FDR was provided, p = 0.014.

Fig. 1: Considering a Lawsuit after a BI-RADS 1 determination upon discovering breast cancer the following year.
figure 1

X-Axis is an experimental condition, broken down by three column panels (No AI, Not Flagged for Abnormality, Flagged for Abnormality). The 12 AI conditions are comprised of four abnormality scores (0, 29, 31, 50) crossed with: 1) the abnormality cutoff threshold of “30” (Cut), and 2) the error rates (FOR, false omission rate, 0.2%; FDR, false discovery rate, 95%). Y-axis is percentage of participants who indicated they would consult an attorney to consider suing the radiologist for the BI-RADS 1 interpretation. The red dashed line is the control reference. Black dots are means with 95% confidence intervals.

Secondary outcomes

Second opinion

As seen in Fig. 2A and Supplementary Tables 3 and 5, desire to follow up with a different radiologist (second opinion) after a BI-RADS 1 determination increased when AI results were included relative to No AI, p = 0.0003. Desire for a second opinion increased from 8.3% with No AI (control) to 27.2% when the AI score of 29 was presented alone, which increased to 68.0% when paired with the cutoff threshold of 30, which decreased slightly to 62.5% when the FOR of 0.2% was included—note, the latter two were relatively high even though AI agreed with the radiologist. Once AI flagged the case as abnormal, desire for a second opinion increased to 70.2% when an AI score of 31 was provided alone and increased to 81.3% when paired with threshold, p = 0.0003. This decreased to 61% when the FDR of 95% was provided, p = 0.0008. Likewise, when the AI score of 50 was presented alone, desire for a second opinion was 68.9%, which increased to 74.0% when paired with threshold, which decreased to 60.9% when the FDR was provided, p = 0.0215.

Fig. 2: Participant Follow-up Behaviors: Second Opinion and Second Look.
figure 2

A, B Follow-up Seeking behaviors. X-Axis is experimental condition, broken down by three column panels (No AI, Not Flagged for Abnormality, Flagged for Abnormality). The 12 AI conditions are comprised of four abnormality scores (0, 29, 31, 50) crossed with: 1) the abnormality cutoff threshold of “30” (Cut), and 2) the error rates (FOR, false omission rate, 0.2%; FDR, false discovery rate, 95%). Y-axis is percentage of participants who indicated they would pursue the follow-up behavior (0–100%). Red dashed line is control reference. Black dots are means with 95% confidence intervals.

Second look

As seen in Fig. 2B and Supplementary Tables 3 and 5, desire to follow up with the same radiologist (second look) after a BI-RADS 1 determination increased when AI results were included relative to No AI, p = 0.0003. Desire for a second look increased from 14.3% with No AI (control) to 30.7% when the AI score of 29 was presented alone, which increased to 62.1% when paired with the cutoff threshold of 30, which decreased slightly to 57.5% when the FOR of 0.2% was included—note, the latter two were relatively high even though AI agreed with the radiologist. Once AI flagged the case as abnormal, desire for a second look increased to 73.2% when an AI score of 31 was provided alone, increasing to 81.3% when paired with threshold, p = 0.0003. This decreased to 54.1% when the FDR of 95% was provided, p = 0.0002. Likewise, when the AI score of 50 was presented alone, desire for a second opinion was 65.8%, which increased to 71.8% when paired with threshold, which decreased to 63.6% when the FDR was provided, but failed to reach statistical significance, p = 0.09889.

Primary care physician (PCP) follow-up

As seen in Fig. 3A and Supplementary Tables 3 and 5, desire to follow up with the PCP after a BI-RADS 1 determination increased when AI results were included relative to No AI, p = 0.0002. Desire for PCP follow-up increased from 24.5% with No AI (control) to 42.9% when the AI score of 29 was presented alone, which increased to 80.4% when paired with the cutoff threshold of 30, which decreased slightly to 71.1% when the FOR of 0.2% was included—note, all three were relatively high even though AI agreed with the radiologist. Once AI flagged the case as abnormal, desire for PCP follow-up increased to 82.8% when an AI score of 31 was provided alone, increasing to 84.9% when paired with the cutoff threshold of 30, p = 0.0002. This decreased to 68.4% when the FDR of 95% was provided, p = 0.0027. Likewise, when the AI score of 50 was presented alone, desire for PCP follow-up was 79.2%, which increased to 80.8% when paired with the cutoff threshold, which decreased to 73.7% when the FDR was provided, but failed to reach statistical significance, p = 0.104.

Fig. 3: Participant Follow-up Behaviors: Primary Care Physician and More Accurate Imaging.
figure 3

A, B Follow-up Seeking behaviors. X-Axis is experimental condition, broken down by three column panels (No AI, Not Flagged for Abnormality, Flagged for Abnormality). The 12 AI conditions are comprised of four abnormality scores (0, 29, 31, 50) crossed with: 1) the abnormality cutoff threshold of “30” (Cut), and 2) the error rates (FOR, false omission rate, 0.2%; FDR, false discovery rate, 95%). Y-axis is percentage of participants who indicated they would pursue the follow-up behavior (0-100%). Red dashed line is control reference. Black dots are means with 95% confidence intervals.

Request for more accurate follow-up imaging (MAFI)

As seen in Fig. 3B and Supplementary Tables 3 and 5, desire for MAFI after a BI-RADS 1 determination increased when AI results were included relative to No AI, p = 0.0002. Desire for MAFI was 10.8% with No AI (control). This increased to 21.6% when the AI score of 29 was presented alone, 45.6% when paired with the cutoff threshold of 30, and 52.3% when the FOR of 0.2% was included despite AI agreeing with the radiologist. Once AI flagged the case as abnormal, MAFI desire increased to 58.9% when an AI score of 31 was provided alone. This increased to 67.3% when paired with the threshold of 30, p = 0.0002, but decreased to 44.7% when the FDR of 95% was provided, p = 0.0005. Likewise, the AI score of 50 had a 46.3% desire for MAFI, which increased to 64.4% when threshold was provided but decreased to 53.4% when the FDR was provided, p = 0.0497.

Trust

As seen in Fig. 4A and Supplementary Tables 3 and 6, trust in the radiologist and AI after a BI-RADS 1 determination varied depending on whether AI flagged a case and what corresponding information was provided, p = 0.0002. Trust in both the radiologist and AI generally decreased when AI scores increased, p = 0.0002. When the FDR of 95% was provided for both a score of 31 and 50, trust in the radiologist was elevated relative to just providing a score and threshold (12.6, p = 0.000167 and 5.35, p = 0.0498, respectively). Conversely, trust in the AI decreased relative to just providing a score and threshold (-24.8, p = 0.000167 and -23.7, p = 0.000176, respectively). Trust in the radiologist was always higher than AI, regardless of experimental condition, p = 0.0002.

Fig. 4: Trust in Radiologist and AI & Concern about Breast Cancer.
figure 4

A Trust in Radiologist and AI, B Concern about Breast Cancer. X-Axis is an experimental condition, broken down by column panels (No AI (4B only), Not Flagged for Abnormality, Flagged for Abnormality). The 12 AI conditions are comprised of four abnormality scores (0, 29, 31, 50) crossed with: 1) the abnormality cutoff threshold of “30” (Cut), and 2) the error rates (FOR, false omission rate, 0.2%; FDR, false discovery rate, 95%). Y-axis is a magnitude scale of (A) 0 (Not at all) to 100 (Very Much) and B a Likert scale (-50 greatly decrease, 0 no change, 50 greatly increase). A Blue dots are means with 95% confidence intervals denoting trust in radiologist, and red dots are means with 95% confidence intervals denoting trust in AI. B Black dots are means with 95% confidence intervals. Red dashed line is control reference.

Concern about breast cancer

As seen in Fig. 4B and Supplementary Tables 3 and 6, concern about breast cancer after a BI-RADS 1 determination increased when AI results were included relative to No AI, p = 0.0002. Concern about breast cancer increased from -20.9 with No AI (control) to −9.6 when the AI score of 29 was presented alone. This increased to 5.1 when paired with the cutoff threshold of 30 and remained at 3.5 when the FOR of 0.2% was included—note, the latter three were relatively high despite AI agreeing with the radiologist. Once AI flagged the case as abnormal, concern increased to 13.8 when an AI score of 31 was provided alone, increasing to 17.4 when paired with threshold, p = 0.0002. This decreased to 1.8 when the FDR of 95% was provided, p = 0.0002. A similar pattern was observed for an AI score of 50: concern about breast cancer was 12.1 with score alone, which increased to 14.0 when paired with threshold, and decreased to 7.3 when the FDR was provided, p = 0.0146.

Financial Reasoning

As seen in Supplemental Table 7, willingness to get a second opinion, a follow-up with the same radiologist, a follow-up with the ordering physician, and additional imaging varied by AI information (as noted in the sections above). Willingness to pay out of pocket for these services also varied. Interestingly, approximately two-thirds of respondents wanted these follow-up services but not if they had to pay for them indicated unaffordability as the reason.

Discussion

In this experiment, we demonstrate that participants’ hypothetical decision-making behaviors when presented with a BI-RADS 1 radiologist report alongside an AI report vary based on the AI score, flagging status, and contextual information for interpreting AI triage. Specifically, participants’ consideration of a lawsuit increased with higher AI scores, and this was mitigated when the FDR was provided. Furthermore, participants’ requests for additional imaging and follow-up—from the same radiologist, a different radiologist, and the ordering physician—increased when AI was included relative to no AI. Desire for follow-up also increased as the AI score increased, though this was often mitigated by providing the FDR (and sometimes FOR). Of note, the FDR used in this study was very high—likely exceeding what participants would expect—whereas the FOR was low. This difference may explain why FDR had a greater effect on participants’ responses, as its magnitude was more salient. Trust in the radiologist and the AI tool also varied depending on the contextual information provided, though was always higher for the radiologist across all conditions, likely related to algorithm aversion as described by Dietvorst et al.12. The majority of participants who indicated they would request follow-up only if they were not responsible for the costs cited the inability—rather than unwillingness—to pay out of pocket.

These findings are important to consider in the context of prior studies probing patient attitudes toward AI in breast imaging. For example, semi-structured interviews showed that trust in AI relied on clear communication about its application and assurance that a radiologist was involved in the assessment process; participants further expressed they would rather tolerate anxiety from callbacks than risk a missed cancer diagnosis7. Abnormal mammography findings can already be a source of anxiety13 and the integration of AI findings into the patient portal—which 95% of the public reportedly desires14—may heighten concern if the radiologist and AI reports conflict. Indeed, a recent patient survey found that 91.5% were concerned about AI misdiagnosis15, while another study observed that health communication is seen as more reliable when told that the information comes from a physician versus AI16. Another survey found that patients preferred AI to be used as a second reader rather than as a standalone6. Our study demonstrates that contextualizing findings by providing the FDR (and sometimes FOR) can reduce the inclination to pursue follow-up care or a lawsuit. Providing contextual information to interpret AI findings can be one way to reduce uncertainty and concern.

Additionally, Pesapane et al. found that most respondents approved of AI as a supplementary tool but held both software developers and radiologists accountable for AI errors17. Emerging guidelines emphasize the importance of appropriate disclosure: Good Machine Learning Practice (GMLP) principles by the FDA and collaborating agencies encourage that “users are provided ready access to clear, contextually relevant information that is appropriate for the intended audience.”18

Human factors of AI implementation can impact patient outcomes and legal liability. One study demonstrated that incorrect AI results could worsen radiologist performance; critically, however, implementation strategies mitigated those effects19. A similar consideration translates to the interaction between patient and AI, recognizing that how AI information is presented in patient portals holds downstream implications for imaging volume, healthcare expenditure, and—as shown in the current study—litigation. Indeed, another vignette experiment demonstrated that when a radiologist provided a false-negative report despite AI correctly observing a pathology, mock jurors were more likely to side with the radiologist-defendant if AI accuracy data was provided20. Implementation strategies addressing human factors could reduce existing concerns surrounding trust, transparency, accountability, and legal liability5,6,7,8.

Lastly, our study highlights that financial considerations shape participants’ decision-making. This finding is important to contextualize against disparities in imaging acquisition and interpretation21. In breast imaging, marginalized populations—including racial and ethnic minorities and those uninsured or of lower socioeconomic status—bear a disproportionate burden of cancer morbidity and mortality due to social determinants22. For instance, newer advances in breast imaging, such as digital breast tomosynthesis, are not universally available and often have higher costs23. Similarly, incorporating AI tools may introduce new financial barriers and exacerbate existing disparities, as our study demonstrates the mismatch between the desire for follow-up and the inability to afford it.

This study assessed participants’ decision-making behaviors based on hypothetical imaging results bearing no actual consequence on or relationship to the participant, which limits ecological validity. Real-world constraints such as time, financial barriers, and other logistical considerations contribute to patients’ ability to seek second opinions and further imaging. Participants may not be accurate at predicting their behavior. By collapsing the inherently complex process of litigation into a single yes/no item regarding attorney consultation, the overall rates may not be generalizable, although the comparison between conditions still likely applies to real-world medical malpractice litigation. This is because the purpose of this study was to see how attitudes towards litigation changed by these conditions, not to estimate how many people would pursue legal action. The demographic composition of the study population, as well as the inclusion criteria focused on English-speaking women living in the US, limits the generalizability of the study’s findings; it is also unclear to what extent they would generalize to other legal systems outside of the US. In addition, financial reasoning could only be described here without inference, as these results were conditioned on responses to a prior item.

Additionally, the experimental conditions only included a BI-RADS 1 radiologist report with two subthreshold and two supra-threshold AI scores. Other combinations of concordance and discordance between radiologist and AI were not assessed. There was also no condition in which a radiologist was given additional information from a source other than AI. Future studies may benefit by including a condition where a second radiologist provides feedback to the interpreting radiologist, and power analyses for determining the sample size of such research could be based on the effect size(s) observed here. Also, inclusion of additional comprehension check items to ensure that participants understand the interpretation of FDR and FOR would be helpful, as would examining participants’ attitudes towards and experience with AI as potential moderators. Finally, this study was designed only to examine overall effects, rather than to explore potential mechanisms, which would be an avenue for future research.

Our study demonstrates that disclosing AI mammography results will impact patients’ interest in pursuing a medical malpractice claim against a radiologist. However, this can be mitigated by providing greater contextual information about an AI system. Moreover, access to AI results impacts patients’ behaviors pertaining to follow-up and checking. Best practices are needed to engage and inform patients about the application of AI tools in their care24. AI results that include “flags” and specific AI scores in their output should be accompanied by their corresponding error rates so patients can better understand that these scores come with inherent uncertainty.

Materials and methods

Participants and setting

Participants were recruited online through Prolific and were eligible if they identified as a female, English-speaking adult living in the United States (US) with 1+ prior mammogram. In total, n = 1623 participants were included (n = 361 and 802 failed a comprehension and manipulation check, accordingly). Demographic information is shown in Table 1. A target of approximately 80 or more completed observations per condition was based on practical (i.e., budget) constraints. This would allow an approximant effect size of d = 0.65 to be detected, assuming 0.80 power and an alpha of 0.05 with 28 comparisons using Šidák correction.

Procedure

Online participants who screened and provided consent were directed to a Qualtrics survey where they were randomized to one of 13 conditions. Participants were compensated $1.25 for completing the survey and $0.14 if they screened out for never having a mammogram. Participants were prevented from participating more than once. The protocol was approved by the Brown University Health Institutional Review Board (IRB009824).

Vignette design and experimental conditions

All participants were shown a vignette asking them to imagine they had recently received their annual screening mammogram and logged into their patient portal to access the results. In all conditions, the hypothetical results consisted of a standard radiologist’s report with a BI-RADS 1 (Negative) determination and a 2D mammogram. No additional information was provided in the control condition.

In the remaining twelve conditions, participants were provided an additional AI-interpreted report with one of four AI scores: 0 (Not flagged for abnormality), 29 (Not flagged for abnormality), 31 (Flagged for abnormality), or 50 (Flagged for abnormality). Moreover, there were three conditions for each AI score: (1) only the AI score was presented, (2) both (1) and the abnormality cutoff threshold of 30 with a range 0-100 were presented, or (3) both (1) and (2) were presented in addition to either a) False Discovery Rate (FDR, complement of the positive predictive value, 1-PPV) for the AI Score=31 or 50 conditions or b) False Omission Rate (FOR, complement of the negative predictive value, 1-NPV) for the AI Score=0 or 29 conditions (Supplemental Table 2). For all 12 AI conditions, the 2D mammogram included the corresponding AI information (see Appendix II for full vignettes).

To maximize ecological validity, the hypothetical AI’s performance was based on a commercially available AI mammography interpretation software, Lunit INSIGHT MMG version 1.1.7.1 (Lunit, Seoul, South Korea)25. Using the optimal threshold of 30 with 72.38% sensitivity and 92.86% specificity described by Seker et al.—and assuming a breast cancer prevalence of 0.57%—the corresponding FDR and FOR were 95% and 0.2%, respectively. These were the FDR and FOR values given and explained to participants25.

Measures

Checking behavior (i.e., behavior to reduce anxiety)

Four questions probed hypothetical decisions participants would make after receiving the mammogram report: Would you request 1) “a second opinion from another radiologist?” 2) “a follow-up meeting with the radiologist who wrote your original report?” 3) “a follow-up consult with your primary care physician (including OB/GYN)?” and 4) “follow-up imaging that is more accurate even if a physician says it is not necessary for you?” For each question, participants could choose between: 1.) “No”, 2.) “Yes, even if I have to pay for it out of pocket”, or 3.) “Yes, but only if I don’t pay for it (if, for example, insurance pays)”. If they chose item 3, they were asked why: 1.) “I cannot afford to pay” or 2.) “I do not want to pay.”

Litigation

As the primary outcome, to measure participants’ inclination to sue the radiologist, participants were told, “Imagine one year later you receive a new mammogram with a new report from a different radiologist that indicates evidence of potential breast cancer. A biopsy confirms that breast cancer is present and is stage 3 (which indicates advanced cancer). Would you consult an attorney to explore suing the original radiologist for not having found the cancer in the previous year?” (yes/no).

Trust

Two questions measured participants’ trust in the decisions of the radiologist and AI system depicted in the vignette, from 0 (not at all) to 100 (very much): “Do you ultimately trust the…” 1) “radiologist’s decision?” 2) “AI’s decision?”

Concern

One question measured concern for breast cancer: “How, if at all, would these results change your concern about having breast cancer?” (-50-Greatly Decrease; 0-No Change; 50-Greatly Increase).

Comprehension and manipulation check

All participants had to correctly identify “No mammographic evidence of malignancy” as a comprehension check. All participants in AI conditions had to correctly recall the AI score (0, 29, 31, 50) and “(Not)flagged for abnormality” determination. Those in the cutoff conditions had to recall the correct cutoff of “30.” Those provided a FOR or FDR had to correctly recall “The AI had a False Omission Rate of 0.2%” or “The AI had a False Discovery Rate of 95%”, respectively.

Statistical analyses

All analyses were conducted using SAS Software 9.4 (SAS Inc. Cary, NC). Attitudes (the dependent variables) by experimental condition (the independent variable) were modeled using generalized linear modeling assuming a normal or binary distribution, where appropriate with the GLIMMX procedure using contrast comparisons reflecting the hypotheses. Trust in radiologist and AI, an exploratory aim, was modeled as an interaction by condition (i.e., condition X AI vs. Radiologist) using generalized estimating equations with sandwich estimation. All interval estimates were calculated for 95% confidence and alpha was established at the 0.05 level. P-values were adjusted for FDR correction using the Benjamini-Hochberg method. Tests of superiority corresponding with hypotheses were one-tailed, otherwise, two-tailed tests were used.