Table 3 Outcome metrics for standard double reading versus double reading plus the AI-assisted additional-reader workflow

From: Prospective implementation of AI-assisted screen reading to improve early detection of breast cancer

Variable

Double reading

Double reading plus the AI-assisted additional-reader workflow

Difference

Num/Denom

Value (95% CI)

Num/Denom

Value (95% CI)

Results of phase 1, pilot rollout (1 site, 1 additional arbitrator, additional arbitration cases were single read), n = 3,746 screens

 CDR (per 1,000 cases)

48/3,746

12.8 (9.7–16.9)

54/3,746

14.4 (11.1–18.8)

1.6a

 RR (%)

250/3,746

6.7 (5.9–7.5)

256/3,746

6.8 (6.1–7.7)

0.2

 Sen (%)

48/58

82.8 (71.7–90.4)

54/58

93.1 (83.6–97.3)

10.3a

 Spec (%)

3,486/3,688

94.5 (93.7–95.2)

3,486/3,688

94.5 (93.7–95.2)

0.0

 PPV (%)

48/250

19.2 (14.8–24.5)

54/256

21.1 (16.5–26.5)

1.9

 Arbitration rate (%)

114/3,746

3.0 (2.5–3.6)

510/3,746

13.6 (12.6–14.8)

10.6

 Positive discordance rate (%)

396/3,746

10.6 (9.6–11.6)

 RR of additional arbitration (%)

6/396

1.5 (0.7–3.3)

 PPV of additional arbitration (%)

6/6

100 (61.0–100)

Results of phase 2, extended pilot (4 sites, 3 additional arbitrators, all additional arbitration cases were read by each additional reader), n = 9,112 screens

 CDR (per 1,000 cases)

126/9,112

13.8 (11.6–16.4)

139/9,112

15.3 (12.9–18.0)

1.4a

 RR (%)

639/9,112

7.0 (6.5–7.6)

661/9,112

7.3 (6.7–7.8)

0.2

 Sen (%)

126/145

86.9 (80.4–91.4)

139/145

95.9 (91.3–98.1)

9.0a

 Spec (%)

8,454/8,967

94.3 (93.8–94.7)

8,445/8,967

94.2 (93.7–94.6)

−0.1

 PPV (%)

126/639

19.7 (16.8–23.0)

139/661

21.0 (18.1–24.3)

1.3

 Arbitration rate (%)

270/9,112

3.0 (2.6–3.3)

1,294/9,112

14.2 (13.5–14.9)

11.2

 Positive discordance rate (%)

1,024/9,112

11.2 (10.6–11.9)

 RR of additional arbitration (%)

22/1,024

2.1 (1.4–3.2)

 PPV of additional arbitration (%)

13/22

59.1 (38.7–76.7)

Results of phase 3, live use in standard clinical practice (4 sites, 3 additional arbitrators, additional arbitration cases were single read), n = 15,953 screens

 CDR (per 1,000 cases)

238/15,953

14.9 (13.2–16.9)

249/15,953

15.6 (13.8–17.7)

0.7a

 RR (%)

1,228/15,953

7.7 (7.3–8.1)

1,276/15,953

8.0 (7.6–8.4)

0.3

 Sen (%)

238/253

94.1 (90.4–96.4)

249/253

98.4 (96.0–99.4)

4.3a

 Spec (%)

14,710/15,700

93.7 (93.3–94.1)

14,673/15,700

93.5 (93.1–93.8)

−0.2

 PPV (%)

238/1,228

19.4 (17.3–21.7)

249/1,276

19.5 (17.4–21.8)

0.1

 Arbitration rate (%)

529/15,953

3.3 (3.0–3.6)

1,715/15,953

10.8 (10.3–11.2)

7.4

 Positive discordance rate (%)

1,186/15,953

7.4 (7.0–7.9)

 RR of additional arbitration (%)

48/1,186

4.0 (3.1–5.3)

 PPV of additional arbitration (%)

11/48

22.9 (13.3–36.5)

  1. Num, numerator; Denom, denominator; CI, confidence interval; Sen, sensitivity; Spec, specificity; RR, recall rate; see metric definitions in Methods.
  2. aThe two-sided McNemar test to assess CDR and Sen differences between double reading and double reading plus the AI-assisted additional-reader workflow resulted in P values of 0.0031, 0.0002 and 0.001 for phases 1, 2 and 3, respectively. The McNemar test is based on the binomial distribution. Continuity correction was applied.