Table 2 Comparison of patient-level diagnostic performance between the AI model and PI-RADS in all datasets

From: Automated MRI system for clinically significant prostate cancer detection development validation and real-world implementation

 

AUC (95%CI)

Z value

P valuea

Sensitivity

Specificity

Accuracy

PPV

NPV

AI model

Training set

0.94(0.93–0.95)

-

-

0.86(1345/1567)

0.92(2519/2727)

0.90(3864/4294)

0.87(1345/1553)

0.92(2519/2741)

Validation set

0.88(0.85–0.91)

-

-

0.90(239/267)

0.74(157/211)

0.83(396/478)

0.82(239/293)

0.85(157/185)

Test set 1-5

0.93(0.91–0.95)

-

-

0.90(319/355)

0.87(419/484)

0.88(738/839)

0.83(319/384)

0.92(419/455)

Test set 1

0.93(0.90–0.96)

-

-

0.93(111/119)

0.81(103/127)

0.87(214/246)

0.82(111/135)

0.93(103/111)

Test set 2

0.86(0.81–0.92)

-

-

0.83(64/77)

0.76(60/79)

0.79(124/156)

0.77(64/83)

0.82(60/73)

Test set 3

0.90(0.79–1.0)

-

-

0.83(10/12)

0.86(25/29)

0.85(35/41)

0.71(10/14)

0.93(25/27)

Test set 4

0.92(0.84–0.99)

-

-

0.79(15/19)

0.91(29/32)

0.86(44/51)

0.83(15/18)

0.88(29/33)

Test set 5

0.96(0.94–0.99)

-

-

0.93(119/128)

0.93(202/217)

0.93(321/345)

0.89(119/134)

0.96(202/211)

Test set TCIA

0.83(0.78–0.88)

  

0.75(109/146)

0.76(81/106)

0.75(190/252)

0.81(109/134)

0.69(81/118)

PI-RADS

        

Training set

0.90(0.89–0.91)

8.674

<0.001

0.91(1423/1567)

0.74(2028/2727)

0.80(3451/4294)

0.67(1423/2122)

0.93(2028/2172)

Validation set

0.85(0.81–0.88)

1.878

0.060

0.93(247/267)

0.47(99/211)

0.72(346/478)

0.69(247/359)

0.83(99/119)

Test set 1-5

0.93(0.92–-0.95)

0.274

0.784

0.98(347/355)

0.65(316/484)

0.79(663/839)

0.67(347/515)

0.98(316/324)

Test set 1

0.91(0.88–0.95)

0.827

0.408

0.97(116/119)

0.64(81/127)

0.80(197/246)

0.72(116/162)

0.96(81/84)

Test set 2

0.90(0.85–0.94)

1.343

0.179

0.99(76/77)

0.41(32/79)

0.69(108/156)

0.62(76/123)

0.97(32/33)

Test set 3

0.93(0.87–1.0)

0.662

0.508

1.00(12/12)

0.62(18/29)

0.73(30/41)

0.52(12/23)

1.00(18/18)

Test set 4

0.93(0.86–1.0)

0.403

0.687

1.00(19/19)

0.62(20/32)

0.76(39/51)

0.61(19/31)

1.00(20/20)

Test set 5

0.96(0.94–0.98)

0.003

0.998

0.97(124/128)

0.76(165/217)

0.84(289/345)

0.70(124/176)

0.98(165/169)

Test set TCIA

0.85(0.80–0.89)

1.153

0.249

0.94(143/152)

0.42(45/108)

0.72(188/260)

0.69(143/206)

0.83(45/54)

  1. AI artificial intelligence, AUC area under the curve, CI confidence interval, PI-RADS prostate imaging reporting and data system, PPV positive predictive value, NPV negative predictive value.
  2. aP-values calculated from the DeLong test between AI model and PI-RADS scores.
  3. AUCs are calculated from continuous scores. For the AI model, sensitivity/specificity/accuracy/PPV/NPV in each test set are computed at a per-dataset threshold selected by maximizing Youden’s J; for PI-RADS, a fixed threshold of ≥3 is used. The pooled ‘Test set 1-5’ metrics are micro-averaged by summing TP/FP/TN/FN across the datasets at their respective thresholds.