Table 2 Comparison of patient-level diagnostic performance between the AI model and PI-RADS in all datasets

From: Automated MRI system for clinically significant prostate cancer detection development validation and real-world implementation

	AUC (95%CI)	Z value	P value^a	Sensitivity	Specificity	Accuracy	PPV	NPV
AI model
Training set	0.94(0.93–0.95)	-	-	0.86(1345/1567)	0.92(2519/2727)	0.90(3864/4294)	0.87(1345/1553)	0.92(2519/2741)
Validation set	0.88(0.85–0.91)	-	-	0.90(239/267)	0.74(157/211)	0.83(396/478)	0.82(239/293)	0.85(157/185)
Test set 1-5	0.93(0.91–0.95)	-	-	0.90(319/355)	0.87(419/484)	0.88(738/839)	0.83(319/384)	0.92(419/455)
Test set 1	0.93(0.90–0.96)	-	-	0.93(111/119)	0.81(103/127)	0.87(214/246)	0.82(111/135)	0.93(103/111)
Test set 2	0.86(0.81–0.92)	-	-	0.83(64/77)	0.76(60/79)	0.79(124/156)	0.77(64/83)	0.82(60/73)
Test set 3	0.90(0.79–1.0)	-	-	0.83(10/12)	0.86(25/29)	0.85(35/41)	0.71(10/14)	0.93(25/27)
Test set 4	0.92(0.84–0.99)	-	-	0.79(15/19)	0.91(29/32)	0.86(44/51)	0.83(15/18)	0.88(29/33)
Test set 5	0.96(0.94–0.99)	-	-	0.93(119/128)	0.93(202/217)	0.93(321/345)	0.89(119/134)	0.96(202/211)
Test set TCIA	0.83(0.78–0.88)			0.75(109/146)	0.76(81/106)	0.75(190/252)	0.81(109/134)	0.69(81/118)
PI-RADS
Training set	0.90(0.89–0.91)	8.674	<0.001	0.91(1423/1567)	0.74(2028/2727)	0.80(3451/4294)	0.67(1423/2122)	0.93(2028/2172)
Validation set	0.85(0.81–0.88)	1.878	0.060	0.93(247/267)	0.47(99/211)	0.72(346/478)	0.69(247/359)	0.83(99/119)
Test set 1-5	0.93(0.92–-0.95)	0.274	0.784	0.98(347/355)	0.65(316/484)	0.79(663/839)	0.67(347/515)	0.98(316/324)
Test set 1	0.91(0.88–0.95)	0.827	0.408	0.97(116/119)	0.64(81/127)	0.80(197/246)	0.72(116/162)	0.96(81/84)
Test set 2	0.90(0.85–0.94)	1.343	0.179	0.99(76/77)	0.41(32/79)	0.69(108/156)	0.62(76/123)	0.97(32/33)
Test set 3	0.93(0.87–1.0)	0.662	0.508	1.00(12/12)	0.62(18/29)	0.73(30/41)	0.52(12/23)	1.00(18/18)
Test set 4	0.93(0.86–1.0)	0.403	0.687	1.00(19/19)	0.62(20/32)	0.76(39/51)	0.61(19/31)	1.00(20/20)
Test set 5	0.96(0.94–0.98)	0.003	0.998	0.97(124/128)	0.76(165/217)	0.84(289/345)	0.70(124/176)	0.98(165/169)
Test set TCIA	0.85(0.80–0.89)	1.153	0.249	0.94(143/152)	0.42(45/108)	0.72(188/260)	0.69(143/206)	0.83(45/54)

AI artificial intelligence, AUC area under the curve, CI confidence interval, PI-RADS prostate imaging reporting and data system, PPV positive predictive value, NPV negative predictive value.
^aP-values calculated from the DeLong test between AI model and PI-RADS scores.
AUCs are calculated from continuous scores. For the AI model, sensitivity/specificity/accuracy/PPV/NPV in each test set are computed at a per-dataset threshold selected by maximizing Youden’s J; for PI-RADS, a fixed threshold of ≥3 is used. The pooled ‘Test set 1-5’ metrics are micro-averaged by summing TP/FP/TN/FN across the datasets at their respective thresholds.

Back to article page

Table 2 Comparison of patient-level diagnostic performance between the AI model and PI-RADS in all datasets

Search

Quick links