Table 1 Model performance.

	Internal dataset: Stanford	Internal dataset: Stanford (real prevalence)	External dataset: Intermountain	External dataset: Intermountain (real prevalence)
Metric
Accuracy	0.77 [0.76–0.78]	0.81 [0.80–0.82]	0.78 [0.77–0.78]	0.80 [0.79–0.81]
AUROC	0.84 [0.82–0.87]	0.84 [0.79–0.90]	0.85 [0.81–0.88]	0.85 [0.80–0.90]
Specificity	0.82 [0.81–0.83]	0.82 [0.82–0.83]	0.80 [0.79–0.81]	0.81 [0.80–0.82]
Sensitivity	0.73 [0.72–0.74]	0.75 [0.73–0.77]	0.75 [0.74–0.76]	0.75 [0.73–0.77]
PPV/precision	0.81 [0.80–0.81]	0.47 [0.45–0.48]	0.77 [0.76–0.78]	0.44 [0.43–0.46]
NPV	0.75 [0.74–0.76]	0.94 [0.94–0.95]	0.78 [0.77–0.79]	0.94 [0.94–0.95]

Model performance on the internal test set (Stanford) and external test set (Intermountain) with 95% confidence interval using probability threshold of 0.55 that maximizes both sensitivity and specificity on Stanford validation dataset. Bootstrapping is used to generate prevalence of PE in real world (between 14 and 22%).

Quick links

Search