Fig. 3
From: Generalizable deep neural networks for image quality classification of cervical images

Classification performance metrics on the “Internal Validation Set” (“Test Set 2”) for models investigated. The models are arranged from top to bottom in order of decreasing performance. Specifically, (a) highlights the discrete classification metrics: %extreme misclassifications (% ext. mis.), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ), (b) highlights the Kappa metrics (linear, quadratic weighted) and (c) highlights the area under the receiver operating characteristics curve (AUROC) for each of the low quality (LQ) versus rest and high quality (HQ) versus rest categories. While overall our top models performed reasonably similarly in terms of the continuous metrics (panel b and c), the discrete metrics (panel a) separated out the top performing model from its competitors. Our best performing model achieved an AUROC of 0.92 (LQ vs. rest) and 0.93 (HQ vs. rest), and a minimal total %EM of 2.8%. The model ranking is consistent to the ranking observed on the “Model Selection Set” (“Test Set 1”) (Supp. Fig. 1).