Abstract
Successful translation of artificial intelligence (AI) models into clinical practice, across clinical domains, is frequently hindered by the lack of image quality control. Diagnostic models are often trained on images with no denotation of image quality in the training data; this, in turn, can lead to misclassifications by these models when implemented in the clinical setting. In the case of cervical images, quality classification is a crucial task to ensure accurate detection of precancerous lesions or cancer; this is true for both gynecologic-oncologists’ (manual) and diagnostic AI models’ (automated) predictions. Factors that impact the quality of a cervical image include but are not limited to blur, poor focus, poor light, noise, obscured view of the cervix due to mucus and/or blood, improper position, and over- and/or under-exposure. Utilizing a multi-level image quality ground truth denoted by providers, we generated an image quality classifier following a multi-stage model selection process that investigated several key design choices on a multi-heterogenous “SEED” dataset of 40,534 images. We subsequently validated the best model on an external dataset (“EXT”), comprising 1,340 images captured using a different device and acquired in different geographies from “SEED”. We assessed the relative impact of various axes of data heterogeneity, including device, geography, and ground-truth rater on model performance. Our best performing model achieved an area under the receiver operating characteristics curve (AUROC) of 0.92 (low quality, LQ vs. rest) and 0.93 (high quality, HQ vs. rest), and a minimal total %extreme misclassification (%EM) of 2.8% on the internal validation set. Our model also generalized well externally, achieving corresponding AUROCs of 0.83 and 0.82, and %EM of 3.9% when tested out-of-the-box on the external validation (“EXT”) set. Additionally, our model was geography agnostic with no meaningful difference in performance across geographies, did not exhibit catastrophic forgetting upon retraining with new data, and mimicked the overall/average ground truth rater behavior well. Our work represents one of the first efforts at generating and externally validating an image quality classifier across multiple axes of data heterogeneity to aid in visual diagnosis of cervical precancer and cancer. We hope that this will motivate the accompaniment of adequate guardrails for AI-based pipelines to account for image quality and generalizability concerns.
Similar content being viewed by others
Introduction
While there has been a notable surge in the development of artificial intelligence (AI) and deep learning (DL) pipelines for several clinical tasks1,2,3,4, translation of these models into clinical practice remains sparse. Particularly concerning is the extent to which models fail to perform in prospective studies for clinical translation, owing to differences in prospectively acquired, real-world clinical data from training data. One unique challenge that has consistently hindered clinical AI deployment is the frequent lack of image quality control, especially in clinical domains that involve multiple and/or custom image capture devices5,6,7,8. In these domains, diagnostic AI models are often trained on images with no filtering for quality in the training data; this, in turn, can lead to misclassifications by these models when implemented, leading to adverse patient outcomes. Cervical cancer screening is one such domain where image quality is of particular importance, for several reasons: 1. only a small portion of the cervix is typically of interest for diagnosis; 2. the area of interest is frequently difficult to visualize and properly position; and 3. there are often multiple devices with different characteristics and multiple image takers in a given health region. For cervical screening, visual evaluation plays a crucial role as a triage step after a primary screening test (ideally, an HPV test) is positive; it is used to determine the need and site of biopsies for histological confirmation and/or the adequacy of ablative treatment.
In the context of this work, image quality refers to the visual attributes of an image including the technical characteristics that determine its clarity and fidelity to the original subject9. In the case of cervical images, quality classification is a crucial task to ensure accurate screening and diagnosis of cervical cancer; this is true for both gynecologic-oncologists’ (manual) and diagnostic AI models’ (automated) predictions. Factors that may impact the quality of a cervical image include but are not limited to blur, poor focus, poor light, noise, obscured view of the cervix due to mucus and/or blood, improper position of the speculum, insufficient magnification, glare, specular reflection, and over- and/or under-exposure. There is a paucity of work in the current DL and medical image classification literature that assesses clinical image quality; most current pipelines therefore lack an image quality check and tend to perform poorly on poor quality images5,6,7,8.
Cervical cancer ranks as the fourth most prevalent cause of cancer-related morbidity and mortality worldwide, with around 90% of the 300,000 deaths per year occurring in low-resource settings10,11,12. Despite a strong understanding of the causal pathway, predominantly attributed to Human Papillomavirus (HPV)11,13,14, effective control of cervical cancer remains elusive, particularly in low-resource settings15. To assess the risk of HPV-positive individuals, low-resource settings commonly employ visual inspection with acetic acid (VIA) as a triage method16,17. However, numerous studies have indicated that visual evaluation by healthcare providers exhibits suboptimal accuracy and repeatability18,19, creating a necessity for automated tools that can more consistently evaluate cervical lesions and direct the appropriate treatment protocol. To this end, we had previously generated a multiclass diagnostic classifier able to classify the appearance of the cervix into “normal”, “indeterminate” and “precancer/cancer” categories20.
Crucially, both manual and automated evaluation of cervical images, as captured by colposcopes, cell phone cameras or other devices are dependent on quality; even highly trained healthcare providers such as colposcopists and gynecologic-oncologists are unable to confidently ascertain the cancer status of a cervix from poor quality images21. Therefore, there is a need for an accurate and generalizable image quality classifier to ensure that only images deemed of sufficient quality undergo diagnostic classification and evaluation, whether manual or automated. For instance, our diagnostic classifier20 utilized only images that were labelled “intermediate” or “high” quality for diagnostic classification. Our goal is to filter out the “low” quality images by prompting the user to retake an image if deemed of poor quality, and only pass through the “intermediate” and “high” quality images, i.e., images deemed to be of sufficient quality for downstream diagnostic classification. In this work, we conducted a multi-stage model selection approach utilizing a collated, multi-heterogeneous dataset to generate a multi-class image quality classifier able to classify images into “low”, “intermediate” and “high” -quality categories. We subsequently validate this classifier on an external, out-of-distribution (OOD) dataset, assessing the relative impacts of various axes of data heterogeneity, including device-, geography-, and ground truth rater-level heterogeneity, on the performance of our best quality classifier model.
Our work makes several important conclusions regarding the performance of our quality classifier model, which, we believe, hold relevance across multiple clinical domains even outside of cervical imaging:
-
1.
Object Detection: Model performances improve after employing a trained bounding box detector to bound and crop the cervix from images and training/testing on the bound and cropped images.
-
2.
Generalizability:
-
a.
Device-level heterogeneity: Our model performs strongly out of the box on an external dataset comprising images from a different device.
-
b.
Geography-level heterogeneity: Our model is geography agnostic, meaning that there is no impact of geography-level heterogeneity on model performance.
-
c.
Label/Ground Truth rater level heterogeneity: Our model strongly mimics the overall/average rater behavior; it discriminates the important boundary classes (“low” and “high” quality) well and reasonably captures the degree of uncertainty seen with the “intermediate” class.
-
a.
Materials and methods
Dataset
Included studies
We utilized two groups of datasets in this study: (1) a collated, multi-device (cervigram, DSLR, J5, S8) and multi-geography (Costa Rica, USA, Europe, Nigeria) dataset, labelled “SEED”, which comprised of a convenience sample combining six distinct studies—Natural History Study (NHS), ASC-US/LSIL Triage Study for Cervical Cancer (ALTS), Costa Rica Vaccine Trial (CVT), Biopsy Study in the US (Biop), Biopsy Study in Europe (D Biop)20 and Project Itoju22, and (2) an external dataset, labelled “EXT”, comprising of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic) collected as part of the HPV-Automated Visual Evaluation (PAVE) study23 (Table 1, Fig. 1). The “SEED” dataset comprised of a total of 40,534 images while the “EXT” dataset comprised of 1340 images (Table 1).
Overview of dataset and model optimization strategy. We utilized a collated multi-device and multi-geography dataset, labelled “SEED” (orange panel), for model training and selection, and subsequently validated the performance of our chosen best-performing model on an external dataset, labelled “EXT” (blue panel), comprising of images from a new device and new geographies (see Table 1 and METHODS for detailed descriptions and breakdown of the datasets by ground truth). We split the “SEED” dataset 10% : 1% : 79% : 10% for train : validation : Test 1 (“Model Selection Set”) and Test 2 (“Internal Validation”), and subsequently investigated the intersection of model design choices in the bottom table on the train and validation sets. The models were ranked based on classification performance on the “Model Selection Set”, captured by the metrics highlighted on the center green panel. The “Internal Validation” set was subsequently utilized to further verify and confirm the ranked order of the models from the “Model Selection Set”. Finally, we validated the performance of our top model on “EXT”, conducting both an external validation and an interrater study (see METHODS). CE: cross entropy; QWK: quadratic weighted kappa; MSE: mean squared error, AUROC: area under the receiver operating characteristics curve.
Ground truth delineation
The ground truth quality labels for the images in the “SEED” and “EXT” datasets were assigned by four healthcare providers into four categories, using the following guidelines: “unusable” (where the images were either not of the cervix, used Lugol’s iodine for visual inspection, included a green filter, were post-surgery or post ablation, and/or consisted of an upload artifact), “unsatisfactory” (where major technical quality factors such as blur, poor focus, poor light, obstructed view of the cervix due to mucus or blood, improper position, over- and/or under-exposure did not allow for a visual diagnostic evaluation), “limited” (where certain technical quality factors still impacted image quality but a visual diagnostic evaluation was possible) and “evaluable” (where there were no technical factors affecting the quality of the image and a visual diagnosis was possible). Each of the raters were licensed physicians, with board certifications in gynecology or gynecologic oncology, with more than 20 years of experience in their fields, as well as specific expertise in HPV epidemiology. Three of the raters labelled images in the “SEED” and “EXT” datasets, while one rater labelled images only in the “EXT” dataset. The four-level ground truth mapping was converted into three levels: “low quality” (which combined the “unusable” and “unsatisfactory” categories), “intermediate quality” (“limited” category) and “high quality (“evaluable” category). The rationale for combining the bottom two quality categories is twofold: first, since both “unusable” and “unsatisfactory” images cannot undergo visual diagnostic evaluation, we expect these images to be filtered out by the quality classifier and new images retaken for the patient; second, combining the lower-two categories ensured a better dataset balance given the large number of “intermediate quality” (“limited” category) and “high quality (“evaluable” category) images. Since both “intermediate” and “high” quality images can be visually evaluated by providers, we expect automated classifiers trained on these images to correspondingly provide diagnostic predictions. The breakdown of the final three-level ground truths in each dataset is highlighted on Table 1.
Ethics
All study participants signed a written informed consent prior to enrollment and sample collection. All studies were reviewed and approved by the Institutional Review Boards of the National Cancer Institute (NCI) and the National Institutes of Health (NIH). The “EXT” studies were approved by country-specific IRBs from Cambodia and DR. All experiments and methods were performed in accordance with the relevant guidelines and regulations.
Model training and analysis
Utilizing a three-level ground truth of “low”, “intermediate” and “high” quality images, we investigated the design of an image quality classifier on the “SEED” dataset and externally validated the best model on the “EXT” dataset. We implemented our model design and selection approach in four distinct steps: 1. model development, 2. internal validation, 3. external validation and 4. interrater performance.
Model development
We conducted our experiments in multiple rounds, incorporating the intersections of model choices across several key model design choice categories (Fig. 1); these included different model architectures (densenet12124, resnet5025), loss functions (standard cross-entropy, quadratic weighted kappa26, and mean-squared error losses) and dataset balancing strategies (balanced sampling, balanced loss). Our design choices here were informed by prior work20 highlighting the utility of these choices across medical imaging domains, and specifically for the cervical domain.
ROUND 1: Training set size
In the first round, our initial runs were aimed at investigating the impact of dataset size on model performance. We conducted model training runs that used either a high (65%) or low (10%) proportion of “SEED” data for training, and subsequently compared several key classification performance metrics between the two sets of runs using paired samples t-tests adjusted for multiple comparisons by the Bonferroni correction.
ROUND 2: Cervix detection
In the second round, we investigated the impact of cervix detection on quality classifier performance, comparing model performance before and after cervix detection. The expected workflow in our overall multistep pipeline includes, in sequence, 1. image capture, 2. cervix detection, 3. image quality classifier, 4. diagnostic classifier, and 5. appropriate treatment as directed. In our overall pipeline, cervix detection can be considered as a preprocessing task that bounds and crops the cervix for input into the downstream classifiers. Given that healthcare providers only look at the cervix enclosed within its circumferential boundary for determination of visual image quality, as well as visual determination of precancer status via aceto-whitening near the transformation zone, our decision to bound and crop the cervix and only pass the cropped image into the downstream classifiers was intuitive and justified.
We used a YOLOv527 model architecture pretrained on the COCO dataset to train our custom cervix detector. Human annotated ground truth bounding boxes were available for images that were split into 60% train, 10% validation, 20% test 1 and 10% test 2 sets. The detector was trained for 100 epochs and achieved an mAP0.95 of 0.995 and mAP0.5:0.95 of 0.954 respectively, indicating a very high level of performance. We subsequently compared several key classification metrics between quality classifier model runs before and after cervix detection by conducting paired samples t-tests, adjusted for multiple comparisons by the Bonferroni correction (Fig. 2).
(a) Comparison of model performances with (green bars) and without (red bars) cervix detection. The bars report mean values of the corresponding metrics on the x-axis across all models. Results from paired samples t-tests adjusted for the Bonferroni correction (t-statistic, p-value) are highlighted in the text above the bars, demonstrating statistically significant improvements in model performance with cervix detection. (b) (i) Bounding boxes generated from running the cervix detector, highlighted in white, around 50 randomly selected images from the external (“EXT”) dataset. The cervix detector utilized a YOLOv5 architecture trained on “SEED” dataset images, and (ii) Bound and cropped images of the cervix which are passed onto the diagnostic classifier.
Model selection and internal validation
Our final model runs utilized the full 40,534 image “SEED” dataset with a split of 10% : 1% : 79% : 10% for training : validation : test 1 (model selection set) : test 2 (internal validation set) and iterated across all combinations of the design choices highlighted on Fig. 1. The specific configurations are highlighted on Table 2. All images were cropped with bounding boxes generated from a YOLOv527 model trained for cervix detection as noted above. RGB images were used for training, since the primary visual indicators of precancerous status in an image of the cervix requires the presence of color (e.g., aceto-whitening near the transformation zone following application of acetic acid, growth or ulceration, vascular abnormalities); subtle color differences reflect underlying physiological and pathological changes associated with precancer/cancer. All models were trained for 75 epochs with a batch size (BS) of 8, a learning rate (LR) of 10–5, and an LR scheduler (ReduceLRonPlateau) which reduced the LR by a factor of 10 if no improvement was seen in the validation metric for 10 epochs. Our choices of a low LR with an LR scheduler, optimal BS and epochs optimized model performance, training time, and available memory capacity, and ensured that all our models reached convergence. We used the summed normal and precancer AUC on the validation set as the early stopping criterion during training. Before training, images were resized to 256 × 256 pixels and scaled to intensity values from 0 to 1. During training, affine transformations were applied to the image for data augmentation. We initialized all model architectures with ImageNet pretrained weights. Additionally, we implemented Monte Carlo (MC) dropout28 in order to alleviate overfitting and regularize the learning process by randomly removing neural connections from the model29. Spatial dropout at a rate of 0.1 was applied after each dense layer for the densenet121 models, and after each residual block for the resnet50 models. The final model prediction was generated via models trained using dropout combined with the inference prediction derived from the 50 forward passes; model predictions can be thought of as analogous to averaging 50 repeat runs of each model, i.e., the average of 50 MC samples.
In the internal validation stage, we ranked our final models in order of performance on the “Model Selection Set” (“Test Set 1” = 32,100 images). We subsequently confirmed the performance of these models on the previously held aside “Internal Validation Set” (“Test Set 2” = 3,975 images). We ranked our models based on area under the receiver operating characteristics curve (AUROC), kappa (linear and quadratic weights), as well as %extreme misclassifications (%EM, representing the proportion of images with a two-class misclassification), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ) (Fig. 3).
Classification performance metrics on the “Internal Validation Set” (“Test Set 2”) for models investigated. The models are arranged from top to bottom in order of decreasing performance. Specifically, (a) highlights the discrete classification metrics: %extreme misclassifications (% ext. mis.), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ), (b) highlights the Kappa metrics (linear, quadratic weighted) and (c) highlights the area under the receiver operating characteristics curve (AUROC) for each of the low quality (LQ) versus rest and high quality (HQ) versus rest categories. While overall our top models performed reasonably similarly in terms of the continuous metrics (panel b and c), the discrete metrics (panel a) separated out the top performing model from its competitors. Our best performing model achieved an AUROC of 0.92 (LQ vs. rest) and 0.93 (HQ vs. rest), and a minimal total %EM of 2.8%. The model ranking is consistent to the ranking observed on the “Model Selection Set” (“Test Set 1”) (Supp. Fig. 1).
Finally, to aid better visualization of predictions at the individual model level, we generated Fig. 4, which compared model predictions across 60 images for the ranked model list. To generate this comparison, we first summarized each model’s output as a continuous severity \(score\). Specifically, we utilized the ordinality of our problem and defined the continuous severity \(score\) as a weighted average using softmax probability of each class \(i\) (\({p}_{i}\)) as described in Eq. 3, where \(k\) = number of classes:
Model-level comparison across investigated models on the “Internal Validation Set” (“Test Set 2”). 60 images were randomly selected from this set (see METHODS/Model Training and Analysis/Model Selection and Internal Validation) and arranged in order of increasing mean score within each ground truth class in the top row (labelled “Ground Truth”). The model predicted class for the investigated models for each of these 60 images is highlighted in the bottom rows, where the images follow the same order as the top row. The color coding in the top row represents ground truth while in the bottom 12 rows represent the model predicted class. Red: Low Quality, Gray: Intermediate, and Green: High Quality, as highlighted in the legend. As we go from the worst model at the bottom to the best model at the top, identification and discrimination of both “intermediate” and “high” quality images steadily improve.
Put another way, the \(score\) is equivalent to the expected value of a random variable that takes values equal to the class labels, and the probabilities are the model’s softmax probability at index \(i\) corresponding to class label \(i\). For a three-class model, the values lie in the range 0 to 2. We next computed the average of the \(score\) for each image across all models and arranged the images in order of increasing average \(score\) within each class. From this \(score\)-ordered list, we randomly selected 20 images per class, maintaining the distribution of mean scores within each class, and arranged the images in order of increasing average \(score\) within each class in the top row of Fig. 4, color coded by ground truth. We subsequently compared the predicted class across the 12 models for each of these 60 images (bottom 12 rows of Fig. 4), maintaining the images in the same order as the ground truth row and color-coded by model predicted class. The image panels on the top of Fig. 4 depict select images with relevant metadata.
External validation
Because our internal validation set shared similar characteristics to our training data (i.e., similar devices and geographies), our next stage consisted of validating our best performing model on external data (“EXT”). Our external test set (“EXT” = “Test Set 3”) comprised of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic (DR)).
First, to get a sense of the dataset distributions of the “SEED” and “EXT” datasets, including the distributions by device and geography, we ran out-of-the-box (OOB) inference with our best performing model on “Test Set 2” (“Internal Validation Set”) from the “SEED” dataset and on the full “EXT” dataset. We subsequently plotted UMAPs of the resulting features, which represent a dimension-reduced version of the features output from the model’s inference run, color-coded by dataset, device, and geography (Fig. 5) respectively.
Uniform manifold approximation and projections (UMAP) highlighting the relative distributions of the datasets, devices and geographies investigated in this work. Each subplot highlights a different representation of the UMAP, where the color coding (highlighted in the corresponding legend at the top of each subplot) is at the (a) dataset-level (seed vs. external), (b) device-level and (c) geography-level. The datasets and devices occupy distinct clusters in (a) and (b), while the geographies are all clustered together within the same device in (c).
We further tested the impact of device- and geography-level heterogeneity on our model performance via three distinct set of investigations: (i) out-of-the-box (OOB) inference on “EXT”; (ii) device-level retraining: adding multi-geography “EXT” images to “SEED” in a 65% : 10% : 25% ratio of train : validation : test and training on the full collated dataset; and (iii) geography-level retraining: adding either Cambodia or DR “EXT” images to “SEED” in separate experiments and training on the full collated dataset. For (i) the OOB model run, we investigated performance on both the full “EXT” test set, as well as on the individual geographies Cambodia and DR. For (ii) the device-level retraining run, we investigated performance on both “EXT” test set and on “SEED” Test Set 2, to assess the possibility of performance degradation (catastrophic forgetting) on “SEED” data upon retraining.
Interrater assessment
Finally, we conducted an interrater assessment of model performance with respect to the ground truth denoted by two different raters on 100 newly acquired, external (“EXT”) dataset images (device = IRIS colposcope, geography = Cambodia). Rater 1 was one of several raters who had labelled images in the “SEED” dataset on which the model was trained, while “Rater 2” was a completely new rater. We specifically investigated the OOB performance of our best performing model (which was trained on “SEED”), on the 100 “EXT” images with respect to each of the individual rater’s ground truth, computing key classification metrics (AUROC, %EM) (Fig. 7a) and ROC curves for each (Fig. 7b,c). Further, we investigated the degree of concordance between the two raters’ ground truths and the corresponding model predictions on each of the 100 images using a rater-level confusion matrix color-coded by model prediction.
Results and discussion
In this work, we implemented a multi-stage model selection approach to generate an image quality classifier utilizing a multi-device and multi-geography “SEED” dataset, and subsequently validated the best performing model on an external “EXT” dataset, assessing the relative impact of device-level, geography-level, and rater heterogeneity on our model.
Model development
ROUND 1: Training set size
Supp. Table 1 highlights that using a high proportion (65%) of all available data for training instead of a low proportion (10%) did not meaningfully improve or alter model performance. We consequently chose to limit all subsequent experiments to a low proportion (10%) of the data to save on training time, optimize available memory capacity and improve computational efficiency. Conceptually, this is consistent with our expectation since given the large size of our “SEED” dataset (40,534 images), even 10% of the dataset amounts to 4,053 images used for training, which is a reasonably large number for this task.
ROUND 2: Cervix detection
Figure 2 highlights statistically significant improvements in model performance with cervix detection across several key classification metrics, including linear kappa (LK), quadratic weighted kappa (QWK), accuracy, area under the receiver operating characteristics curve (AUROC) and area under the precision recall curve (AUPRC). We consequently chose to limit our final set of model selection experiments to models utilizing images bound and cropped following cervix detection. Conceptually, this is consistent with our expectation since our raters primarily utilized the location around the cervical os, which largely encompassed the circumferential boundary around the cervix, for determining quality ground truths, given that this is the region of the cervix that is used to visually determine cervical precancer and cancer.
Model selection and internal validation
Figure 3 depicts the rank order of our final models on “Test Set 2” (“Internal Validation Set”). This rank order is consistent with the rank order of models on “Test Set 1” (“Model Selection Set”) (Supp. Fig. 1), demonstrating that the performance differences between the models are driven by the design choices and not due to chance or specific composition of the individual test sets. Figure 3 highlights that while all models perform well and reasonably similarly in terms of the continuous metrics (the top models have similar AUROCs ~ 0.92 and Kappa ~ 0.65), the discrete metrics (%EM, %LQ as HQ and %HQ as LQ) effectively discriminate between models and separate out the top performing model from its competitors. Our best performing model achieved an AUROC of 0.92 (LQ vs. rest) and 0.93 (HQ vs. rest), and a minimal total %EM of 2.8%. Finally, Fig. 4 highlights a more granular view of this difference in performance between the models, demonstrating that as we go from the worst model at the bottom to the best model at the top, identification and discrimination of both “intermediate” and “high” quality images steadily improve. This is consistent with our expectations given the design choices investigated in our model selection: incorporation of an “intermediate” class, together with MC dropout and a loss function (QWK) that penalizes misclassifications between the extreme classes ensure that we effectively deal with ambiguity with cases at the class boundaries. Our best model (“Model 1” on Figs. 3 and 4) utilizes densenet121 as the architecture, quadratic weighted kappa as the loss function and balanced loss as the balancing strategy. Even though we used RGB images as input, our general model selection and validation approach are independent of color and should apply widely even across grayscale images in diagnostic radiology.
External validation
The UMAPs on Fig. 5a, b highlight that the “EXT” dataset and its corresponding IRIS colposcope device occupy similar regions to the DSLR and J5 clusters from the “SEED” dataset, while Fig. 5c highlights the geography level distribution. Taken together, Fig. 5a–c suggest that 1. while there is device-level heterogeneity within the data, the likely impact on model performance on “EXT” should be minimal given its proximity to “SEED”; and 2. that geography should not play a role in model performance, given that within the same device, different geographies do not occupy distinct clusters on Fig. 5c, unlike the corresponding device level clusters on Fig. 5b.
Figure 6 highlights that our model demonstrated strong out-of-the-box (OOB) performance on external data: AUROC of 0.83 (LQ vs. rest) and 0.82 (HQ vs. rest), and a %EM of 3.9% (Fig. 6a.i, blue bars), consistent with our expectation from Fig. 5; these values further improved upon retraining with “EXT” images added to “SEED” (AUROC = 0.95, 0.88 respectively; %EM = 1.8%) (Fig. 6a.i, orange bars). Additionally, we found that 1. retraining using external data did not adversely affect performance on “SEED”, i.e., there was no catastrophic forgetting (AUROC = 0.92, 0.93 respectively; %EM = 3.2%) (Fig. 6a.i, yellow bars), and 2. our model is geography agnostic: OOB performance did not meaningfully differ between Cambodia and Dominican Republic (DR) images (Fig. 6b.i, light and dark blue bars), and models trained on Seed + Cambodia images performed strongly on DR images and vice versa (Fig. 6b.i, light and dark green bars). Taken together, Figs. 4 and 5 suggest that there is no impact of geography-level heterogeneity on model performance, and while there is some degree of device level heterogeneity as captured by the UMAPs, performance on our external device (IRIS colposcope) is strong.
External validation of our best performing model on “EXT” dataset. Panel (a) highlights the strong out-of-the-box (OOB) performance of our model, where area under the receiver operating characteristics curve (AUROC) = 0.83 (low quality, LQ vs. rest) and 0.82 (high quality, HQ vs. rest), and %extreme misclassification (%Ext. Mis.) = 3.9% (a.i, blue bars), with the corresponding confusion matrix and ROC curve in (ii). Panel (a) further highlights the improvement in performance upon retraining, where AUROC = 0.95, 0.88 respectively; %Ext. Mis. = 1.8% on “EXT” test set (a.i, orange bars) and the absence of catastrophic forgetting, where AUROC = 0.92, 0.93 respectively; %Ext. Mis. = 3.2% on “SEED” Test Set 2 (a.i, yellow bars; confusion matrix and ROC curves in iii). Panel (b) highlights that our model is geography agnostic, with no meaningful difference in OOB performance on “EXT” between Cambodia (Cam.) and Dominican Republic (DR) (b.i, light and dark blue bars) and strong performance on DR for models trained on “SEED” + Cambodia and vice versa (b.i. light and dark green bars; confusion matrices and ROC curves depicted in ii and iii respectively).
Interrater assessment
Figure 7 demonstrates that our model mimics overall rater behavior well. Our model demonstrated strong performance OOB on each individual rater’s ground truth (Rater 1: AUROC = 0.96, 0.85 respectively, and %EM = 2%; Rater 2: AUROC = 0.87, 0.80 respectively, and %EM = 8%). Rater 1 was involved in the generation of ground truths within “SEED”, meaning that the model had “seen” Rater 1 ground truth patterns in the “SEED” data, while Rater 2 was a completely new rater. Our model predicted correctly 85% of cases where both raters agreed on either “low quality” or “high quality” images, making grave errors on only two images. The “intermediate” class is known to be highly uncertain among raters, given that there is generally disagreement among healthcare providers as to what defines an “intermediate” or limited quality image. We found that our model uniquely captured this rater-level uncertainty and disagreement with the “intermediate” class and largely erred on the side of caution; images predicted as “low” quality by our model were largely deemed “low” quality by at least one rater, and vice versa, while images deemed “intermediate” by one rater and “high” quality by the other largely had a mix of “intermediate” and “high quality” model predictions. This pattern of model performance is optimal for our use case, since we expect utilizing our model to filter out “low” quality images, while allowing “intermediate” and “high” quality images to pass through to diagnostic classification.
Interrater assessment of our best performing model on 100 newly acquired, “EXT” dataset images (device = IRIS colposcope, geography = Cambodia), with respect to the ground truth denoted by two different raters. Rater 1 was one of the raters who had labelled images in the “SEED” dataset on which the model was trained, while Rater 2 was a completely new rater. Our model demonstrated strong performance out-of-the-box (OOB) on each individual rater’s ground truth, where for Rater 1: area under the receiver operating characteristics curve (AUROC) = 0.96 (low quality, LQ vs. rest), 0.85 (high quality, HQ vs. rest) respectively, and %extreme misclassifications (%Ext. Mis.) = 2% (panel (a), blue bars; ROC curves in panel (b)); and for Rater 2: AUROC = 0.87, 0.80 respectively, and %Ext. Mis. = 8% (panel (a), red bars; ROC curves in panel (b)). Panel (c) highlights the degree of concordance between the two rater’s ground truths (x-axis: Rater 1; y-axis: Rater 2) and the corresponding model prediction on each of the 100 images using a confusion matrix color-coded by model prediction (Red: low quality; Gray: intermediate quality and Green: high quality).
Diagnostic classifier performance by quality
To further shed light into the motivation behind the image quality classifier and our experiments reported here, we investigated the performance of our downstream diagnostic classifier within each quality ground truth class. Since our downstream diagnostic classification dataset utilized both “Intermediate” and “High Quality” images for training and testing, we can look at the diagnostic classifier model predictions on its test set with respect to the diagnostic classification ground truth within each of these two quality classes and determine whether there is an impact of image quality on diagnostic classifier model predictions. Detailed description of our diagnostic classifier model and the multi-heterogeneous (multi-device, multi-geography) dataset that was used can be found in20. The results from this analysis are reported Fig. 8, which utilized the test set from the diagnostic classifier comprising of 10,420 images.
Analysis of diagnostic classifier performance by image quality class. Specifically, the x-axis represents the image quality label/ground truth (“Intermediate” and “High Quality”) while the y-axis represents the diagnostic classifier label/ground truth (“Normal”, “Gray Zone/Indeterminate” and “Precancer+”). Within each of the six coordinates (reflecting the six combinations of quality and diagnostic classifier ground truths), each color-coded bubble represents the diagnostic classifier model predictions, with the relative sizes of the bubbles indicating the relative ratio of predictions for each class within each of the six coordinates. The numbers in the center of each bubble represents the number predicted to be of the given color for diagnostic class, as highlighted in the legend on the top, where Green: Normal, Gray: Gray Zone/Indeterminate, and Red: Precancer+.
The x-axis of the bubble plot in Fig. 8 represents the image quality label/ground truth (“Intermediate” and “High Quality”) while the y-axis represents the diagnostic classifier label/ground truth (“Normal”, “Gray Zone / Indeterminate” and “Precancer + ”). Within each of the six coordinates (reflecting the six combinations of quality and diagnostic classifier ground truths), each color-coded bubble represents the diagnostic classifier model predictions, with the relative sizes of the bubbles indicating the relative ratio of predictions for each class within each of the six coordinates. Crucially, none of the images labelled “Intermediate” quality are predicted to be “Normal” by the diagnostic classifier; this leads to a large discrepancy of 249 images which are ground truth labelled as “Normal” but are predicted to be largely “Gray Zone / Intermediate” (80%) or “Precancer + ” (20%) by the diagnostic classifier. This is a critical finding, since this signifies that the quality of a cervical image is important for downstream diagnostic classification. It appears that poorer quality images are deemed to be pathologic by the diagnostic classifier regardless of ground truth, thereby reinforcing the need for a dedicated deep-learning based model to filter out poorer quality images prior to diagnostic classification.
Analysis of performance by available quality factor
The dataset utilized in this work had a small number of “Low Quality” images for which we had a further denotation of the specific quality factor that led to the quality of the image to be determined as poor (n = 785 in the combined test set). In order to aid in explainability of our model, we assessed the performance of our top performing quality classifier (Model 1 in Figs. 3 and 4) within each of these categories by calculating the accuracy of model predictions, i.e. how many of the images within each specific quality factor category were correctly predicted as “Low Quality” by the model. Figure 9 depicts the bar plot of the accuracy within each specific quality category, highlighting that our quality classifier model performs well in filtering out post-iodine images as well as images where the view of the cervix is obscured due to the position of the cervix or an unspecified reason, while it does less well on images where the view of the cervix is obscured by mucus or blood.
Analysis of quality classifier performance by available quality factor, where each bar represents the accuracy of the best performing quality classifier model (Model 1 in Fig. 3 and 4) within each of the specific quality factor category as denoted on the x-axis. The total number of images in each category are denoted both in the bottom and top of each bar. On the x-axis, “Obscured” indicates that the view of the cervix is obscured by the factor that is denoted in parenthesis.
Conclusion
To successfully translate AI pipelines to clinical practice, models must be designed with guardrails built to deal with poor quality images. Domains such as cervical cancer screening, which comprises of a variety of image capture devices as well as differently skilled providers to capture images, are particularly prone to image quality concerns that may adversely impact diagnostic evaluation. In this work, we tackle the image quality problem head on, by generating and externally validating a multiclass image quality classifier able to classify images of the cervix into “low”, “intermediate” and “high” quality categories. We subsequently highlight that our best performing model generalizes well, performing strongly across multiple axes of data heterogeneity, including device, geography, and ground truth rater.
Our choice of three-classes for image quality classification was motivated by two reasons. First, three classes capture the most accurate representation of true image quality for cervical images as encountered by raters/providers in clinic by effectively capturing the true “ambiguity” with the “intermediate” class, and integrates seamlessly into our overall workflow: images predicted as “low” quality would be filtered out by our model, with the provider prompted to retake the image for the patient until the prediction is no longer “low” quality; only images deemed to be of sufficient quality (“intermediate” and “high” quality categories) would be passed onto downstream diagnostic classification. Second, by incorporating a three-class classifier with a loss function (QWK) that severely penalizes extreme misclassifications (i.e., LQ as HQ and vice-versa) with quadratic weights, we further ensure greater separation between and stronger discrimination of the “low” and “high” quality boundary classes.
Despite the heterogeneous nature of our datasets, our work may be limited by the number of external devices utilized and the number of image takers. Additionally, we use RGB images for input, although our approaches Forthcoming work will further evaluate our retraining approaches and assess model performance on additional external devices and image takers. Future work will further optimize our model for use on edge devices, thereby promoting clinical translation.
Our investigation of quality classifier performance across the various axes of heterogeneity present within our data underscore the importance of assessing portability or generalizability of AI models designed for clinical deployment30,31,32. We posit that for any model to be deployed successfully in the clinic, portability concerns must be adequately acknowledged. This is particularly true in light of the FDA’s October 2023 guidance on effective implementation of AI/DL models33, proposing the need for adapting models to data distribution shifts. We hope that our work will set a standard for the importance of assessing model performance against the axes of heterogeneity present in the dataset, across clinical domains, and will motivate the accompaniment of adequate guardrails for AI-based pipelines to account for concerns relating to image quality and generalizability.
Data availability
The repository used to train and generate results can be found at https://github.com/QTIM-Lab/image_quality_classifier.
References
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Hannun, A. Y. et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med. 25, 65–69 (2019).
Piccialli, F., Somma, V. D., Giampaolo, F., Cuomo, S. & Fortino, G. A survey on deep learning in medicine: Why, how and when?. Inf. Fusion 66, 111–137 (2021).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Sabottke, C. F. & Spieler, B. M. The effect of image resolution on deep learning in radiography. Radiol. Artif. Intell. 2, e190015 (2020).
Abdelhafiz, D., Yang, C., Ammar, R. & Nabavi, S. Deep convolutional neural networks for mammography: Advances, challenges and applications. BMC Bioinform. 20, 1–20 (2019).
Wright, A. I., Dunn, C. M., Hale, M., Hutchins, G. G. A. & Treanor, D. E. The effect of quality control on accuracy of digital pathology image analysis. IEEE J. Biomed. Heal. Inform. 25, 307–314 (2021).
Dodge, S. & Karam, L. Understanding how image quality affects deep neural networks. In 8th International Conference on Quality of Multimedia Experience QoMEX 2016 (2016) https://doi.org/10.1109/QOMEX.2016.7498955.
Pratt, W. K. Digital Image Processing. (2007) https://doi.org/10.1002/0470097434.
Wentzensen, N. et al. Accuracy and efficiency of deep-learning–based automation of dual stain cytology in cervical cancer screening. JNCI J. Natl. Cancer Inst. 113, 72–79 (2021).
de Martel, C., Plummer, M., Vignat, J. & Franceschi, S. Worldwide burden of cancer attributable to HPV by site, country and HPV type. Int. J. cancer 141, 664–670 (2017).
Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Schiffman, M. et al. Carcinogenic human papillomavirus infection. Nat. Rev. Dis. Primers 2, 1–20 (2016).
Schiffman, M. H. et al. Epidemiologic evidence showing that human papillomavirus infection causes most cervical intraepithelial neoplasia. JNCI J. Natl. Cancer Inst. 85, 958–964 (1993).
Schiffman, M., Castle, P. E., Jeronimo, J., Rodriguez, A. C. & Wacholder, S. Human papillomavirus and cervical cancer. Lancet 370, 890–907 (2007).
Belinson, J. Cervical cancer screening by simple visual inspection after acetic acid. Obstet. Gynecol. 98, 441–444 (2001).
Ajenifuja, K. O. et al. A population-based study of visual inspection with acetic acid (VIA) for cervical screening in rural Nigeria. Int. J. Gynecol. Cancer 23, 507–512 (2013).
Massad, L. S., Jeronimo, J. & Schiffman, M. Interobserver agreement in the assessment of components of colposcopic grading. Obstet. Gynecol. 111, 1279–1284 (2008).
Silkensen, S. L., Schiffman, M., Sahasrabuddhe, V. & Flanigan, J. S. Is it time to move beyond visual inspection with acetic acid for cervical cancer screening?. Glob. Heal. Sci. Pract. 6, 242–246 (2018).
Ahmed, S. R. et al. Reproducible and clinically translatable deep neural networks for cervical screening. Sci. Rep. 13, 1–18 (2023).
Levitz, D., Angara, S., Jeronimo, J., Rodriguez, A. C., de Sanjose, S., Antani, S., & Schiffman M. W. A survey of image quality defects and their effect on the performance of an automated visual evaluation classifier. In Optics Biophotonics Low-Resource Settings VIII (2022)
Desai, K. T. et al. Design and feasibility of a novel program of cervical screening in Nigeria: self-sampled HPV testing paired with visual triage. Infect. Agent. Cancer 15, 60 (2020).
de Sanjosé, S. et al. Design of the HPV-automated visual evaluation (PAVE) study: validating a novel cervical screening strategy. Elife 12 (2023).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely Connected Convolutional Networks. In Proceedings of—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2261–2269 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December, 770–778 (2015).
de la Torre, J., Puig, D. & Valls, A. Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognit. Lett. 105, 144–154 (2018).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December, 779–788 (2016).
Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. 33rd International Conference on Machine Learning ICML 2016, 3, 1651–1660 (2015).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Ahmed, S. R. et al. Assessing generalizability of an AI-based visual test for cervical cancer screening. medRxiv 10, 2023.09.26.23295263 (2023).
Egemen, D. et al. Artificial intelligence–based image analysis in clinical testing: lessons from cervical cancer screening. JNCI J. Natl. Cancer Inst. https://doi.org/10.1093/JNCI/DJAD202 (2023).
Perkins, R. B. et al. Use of risk-based cervical screening programs in resource-limited settings. Cancer Epidemiol. 84, 102369 (2023).
CDRH Issues Guiding Principles for Predetermined Change Control Plans for Machine Learning-Enabled Medical Devices | FDA. https://www.fda.gov/medical-devices/medical-devices-news-and-events/cdrh-issues-guiding-principles-predetermined-change-control-plans-machine-learning-enabled-medical.
Acknowledgements
We acknowledge Farideh Almani from the National Cancer Institute for image review; Dr. Ernestina Hernandez, Dr Lina José and Dr. Flavia Antigua for image capture; as well as the following list comprising the Takeo Physicians (HOURT, Kay; KHIM, Thou; KON, Korng; CHA, Sang Hak; KON, Kim Chhorng; SOY, Sokhoeun; NOUN, Chan Reatrey; THUN, Laiky; KIM, Monyrathna; LY, Sovannrathanak; SREANG, Sovannareth) and Midwives (SOK, Tonh; KAY, Sokny; CHHENG, Sreimom; SOK, Nissay; CHEA, Savet; SORN, Vantha; KUNG, Saroeun; KOM, Sokunthea; UN, Sophea; SO, Kalyane; LONG, Samphors; SEM, Sreypov; NHEM, Sokha) team at the Provincial Referral Hospital and Fanine, participating in the PAVE study in Cambodia.
Author information
Authors and Affiliations
Consortia
Contributions
Study concept and design: S.R.A. Data collection: all authors. Data analysis and interpretation: S.R.A. Drafting of the manuscript: S.R.A. Critical revision of the manuscript for important intellectual content and final approval: all authors. Supervision: J.K.-C.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ahmed, S.R., Befano, B., Egemen, D. et al. Generalizable deep neural networks for image quality classification of cervical images. Sci Rep 15, 6312 (2025). https://doi.org/10.1038/s41598-025-90024-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-90024-0