Introduction

While there has been a notable surge in the development of artificial intelligence (AI) and deep learning (DL) pipelines for several clinical tasks1,2,3,4, translation of these models into clinical practice remains sparse. Particularly concerning is the extent to which models fail to perform in prospective studies for clinical translation, owing to differences in prospectively acquired, real-world clinical data from training data. One unique challenge that has consistently hindered clinical AI deployment is the frequent lack of image quality control, especially in clinical domains that involve multiple and/or custom image capture devices5,6,7,8. In these domains, diagnostic AI models are often trained on images with no filtering for quality in the training data; this, in turn, can lead to misclassifications by these models when implemented, leading to adverse patient outcomes. Cervical cancer screening is one such domain where image quality is of particular importance, for several reasons: 1. only a small portion of the cervix is typically of interest for diagnosis; 2. the area of interest is frequently difficult to visualize and properly position; and 3. there are often multiple devices with different characteristics and multiple image takers in a given health region. For cervical screening, visual evaluation plays a crucial role as a triage step after a primary screening test (ideally, an HPV test) is positive; it is used to determine the need and site of biopsies for histological confirmation and/or the adequacy of ablative treatment.

In the context of this work, image quality refers to the visual attributes of an image including the technical characteristics that determine its clarity and fidelity to the original subject9. In the case of cervical images, quality classification is a crucial task to ensure accurate screening and diagnosis of cervical cancer; this is true for both gynecologic-oncologists’ (manual) and diagnostic AI models’ (automated) predictions. Factors that may impact the quality of a cervical image include but are not limited to blur, poor focus, poor light, noise, obscured view of the cervix due to mucus and/or blood, improper position of the speculum, insufficient magnification, glare, specular reflection, and over- and/or under-exposure. There is a paucity of work in the current DL and medical image classification literature that assesses clinical image quality; most current pipelines therefore lack an image quality check and tend to perform poorly on poor quality images5,6,7,8.

Cervical cancer ranks as the fourth most prevalent cause of cancer-related morbidity and mortality worldwide, with around 90% of the 300,000 deaths per year occurring in low-resource settings10,11,12. Despite a strong understanding of the causal pathway, predominantly attributed to Human Papillomavirus (HPV)11,13,14, effective control of cervical cancer remains elusive, particularly in low-resource settings15. To assess the risk of HPV-positive individuals, low-resource settings commonly employ visual inspection with acetic acid (VIA) as a triage method16,17. However, numerous studies have indicated that visual evaluation by healthcare providers exhibits suboptimal accuracy and repeatability18,19, creating a necessity for automated tools that can more consistently evaluate cervical lesions and direct the appropriate treatment protocol. To this end, we had previously generated a multiclass diagnostic classifier able to classify the appearance of the cervix into “normal”, “indeterminate” and “precancer/cancer” categories20.

Crucially, both manual and automated evaluation of cervical images, as captured by colposcopes, cell phone cameras or other devices are dependent on quality; even highly trained healthcare providers such as colposcopists and gynecologic-oncologists are unable to confidently ascertain the cancer status of a cervix from poor quality images21. Therefore, there is a need for an accurate and generalizable image quality classifier to ensure that only images deemed of sufficient quality undergo diagnostic classification and evaluation, whether manual or automated. For instance, our diagnostic classifier20 utilized only images that were labelled “intermediate” or “high” quality for diagnostic classification. Our goal is to filter out the “low” quality images by prompting the user to retake an image if deemed of poor quality, and only pass through the “intermediate” and “high” quality images, i.e., images deemed to be of sufficient quality for downstream diagnostic classification. In this work, we conducted a multi-stage model selection approach utilizing a collated, multi-heterogeneous dataset to generate a multi-class image quality classifier able to classify images into “low”, “intermediate” and “high” -quality categories. We subsequently validate this classifier on an external, out-of-distribution (OOD) dataset, assessing the relative impacts of various axes of data heterogeneity, including device-, geography-, and ground truth rater-level heterogeneity, on the performance of our best quality classifier model.

Our work makes several important conclusions regarding the performance of our quality classifier model, which, we believe, hold relevance across multiple clinical domains even outside of cervical imaging:

  1. 1.

    Object Detection: Model performances improve after employing a trained bounding box detector to bound and crop the cervix from images and training/testing on the bound and cropped images.

  2. 2.

    Generalizability:

    1. a.

      Device-level heterogeneity: Our model performs strongly out of the box on an external dataset comprising images from a different device.

    2. b.

      Geography-level heterogeneity: Our model is geography agnostic, meaning that there is no impact of geography-level heterogeneity on model performance.

    3. c.

      Label/Ground Truth rater level heterogeneity: Our model strongly mimics the overall/average rater behavior; it discriminates the important boundary classes (“low” and “high” quality) well and reasonably captures the degree of uncertainty seen with the “intermediate” class.

Materials and methods

Dataset

Included studies

We utilized two groups of datasets in this study: (1) a collated, multi-device (cervigram, DSLR, J5, S8) and multi-geography (Costa Rica, USA, Europe, Nigeria) dataset, labelled “SEED”, which comprised of a convenience sample combining six distinct studies—Natural History Study (NHS), ASC-US/LSIL Triage Study for Cervical Cancer (ALTS), Costa Rica Vaccine Trial (CVT), Biopsy Study in the US (Biop), Biopsy Study in Europe (D Biop)20 and Project Itoju22, and (2) an external dataset, labelled “EXT”, comprising of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic) collected as part of the HPV-Automated Visual Evaluation (PAVE) study23 (Table 1, Fig. 1). The “SEED” dataset comprised of a total of 40,534 images while the “EXT” dataset comprised of 1340 images (Table 1).

Table 1 Detailed breakdown of full dataset including “SEED” and “EXT” by ground truth class and characteristics (study, device, and geography), highlighting both the number of images and relative percentage.
Fig. 1
figure 1

Overview of dataset and model optimization strategy. We utilized a collated multi-device and multi-geography dataset, labelled “SEED” (orange panel), for model training and selection, and subsequently validated the performance of our chosen best-performing model on an external dataset, labelled “EXT” (blue panel), comprising of images from a new device and new geographies (see Table 1 and METHODS for detailed descriptions and breakdown of the datasets by ground truth). We split the “SEED” dataset 10% : 1% : 79% : 10% for train : validation : Test 1 (“Model Selection Set”) and Test 2 (“Internal Validation”), and subsequently investigated the intersection of model design choices in the bottom table on the train and validation sets. The models were ranked based on classification performance on the “Model Selection Set”, captured by the metrics highlighted on the center green panel. The “Internal Validation” set was subsequently utilized to further verify and confirm the ranked order of the models from the “Model Selection Set”. Finally, we validated the performance of our top model on “EXT”, conducting both an external validation and an interrater study (see METHODS). CE: cross entropy; QWK: quadratic weighted kappa; MSE: mean squared error, AUROC: area under the receiver operating characteristics curve.

Ground truth delineation

The ground truth quality labels for the images in the “SEED” and “EXT” datasets were assigned by four healthcare providers into four categories, using the following guidelines: “unusable” (where the images were either not of the cervix, used Lugol’s iodine for visual inspection, included a green filter, were post-surgery or post ablation, and/or consisted of an upload artifact), “unsatisfactory” (where major technical quality factors such as blur, poor focus, poor light, obstructed view of the cervix due to mucus or blood, improper position, over- and/or under-exposure did not allow for a visual diagnostic evaluation), “limited” (where certain technical quality factors still impacted image quality but a visual diagnostic evaluation was possible) and “evaluable” (where there were no technical factors affecting the quality of the image and a visual diagnosis was possible). Each of the raters were licensed physicians, with board certifications in gynecology or gynecologic oncology, with more than 20 years of experience in their fields, as well as specific expertise in HPV epidemiology. Three of the raters labelled images in the “SEED” and “EXT” datasets, while one rater labelled images only in the “EXT” dataset. The four-level ground truth mapping was converted into three levels: “low quality” (which combined the “unusable” and “unsatisfactory” categories), “intermediate quality” (“limited” category) and “high quality (“evaluable” category). The rationale for combining the bottom two quality categories is twofold: first, since both “unusable” and “unsatisfactory” images cannot undergo visual diagnostic evaluation, we expect these images to be filtered out by the quality classifier and new images retaken for the patient; second, combining the lower-two categories ensured a better dataset balance given the large number of “intermediate quality” (“limited” category) and “high quality (“evaluable” category) images. Since both “intermediate” and “high” quality images can be visually evaluated by providers, we expect automated classifiers trained on these images to correspondingly provide diagnostic predictions. The breakdown of the final three-level ground truths in each dataset is highlighted on Table 1.

Ethics

All study participants signed a written informed consent prior to enrollment and sample collection. All studies were reviewed and approved by the Institutional Review Boards of the National Cancer Institute (NCI) and the National Institutes of Health (NIH). The “EXT” studies were approved by country-specific IRBs from Cambodia and DR. All experiments and methods were performed in accordance with the relevant guidelines and regulations.

Model training and analysis

Utilizing a three-level ground truth of “low”, “intermediate” and “high” quality images, we investigated the design of an image quality classifier on the “SEED” dataset and externally validated the best model on the “EXT” dataset. We implemented our model design and selection approach in four distinct steps: 1. model development, 2. internal validation, 3. external validation and 4. interrater performance.

Model development

We conducted our experiments in multiple rounds, incorporating the intersections of model choices across several key model design choice categories (Fig. 1); these included different model architectures (densenet12124, resnet5025), loss functions (standard cross-entropy, quadratic weighted kappa26, and mean-squared error losses) and dataset balancing strategies (balanced sampling, balanced loss). Our design choices here were informed by prior work20 highlighting the utility of these choices across medical imaging domains, and specifically for the cervical domain.

ROUND 1: Training set size

In the first round, our initial runs were aimed at investigating the impact of dataset size on model performance. We conducted model training runs that used either a high (65%) or low (10%) proportion of “SEED” data for training, and subsequently compared several key classification performance metrics between the two sets of runs using paired samples t-tests adjusted for multiple comparisons by the Bonferroni correction.

ROUND 2: Cervix detection

In the second round, we investigated the impact of cervix detection on quality classifier performance, comparing model performance before and after cervix detection. The expected workflow in our overall multistep pipeline includes, in sequence, 1. image capture, 2. cervix detection, 3. image quality classifier, 4. diagnostic classifier, and 5. appropriate treatment as directed. In our overall pipeline, cervix detection can be considered as a preprocessing task that bounds and crops the cervix for input into the downstream classifiers. Given that healthcare providers only look at the cervix enclosed within its circumferential boundary for determination of visual image quality, as well as visual determination of precancer status via aceto-whitening near the transformation zone, our decision to bound and crop the cervix and only pass the cropped image into the downstream classifiers was intuitive and justified.

We used a YOLOv527 model architecture pretrained on the COCO dataset to train our custom cervix detector. Human annotated ground truth bounding boxes were available for images that were split into 60% train, 10% validation, 20% test 1 and 10% test 2 sets. The detector was trained for 100 epochs and achieved an mAP0.95 of 0.995 and mAP0.5:0.95 of 0.954 respectively, indicating a very high level of performance. We subsequently compared several key classification metrics between quality classifier model runs before and after cervix detection by conducting paired samples t-tests, adjusted for multiple comparisons by the Bonferroni correction (Fig. 2).

Fig. 2
figure 2

(a) Comparison of model performances with (green bars) and without (red bars) cervix detection. The bars report mean values of the corresponding metrics on the x-axis across all models. Results from paired samples t-tests adjusted for the Bonferroni correction (t-statistic, p-value) are highlighted in the text above the bars, demonstrating statistically significant improvements in model performance with cervix detection. (b) (i) Bounding boxes generated from running the cervix detector, highlighted in white, around 50 randomly selected images from the external (“EXT”) dataset. The cervix detector utilized a YOLOv5 architecture trained on “SEED” dataset images, and (ii) Bound and cropped images of the cervix which are passed onto the diagnostic classifier.

Model selection and internal validation

Our final model runs utilized the full 40,534 image “SEED” dataset with a split of 10% : 1% : 79% : 10% for training : validation : test 1 (model selection set) : test 2 (internal validation set) and iterated across all combinations of the design choices highlighted on Fig. 1. The specific configurations are highlighted on Table 2. All images were cropped with bounding boxes generated from a YOLOv527 model trained for cervix detection as noted above. RGB images were used for training, since the primary visual indicators of precancerous status in an image of the cervix requires the presence of color (e.g., aceto-whitening near the transformation zone following application of acetic acid, growth or ulceration, vascular abnormalities); subtle color differences reflect underlying physiological and pathological changes associated with precancer/cancer. All models were trained for 75 epochs with a batch size (BS) of 8, a learning rate (LR) of 10–5, and an LR scheduler (ReduceLRonPlateau) which reduced the LR by a factor of 10 if no improvement was seen in the validation metric for 10 epochs. Our choices of a low LR with an LR scheduler, optimal BS and epochs optimized model performance, training time, and available memory capacity, and ensured that all our models reached convergence. We used the summed normal and precancer AUC on the validation set as the early stopping criterion during training. Before training, images were resized to 256 × 256 pixels and scaled to intensity values from 0 to 1. During training, affine transformations were applied to the image for data augmentation. We initialized all model architectures with ImageNet pretrained weights. Additionally, we implemented Monte Carlo (MC) dropout28 in order to alleviate overfitting and regularize the learning process by randomly removing neural connections from the model29. Spatial dropout at a rate of 0.1 was applied after each dense layer for the densenet121 models, and after each residual block for the resnet50 models. The final model prediction was generated via models trained using dropout combined with the inference prediction derived from the 50 forward passes; model predictions can be thought of as analogous to averaging 50 repeat runs of each model, i.e., the average of 50 MC samples.

Table 2 Configurations of the final set of runs investigated during model selection and internal validation, where each model comprised a unique combination of architecture, loss function and balancing strategy.

In the internal validation stage, we ranked our final models in order of performance on the “Model Selection Set” (“Test Set 1” = 32,100 images). We subsequently confirmed the performance of these models on the previously held aside “Internal Validation Set” (“Test Set 2” = 3,975 images). We ranked our models based on area under the receiver operating characteristics curve (AUROC), kappa (linear and quadratic weights), as well as %extreme misclassifications (%EM, representing the proportion of images with a two-class misclassification), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ) (Fig. 3).

Fig. 3
figure 3

Classification performance metrics on the “Internal Validation Set” (“Test Set 2”) for models investigated. The models are arranged from top to bottom in order of decreasing performance. Specifically, (a) highlights the discrete classification metrics: %extreme misclassifications (% ext. mis.), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ), (b) highlights the Kappa metrics (linear, quadratic weighted) and (c) highlights the area under the receiver operating characteristics curve (AUROC) for each of the low quality (LQ) versus rest and high quality (HQ) versus rest categories. While overall our top models performed reasonably similarly in terms of the continuous metrics (panel b and c), the discrete metrics (panel a) separated out the top performing model from its competitors. Our best performing model achieved an AUROC of 0.92 (LQ vs. rest) and 0.93 (HQ vs. rest), and a minimal total %EM of 2.8%. The model ranking is consistent to the ranking observed on the “Model Selection Set” (“Test Set 1”) (Supp. Fig. 1).

Finally, to aid better visualization of predictions at the individual model level, we generated Fig. 4, which compared model predictions across 60 images for the ranked model list. To generate this comparison, we first summarized each model’s output as a continuous severity \(score\). Specifically, we utilized the ordinality of our problem and defined the continuous severity \(score\) as a weighted average using softmax probability of each class \(i\) (\({p}_{i}\)) as described in Eq. 3, where \(k\) = number of classes:

Fig. 4
figure 4

Model-level comparison across investigated models on the “Internal Validation Set” (“Test Set 2”). 60 images were randomly selected from this set (see METHODS/Model Training and Analysis/Model Selection and Internal Validation) and arranged in order of increasing mean score within each ground truth class in the top row (labelled “Ground Truth”). The model predicted class for the investigated models for each of these 60 images is highlighted in the bottom rows, where the images follow the same order as the top row. The color coding in the top row represents ground truth while in the bottom 12 rows represent the model predicted class. Red: Low Quality, Gray: Intermediate, and Green: High Quality, as highlighted in the legend. As we go from the worst model at the bottom to the best model at the top, identification and discrimination of both “intermediate” and “high” quality images steadily improve.

$$score= \sum_{i=0}^{k}{p}_{i} \times i$$

Put another way, the \(score\) is equivalent to the expected value of a random variable that takes values equal to the class labels, and the probabilities are the model’s softmax probability at index \(i\) corresponding to class label \(i\). For a three-class model, the values lie in the range 0 to 2. We next computed the average of the \(score\) for each image across all models and arranged the images in order of increasing average \(score\) within each class. From this \(score\)-ordered list, we randomly selected 20 images per class, maintaining the distribution of mean scores within each class, and arranged the images in order of increasing average \(score\) within each class in the top row of Fig. 4, color coded by ground truth. We subsequently compared the predicted class across the 12 models for each of these 60 images (bottom 12 rows of Fig. 4), maintaining the images in the same order as the ground truth row and color-coded by model predicted class. The image panels on the top of Fig. 4 depict select images with relevant metadata.

External validation

Because our internal validation set shared similar characteristics to our training data (i.e., similar devices and geographies), our next stage consisted of validating our best performing model on external data (“EXT”). Our external test set (“EXT” = “Test Set 3”) comprised of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic (DR)).

First, to get a sense of the dataset distributions of the “SEED” and “EXT” datasets, including the distributions by device and geography, we ran out-of-the-box (OOB) inference with our best performing model on “Test Set 2” (“Internal Validation Set”) from the “SEED” dataset and on the full “EXT” dataset. We subsequently plotted UMAPs of the resulting features, which represent a dimension-reduced version of the features output from the model’s inference run, color-coded by dataset, device, and geography (Fig. 5) respectively.

Fig. 5
figure 5

Uniform manifold approximation and projections (UMAP) highlighting the relative distributions of the datasets, devices and geographies investigated in this work. Each subplot highlights a different representation of the UMAP, where the color coding (highlighted in the corresponding legend at the top of each subplot) is at the (a) dataset-level (seed vs. external), (b) device-level and (c) geography-level. The datasets and devices occupy distinct clusters in (a) and (b), while the geographies are all clustered together within the same device in (c).

We further tested the impact of device- and geography-level heterogeneity on our model performance via three distinct set of investigations: (i) out-of-the-box (OOB) inference on “EXT”; (ii) device-level retraining: adding multi-geography “EXT” images to “SEED” in a 65% : 10% : 25% ratio of train : validation : test and training on the full collated dataset; and (iii) geography-level retraining: adding either Cambodia or DR “EXT” images to “SEED” in separate experiments and training on the full collated dataset. For (i) the OOB model run, we investigated performance on both the full “EXT” test set, as well as on the individual geographies Cambodia and DR. For (ii) the device-level retraining run, we investigated performance on both “EXT” test set and on “SEED” Test Set 2, to assess the possibility of performance degradation (catastrophic forgetting) on “SEED” data upon retraining.

Interrater assessment

Finally, we conducted an interrater assessment of model performance with respect to the ground truth denoted by two different raters on 100 newly acquired, external (“EXT”) dataset images (device = IRIS colposcope, geography = Cambodia). Rater 1 was one of several raters who had labelled images in the “SEED” dataset on which the model was trained, while “Rater 2” was a completely new rater. We specifically investigated the OOB performance of our best performing model (which was trained on “SEED”), on the 100 “EXT” images with respect to each of the individual rater’s ground truth, computing key classification metrics (AUROC, %EM) (Fig. 7a) and ROC curves for each (Fig. 7b,c). Further, we investigated the degree of concordance between the two raters’ ground truths and the corresponding model predictions on each of the 100 images using a rater-level confusion matrix color-coded by model prediction.

Results and discussion

In this work, we implemented a multi-stage model selection approach to generate an image quality classifier utilizing a multi-device and multi-geography “SEED” dataset, and subsequently validated the best performing model on an external “EXT” dataset, assessing the relative impact of device-level, geography-level, and rater heterogeneity on our model.

Model development

ROUND 1: Training set size

Supp. Table 1 highlights that using a high proportion (65%) of all available data for training instead of a low proportion (10%) did not meaningfully improve or alter model performance. We consequently chose to limit all subsequent experiments to a low proportion (10%) of the data to save on training time, optimize available memory capacity and improve computational efficiency. Conceptually, this is consistent with our expectation since given the large size of our “SEED” dataset (40,534 images), even 10% of the dataset amounts to 4,053 images used for training, which is a reasonably large number for this task.

ROUND 2: Cervix detection

Figure 2 highlights statistically significant improvements in model performance with cervix detection across several key classification metrics, including linear kappa (LK), quadratic weighted kappa (QWK), accuracy, area under the receiver operating characteristics curve (AUROC) and area under the precision recall curve (AUPRC). We consequently chose to limit our final set of model selection experiments to models utilizing images bound and cropped following cervix detection. Conceptually, this is consistent with our expectation since our raters primarily utilized the location around the cervical os, which largely encompassed the circumferential boundary around the cervix, for determining quality ground truths, given that this is the region of the cervix that is used to visually determine cervical precancer and cancer.

Model selection and internal validation

Figure 3 depicts the rank order of our final models on “Test Set 2” (“Internal Validation Set”). This rank order is consistent with the rank order of models on “Test Set 1” (“Model Selection Set”) (Supp. Fig. 1), demonstrating that the performance differences between the models are driven by the design choices and not due to chance or specific composition of the individual test sets. Figure 3 highlights that while all models perform well and reasonably similarly in terms of the continuous metrics (the top models have similar AUROCs ~ 0.92 and Kappa ~ 0.65), the discrete metrics (%EM, %LQ as HQ and %HQ as LQ) effectively discriminate between models and separate out the top performing model from its competitors. Our best performing model achieved an AUROC of 0.92 (LQ vs. rest) and 0.93 (HQ vs. rest), and a minimal total %EM of 2.8%. Finally, Fig. 4 highlights a more granular view of this difference in performance between the models, demonstrating that as we go from the worst model at the bottom to the best model at the top, identification and discrimination of both “intermediate” and “high” quality images steadily improve. This is consistent with our expectations given the design choices investigated in our model selection: incorporation of an “intermediate” class, together with MC dropout and a loss function (QWK) that penalizes misclassifications between the extreme classes ensure that we effectively deal with ambiguity with cases at the class boundaries. Our best model (“Model 1” on Figs. 3 and 4) utilizes densenet121 as the architecture, quadratic weighted kappa as the loss function and balanced loss as the balancing strategy. Even though we used RGB images as input, our general model selection and validation approach are independent of color and should apply widely even across grayscale images in diagnostic radiology.

External validation

The UMAPs on Fig. 5a, b highlight that the “EXT” dataset and its corresponding IRIS colposcope device occupy similar regions to the DSLR and J5 clusters from the “SEED” dataset, while Fig. 5c highlights the geography level distribution. Taken together, Fig. 5a–c suggest that 1. while there is device-level heterogeneity within the data, the likely impact on model performance on “EXT” should be minimal given its proximity to “SEED”; and 2. that geography should not play a role in model performance, given that within the same device, different geographies do not occupy distinct clusters on Fig. 5c, unlike the corresponding device level clusters on Fig. 5b.

Figure 6 highlights that our model demonstrated strong out-of-the-box (OOB) performance on external data: AUROC of 0.83 (LQ vs. rest) and 0.82 (HQ vs. rest), and a %EM of 3.9% (Fig. 6a.i, blue bars), consistent with our expectation from Fig. 5; these values further improved upon retraining with “EXT” images added to “SEED” (AUROC = 0.95, 0.88 respectively; %EM = 1.8%) (Fig. 6a.i, orange bars). Additionally, we found that 1. retraining using external data did not adversely affect performance on “SEED”, i.e., there was no catastrophic forgetting (AUROC = 0.92, 0.93 respectively; %EM = 3.2%) (Fig. 6a.i, yellow bars), and 2. our model is geography agnostic: OOB performance did not meaningfully differ between Cambodia and Dominican Republic (DR) images (Fig. 6b.i, light and dark blue bars), and models trained on Seed + Cambodia images performed strongly on DR images and vice versa (Fig. 6b.i, light and dark green bars). Taken together, Figs. 4 and 5 suggest that there is no impact of geography-level heterogeneity on model performance, and while there is some degree of device level heterogeneity as captured by the UMAPs, performance on our external device (IRIS colposcope) is strong.

Fig. 6
figure 6

External validation of our best performing model on “EXT” dataset. Panel (a) highlights the strong out-of-the-box (OOB) performance of our model, where area under the receiver operating characteristics curve (AUROC) = 0.83 (low quality, LQ vs. rest) and 0.82 (high quality, HQ vs. rest), and %extreme misclassification (%Ext. Mis.) = 3.9% (a.i, blue bars), with the corresponding confusion matrix and ROC curve in (ii). Panel (a) further highlights the improvement in performance upon retraining, where AUROC = 0.95, 0.88 respectively; %Ext. Mis. = 1.8% on “EXT” test set (a.i, orange bars) and the absence of catastrophic forgetting, where AUROC = 0.92, 0.93 respectively; %Ext. Mis. = 3.2% on “SEED” Test Set 2 (a.i, yellow bars; confusion matrix and ROC curves in iii). Panel (b) highlights that our model is geography agnostic, with no meaningful difference in OOB performance on “EXT” between Cambodia (Cam.) and Dominican Republic (DR) (b.i, light and dark blue bars) and strong performance on DR for models trained on “SEED” + Cambodia and vice versa (b.i. light and dark green bars; confusion matrices and ROC curves depicted in ii and iii respectively).

Interrater assessment

Figure 7 demonstrates that our model mimics overall rater behavior well. Our model demonstrated strong performance OOB on each individual rater’s ground truth (Rater 1: AUROC = 0.96, 0.85 respectively, and %EM = 2%; Rater 2: AUROC = 0.87, 0.80 respectively, and %EM = 8%). Rater 1 was involved in the generation of ground truths within “SEED”, meaning that the model had “seen” Rater 1 ground truth patterns in the “SEED” data, while Rater 2 was a completely new rater. Our model predicted correctly 85% of cases where both raters agreed on either “low quality” or “high quality” images, making grave errors on only two images. The “intermediate” class is known to be highly uncertain among raters, given that there is generally disagreement among healthcare providers as to what defines an “intermediate” or limited quality image. We found that our model uniquely captured this rater-level uncertainty and disagreement with the “intermediate” class and largely erred on the side of caution; images predicted as “low” quality by our model were largely deemed “low” quality by at least one rater, and vice versa, while images deemed “intermediate” by one rater and “high” quality by the other largely had a mix of “intermediate” and “high quality” model predictions. This pattern of model performance is optimal for our use case, since we expect utilizing our model to filter out “low” quality images, while allowing “intermediate” and “high” quality images to pass through to diagnostic classification.

Fig. 7
figure 7

Interrater assessment of our best performing model on 100 newly acquired, “EXT” dataset images (device = IRIS colposcope, geography = Cambodia), with respect to the ground truth denoted by two different raters. Rater 1 was one of the raters who had labelled images in the “SEED” dataset on which the model was trained, while Rater 2 was a completely new rater. Our model demonstrated strong performance out-of-the-box (OOB) on each individual rater’s ground truth, where for Rater 1: area under the receiver operating characteristics curve (AUROC) = 0.96 (low quality, LQ vs. rest), 0.85 (high quality, HQ vs. rest) respectively, and %extreme misclassifications (%Ext. Mis.) = 2% (panel (a), blue bars; ROC curves in panel (b)); and for Rater 2: AUROC = 0.87, 0.80 respectively, and %Ext. Mis. = 8% (panel (a), red bars; ROC curves in panel (b)). Panel (c) highlights the degree of concordance between the two rater’s ground truths (x-axis: Rater 1; y-axis: Rater 2) and the corresponding model prediction on each of the 100 images using a confusion matrix color-coded by model prediction (Red: low quality; Gray: intermediate quality and Green: high quality).

Diagnostic classifier performance by quality

To further shed light into the motivation behind the image quality classifier and our experiments reported here, we investigated the performance of our downstream diagnostic classifier within each quality ground truth class. Since our downstream diagnostic classification dataset utilized both “Intermediate” and “High Quality” images for training and testing, we can look at the diagnostic classifier model predictions on its test set with respect to the diagnostic classification ground truth within each of these two quality classes and determine whether there is an impact of image quality on diagnostic classifier model predictions. Detailed description of our diagnostic classifier model and the multi-heterogeneous (multi-device, multi-geography) dataset that was used can be found in20. The results from this analysis are reported Fig. 8, which utilized the test set from the diagnostic classifier comprising of 10,420 images.

Fig. 8
figure 8

Analysis of diagnostic classifier performance by image quality class. Specifically, the x-axis represents the image quality label/ground truth (“Intermediate” and “High Quality”) while the y-axis represents the diagnostic classifier label/ground truth (“Normal”, “Gray Zone/Indeterminate” and “Precancer+”). Within each of the six coordinates (reflecting the six combinations of quality and diagnostic classifier ground truths), each color-coded bubble represents the diagnostic classifier model predictions, with the relative sizes of the bubbles indicating the relative ratio of predictions for each class within each of the six coordinates. The numbers in the center of each bubble represents the number predicted to be of the given color for diagnostic class, as highlighted in the legend on the top, where Green: Normal, Gray: Gray Zone/Indeterminate, and Red: Precancer+.

The x-axis of the bubble plot in Fig. 8 represents the image quality label/ground truth (“Intermediate” and “High Quality”) while the y-axis represents the diagnostic classifier label/ground truth (“Normal”, “Gray Zone / Indeterminate” and “Precancer + ”). Within each of the six coordinates (reflecting the six combinations of quality and diagnostic classifier ground truths), each color-coded bubble represents the diagnostic classifier model predictions, with the relative sizes of the bubbles indicating the relative ratio of predictions for each class within each of the six coordinates. Crucially, none of the images labelled “Intermediate” quality are predicted to be “Normal” by the diagnostic classifier; this leads to a large discrepancy of 249 images which are ground truth labelled as “Normal” but are predicted to be largely “Gray Zone / Intermediate” (80%) or “Precancer + ” (20%) by the diagnostic classifier. This is a critical finding, since this signifies that the quality of a cervical image is important for downstream diagnostic classification. It appears that poorer quality images are deemed to be pathologic by the diagnostic classifier regardless of ground truth, thereby reinforcing the need for a dedicated deep-learning based model to filter out poorer quality images prior to diagnostic classification.

Analysis of performance by available quality factor

The dataset utilized in this work had a small number of “Low Quality” images for which we had a further denotation of the specific quality factor that led to the quality of the image to be determined as poor (n = 785 in the combined test set). In order to aid in explainability of our model, we assessed the performance of our top performing quality classifier (Model 1 in Figs. 3 and 4) within each of these categories by calculating the accuracy of model predictions, i.e. how many of the images within each specific quality factor category were correctly predicted as “Low Quality” by the model. Figure 9 depicts the bar plot of the accuracy within each specific quality category, highlighting that our quality classifier model performs well in filtering out post-iodine images as well as images where the view of the cervix is obscured due to the position of the cervix or an unspecified reason, while it does less well on images where the view of the cervix is obscured by mucus or blood.

Fig. 9
figure 9

Analysis of quality classifier performance by available quality factor, where each bar represents the accuracy of the best performing quality classifier model (Model 1 in Fig. 3 and 4) within each of the specific quality factor category as denoted on the x-axis. The total number of images in each category are denoted both in the bottom and top of each bar. On the x-axis, “Obscured” indicates that the view of the cervix is obscured by the factor that is denoted in parenthesis.

Conclusion

To successfully translate AI pipelines to clinical practice, models must be designed with guardrails built to deal with poor quality images. Domains such as cervical cancer screening, which comprises of a variety of image capture devices as well as differently skilled providers to capture images, are particularly prone to image quality concerns that may adversely impact diagnostic evaluation. In this work, we tackle the image quality problem head on, by generating and externally validating a multiclass image quality classifier able to classify images of the cervix into “low”, “intermediate” and “high” quality categories. We subsequently highlight that our best performing model generalizes well, performing strongly across multiple axes of data heterogeneity, including device, geography, and ground truth rater.

Our choice of three-classes for image quality classification was motivated by two reasons. First, three classes capture the most accurate representation of true image quality for cervical images as encountered by raters/providers in clinic by effectively capturing the true “ambiguity” with the “intermediate” class, and integrates seamlessly into our overall workflow: images predicted as “low” quality would be filtered out by our model, with the provider prompted to retake the image for the patient until the prediction is no longer “low” quality; only images deemed to be of sufficient quality (“intermediate” and “high” quality categories) would be passed onto downstream diagnostic classification. Second, by incorporating a three-class classifier with a loss function (QWK) that severely penalizes extreme misclassifications (i.e., LQ as HQ and vice-versa) with quadratic weights, we further ensure greater separation between and stronger discrimination of the “low” and “high” quality boundary classes.

Despite the heterogeneous nature of our datasets, our work may be limited by the number of external devices utilized and the number of image takers. Additionally, we use RGB images for input, although our approaches Forthcoming work will further evaluate our retraining approaches and assess model performance on additional external devices and image takers. Future work will further optimize our model for use on edge devices, thereby promoting clinical translation.

Our investigation of quality classifier performance across the various axes of heterogeneity present within our data underscore the importance of assessing portability or generalizability of AI models designed for clinical deployment30,31,32. We posit that for any model to be deployed successfully in the clinic, portability concerns must be adequately acknowledged. This is particularly true in light of the FDA’s October 2023 guidance on effective implementation of AI/DL models33, proposing the need for adapting models to data distribution shifts. We hope that our work will set a standard for the importance of assessing model performance against the axes of heterogeneity present in the dataset, across clinical domains, and will motivate the accompaniment of adequate guardrails for AI-based pipelines to account for concerns relating to image quality and generalizability.