Generalizable deep neural networks for image quality classification of cervical images

Ahmed, Syed Rakin; Befano, Brian; Egemen, Didem; Rodriguez, Ana Cecilia; Desai, Kanan T.; Jeronimo, Jose; Ajenifuja, Kayode O.; Clark, Christopher; Perkins, Rebecca; Campos, Nicole G.; Inturrisi, Federica; Wentzensen, Nicolas; Han, Paul; Guillen, Diego; Norman, Judy; Goldstein, Andrew T.; Madeleine, Margaret M.; Donastorg, Yeycy; Schiffman, Mark; de Sanjose, Silvia; Kalpathy-Cramer, Jayashree

doi:10.1038/s41598-025-90024-0

Download PDF

Article
Open access
Published: 21 February 2025

Generalizable deep neural networks for image quality classification of cervical images

Syed Rakin Ahmed^1,2,3,4,
Brian Befano^5,6,
Didem Egemen⁷,
Ana Cecilia Rodriguez⁷,
Kanan T. Desai⁷,
Jose Jeronimo⁷,
Kayode O. Ajenifuja⁸,
Christopher Clark⁹,
Rebecca Perkins¹⁰,
Nicole G. Campos¹¹,
Federica Inturrisi⁷,
Nicolas Wentzensen⁷,
Paul Han¹²,
Diego Guillen¹³,
Judy Norman¹⁴,
Andrew T. Goldstein¹⁵,
Margaret M. Madeleine¹⁶,
Yeycy Donastorg¹⁷,
Mark Schiffman⁷,
Silvia de Sanjose^7,18 &
Jayashree Kalpathy-Cramer^1,9 on behalf of
the PAVE Study Group

Scientific Reports volume 15, Article number: 6312 (2025) Cite this article

3053 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Successful translation of artificial intelligence (AI) models into clinical practice, across clinical domains, is frequently hindered by the lack of image quality control. Diagnostic models are often trained on images with no denotation of image quality in the training data; this, in turn, can lead to misclassifications by these models when implemented in the clinical setting. In the case of cervical images, quality classification is a crucial task to ensure accurate detection of precancerous lesions or cancer; this is true for both gynecologic-oncologists’ (manual) and diagnostic AI models’ (automated) predictions. Factors that impact the quality of a cervical image include but are not limited to blur, poor focus, poor light, noise, obscured view of the cervix due to mucus and/or blood, improper position, and over- and/or under-exposure. Utilizing a multi-level image quality ground truth denoted by providers, we generated an image quality classifier following a multi-stage model selection process that investigated several key design choices on a multi-heterogenous “SEED” dataset of 40,534 images. We subsequently validated the best model on an external dataset (“EXT”), comprising 1,340 images captured using a different device and acquired in different geographies from “SEED”. We assessed the relative impact of various axes of data heterogeneity, including device, geography, and ground-truth rater on model performance. Our best performing model achieved an area under the receiver operating characteristics curve (AUROC) of 0.92 (low quality, LQ vs. rest) and 0.93 (high quality, HQ vs. rest), and a minimal total %extreme misclassification (%EM) of 2.8% on the internal validation set. Our model also generalized well externally, achieving corresponding AUROCs of 0.83 and 0.82, and %EM of 3.9% when tested out-of-the-box on the external validation (“EXT”) set. Additionally, our model was geography agnostic with no meaningful difference in performance across geographies, did not exhibit catastrophic forgetting upon retraining with new data, and mimicked the overall/average ground truth rater behavior well. Our work represents one of the first efforts at generating and externally validating an image quality classifier across multiple axes of data heterogeneity to aid in visual diagnosis of cervical precancer and cancer. We hope that this will motivate the accompaniment of adequate guardrails for AI-based pipelines to account for image quality and generalizability concerns.

Reproducible and clinically translatable deep neural networks for cervical screening

Article Open access 08 December 2023

Comparison of machine and deep learning for the classification of cervical cancer based on cervicography images

Article Open access 09 August 2021

Enhancing advanced cervical cell categorization with cluster-based intelligent systems by a novel integrated CNN approach with skip mechanisms and GAN-based augmentation

Article Open access 23 November 2024

Introduction

While there has been a notable surge in the development of artificial intelligence (AI) and deep learning (DL) pipelines for several clinical tasks^1,2,3,4, translation of these models into clinical practice remains sparse. Particularly concerning is the extent to which models fail to perform in prospective studies for clinical translation, owing to differences in prospectively acquired, real-world clinical data from training data. One unique challenge that has consistently hindered clinical AI deployment is the frequent lack of image quality control, especially in clinical domains that involve multiple and/or custom image capture devices^5,6,7,8. In these domains, diagnostic AI models are often trained on images with no filtering for quality in the training data; this, in turn, can lead to misclassifications by these models when implemented, leading to adverse patient outcomes. Cervical cancer screening is one such domain where image quality is of particular importance, for several reasons: 1. only a small portion of the cervix is typically of interest for diagnosis; 2. the area of interest is frequently difficult to visualize and properly position; and 3. there are often multiple devices with different characteristics and multiple image takers in a given health region. For cervical screening, visual evaluation plays a crucial role as a triage step after a primary screening test (ideally, an HPV test) is positive; it is used to determine the need and site of biopsies for histological confirmation and/or the adequacy of ablative treatment.

In the context of this work, image quality refers to the visual attributes of an image including the technical characteristics that determine its clarity and fidelity to the original subject⁹. In the case of cervical images, quality classification is a crucial task to ensure accurate screening and diagnosis of cervical cancer; this is true for both gynecologic-oncologists’ (manual) and diagnostic AI models’ (automated) predictions. Factors that may impact the quality of a cervical image include but are not limited to blur, poor focus, poor light, noise, obscured view of the cervix due to mucus and/or blood, improper position of the speculum, insufficient magnification, glare, specular reflection, and over- and/or under-exposure. There is a paucity of work in the current DL and medical image classification literature that assesses clinical image quality; most current pipelines therefore lack an image quality check and tend to perform poorly on poor quality images^5,6,7,8.

Cervical cancer ranks as the fourth most prevalent cause of cancer-related morbidity and mortality worldwide, with around 90% of the 300,000 deaths per year occurring in low-resource settings^10,11,12. Despite a strong understanding of the causal pathway, predominantly attributed to Human Papillomavirus (HPV)^11,13,14, effective control of cervical cancer remains elusive, particularly in low-resource settings¹⁵. To assess the risk of HPV-positive individuals, low-resource settings commonly employ visual inspection with acetic acid (VIA) as a triage method^16,17. However, numerous studies have indicated that visual evaluation by healthcare providers exhibits suboptimal accuracy and repeatability^18,19, creating a necessity for automated tools that can more consistently evaluate cervical lesions and direct the appropriate treatment protocol. To this end, we had previously generated a multiclass diagnostic classifier able to classify the appearance of the cervix into “normal”, “indeterminate” and “precancer/cancer” categories²⁰.

Crucially, both manual and automated evaluation of cervical images, as captured by colposcopes, cell phone cameras or other devices are dependent on quality; even highly trained healthcare providers such as colposcopists and gynecologic-oncologists are unable to confidently ascertain the cancer status of a cervix from poor quality images²¹. Therefore, there is a need for an accurate and generalizable image quality classifier to ensure that only images deemed of sufficient quality undergo diagnostic classification and evaluation, whether manual or automated. For instance, our diagnostic classifier²⁰ utilized only images that were labelled “intermediate” or “high” quality for diagnostic classification. Our goal is to filter out the “low” quality images by prompting the user to retake an image if deemed of poor quality, and only pass through the “intermediate” and “high” quality images, i.e., images deemed to be of sufficient quality for downstream diagnostic classification. In this work, we conducted a multi-stage model selection approach utilizing a collated, multi-heterogeneous dataset to generate a multi-class image quality classifier able to classify images into “low”, “intermediate” and “high” -quality categories. We subsequently validate this classifier on an external, out-of-distribution (OOD) dataset, assessing the relative impacts of various axes of data heterogeneity, including device-, geography-, and ground truth rater-level heterogeneity, on the performance of our best quality classifier model.

Our work makes several important conclusions regarding the performance of our quality classifier model, which, we believe, hold relevance across multiple clinical domains even outside of cervical imaging:

1.
Object Detection: Model performances improve after employing a trained bounding box detector to bound and crop the cervix from images and training/testing on the bound and cropped images.
2.
Generalizability:
1. a.
  Device-level heterogeneity: Our model performs strongly out of the box on an external dataset comprising images from a different device.
2. b.
  Geography-level heterogeneity: Our model is geography agnostic, meaning that there is no impact of geography-level heterogeneity on model performance.
3. c.
  Label/Ground Truth rater level heterogeneity: Our model strongly mimics the overall/average rater behavior; it discriminates the important boundary classes (“low” and “high” quality) well and reasonably captures the degree of uncertainty seen with the “intermediate” class.

Materials and methods

Dataset

Included studies

We utilized two groups of datasets in this study: (1) a collated, multi-device (cervigram, DSLR, J5, S8) and multi-geography (Costa Rica, USA, Europe, Nigeria) dataset, labelled “SEED”, which comprised of a convenience sample combining six distinct studies—Natural History Study (NHS), ASC-US/LSIL Triage Study for Cervical Cancer (ALTS), Costa Rica Vaccine Trial (CVT), Biopsy Study in the US (Biop), Biopsy Study in Europe (D Biop)²⁰ and Project Itoju²², and (2) an external dataset, labelled “EXT”, comprising of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic) collected as part of the HPV-Automated Visual Evaluation (PAVE) study²³ (Table 1, Fig. 1). The “SEED” dataset comprised of a total of 40,534 images while the “EXT” dataset comprised of 1340 images (Table 1).

Table 1 Detailed breakdown of full dataset including “SEED” and “EXT” by ground truth class and characteristics (study, device, and geography), highlighting both the number of images and relative percentage.

Full size table

Ground truth delineation

The ground truth quality labels for the images in the “SEED” and “EXT” datasets were assigned by four healthcare providers into four categories, using the following guidelines: “unusable” (where the images were either not of the cervix, used Lugol’s iodine for visual inspection, included a green filter, were post-surgery or post ablation, and/or consisted of an upload artifact), “unsatisfactory” (where major technical quality factors such as blur, poor focus, poor light, obstructed view of the cervix due to mucus or blood, improper position, over- and/or under-exposure did not allow for a visual diagnostic evaluation), “limited” (where certain technical quality factors still impacted image quality but a visual diagnostic evaluation was possible) and “evaluable” (where there were no technical factors affecting the quality of the image and a visual diagnosis was possible). Each of the raters were licensed physicians, with board certifications in gynecology or gynecologic oncology, with more than 20 years of experience in their fields, as well as specific expertise in HPV epidemiology. Three of the raters labelled images in the “SEED” and “EXT” datasets, while one rater labelled images only in the “EXT” dataset. The four-level ground truth mapping was converted into three levels: “low quality” (which combined the “unusable” and “unsatisfactory” categories), “intermediate quality” (“limited” category) and “high quality (“evaluable” category). The rationale for combining the bottom two quality categories is twofold: first, since both “unusable” and “unsatisfactory” images cannot undergo visual diagnostic evaluation, we expect these images to be filtered out by the quality classifier and new images retaken for the patient; second, combining the lower-two categories ensured a better dataset balance given the large number of “intermediate quality” (“limited” category) and “high quality (“evaluable” category) images. Since both “intermediate” and “high” quality images can be visually evaluated by providers, we expect automated classifiers trained on these images to correspondingly provide diagnostic predictions. The breakdown of the final three-level ground truths in each dataset is highlighted on Table 1.

Ethics

All study participants signed a written informed consent prior to enrollment and sample collection. All studies were reviewed and approved by the Institutional Review Boards of the National Cancer Institute (NCI) and the National Institutes of Health (NIH). The “EXT” studies were approved by country-specific IRBs from Cambodia and DR. All experiments and methods were performed in accordance with the relevant guidelines and regulations.

Model training and analysis

Utilizing a three-level ground truth of “low”, “intermediate” and “high” quality images, we investigated the design of an image quality classifier on the “SEED” dataset and externally validated the best model on the “EXT” dataset. We implemented our model design and selection approach in four distinct steps: 1. model development, 2. internal validation, 3. external validation and 4. interrater performance.

Model development

We conducted our experiments in multiple rounds, incorporating the intersections of model choices across several key model design choice categories (Fig. 1); these included different model architectures (densenet121²⁴, resnet50²⁵), loss functions (standard cross-entropy, quadratic weighted kappa²⁶, and mean-squared error losses) and dataset balancing strategies (balanced sampling, balanced loss). Our design choices here were informed by prior work²⁰ highlighting the utility of these choices across medical imaging domains, and specifically for the cervical domain.

ROUND 1: Training set size

In the first round, our initial runs were aimed at investigating the impact of dataset size on model performance. We conducted model training runs that used either a high (65%) or low (10%) proportion of “SEED” data for training, and subsequently compared several key classification performance metrics between the two sets of runs using paired samples t-tests adjusted for multiple comparisons by the Bonferroni correction.

ROUND 2: Cervix detection

In the second round, we investigated the impact of cervix detection on quality classifier performance, comparing model performance before and after cervix detection. The expected workflow in our overall multistep pipeline includes, in sequence, 1. image capture, 2. cervix detection, 3. image quality classifier, 4. diagnostic classifier, and 5. appropriate treatment as directed. In our overall pipeline, cervix detection can be considered as a preprocessing task that bounds and crops the cervix for input into the downstream classifiers. Given that healthcare providers only look at the cervix enclosed within its circumferential boundary for determination of visual image quality, as well as visual determination of precancer status via aceto-whitening near the transformation zone, our decision to bound and crop the cervix and only pass the cropped image into the downstream classifiers was intuitive and justified.

We used a YOLOv5²⁷ model architecture pretrained on the COCO dataset to train our custom cervix detector. Human annotated ground truth bounding boxes were available for images that were split into 60% train, 10% validation, 20% test 1 and 10% test 2 sets. The detector was trained for 100 epochs and achieved an mAP_0.95 of 0.995 and mAP_0.5:0.95 of 0.954 respectively, indicating a very high level of performance. We subsequently compared several key classification metrics between quality classifier model runs before and after cervix detection by conducting paired samples t-tests, adjusted for multiple comparisons by the Bonferroni correction (Fig. 2).

Model selection and internal validation

Our final model runs utilized the full 40,534 image “SEED” dataset with a split of 10% : 1% : 79% : 10% for training : validation : test 1 (model selection set) : test 2 (internal validation set) and iterated across all combinations of the design choices highlighted on Fig. 1. The specific configurations are highlighted on Table 2. All images were cropped with bounding boxes generated from a YOLOv5²⁷ model trained for cervix detection as noted above. RGB images were used for training, since the primary visual indicators of precancerous status in an image of the cervix requires the presence of color (e.g., aceto-whitening near the transformation zone following application of acetic acid, growth or ulceration, vascular abnormalities); subtle color differences reflect underlying physiological and pathological changes associated with precancer/cancer. All models were trained for 75 epochs with a batch size (BS) of 8, a learning rate (LR) of 10^–5, and an LR scheduler (ReduceLRonPlateau) which reduced the LR by a factor of 10 if no improvement was seen in the validation metric for 10 epochs. Our choices of a low LR with an LR scheduler, optimal BS and epochs optimized model performance, training time, and available memory capacity, and ensured that all our models reached convergence. We used the summed normal and precancer AUC on the validation set as the early stopping criterion during training. Before training, images were resized to 256 × 256 pixels and scaled to intensity values from 0 to 1. During training, affine transformations were applied to the image for data augmentation. We initialized all model architectures with ImageNet pretrained weights. Additionally, we implemented Monte Carlo (MC) dropout²⁸ in order to alleviate overfitting and regularize the learning process by randomly removing neural connections from the model²⁹. Spatial dropout at a rate of 0.1 was applied after each dense layer for the densenet121 models, and after each residual block for the resnet50 models. The final model prediction was generated via models trained using dropout combined with the inference prediction derived from the 50 forward passes; model predictions can be thought of as analogous to averaging 50 repeat runs of each model, i.e., the average of 50 MC samples.

Table 2 Configurations of the final set of runs investigated during model selection and internal validation, where each model comprised a unique combination of architecture, loss function and balancing strategy.

Full size table

In the internal validation stage, we ranked our final models in order of performance on the “Model Selection Set” (“Test Set 1” = 32,100 images). We subsequently confirmed the performance of these models on the previously held aside “Internal Validation Set” (“Test Set 2” = 3,975 images). We ranked our models based on area under the receiver operating characteristics curve (AUROC), kappa (linear and quadratic weights), as well as %extreme misclassifications (%EM, representing the proportion of images with a two-class misclassification), %high quality misclassified as low quality (%HQ as LQ) and %low quality misclassified as high quality (%LQ as HQ) (Fig. 3).

Finally, to aid better visualization of predictions at the individual model level, we generated Fig. 4, which compared model predictions across 60 images for the ranked model list. To generate this comparison, we first summarized each model’s output as a continuous severity $score$. Specifically, we utilized the ordinality of our problem and defined the continuous severity $score$ as a weighted average using softmax probability of each class $i$ (${p}_{i}$) as described in Eq. 3, where $k$ = number of classes:

$$score= \sum_{i=0}^{k}{p}_{i} \times i$$

Put another way, the $score$ is equivalent to the expected value of a random variable that takes values equal to the class labels, and the probabilities are the model’s softmax probability at index $i$ corresponding to class label $i$. For a three-class model, the values lie in the range 0 to 2. We next computed the average of the $score$ for each image across all models and arranged the images in order of increasing average $score$ within each class. From this $score$-ordered list, we randomly selected 20 images per class, maintaining the distribution of mean scores within each class, and arranged the images in order of increasing average $score$ within each class in the top row of Fig. 4, color coded by ground truth. We subsequently compared the predicted class across the 12 models for each of these 60 images (bottom 12 rows of Fig. 4), maintaining the images in the same order as the ground truth row and color-coded by model predicted class. The image panels on the top of Fig. 4 depict select images with relevant metadata.

External validation

Because our internal validation set shared similar characteristics to our training data (i.e., similar devices and geographies), our next stage consisted of validating our best performing model on external data (“EXT”). Our external test set (“EXT” = “Test Set 3”) comprised of images from a new device (IRIS colposcope) and new geographies (Cambodia, Dominican Republic (DR)).

First, to get a sense of the dataset distributions of the “SEED” and “EXT” datasets, including the distributions by device and geography, we ran out-of-the-box (OOB) inference with our best performing model on “Test Set 2” (“Internal Validation Set”) from the “SEED” dataset and on the full “EXT” dataset. We subsequently plotted UMAPs of the resulting features, which represent a dimension-reduced version of the features output from the model’s inference run, color-coded by dataset, device, and geography (Fig. 5) respectively.

We further tested the impact of device- and geography-level heterogeneity on our model performance via three distinct set of investigations: (i) out-of-the-box (OOB) inference on “EXT”; (ii) device-level retraining: adding multi-geography “EXT” images to “SEED” in a 65% : 10% : 25% ratio of train : validation : test and training on the full collated dataset; and (iii) geography-level retraining: adding either Cambodia or DR “EXT” images to “SEED” in separate experiments and training on the full collated dataset. For (i) the OOB model run, we investigated performance on both the full “EXT” test set, as well as on the individual geographies Cambodia and DR. For (ii) the device-level retraining run, we investigated performance on both “EXT” test set and on “SEED” Test Set 2, to assess the possibility of performance degradation (catastrophic forgetting) on “SEED” data upon retraining.

Interrater assessment

Finally, we conducted an interrater assessment of model performance with respect to the ground truth denoted by two different raters on 100 newly acquired, external (“EXT”) dataset images (device = IRIS colposcope, geography = Cambodia). Rater 1 was one of several raters who had labelled images in the “SEED” dataset on which the model was trained, while “Rater 2” was a completely new rater. We specifically investigated the OOB performance of our best performing model (which was trained on “SEED”), on the 100 “EXT” images with respect to each of the individual rater’s ground truth, computing key classification metrics (AUROC, %EM) (Fig. 7a) and ROC curves for each (Fig. 7b,c). Further, we investigated the degree of concordance between the two raters’ ground truths and the corresponding model predictions on each of the 100 images using a rater-level confusion matrix color-coded by model prediction.

Results and discussion

In this work, we implemented a multi-stage model selection approach to generate an image quality classifier utilizing a multi-device and multi-geography “SEED” dataset, and subsequently validated the best performing model on an external “EXT” dataset, assessing the relative impact of device-level, geography-level, and rater heterogeneity on our model.