Introduction

Human papillomavirus (HPV) infection is a highly prevalent sexually transmitted disease, affecting over 80% of sexually active individuals at some point in their lives1. It is frequently associated with transient, mostly asymptomatic infection. However, in certain cases, chronic persistent infection can develop, increasing the risk of neoplastic transformation across diverse anatomical locations2. One possible mechanism contributing to carcinogenesis is integration of viral DNA into the host cell genome, which triggers uncontrolled proliferation and impairs DNA repair mechanisms2. A key manifestation of active HPV infection is the development of low-grade intraepithelial lesion (LSIL), which resolves spontaneously in the majority of cases. However, while LSIL often regresses without further consequences, in some cases it can progress to high-grade intraepithelial lesion (HSIL). HSIL is associated with a significantly higher risk of progression to invasive squamous carcinoma and is therefore considered a precancerous lesion3.

Understanding this carcinogenesis process is essential, since promptly treating HSIL represents a pivotal opportunity to reduce the burden of HPV-associated squamous cancers3. Cervical cancer represents the oncogenic disease model and it is, in fact, the most important HPV-related neoplasia2. Currently, there are already a number of established initiatives aimed at both primary prevention (through immunization) and secondary prevention (through treatment of precursor dysplastic lesions)4. Additionally, it is important to recognize that the nature of HPV infection is ubiquitous and its impact extends beyond the cervical region. This means that its carcinogenic effect can manifest in various pelvic areas, including the vagina or vulva in women, the penis in men, as well as, the perianal area and the anus5,6.

Considering the need to inspect finer anatomical details, there has been a growing interest in utilization of high-resolution colposcopes to assess not only the women genital tract, but also the anal zone. Performing colposcopy for both cervical and anal regions enable high resolution detailed assessment of these areas and precise targeting of biopsies and treatment procedures through direct visualization7,8. The prevailing recommendation is to opt for colposcopic assessment following a cytological abnormal exam and/or high-risk HPV type of the cervical/anal area8,9. Although this procedure provides the highest diagnostic and therapeutic yield, this procedure is accompanied by a significant learning curve. The limited expertise can lead to a shortage of physicians who are technically proficient at raising suspicion and providing accurate diagnoses, especially in early stages10. In the particular case of high resolution anoscopy (using colposcopes or anoscopes), the insufficient number of trained proctologists, may result in gynecologists performing both cervical and anal assessment, given their greater familiarity with HPV-related dysplastic lesions.

In contexts with suboptimal diagnostic accuracy and high interobserver variability, artificial intelligence (AI) models could enhance procedures cost-effectiveness11. The abundance of colposcopy images further supports AI tools for image analysis, particularly convolutional neural networks (CNN), inspired by human visual cortex for pattern analysis. Currently, researchers are leveraging this technology in the perineal region to improve the accuracy of diagnosing HPV-induced lesions using colposcopy/anoscopy12,13,14. The published models so far focus only in detecting and differentiating lesions in one specific region, either the cervix or in the anal canal13,14,15,16,17,18,19,20,21. However, achieving high performance metrics in one area may not necessarily translate to similar effectiveness in the other, and AI-enhanced ubiquitous diagnosis tools (with training dataset of both regions) are currently lacking.

The aim of this study is to develop and validate a CNN for automatic differentiation of cervical and anal squamous cancers precursors during high-resolution colposcopy/ anoscopy.

Methods

Study design and categorization of the lesions

We included high-resolution colposcopies performed at Centro Materno Infantil do Norte [CMIN] (Porto, Portugal) [n = 70] using a Zeiss FC 150 colposcope and high-resolution anoscopies performed at Groupe Hospitalier Paris Saint-Joseph [GHPSJ] (Paris, France) [n = 177], Instituto de Infecciologia Emílio Ribas [IFER] (São Paulo, Brazil) [n = 54] and Wake Forest University [WKU] (North Carolina, USA) [n = 13] using a videoproctoscope THD® Proctostation HRA Module (THD SpA, Correggio, Italy), Kolplast colposcope (Kolplast CIA, São Paulo, Brazil), Zeiss FC 150 colposcope (Carl Zeiss Meditec AG, Jena, Germany), respectively. The included procedures were conducted and recorded between 2020 and 2023. The collected videos were then segmented in still frames using VLC Media Player.

The dataset consisted of a total of 88,073 frames of HPV-induced dysplastic lesions, with 45,726 labelled as LSIL and 42,347 labelled as HSIL. This binary classification was determined based on the corresponding histopathology reports from biopsied or treated lesions during colposcopy or anoscopy procedures. Cytological samples were not used to establish the ground truth.

The number of biopsies/treated lesions varied for each procedure. In cases involving multiple biopsies, the biopsy sites were documented, and the histological findings were matched with the corresponding video frames. Any cases with uncertainty about the correlation between the biopsy site and the image were excluded from the analysis to ensure rigor and prevent misclassification.

We divided total data in two parts: training/validation and testing set, with 79,265 (90%) and 8808 (10%) frames, respectively. We used the testing set to assess the global performance of the model. We resume dataset methodology in Fig. 1.

Fig. 1
figure 1

Study design. AUC-PR area under the precision-recall curve; AUC-ROC area under the conventional receiver operating characteristic curve; NPV negative predictive value, PPV positive predictive value.

Due to the retrospective nature of the data collection, this study follows a non-interventional approach. Additionally, no modifications to therapeutic practices were made as a result of the study. Approval from the ethics committee was obtained prior to study’s beginning, with permissions granted by the ethics committee of Group Hospitalier Paris Saint-Joseph, Instituto de Infecciologia Emílio Ribas, and Hospital Universitário Santo António (IRB 00012157, SPTC 81/2023, IRB 2023.157(131-DEFI/123-CE), respectively). The study was carried out in accordance with the principles of Helsinki Declaration.

Colposcopy and anoscopy protocol

In each center, colposcopy and anoscopy procedures were performed by expert medical doctors, according to the current best practices. Each be procedure can be divided in four stages maximum: first examination without applying stain, followed by examination with 3% acetic acid and optionally lugol’s iodine later, ending with therapeutic manipulation (e.g. laser ablation, plasma coagulation or surgical ablation). The dataset included frames from these four categories, with each procedure potentially encompassing any combination of them.

Development of DL model and performance analysis

A Resnet10 model, which was pre-trained on ImageNet-1 K (a comprehensive collection of data used to recognize objects within images), was used to build this CNN22. The early layers of the model were kept, in order to use the features it had already learned, but the final fully connected layers were removed. Instead, new fully connected layers were added to adapt the model for LSIL vs HSIL classification. The architecture consists of two main blocks, each includes a fully connected layer with a subsequent dropout layer, to mitigate overfitting risk. Following these blocks, a dense layer was incorporated, sized according to the number of categories (2). We fine-tuned hyperparameters such as the learning rate (0.0001), batch size (32), and the number of epochs (5) through trial and error to achieve the best performance. Libraries such as FFMPEG, Pandas, and Pillow were used for data preparation. We implemented the model in PyTorch 2.2.2, running it on a powerful system equipped with a dual 2.1 GHz Intel Xeon Gold 6130 processor (Intel, Santa Clara, CA, USA) and a dual NVIDIA Quadro RTX A6000 graphics card (NVIDIA Corporate, Santa Clara, CA, USA). A probability of being LSIL or HSIL was calculated for each frame. The CNN’s final classification for each frame relied on the category with the highest probability. The classification of the model was compared to the current gold standard, corresponding histopathological classification (Fig. 2).

Fig. 2
figure 2

Examples of how the algorithm estimated probability of being LSIL (low-grade intraepithelial squamous lesion) vs HSIL (high-grade intraepithelial squamous lesion). Every frame was categorized into one of the previously mentioned categories based on which had the highest probability. The classification provided by the Convolutional Neural Network (CNN) was then compared to the histopathological classification (upper left corner), which was considered the gold standard. Blue bars represent correct CNN predictions, while red bars represent wrong ones.

Statistics and reproducibility

The model was assessed during training/validation phase (rationale: assess robustness) and during test phase (rationale: assess overall performance). During training/validation phase, 90% of the data underwent division into three equivalent dimension folds, using a StratifiedKFold division. A total of three distinct iterations were executed in total. In each iteration, the model was trained using two folds, and validate using the other one. Additionally, in each iteration, the folds employed for training and validation were different. During test phase, the remaining 10% were used to independently to assess performance of the CNN. Computational performance was also evaluated by measuring the algorithm processing time for all frames in the test set.

We performed statistical analysis using Sci-Kit Learn v0.22.2 (https://scikit-learn.org/0.22/)23. We also generated heatmaps to assess which characteristics most significantly contributed to CNN prediction. Examples of a cervical and an anal frame are shown in Fig. 3.

Fig. 3
figure 3

Examples of generated heatmaps in an anal (1) and a cervical (2) frame to assess which characteristics most significantly contributed to Convolutional Neural Network (CNN) prediction. These identified areas can assist the physician in assessing why the Artificial Intelligence (AI) model made its prediction and may guide targeted biopsies by highlighting regions with a higher probability of lesions.

Results

We included a total of 88,073 of high-resolution colposcopy and anoscopy still frames, from 3 different devices.

From the total dataset, 79,265 frames were used to train the model (GHPSJ = 31,086, IFER = 22,738, CMIN = 20,393, WFU = 5048), whereas the remaining 8808 frames were used to independently test the model (GHPSJ = 3393, IFER = 2587, CMIN = 2300, WFU = 528 frames).

The total dataset incorporated 45,726 of LSIL (41,153 in training/validation, 4573 in testing set) and 42,347 HSIL (38,112 in training/validation, 4235 in testing set) labeled frames.

From the procedure number perspective, the training/validation set included frames from 165 exams, while the testing set had frames from 155.

1. Training/Validation set

Table 1 displays the number of frames, patients, devices, regions and lesion (LSIL and HSIL) numbers for each fold, during cross-validation (training/validation phase).

Table 1 Number of frames, patients and types of CE device per group, which was divided in training/validation (90% of patients, including a threefold cross validation) vs. test group (10% of remaining).

Regarding performance metrics to HSIL differentiation (Table 2), the average sensitivity was 98.1% (IC95% 97.6–98.5%) and the average specificity was 97.4% (IC95% 96.0–98.8%). The average PPV were 97.2% (IC95% 95.8–98.7%) and the average NPV was 98.2% (IC95% 97.7–98.6%). The average overall accuracy was 97.7% (IC95% 97.2–98.6%). The mean AUC-ROC and AUC-PR were both 0.98 ± 0.01. Table 3 reveals the performance metrics detailed for each fold. Figure 4 reveals the discriminatory capacity of the model during threefold cross validation, as shown by the AUC-ROC and AUC-PR curves.

Table 2 Confusion matrices for each cross-validation run (training-validation phase) and in test phase.
Table 3 Data was divided in training/validation and test groups. During training/validation phase, three iterations were conducted, each with unique frame distribution.
Fig. 4
figure 4

1—Area under the conventional receiver operating characteristic curve, 2—Area under the precision-recall curve (AUC-PR) of CNN performance in differentiation HSIL from LSIL in cervical and anal colposcopic/anoscopic still frames. In this case, sensitivity (or recall) is the proportion of HSIL cases that were correctly identified by the CNN. Specificity is the proportion of LSIL cases that were correctly identified. Precision is the proportion of correct predictions (both HSIL and LSIL) out of all predictions made by the CNN.

2. Testing set

Regarding testing phase, performance metrics to HSIL differentiation were as follow: sensitivity of 99.0%, specificity of 97.8%, with a PPV and NPV of 97.6% and 99.0%, respectively. The overall accuracy was 98.3%.

Discussion

This study introduces the first worldwide ubiquitous deep learning model that can detect and differentiate HPV-related dysplastic lesions in two distinct areas: the cervix and the anal canal. The model predictions are highly accurate and hold great potential for practical use in clinical live scenarios. This cross-zone interoperable model represents a novel advancement in computer-aided detection (CADe) and diagnosis (CADx) systems by enabling effective analysis across anatomically distinct regions. This approach offers an original solution to improve the accuracy and efficiency of endoscopic and magnified evaluation of these regions, providing a more versatile diagnostic tool for practitioners.

One of the primary strengths of this AI model is its development using histologically confirmed frames lesions from two different anatomical zones. This ensures the model is trained in both regions, unlike existing evidence focused only on detecting cervical or anal lesions. Another key strength of the model is its multicentricity and interoperability, having been trained on data provided by four centers and three distinct devices used for endoscopic evaluation of cervical and anal regions. This approach generated a more heterogeneous dataset, incorporating different populations from Europe and America, probably reflecting a more externally validated and more closed to real-life scenario, closing the gap between development and clinical practice.

Adhering to FAIR criteria is essential in current development of AI software to enhance clinical practice24. CNN compatibility with multiple devices is mandatory, in order to facilitate validation and extend clinical practice and research across multiple settings clinical. Therefore, interoperability of this model is a significant advantage, elevating it to a higher level of technological readiness. Moreover, our group has been developing AI models for HPV-related dysplastic lesions, initially focusing on the anal canal, then the cervix, and now a ubiquitous single model capable of calculating predictions for both regions. This approach adheres to the principle of reusability and may facilitate the development of more robust and efficient AI model. Principles of findability and accessibility were also respected through the reproducible and consistent collection of data.

Comparing published models so far is challenging, as comparing performance metrics only may not provide an accurate assessment (Table 4). The methodologies used in each study can vary significantly, making direct comparisons difficult. From the cervical problematic point of view, several CNN have been published for differentiating HSIL. Miyagi et al. reported 80% sensitivity and 88% specificity (fivefold cross validation), using a dataset of LSIL and HSIL (two categories) non-stained frames dataset17. This study involved a low number of patients (330) and used only one frame per colposcopy. Yuan et al. achieved 85% sensitivity and 85% specificity (train-test validation 80-10-10%; these metrics relate to distinguishing HSIL from other categories), using a dataset of normal, LSIL and HSIL (three categories) stained frames19. This study included a large number of patients (11,198) and used frames stained with acetic acid and another with lugol. Xue et al. reported 66% sensitivity and 90% specificity (train-test validation 70-10-20%; these metrics relate to distinguishing HSIL from other categories), using a dataset that included normal, LSIL, HSIL and cancer (four categories) non-stained frames18. This study involved a larger number of patients (19,435) but relied on frame annotation. Chen et al. achieved 88% sensitivity and 94% specificity (train-test-validation 60-20-20%), using both stained and non-stained LSIL and HSIL frames, using multiple frames per exam of a total of 6002 patients15. Fang et al. reported 82% sensitivity for detecting HSIL in a dataset constituted with non-stained frames from normal, LSIL, HSIL, cervical cancer, from 1189 patients16. Lastly, Mascarenhas et al. reported 99.7% sensitivity and 98.6% specificity (train-test 90-10%), using a dataset of LSIL and HSIL (two categories) from non-stained, stained and post-manipulated frames14. The dataset comprised a higher number of frames containing dysplastic lesions (22,693), from 70 patients. Regarding the perspective on this issue concerning the anal canal, to our knowledge, only our group has published evidence on the development of AI models for this anatomical region. Our studies have shown significant progress: from a pilot study reporting 91.4% sensitivity and 89.7% specificity (train-test 90-10%), using a dataset of LSIL and HSIL (two categories) of 5026 frames20; To a subsequent study achieving 96.5% sensitivity and 94.3% specificity (fivefold cross validation), using a dataset of total 27,770 frames, maintaining high performance metrics across categories21; and finally, our latest study utilizing frames from high-resolution colposcopes and anoscopes13. For detection of HSIL, the model reported 93.6% sensitivity and 95.7% specificity (train-test 80–20%); from a total of 57,882 frames across 151 exams.

Table 4 Summarized published deep learning models developed for differentiation of HPV-related dysplastic lesions in the cervical and anal regions.

From the data science methodology and analysis perspective, there are some key points of the study that should be mentioned. We strictly included only lesion frames that were later confirmed through histological analysis (ground truth). This included frames from the entire endoscopic examination, encompassing non-stained, stained and post-manipulated ones, which can be particular useful for physicians, as its diagnostic performance may not be compromised despite the presence of stains, blood or burnt tissue. Moreover, we included a diverse dataset with images from various angles, implemented a proper train-test split, and avoided data annotation. This model represents a pioneering approach as it was trained simultaneously of still frames showing HPV dysplastic lesions (LSIL or HSIL) from two different anatomical regions: the cervix and the anus. It demonstrated high performance metrics in test set, achieving 99.0% sensitivity and 97.8% specificity. We also performed a cross-validation during training/validation phase. The average metrics were similarly high (mean sensitivity 98.1% and mean specificity 97.4%), indicating robustness across different frame distribution.

The retrospective nature of this study, in addition to the lack of procedural split in the training/validation and testing sets, as well as the potential demographic bias associated with the absence of patient-level data (due to GPDR restrictions). These limitations should be acknowledged and may contribute to risk of overfitting. Consequently, the findings of this study cannot be broadly generalizable or directly applied to clinical setting and more prospective and multicentric studies are still needed to determine if the use of these AI model can significantly improve the diagnostic and treatment of HPV-related dysplastic lesions. Since our dataset preparation did not involve manual data annotation, and due to the inherent black box nature of these models, we excluded more complex cases with simultaneous (observed in the same frame) presence of LSIL and HSIL lesions, which can be also a limitation that can compromised the external validity of study’s results. For similar reasons, we excluded frames where both lesions and instruments (e.g. forceps used for traction to better expose a lesion) were present simultaneously. We acknowledge that ensuring the model can function without interference from such instruments is important for its real-world applicability. Additionally, form a gynecological/ proctological perspective, the ideal CADe/CADx system would be the one that was capable of detecting/differentiating LSIL, HSIL, and non-dysplastic lesions. Due to limitations in our current dataset, implementing a trinary model at this stage was not feasible.

In conclusion, the development of efficient AI models, interoperable, developed with minimal selection bias and dataset diversity is essential for implementing this technology in real clinical scenarios. Using a single ubiquitous model in both anatomical regions can be more versatile and efficient for clinical practice than employing two separate models individually. Future research will prioritize the validation of CADe/CADx models within prospective, multicentric and real-time clinical context, including tandem comparative evaluation between AI-enhanced and clinical performance metrics through conventional high resolution anoscopy and/or colposcopy diagnostic performance. Building on this need, this multicentric study represents a necessary intermediate step and introduces innovative model for detection and differentiation of HPV-related dysplastic lesions in two main anatomical areas of assessment of colposcopes and anoscopes. This development may increase clinical outcomes and cost-effectiveness of these procedures, potentially making them accessible to a larger portion of the population.

Due to the retrospective nature of the study by the ethics committee of Groupe Hospitalier Paris Saint-Joseph, Institituto de Infecciologia Emílio Ribas and Hospital Universitário Santo António (IRB 00,012,157, SPTC 81/2023, IRB 2023.157(131-DEFI/123-CE), respectively) waived the need of obtaining informed consent.