Introduction

Colon capsule endoscopy (CCE) is a non-invasive procedure with advantages in diagnosing, monitoring, and managing colorectal diseases compared to optical colonoscopy (OC), flexible sigmoidoscopy, and computed tomographic colonography (CTC)1,2,3. Clinical trials show CCE outperforms CTC in detecting polyps larger than 6 mm and is non-inferior for polyps larger than 10 mm4. In cases of incomplete OC, CCE has a higher diagnostic yield than CTC for polyps of any size CCE5,6. Patients also prefer CCE due to its lower complication rate compared to OC, supporting its broader adoption. However, several challenges limit its widespread use, including the dependency on bowel cleansing quality, logistical issues in capsule handling, labor-intensive image review ( 12,000 images per investigation), low-resolution imaging, and low completion rates. To address these, we developed an AI-enhanced wireless capsule featuring real-time image processing, dual-mode imaging (white-light and narrow-band), and bi-directional communication with personal devices for reporting findings7. This novel design significantly improves hardware and software capabilities, enhancing diagnostic accuracy and enabling real-time AI analysis. Despite these advancements, gaps in clinical implementation persist, raising feasibility concerns among professionals8.

The CCE pathway includes bowel preparation, capsule ingestion, manual image analysis, and post-procedure care. This study focuses on fully integrating AI into the image analysis stage, enabling autonomous detection, localization, and characterization of findings using advanced algorithms. Characterization involves analyzing abnormalities’ morphology (e.g., size) and histopathological (HP) properties, such as neoplastic features. Building on previous work in polyp detection and localization5,6,9,10,11,12,13,14,15,16,17, we incorporate size estimation and HP analysis to further optimize CCE’s workflow, advancing toward full AI integration in routine clinical practice.

While AI-based detection and classification of colorectal polyps in optical colonoscopy is well-studied18, few studies (besides our own) have explored similar questions in CCE investigations19,20,21. This is largely due to the lack of publicly available datasets for CCE, unlike optical colonoscopy databases such as Kvasir (A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection). Additionally, CCE’s reliance on invariant white-light imaging (WLI) results in lower-resolution images, posing significant challenges for AI algorithms to achieve the robustness required for routine clinical practice.

This study builds on the “Danish CareForColon2015 trial (cfc2015),” launched in 2021 as part of the Danish Colorectal Cancer Screening program. As the largest randomized controlled trial on CCE22, its primary aim was to compare detected colorectal cancers and intermediate- or high-risk adenomas between intervention and control groups. Secondary aims included evaluating patient acceptability, complication and completion rates, interval CRC rates, patient-reported outcomes (PRO), long-term cancer incidence, social inequality, CCE applicability, and cost-effectiveness. Of 370, 306 individuals invited to screen, 2015 FIT-positive patients underwent CCE investigations, forming the basis for developing and validating the AI algorithms in this study. Further details can be found at https://clinicaltrials.gov/ct2/show/NCT04049357 or one of our recent publications15.

Contribution

The sketch of the workflow associated with CCE’s pathway automation is presented in Figs. 1 and 2. The steps presented in blue have been published in previous works5,6,11,13,14, while those in orange are under development. This paper presents those steps highlighted in green, which focus on the recognition of important abnormalities, estimating their size and define their histopathology.

Fig. 1
figure 1

AI-based optimization of CCE’s pathway for image analysis. Blue boxes represent the algorithms that are completed (previously published). Green boxes represent the algorithms discussed in this work, and orange boxes represent the future work.

Fig. 2
figure 2

Sequence of neural networks (NN) for recognition (NasNetLarge), characterization (VGG16) and polyp size estimation (AID-U-Net) algorithms.

This paper is organized as follows: We first introduce a deep neural network capable of detecting abnormalities with high sensitivity and specificity. Detected abnormalities are then processed by two parallel algorithms for size estimation and characterization, with details of their explainability (XAI) criteria provided in the article. Finally, we evaluate the outcomes of these algorithms, highlight their strengths and limitations, and conclude on the feasibility of optimizing CCE image analysis with AI.

Methods and results

Code availability

Codes for recognition, characterization and size estimation algorithms developed in this study are available to interested readers upon request to the corresponding author. Consortium agreements signed with the funding agencies i.e., the European Union and UK Research and Innovation Office (co-funded) prohibits us from sharing the codes in public repositories.

Ethics

The study was approved by the Regional Health Research Ethics committee (journal number S-20190100), was registered with the Regional Data Protection Agency (journal number 19/29858), as well as with ClinicalTrials (identifier NCT04049357). All participants received verbal and written study information prior to participation and signed informed consent was obtained from each individual. The study was conducted in accordance with the declaration of Helsinki.

Recognition

Our prior research on colorectal polyp detection and localization using an enhanced ZF-Net achieved \(98.0\%\) accuracy, \(98.1\%\) sensitivity, and \(96.3\%\) specificity on a dataset of approx. 800 images (400 with polyps, 400 normal mucosa)12. While this remains one of the top-performing networks in the literature, deploying it in the cfc2015 trial risked missing up to four cancers among 2015 FIT-positive patients due to the prevalence of CRC. Additionally, its findings serve as input for size estimation and characterization algorithms, necessitating a new DNN with higher negative predictive value (NPV), sensitivity, and specificity. After evaluating leading architectures like ResNet50 and InceptionV3, we selected NasNetLarge as the backbone for abnormality recognition.

To adapt network’s architecture and to use transfer learning for the purpose of this study, we modified the last 20 learnable layers, and froze the parameters of the remaining layers, accordingly. The learning rate was initially set to \(1\hbox{e}{-}3\), but adaptively reduced after every 2 epochs during the training process, until the validation criteria were met. The epoch size for training process was limited to a maximum of 6, with a mini batch size of 10, and the validation frequency of 798. The dataset containing images of both colorectal polyps (1751) and the normal mucosa (1672) (Fig. 3) was augmented by horizontal and vertical random reflection, random scaling and random translation, along with random rotation, all picked from a continuous uniform distribution, without affecting the contents or the size of the images. This augmentation resulted in a database containing 5838 images of colorectal polyps and 5573 images of normal mucous layer. The dataset was split to \(70\%\) for both training and validation process, while the remaining \(30\%\) was allocated to the test process. NasNetLarge with the aforementioned configuration resulted in a sensitivity of \(99.9\%\), a specificity of \(99.4\%\), and an NPV of \(99.8\%\) on the test set, implying that less than one cancer among the cfc2015 trial subjects will be missed.

To evaluate as whether the network has learnt sufficient features, and knowledge has been transferred, we exposed the recognition network to an additional set of images not previously used for either training or testing. Manual examination by trained CCE readers initially classified three cases as either inflammation or normal tissue, a diagnosis confirmed by specialists. However, during a routine quality check in the cfc2015 trial, medical experts identified these cases as cancers. To evaluate our DNN’s performance, the misdiagnosed images were included in the test set, where the network correctly flagged them as significant findings, demonstrating its generalization capability. Following this incident, the cfc2015 trial experts reexamined all patient images to ensure no critical findings were missed.

Fig. 3
figure 3

Samples of retrieved CCE images for training and validation purposes for abnormality detection.

Explainable AI (XAI)

To enhance interpretability and trust in our recognition algorithm, three classes of XAI methods for image processing, namely, CartoonX23, Pixel Rate Distortion Explanation (RDE)23 and Class Activation Mapping (CAM) and it’s variations such as GradCam24 were integrated.

Starting off with CAM-based techniques, the explanations from GradCAM, GradCAM++, AblationCAM, and RandomCAM reveal that the proposed recognition model extract most representative features important for the classification task, i.e., the existence of a polyp vs. no-polyp. The parameters, remove and debias nullify the effect of leaking data by a weighted average of its neighbors. Percentile values of [20, 40, and 95] provide the top \(20\%\), \(40\%\), and \(95\%\) of the most important regions identified by various CAM techniques to correlate with the target class. A single score measures the average impact on model confidence by contrasting the removal of least and most important regions of an image. Our overall analysis showed that among the four CAM-based techniques, GradCAM++ explanations map the exact features from the last convolutional layer, which mainly focus on polyps within the image. This is highlighted by the score obtained for each explanation (Fig. 4), hence justifying the decision of the network.

To further confirm the validity of the explanations, we also used CartoonX and pixel rate distortion explanations. Unlike CAM-based models, which rely on output feature maps, CartoonX and pixel RDE explore input features, making them model-independent explainable techniques. To have a faithful comparison between different methods, we used the same image of a polyp for comparing the outputs of CartoonX and pixel RDE with CAM-based techniques. As evident from Fig. 4, both CartoonX and pixel RDE explanations detect the main region of the polyp, being the most critical area of the image. CartoonX provides effective input features due to its ability to extract relevant and piecewise smooth image segments, rather than focusing on sparsely distributed pixel regions. Comparison between CartoonX and pixel RDE explanations reveals that CartoonX explanations show lower distortions for image classification compared to Pixel RDEs, confirming the findings of the original study25. This is since CartoonX provides piecewise smooth explanations, effectively uncovering meaningful patterns that are less apparent with pixel RDE and CAM-based techniques.

Fig. 4
figure 4

Examples of explainability including heat map (top two) , GradCam (middle), CartoonX and Pixel RDE (bottom) for network’s explanation on decision making process. C-Polyp and No-C-Polyp stand for colorectal polyp and no colorectal polyp, respectively.

Size estimation

The size estimation algorithm builds on our previous work developing AID-U-Net, a novel semantic segmentation network13,14,26. AID-U-Net incorporates direct contracting and expansive paths, along with unique sub-contracting and sub-expansive paths, achieving superior performance (F1 score: \(88.1\%\)) compared to U-Net (\(81.1\%\)) and U-Net++ (\(87.6\%\)), without requiring pre-trained backbones. The optimal architecture for segmenting CCE images was AID-U-Net(2,2), with a depth of two for both the direct path and sub-path. For detailed architecture and performance insights, we refer readers to our earlier work13.

We applied AID-U-Net(2,2) to an augmented dataset of 5,838 colorectal polyp images, achieving correct segmentation in \(81\%\) (4685 images). By comparison, U-Net and U-Net++ achieved \(61\%\) and \(72\%\), respectively. Incorrect segmentation included three scenarios: (1) missing a region of interest (ROI, e.g., a polyp), (2) segmenting the wrong region, or (3) splitting a single ROI into multiple segments. Assuming each image contained one ROI and summing the estimated sizes of all segmented regions improved segmentation accuracy to \(88\%\). Examples of segmented polyps, along with their bounding boxes and fitted ellipses, are shown in Fig. 5.

Fig. 5
figure 5

Examples of the outcome from the segmentation and size estimation algorithm. Purple regions of the images represent pixels associated with detected abnormalities, and the white region is the cropped out version of the same region. Red rectangles represent bounding boxes encompassing the important findings, e.g., polyps.

Precise size estimation requires polyp depth relative to the camera lens, which is unavailable in CCE. Therefore, our algorithm estimates size by calculating the ratio of the largest diameter of the fitted ellipse around the segmented polyp to the total image size, excluding peripheral information (e.g., date and time). Using this approach, we achieved a perfect match with size estimates from the RAPID Reader27 used by trained CCE readers.

Using histopathology reports as the gold standard, the size estimation algorithm mapped polyp sizes from segmented CCE images to pathology outcomes. Currently, 280 polyps are matched between the two datasets (CCE vs. pathology). We observed that CCE generally overestimates polyp sizes compared to pathology reports. The best regression model, using fine Gaussian support vector machines (SVM), achieved a root mean squared error (RMSE) of approximately 6mm. Despite this error, a richer database of matched polyps, currently being expanded through the cfc2015 trial, will improve accuracy.

Characterization

Colorectal polyps are classified as neoplastic or non-neoplastic, with this classification-alongside size, histology, and location (distal vs. proximal colon)-guiding patient management and treatment success. Our characterization algorithm is a binary classifier, taking CCE images flagged as important findings and categorizing them as neoplastic or non-neoplastic.

The dataset includes 479 images of polyps observed during CCE, resected, and matched post-polypectomy. Of these, 317 were neoplastic and 162 non-neoplastic. To address the small sample size and class imbalance (2:1 ratio), we augmented the dataset fourfold using random horizontal/vertical reflections, scaling, translation, and rotation without altering image content or size. Although only polyp segments should inform pathology, we used entire image frames as many featured small polyps occupying minimal space.

Using the same training settings as the recognition DNN, we implemented a VGG16 network as the backbone for the characterization algorithm. The dataset was split into \(70\%\) for training/validation and \(30\%\) for testing, yielding a binary classifier with \(84.3\%\) sensitivity, \(81.5\%\) specificity, and \(82\%\) accuracy.

To enhance the explainability of the VGG16 characterization network, we assigned relevance to the learned patterns using pattern attribution techniques such as Layer-wise relevance propagation (LRP). LRP interprets neural network predictions by attributing relevance scores to individual input features, such as pixels in an image28,29. Starting from the output of the network, relevance is traced backward, layer by layer, down to the input while ensuring that the total relevance remains constant across layers. This process highlights which input features are most influential in the prediction. We implemented the following LRP variants as shown in Fig. 6:

  • Epsilon Rule: incorporating a small positive constant \(\epsilon\) to prevent division by small values when distributing relevance proportionally to the weighted activations of neurons, reducing noise in the relevance attribution. We set \(\epsilon = 0.1\).

  • Alpha-Beta Rule: splitting relevance between positive (\(\alpha\)) and negative (\(\beta\)) contributions, where \(\alpha +\beta =1\), allowing flexible focus on supportive or opposing evidence. We set \(\alpha =1\) and \(\beta =0\) so that only positive contributions are highlighted.

  • Gamma Epsilon Rule: enhancing relevance for positive contributions by applying a factor \(\gamma >0\), to emphasize important features. Further, we set \(\epsilon = 0.1\) to avoid division by small values.

  • Patter net: rather than attributing relevance to individual features based on activations or weights, Patternnet focuses on specific patterns in the input data that are most aligned with the network’s learned features.

  • Pattern attribution: relevance is attributed not just to the individual input features but to the patterns that are learned by the network, highlighting their contribution to the final decision.

In Fig. 6, the regions marked in red have a significant influence on the network’s decision, while the regions marked in blue have a lesser impact. From this figure, we can see that PatternAttribution offers a clearer visualization of the features influencing the polyp characterization network’s predictions. Unlike other methods that focus on pixel-level or activation-level relevance, PatternAttribution captures the relationship between input patterns and the model’s decision-making process, providing a more intuitive and comprehensive understanding of how the network made its decisions.

Fig. 6
figure 6

Layer-wise relevance propagation (LRP) on VGG16 network for the characterization algorithm to classify polyps to neoplastic versus non-neoplastic.

To enhance performance by incorporating polyp texture, we also applied neural style transfer (NST), a technique effective in image stylization30,31. By leveraging Gram matrices from different convolutional layers, we captured a stationary, multi-scale representation of texture through filter response correlations, independent of global image arrangement. Since neoplastic and non-neoplastic polyps exhibit distinct textures, this texture-based information supplemented the VGG16 classification, utilizing Gram matrix outputs as inner products of vectorized feature maps across layers.

Discussion and conclusions

The Danish National Institute of Health’s Technology Assessment Board recently decided against recommending AI for colonoscopy in diagnosing neoplastic disease. Their decision was based on two main reasons: insufficient evidence, with only a meta-analysis of two trials from the same authors sponsored by the manufacturer, and outdated clinical guidelines that require the removal of all polyps, including many insignificant ones. Current guidelines mandate the removal of all colorectal polyps, which means AI’s ability to detect even insignificant polyps would increase the treatment burden on patients and the healthcare sector. Despite growing interest in AI solutions in radiology, the Danish Treatment Council’s decision highlights the healthcare sector’s unreadiness to adopt AI for gastrointestinal disorder detection. This is due to the lack of robust, generalizable models in gastroenterology, unlike in radiology where extensive public databases exist, and the complexity of AI-based radiology reports that provide detailed diagnostic information.

Clinical trials like CFC2015 and initiatives in NHS Scotland (ScotCap)32 and NHS England33 to implement colon capsule endoscopy in primary care will increase data availability. Additionally, research like this offers AI-based solutions comparable to those in radiology, bridging the gap in AI deployment between gastroenterology and radiology. The goal is to develop algorithms that generate comprehensive patient reports, similar to those in other clinical fields where AI is more established.

To achieve this goal, each algorithm (Fig. 1) must perform reliably. CCE videos are a sequence of images that are sampled at a variable frequency. The main difference between still-image analysis and video analysis is that the temporal information carried by the video, i.e., order in the sequence of images is partially discarded during still-image analysis. While some tasks such as tracking the path of the endoscopic capsule (localization of important findings) based on feature point tracking require temporal information of the video, other tasks such as polyp recognition, characterization or size estimation do not necessarily benefit from the inclusion of the temporal information. We reported these findings in one of our previous studies12 where the performance of the AI algorithms on both still image and video analysis were similar.

The recognition network has shown exceptional sensitivity, specificity, and NPV, making it ready for external validation with over 3000 new CCE videos from the ScotCap trial. For each patient, the network will identify candidate images with polyps and other significant findings. Inclusion of XAI techniques such as heat map, CartoonX, Pixel RDE and GradCam explanations enhances both interpretability and trust in algorithm’s decisions. This is particularly important for misclassified cases, and those such as the one shown in Fig. 4 (second row). Despite correctly classifying the image as one containing polyp, the DNN based its decision partly on regions that feature normal mucosa.

The size estimation algorithm’s first component uses the segmentation results from AID-U-Net(2,2). This network outperformed UNet and UNet++ and matched UNetResNet’s \(84\%\) accuracy, despite having fewer parameters. The second component is a Gaussian SVM-based regression estimator that maps CCE findings to histopathology sizes. Our results showed a systematic overestimation by CCE, with a 6 mm size estimation error (RMSE) compared to histopathology.

Several studies have shown that CCE often overestimates polyp sizes compared to freshly retrieved (OC) and formalin-fixed (histopathology) polyps, especially for polyps smaller than 6 or 10 mm10,17. This discrepancy between CCE and OC is due to differences in morphological assessment, with polyps appearing more “pedunculated” in CCE and more “flat” in OC, likely due to colon inflation during OC34. The 6 mm size estimation error (RMSE) between CCE and histopathology is partly due to a small dataset, which will improve with more data. Switching from a regression-based to a classification-based size estimator, dividing sizes into four classes (\(\le 6\,\hbox{mm}\), \(7\,\hbox{mm} \le \cdots <10\,\hbox{mm}\), \(10\,\hbox{mm} \le \cdots < 20\,\hbox{mm}\), and \(\ge 20\,\hbox{mm}\)) as shown in Fig. 7, can reduce patient classification uncertainty. However, there still remains a gap between the performance of our size estimation algorithm and AI-assisted colonoscopy34.

Fig. 7
figure 7

Confusion matrix of estimated polyp size categorized in four classes (\(\hbox{size} \le 6\,\hbox{mm}\), \(7\,\hbox{mm} \le \hbox{size} <10\,\hbox{mm}\), \(10\,\hbox{mm} \le \hbox{size} < 20\,\hbox{mm}\), and \(\hbox{size} \ge 20\,\hbox{mm}\)) for CCE versus HP.

Knowing the exact location of a polyp is very important for the follow-up colonoscopy. In one of our previous studies11, we developed a novel localization technique using feature point tracking that addressed the issue. However, this algorithm required a considerable amount of time (in order of hours) to run and to reconstruct the path that the capsule has taken through the GI tract. Knowing the precise location of a polyp instantly using CCE videos remains an open challenge, and therefore, we provided an alternative solution. By detecting anatomical landmarks such as flexures, we are capable of reporting the approximate location of polyps, i.e., ascending, transverse or descending colon and guide the colonsocopist to the resection site.

The two strategies for the characterization algorithm-training a VGG16 network and using neural style transfer for texture analysis-have been effective. Inclusion of layer-wise relevance propagation-based XAI to better capture the important regions of the image, by attributing relevance scores to individual input features, has been effective. Despite clear advantage of PatternAttribution over other LRP-based algorithms, it can be observed in Fig. 6 that in one case (top middle), the network highlighted regions that feature normal mucosa. We anticipate that increasing the number of images and including larger polyps will improve the algorithm’s performance. Analyzing texture information of segmented polyps and surrounding tissue will help quantify vascularity, aiding in distinguishing between neoplastic and non-neoplastic classes.

While we are externally validating our algorithms with ScotCap data, improving their performance remains a priority. Annotating data and achieving interobserver and intraobserver agreement in CCE and OC evaluations require hybrid active learning approaches. Strategies like Self Supervised Learning (SSL), Balance Exploration and Exploitation (BEE) and conformal prediction (CP) can reduce the number of samples needed by querying the teacher network for labels. Future work includes completing all algorithms, external validation with ScotCap data, and enhancing algorithm performance. We also plan to adopt radiology workflow solutions, such as PACS and DICOM, and transfer CCE-generated non-DICOM images to cloud-based PACS for real-time analysis, report generation, and sharing with external medical professionals.