Introduction

Colorectal cancer (CRC) is the third most common malignancy and the second leading cause of cancer-related mortality globally1. Early detection and removal of precancerous adenomas via colonoscopy are crucial for reducing CRC incidence and mortality2. Despite the proven effectiveness of colonoscopy in CRC prevention, 26% of adenomas and 9% of advanced adenomas are missed during colonoscopy, largely due to the substantial variability in colonoscopy inspection quality3. Missed lesions may promote the development of post-colonoscopy CRC (PCCRC)4. Therefore, various quality indicators for colonoscopy, such as withdrawal time, bowel preparation quality, and cecal intubation rate, have been established to increase the adenoma detection rate (ADR) and reduce the risk of PCCRC5,6,7.

Colonoscopy withdrawal time is defined as the period spent withdrawing the colonoscope while inspecting the colonic mucosa from the cecum to the anus in the absence of an intervention6. Current guidelines recommend a minimum withdrawal time of 6 min to ensure thorough mucosal observation and enhance ADR8,9,10,11. According to recent studies, increasing the withdrawal time to 8–13 min could further improve the ADR, as it allows for a more meticulous examination of the colonic mucosa12,13,14,15,16. However, withdrawal time contains various components, including time spent on non-informative and defective observations, handling of foreign bodies, and qualified mucosal observation17. Enhancing the qualified mucosal observation is more effective in increasing the ADR than simply prolonging the withdrawal time18. Moreover, improving the qualified mucosal exposure can significantly reduce the total inspection time without compromising lesion detection19,20. These findings underscore the importance of focusing on qualified mucosal observation time (QMOT) rather than on the total withdrawal time. However, manual quantification of QMOT during routine practice demands substantial human resources and is prone to interobserver variability.

The integration of artificial intelligence (AI) into medical imaging has shown great promise for enhancing diagnostic accuracy and procedural quality21,22,23. In colonoscopy, AI has primarily focused on real-time polyp or adenoma detection and characterization, demonstrating its potential in standardizing quality of colonoscopy and reducing interobserver variability24,25,26. However, limited researches have investigated the use of AI in the measurement of QMOT and validated this tool as a quality control metric for colonoscopy.

This study aimed to develop and validate the accuracy and effectiveness of an AI system, named the quality assessment master (QAMaster), for the automatic calculation of QMOT during colonoscopy withdrawal by leveraging advanced Vision Transformer (ViT) models (Fig. 1)27. By conducting a comprehensive validation, our study demonstrated that the QAMaster is a potential practical tool for assessing colonoscopy quality.

Fig. 1: Illustration of the overall pipeline for the construction and evaluation of the Quality Assessment Master (QAMaster).
figure 1

Data of the colonoscopic withdrawal images were collected and annotated. QAMaster, which consists of two models—Model I for image quality assessment and Model II for anatomical landmark identification—was then trained. Afterward, QAMaster was validated using both internal and external testing datasets. The qualified mucosal observation time (QMOT) was automatically calculated using QAMaster, with a threshold of QMOT ≥ 90 s set as the cut-off, corresponding to an adenoma detection rate (ADR) of 26.05%. Finally, the clinical value of QMOT was assessed using a prospective cohort study.

Results

QAMaster development and evaluation

Four methods, namely ViT, ResNet-50, DenseNet-121, and EfficientNet-b2, were used to develop QAMaster. Compared to convolutional neural network architectures, ViT-based models demonstrated a slight advantage in performance, achieving the highest macro-AUC of 0.980 and micro-AUC of 0.983 in image quality assessment (Supplementary Table 1). For cecum identification, the ViT-based models showed better performance than the others, with the highest AUC of 0.992 and an accuracy of 0.975 (Supplementary Table 2).

The accuracy of Model I in the internal testing dataset (20 patients, 19,467 images) ranged from 0.911 to 0.999 for the six types of images (in vitro, non-informative, foreign body, intervention, defective, and qualified), and the macro-AUC was 0.980 (range, 0.962–0.999). In the external testing dataset (20 patients, 19,328 images), the accuracy ranged from 0.961 to 0.995 and the macro-AUC was 0.990 (range, 0.981–0.996). In the prospective testing dataset (20 patients, 24,568 images), the accuracy ranged from 0.954 to 0.990, and the macro-AUC was 0.991 (range, 0.985–0.999) (Fig. 2a–i and Supplementary Tables 35). Representative predicted images are shown in Supplementary Fig. 1a–f, and classification of the images collected from a representative patient into six types is shown in Fig. 2j. In the 10 videos, the proportion of each label identified by Model I was comparable to that of the endoscopist’s annotations (Fig. 2k).

Fig. 2: Performance of Model I for image quality analysis.
figure 2

a Accuracy, sensitivity, and specificity of Model I on the internal testing dataset. b Confusion matrix of Model I on the internal testing dataset. c AUC performance for classifying multiclass labels of Model I on the internal testing dataset. d Accuracy, sensitivity, and specificity of Model I on the external testing dataset. e Confusion matrix of Model I on the external testing dataset. f AUC performance for classifying multiclass labels of Model I on the external testing dataset. g Accuracy, sensitivity, and specificity of Model I on the prospective testing dataset. h Confusion matrix of Model I on the prospective testing dataset. i AUC performance for classifying multiclass labels of Model I on the prospective testing dataset. j Distribution of Model I in recognizing each classification in a representative withdrawal video. k Comparison between Model I and expert endoscopists in recognition of each classification in the 10 withdrawal videos. AUC area under the curve.

The accuracy of Model II in identifying the cecum in the internal testing dataset (20 patients, 1751 images) was 0.963, with the AUC 0.997. In the external testing dataset (20 patients, 1657 images), the accuracy was 0.947 and the AUC was 0.996. In the prospective testing dataset (20 patients, 1552 images), accuracy was 0.958 and the AUC was 0.977 (Fig. 3a–i and Supplementary Table 6). Typical visualizations of Model II in recognizing the cecum and other sites are shown in the Supplementary Fig. 2a, b. Identification of the cecum images collected from a representative patient is shown in Fig. 3j. Notably, the withdrawal time predicted by QAMaster was highly correlated with the time determined by the endoscopists (Pearson correlation coefficient 0.991, P < 0.001, Fig. 3k).

Fig. 3: Performance of Model II for landmark identification.
figure 3

a Accuracy, sensitivity, and specificity of Model II on the internal testing dataset. b Confusion matrix of Model II on the internal testing dataset. c AUC performance for classifying landmarks of Model II on the internal testing dataset. d Accuracy, sensitivity, and specificity of Model II on the external testing dataset. e Confusion matrix of Model II on the external testing dataset. f AUC performance for classifying landmarks of Model II on the external testing dataset. g Accuracy, sensitivity, and specificity of Model II on the prospective testing dataset. h Confusion matrix of Model II on the prospective testing dataset. i AUC performance for classifying landmarks of Model II on the prospective testing dataset. j Distribution of Model II in recognizing each classification in a representative withdrawal video. k Correlation between the predicted withdrawal time of QAMaster and endoscopists in 100 withdrawal videos. AUC area under the curve, PCC Pearson correlation coefficient.

Clinical effectiveness of QMOT in prospective cohort

The prospective cohort included 482 patients who were enrolled in the analysis (Supplementary Fig. 3). The baseline characteristics are shown in the Supplementary Table 7. The mean QMOT was 78.93 s, and the mean withdrawal time was 365.48 s.

To simplify the utilization of QMOT, the actual QMOT calculated by QAMaster was converted into categorical variables at 30-s intervals. A significant positive correlation was observed between QMOT and ADR, with a Spearman’s rank correlation coefficient of 0.999 (Fig. 4a, P < 0.001). ADR was 8.33% in patients with <30-s QMOT (n = 24), 16.25% in patients with 30–60-s QMOT (n = 160), 26.05% in patients with 60–90-s QMOT (n = 142), 32.98% in patients with 90–120-s QMOT (n = 94), 37.84% in patients with 120–150-s QMOT (n = 37), and 48.00% in patients with ≥180-s QMOT (n = 25). To achieve an ADR of at least 25%, QMOT ≥90 s was set as the cut-off (the ADR being 26.05%) (Fig. 4a).

Fig. 4: Clinical value of QMOT in ADR.
figure 4

a Correlation between ADR and QMOT classification was conducted using Spearman’s rank correlation. The dotted lines represent an ADR of 25%, and the latest ADR of 26.05% on the curve was used as the optimal time point for QMOT (90 s). b Comparison of ADR between low-QMOT and high-QMOT groups using chi-squared test. c Comparison of ADR across groups with different total withdrawal times (between ≥6 min and <6 min) in high-QMOT group. d Association between risk factors and ADR using a multivariate logistic regression analysis. Data are presented as the ORs, 95% CIs, and P values. ADR adenoma detection rate, SCC Spearman’s rank correlation coefficient, QMOT qualified mucosal observation time.

The enrolled patients were divided into two groups, namely low-QMOT with QMOT <90 s and high-QMOT with QMOT ≥90 s. The distribution of baseline information of patients and endoscopists in the low- and high-QMOT groups was collected (Supplementary Table 8). Compared to that in the low-QMOT group, colonoscopies in the high-QMOT group were more often performed in the early session of the half-day (73.08% vs. 60.43%, P = 0.009), by female endoscopists (60.26% vs. 47.24%, P = 0.010) and by physicians with ≥4000 colonoscopy cases (81.41% vs. 71.47%, P = 0.025) (Supplementary Tables 8 and 9). The ADR in the high-QMOT group was significantly higher than that in the low-QMOT group (36.54% vs. 19.94%, P < 0.001; Fig. 4b). Multivariate logistic regression showed that high QMOT was an independent risk factor for adenoma (odds ratio (OR), 2.02; 95% CI, 1.23–3.33) and polyp detection (OR, 2.21; 95% CI, 1.41–3.48, Table 1). Notably, high QMOT was significantly correlated with the detection of diminutive (OR, 3.93; 95% CI, 2.09–7.39) and small adenomas (OR, 1.76; 95% CI, 1.02–3.03, Table 1). The association between high QMOT and ADR was consistent across the right, transverse, and left colon (Table 1). Subgroup analyses showed that the ADR in the high-QMOT group was remarkably higher than that in the low-QMOT group across different ages, patient sex, examination time for colonoscopy, endoscopist sex, endoscopist experience, and withdrawal time (Supplementary Tables 1021 and Supplementary Figs. 4a–f and 5). Notably, the ADR of the high-QMOT group was significantly higher than that of the low-QMOT group in the subgroup of withdrawal time ≥6 min (41.91% vs. 22.09%, P = 0.006, Supplementary Table 21). We further investigated the difference of ADR within subgroups of different total withdrawal time (<6-min and ≥ 6-min groups) in the high-QMOT group. Results showed that the ADR for total withdrawal time ≥ 6 min was higher than that for total withdrawal time <6 min (41.91% vs. 25.49%, P = 0.069) although the difference was not statistically significant (Fig. 4c and Supplementary Table 22). The results suggested that with this threshold, QMOT may be used as an effective quality indicator for colonoscopy.

Table 1 Primary and secondary outcomes between low-QMOT and high-QMOT groups

Risk factors for adenoma detection

Finally, we conducted a logistic regression analysis to determine the variables related to adenoma detection. The results of the univariate logistic regression were as follows: age ≥50 years (OR, 4.36; 95% CI, 2.51–7.59), male patients (OR, 1.94; 95% CI, 1.27–2.96), QMOT ≥90 s (OR, 2.31; 95% CI, 1.51–3.53), total withdrawal time ≥6 min (OR, 1.94; 95% CI, 1.28–2.93), and endoscopists with experience of ≥4000 cases (OR, 1.88; 95% CI, 1.11–3.17, Supplementary Table 23) were risk factors of adenoma detection. The results of the multivariate logistic regression revealed the following independent risk factors of adenoma detection: age ≥50 years (OR, 4.39; 95% CI, 2.48–7.78), male patients (OR, 2.02; 95% CI, 1.29–3.16), and QMOT of ≥90 s (OR, 2.02; 95% CI, 1.23–3.33) (Fig. 4d and Supplementary Table 23).

Discussion

In this study, we developed and evaluated the accuracy and effectiveness of the QAMaster for colonoscopic examinations. QAMaster leveraged the ViT models to provide real-time image quality assessment and landmark identification, thereby automatically calculating QMOT. ADR in the high-QMOT group was significantly higher than that in the low-QMOT group. Multivariate logistic regression analysis indicated that QMOT was a potentially more effective quality control indicator than the total withdrawal time.

Quality control during colonoscopy plays a crucial role in increasing the incidence of ADR while reducing that of PCCRC5,6. Researches indicate that a more prolonged colonoscopy withdrawal time (8–13 min) allows for a higher ADR12,13,14,15,16. However, even during a colonoscopy with more prolonged withdrawal time, enough QMOT may not be possible19,20. Calculating QMOT typically requires significant human resources, is challenging to implement broadly12,28, and can introduce high variability and inconsistencies. Our QAMaster based on AI algorithms automated the calculation of QMOT, and demonstrated that the system performs accurately and consistently across multiple still images and video datasets, allowing for a mucosal observation of high quality during withdrawal.

Previous studies have used AI algorithms to record QMOT. For instance, Lux et al. developed an AI model to automatically calculate QMOT, and found the AI-predicted time to have smaller absolute differences with the measured time than the human-reported time29. Similarly, Lui et al. developed an AI model to record QMOT by excluding the time of inadequate view (including water absorption, significant redness, incomplete inflation, light reflection, excrement blocking, and air bubble blocking), identifying four quintiles of QMOT, and demonstrating a positive correlation between adenoma detection and higher quintiles30. Compared to previous studies, our image quality assessment model classified images in more detail, ensuring homogeneity across different categories, and enhancing the model’s effectiveness. We applied six distinct labels in this study, including in vitro, non-informative, foreign body, intervention, defective, and qualified with predefined visual criteria. Although all the images were annotated through majority agreement by three experienced endoscopists, the distinction between labels may vary slightly depending on human interpretation, which can influence the final QMOT measurement. In future work, we will further refine and standardize annotation protocols, possibly incorporating semi-automated labeling tools and multi-institutional validation to improve reproducibility and generalizability. Despite this subjectivity, QAMaster demonstrated consistently high performance across both internal and external validation datasets, suggesting that the current annotation framework is sufficiently robust for a generalizable and effective QMOT assessment system.

Moreover, none of the previous studies identified a suitable QMOT or investigated its clinical significance in real-world clinical settings. Our study addressed this gap by identifying a reasonable QMOT that provides a significant improvement over those from previous methods. We found that QMOT ≥90 s was positively correlated with adenoma detection (OR, 2.02; 95% CI, 1.23–3.33) through multivariate logistic regression. Subgroup analyses demonstrated the stability of the association between QMOT ≥90 s and ADR across different ages, patient sexes, examination time for colonoscopy, endoscopist sexes, endoscopists’ experience, and withdrawal time. There were several advantages of our established QMOT. On one hand, the ADR of high QMOT was significantly higher than that of low QMOT, and QMOT ≥90 s was an independent risk factor for ADR. When total withdrawal time, colonoscopy experience, and QMOT were included in the regression model, only QMOT ≥90 s remained significantly positively associated with ADR, despite being positively correlated with ADR in univariate analyses. This suggested that the QWT could be more important for quality control during colonoscopy. On the other hand, in the group with withdrawal time ≥6 min, the ADR was significantly higher in the high-QMOT subgroup than in the low-QMOT subgroup. This further highlighted the importance of QMOT over total withdrawal time in improving ADR. Our current definition of QMOT is intentionally strict, as it was designed to capture only frames with stable, clear, and informative mucosal visualization while excluding frames affected by camera movement, non-informative content, foreign bodies, or transitional views. This conservative approach ensures high quality of frames and consistency but may lead to relatively low QMOT values, as observed in our results. In future work, we will explore more flexible and dynamic criteria, such as those can tolerate slight movements or incorporate a continuous quality score, to better align the QMOT metric with practical endoscopic performance while maintaining its objectivity and reliability. Due to the strict definition of QMOT, endoscopists are required to maintain a relatively stable withdrawal or adopt a “withdrawal–stop to inspect–withdrawal” approach during the procedure to enable real-time application of QAMaster.

The AI architecture employed in QAMaster offered various advantages over traditional methods. In particular, the ViT models utilized in our system excelled at processing complex visual data, enabling precise real-time assessment of image quality and landmark identification. The models significantly enhanced the accuracy and robustness of the system, ensuring reliable performance across diverse clinical settings compared to several deep convolutional neural networks. The heat maps generated by QAMaster suggested that its remarkable performance can be attributed to the ability of ViT models to capture global information across the entire image27. This integration of AI into colonoscopy represented a notable advancement, promoting a higher ADR and more comprehensive mucosal inspections, ultimately improving CRC screening and prevention.

Our study had certain limitations. One limitation of our current approach is that the exclusion of frames containing foreign bodies or defects may lead to underestimation of potentially informative images, as such frames can still contribute to lesion detection in real-world clinical practice. In future work, we plan to quantify the proportion of the visual field obscured by foreign matter or defects and explore threshold values that do not compromise diagnostic accuracy. Moreover, our current model does not distinguish between standard-view and magnified-view modes, which may affect the granularity of time measurement for lesion search versus detailed inspection. We will incorporate view mode differentiation in subsequent model iterations like previously reported view mode to further refine the assessment of effective mucosal observation time30. Another limitation was that the system’s performance was based on data from Olympus endoscopes. Considering the widespread market share of Olympus endoscopes and the homogeneity of endoscopic images, QAMaster may be adapted to other endoscopes through transfer learning, thus ensuring broader applicability. Furthermore, while QAMaster has demonstrated potential in improving ADR, its effect on the detection of high-risk precancerous lesions and long-term clinical outcomes remains to be validated with regard to cost-effectiveness31. We plan to conduct randomized controlled trials to comprehensively assess its impact on high-risk lesion detection and to further establish its clinical value in enhancing the quality and effectiveness of colonoscopy.

In conclusion, we proposed the QAMaster system, which would offer a more precise and effective method for colonoscopy quality control through automatic calculation of QMOT. The system addressed the limitations of total withdrawal time metrics and could provide a robust tool for enhancing the effectiveness of colonoscopic examinations. Future research should focus on broadening the validation of QAMaster across diverse clinical settings and exploring its long-term impact on patient outcomes.

Methods

Study design and datasets

A stepwise AI validation study was conducted in two hospitals, namely Nanjing Drum Tower Hospital (NJDTH) and Nanjing Gaochun People’s Hospital (GCPH) in China. QAMaster consisted of two models, namely Model I for image quality assessment and Model II for anatomical landmark identification. First, we retrospectively collected 93,726 images from NJDTH, which were randomly divided into a training dataset with 57,235 images (61.07%) from 64 patients, a validation dataset with 17,024 images (18.16%) from 16 patients, and an internal testing dataset with 19,467 images (20.77%) from 20 patients to develop and validate Model I. Similarly, 11,167 images from NJDTH were randomly split into a training dataset with 7712 images (69.06%) from 3013 patients, a validation dataset with 1704 images (15.26%) from 755 patients, and an internal testing dataset with 1751 images (15.68%) from 20 patients to develop and validate Model II. The inclusion criteria for the retrospective datasets were as follows: (1) colonoscopy examinations conducted at one of the two institutions between January 2021 and June 2021; (2) availability of clinical information, images, and videos at the time of diagnosis; and (3) all endoscopic procedures performed with the Olympus 290 series system (Olympus Optical, Tokyo, Japan). The exclusion criteria were: (1) inflammatory bowel disease, (2) familial polyposis syndrome, and (3) history of colorectal surgery.

Subsequently, we collected three independent datasets for further validation of Model I (retrospective external testing dataset 1:19,328 images of 20 patients from GCPH; prospective testing dataset 1:24,568 images of 20 patients from NJDTH; and video testing dataset 1:10 videos of 10 patients from NJDTH) and Model II (retrospective external testing dataset 2:1657 images of 20 patients from GCPH; prospective testing dataset 2:1552 images of 20 patients from NJDTH; and video testing dataset 2:100 videos of 100 patients from NJDTH) (Supplementary Fig. 6 and Supplementary Table 24). The training, validation, and testing datasets were divided at the patient level to ensure the independence of the datasets. Finally, a prospective observational study was conducted to evaluate the clinical significance of QMOT in real-world clinical scenarios.

The study protocol was reviewed and approved by the Medical Ethics Committee of Nanjing Drum Tower Hospital (no. 2023-175-02). Written informed consent was obtained from all the prospectively recruited patients. For retrospectively collected data, the ethics committee waived the requirement for informed consent since only de-identified information was used. All procedures adhered to the principles of the Declaration of Helsinki.

Training and testing datasets

To generate the training and testing datasets, both still images and videos from the enrolled patients were utilized. All videos, except those designated for video testing, were firstly converted into frames, followed by cropping the irrelevant borders to retain the internal view of the colon. Then, OpenCV’s Perceptual Hash (pHash) algorithm was employed to identify and eliminate duplicate images, thereby reducing redundancy.

For the training and testing of Model I, the following datasets were used (Supplementary Fig. 6):

  1. (1)

    Training dataset: 3926 in vitro images, 20,439 non-informative images, 11,207 foreign body images, 3396 intervention images, 13,644 defective images, and 4623 qualified images from 64 patients at Nanjing Drum Tower Hospital (NJDTH).

  2. (2)

    Validation dataset: 731 in vitro images, 6595 non-informative images, 2846 foreign body images, 1366 intervention images, 4186 defective images, and 1300 qualified images from 16 patients at NJDTH.

  3. (3)

    Internal testing dataset: 1191 in vitro images, 7000 non-informative images, 4144 foreign body images, 1257 intervention images, 4511 defective images, and 1364 qualified images from 20 patients at NJDTH.

  4. (4)

    External testing dataset: 688 in vitro images, 5736 non-informative images, 3931 foreign body images, 1735 intervention images, 4451 defective images, and 2788 qualified images from 20 patients at Gaochun People’s Hospital (GCPH).

  5. (5)

    Prospective testing dataset: 4451 in vitro images, 7803 non-informative images, 4444 foreign body images, 848 intervention images, 4001 defective images, and 2921 qualified images from 20 patients at NJDTH.

  6. (6)

    Video testing dataset: 10 videos from 10 patients, randomly selected from the prospective testing dataset at NJDTH, were used to evaluate the consistency between Model I and endoscopists in the real-time recognition of each label.

For the training and testing of Model II, the following datasets were used:

  1. (1)

    The training dataset: 4668 cecum images and 4748 other sites images of 3768 patients at NJDTH.

  2. (2)

    The internal testing dataset: 763 cecum images and 988 other sites images of 20 patients at NJDTH.

  3. (3)

    The external testing dataset: 644 cecum images and 1013 other sites images of 20 patients at GCPH.

  4. (4)

    The prospective testing dataset: 599 cecum images and 953 other sites images of 20 patients at NJDTH.

  5. (5)

    The video testing dataset: 100 videos from 100 patients, randomly selected from the dataset established in the prospective observational study, were used to evaluate the correlation between the predicted withdrawal time by QAMaster and that determined by endoscopists.

The training and testing datasets were divided at patient level to ensure the independence of the training and testing datasets.

Model design and training

QAMaster is consisted of two deep learning models, and both models were based on ViT27. QAMaster initiates its preprocessing by first resizing the images, followed by normalization. An RGB image is cropped and split into a batch of 16*16 nonoverlapping patches before being fed into the ViTs architecture. To retain the maximum amount of visual information, all colonoscopy images were preserved in their original three-channel RGB format throughout the entire preprocessing and training pipeline. No conversion to grayscale or other color spaces was performed. This decision was motivated by the fact that color cues can be clinically relevant in endoscopic imaging and may contribute to improved feature extraction by the neural network, particularly in differentiating mucosal texture, vascular patterns, and artifacts. Once normalized, the images are partitioned into distinct, non-overlapping windows. Subsequently, the content of these windows is converted into token embeddings. To enhance their spatial context, these embeddings are complemented with positional embeddings32. We employed standard learnable 1D positional embeddings, consistent with the original ViT design. In particular, each patch embedding was augmented with a positional embedding of the same dimensionality as that of patch embedding before being fed into the Transformer encoder. These positional embeddings were initialized as trainable parameters and added element-wisely to the sequence of patch embeddings. This design allows the model to capture spatial information across image patches, which is critical for accurate visual representation and downstream classification. This refined data then proceeds to the deeper layers of the model, enabling sophisticated visual analysis. All our model parameters have been pretrained on ImageNet before being fine-tuned on our collected dataset. Finally, to enhance the model’s robustness, we utilized mixup augmentation during the training process33. To improve the efficiency of training process, an increasing number of training images were used to obtain an optimal sample size of training datasets. An early stopping strategy was used to prevent overfitting. In particular, after each epoch, the model’s performance on the validation set was evaluated. If no improvement in validation accuracy was observed for five consecutive epochs, training was terminated. The model weights from the epoch with the highest validation accuracy were retained for final evaluation on the test set. This strategy ensured robust generalization while avoiding unnecessary over-training. The stability of the models was evaluated with 5-fold cross-validation and using different random seeds. The specific training details are shown in Supplementary Note 1, Supplementary Figs. 7 and 8, and Supplementary Tables 2528. Our models were both trained and evaluated using four GeForce RTX 2080TI graphics processing units (NVIDIA Corporation, Santa Clara, California, USA). The optimal hyperparameters were selected using validation datasets.

Evaluation of QAMaster performance

The performance of Model I was assessed on the internal, external, and prospective testing datasets. For each classification on the datasets, multi-label classification was evaluated with confusion matrix, accuracy, sensitivity, specificity, precision, and F1-score at binary decision thresholds. Receiver operating characteristic (ROC) analysis with area under the curve (AUC) calculation was performed. macro-AUC, and micro-AUC were used to assess the aggregate performance of multi-label classification. Five-fold cross validation and different random seeds were also used to evaluate the performance of Model I. The 95% confidence intervals (CIs) of accuracy, sensitivity, specificity, and precision were computed with Clopper–Pearson method, and the 95% CIs of AUC and F1-score were calculated using bootstrapping with 1000 resamples.

The Model II was tested on the internal, external, and prospective testing datasets. The performance of Model II was evaluated with confusion matrix, AUC, accuracy, sensitivity, specificity, precision, and F1-score with 95% CIs.

The metrics were calculating using the following formulas:

$${\rm{Accuracy}}=\,\frac{{\rm{true\; positive\; images}}\left({\rm{cases}}\right)+{\rm{true\; negative\; images}}({\rm{cases}})}{{\rm{total\; images}}({\rm{cases}})}$$
(1)
$${\rm{Sensitivity}}=\,\frac{{\rm{true\; positive\; images}}({\rm{cases}})}{{\rm{true\; positive\; images}}\left({\rm{cases}}\right)+{\rm{false\; negative\; images}}({\rm{cases}})}$$
(2)
$${\rm{Specificity}}=\,\frac{{\rm{true\; negative\; images}}({\rm{cases}})}{{\rm{true\; negative\; images}}\left({\rm{cases}}\right)+{\rm{false\; positive\; images}}({\rm{cases}})}$$
(3)
$${\rm{Precision}}=\,\frac{{\rm{true\; positive\; images}}({\rm{cases}})}{{\rm{true\; positive\; images}}\left({\rm{cases}}\right)+{\rm{false\; positive\; images}}({\rm{cases}})}$$
(4)
$${\rm{F}}1-{\rm{score}}=\,\frac{2\times {\rm{Precision}}\times {\rm{Sensitivity}}}{{\rm{Precision}}+{\rm{Sensitivity}}}$$
(5)

Procedures

To establish Model I for image quality assessment, colonoscopy images were annotated with the following six labels: (1) in vitro: images captured outside of the body; (2) non-informative: images captured too close to the colon wall, those out of focus, containing numerous artifacts, or with over- or under-exposure; (3) foreign body: images containing bubbles, fluid, and fecal materials; (4) intervention: images showing flushing, chromoendoscopy, biopsy, or any other instruments; (5) defective: images with a small number of artifacts, slightly blurry areas, or mild overexposure but still showing part of the mucosa; (6) qualified: images with clearly visible mucosa or vessels, moderate lighting, and no artifact17. For Model II (for anatomical landmark identification), images were annotated as showing either the cecum (ileocecal valve, appendiceal orifice, and cecal caput) or other sites6. All images were annotated by three experienced endoscopists with >10 years of experience in colonoscopy, and only the images for which at least two endoscopists reached a consensus were included.

To test QAMaster, images from the internal, external, and prospective testing datasets were first adopted to evaluate the classification performance of Models I and II. Subsequently, videos from video-testing dataset 1 were used to analyze the accuracy of Model I in calculating the duration of different labels in real-time. Videos from video-testing dataset 2 were used to assess the performance of QAMaster in calculating the withdrawal time (Supplementary Movies 1 and 2).

QMOT was calculated using QAMaster, as follows: Model II identified the cecum frame for the first time to determine the start time of colonoscopy withdrawal, and Model I recognized the in-vitro frames to determine the end time of the withdrawal. The total number of withdrawal and qualified frames in the withdrawal procedure were calculated using Model I.

\({\rm{The\; total\; withdrawal\; time}}={\rm{the\; end\; time\; of\; withdrawal}}-{\rm{the\; start\; time\; of\; withdrawal}}-{\rm{time\; of\; the\; intervention}}\) (6)

$${\rm{QMOT}}=\,\frac{{\rm{qualified\; frames}}}{{\rm{total\; withdrawal\; frames}}}\times {\rm{the\; total\; withdrawal\; time}}$$
(7)

To facilitate the utilization of the qualified withdrawal time, every 30 s of QMOT was treated as a class (i.e., 0–30 s corresponded to a QMOT category of 30 s, 30–60 s corresponded to a QMOT category of 60 s, etc.).

To evaluate the clinical value of QMOT, we conducted a single-center prospective cohort study in NJDTH. We recruited consecutive patients aged between 18 and 75 years who provided informed consent to undergo colonoscopy between July 1 and October 15, 2023. The exclusion criteria were inflammatory bowel disease, family polyposis syndrome, history of colorectal surgery, known or suspected bowel obstruction or perforation, and pregnancy or lactation.

Data collection

Basic demographic characteristics of patients including age, sex, indication for colonoscopy, type of sedation, recruitment of patients, and time for colonoscopy were recorded before colonoscopy. Time for colonoscopy was divided into 2 groups according to the end time of procedure. The early group commenced in the early session per half day (8:00 AM to 10:59 AM or 1:00 PM to 3:29 PM), and the late group commenced in the late session per half day (11:00 AM to 12:59 PM or 3:30 PM to 5:29 PM). Boston Bowel Preparation Scale (BBPS) score was used to assess bowel preparation34. Information regarding endoscopist age, sex, year of practice, endoscopists experience, and colonoscopies per year were also obtained. The withdrawal time was defined as time from cecum to anus, exclusion the time of intervention including polyp resection, biopsy time, time of the mucosal cleaning, and observation time using chromoendoscopy, which was automatically computed by the QAMaster. After the colonoscopy, whole colonoscopy videos were collected for subsequent analysis. Two endoscopists rechecked the cecal intubation time and withdrawal time reported by the QAMaster.

Outcomes

The primary outcomes for the performance of QAMaster were AUC, accuracy, sensitivity, and specificity. For the prospective observational study, primary outcome was the ADR, defined as the proportion of patients with one or more histologically confirmed adenomas. Secondary outcomes included the ADR for adenomas of different sizes (diminutive ≤5 mm, small >5 to <10 mm, or large ≥10 mm) and locations (the right colon defined as the cecum to the ascending colon, the transverse colon as the hepatic flexure to the splenic flexure, and the left colon as the descending colon to the rectum), the polyp detection rate for polyps of different sizes and locations, advanced ADR, and sessile serrated lesion detection rate of patients above and below the determined QMOT threshold. Advanced adenomas were defined as those with a size of 10 mm, a villous component, or high-grade dysplasia. Polyps were diagnosed based on endoscopic diagnosis, and adenomas and sessile serrated lesions were diagnosed based on the World Health Organization criteria and the pathology reports from NJDTH.

Statistical analysis

We analyzed 250 patients recruited earlier (in the prospective study) to estimate the sample size. A sample size of 466 was required to meet a two-sided 95% CI with a width of 0.1 when the Spearman’s rank correlation of the 250 patients was 0.715 (PASS, version 15.0.5). Considering that approximately 5% of the patients may be excluded from the analysis, the target sample size was set to 492 cases. Variants were compared using the chi-square test or Fisher’s exact test for categorical variables and Student’s t test for continuous variables. We calculated the ADR of each QMOT class and performed correlation analysis using Spearman analysis. Logistic regression was used to evaluate the association between adenoma detection and QWT above and below the determined threshold, after adjusting for age, sex, indication for colonoscopy, use of analgesia, examination time for colonoscopy, bowel preparation, type of instrument, withdrawal time, age and sex of endoscopists, years of practice, experience as endoscopist, and colonoscopies per year. All statistical tests were two-sided, and P < 0.05 was regarded as statistically significant. All statistical analyses were conducted using the R software (version 4.3.3).