Introduction

Early disease screening is crucial for improving patient prognosis and reducing healthcare costs1,2, which can significantly decrease morbidity and mortality rates by detecting diseases at an early, more treatable stage3,4. This is particularly crucial in the realm of colorectal lesions, including precancerous lesions and colorectal cancer (CRC). In this context, early detection via screening methods makes timely intervention and more favorable prognoses4,5,6,7. Endoscopy, as a non-invasive and highly accurate diagnostic tool, allows for direct visualization of the gastrointestinal tract8. It is instrumental in identifying precancerous lesions and early-stage cancers, facilitating timely treatment. Endocytoscopy (EC) represents an advanced endoscopic system that enables the observation of cellular atypia and glandular structures with ultra-high magnification (e.g., ×520) images following appropriate staining9,10. Additionally, EC combined with narrow-band imaging (EC-NBI) enables visualization of microvessels on the surface of colorectal lesions without the need for staining, thereby avoiding potential health risks associated with the staining process. Studies have demonstrated that EC-NBI provides diagnostic efficacy in predicting the histopathological diagnosis of colorectal lesions comparable to EC with staining11. Compared to non-magnified NBI or magnifying NBI (ME-NBI, with magnification levels of approximately ×80–100), the ultra-high magnification capability of EC-NBI enables more precise observation of surface vascular patterns, facilitating more accurate prediction of colorectal lesions. However, accurate EC diagnosis of surface microvessels heavily depends on the experience of endoscopists, making it challenging for novices to achieve sufficient diagnostic accuracy12.

To address this issue, computer-aided diagnosis (CAD) based on EC-NBI has been developed to assist endoscopists in attaining results comparable to those of experienced professionals13,14,15. Currently, for clinical application, the designed CAD model serves as a binary classifier. It has the ability to tell non-neoplastic colorectal lesions apart from neoplastic ones, but it is unable to make a more detailed distinction between adenomas and invasive cancers12. Based on clinical practice with EC-NBI, it is difficult for endoscopists to distinguish adenomas, particularly high-grade adenomas, from invasive cancers15. Therefore, there is an urgent need to develop a three-class CAD model capable of providing three predicted outcomes: non-neoplastic lesions, adenomas, and invasive cancers.

Deep learning has emerged as an extremely powerful approach in the realm of extracting valuable information from medical images. It has demonstrated tremendous promise when it comes to diagnostic prediction, particularly in the area of digestive diseases16,17,18. In contrast to traditional machine learning methods that have been utilized for EC diagnosis12, deep learning-based methods bring a significant advantage by doing away with the necessity for manual feature extraction19. Through adaptively learning feature extraction capabilities from extensive training datasets in an end-to-end manner, deep learning models can achieve far superior disease prediction capabilities in comparison to traditional machine learning approaches.

In this work, we develop a data-driven three-class CAD model for colorectal lesion diagnosis from EC-NBI images, leveraging deep learning techniques. Our model demonstrates high accuracy in classifying non-neoplastic lesions, adenomas, and invasive cancers among colorectal lesions. The distinguishing feature of our method lies in the integration of masked autoencoders20, introducing significant improvements. Encouraged by the success of the multi-stage continuing pre-training (CPT) strategy for large language models, in our work, we mimic the process of multi-stage CPT of large language models and use different data sets to pre-train the model, which finally helps to improve the model’s recognition of colorectal lesion. Specifically, during the model training stage, in the first stage of generalized network pretraining, we conduct unsupervised pre-training using a vast number of endoscopic images. Then, in the second stage of in-domain CPT, we continuously pre-train the model using an appropriate amount of unlabeled EC-NBI data in an unsupervised manner to continuously enhance the model’s feature extraction ability. In the final supervised fine-tuning (SFT) stage, we introduce the concept of supervised deep clustering (SDC) which employs the method of expectation maximization (EM) to enhance downstream classification accuracy through feature clustering, thus improving the model’s classification capabilities. After model training, we conducted validation of the performance of our model in a multi-center cohort. From diverse quantitative indicators of the model to human-machine competitions, the performance of our model comprehensively outperforms that of other advanced models and even exceeds the diagnostic performance of endoscopists. The successful implementation of our model not only enhances the diagnostic efficiency of endoscopists but also, to a certain degree, propels endoscopic diagnosis to the level of real-time biopsy diagnosis.

Results

Patient characteristics

For the training and internal validation of the proposed model, a total of 32142 EC-NBI images from 723 patients with colorectal lesions were collected for this retrospective study at the First Hospital of Jilin University (Changchun, China) (seen in Fig. 1). These data were divided into a training cohort (8615 EC-NBI images from 406 lesions of 315 patients) and an internal validation cohort (23527 EC-NBI images from 498 lesions of 408 patients), with a specific time node serving as the demarcation. For the generalization verification of the model, a total of 2749 EC-NBI images from 152 lesions of 90 patients with colorectal lesions were collected as an external validation cohort for this retrospective study at the MEIHEKOU Center Hospital (Meihekou, China). The statistical characteristics of patients and lesions in each cohort are presented in Table 1 and Table 2, respectively. Detailed data can be found in Supplementary Data S1 and S2.

Fig. 1
figure 1

Flowcharts of the dataset.

Table 1 The basic characteristics of the patients in training and validation cohorts
Table 2 The basic characteristics of the lesions in training and validation cohorts

Development of the proposed deep learning-based model

The overall study design is presented in Fig. 2. A deep learning model was trained with the aid of a multi-stage CPT strategy and validated by using a multi-center cohort of EC-NBI images to provide three predicted outcomes: non-neoplastic lesions, adenomas, and invasive cancer. In our work, we conduct large-scale generalized unsupervised pre-training with the aid of almost 480,000 endoscopic images, which are compiled from several public datasets21,22,23,24,25,26,27,28,29,30,31. These datasets comprise colonoscopy, gastroscopy, limited esophagus images and videos, etc. They are primarily in white light modality with some fluorescence and narrow-band imaging. From these, we extracted a frame image per second to enrich the dataset for video. Then, we performed unsupervised CPT with unlabeled EC-NBI images. Finally, combined with the ground truth, we fine-tuned the EC-NBI image classification model in a supervised manner.

Fig. 2: Research flowchart.
figure 2

a Study design and the flowchart of the multi-stage continuing pre-training (CPT) strategy. b The model evaluation via different metrics.

The proposed model accurately makes diagnosis at the image level

Given that each lesion is associated with several EC-NBI images, a comprehensive evaluation of the model’s performance is conducted at both the EC-NBI image level and the lesion level. As depicted in Fig. 3, the proposed deep learning model exhibits remarkable accuracy in classifying specific classes in both the training and validation cohorts at the image level. Specifically, in the internal validation cohort, for non-neoplastic lesions, the area under the curve (AUC) is 0.939, with a 95% confidence interval (95% CI) of (0.936, 0.942). Adenomas have an AUC of 0.917 (0.914, 0.921), and invasive cancers have an AUC of 0.969 (0.967, 0.972). In the external validation cohort, for non-neoplastic lesions, the area under the curve (AUC) is 0.837, with a 95% confidence interval (95% CI) of (0.823, 0.851). Adenomas have an AUC of 0.827 (0.813, 0.841), and invasive cancers have an AUC of 0.908 (0.897, 0.919) (Fig. 3a). Furthermore, the confusion matrices for the training cohort and both internal and external validation cohorts are presented, which can provide a more detailed view of the model’s performance (Fig. 3b).

Fig. 3: The performance indicators of the model at different levels in both the independent internal and external validation cohort.
figure 3

At the image level, we quantitatively perceive the model performance by calculating the receiver operating characteristic (ROC) (a) and confusion matrix (b) of the training and validation cohorts of the model. Further, through the T-SNE method, we perform dimensionality reduction and visualization of the features of the samples. The clustering situation between different classes can be observed (c), and their silhouette scores are quantitatively calculated. The same indicators calculation and visualization are also displayed at the lesion level (df).

Typically, the performance of a disease model depends on the separability of images within the feature space. We first output the sample features that have passed through the model. Subsequently, we employ the dimensionality reduction method of T-SNE to visualize the aggregation of features and calculate the corresponding silhouette score, which can quantitatively reflect the quality of feature aggregation. The larger the value, the better the separability of features. To be specific, in Fig. 3c, we can obtain a silhouette score of 0.4724 in the training cohort, 0.4284 in the internal validation cohort, and 0.4622 in the external validation cohort.

The proposed model accurately makes the diagnosis at the lesion level

We further conduct an evaluation of the performance of our model at the lesion level. Due to the voting mechanism, whereby the final prediction result of a lesion is determined through voting within the same lesion and the final feature can be computed by averaging all the related features. As a result, the final result can disregard a small number of inaccurately predicted samples, leading to a marginally higher performance at the lesion level compared to that at the image level (Fig. 3d, e). This slightly higher performance can also be intuitively verified from the visualization of feature visualization using T-SNE (Fig. 3f). Specifically, we can obtain a silhouette score of 0.6914 in the training cohort, 0.8116 in the internal validation cohort, and 0.6673 in the external validation cohort. These silhouette scores are significantly higher than those obtained at the image level (Fig. 3c), indicating that features are more separable at the lesion level through the voting mechanism. Noting that our method underperformed on the external validation set compared to the internal validation set, both at the image level and lesion level. The potential explanation for this discrepancy is the inconsistency in imaging devices used for data acquisition between the external and internal validation sets. Variations in image quality (e.g., resolution, lighting, and noise profiles) caused by different devices led to a mismatch in data distributions between the external validation set and the training data, thereby diminishing the model’s generalization performance.

The proposed model outperforms the traditional deep-learning approach

We systematically evaluated the performance of various state-of-the-art (SOTA) deep learning models, including ResNeSt32, ViT-B/1633, Convnext34, MambaOut35, and MaxViT36. As shown in Fig. 4, whether at the image level or the lesion level, in comparison with other SOTA methods, although for certain disease categories, some metrics of specific methods may be on a par with ours or slightly superior to ours (for example, at the image level, in the invasive cancers category, MaxViT surpasses our model in PPV. At the lesion level, in the identification of invasive cancers, the sensitivity is marginally better than the result of our model), we take the area enclosed by the five indicators as the comprehensive performance of the model on the five indicators, which can be seen under each Radar chart in Fig. 4, and the area of our method is optimal in most cases. This can be observed from the fact that the blue lines in Fig. 4 are essentially on the outermost periphery. Detailed quantitative metrics are provided in Tables S1 and S2 in the Supplementary Information File.

Fig. 4: Comparison experiments of different methods.
figure 4

a At the image level, the performance of different comparison methods on different disease types in the training cohort. b At the image level, the performance of different comparison methods on different disease types in the internal validation cohort. c At the image level, the performance of different comparison methods on different disease types in the external validation cohort. d At the lesion level, the performance of different comparison methods on different disease types in the training cohort. e At the image level, the performance of different comparison methods on different disease types in the internal validation cohort. f At the image level, the performance of different comparison methods on different disease types in the external validation cohort. Note: The closer to the periphery, the better the performance. The numerical comparison below each radar chart represents the area enclosed by the five indicators under the different methods; the larger the area, the higher the comprehensive performance of the method.

Human-machine collaboration can improve the accuracy of endoscopists

To further validate the clinical performance of our model, we also evaluated the capacity of the deep learning model to enhance the performance of endoscopists in disease prediction with and without the assistance of artificial intelligence (AI). To illustrate the influence of the deep learning model on the individualized assessment of endoscopists, four endoscopists independently conducted the histopathological diagnosis assessment of each lesion in both the internal and external validation cohorts. Subsequently, on the same day, they re-evaluated the histopathological diagnosis after taking into account the predictions of our proficiently trained AI model. The evaluation metrics encompass accuracy, recall, F1-score, specificity, sensitivity, as well as positive and negative predictive values (PPV and NPV). The results are presented in Fig. 5.

Fig. 5: Human-machine collaboration.
figure 5

Performance comparison between our model and endoscopists, as well as the improvement in endoscopists’ prediction results with the assistance of AI. (AI) means this doctor has AI assistance. The solid dark-red curves represent AI-assisted endoscopist diagnoses. The dashed light-red curves represent endoscopist-only diagnoses. The solid blue curves represent AI-only diagnoses (Our method).

As depicted in Fig. 5, the diagnostic performance of four endoscopists exhibited measurable improvement when augmented by AI assistance within the internal validation dataset. This is substantiated by the observation that the solid dark-red curves (representing AI-assisted endoscopist diagnoses) consistently encompass the dashed light-red curves (baseline endoscopist performance) across all diagnostic dimensions. Furthermore, the proposed AI-only methodology demonstrated statistically superior predictive accuracy compared to AI-assisted endoscopists in the vast majority of evaluated metrics spanning three disease categories, with the exception of the third endoscopist, whose sensitivity for invasive carcinoma marginally exceeded our AI-only approach.

Further, external validation revealed analogous trends: AI assistance enhanced diagnostic discrimination across all endoscopists. However, both the AI-only system and endoscopist performance exhibited modest declines across all metrics, attributable to distributional shifts in the external dataset. Notably, while the AI-only approach outperformed diagnoses by the first and second endoscopists (junior clinicians with 3 and 1 years of endoscopic experience, respectively), the third and fourth endoscopists (senior clinicians with 9 and 7 years of experience) achieved comparable or marginally superior diagnostic accuracy in specific metrics. This discrepancy likely stems from the differential integration of clinical expertise, where senior endoscopists’ experiential knowledge compensated for dataset heterogeneity more effectively than junior counterparts. (Detailed experimental results are presented in the Supplementary Data S3 and S4)

Interpretation of the proposed deep learning-based model

The deep learning-based classification model typically consists of the feature extractor and the classifier. Such models are often difficult to understand. On the one hand, the extracted features are not interpretable in clinical practice. On the other hand, due to the errors of the classifier, there is a certain perturbation between the features and the final classification prediction, which means good features do not necessarily represent correct prediction results. Regarding the uninterpretability of extracted features in clinical practice, we employed the Gradient-weighted Class Activation Mapping (Grad-CAM) method to generate class activation mapping, which regions of image features play a key role in determining a particular category output of the features from different layers of our model (Fig. 6a). From this, we can see that the model mainly focuses on the blood vessels, which is consistent with our clinical knowledge. Further, we also output all the class activation mappings of each transformer block in our model, and we can see that with the deepening of the blocks, the model’s region of interest gradually converges to the vascular area in the EC images (see Fig. S1 in the Supplementary Information File). By perturb the labels, we can see that the model’s class activation mappings are also changed, thus: λfor the non-neoplastic category, the model mainly focuses on the fuzzy and unclear margin microvessels of the image; for the adenomas category, the model primarily focuses on the dense, regular and clear margin micronetworks in the EC images; and for the invasive cancer category, the model mainly focuses on the irregular, thick and sparse microvessels in the EC images. (see Fig. S2 in the Supplementary Information File).

Fig. 6: Model interpretability.
figure 6

a The class activation mapping of the model when predicting samples of different categories was obtained through Gradient-weighted Class Activation Mapping (Grad-CAM). b Bee swarm summary plot of feature importance based on Shapley Additive exPlanations (SHAP) analysis. The bee swarm plot is designed to display an information-dense summary illustrating how the top features in a dataset affect the output of a model. Each observation in the data is represented by a single dot on each feature row. The vertical axis represents the features, sorted from top to bottom according to their importance as predictors. The position of a dot on a feature row is determined by the SHAP value of the corresponding feature, and the accumulation of dots on each feature row illustrates its density. The feature value determines the color of the dots, with red indicating large SHAP values and blue indicating small SHAP values. c Feature importance plot. Passing the SHAP value matrix to the bar plot function creates a global feature importance plot for each class, where the global importance of each feature for each class is considered as the average absolute value of that feature overall given samples.

For the relationship between features and classification output, we employed an explanatory model named Shapley Additive exPlanations (SHAP)37,38, which functions as an additive method in feature attribution interpretation. It explains the model’s predicted values as the cumulative sum of attribution values assigned to each input feature. A significant SHAP value highlights the important influence of each feature in terms of the effectiveness of classification prediction, enabling researchers and practitioners to better understand the model’s behavior and make more informed decisions. In this work, we computed the feature importance and impact using SHAP for each class. We ranked the feature importance in descending order. The results are presented through a beeswarm summary plot (Fig. 6b) and a feature importance plot (Fig. 6c). For example, we can see that “Feature 520” has a significant impact on the prediction of non-neoplastic lesions, “Feature 385” for adenomas, and “Feature 340” for invasive cancers, which can be used to determine the reliability of prediction results. To this end, in clinical practice, following the capture of EC images containing potential abnormalities by endoscopists, heatmaps can be applied to enhance the interpretability of the model, providing visual explanations that support clinicians in achieving more accurate and reliable diagnostic assessments.

Discussion

It is well known that EC offers the highest magnification among endoscopic imaging modalities, holding significant potential for achieving ‘optical biopsy’. From non-neoplastic lesions to adenomas and invasive cancers, the microvessels become increasingly thicker and irregular with an increasing grade of lesion dysplasia (Fig. 6a). The precise real-time endoscopic diagnosis of colorectal lesions is critical for endoscopists in selecting appropriate treatment strategies. However, it is difficult for endoscopists, especially novice endoscopists, to make accurate diagnoses using EC-NBI. To address this issue, we collected approximately 480,000 publicly available endoscopic images to pre-train the foundation of the large model. Then, we collected 8615 EC-NBI images (from 315 patients) at the First Hospital of Jilin University to conduct in-domain CPT and SFT of the model. Ultimately, we have successfully developed a computer-aided diagnosis model based on EC-NBI for predicting the classification of colorectal lesions. In contrast to previous machine learning-based binary classification CAD models, our model effectively discriminated non-neoplastic lesions, adenomas, and invasive cancers, which can be integrated into the clinical workflow to provide real-time assistance for disease diagnosis (An example of real-time diagnosis in the Supplementary Movie 1). By rapidly determining lesion pathology, endoscopists optimize treatment strategies, enhancing clinical outcomes.

Approximately 90% of colorectal lesions detected during colonoscopy are diminutive (≤ 5 mm) or small (≤ 10 mm) polyps, primarily non-neoplastic with minimal malignant potential39. Aligned with the American Society for Gastrointestinal Endoscopy introduced the Preservation and Incorporation of Valuable Endoscopic Innovations (PIVI) guidelines for optimal management, “resect-and-discard” or “leave-in-situ” strategies effectively reduce unnecessary pathological examinations and associated risks40. Generally, hyperplastic polyps (Hps), which account for the majority of non-neoplastic lesions, exhibit fine and fuzzy microvessels. While NBI, ME-NBI, and EC-NBI have all been shown to effectively identify characteristic HPs, EC-NBI has demonstrated the ability to distinguish diminutive HPs with reticular capillary patterns, which account for over 10% of HPs41. The EC-NBI shows these HPs with unclear margin microvessels. It could be challenging for endoscopists to discern these microvessel patterns. Our results demonstrate that our model achieves superior performance in differentiating these HPs compared to endoscopists. Furthermore, when assisted by the CAD model, endoscopists exhibit significantly enhanced lesion classification accuracy. (Fig. 5). As we know, about 80 to 95% of colorectal cancers develop from adenomas. It is crucial to distinguish adenoma and invasive cancer because it is directly related to the choice of endoscopic or surgical treatment. To accurately predict adenomas and invasive cancers, an endoscopic diagnostic approach with high sensitivity and specificity is imperative. Our deep learning model demonstrated excellent diagnostic performance in predicting different histopathologies of colorectal lesions. Our model shows the accuracy of predicting non-neoplastic lesions, adenomas, and invasive cancers at the image level is 0.904, 0.856, and 0.944 in the internal validation set, 0.877, 0.853, and 0.975 in the external validation set, respectively.

In addition, the diagnostic accuracy of EC-NBI is significantly related to the experience of the endoscopist. Previous study has shown that the accuracy of EC-NBI in diagnosing adenomas is sharply lower for non-expert endoscopists (63.3% [59.1%–67.5%]) compared to expert endoscopists (84.2% [81.2%–86.8%])12. At the same time, our study also indicates that there is a significant difference in the diagnostic performance of EC-NBI among endoscopists at different levels. Nevertheless, with the aid of our CAD model, the diagnostic accuracy of endoscopists has been improved. For instance, in the internal validation set, for non-neoplastic lesions, the sensitivity remains unchanged while the specificity is increased by approximately 4 basis points. For adenomas, the specificity remains unchanged while the sensitivity is maximally increased by 5 basis points. For invasive cancers, all indicators improved further. Thus, the CAD model will significantly improve the accuracy and homogeneity of EC-NBI in the diagnosis of colorectal lesions and achieve the purpose of optical biopsy. The utilization of this model eliminates the potential risks associated with traditional biopsy, streamlines the time-consuming process of pathological diagnosis, and enables prompt diagnosis and removal of adenomas and colorectal cancers. The strategy of observation for certain non-neoplastic lesions, in the meantime, serves to reduce costs and avoid the risks associated with polypectomy.

When developing the deep learning model, we trained a Transformer-based neural network using a multi-stage CPT strategy. We have demonstrated that this approach is superior to traditional neural networks and models without multi-stage CPT in predicting EC-NBI diseases (Table S5 in the Supplementary Information File). Compared with traditional machine learning, our method can utilize a large amount of labeled and unlabeled clinical data and has achieved performance far exceeding that of traditional machine learning. Compared with traditional convolutional neural networks, our model is the first large model related to EC-NBI, with a parameter amount exceeding 80 M, which can greatly promote the model performance, as seen in Table S1 and S2 in the Supplementary Information File. Under the pre-training of a large amount of unlabeled data, we can provide a very robust starting point for subsequent SFT, enabling SFT to be completed in a few epochs. Additionally, during the brief period of SFT, the supervised clustering loss can greatly promote the aggregation of samples in the feature space and improve the recognition ability of the model, as seen in Table S4 in the Supplementary Information File.

Although our method can significantly enhance the diagnostic performance of colorectal lesions with EC-NBI images and to a certain extent improve the discriminatory ability of diagnosticians at different levels, this study had several limitations. (1) This study’s data has several bias-related aspects. The training data from only the First Hospital of Jilin University leads to institutional-specific distribution. The model’s performance varies greatly between the internal and external validation sets, as seen in Fig. 4, due to differences in image acquisition systems and annotation standards. Also, since both hospitals are in Northeast China, the data has regional characteristics, and the model’s cross-regional generalization ability needs further study, especially when applied to diverse populations. Moreover, clinical data naturally lack balance in factors like gender, age, and lesion features, causing data bias. But this bias is in line with clinical reality, and the model’s results are somewhat recognized by endoscopists, as shown in Fig. 5, demonstrating its robustness in real-world scenarios. (2) Due to the limited lesion number, the CAD model is currently unable to distinguish between superficial submucosal invasive cancer and deep submucosal invasive cancer. This limitation will be addressed in future work by expanding the training dataset with a larger and more diverse dataset of relevant cases collected from multiple hospitals. (3) External validation experiments demonstrate that disparities in image quality across hospitals, arising from regional variations and differences in equipment characteristics, such as resolution, illumination, and noise profiles, adversely affect model performance. To address this challenge, we propose two complementary strategies: Expanding multi-center datasets to enhance the diversity of the training set or implementing an EC-NBI image quality control system at the model’s input stage to enforce rigorous standardization of input data. (4) This study is a retrospective analysis, lacking prospective clinical validation. However, we have developed a real-time EC-NBI diagnostic system that has demonstrated effective clinical diagnostic performance in laboratory settings. Pending approval from relevant regulatory bodies, this system will be poised for clinical deployment to facilitate prospective clinical validation, which is a key objective of our future research agenda.

Methods

Patients and lesions

We retrospectively collected the clinical data and endocytoscopic images of colorectal lesions that were pathologically diagnosed after endocytoscopic observation between June 2023 and October 2024. Exclusion criteria were: a history of inflammatory bowel disease, radiation therapy for colorectal cancer, special pathological types such as malignant lymphoma, submucosal lesions, sessile serrated lesions (SSLs), and those lesions with endocytoscopic images of poor quality.

Inclusion and ethics statement

The study protocol was approved by the Ethics Committee of the First Hospital of Jilin University and the Ethics Committee of the MEIHEKOU Center Hospital. Informed consent from patients was exempted because of the retrospective nature of this study.

Development of deep learning model

Inspired by the observation that proper model initialization can significantly enhance performance in downstream tasks20, we have developed a multi-stage CPT strategy for diagnostic decision-making using EC-NBI images. In Fig. 2a, in the first stage, we utilized a substantial volume of endoscopy images (480,000) from publicly available datasets, which were unlabeled, to pre-train the encoder of our well-designed model. To be specific, we compile this large-scale dataset from several public datasets21,22,23,24,25,26,27,28,29,30,31 for the first-stage network pre-training. These datasets include colonoscopy, gastroscopy, and limited esophagus images and videos, primarily in white light modality with some fluorescence and narrow-band imaging. For image, each image has been adopted. For video, we extract images per frame per second to enrich the dataset. This pre-training enables the model to acquire superior feature extraction capabilities. In our study, we employed the Masked Image Modeling framework42 to train an image restoration network. To be specific, utilizing a substantial dataset, the network was trained to reconstruct images where 75% of the content had been masked. For our model, we used MAE encoder as the backbone. In the second stage, we further utilized all the EC-NBI images without labels in the training dataset (8615 images) to continue pre-train our model in the same way.

In the third stage, following model pre-training, we fine-tuned the pre-trained encoder using labeled EC-NBI datasets in a supervised manner. This SFT process, combined with a straightforward classifier, was tailored to optimize the model for the downstream task of colorectal lesion diagnosis from EC-NBI image. To be specific, we incorporated the concept of SDC based on the method of expectation maximization43,44 into the fine-tuning process. By maintaining class centers, we ensured that samples within the same class were as close as possible in the feature space, while samples from different classes were maximally distant. This approach encourages the fine-tuned feature extractor to generate features that are distinctly separable in the feature space.

Backbone

In this study, we chose MAE encoder as the backbone of our model. MAE is an unsupervised technique that learns image representations by recovering partially masked images, thereby effectively capturing both local and global information in the data. It can make full use of unlabeled data, which is particularly advantageous in fields where labeled data is expensive. Specifically, in our work, we followed the suggestion in the original paper20 to randomly occlude 75% of the image patches and input the remaining 25% of visible patches into the model. During model training, the model uses these visible patches to predict and restore the content of the remaining 75%.

Supervised deep clustering

The primary objective of classification problems lies in extracting features that are separable within the feature space. Although traditional supervised classification methods may perform well on specific testing sets, they often encounter issues with poor generalization. A significant contributing factor to this is that the features extracted by these models are not readily distinguishable in the feature space. Consequently, new samples are prone to being misclassified in the feature space. Thus, to obtain a feature extractor that yields easily distinguishable features in the feature space, the fundamental principle is to minimize the intra-class distance and maximize the inter-class distance. In our work, we incorporate the concept of expectation-maximization to alternately optimize our ultimate objective through two steps: expectation and maximization. By reducing the distance between samples and their corresponding class centers and increasing the distance between samples and the centers of other classes, we achieve intra-class compactness and inter-class separation. This approach enables excellent classification performance even with a simple classifier. Experiments have demonstrated that supervised deep clustering can enhance the accuracy of downstream classification tasks and assist the encoder in learning sample-specific representations.

Supervised fine-tuning via downstream dataset

After dual-stage continuing pre-training, we can obtain a well initial weight for the model and a feature extractor with strong feature extraction capabilities. To achieve good performance in downstream EC-NBI diagnostic tasks, we first added a simple classifier which includes a fully-connection hidden layer after the pre-trained feature extractor. Then, we used the labeled EC-NBI training data to fine-tune our model. During the fine-tuning process, we not only optimize the classification loss but also simultaneously optimize the supervised deep clustering loss.

Loss Function

In the pre-training stage, we continuously trained our encoder on endoscopic images and EC-NBI images, respectively. We employed the combination of \({L}_{1}\) loss and perceptual loss \({{{L}}}_{{{{pl}}}}\) as the overall loss function \({L}_{{Pre\_train}}\):

$${L}_{{{\rm{\it{Pre}}}}{ \it{\_} }{{\rm{\it{train}}}}}={L}_{1}+{L}_{{pl}}$$
(1)
$${L}_{1}=\frac{1}{N}{\sum }_{i=1}^{N}|{X}_{i}-{\bar{X}}_{i}|$$
(2)
$${L}_{{pl}}=\frac{1}{N}{\sum }_{i=1}^{N}|{{VGG}}_{{X}_{i}}-{{VGG}}_{{\bar{X}}_{i}}|$$
(3)

Here, \({{{X}}}_{{{{i}}}}\) is the ith EC-NBI image an \({\bar{{{X}}}}_{{{{i}}}}\) is the ith predicted output. N is the number of EC-NBI images. In the third stage of model fine-tuning via labeled EC-NBI dataset, we not only employed cross entropy \({L}_{{CE}}\) as the loss function of classification, but also used a loss function of supervised deep clustering \({L}_{{SDC}}\). The final loss function in the stage of SFT can be formulated as follows:

$${L}_{{SFT}}={L}_{{CE}}+{L}_{{SDC}}$$
(4)
$${L}_{{{{SDC}}}}=\frac{1}{N}{\sum }_{i=1}^{N}\frac{{e}^{1-{{{\bf{f}}}}_{{\boldsymbol{i}}}{{{\bf{c}}}}_{{\boldsymbol{i}}}^{{{\rm{T}}}}}}{{\sum }_{k=0}^{K}{e}^{1-{{{\bf{f}}}}_{{\boldsymbol{i}}}{{{\bf{c}}}}_{{\boldsymbol{k}}}^{{{\rm{T}}}}}}$$
(5)

Here, \({{{\bf{f}}}}_{{{\boldsymbol{{i}}}}}\) is the feature of the ith EC-NBI image, \({{{\bf{c}}}}_{{{\boldsymbol{{i}}}}}\) is the cluster center of the ith class. T is the transpose operation.

Experimental implementation

This experiment is divided into three stages. In the first stage, a large amount of unlabeled endoscopic data is used for network pre-training. In the second stage, continuing unsupervised pre-training is conducted on the downstream task training set without any labels. In the third stage, SFT is performed for the downstream task. In this work, all images are resized to 224\(\times\)224. Subsequently, the mean and standard deviation of Imagenet45 are used to standardize the images. We set the batch size at 128 and used AdamW as the optimizer. In the stage of network pre-training, a small learning rate of \({10}^{-6}\) is used for warm-up and we set the training epoch at 2 for warm-up. The difference between stage one and stage two is that we conduct 50 epochs of pre-training in the first stage, and 100 epochs for the second stage. We used a cosine scheduler for learning rates whose maximum learning rates involved are set to \({10}^{-4}\) for stage one and \({10}^{-5}\) for stage two, respectively. The minimum learning rate is \({10}^{-5}\) for both. During the pre-training process, all the data will also undergo random cropping, colorization, and rotation through transformation. In addition, perception loss is used via a pretrained VGG with ImageNet for perception loss. For the downstream fine-tuning, the maximum learning rate is \({10}^{-4}\), and the minimum learning rate is \({10}^{-6}\) for the cosine scheduler, and we set the training epoch at 10.

Evaluation of model accuracy for disease diagnosis

Image-level diagnosis

After comprehensive network training, we employed receiver operating characteristic (ROC) analysis to assess the performance of the proposed model in predicting the types and probabilities of colorectal diseases on the internal validation cohort consisting of 23,527 EC-NBI images from 408 patients and the external validation cohort consisting of 2749 EC-NBI images from 90 patients. In the process of calculating various metrics, the optimal cutoff value we adopted is determined by selecting the best cutoff based on the Youden index at the image level in the training cohort, which maximizes the sum of sensitivity and specificity. We computed evaluation metrics, including the area AUC, sensitivity, specificity, and positive and negative predictive values.

Lesion-level diagnosis

Clinically, multiple EC-NBI images can be obtained for a single lesion. In this paper, our inference at the lesion level is obtained through voting based on the results of multiple EC-NBI images. Specifically, for a lesion, we predict the result of each EC-NBI image separately and then take the disease with the largest number of predicted categories as the final prediction result for that lesion. Additionally, we evaluated the ability of the deep learning model to enhance endoscopists’ performance in predicting colorectal lesions with and without artificial intelligence assistance. Endoscopists (1st doctor: 3 years of experience; 2nd doctor: 1 year; 3rd doctor: 9 years; 4th doctor: 7 years) independently assessed the intestinal disease status of each lesion. Then, after considering the predictions of the artificial intelligence model, they re-evaluated the histologic classification on the same day, demonstrating the impact of the model on individual assessments.

Interpretation of the model

To address the issue of black-box predictions in deep learning models, we first apply Grad-CAM to explore the class activation mapping per transformer block of the proposed model. Grad-CAM is a model-agnostic interpretability technique that generates class-specific localization maps by computing the weighted sum of feature maps from the final convolutional layer, highlighting important regions in an input image for a particular prediction. Subsequently, SHAP, a game-theoretic approach, to analyze the data generated by the proposed model, was employed. SHAP values offer an in-depth understanding of the contribution of each feature to the model’s predictions, providing a clear and interpretable understanding of the model’s decision-making process. By calculating the SHAP value for each feature, we can determine the importance and impact of individual features on the predicted outcomes. High SHAP values signify a significant influence of a feature on the prediction, while low SHAP values imply a minimal impact. SHAP analysis is conducted using the SHAP library (V.0.42.1).

Improvement of representation learning

Typically, the performance of a disease model depends on the separability of images within the feature space. If image features are closer together within the same class and farther apart between different classes in the feature space, the model’s predictions will be more reliable and exhibit better generalization performance. To evaluate the model’s representation learning ability, we visualize the clustering effect of image features. Specifically, we convert the encoder’s output features into one-dimensional vectors through global pooling and then apply T-SNE for dimensionality reduction to create feature visualizations. Additionally, we calculate the silhouette score to assess the effectiveness of each method in separating different disease types. The silhouette score measures the similarity of an object to its cluster compared to other clusters, providing an evaluation of clustering quality. Higher silhouette scores indicate better model performance in differentiating between various disease types, reflecting more reliable and generalizable prediction results.

Ablation study

In this paper, our method is a complex approach that combines a multi-stage pre-training strategy and supervised deep clustering. To explore the impact of different components on the model’s performance, we conducted ablation experiments for different parts. The comprehensive quantitative results are shown in Table S3, S4, and S5 in the Supplementary Information File.

Ablation study on pre-training dataset: To explore the role of large-scale endoscopic images in the first-stage pre-training, we replaced them with large-scale natural vision data, ImageNet, which contains approximately one million images. We maintained various hyperparameters and methods in the experimental settings and conducted the same pre-training. Eventually, our method exceeded ImageNet by approximately 30.7% in terms of Accuracy, shown Table S3 in the Supplementary Information File. According to the experimental phenomena, for the magnifying endoscopy classification task, using more closely related ordinary endoscopic data is more helpful compared to general pre-training data sets like ImageNet.

Ablation study on supervised deep clustering (SDC): To explore the role of supervised deep clustering (SDC) in the model, we conducted a new experiment without the deep clustering loss. Through experimental comparison, it can be found that with the assistance of deep clustering loss, all quantitative indicators of our method are improved by more than approximately 10%, shown Table S4 in the Supplementary Information File. This demonstrates that supervised deep clustering can extract more effective feature representations by optimizing the feature differences of similar samples in the same category, thereby helping the model achieve better classification results.

Ablation study on multi-stage training strategy: During multi-stage training, the pre-training in the first stage is of tremendous help as it can effectively assist the model in optimizing to a better performance state. The second stage can make more comprehensive use of downstream task data, that is, magnifying endoscopy data. The multi-stage pre-training strategy can help the model improve its accuracy by approximately 31.9% compared to that without the assistance of multi-stage pre-training, and the improvement in other indicators is even more significant, shown in Table S5 in the Supplementary Information File.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.