Background & Summary

Colorectal cancer is the second most common cause of cancer-related deaths worldwide1. With the growing adoption of colonoscopy screening and advancements in AI-assisted diagnosis, the detection rate of colorectal neoplastic lesions (CNLs) has been steadily increasing2,3,4. For large lesions (>2 cm) that require en bloc resection, flat non-granular laterally spreading tumors (LST-NG) especially those with pit-depressed type, early-stage cancers with superficial submucosal invasion, fibrotic mucosal tumors, and recurrent cancers following endoscopic resection, endoscopic submucosal dissection (ESD) is widely recognized as the primary treatment modality5. Compared with traditional surgical methods, ESD offers less invasiveness and better post-operative life quality for individuals undergoing treatment6. However, ESD requires advanced endoscopic skills, including precise instrument manipulation based on lesion morphology, optimal endoscope positioning, and meticulous gas volume regulation in different stomach regions. Moreover, accurate identification of lesion type and extent from real-time endoscopic images is crucial7.

Endoscopist training typically progresses from theoretical knowledge to practical application on animal models before advancing to supervised procedures on patients and eventual independent practice. However, limited access to animal models and expert mentorship creates significant challenges for novice practitioners, prolonging their learning curve and potentially increasing patient risk during procedures8.

Surgical videos offer a more objective and comprehensive record of intraoperative events compared to procedural documentation alone, allowing physicians to critically analyze their surgical techniques and correlate findings with patient outcomes9,10. Artificial intelligence (AI)-driven video analysis of surgical procedures provides real-time feedback and decision support to surgeons by dissecting the progression of surgical steps on a second-by-second basis11. This technology can also identify potential risks, improving surgical safety and patient care quality12,13. For novice practitioners, AI can support skill assessment through simulated environments and analyzed operation videos, contributing to a shortened learning curve14. As a result, the field of intelligent surgical process analysis is attracting growing interest from both computer scientists and medical professionals.

Currently, research on phase recognition in ESD treatment of early gastrointestinal cancers is limited. Cao et al. annotated approximately 50 ESD video instances (201,026 frames) covering gastric, esophageal, and colon lesions to develop an AI cognitive assistance system that showed promising training results in animal studies15. Furube et al. annotated 94 esophageal ESD videos and developed a phase recognition system that achieved over 90% accuracy in two independent tests16. A recent multicenter study analyzed 195 ESD videos from seven hospitals worldwide, with their AI model achieving 89.84%, 80.62% and 74.61% accuracy in esophageal, stomach and colorectal ESD tests respectively17. However, these ESD videos were not made public, and procedural classifications remain inadequate. While the recent study included chromoendoscopy for lesion observation, it wasn’t separated from other phases, and recognition of key steps like intraoperative traction and hemostasis still needs improvement.

Considering the limited availability of public datasets for automatic phase identification in ESD for CNLs, this study aimed to establish a publicly accessible database for ESD phase recognition. This initiative enhances the efficiency of ESD process analysis by meticulously annotating 30 ESD endoscopic videos of CNLs. The primary contributions include:

  1. 1.

    Video collection: A total of 30 surgical videos were analyzed for CNLs, detailing the distribution of cases across various anatomical regions. Specifically, 3 cases were identified in the ileocecum, 13 in the transverse colon, 6 in the ascending colon, 3 in the sigmoid colon, and 5 in the rectum.

  2. 2.

    Comprehensive labeling: 130,298 frames received annotations, with each frame categorized according to a specific procedural phase. We believe that making this dataset publicly available will substantially advance AI research related to ESD and promote the translation of these technologies into clinical settings.

Methods

Data collection

The study analyzed data collected during standard endoscopic submucosal dissection procedures conducted at Renji Hospital’s Endoscopy Department (Shanghai Jiao Tong University School of Medicine) from May through October 2024. Patients scheduled for standard ESD procedures were approached during preoperative consultations and informed about the optional research component. Each participant signed consent documentation permitting their procedure recordings to be utilized for research purposes and open published after data collection, with participants explicitly informed that they could decline research participation while still receiving standard clinical treatment. All personal identifiers (such as ID, date, facial close-up, etc.) were subsequently anonymized to protect confidentiality. Ethical clearance was granted by Renji Hospital’s Ethics Committee (Reference: LY2024-271-B). The analysis encompassed 30 ESD procedure recordings captured with the Olympus CV-260/290 endoscopy platform and IMH-200 image management hub at 1920 × 1080 resolution and 50 frames per second (fps). Table 1 provides comprehensive specifications of endoscopic tools employed during these interventions. All footage underwent thorough screening to remove non-digestive tract imagery, while identifying elements like patient numbers and temporal indicators were carefully obscured. A highly qualified senior endoscopist, Dr. Li Xiaobo, performed all the procedures included in this investigation.

Table 1 The instrument used in the ESD surgery.

Annotations of video

To facilitate dataset annotation, our team created a methodical protocol for documenting ESD procedures (Fig. 1, Table 2). This protocol divides the ESD process into 8 phases: (1) preparation: the interval during which clinicians adjust endoscopic equipment or exchange instruments; (2) Estimation: the initial evaluation of lesion characteristics through white light examination, magnification techniques, and NBI chromoendoscopy with optional Indigo carmine or crystal violet application (not mandatory for all ESD cases); (3) Marking: the identification of the lesion boundary, followed by placement of multiple circumferential electrocautery indicators positioned roughly 5 mm from the affected area (not required for all ESD procedures). (4) Injection: the submucosal administration of a combined solution containing physiological saline, adrenaline, or hyaluronic acid derivatives to elevate targeted tissue layers. (5) Incision: the circumferential cutting of mucosa at a measured distance of 5 mm from either the lesion itself or the previously marked region. (6) ESD: the progressive detachment of the submucosal layer from the underlying muscularis propria until complete removal and retrieval of the target tissue. Bleeding control is managed via electrocoagulation or thermal forceps as necessary, with traction-assisted methods employed when indicated; (7) vessel treatment: the management of remaining vascular structures or hemorrhage points on the exposed surface using thermal biopsy instruments; (8) Clip: the application of hemostatic clips for wound closure or suspected perforation management. Because of our substantial dataset volume, videos were downsampled to 1 fps by extracting the first frame of each second for subsequent annotation. When phase transitions occurred between consecutive sampled frames, phase boundaries were determined based on the predominant surgical activity observed in each sampled frame. Expert annotators assigned phase labels according to the most clinically significant activity present at each timestamp. Each individual frame received classification into only one of the 8 phases, determined by identifying the commencing and concluding frames of respective phases.

Fig. 1
figure 1

Characteristic visual examples depicting each distinct category within the eight identified endoscopic procedural stages.

Table 2 Definition of each phase of ESD for CNLs.

Data Records

Our collection underwent extensive quality verification through multiple validation steps. The complete package, accessible as a compressed file on Figshare18, contains the initial 30 endoscopic recordings alongside 130,298 annotated phase recognition sequences. For each identified action phase within individual recordings, text-formatted annotations specify beginning and concluding frames. The distribution of phase annotations across all ESD endoscopic recordings appears in Fig. 2. When accessing materials from Figshare, users will find content arranged hierarchically. The top-level organization consists of numerically labeled directories (1–30), each corresponding to a specific procedural case. These individual case directories house both the unprocessed endoscopic footage (in.mp4 format) and corresponding phase designation files (in.txt format). Additionally, Fig. 3 illustrates the cumulative temporal span of annotations for individual phases throughout the complete video collection.

Fig. 2
figure 2

Cumulative temporal measurements (displayed in seconds) of all documented phases across the complete collection of thirty annotated endoscopic submucosal dissection recordings.

Fig. 3
figure 3

Graphical representation of phase identification throughout thirty ESD procedures. Video durations vary between cases and have been standardized to enhance comparative visual interpretation.

Technical Validation

Dataset characteristics

The collection encompasses 30 individual procedure videos with accompanying phase classification labels, gathered from an equal number of clinical cases. Participants had a mean age of 58.7 years (SD = 8.60). The demographic distribution included 11 women and 19 men, all of Chinese nationality.

Data validation

Our classification protocol implemented a three-step verification process. In the preliminary stage, two clinically trained data specialists (Tang Cao and Jinneng Wang) conducted separate and independent classification of two video sequences containing 3119 and 5185 frames respectively, following established annotation protocols. The degree of agreement between annotators was measured using Cohen’s-Kappa statistical analysis, resulting in coefficients of 0.955 and 0.959 respectively (p < 0.001), demonstrating exceptional inter-rater consistency in applying the classification framework. Upon validation of this reliability, both annotators proceeded to independently categorize the remaining 23 procedural recordings. Following completion of all annotations, thorough quality verification was conducted by two senior endoscopists with substantial clinical expertise (Xiaobo Li and Qingwei Zhang). This validation procedure combined visual inspection with applied clinical expertise to confirm the accuracy of procedural phase designations.

Limitations of the datasets

A primary constraint of our dataset involves its single-institution origin, which restricts the diversity of acquisition parameters due to standardized equipment configurations including specific endoscope models and illumination systems. Future work will focus on expanding our repository and establishing collaborative relationships with additional clinical facilities. Despite these constraints, as the currently available public resource for ESD procedure documentation, we believe this database will make valuable contributions toward advancing automated recognition of endoscopic submucosal dissection procedural phases.