Background & Summary

Endoscopic spine surgery (ESS) is a minimally invasive surgical technique commonly applied to procedures including decompression of spinal nerves, removal of herniated discs, and spinal fusion1. This advanced surgical approach allows surgeons to operate with minimal disruption to surrounding tissues, resulting in less postoperative pain, faster recovery, reduced scarring, and shorter hospital stays compared to traditional open surgery2,3,4. However, the steep learning curve represents a significant obstacle to the widespread adoption of these techniques by surgeons5,6. The technical intricate nature of the surgical procedure, along with the limited operational space, requires surgeons to possess a high degree of precision and control when manipulating surgical instruments. Young surgeons often lack surgical experience, which can lead to various issues when managing patients with diverse pathological presentations. In particular, situations such as epidural hematoma or significant bleeding also pose serious risks to patient safety7. In recent years, the rapid advancement of artificial intelligence (AI) has driven innovation and development in disease diagnosis and surgical assistance systems based on endoscopic data8,9,10,11,12. Furthermore, the implementation of computer-assisted interventions can enhance navigation during surgery, automate image interpretation, and enable the operation of robotic-assisted tools13. Therefore, developing intelligent systems capable of real-time decision-making during surgery is a key direction for the future of endoscopic spinal surgery, holding great potential for enhancing the intelligence, safety and efficiency of procedures.

The automatic segmentation of surgical instrument boundaries is the first step in realizing the intelligent surgical assistance system for ESS. It is also a crucial prerequisite for efficient tracking and posture estimation of instruments during surgery. However, during the surgeries, surgeons often need to frequently switch instruments and adjust the position of the access port in real-time, which can lead to deformations in the instruments’ appearance within the field of view. Moreover, unlike laparoscopic surgery, ESS is performed in a fluid-irrigated environment. The diversity of surgical instruments and the uncertainties in the surgical environment (such as blood and bubble interference, tissue occlusion, and overexposure) pose greater challenges to the tracking and rapid adaptation capabilities of computer-assisted systems. Currently, deep learning algorithms represent the most advanced technologies for automatic instrument boundary segmentation in medical imaging14,15,16,17,18,19. These algorithms, such as UNet, CaraNet, CENet, nnU-Net and YOLOv11x, share key strengths in medical instrument segmentation by effectively capturing global information, offering versatility in training, and excelling in the precise segmentation of both large and small objects. Their combined advancements in context awareness and feature extraction make them particularly well-suited for the complex challenges of instrument segmentation in medical imaging17,20,21,22,23,24.

Large-scale, high-quality datasets are the cornerstone for developing successful deep learning models. Balu A et al.25 released a simulated minimally invasive spinal surgery video dataset (SOSpine), which includes corresponding surgical outcomes and annotated surgical instruments. This dataset was used to train neurosurgery residents through a validated minimally invasive spine surgery cadaveric dura repair simulator. However, there still lacks specialized and publicly accessible image datasets in the field of ESS. Therefore, we have created the first image dataset (Spine Endoscopic Atlas, SEA) of endoscopic surgical instruments based on real surgical procedures. The new dataset comprises a total of 48,510 images from two centers, covering both cervical and lumbar endoscopic surgeries involving both small and large channel procedures. We have segmented part of the surgical instrument boundaries and categorized them meticulously. A large amount of unsegmented image data provides abundant training material for semi-supervised learning. By using a small set of segmented and annotated images as supervision signals, the model can leverage the unsegmented images to automatically learn structural features, significantly reducing the annotation workload26,27,28. Moreover, by combining segmented and unsegmented data, the model can learn more diverse features, enhancing its ability to recognize surgical instruments under various conditions29. This utilization of unsegmented data helps prevent the model from overfitting to the limited annotated data, thereby improving its generalization performance when encountering new data. We believe that publishing this dataset will greatly assist researchers in enhancing existing algorithms, integrating data from multiple centers, and improving the intelligence of ESS more efficiently.

Method

Ethics statement

Shenzhen Nanshan People’s Hospital’s ethics committee (Ethics ID: ky-2024-101601) had granted a waiver of informed consent for this retrospective study. As all data were strictly anonymized after video export, with no involvement of any identifiable or sensitive patient information, the committee approved the use of the anonymized data for research purposes. Additionally, the approval allows for the open publication of the dataset.

Data collection

This study retrospectively collected endoscopic surgery videos of the cervical and lumbar spine from a total of 119 patients. All of the operations were performed by senior surgeons at two medical centers between January 1, 2022 and December 31, 2023. The video recordings were captured using standard endoscopic equipment from STORZ (IMAGE1 S) and L’CARE company, with all videos recorded in 1080 P/720 P at a frame rate of 60 frames per second. The endoscopes are from Joimax and Spinendos companies, respectively. To ensure the originality and completeness of the data, no form of preprocessing was performed before data storage. The video data were not subjected to denoising, cropping, color adjustments, or any other modifications, thus preserving all potentially important details during the surgery to provide the most authentic foundational data for subsequent image analysis and model training. All videos and images were automatically anonymized at the time of export from the recording devices and during the process of frame extraction. This procedure ensured that the final dataset contained no personally identifiable information and that individual participants could not be traced from the stored data.

Data processing, annotation and quality assessment

Image data

We created the SEA dataset using the collected video data. An expert with extensive experience in image processing selected segments from each video where the instruments appeared frequently. One frame per second was extracted from these segments, with priority given to selecting images that presented segmentation challenges. Data was excluded if no instruments appeared in the field of view. Each sample was stored as an independent entry in the corresponding folder.

The display of surgical instruments may vary under different perspectives. Specifically, the shape, size, contours, and visible parts of the instruments in the image may change due to variations in the endoscopic field of view. Thus, the images were preliminarily classified based on the size of the surgical working channel. Images where the instrument boundaries were continuous and clear were categorized as “Normal scenario”, while images where the instruments were obscured by issues like blood, bubbles, or tissue were classified as “Difficult scenario”. The six main instruments appearing in the dataset were further classified into designated folders. After classification, two experts reviewed the dataset repeatedly. It is worth noting that, despite efforts to minimize subjective errors through re-evaluation of the dataset, certain errors caused by lighting conditions (such as overexposure or underexposure) may still persist.

Segmentation of images

The segmentation process was performed using LabelMe (v5.0.2) and 3D Slicer (v5.0.2). Prior to the commencement of the annotation process, multiple preparatory meetings were conducted to thoroughly explain the annotation standards applicable to various scenarios. These sessions included detailed guidelines as well as hands-on practice with representative examples. Furthermore, feedback was provided to ensure the annotators achieved a consistent understanding of the criteria. In the initial stages of annotation, two annotators independently labeled a set of 200 images from the same dataset, followed by a consistency analysis of the annotation results. These annotations were then reviewed by two junior reviewers. Any annotations that did not meet the standards were returned for re-labeling until they were approved. Afterward, formal annotation began, with the annotators dividing the remaining tasks between them. Throughout the process, the two junior reviewers continued to assess the annotations, and when uncertain data was encountered, a senior reviewer conducted a final review to determine if any annotations needed to be returned for revision. The specific process flow can be seen in Fig. 1. It is particularly noteworthy that for instruments partially obscured by factors such as blood, tissue, or lighting issues, clinicians were still required to outline the full contours of the instruments as accurately as possible. For cases where most parts were unclear, the segmentation focused on the visible parts. Figure 2 illustrates examples of segmentation results for various types of surgical instruments.

Fig. 1
figure 1

Illustration of the instrument annotation, review, and revision process.

Fig. 2
figure 2

Instrument segmentation of ESS using LabelMe and 3D Slicer. (a) grasping forceps, (b) bipolar, (c) drill, (d) scissor, (e) dissector, (f) punch. (a) to (c) were segmented using LabelMe, while (e) to (f) were segmented using 3D Slicer.

Data statistics

Tables 1, 2 show the overview of data record and image dimensions in the dataset. The dataset comprises a total of 4,8510 images. These images come in three different dimension and are stored in both JPG and PNG formats. The total size of these image files is 9.99 GB (gigabyte). The segmentation mask files are stored in NRRD format only. The total size of all segmentation files is 0.08 GB.

Table 1 Overview of data record in the Spine Endoscopic Atlas dataset.
Table 2 Image dimensions in the Spine Endoscopic Atlas dataset.

Figure 3a shows the proportional relationship between the number of images in two different scenarios, with a clear bias towards images from difficult scenarios in the dataset. Regarding the issue of data imbalance, we did not employ any specific measure. If a model performs well only in normal scenarios, it may fall short in practical applications. By increasing the number of images from difficult scenarios, the model can gain more training data in these complex conditions, thereby enhancing its robustness and generalization capability.

Fig. 3
figure 3

(a) Number of images in the Spine Endoscopic Atlas dataset for normal scenario and difficult scenario; (b) number of images for each instrument category in the Spine Endoscopic Atlas dataset.

In a fluid-irrigated environment, instrument visibility issue may cause significant challenge. Figure 4 illustrates different types of challenging situations. When an instrument occupies less than 20% of the entire field of view, it is considered to have a small proportion. The dataset does not provide individual annotations for each specific difficulty, as these challenges often do not exist in isolation but rather arise from a combination of multiple factors occurring simultaneously, such as limited visibility, obstructed instrument operation, and difficulty in controlling bleeding.

Fig. 4
figure 4

Difficult scenarios of surgical instrument segmentation during endoscopic spinal surgery. (a) bubbles, (b) working channel interference, (c) bleeding, (d) underexposed, (e) overexposed, (f) bone debris, (g) small proportion of instrument, (h) tissue obstruction, (i) multiple difficult scenarios.

Figure 3b reveals the distribution differences of various types of surgical instruments in the dataset. The number of images for grasping forceps and bipolar was higher, at 4,918 and 2,933 images, respectively. The number of images for drill and punch is slightly lower, which is related to their use only in specific steps during surgeries. The dataset contains only 407 images of scissors and 241 images of dissectors, indicating that these instruments are relatively less frequently used during surgeries.

Data description

The dataset is available at the Figshare repository30. The entire folder structure is shown in Fig. 5. In general, the image dataset is divided into two main folders: “classified” and “unclassified”. To increase the diversity of recording equipment and sample variety, data from two medical centers have been included. Within the “classified” folder, the data is further divided into “big channel” and “small channel” subfolders based on the size of the working channel (diameter = 6.0 mm and 3.7 mm). Additionally, the data is categorized according to the surgical region, with separate folders for the “cervical” and “lumbar” areas. Each patient is treated as an individual sample stored in a separate folder. Samples from medical center 1 and 2 are labeled as “P + Patient ID” and “T + Patient ID”, respectively. Inside each sample folder, images of instruments are meticulously classified and stored under the “Normal scenario” and “Difficult scenario” subfolders. Within the respective scenario folders, both the images and their corresponding annotation files are organized under instrument-specific subfolders (e.g., “bipolar”, “grasping forceps”, “drill”, “dissector”, “punch”, “scissor”). The naming convention for both the image and annotation files follows this format: “P/T + Patient ID + video index + frame index” (e.g., P7-2-0204). All unclassified data comes from medical center 1, providing researchers with a rich source of raw data that aids in the development of more precise algorithms for automatically identifying and classifying complex medical images and instrument features.

Fig. 5
figure 5

Overview of the Spine Endoscopic Atlas dataset’s structure.

Technical Validation

we performed a consistency analysis on 200 samples annotated by two annotators in the early stages, as shown in Table 3. In Stage 1, we conducted an initial consistency assessment of the annotation results from the first 100 cases completed by the two annotators. The results were then reviewed by two junior reviewers, who returned them for revision as needed until all cases were approved. Upon completion, the process advanced to Stage 2, where the annotators annotated an additional 100 new cases. A second consistency assessment was then carried out on these results. The evaluation metrics included the Dice coefficient (DC) and Intersection over Union (IoU) values. The results showed that in Stage 1, the DC was 0.8743, and the IoU was 0.8021. In Stage 2, both metrics improved significantly, with the DC reaching 0.9550 and the IoU increasing to 0.9166, indicating enhanced consistency and reliability of the annotations. This two-stage validation approach not only helped identify and correct early inconsistencies in the annotation process but also facilitated continuous improvement in annotation quality. By implementing iterative reviews and quantitative assessments, we ensured that the final dataset maintained high standards of accuracy and consistency, which is critical for downstream tasks such as model training, evaluation, and clinical applicability.

Table 3 Inter-annotator consistency test between two annotators.

To compare the performance of different deep learning algorithms in instrument segmentation within SEA, we trained and tested five deep learning models: UNet, CaraNet, CENet, nnU-Net and YOLOv11x. The selection of each model was based on its suitability for medical image segmentation and its proven performance in related tasks.

CENet

CENet is designed to capture both local and global features by integrating context encoding with convolutional operations20,31, making it highly effective for tasks where precise segmentation of instruments is required in complex environments. The context module encodes rich global information, which is particularly important for handling scenarios with occlusions, bubbles, or low contrast, as highlighted in the dataset.

U-Net

U-Net is one of the most widely used architectures for medical image segmentation due to its encoder-decoder structure with skip connections21,32. It excels at preserving spatial information during down-sampling, which is crucial for accurate segmentation of surgical instruments, even when only a small portion of the instrument is visible.

CaraNet

CaraNet is an innovative medical image segmentation model that significantly enhances segmentation accuracy and detail capture by incorporating a reversed attention mechanism and context information enhancement module22. The model excels at capturing image details and edges, making it suitable for medical image segmentation tasks in complex backgrounds.

nnU-Net

nnU-Net is an adaptive framework based on the U-Net architecture23,33, specifically designed for medical image segmentation tasks. Compared to traditional U-Net models, nnU-Net not only provides the model architecture but also includes a complete automated pipeline for preprocessing, training, inference, and post-processing, thereby simplifying the adaptation to different datasets.

YOLOv11x

YOLOv11x is the latest real-time object detection model in the YOLO series24, combining advanced architectural design and feature extraction capabilities. Compared to YOLOv8, YOLOv11 achieves higher mean accuracy on the COCO dataset and improves computational efficiency.

Validation procedure

The SEA was divided into a training set (80%) and two testing sets (20%), with the testing set further split into local (10%) and external (10%) testing. The training set was exclusively derived from Medical Center 1, whereas data from Medical Center 2 was reserved for external validation and excluded from the training phase. Each model was trained and tested according to this distribution. For training, we used Adam optimizer with a learning rate scheduler to adjust the learning rate dynamically. Cross-entropy loss combined with dice loss was used as the loss function to handle class imbalance, as some instrument categories (such as dissector and scissor) had fewer samples.

Performance metrics

The dice coefficient and intersection over union (IoU) are used to evaluate the performance of the five deep learning models in the task of instrument segmentation in SEA. Both metrics are used to measure the similarity between the predicted results and the ground truth labels, with higher values indicating better predictive performance. The dice coefficient is more effective in handling class imbalance issues, while IoU more intuitively reflects the proportion of the overlap area relative to the union.

Results

On the local test set, nnU-Net performed the best, with a dice coefficient of 0.9753 and an IoU of 0.9531, slightly higher than the other models. On the external test set, nnU-Net’s dice coefficient and IoU still remained at relatively high levels, 0.9376 and 0.8966, demonstrating better generalization ability. While U-Net, CaraNet and YOLOv11x showed a significant decline in performance. The specific results can be found in Table 4, while Figs. 6, 7 illustrate the automatic segmentation quality of different models.

Table 4 Performances evaluation of U-Net, CaraNet, CENet, nnU-Net and YOLOv11x models in instrument segmentation for endoscopic spine surgery.
Fig. 6
figure 6

Qualitative comparison of U-Net, CaraNet, CENet, nnU-Net and YOLOv11x automatic segmentation of instruments in local test set.

Fig. 7
figure 7

Qualitative comparison of U-Net, CaraNet, CENet, nnU-Net and YOLOv11x automatic segmentation of instruments in external test set.

Overall, the model’s performance declines on external test set, especially with the U-Net model, revealing that when the training data comes from a single center or has a concentrated distribution, the model’s performance significantly deteriorates when facing data from different sources. It is obvious that the new environment poses a challenge to the model’s generalization ability. Consequently, the SEA demonstrates significant value in improving models and advancing the intelligent development of minimally invasive spinal surgery in the future.

Usage Notes

This study introduces the first comprehensive instrument dataset for endoscopic spine surgery (ESS), designed to accelerate the development of AI-driven solutions in minimally invasive spinal interventions. The dataset is uniquely positioned to support a spectrum of downstream tasks critical to intelligent surgical systems, including real-time tool tracking, context-aware navigation, procedural action recognition, safety landmark detection, and 3D pose estimation of instruments. These capabilities address fundamental challenges in ESS, such as maintaining spatial awareness in confined anatomical spaces and mitigating risks associated with instrument-tissue interactions.

The clinical significance of this resource extends beyond conventional instrument segmentation. By providing high-fidelity annotations across diverse surgical scenarios (cervical/lumbar procedures, variable channel sizes), the dataset enables systematic investigation of AI’s role in enhancing surgical precision—particularly in reducing positional errors during decompression—and improving safety through early detection of hazardous instrument trajectories. Furthermore, the inclusion of temporal and environmental variability (fluid artifacts, tissue occlusion patterns) facilitates the development of robust models capable of adaptive performance under realistic surgical conditions.

Future efforts will be implemented in phases to expand the dataset’s clinical and technical impact. We plan to collaborate with at least three tertiary hospitals to incorporate their ESS surgical videos and clinical data, covering diverse regions (North and South China), surgical teams (senior experts and residents), and more variety of surgical instruments (such as suturing devices, ultrasonic bone cutters, etc.), thereby enhancing the dataset’s diversity and representativeness. Building upon existing annotations, we will introduce additional surgical phase labels (dural incision, nerve decompression) to support the development of surgical risk prediction models. These efforts will position the dataset as a critical resource for advancing intelligent ESS systems, enhancing surgical precision, safety, and efficiency worldwide.