ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy

Hansen, Pascal; Kim, Ji Woong Brian; Goldenberg, Antony; Chen, Juo Tung; Li, Yuanzhe Amos; Deguet, Anton; White, Brandon; Tsai, De Ru; Cha, Richard; Jopling, Jeffrey; Scheikl, Paul Maria; Krieger, Axel

doi:10.1038/s41597-025-06526-z

Download PDF

Data Descriptor
Open access
Published: 16 January 2026

ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy

Pascal Hansen ORCID: orcid.org/0009-0002-4833-4184^1,2,
Ji Woong Brian Kim^1,3,
Antony Goldenberg¹,
Juo Tung Chen¹,
Yuanzhe Amos Li¹,
Anton Deguet ORCID: orcid.org/0000-0003-3856-2095¹,
Brandon White⁴,
De Ru Tsai⁵,
Richard Cha⁵,
Jeffrey Jopling⁴,
Paul Maria Scheikl^1,6 &
…
Axel Krieger¹

Scientific Data volume 13, Article number: 210 (2026) Cite this article

3580 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The growing global shortage of skilled surgeons underscores the need for intelligent, assistive technologies in the operating room. To address this challenge, we introduce ImitateCholec, a publicly available dataset specifically designed to advance autonomous robotic systems during the critical clipping and cutting phase of laparoscopic cholecystectomy. The dataset comprises over 18,000 demonstrations from 34 ex vivo porcine cholecystectomies, totaling approximately 20 hours of data. Each clipping and cutting phase recorded in the dataset is segmented into 17 distinct surgical tasks. ImitateCholec uniquely integrates endoscopic videos captured from multiple camera perspectives with comprehensive kinematic data acquired through the da Vinci Research Kit. Both optimal demonstration executions and recovery maneuvers were systematically recorded, enabling the training of imitation learning models capable of robustly addressing real-world surgical variability. Primarily, ImitateCholec facilitates imitation learning for long-horizon surgical workflow execution, significantly advancing the development of autonomous robotic systems toward achieving phase-level autonomy and, ultimately, full procedural autonomy. Additional supported applications include surgical workflow modeling, error recognition, and surgical tool pose estimation.

Robotic scrub nurse to anticipate surgical instruments based on real-time laparoscopic video analysis

Article Open access 02 August 2024

A Comprehensive Video Dataset for Surgical Laparoscopic Action Analysis

Article Open access 24 May 2025

Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis

Article Open access 17 December 2025

Background & Summary

Minimally invasive surgery has significantly improved patient outcomes by reducing postoperative recovery times and complications compared to traditional open surgical procedures¹. Techniques such as laparoscopic surgery enable surgeons to perform operations through small incisions, employing specialized instruments to access internal organs with minimal tissue damage. Advances beyond traditional laparoscopy have led to the development of robot-assisted surgery, significantly enhancing surgical precision, dexterity, and accessibility to anatomically challenging regions². A prominent example is laparoscopic and robotic cholecystectomy—the surgical removal of the gallbladder—which is routinely performed worldwide, accounting for over 750,000 procedures annually in the United States alone¹.

Despite these technological advances, the healthcare sector continues to face a critical shortage of surgical personnel. The increasing demand for surgical procedures, driven by an aging population and broader access to healthcare, has significantly strained existing medical infrastructure. Studies predict a shortage of approximately 20,000 general surgeons in the United States by 2030³. This shortage exacerbates the burden on surgical teams, resulting in extended working hours, heightened stress levels, surgeon burnout, and potentially compromised patient care quality⁴.

To mitigate these challenges, computer-assisted surgical systems leveraging existing robotic surgical infrastructure have emerged as promising solutions to address the shortage of skilled surgical personnel⁵. Assistive technologies, such as automated surgical workflow analysis, can streamline clinical documentation and significantly reduce postoperative workload⁶. Early autonomous surgical robotic systems have already demonstrated their capability to perform complex, short-duration tasks with high precision under controlled conditions⁷, highlighting their potential to alleviate surgical workloads by automating longer-horizon workflows such as individual surgical phases and, in the longer term, contributing to more comprehensive procedural support.

Open-source datasets such as Cholec80⁸, CholecT50⁹, and HeiCo¹⁰ have substantially advanced surgical workflow analysis by providing labeled surgical phases and actions for laparoscopic procedures. However, these datasets are restricted to laparoscopic video data, lacking essential kinematic information required to develop autonomous robotic surgical policies. Other datasets, including JIGSAWS¹¹, designed primarily for gesture recognition in simplified tabletop settings, and CRCD¹², focused on robotic clutch and camera operations, contain visual and kinematic data but are limited in task variability and scale. These limitations restrict their effectiveness in training autonomous surgical systems capable of generalizing to complex, realistic surgical scenarios. Developing autonomous systems for long-horizon tasks via imitation learning demands extensive datasets with numerous diverse task executions¹³. While extensive datasets supporting complex, long-horizon imitation learning exist in general robotics^14,15, such comprehensive datasets are notably absent in surgical robotics, representing a critical barrier to progress in autonomous surgery.

To overcome these limitations, this paper introduces both a novel robotic surgical dataset and an associated structured data acquisition approach explicitly designed for imitation learning of long-horizon surgical phases, forming the basis for training SRT-H¹⁶, an autonomous framework for executing such surgical phases fully autonomously. Our systematic data acquisition includes extensive visual and kinematic recordings from 34 ex vivo porcine cholecystectomies, comprising over 18,000 surgical task demonstrations, specifically targeting the critical clipping and cutting phase across 17 distinct surgical tasks (see Fig. 1). Crucially, our dataset captures not only optimal task executions but also corrective maneuvers, enabling imitation-learning policies to robustly handle real-world surgical variability and recovery from errors. This comprehensive dataset and data acquisition approach provide an essential foundation for developing next-generation policies that can autonomously execute surgical workflows at the phase level and, eventually, achieve procedure-level autonomy. Using this data collection approach, also other related surgical multi-step long-horizon workflows could be recorded such as suturing or stapling. Moreover, the integration of kinematic information provides an opportunity to analyze the impact of multi-modal inputs on model robustness and reliability, e.g. in surgical workflow analysis, particularly in scenarios involving visual occlusions common in surgical environments. Finally, the dataset’s design naturally supports research in surgical tool pose estimation, further broadening its applicability and potential impact in surgical robotics research.

Methods

Data Acquisition

The following outlines the data acquisition process used at Johns Hopkins University for collecting visual and kinematic data during the clipping and cutting phases of ex vivo porcine cholecystectomy. All gallbladder organs were sourced from Animal Technologies, Inc. (Tyler, TX, USA). The data acquisition description is structured into four parts detailing the applied hardware and software setup, the experimental protocol followed, and the characteristics of both the recorded vision and kinematic data.

Setup

For the data acquisition, we utilized a da Vinci Research Kit (dVRK) Si system¹⁷. A dVRK Si has three robotic arms for manipulation tasks (referred to as patient side manipulators (PSMs): PSM1, PSM2, PSM3) and one robotic arm for holding the endoscopic camera (referred to as endoscope camera manipulator (ECM)).

A 12 mm, 0-degree stereo endoscope (Intuitive part number: 380609) is used for collecting video data capturing the surgical target area. Wrist cameras were mounted on the surgical instruments, 3.5 cm from the base of the instrument shafts, to improve the depth perception for the autonomous robot policy, as shown in⁷.

The clipping and cutting are performed by PSM1 with a medium clip applier (Intuitive part number: 420327) and curved scissors (Intuitive part number: 420178). PSM2, equipped with a Cadiere forceps (Intuitive part number: 420049), is used to grasp the neck of the gallbladder and manipulate it to better separate and expose the tubes. In order to replicate the positioning of the key anatomical areas (gallbladder, cystic duct, and cystic artery) from human cholecystectomy procedures, the Cadière forceps on PSM3 holds the ex vivo tissue statically in an almost vertical position during the entire procedure. Once the optimal positioning for the data acquisition is found, PSM3 and ECM are kept stationary in their positions. Additionally, a computer-aided design computer-aided design (CAD) modeled custom rack is placed below the ex vivo tissue to further hold it in place and prevent tearing of the tissue. For standardized port locations of the surgical arms, a cad modeled open structure, based on port locations of a human cholecystectomy, is positioned in between the ex vivo tissue and the surgical robot. The open design specifically facilitates frequent clip reloading and tool switching. Throughout the procedure, a clamp has been applied on the end of the cystic duct to prevent bile (fluid stored in the gallbladder) from flowing out of the opening of the cystic duct. The surgical setup is demonstrated in Fig. 2.

During the data acquisition, a data collector is sitting on the teleoperation console, which is displayed in Fig. 2A. Through the virtual reality (VR) headset connected to the teleoperation console, the data collector has a 3D vision of the surgical scene from the da Vinci Si stereo endoscope, similar to the view accessible in the original da Vinci teleoperation console. In the VR headset, we projected a center line (indicating the desired position for the tubes) and two bounding boxes (for PSM1 and PSM2) within the view of the data collector, see Fig. 3, to facilitate the standardized positioning of two instruments at the start and end of certain recordings. These standardized positions are later referred to as the home positions for PSM1 and PSM2.

The master tool manipulators (MTMs), MTML and MTMR, are used to control the two selected instruments, here the instruments on PSM1 and PSM2. The foot pedals, here from a da Vinci Classic / S foot pedal tray, can be used to readjust the camera and activate the clutch. The remaining foot pedals are used to control the starting and stopping of different types of recordings without moving out of the teleoperation console. During the data acquisition, all data streams were concurrently recorded at 30 Hz via robot operating system (ROS)¹⁸ using a publisher-subscriber pattern, i.e., having for each stream a separate ROS topic.

Protocol

The ex vivo porcine tissues used in this study consist of undissected gallbladders that remain attached to the liver. Prior to the clipping and cutting phase, the tissues must undergo dissection. The Calot’s triangle dissection is performed by blunt dissection using Maryland forceps (Intuitive part number: 420172) to achieve the critical view of safety (CVS), ensuring clear visibility of both the cystic duct and artery for safe clipping and cutting. When ex vivo tissues showed abnormal anatomical structures, such as a cystic artery crossing the cystic duct or branching anomalies (see Fig. 4), these tissues were excluded from the study. Thus, only tissues with two separable tubes were selected. Approximately 10% of tissues have been excluded based on such anatomical abnormalities.

Once the gallbladders were dissected and the CVS achieved, data acquisition was performed by two experienced, non-clinical research assistants, each trained by a surgical resident with extensive expertise in performing cholecystectomies. The clipping and cutting phase was segmented into 17 more fine-grained surgical tasks, based on the terminology established in the 2021 SAGES consensus¹⁹, as illustrated in Fig. 5. First, the left tube (typically the cystic duct) is targeted, and after the cut, the focus shifts to the right tube (typically the cystic artery). The initial surgical task starts with both instruments in their standardized home positions and consists of grasping the gallbladder neck with the Cadière forceps while retracting the gallbladder neck. In task 2, the first clip is placed on the bottom of the left tube, followed by task 3, where the clip applier goes back to the home position to reload a new clip. At the end of the going back task, the clip applier is at the home position, and the gallbladder neck is relaxed by releasing the retraction to prevent tissue damage. Tasks 4 and 6 involve clipping the second and third clips for the left tube, with the second clip placed close to the first clip and the third clip placed at the top of the left tube. From task 4 onward, all clipping and cutting tasks include the retraction of the gallbladder neck. Tasks 5 and 7 are the going back tasks after the second and third clip, respectively, and are identical in their execution to task 3. After the three clips on the left tube are set, the cutting of the left tube is performed in task 8, followed by going back to the PSM1 home position in task 9. The clipping and cutting are then repeated for the right tube.

In addition to the optimally executed surgical tasks described above, also alternative versions of all grasping, clipping, and cutting tasks were recorded. These alternative versions begin with atypical instrument placements to simulate scenarios in which an error has occurred that needs to be corrected. The alternative executions are referred to as recovery demonstrations, while the standard executions are referred to as optimal demonstrations.

In order to maximize the number of demonstrations per ex vivo tissue and retain the condition of the tissue, specific techniques have been employed. Each surgical task has been recorded multiple times on the same tissue before moving on to the next task. For clipping, clips with a disabled latching mechanism were used, allowing repetitive clipping motions without fully locking onto the duct or artery. For cutting motions, the scissors were positioned around the tubes but not fully closed to keep the tubes in the same condition for the subsequent demonstrations. Only in the final demonstrations of the clipping and cutting, which are recorded continuously with the subsequent going back task, an unmodified clip was used or the cut performed. To prevent tissue degradation during the data acquisition, which can take multiple hours, the tissues were kept moist during the experiments.

Vision Data

The frames of the da Vinci Si stereo endoscopic video were saved, for both the left and the right image, with a downsampled resolution of 960 × 540 from the original resolution of 1920 × 1080. This reduces storage demands while retaining the level of detail required by modern computer vision models. The wrist camera frames, from PSM1 and PSM2, were recorded at a resolution of 640 × 480. Both the da Vinci and wrist camera frames were saved as JPG files.

Kinematic Data

The kinematic data collected from the da Vinci robotic system includes measurements from the two active PSMs (PSM1 and PSM2), the stationary positioning of PSM3, and the ECM. For each manipulator, the Cartesian pose of the instrument tip is recorded as a 7-dimensional vector, including 3D translation coordinates (x,y,z) in meters and a quaternion (x,y,z,w) representing orientation. The ECM tip serves as the reference frame for this setup, with the kinematic data from each PSM measured w.r.t. the ECM frame, as illustrated in Fig. 6. The local instrument tip Cartesian pose represents the pose w.r.t. the remote center motion (RCM) position of its arm, which was measured for the ECM and all PSMs.

In addition to capturing the Cartesian poses, the joint states of each arm—which represent the rotational and translational configurations of the manipulator’s joints—were recorded to comprehensively characterize the kinematic state of the da Vinci Si surgical system. The ECM comprises four joints, denoted as \({\{{\theta }_{i}\}}_{i=1}^{4}\), with joints 1-2 being rotational (in radians), joint 3 being translational (in millimeters), and joint 4 being rotational. Each of the PSMs include six joints, denoted as \({\{{\theta }_{i}\}}_{i=1}^{6}\), with rotational joints expressed in radians and the prismatic joint (joint 3) expressed in millimeters. For each PSM, the jaw opening angle θ is also provided, expressed in radians.

Besides the end-effector tip positions of the surgical instruments, kinematics of the set-up joints (SUJs) for the ECM and all PSMs are included. The SUJs kinematics are given for their base Cartesian poses, encoded as 7-dimensional vectors comprising 3D translation components (x, y, z) in meters and orientation in the form of a quaternion (x, y, z, w). Furthermore, the SUJs joint configurations are recorded as 4-dimensional vectors of joint values \({\{{\theta }_{i}\}}_{i=1}^{4}\), with the first joint being translational and expressed in millimeters, and the remaining joints being rotational and expressed in radians.

Data Curation

All demonstrations were systematically reviewed to ensure correctness and completeness across all recorded modalities, including video and kinematic data. Recordings were temporally trimmed to exclude non-informative segments at the beginning or end, thereby preserving only the relevant portions of each demonstration. Since the actual tissue cutting occurs exclusively in the final repetition of each cutting task, the corresponding kinematic trajectories were augmented by linearly interpolating the jaw opening value toward a fully closed state, thereby simulating the instrument’s closing motion. Manual labeling was not required, as each surgical task was recorded independently. Consequently, demonstrations within each ex vivo tissue sample are organized by surgical task, enabling direct use of folder names as task labels where needed.

Data Records

The dataset is hosted on the Johns Hopkins Research Data Repository²⁰. The repository is publicly accessible without user registration. The data is organized in a five-level directory hierarchy, as illustrated in Fig. 7: Level 1 folders correspond to individual data collectors. Within each collector folder (Level 2), there is one subfolder per ex vivo tissue sample. Each tissue folder (Level 3) contains subfolders for each of the 17 surgical tasks. Only tissues 1, 11, 12, and 16 are missing one or more task recordings; all others include the full set of 17 optimal tasks. When recovery demonstrations were recorded, they were saved in a new surgical task folder named after the surgical task with the suffix _recovery. Within each task folder (Level 4), each demonstration is stored in a separate folder named according to its recording start timestamp. The number of demonstrations per task varies by tissue and task. Each demonstration folder (Level 5) contains: A kinematic.csv file. Four image subfolders: da_vinci_stereo_left, da_vinci_stereo_right, endo_psm1, and endo_psm2. Images are saved as JPG files named frame_<frame_id>.jpg, with <frame_id> running from 0 to the last frame index. Table 1 provides a detailed description of each column in kinematic.csv.

Table 1 Feature Descriptions for the da Vinci Robot System. Description of the features, their dimensions, and units in the da Vinci robot system. For all features with an asterisk *, there exist current values and target values. The same features were recorded for PSM3, which was kept stationary during the demonstrations.

Full size table

The resulting dataset consists of a total of 34 ex vivo tissues. The total number of demonstrations is around 13,188 optimal and about 5,155 recovery demonstrations, which are about 20 hours of data. A dataset summary can be found in Table 2 and more details on the demonstration distribution are shown in Fig. 8.

Table 2 Dataset Summary. Statistics for the data collected by the two data collectors.

Full size table

Technical Validation

Validation of Vision Data

To ensure dataset reliability and quality, we implemented comprehensive validation procedures focusing initially on visual data. Each demonstration was carefully reviewed to confirm it depicted a complete and correctly executed task within the defined task boundaries (see Fig. 9). Specifically, the first and last frames of each video were examined to confirm proper initiation and completion, ensuring that instruments started in the designated start pose and successfully completed the intended task. Additionally, each recording was checked to confirm it contained no segments from previous or subsequent tasks, maintaining strict adherence to the predefined task boundaries.

Validation of Kinematic Data

Subsequently, we assessed the kinematic domain to detect potential artifacts, such as sudden spikes in instrument motion (see Fig. 10). For each demonstration, we calculated the delta in translational distances and angular orientations between consecutive time points, enabling the identification of abrupt trajectory changes. Across individual tissues, the average maximum translational distances recorded were 1.72 mm for PSM1 and 1.60 mm for PSM2. The average maximum angular orientation change was 6.64° for PSM1 (with one notable outlier at 12.32°) and 2.77° for PSM2.

In a final detailed validation step, we examined outlier cases by selecting the demonstrations with the 20 shortest and longest execution times and cumulative path lengths for both PSM1 and PSM2 (see Fig. 11). These demonstrations underwent a thorough review to confirm their correctness in both modalities and to ensure that the visual recordings correspond to the overall kinematic sequence. Movements observed in the videos were manually compared to their corresponding kinematic trajectories to confirm consistency between both modalities. All reviewed outliers exhibited correct and consistent behavior.

Beyond this subset check, the internal consistency of the dataset is further supported by prior work¹⁶ using these recordings, in which learned policies successfully reproduced the clipping and cutting phase on eight unseen ex vivo tissues, indicating that the visual and kinematic modalities are sufficiently coherent for learning robust phase-level behaviors.

Validation of Task Labels

Demonstration labels were directly assigned based on their parent surgical task folders. Validation of these labels was performed through the previously described visual inspections, particularly by evaluating the start and end frames and verifying the appropriate folder placement. These steps ensured each demonstration accurately represented its intended task, confirming label correctness implicitly through this broader validation process.

Limitations of the Dataset

The ImitateCholec dataset has several limitations that should be acknowledged. The dataset was collected using a single surgical robotic system (dVRK Si). Thus, its generalizability to other robotic setups remains uncertain. Although ImitateCholec captures multiple fine-grained tasks involved in clipping and cutting, its restriction to this specific phase of cholecystectomy limits its representation of the full procedural complexity seen in other surgical contexts.

During initial data acquisition, several discrepancies arose in the positioning and orientation of wrist-mounted cameras. Until tissue 4, wrist cameras were positioned closer than the default 3.5 cm distance. For tissue 2, improper mounting of the PSM2 wrist camera caused a 180-degree rotation. Tissues 3 to 9 exhibited variations in PSM1 wrist camera rotation, resulting in altered perspectives. These inconsistencies can be addressed through rectification and augmentation strategies, with detailed transformation procedures and implementation code available in our public repository.

The dataset does not include hardware-level synchronization across sensors. Although all streams were sampled concurrently, the cameras and kinematic sources do not share hardware clocks or trigger signals, and exact multi-view or video-kinematic synchronization cannot be guaranteed.

Despite these limitations, ImitateCholec lays a strong foundation for advancing the development of autonomous robotic systems through imitation learning by providing data for the entire clipping and cutting phase for cholecystectomy.

Usage Notes

The dataset is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use and distribution provided the original work is properly cited.

This dataset is specifically developed to support imitation learning in the context of long-horizon surgical workflows, encompassing entire surgical phases composed of sequential surgical tasks. It can be used to train custom policies via imitation learning to autonomously perform the clipping and cutting phase of cholecystectomy on ex vivo porcine tissue. Moreover, the dataset facilitates co-training for novel surgical tasks—a strategy demonstrated to improve task performance and data efficiency in recent imitation learning studies^21,22. The dataset provides core surgical primitives (grasping, retraction, clipping, cutting) that recur across many procedures, enabling co-trained models to learn transferable behaviors that support generalization beyond the specific phase captured here. Using the action representations proposed in^7,16, this data can also be used for other dual-arm surgical systems that support Cartesian space control, such as a da Vinci Classic system.

As the dataset is designed for imitation learning, the surgical phases are not captured as continuous recordings but are instead segmented into fine-grained task-specific sequences. To use the dataset for surgical workflow analysis over the entire clipping and cutting phases, neighboring task recordings can be concatenated to synthetically reconstruct full surgical phase recordings. This allows for the simulation of task transitions and the modeling of phase-level surgical workflows. A protocol modification was introduced starting from tissue 9: the retraction of the gallbladder neck was moved from task 2 (clipping first clip left tube) to task 1 (grasping gallbladder). This adjustment aimed to enhance safety and success rates in imitation learning but introduced an inconsistency in the dataset, complicating workflow analysis due to altered task boundaries. To maintain consistency across the dataset, we recommend excluding task 1 from the annotated tasks and start the workflow analysis from task 2.

The dataset also enables the development of surgical tool pose estimation by providing video data and corresponding kinematic recordings of the surgical instruments.

For general clinical use, the inclusion of recovery demonstrations further enables the analysis of failure states, providing valuable insight into error recognition and correction within the surgical workflow. This could support future trainee-guidance systems, for example, in training scenarios (e.g., cadaveric setups), before performing the clipping or cutting step by suggesting simple corrective primitives such as moving the tool slightly upward, downward, or toward/away from the surgeon for either instrument, and by notifying the trainee when an erroneous action has been performed.

Data availability

The dataset supporting this study is publicly available in the Johns Hopkins Research Data Repository²⁰. The repository is open access, requires no user registration, and provides unrestricted access to all data described in this manuscript.

Code availability

A publicly accessible code repository has been developed to support the utilization of the dataset. It provides comprehensive instructions for loading and preprocessing the dataset, facilitating its integration into a basic training pipeline for the proposed use cases. The repository also provides visualization tools for both visual and kinematic modalities. All scripts are written in Python (evaluated on 3.12.8) and built upon the PyTorch framework (version ≥ 2.0). The code has been tested on Linux platforms. A detailed README file accompanies the repository, offering step-by-step guidance and descriptions of the available components. The full codebase is released under the MIT Open Source license and is available at: https://github.com/phnib/ImitateCholec.

References

Alberton, A. & Peltz, E. D. Cholecystectomy. Surg. Clin. North Am. 104, 1203–1215, https://doi.org/10.1016/j.suc.2024.04.011 (2024).
Article PubMed Google Scholar
Reddy, K. et al. Advancements in robotic surgery: a comprehensive overview of current utilizations and upcoming frontiers. Cureus 15, e50415, https://doi.org/10.7759/cureus.50415 (2023).
Article PubMed PubMed Central Google Scholar
Oslock, W. M. et al. A contemporary reassessment of the us surgical workforce through 2050 predicts continued shortages and increased productivity demands. Am. J. Surg. 223, 28–35, https://doi.org/10.1016/j.amjsurg.2021.07.033 (2022).
Article PubMed Google Scholar
Dimou, F. M., Eckelbarger, D. & Riall, T. S. Surgeon burnout: a systematic review. J. Am. Coll. Surg. 222, 1230–1239, https://doi.org/10.1016/j.jamcollsurg.2016.03.022 (2016).
Article PubMed PubMed Central Google Scholar
Schmidgall, S., Kim, J. W., Kuntz, A., Ghazi, A. E. & Krieger, A. General-purpose foundation models for increased autonomy in robot-assisted surgery. Nat. Mach. Intell. 8, 8, https://doi.org/10.1038/s42256-024-00917-4 (2024).
Article Google Scholar
Xu, M., Islam, M., Lim, C. M. & Ren, H. Class-incremental domain adaptation with smoothing and calibration for surgical report generation. In Medical Image Computing and Computer Assisted Intervention (MICCAI 2021), 269–278, https://doi.org/10.1007/978-3-030-87202-1_26 (Springer, 2021).
Kim, J. W. et al. Surgical robot transformer (srt): imitation learning for surgical subtasks. In Conference on Robot Learning (CoRL 2024) (2024).
Twinanda, A. P. et al. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36, 86–97, https://doi.org/10.1109/TMI.2016.2593957 (2017).
Article ADS PubMed Google Scholar
Nwoye, C. I. et al. Cholectriplet2022: show me a tool and tell me the triplet - an endoscopic vision challenge for surgical action triplet detection. Med. Image Anal. 89, 102888, https://doi.org/10.1016/j.media.2023.102888 (2023).
Article PubMed Google Scholar
Maier-Hein, L. et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Sci. Data 8, 101, https://doi.org/10.1038/s41597-021-00882-2 (2021).
Article PubMed PubMed Central Google Scholar
Gao, Y. et al. The jhu-isi gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In Modeling and Monitoring of Computer Assisted Interventions - MICCAI Workshop (2014).
Oh, K. et al. Comprehensive robotic cholecystectomy dataset (crcd): integrating kinematics, pedal signals, and endoscopic videos. In IEEE International Symposium on Medical Robotics (ISMR 2024), 1–7, https://doi.org/10.1109/ISMR63436.2024.10585836 (2024).
Shi, L. X. et al. Yell at your robot: Improving on-the-fly from language corrections. In Robotics: Science and Systems (2024).
Walke, H. R. et al. Bridgedata v2: a dataset for robot learning at scale. In Conference on Robot Learning (CoRL 2023) (2023).
Brohan, A. et al. Rt-1: robotics transformer for real-world control at scale. In Robotics: Science and Systems (2023).
Kim, J. W. B. et al. Srt-h: A hierarchical framework for autonomous surgery via language-conditioned imitation learning. Science Robotics 10, eadt5254, https://doi.org/10.1126/scirobotics.adt5254 (2025).
Article PubMed Google Scholar
Kazanzides, P. et al. An open-source research kit for the da vinci surgical system. In IEEE International Conference on Robotics and Automation (ICRA 2014), 6434–6439, https://doi.org/10.1109/ICRA.2014.6907809 (2014).
Quigley, M. et al. Ros: an open-source robot operating system. In ICRA Workshop on Open Source Robotics (Kobe, Japan, 2009).
Meireles, O. R. et al. Sages consensus recommendations on an annotation framework for surgical video. Surg. Endosc. 35, 4918–4929, https://doi.org/10.1007/s00464-021-08578-9 (2021).
Article PubMed Google Scholar
Hansen, P. et al. Data associated with the publication: ImitateCholec: A multimodal dataset for long-horizon imitation learning in robotic cholecystectomy. Johns Hopkins Research Data Repository, https://doi.org/10.7281/T1PF3FYK (2025).
Fu, Z., Zhao, T. Z. & Finn, C. Mobile aloha: learning bimanual mobile manipulation using low-cost whole-body teleoperation. In Conference on Robot Learning (CoRL 2024) (2024).
Black, K. et al. π0.5: A vision-language-action model with open-world generalization. In Conference on Robot Learning (CoRL 2024)

Download references

Acknowledgements

This material is based upon work supported by ARPA-H 75N91023C00048, NSF/FRR 2144348, and NSF DGE 2139757.

Author information

Authors and Affiliations

Laboratory for Computational Sensing and Robotics, Johns Hopkins University, Baltimore, 21218, USA
Pascal Hansen, Ji Woong Brian Kim, Antony Goldenberg, Juo Tung Chen, Yuanzhe Amos Li, Anton Deguet, Paul Maria Scheikl & Axel Krieger
Division of Intelligent Medical Systems, German Cancer Research Center, Heidelberg, Germany
Pascal Hansen
Department of Computer Science, Stanford University, Stanford, 94305, USA
Ji Woong Brian Kim
Department of Surgery, Johns Hopkins University, Baltimore, 21218, USA
Brandon White & Jeffrey Jopling
Optosurgical, Columbia, 21046, USA
De Ru Tsai & Richard Cha
Amazon Robotics, Berlin, Germany
Paul Maria Scheikl

Authors

Pascal Hansen
View author publications
Search author on:PubMed Google Scholar
Ji Woong Brian Kim
View author publications
Search author on:PubMed Google Scholar
Antony Goldenberg
View author publications
Search author on:PubMed Google Scholar
Juo Tung Chen
View author publications
Search author on:PubMed Google Scholar
Yuanzhe Amos Li
View author publications
Search author on:PubMed Google Scholar
Anton Deguet
View author publications
Search author on:PubMed Google Scholar
Brandon White
View author publications
Search author on:PubMed Google Scholar
De Ru Tsai
View author publications
Search author on:PubMed Google Scholar
Richard Cha
View author publications
Search author on:PubMed Google Scholar
Jeffrey Jopling
View author publications
Search author on:PubMed Google Scholar
Paul Maria Scheikl
View author publications
Search author on:PubMed Google Scholar
Axel Krieger
View author publications
Search author on:PubMed Google Scholar

Contributions

P.H., A.G., Y.L., J.W.K., and A.K. conceptualized the study, compiled the dataset, and drafted the majority of the manuscript. P.H., J.C., and D.R.T. conducted technical validation and assisted with data curation. J.J., B.W., and R.C. provided clinical expertise for the data collection and critically reviewed the manuscript. A.K., P.M.S., and J.W.K. established technical infrastructure and contributed essential scientific guidance. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Pascal Hansen or Axel Krieger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hansen, P., Kim, J.W.B., Goldenberg, A. et al. ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy. Sci Data 13, 210 (2026). https://doi.org/10.1038/s41597-025-06526-z

Download citation

Received: 15 September 2025
Accepted: 24 December 2025
Published: 16 January 2026
Version of record: 10 February 2026
DOI: https://doi.org/10.1038/s41597-025-06526-z