Abstract
This paper introduces the Trauma THOMPSON dataset, designed to advance AI-driven decision support for life-saving interventions (LSIs) in emergency care, particularly in resource constrained humanitarian settings. The dataset comprises 3,717 high resolution and egocentric video clips of both regular and just-in-time (JIT) procedures. The JIT procedures consists of videos of the same LSI procedures, but with makeshift tools, and is useful for studying human medical commonsense. Each clip is annotated by medical professionals with verb-noun format, such as “take scalpel” and “make incision”. In addition to action segments, the dataset includes annotations for medical visual question answering (MVQA), hand maneuvers, and object detection. Eventually, these rich annotations and dataset can be used to train an AI agent to advise first-responders in the field about what to do next with the resources at hand. We provide benchmarks for action recognition, anticipation, and MVQA using state-of-the-art machine learning models.
Similar content being viewed by others
Background & Summary
Recent advancements in Generative AI, and Machine Learning (ML) have enabled the development of foundational models for virtual agents and assistants capable of supporting real-time user performance in unstructured, uncontrolled and highly variable tasks across diverse domains1,2. These models are designed to process and analyze textual, visual, or multimodal data to output useful information. The training of these large models typically follows a two-stage process: pretrain on a vast amount of data in a self-supervised or unsupervised fashion and subsequently finetune with domain-specific and annotated datasets to perform certain downstream tasks3,4,5. This paradigm ensures that the large models can extract common textual and visual features and develop a broad understanding of general knowledge, while being capable of adapting to domain-specific problems. One particular case, which is the focus of this paper, is high-stakes applications in the medical, healthcare and critical care domain. As AI researchers strive to integrate multimodal intelligence into the critical care workflow, datasets with high specificity and quality annotations will be pivotal in bridging the gap between theoretical advancements and practical applications. As AI mentors, recognition of current action, anticipation of the next action, and answering medical questions are among the most important functions to guide inexperienced care providers in performing critical care tasks. Robust hand tracking and precise object detection enable skill evaluation and tool localization, reducing human error and supporting safer and more effective care.
In the development of AI agents, domain-specific datasets are essential for effectively addressing tasks within specialized fields such as medicine. While generalized datasets are useful for pretraining and capturing broad patterns of human behavior, they often lack the granularity required to navigate the complex, nuanced scenarios encountered in clinical environments. Medical tasks—ranging from diagnostics and surgical guidance to critical care management—demand datasets that accurately represent procedural intricacies, patient variability, and the use of specialized equipment. For instance, datasets like MIMIC-III6 and DAISI7 have played key roles in advancing AI for electronic health record analysis and surgical instruction, respectively. However, these datasets largely omit critical care and trauma scenarios.
To develop AI mentors capable of assisting inexperienced caregivers during hands-on procedures, it is crucial that such systems can perceive, recognize, and predict human actions in real time. Egocentric datasets have been proposed for this purpose. For example, the EPIC-KITCHENS 100 Dataset8 provides large-scale and first-person video recordings of everyday kitchen tasks annotated with fine-grained action labels, enabling models to learn context-aware behavior prediction. In the medical domain, the Egosurgery-Phase dataset9 addresses a longstanding gap in surgical phase recognition for open surgery. Comprising 15 hours of head-mounted egocentric video across nine surgical phases and enriched with eye gaze data, it offers a valuable foundation for modeling surgical attention and decision-making. Similarly, MedVidCL10 supports AI development for surgical diagnostics and training. However, these datasets are typically collected in structured, controlled environments—primarily operating rooms—and do not reflect the urgency, improvisation, or contextual constraints common in emergency medicine. As a result, their applicability to real-world, resource-limited, or humanitarian settings remains limited. The lack of datasets capturing the spontaneity and variability of real trauma scenarios continues to be a barrier to AI generalization in emergency care.
Beyond datasets, existing AI systems designed for surgical mentoring have focused largely on planned procedures in well-resourced environments. For instance, the Virtual Operative Assistant11 delivers real-time feedback to neurosurgery trainees during tumor resections in a VR environment, significantly enhancing technical performance. Other systems provide spatial navigation support during surgery12,13 or enable post-operative performance evaluation14,15,16. However, these technologies are typically limited to operating rooms17 or laboratory simulations7,18,19,20,21,22 and do not extend to unpredictable, low-resource environments. Despite the critical importance of timely interventions in trauma and emergency medicine, AI systems capable of guiding life-saving procedures under austere or chaotic conditions are still lacking. This is particularly problematic given that many emergency care providers are undertrained in life-saving interventional (LSI) procedures when operating in such environments23,24,25.
The Trauma THOMPSON dataset26,27,28 fills those gaps by offering a highly specialized and annotated video dataset tailored to trauma and emergency care. It not only contributes to improving AI model’s performance but also ensures that the predictions are grounded in the clinical world. The Trauma THOMPSON dataset is a collection of annotated video clips designed to advance research and development in autonomous AI mentorship for humanitarian medicine. To the best of our knowledge, Trauma THOMPSON is unprecedented in terms of scale, real-world settings, unique challenges, and practical applicability. The highlights of this dataset are as follows. We created the first egocentric-view dataset focused on operational medicine to assist field medics and guide inexperienced users in performing emergency care procedures. Designed for resource-constrained, uncontrolled, and urgent scenarios, the dataset extends a previous iteration and includes 220 videos comprising 3,717 annotated clips covering five unscripted life-saving procedures, which includes cricothyroidotomy (CR), tube thoracostomy (CT), tourniquet application (TQ), intraosseous infusion (IO), and needle thoracostomy (ND). It includes not only regular LSI procedures, but also just-in-time (JIT) procedures, which use unconventional makeshift tools to perform emergency procedures. The addition of JIT procedures creates extra challenges for the dataset by introducing more variability of tools and environments and is useful for studying human medical commonsense. In addition, we include annotations for medical visual question answering (MVQA)29, hand maneuvers, and object detection. MVQA can serve as clinical decision support tools and allow caregivers to extract critical insights from medical imagery through natural language interaction. The dataset is openly accessible and freely available to foster the development of AI applications in medical care. We benchmarked the dataset on action recognition, action anticipation, and MVQA with multiple algorithms, demonstrating how machine learning models can leverage the dataset’s annotations to predict therapeutic actions essential for humanitarian medicine and resuscitative care.
Methods
Data Collection
This study was approved by the Institutional Review Board (IRB) under protocol number 223046, reviewed and overseen by the Geneva Foundation team. Participants were medical professionals recruited through internal invitations at the Madigan Army Medical Center and the University of Calgary based on their clinical expertise. All participants provided written informed consent prior to participation, agreeing to be video-recorded from an egocentric perspective while performing simulated medical procedures for research and dataset development purposes. To ensure privacy and confidentiality, all video data underwent human review, and any video with personally identifiable features, such as tattoos, were removed. No physician-identifying or personally traceable information was included in the final dataset.
Procedures Identification
Regular Procedures
A team of subject matter experts (SMEs) with extensive experience in deployed settings developed the Trauma THOMPSON dataset, including surgeons, critical care physicians, and emergency medicine practitioners, each with 5-20 years of experience in deployed or trauma care settings. They identified essential procedures for Tactical Combat Casualty Care (TCCC)30, such as cricothyroidotomy and tourniquet application, and a task list describing the fundamental steps for the successful performance of the procedures during video collection. Additionally, a focus group of 15-30 SMEs collaborated to establish a consensus on the dataset’s content and best practices. TCCC are structured around the MARCH algorithm (Massive Hemorrhage, Airway, Respiration, Circulation, Hypothermia/Head Injury)31. The MARCH framework prioritizes life-saving interventions performed under fire or during tactical field care, such as controlling major bleeding, securing airways, treating chest injuries, managing shock, and preventing hypothermia. A survey was conducted to assess agreement on various TCCC procedures and skills. After its completion, the SMEs engaged in discussions to refine the rankings, identify any missing procedures, and establish optimal practices for performing them. This process resulted in a finalized list of 5 procedures for the Trauma THOMPSON dataset based on the MARCH algorithm, as shown in Fig. 1, TQ for Massive Hemorrhage, CR and ND for Respiration, CT for Airway and IO for Circulation. Furthermore, detailed instructions outlining the fundamental steps for successfully collecting these procedures are elaborated in subsequent sections.
The Trauma THOMPSON Dataset.
Just-in-time Procedures (JIT)
In addition to standard techniques, the medical team with extensive field experience developed improvised methods to perform the same five life-saving procedures using alternative tools and materials readily available in resource-constrained environments. These improvised approaches, guided by physicians’ expert judgment and adaptability, reflect the ingenuity required in emergency care and underscore the importance of designing AI systems that can understand and adapt to non-standard techniques, ensuring broader applicability in diverse operational contexts. To support this, the dataset was further enriched with videos demonstrating “just-in-time” procedures utilizing improvised, non-traditional equipment. These recordings captured creative solutions such as using belts or clothing with a screwdriver to fashion tourniquets, scissors for incision and expansion during tube thoracostomy, and screwdrivers to aid in tube insertion. For improvised needle cricothyroidotomy, a needle was used in place of a standard incision and tube for emergency airway access, and manual intraosseous needle placement was demonstrated in the absence of a functioning needle driver.
Recording
We focused on recording natural and unscripted life-saving intervention (LSI) procedures from a first-person perspective, capturing actions such as operating medical tools, searching for items, reconsidering decisions, and managing unexpected challenges. To achieve this, commercial GoPro Hero7 Black cameras (GoPro, San Mateo, California) were used, mounted on the heads to capture egocentric views. The surgeons positioned the cameras at a 20-30∘ angle relative to their foreheads to ensure optimal video capture, with the hands typically centered in the frame for clear visualization during procedures. Filming was conducted across various simulation models and environments, including those that resemble field conditions. All recordings were de-identified to protect privacy. The videos were captured in 1080p resolution to maintain high-quality visuals.
Annotation
The annotation process followed common practices used in activity recognition dataset, specifically the EPIC-KITCHENS dataset. Annotators were instructed to identify fine-grained medical actions, limiting each to a maximum duration of 5 seconds for easier data processing. Longer actions were divided into sub-actions or multiple video clips. The vocabulary for the medical actions is from terminologies common in TCCC. The dataset annotations include the start and end timestamps, as well as the actions represented as verb-noun pairs for each video clip (e.g., take scalpel, incise skin), as adopted by the EPIC-KITCHENS dataset. The process is illustrated in Fig. 2. Medical professionals annotated each procedural step by specifying the precise timestamps in mm:ss format and the corresponding actions. The annotated data were then reviewed by peers to ensure the accuracy of timestamping and video segmentation. The annotation work was carried out by project managers and research assistants with specialized training and 2-5 years of experience in military medicine from The Geneva Foundation and Medigan Army Medical Center.
Illustration of the annotation process.
Data quality assurance
To ensure the accuracy of annotations, each procedure’s actions are annotated by three medical professionals. One annotator generates the initial annotations, while the other two reviewers validate them. Since the annotations are estimates and precise timestamps for each procedure cannot be guaranteed, a method is proposed to calculate annotation accuracy.
The actual timestamp, ta, is defined as the average of the timestamps provided by the annotator and the reviewers. Let to represent the timestamp from the annotator, tri the timestamp from reviewer i, and nr the number of reviewers. The actual timestamp is calculated as:
For each clip, tas and tae represent the actual start and end times, while tos and toe denote the original start and end times determined by the annotator. The annotation accuracy for each clip is computed by dividing the overlapping time between the original and actual timestamps by the actual clip duration. The overlapping time is determined using:
The accuracy for a clip, pi, is then calculated as:
Finally, the average annotation accuracy (acc) is computed using:
where N is the total number of clips. Then, the final average accuracy is computed as:
Medical Visual Question Answering
The MVQA dataset is built upon the egocentric video dataset and is enriched with detailed annotations in the form of natural language questions paired with corresponding plausible answers. These annotations focus on clinically relevant aspects and are designed to simulate the types of inquiries a medical AI mentor might encounter when assessing a situation or guiding a procedure. The questions are not procedure-specific but rather general and orthogonal to the procedure type. As illustrated in Fig. 3, each annotated scene includes a visual frame from the egocentric video, accompanied by multiple questions and a set of 3 to 5 potential answer choices. The questions can be categorized into three types based on their reasoning complexity. Descriptive questions involve direct observation of visible elements in the scene, such as “What limb is injured?” with the answer “Right arm.” Interpretive questions require analytical reasoning to infer the presence or absence of certain conditions, such as “Is there any bleeding?” with the answer “No.” Finally, contextual questions require an understanding of spatial or procedural relationships, such as “What is the current action?” with the answer “Take tourniquet.” It enables the development and evaluation of AI systems capable of interpreting emergency scenarios and reasoning about injuries, even when faced with uncertainty or incomplete visual cues.
Illustration of Visual Question Answering.
Hand Maneuvers and Object Detection
To efficiently generate high-quality bounding box annotations, a human-in-the-loop strategy is employed, combining manual labeling with automated tracking. Annotators manually label hands and objects every 10-30 frames, while intermediate frames are automatically annotated using CSRT trackers32 provided through OpenCV. We annotated for the left and right hands, as well as 15 distinct medical instruments, as illustrated in Fig. 4. The hand annotations include bounding boxes, object classes, and object IDs for tracking, whereas the object annotations include bounding boxes and object classes for detection. Vision-language models (VLMs) were used to identify hands and detect objects that are crucial in clinical settings, particularly where precision is vital. Reliable hand tracking could enable the evaluation of procedural skills based on hand maneuvers in real time, providing feedback on bimanual coordination and task performance33,34. Additionally, integrating object detection with natural language interfaces allows healthcare providers to query AI systems about tool locations, thereby reducing cognitive effort and the likelihood of errors. Recent models such as Florence-235 and F-VLM36 have showcased strong object recognition capabilities37, underscoring the potential of unified VLMs to support diverse visual tasks in medical workflows.
Hand and objects annotations.
Data Records
We have archived a total of 37 data files for the Trauma THOMPSON dataset on Harvard Dataverse, available at https://doi.org/10.7910/DVN/V5BTRU38. The dataset is split into multiple archive files to comply with the platform’s file size limitations. Each archive contains a subset of the videos and annotations relevant to the respective tasks. The dataset includes video recordings in MP4 format and annotations supporting multiple downstream tasks in computer vision and multimodal learning. Annotations for action recognition and anticipation tasks are provided in CSV format. Each row in the CSV file corresponds to a labeled temporal segment, including the video identifier, timestamp ranges, and associated action labels. Separate CSV files are included for training, validation, and test splits. Annotations for the MVQA task are stored in a single .json file. Each entry includes the question, answer, relevant frame identifiers, and associated metadata. This file allows mapping clinical questions to specific visual contexts in the videos and is designed to support multimodal question answering models. Hand annotations are stored in COCO-style .json format. Each file includes bounding boxes for left and right hands across frames, along with instance-level segmentation (where applicable). These annotations support the training of models for motion analysis and skill assessment in emergency medical procedures. Object annotations are formatted in YOLO-compatible .txt files. Each file corresponds to a single frame and contains bounding box coordinates and class labels for surgical tools and other relevant objects. These annotations facilitate real-time tool detection for clinical decision support.
A README file is included in the root directory of the dataset, providing detailed documentation of each annotation format, label taxonomy, and code snippets to assist with data loading and preprocessing. This dataset supports a broad range of AI tasks including action recognition, anticipation, visual question answering, hand motion analysis, and object detection, making it a comprehensive resource for multimodal learning in emergency medicine scenarios.
Data Overview
Dataset statistics
Currently, the dataset includes 220 videos covering five emergency care procedures, with a total of 3,717 fully annotated video clips. For classification tasks, the class distribution reflects the real-world frequency of each procedure. From the 1080p videos (1920 × 1080) a total of over 593,000 interventional frames have been extracted. The dataset are divided into training and testing sets at the video level using a random 80-20 split, following common practice in other datasets to ensure sufficient data for model training while reserving a representative portion for evaluation. The training set consists of 2,477 clips of regular procedures while the testing set is split into two parts, containing 683 clips of regular procedures and 557 clips of JIT procedures. The regular procedure dataset includes 42 verb classes, 42 noun classes, and 124 action classes (unique verb-noun pairs) and the JIT procedure includes 28 verb classes, 32 noun classes, and 86 action classes. In total, there are 45 verb classes, 49 noun classes, and 162 action classes in the Trauma THOMPSON dataset.
Video duration
The dataset contains a total video duration of 11528 minutes. Average video durations are computed separately for the regular and JIT scenarios across the five medical procedures. For regular procedures, the average durations are 76.49 seconds for CR, 121.19 seconds for CT, 48.36 seconds for IO, 41.18 seconds for ND, and 41.26 seconds for TQ, resulting in an overall average duration of 64.96 seconds. In comparison, JIT procedures are consistently shorter, with 42.40 seconds for CR, 69.66 seconds for CT, 38.11 seconds for IO, 26.67 seconds for ND, and 38.13 seconds for TQ, yielding an overall average duration of 42.48 seconds. These results indicate that JIT demonstrations emphasize fast task execution with makeshift tools under time-critical conditions.
Action annotation diversity
The unique action classes per procedure are computed to compare action and task complexity between regular and JIT procedures. In the regular procedures, CR has 52 unique actions, CT has 49, IO has 38, ND has 23, and TQ has 18. In the JIT procedures, CR has 42 unique actions, CT has 38, IO has 16, ND has 11, and TQ has 28. These results indicate differences in the distribution of action classes across procedure types and conditions.
Annotation accuracy statistics
Given the dataset’s large size, our dataset reviewers were instructed to randomly evaluate 60 videos following the specified guidelines. The review results show a temporal accuracy of 99.4%, with action, verb, and noun label accuracies of 97.2%, 97.2%, and 97.7% respectively.
Technical Validation
Dataset class distribution
Fig. 5 illustrates the distribution of regular procedures while Fig. 6 shows the distribution of JIT procedures. The two figures demonstrate the imbalanced nature of the current dataset, which results from the data being collected from an unscripted source, thereby reflecting the true frequency of procedures in real-world settings. However, it is crucial to recognize that as class imbalance in the training data increases, the performance of algorithms typically declines. Therefore, working with an imbalanced dataset presents greater challenges for algorithm development compared to a balanced one.
Action frequency of the regular procedures.
Action frequency of the JIT procedures.
Action class co-occurrences
Fig. 7 illustrates the distribution of verb-verb, noun-noun, verb-noun pair frequencies of the regular procedures. It is apparent that verbs such as ‘take’, ‘remove’, and ‘drop’ frequently co-occur with various nouns. This trend aligns with the action patterns commonly observed during the LSI procedures. The JIT procedures follow a similar pattern.
Frequency of verb-noun class co-occurrences in regular procedures.
Benchmark results and discussion
Action Recognition
Table 1 shows the performance of six action recognition models on the Trauma THOMPSON dataset, including MViT v2, Uniformer v2, VideoSwin, TimeSFormer, VideoMAE, and LaViLa. It shows insights into their effectiveness across different training and testing conditions. Overall, MViT v2 and Uniformer v2 emerge as the strongest models, consistently achieving the highest Top-1 and Top-5 accuracy scores. In contrast, TimeSFormer performs the worst. VideoSwin, VideoMAE and LaViLa demonstrate moderate performance, though they suffer when tested on challenging data variations.
Depending on the training data, the models have drastic performance difference on their generalization capability based on different test set. When models are trained on regular procedure, they perform well on regular test data but exhibit a sharp decline in accuracy when tested on JIT data. For instance, VideoSwin’s Top-1 accuracy drops dramatically from 45.10% on regular data to just 3.85% on JIT data, highlighting a significant decline due to unseen data. Even the highest-performing model, MViT v2, sees a substantial performance drop when moving from regular training to JIT testing, suggesting that JIT data presents unique challenges such as rare actions, occlusions, or unconventional tools.
Despite these improvements, JIT test results remain consistently lower than regular test results across all models, reinforcing the difficulty of recognizing actions in JIT scenarios. This suggests potential domain gaps in temporal dynamics or scene context, making it harder for models to generalize. Another key observation is the gap between Top-1 and Top-5 accuracy scores. For example, MViT v2 achieves 90.38% Top-5 accuracy on JIT when trained on Combined data, but its Top-1 accuracy remains at 50.96%. This indicates that models frequently rank the correct action within their top five predictions but struggle with precise Top-1 classification, likely due to fine-grained action distinctions in JIT data.
Among the individual models, MViT v2 stands out as the best overall performer, particularly excelling in JIT generalization when trained on Combined data. This suggests that its multi-scale modeling mechanism contributes to its robustness. Uniformer v2 also shows strong performance, likely benefiting from its unified architecture for spatial and temporal modeling. In contrast, VideoSwin and VideoMAE achieve moderate results but struggle with JIT data, suggesting they may lack the necessary mechanisms to handle within-domain variability. TimeSFormer performs the worst, with very low scores on JIT testing (e.g., 0.51% Top-1 accuracy when trained on regular data).
Fig. 8a shows the confusion matrix for the top-performing model, MViT v2, on the action recognition task. The dark diagonal pattern reflects the model’s high accuracy in correctly predicting many classes. The action labels are sorted by frequency, with common classes at the top and rarer ones at the bottom. Notably, the lower section of the matrix contains fewer dark regions, highlighting the model’s difficulty in accurately recognizing less frequent actions. Fig. 9 presents the Top-1 accuracy of verbs, nouns, and full actions across five emergency procedures for each model, with the numbers summarized in Table 2. The consistent shapes across radar plots suggest similar performance trends among models for different procedures. However, no single model dominates across all procedures, indicating that the highest-performing model overall does not necessarily perform best on every individual task.
Confusion matrices of action recognition and action anticipation with MViT v2 on regular procedures.
Action recognition Top 1 accuracies of verb, noun, and action by each type of procedure.
Action Anticipation
Table 3 summarizes the benchmarking results for action anticipation across the same six models. In the regular train-test configuration, MViT v2 outperforms others with the highest Top-1 and Top-5 accuracies of 60.12% and 87.02%, respectively. Uniformer v2 follows closely with 56.25% Top-1 and 84.70% Top-5 accuracy. VideoSwin and VideoMAE deliver moderate performance, while TimeSFormer lags significantly with only 28.44% Top-1 accuracy. When evaluated on JIT data using models trained solely on regular data, all models suffer a sharp drop in performance—MViT v2 achieves just 8.42% Top-1 accuracy, and Uniformer v2 shows similar degradation. This highlights the challenge of generalizing anticipation models to JIT contexts without diverse training data.
Under the combined test setup (testing on both regular and JIT data with regular training), MViT v2 again leads with 53.50% Top-1 and 78.71% Top-5 accuracy. Uniformer v2 remains competitive, while TimeSFormer continues to perform the weakest. These patterns align with those observed in the regular test scenario, reinforcing the limitations of training solely on regular procedure data for anticipating actions in varied environments.
Overall, action anticipation follows a similar trend to action recognition. MViT v2 and Uniformer v2 consistently outperform other models. However, the overall accuracy for anticipation is generally lower, especially under JIT conditions, emphasizing the added difficulty of forecasting future actions compared to recognizing current ones. Training with combined datasets proves beneficial, markedly improving the models’ ability to generalize across both regular and JIT scenarios.
Fig. 8b displays the confusion matrix for MViT v2 on the anticipation task, showing reduced classification accuracy for less frequent classes—an issue also observed in recognition. Fig. 10 compares the performance of all models on the five emergency procedures, with the numbers summarized in Table 4. Similar trends appear in the recognition task, although anticipation results tend to be lower overall.
Action anticipation Top 1 accuracies of verb, noun, and action by each type of procedure.
Action Recognition and Action Anticipation with Large VLMs
Large VLMs have attracted huge interests in solving vision tasks recently, so in Table 5 we present the performance of three large VLMs on the action recognition and action anticipation task, including VLMs—LLaVA v1.6 7B39,40, Qwen2.5 VL 7B41, and Gemma3 4B42. These models were fine tuned using QLoRA (Quantized Low Rank Adaptation)43, enabling efficient training on a single GPU. For the recognition and anticipation tasks, we reformulated the video based classification problem into an image understanding task. Specifically, we sampled 9 evenly spaced frames per video clip and arranged them in a 3 × 3 grid layout to compose a single grid image. Natural language prompts were used to embed the target labels, allowing the large VLMs to interpret them within a language conditioned vision framework.
For data preparation, all models were implemented with the HuggingFace transformers library. We used the BLIP model to generate a descriptive caption for each video clip. The dataset was converted to the chat template specified by the transformers library and augmented by oversampling less frequent classes to balance the training data.
We used QLoRA for all three models to allow fine tuning on a single GPU. All models were initialized as 4 bit quantized models via BitsAndBytesConfig. A common SFTConfig (Supervised Fine Tuning Configuration) was defined to run 15 epochs of supervised fine tuning with gradient accumulation, gradient checkpointing, a fused AdamW optimizer under a constant learning rate schedule, and bfloat16 precision. The LoRA adapter configurations were similar. For LLaVA v1.6 and Gemma3, we attached a LoRA adapter to all linear modules with a rank of 16, an alpha of 16 (for LLaVA), and 5 percent dropout. For Qwen2.5 VL, we applied a LoRA adapter more specifically to the model’s query and value projection layers with a rank of 8, alpha of 16, and 5 percent dropout. A custom collate function was used for each model to apply the appropriate chat template, tokenize text, preprocess images, and mask padding tokens, label tokens, or special image token IDs as required by the model architecture. Finally, we used TRL’s SFTTrainer for the training.
For action recognition, LLaVA-v1.6-7B achieves the highest Top-1 accuracy across all evaluation settings: 45.56% on the regular test set, 26.93% on the Just-in-Time (JIT) set, and 43.10% on the combined set. These results indicate that LLaVA-v1.6-7B is the most effective large VLM among those evaluated for recognizing previously seen actions, as well as generalizing across different procedural settings. Qwen2.5-VL-7B and Gemma3-4B also perform competitively, but consistently fall short of LLaVA’s accuracy, especially on the regular and combined splits. The strong results on the JIT set suggest that large VLMs possess robust zero-shot generalization abilities, making them promising candidates for settings where annotated data is scarce or procedural variations are frequent.
For action anticipation, the overall performance is lower across all models, reflecting the increased difficulty of predicting future actions. Qwen2.5-VL-7B leads slightly with the best combined set accuracy (21.77%), although LLaVA-v1.6-7B attains the highest score on the JIT subset (15.75%). This marginal advantage on JIT data suggests that LLaVA may possess stronger generalization capabilities to out-of-distribution or previously unseen procedural variations, possibly due to its larger visual-text alignment capacity. However, none of the large VLMs surpass their own recognition performance, nor do they outperform vision-only models like MViT v2 and Uniformer v2 (see Tables 1 and 3), which were optimized specifically for temporal video understanding.
Medical Visual Question Answering
Table 6 presents a comparative analysis of six fine-tuned MVQA models with varying sizes: ViLT-B/3244, BLIP45, Florence246, LLaVA-v1.6-7B39,40, Qwen2.5-VL-7B41, and Gemma3-4B42. Among them, LLaVA-v1.6, Qwen2.5-VL, and Gemma3 were fine-tuned using the QLoRA (Quantized Low-Rank Adaptation) approach43, enabling efficient model adaptation on a single GPU. BLIP achieved the highest accuracy at 88.64%, showcasing strong MVQA performance with a relatively moderate parameter size. Florence2 followed closely with an accuracy of 87.86%. LLaVA-v1.6 and Qwen2.5-VL achieved accuracies of 85.57% and 83.29%, respectively, while Gemma3 attained 72.04%. ViLT-B/32, the smallest model with only 87 million parameters, provided a lightweight option with a respectable accuracy of 79.88%.
Usage Notes
The Trauma THOMPSON dataset is openly available for research and development purposes. Detailed instructions on how to process the dataset for different tasks are described in the README file in our GitHub repository, including preprocessing steps and annotation formats. The codes for processing the videos, extracting relevant features, and building models are available on our GitHub repository as well. This will allow researchers to efficiently work with the dataset and implement their own machine learning pipelines. All users of the dataset will be required to fill out a consent form, located under the “Terms” section, and then the raw videos will be available for download, along with the annotations.
Data availability
The dataset is openly available on Harvard Dataverse at https://doi.org/10.7910/DVN/V5BTRU.
Code availability
The codes for processing the Trauma THOMPSON dataset and reproducing the experimental benchmarks are available in our GitHub repository (https://github.com/purdue-isat/TT). This repository will enable researchers to replicate results and build upon the dataset for further advancements in action recognition, action anticipation, and MVQA.
References
Schwartz, S., Yaeli, A. & Shlomov, S. Enhancing Trust in LLM-Based AI Automation Agents: New Considerations and Future Challenges, ArXiv:2308.05391 [cs] https://doi.org/10.48550/arXiv.2308.05391 (2023).
Qiu, J. et al. LLM-based agentic systems in medicine and healthcare. Nature Machine Intelligence 6, 1418–1420, https://doi.org/10.1038/s42256-024-00944-1 (2024).
Chen, T. et al. Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning, https://doi.org/10.48550/ARXIV.2003.12862 (2020).
Wang, L. et al. Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies, ArXiv:2410.19878 [cs] https://doi.org/10.48550/arXiv.2410.19878 (2024).
Tay, Y. et al. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers, ArXiv:2109.10686 [cs] https://doi.org/10.48550/arXiv.2109.10686 (2022).
Johnson, A. E. et al. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 160035, https://doi.org/10.1038/sdata.2016.35 (2016).
Rojas-Muñoz, E., Couperus, K. & Wachs, J. DAISI: Database for AI Surgical Instruction ArXiv:2004.02809 [cs, eess] (2020).
Damen, D. et al. The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines ArXiv:2005.00343 [cs] (2020).
Fujii, R., Hatano, M., Saito, H. & Kajita, H. EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos. In Linguraru, M. G. et al. (eds.) Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, vol. 15006, 187–196, https://doi.org/10.1007/978-3-031-72089-5_18 (Springer Nature Switzerland, Cham, 2024).
Gupta, D., Attal, K. & Demner-Fushman, D. A dataset for medical instructional video classification and question answering. Scientific Data 10, 158, https://doi.org/10.1038/s41597-023-02036-y (2023).
Mirchi, N. et al. The Virtual Operative Assistant: An explainable artificial intelligence tool for simulation-based training in surgery and medicine. PLOS ONE 15, e0229596, https://doi.org/10.1371/journal.pone.0229596 (2020).
Auloge, P. et al. Augmented reality and artificial intelligence-based navigation during percutaneous vertebroplasty: a pilot randomised clinical trial. European Spine Journal 29, 1580–1589, https://doi.org/10.1007/s00586-019-06054-6 (2020).
Jha, S. & MB, N. The Essence of the surgical navigation system using artificial intelligence and augmented reality. International Research Journal of Engineering and Technology (IRJET) 06 (2019).
Bissonnette, V. et al. Artificial Intelligence Distinguishes Surgical Training Levels in a Virtual Reality Spinal Task. Journal of Bone and Joint Surgery 101, e127, https://doi.org/10.2106/JBJS.18.01197 (2019).
Ward, T. M. et al. Surgical data science and artificial intelligence for surgical education. Journal of Surgical Oncology 124, 221–230, https://doi.org/10.1002/jso.26496 (2021).
Fazlollahi, A. M. et al. Effect of Artificial Intelligence Tutoring vs Expert Instruction on Learning Simulated Surgical Skills Among Medical Students: A Randomized Clinical Trial. JAMA Network Open 5, e2149008, https://doi.org/10.1001/jamanetworkopen.2021.49008 (2022).
Novaes, M. & Basu, A. Disruptive technologies: Present and future. In Fundamentals of Telemedicine and Telehealth, 305–330, https://doi.org/10.1016/B978-0-12-814309-4.00014-8 (2020).
Rojas, E., Couperus, K. & Wachs, J. The AI-Medic: an artificial intelligent mentor for trauma surgery. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 9, 1–9, https://doi.org/10.1080/21681163.2020.1835548 (2020).
Zhang, J., Nie, Y., Chang, J. & Zhang, J. J. Surgical Instruction Generation with Transformers ArXiv:2107.06964 [cs]. (2021).
Xu, M., Islam, M. & Ren, H. Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches ArXiv:2207.00113 [cs] (2022).
Vannaprathip, N., Haddawy, P., Schultheis, H. & Suebnukarn, S. SDMentor: A virtual reality-based intelligent tutoring system for surgical decision making in dentistry. Artificial Intelligence in Medicine 162, 103092, https://doi.org/10.1016/j.artmed.2025.103092 (2025).
Caballero, D., Sánchez-Margallo, J. A., Pérez-Salazar, M. J. & Sánchez-Margallo, F. M. Applications of Artificial Intelligence in Minimally Invasive Surgery Training: A Scoping Review. Surgeries 6, 7, https://doi.org/10.3390/surgeries6010007 (2025).
Remley, M., Paul, L. & Riesberg, J. Prolonged Casualty Care Guidelines (CPG ID:91) (2021).
Bhattarai, H. K., Bhusal, S., Barone-Adesi, F. & Hubloue, I. Prehospital Emergency Care in Low- and Middle-Income Countries: A Systematic Review. Prehospital and Disaster Medicine 38, 495–512, https://doi.org/10.1017/S1049023X23006088 (2023).
Kinder, F., Mehmood, S., Hodgson, H., Giannoudis, P. & Howard, A. Barriers to Trauma Care in South and Central America: a systematic review. European Journal of Orthopaedic Surgery & Traumatology 32, 1163–1177, https://doi.org/10.1007/s00590-021-03080-3 (2022).
Jiang, N. et al. Baseline Models for Action Recognition of Unscripted Casualty Care Dataset. In Waiter, G. et al. (eds.) Medical Image Understanding and Analysis, vol. 14122, 215–227, https://doi.org/10.1007/978-3-031-48593-0_16 (Springer Nature Switzerland, Cham, 2024).
Birch, E. et al. Trauma THOMPSON: Clinical Decision Support for the Frontline Medic. Military Medicine 188, 208–214, https://doi.org/10.1093/milmed/usad087 (2023).
Zhuo, Y., W. Kirkpatrick, A., Couperus, K., Tran, O. & Wachs, J. The Trauma THOMPSON Challenge Report MICCAI 2023. In Bao, R., Grant, E., Kirkpatrick, A., Wachs, J. & Ou, Y. (eds.) AI for Brain Lesion Detection and Trauma Video Action Recognition, vol. 14567, 61–71, https://doi.org/10.1007/978-3-031-71626-3_8 (Springer Nature Switzerland, Cham, 2025).
Zhuo, Y. et al. Overview of the Trauma THOMPSON Challenge at MICCAI 2023. In Bao, R., Grant, E., Kirkpatrick, A., Wachs, J. & Ou, Y. (eds.) AI for Brain Lesion Detection and Trauma Video Action Recognition, vol. 14567, 47–60, https://doi.org/10.1007/978-3-031-71626-3_7 (Springer Nature Switzerland, Cham, 2025).
Butler, F. K., Hagmann, J. & Butler, E. G. Tactical Combat Casualty Care in Special Operations. Military Medicine 161, 3–16, https://doi.org/10.1093/milmed/161.suppl_1.3 (1996).
Schauer, S. G. et al. Hypothermia in the Combat Trauma Population. Prehospital Emergency Care 27, 934–940, https://doi.org/10.1080/10903127.2022.2119315 (2023).
Lukežič, A., Vojír^, T., Čehovin, L., Matas, J. & Kristan, M. Discriminative Correlation Filter with Channel and Spatial Reliability. International Journal of Computer Vision 126, 671–688, https://doi.org/10.1007/s11263-017-1061-3 (2018).
Azari, D. P. et al. Modeling Surgical Technical Skill Using Expert Assessment for Automated Computer Rating. Annals of Surgery 269, 574–581, https://doi.org/10.1097/SLA.0000000000002478 (2019).
Mackenzie, C. F. et al. Enhanced Training Benefits of Video Recording Surgery With Automated Hand Motion Analysis. World Journal of Surgery 45, 981–987, https://doi.org/10.1007/s00268-020-05916-1 (2021).
Xiao, B. et al. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, ArXiv:2311.06242 [cs] https://doi.org/10.48550/arXiv.2311.06242 (2023).
Kuo, W., Cui, Y., Gu, X., Piergiovanni, A. & Angelova, A. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models, https://doi.org/10.48550/ARXIV.2209.15639 (2022)2.
Feng, Y. et al. Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation, ArXiv:2504.09480 [cs] https://doi.org/10.48550/arXiv.2504.09480 (2025).
Zhuo, Y. Trauma THOMPSON Dataset, https://doi.org/10.7910/DVN/V5BTRU (2025).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual Instruction Tuning, https://doi.org/10.48550/ARXIV.2304.08485 (2023).
Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved Baselines with Visual Instruction Tuning, ArXiv:2310.03744 [cs] https://doi.org/10.48550/arXiv.2310.03744 (2024).
Bai, S. et al. Qwen2.5-VL Technical Report, ArXiv:2502.13923 [cs] https://doi.org/10.48550/arXiv.2502.13923 (2025).
Gemma Team et al. Gemma 3 Technical Report, https://doi.org/10.48550/ARXIV.2503.19786 (2025).
Dettmers, T., Pagnoni, A., Holtzman, A. & Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs, https://doi.org/10.48550/arXiv.2305.14314 ArXiv:2305.14314 [cs] (2023).
Kim, W., Son, B. & Kim, I. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Meila, M. & Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139 of Proceedings of Machine Learning Research, 5583–5594 (PMLR, 2021).
Li, J., Li, D., Xiong, C. & Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, https://doi.org/10.48550/ARXIV.2201.12086 (2022).
Yuan, L. et al. Florence: A New Foundation Model for Computer Vision, ArXiv:2111.11432 [cs] https://doi.org/10.48550/arXiv.2111.11432 (2021).
Acknowledgements
This work was partially supported by the Center for AI and Robotic Excellence in medicine (CARE) at Purdue University and Indiana University School of Medicine. This work was also supported by the US Army Medical Research and Development Command under Contract No. W81XWH21C0119 and by the National Science Foundation under Grant NSF #2140612. The views, opinions, and/or findings contained in this report are those of the author(s) and should not be construed as an official Department of the Army position, policy, or decision unless so designated by other documentation.
Author information
Authors and Affiliations
Contributions
The study was conceptualized and led by K.C., C.C, A.W.K., and J.W. The Purdue team (Y.Z., E.Z., X.Y., A.P., W.F., X.C., and J.W.) developed and ran the algorithms and conducted detailed annotations of hands and medical instruments. Authors from the University of Calgary (A.W.K., J.M.), Madigan Army Medical Center (K.C., C.C., O.T., J.B., D.D., C.G., R.C., E.B.), and The Geneva Foundation (K.C., C.C., O.T., J.B., D.D.) contributed to the creation of egocentric medical procedure videos and associated ground truth annotations. All authors contributed to the writing and editing of the manuscript and approved the final version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhuo, Y., Zhang, E., Yu, X. et al. An Egocentric Life-Saving Interventional Procedure Dataset of Actions, Medical Questions, Maneuvers and Tools. Sci Data 13, 51 (2026). https://doi.org/10.1038/s41597-025-06365-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-06365-y












