Background & Summary

Recent advancements in Generative AI, and Machine Learning (ML) have enabled the development of foundational models for virtual agents and assistants capable of supporting real-time user performance in unstructured, uncontrolled and highly variable tasks across diverse domains1,2. These models are designed to process and analyze textual, visual, or multimodal data to output useful information. The training of these large models typically follows a two-stage process: pretrain on a vast amount of data in a self-supervised or unsupervised fashion and subsequently finetune with domain-specific and annotated datasets to perform certain downstream tasks3,4,5. This paradigm ensures that the large models can extract common textual and visual features and develop a broad understanding of general knowledge, while being capable of adapting to domain-specific problems. One particular case, which is the focus of this paper, is high-stakes applications in the medical, healthcare and critical care domain. As AI researchers strive to integrate multimodal intelligence into the critical care workflow, datasets with high specificity and quality annotations will be pivotal in bridging the gap between theoretical advancements and practical applications. As AI mentors, recognition of current action, anticipation of the next action, and answering medical questions are among the most important functions to guide inexperienced care providers in performing critical care tasks. Robust hand tracking and precise object detection enable skill evaluation and tool localization, reducing human error and supporting safer and more effective care.

In the development of AI agents, domain-specific datasets are essential for effectively addressing tasks within specialized fields such as medicine. While generalized datasets are useful for pretraining and capturing broad patterns of human behavior, they often lack the granularity required to navigate the complex, nuanced scenarios encountered in clinical environments. Medical tasks—ranging from diagnostics and surgical guidance to critical care management—demand datasets that accurately represent procedural intricacies, patient variability, and the use of specialized equipment. For instance, datasets like MIMIC-III6 and DAISI7 have played key roles in advancing AI for electronic health record analysis and surgical instruction, respectively. However, these datasets largely omit critical care and trauma scenarios.

To develop AI mentors capable of assisting inexperienced caregivers during hands-on procedures, it is crucial that such systems can perceive, recognize, and predict human actions in real time. Egocentric datasets have been proposed for this purpose. For example, the EPIC-KITCHENS 100 Dataset8 provides large-scale and first-person video recordings of everyday kitchen tasks annotated with fine-grained action labels, enabling models to learn context-aware behavior prediction. In the medical domain, the Egosurgery-Phase dataset9 addresses a longstanding gap in surgical phase recognition for open surgery. Comprising 15 hours of head-mounted egocentric video across nine surgical phases and enriched with eye gaze data, it offers a valuable foundation for modeling surgical attention and decision-making. Similarly, MedVidCL10 supports AI development for surgical diagnostics and training. However, these datasets are typically collected in structured, controlled environments—primarily operating rooms—and do not reflect the urgency, improvisation, or contextual constraints common in emergency medicine. As a result, their applicability to real-world, resource-limited, or humanitarian settings remains limited. The lack of datasets capturing the spontaneity and variability of real trauma scenarios continues to be a barrier to AI generalization in emergency care.

Beyond datasets, existing AI systems designed for surgical mentoring have focused largely on planned procedures in well-resourced environments. For instance, the Virtual Operative Assistant11 delivers real-time feedback to neurosurgery trainees during tumor resections in a VR environment, significantly enhancing technical performance. Other systems provide spatial navigation support during surgery12,13 or enable post-operative performance evaluation14,15,16. However, these technologies are typically limited to operating rooms17 or laboratory simulations7,18,19,20,21,22 and do not extend to unpredictable, low-resource environments. Despite the critical importance of timely interventions in trauma and emergency medicine, AI systems capable of guiding life-saving procedures under austere or chaotic conditions are still lacking. This is particularly problematic given that many emergency care providers are undertrained in life-saving interventional (LSI) procedures when operating in such environments23,24,25.

The Trauma THOMPSON dataset26,27,28 fills those gaps by offering a highly specialized and annotated video dataset tailored to trauma and emergency care. It not only contributes to improving AI model’s performance but also ensures that the predictions are grounded in the clinical world. The Trauma THOMPSON dataset is a collection of annotated video clips designed to advance research and development in autonomous AI mentorship for humanitarian medicine. To the best of our knowledge, Trauma THOMPSON is unprecedented in terms of scale, real-world settings, unique challenges, and practical applicability. The highlights of this dataset are as follows. We created the first egocentric-view dataset focused on operational medicine to assist field medics and guide inexperienced users in performing emergency care procedures. Designed for resource-constrained, uncontrolled, and urgent scenarios, the dataset extends a previous iteration and includes 220 videos comprising 3,717 annotated clips covering five unscripted life-saving procedures, which includes cricothyroidotomy (CR), tube thoracostomy (CT), tourniquet application (TQ), intraosseous infusion (IO), and needle thoracostomy (ND). It includes not only regular LSI procedures, but also just-in-time (JIT) procedures, which use unconventional makeshift tools to perform emergency procedures. The addition of JIT procedures creates extra challenges for the dataset by introducing more variability of tools and environments and is useful for studying human medical commonsense. In addition, we include annotations for medical visual question answering (MVQA)29, hand maneuvers, and object detection. MVQA can serve as clinical decision support tools and allow caregivers to extract critical insights from medical imagery through natural language interaction. The dataset is openly accessible and freely available to foster the development of AI applications in medical care. We benchmarked the dataset on action recognition, action anticipation, and MVQA with multiple algorithms, demonstrating how machine learning models can leverage the dataset’s annotations to predict therapeutic actions essential for humanitarian medicine and resuscitative care.

Methods

Data Collection

This study was approved by the Institutional Review Board (IRB) under protocol number 223046, reviewed and overseen by the Geneva Foundation team. Participants were medical professionals recruited through internal invitations at the Madigan Army Medical Center and the University of Calgary based on their clinical expertise. All participants provided written informed consent prior to participation, agreeing to be video-recorded from an egocentric perspective while performing simulated medical procedures for research and dataset development purposes. To ensure privacy and confidentiality, all video data underwent human review, and any video with personally identifiable features, such as tattoos, were removed. No physician-identifying or personally traceable information was included in the final dataset.

Procedures Identification

Regular Procedures

A team of subject matter experts (SMEs) with extensive experience in deployed settings developed the Trauma THOMPSON dataset, including surgeons, critical care physicians, and emergency medicine practitioners, each with 5-20 years of experience in deployed or trauma care settings. They identified essential procedures for Tactical Combat Casualty Care (TCCC)30, such as cricothyroidotomy and tourniquet application, and a task list describing the fundamental steps for the successful performance of the procedures during video collection. Additionally, a focus group of 15-30 SMEs collaborated to establish a consensus on the dataset’s content and best practices. TCCC are structured around the MARCH algorithm (Massive Hemorrhage, Airway, Respiration, Circulation, Hypothermia/Head Injury)31. The MARCH framework prioritizes life-saving interventions performed under fire or during tactical field care, such as controlling major bleeding, securing airways, treating chest injuries, managing shock, and preventing hypothermia. A survey was conducted to assess agreement on various TCCC procedures and skills. After its completion, the SMEs engaged in discussions to refine the rankings, identify any missing procedures, and establish optimal practices for performing them. This process resulted in a finalized list of 5 procedures for the Trauma THOMPSON dataset based on the MARCH algorithm, as shown in Fig. 1, TQ for Massive Hemorrhage, CR and ND for Respiration, CT for Airway and IO for Circulation. Furthermore, detailed instructions outlining the fundamental steps for successfully collecting these procedures are elaborated in subsequent sections.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The Trauma THOMPSON Dataset.

Just-in-time Procedures (JIT)

In addition to standard techniques, the medical team with extensive field experience developed improvised methods to perform the same five life-saving procedures using alternative tools and materials readily available in resource-constrained environments. These improvised approaches, guided by physicians’ expert judgment and adaptability, reflect the ingenuity required in emergency care and underscore the importance of designing AI systems that can understand and adapt to non-standard techniques, ensuring broader applicability in diverse operational contexts. To support this, the dataset was further enriched with videos demonstrating “just-in-time” procedures utilizing improvised, non-traditional equipment. These recordings captured creative solutions such as using belts or clothing with a screwdriver to fashion tourniquets, scissors for incision and expansion during tube thoracostomy, and screwdrivers to aid in tube insertion. For improvised needle cricothyroidotomy, a needle was used in place of a standard incision and tube for emergency airway access, and manual intraosseous needle placement was demonstrated in the absence of a functioning needle driver.

Recording

We focused on recording natural and unscripted life-saving intervention (LSI) procedures from a first-person perspective, capturing actions such as operating medical tools, searching for items, reconsidering decisions, and managing unexpected challenges. To achieve this, commercial GoPro Hero7 Black cameras (GoPro, San Mateo, California) were used, mounted on the heads to capture egocentric views. The surgeons positioned the cameras at a 20-30 angle relative to their foreheads to ensure optimal video capture, with the hands typically centered in the frame for clear visualization during procedures. Filming was conducted across various simulation models and environments, including those that resemble field conditions. All recordings were de-identified to protect privacy. The videos were captured in 1080p resolution to maintain high-quality visuals.

Annotation

The annotation process followed common practices used in activity recognition dataset, specifically the EPIC-KITCHENS dataset. Annotators were instructed to identify fine-grained medical actions, limiting each to a maximum duration of 5 seconds for easier data processing. Longer actions were divided into sub-actions or multiple video clips. The vocabulary for the medical actions is from terminologies common in TCCC. The dataset annotations include the start and end timestamps, as well as the actions represented as verb-noun pairs for each video clip (e.g., take scalpel, incise skin), as adopted by the EPIC-KITCHENS dataset. The process is illustrated in Fig. 2. Medical professionals annotated each procedural step by specifying the precise timestamps in mm:ss format and the corresponding actions. The annotated data were then reviewed by peers to ensure the accuracy of timestamping and video segmentation. The annotation work was carried out by project managers and research assistants with specialized training and 2-5 years of experience in military medicine from The Geneva Foundation and Medigan Army Medical Center.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Illustration of the annotation process.

Data quality assurance

To ensure the accuracy of annotations, each procedure’s actions are annotated by three medical professionals. One annotator generates the initial annotations, while the other two reviewers validate them. Since the annotations are estimates and precise timestamps for each procedure cannot be guaranteed, a method is proposed to calculate annotation accuracy.

The actual timestamp, ta, is defined as the average of the timestamps provided by the annotator and the reviewers. Let to represent the timestamp from the annotator, tri the timestamp from reviewer i, and nr the number of reviewers. The actual timestamp is calculated as:

$${t}_{a}=\frac{1}{{n}_{r}+1}\left(\mathop{\sum }\limits_{i=1}^{{n}_{r}}{t}_{ri}+{t}_{o}\right).$$
(1)

For each clip, tas and tae represent the actual start and end times, while tos and toe denote the original start and end times determined by the annotator. The annotation accuracy for each clip is computed by dividing the overlapping time between the original and actual timestamps by the actual clip duration. The overlapping time is determined using:

$${t}_{start}=\max ({t}_{os},{t}_{as}),\quad {t}_{end}=\min ({t}_{oe},{t}_{ae}).$$
(2)

The accuracy for a clip, pi, is then calculated as:

$${p}_{i}=\frac{{t}_{end}-{t}_{start}}{{t}_{ae}-{t}_{as}}.$$
(3)

Finally, the average annotation accuracy (acc) is computed using:

$$\,{\rm{acc}}\,=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\,{p}_{i},$$
(4)

where N is the total number of clips. Then, the final average accuracy is computed as:

$$\,{\rm{acc}}\,=\frac{{\sum }_{i=1}^{n}({p}_{i}\ast ({t}_{ae}-{t}_{as}))}{{\sum }_{i=1}^{n}({t}_{ae}-{t}_{as})}$$
(5)

Medical Visual Question Answering

The MVQA dataset is built upon the egocentric video dataset and is enriched with detailed annotations in the form of natural language questions paired with corresponding plausible answers. These annotations focus on clinically relevant aspects and are designed to simulate the types of inquiries a medical AI mentor might encounter when assessing a situation or guiding a procedure. The questions are not procedure-specific but rather general and orthogonal to the procedure type. As illustrated in Fig. 3, each annotated scene includes a visual frame from the egocentric video, accompanied by multiple questions and a set of 3 to 5 potential answer choices. The questions can be categorized into three types based on their reasoning complexity. Descriptive questions involve direct observation of visible elements in the scene, such as “What limb is injured?” with the answer “Right arm.” Interpretive questions require analytical reasoning to infer the presence or absence of certain conditions, such as “Is there any bleeding?” with the answer “No.” Finally, contextual questions require an understanding of spatial or procedural relationships, such as “What is the current action?” with the answer “Take tourniquet.” It enables the development and evaluation of AI systems capable of interpreting emergency scenarios and reasoning about injuries, even when faced with uncertainty or incomplete visual cues.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Illustration of Visual Question Answering.

Hand Maneuvers and Object Detection

To efficiently generate high-quality bounding box annotations, a human-in-the-loop strategy is employed, combining manual labeling with automated tracking. Annotators manually label hands and objects every 10-30 frames, while intermediate frames are automatically annotated using CSRT trackers32 provided through OpenCV. We annotated for the left and right hands, as well as 15 distinct medical instruments, as illustrated in Fig. 4. The hand annotations include bounding boxes, object classes, and object IDs for tracking, whereas the object annotations include bounding boxes and object classes for detection. Vision-language models (VLMs) were used to identify hands and detect objects that are crucial in clinical settings, particularly where precision is vital. Reliable hand tracking could enable the evaluation of procedural skills based on hand maneuvers in real time, providing feedback on bimanual coordination and task performance33,34. Additionally, integrating object detection with natural language interfaces allows healthcare providers to query AI systems about tool locations, thereby reducing cognitive effort and the likelihood of errors. Recent models such as Florence-235 and F-VLM36 have showcased strong object recognition capabilities37, underscoring the potential of unified VLMs to support diverse visual tasks in medical workflows.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Hand and objects annotations.

Data Records

We have archived a total of 37 data files for the Trauma THOMPSON dataset on Harvard Dataverse, available at https://doi.org/10.7910/DVN/V5BTRU38. The dataset is split into multiple archive files to comply with the platform’s file size limitations. Each archive contains a subset of the videos and annotations relevant to the respective tasks. The dataset includes video recordings in MP4 format and annotations supporting multiple downstream tasks in computer vision and multimodal learning. Annotations for action recognition and anticipation tasks are provided in CSV format. Each row in the CSV file corresponds to a labeled temporal segment, including the video identifier, timestamp ranges, and associated action labels. Separate CSV files are included for training, validation, and test splits. Annotations for the MVQA task are stored in a single .json file. Each entry includes the question, answer, relevant frame identifiers, and associated metadata. This file allows mapping clinical questions to specific visual contexts in the videos and is designed to support multimodal question answering models. Hand annotations are stored in COCO-style .json format. Each file includes bounding boxes for left and right hands across frames, along with instance-level segmentation (where applicable). These annotations support the training of models for motion analysis and skill assessment in emergency medical procedures. Object annotations are formatted in YOLO-compatible .txt files. Each file corresponds to a single frame and contains bounding box coordinates and class labels for surgical tools and other relevant objects. These annotations facilitate real-time tool detection for clinical decision support.

A README file is included in the root directory of the dataset, providing detailed documentation of each annotation format, label taxonomy, and code snippets to assist with data loading and preprocessing. This dataset supports a broad range of AI tasks including action recognition, anticipation, visual question answering, hand motion analysis, and object detection, making it a comprehensive resource for multimodal learning in emergency medicine scenarios.

Data Overview

Dataset statistics

Currently, the dataset includes 220 videos covering five emergency care procedures, with a total of 3,717 fully annotated video clips. For classification tasks, the class distribution reflects the real-world frequency of each procedure. From the 1080p videos (1920 × 1080) a total of over 593,000 interventional frames have been extracted. The dataset are divided into training and testing sets at the video level using a random 80-20 split, following common practice in other datasets to ensure sufficient data for model training while reserving a representative portion for evaluation. The training set consists of 2,477 clips of regular procedures while the testing set is split into two parts, containing 683 clips of regular procedures and 557 clips of JIT procedures. The regular procedure dataset includes 42 verb classes, 42 noun classes, and 124 action classes (unique verb-noun pairs) and the JIT procedure includes 28 verb classes, 32 noun classes, and 86 action classes. In total, there are 45 verb classes, 49 noun classes, and 162 action classes in the Trauma THOMPSON dataset.

Video duration

The dataset contains a total video duration of 11528 minutes. Average video durations are computed separately for the regular and JIT scenarios across the five medical procedures. For regular procedures, the average durations are 76.49 seconds for CR, 121.19 seconds for CT, 48.36 seconds for IO, 41.18 seconds for ND, and 41.26 seconds for TQ, resulting in an overall average duration of 64.96 seconds. In comparison, JIT procedures are consistently shorter, with 42.40 seconds for CR, 69.66 seconds for CT, 38.11 seconds for IO, 26.67 seconds for ND, and 38.13 seconds for TQ, yielding an overall average duration of 42.48 seconds. These results indicate that JIT demonstrations emphasize fast task execution with makeshift tools under time-critical conditions.

Action annotation diversity

The unique action classes per procedure are computed to compare action and task complexity between regular and JIT procedures. In the regular procedures, CR has 52 unique actions, CT has 49, IO has 38, ND has 23, and TQ has 18. In the JIT procedures, CR has 42 unique actions, CT has 38, IO has 16, ND has 11, and TQ has 28. These results indicate differences in the distribution of action classes across procedure types and conditions.

Annotation accuracy statistics

Given the dataset’s large size, our dataset reviewers were instructed to randomly evaluate 60 videos following the specified guidelines. The review results show a temporal accuracy of 99.4%, with action, verb, and noun label accuracies of 97.2%, 97.2%, and 97.7% respectively.

Technical Validation

Dataset class distribution

Fig. 5 illustrates the distribution of regular procedures while Fig. 6 shows the distribution of JIT procedures. The two figures demonstrate the imbalanced nature of the current dataset, which results from the data being collected from an unscripted source, thereby reflecting the true frequency of procedures in real-world settings. However, it is crucial to recognize that as class imbalance in the training data increases, the performance of algorithms typically declines. Therefore, working with an imbalanced dataset presents greater challenges for algorithm development compared to a balanced one.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Action frequency of the regular procedures.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Action frequency of the JIT procedures.

Action class co-occurrences

Fig. 7 illustrates the distribution of verb-verb, noun-noun, verb-noun pair frequencies of the regular procedures. It is apparent that verbs such as ‘take’, ‘remove’, and ‘drop’ frequently co-occur with various nouns. This trend aligns with the action patterns commonly observed during the LSI procedures. The JIT procedures follow a similar pattern.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Frequency of verb-noun class co-occurrences in regular procedures.

Benchmark results and discussion

Action Recognition

Table 1 shows the performance of six action recognition models on the Trauma THOMPSON dataset, including MViT v2, Uniformer v2, VideoSwin, TimeSFormer, VideoMAE, and LaViLa. It shows insights into their effectiveness across different training and testing conditions. Overall, MViT v2 and Uniformer v2 emerge as the strongest models, consistently achieving the highest Top-1 and Top-5 accuracy scores. In contrast, TimeSFormer performs the worst. VideoSwin, VideoMAE and LaViLa demonstrate moderate performance, though they suffer when tested on challenging data variations.

Table 1 Model Performance for Top 1 and Top 5 Accuracy of Action Recognition.

Depending on the training data, the models have drastic performance difference on their generalization capability based on different test set. When models are trained on regular procedure, they perform well on regular test data but exhibit a sharp decline in accuracy when tested on JIT data. For instance, VideoSwin’s Top-1 accuracy drops dramatically from 45.10% on regular data to just 3.85% on JIT data, highlighting a significant decline due to unseen data. Even the highest-performing model, MViT v2, sees a substantial performance drop when moving from regular training to JIT testing, suggesting that JIT data presents unique challenges such as rare actions, occlusions, or unconventional tools.

Despite these improvements, JIT test results remain consistently lower than regular test results across all models, reinforcing the difficulty of recognizing actions in JIT scenarios. This suggests potential domain gaps in temporal dynamics or scene context, making it harder for models to generalize. Another key observation is the gap between Top-1 and Top-5 accuracy scores. For example, MViT v2 achieves 90.38% Top-5 accuracy on JIT when trained on Combined data, but its Top-1 accuracy remains at 50.96%. This indicates that models frequently rank the correct action within their top five predictions but struggle with precise Top-1 classification, likely due to fine-grained action distinctions in JIT data.

Among the individual models, MViT v2 stands out as the best overall performer, particularly excelling in JIT generalization when trained on Combined data. This suggests that its multi-scale modeling mechanism contributes to its robustness. Uniformer v2 also shows strong performance, likely benefiting from its unified architecture for spatial and temporal modeling. In contrast, VideoSwin and VideoMAE achieve moderate results but struggle with JIT data, suggesting they may lack the necessary mechanisms to handle within-domain variability. TimeSFormer performs the worst, with very low scores on JIT testing (e.g., 0.51% Top-1 accuracy when trained on regular data).

Fig. 8a shows the confusion matrix for the top-performing model, MViT v2, on the action recognition task. The dark diagonal pattern reflects the model’s high accuracy in correctly predicting many classes. The action labels are sorted by frequency, with common classes at the top and rarer ones at the bottom. Notably, the lower section of the matrix contains fewer dark regions, highlighting the model’s difficulty in accurately recognizing less frequent actions. Fig. 9 presents the Top-1 accuracy of verbs, nouns, and full actions across five emergency procedures for each model, with the numbers summarized in Table 2. The consistent shapes across radar plots suggest similar performance trends among models for different procedures. However, no single model dominates across all procedures, indicating that the highest-performing model overall does not necessarily perform best on every individual task.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Confusion matrices of action recognition and action anticipation with MViT v2 on regular procedures.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Action recognition Top 1 accuracies of verb, noun, and action by each type of procedure.

Table 2 Performance of Top1 and Top5 Accuracy of Action Recognition.

Action Anticipation

Table 3 summarizes the benchmarking results for action anticipation across the same six models. In the regular train-test configuration, MViT v2 outperforms others with the highest Top-1 and Top-5 accuracies of 60.12% and 87.02%, respectively. Uniformer v2 follows closely with 56.25% Top-1 and 84.70% Top-5 accuracy. VideoSwin and VideoMAE deliver moderate performance, while TimeSFormer lags significantly with only 28.44% Top-1 accuracy. When evaluated on JIT data using models trained solely on regular data, all models suffer a sharp drop in performance—MViT v2 achieves just 8.42% Top-1 accuracy, and Uniformer v2 shows similar degradation. This highlights the challenge of generalizing anticipation models to JIT contexts without diverse training data.

Table 3 Model Performance for Top 1 and Top 5 Accuracy of Action Anticipation.

Under the combined test setup (testing on both regular and JIT data with regular training), MViT v2 again leads with 53.50% Top-1 and 78.71% Top-5 accuracy. Uniformer v2 remains competitive, while TimeSFormer continues to perform the weakest. These patterns align with those observed in the regular test scenario, reinforcing the limitations of training solely on regular procedure data for anticipating actions in varied environments.

Overall, action anticipation follows a similar trend to action recognition. MViT v2 and Uniformer v2 consistently outperform other models. However, the overall accuracy for anticipation is generally lower, especially under JIT conditions, emphasizing the added difficulty of forecasting future actions compared to recognizing current ones. Training with combined datasets proves beneficial, markedly improving the models’ ability to generalize across both regular and JIT scenarios.

Fig. 8b displays the confusion matrix for MViT v2 on the anticipation task, showing reduced classification accuracy for less frequent classes—an issue also observed in recognition. Fig. 10 compares the performance of all models on the five emergency procedures, with the numbers summarized in Table 4. Similar trends appear in the recognition task, although anticipation results tend to be lower overall.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Action anticipation Top 1 accuracies of verb, noun, and action by each type of procedure.

Table 4 Performance of Top1 and Top5 Accuracy of Action Anticipation.

Action Recognition and Action Anticipation with Large VLMs

Large VLMs have attracted huge interests in solving vision tasks recently, so in Table 5 we present the performance of three large VLMs on the action recognition and action anticipation task, including VLMs—LLaVA v1.6 7B39,40, Qwen2.5 VL 7B41, and Gemma3 4B42. These models were fine tuned using QLoRA (Quantized Low Rank Adaptation)43, enabling efficient training on a single GPU. For the recognition and anticipation tasks, we reformulated the video based classification problem into an image understanding task. Specifically, we sampled 9 evenly spaced frames per video clip and arranged them in a 3 × 3 grid layout to compose a single grid image. Natural language prompts were used to embed the target labels, allowing the large VLMs to interpret them within a language conditioned vision framework.

Table 5 LVLM Performance (%) on action recognition, anticipation and VQA tasks.

For data preparation, all models were implemented with the HuggingFace transformers library. We used the BLIP model to generate a descriptive caption for each video clip. The dataset was converted to the chat template specified by the transformers library and augmented by oversampling less frequent classes to balance the training data.

We used QLoRA for all three models to allow fine tuning on a single GPU. All models were initialized as 4 bit quantized models via BitsAndBytesConfig. A common SFTConfig (Supervised Fine Tuning Configuration) was defined to run 15 epochs of supervised fine tuning with gradient accumulation, gradient checkpointing, a fused AdamW optimizer under a constant learning rate schedule, and bfloat16 precision. The LoRA adapter configurations were similar. For LLaVA v1.6 and Gemma3, we attached a LoRA adapter to all linear modules with a rank of 16, an alpha of 16 (for LLaVA), and 5 percent dropout. For Qwen2.5 VL, we applied a LoRA adapter more specifically to the model’s query and value projection layers with a rank of 8, alpha of 16, and 5 percent dropout. A custom collate function was used for each model to apply the appropriate chat template, tokenize text, preprocess images, and mask padding tokens, label tokens, or special image token IDs as required by the model architecture. Finally, we used TRL’s SFTTrainer for the training.

For action recognition, LLaVA-v1.6-7B achieves the highest Top-1 accuracy across all evaluation settings: 45.56% on the regular test set, 26.93% on the Just-in-Time (JIT) set, and 43.10% on the combined set. These results indicate that LLaVA-v1.6-7B is the most effective large VLM among those evaluated for recognizing previously seen actions, as well as generalizing across different procedural settings. Qwen2.5-VL-7B and Gemma3-4B also perform competitively, but consistently fall short of LLaVA’s accuracy, especially on the regular and combined splits. The strong results on the JIT set suggest that large VLMs possess robust zero-shot generalization abilities, making them promising candidates for settings where annotated data is scarce or procedural variations are frequent.

For action anticipation, the overall performance is lower across all models, reflecting the increased difficulty of predicting future actions. Qwen2.5-VL-7B leads slightly with the best combined set accuracy (21.77%), although LLaVA-v1.6-7B attains the highest score on the JIT subset (15.75%). This marginal advantage on JIT data suggests that LLaVA may possess stronger generalization capabilities to out-of-distribution or previously unseen procedural variations, possibly due to its larger visual-text alignment capacity. However, none of the large VLMs surpass their own recognition performance, nor do they outperform vision-only models like MViT v2 and Uniformer v2 (see Tables 1 and 3), which were optimized specifically for temporal video understanding.

Medical Visual Question Answering

Table 6 presents a comparative analysis of six fine-tuned MVQA models with varying sizes: ViLT-B/3244, BLIP45, Florence246, LLaVA-v1.6-7B39,40, Qwen2.5-VL-7B41, and Gemma3-4B42. Among them, LLaVA-v1.6, Qwen2.5-VL, and Gemma3 were fine-tuned using the QLoRA (Quantized Low-Rank Adaptation) approach43, enabling efficient model adaptation on a single GPU. BLIP achieved the highest accuracy at 88.64%, showcasing strong MVQA performance with a relatively moderate parameter size. Florence2 followed closely with an accuracy of 87.86%. LLaVA-v1.6 and Qwen2.5-VL achieved accuracies of 85.57% and 83.29%, respectively, while Gemma3 attained 72.04%. ViLT-B/32, the smallest model with only 87 million parameters, provided a lightweight option with a respectable accuracy of 79.88%.

Table 6 VQA model performance comparison.

Usage Notes

The Trauma THOMPSON dataset is openly available for research and development purposes. Detailed instructions on how to process the dataset for different tasks are described in the README file in our GitHub repository, including preprocessing steps and annotation formats. The codes for processing the videos, extracting relevant features, and building models are available on our GitHub repository as well. This will allow researchers to efficiently work with the dataset and implement their own machine learning pipelines. All users of the dataset will be required to fill out a consent form, located under the “Terms” section, and then the raw videos will be available for download, along with the annotations.