Fig. 2: Decoding surgical subphases from videos.
From: A vision transformer for decoding surgeon activity from surgical videos

a–c, SAIS is trained on video samples exclusively from USC and evaluated on those from USC (a), SAH (b) and HMH (c). Results are shown as an average (±1 standard deviation) of ten Monte Carlo cross-validation steps. d, We trained variants of SAIS to quantify the marginal benefit of its components on its PPV. We removed test-time augmentation (‘without TTA’), RGB frames (‘without RGB’), flow maps (‘without flow’) and the self-attention mechanism (‘without SA’). We found that the attention mechanism and the multiple modality input (RGB and flow) are the greatest contributors to PPV. e, We benchmarked SAIS against an I3D model when decoding subphases from entire VUA videos without human supervision. Each box reflects the quartiles of the results, and the whiskers extend to 1.5× the interquartile range.