Fig. 2

This flowchart displays how data moves through our pipeline for phase recognition in SICS. First, the 105 videos of our collection are manually annotated by ophthalmologists and the ground truth is generated from this information. The raw videos are processed into I3D features, which serve as input for the MS-TCN + + architecture. The red box provides further details on MS-TCN++, adapted from the original publication40. The model first generates an initial coarse prediction in the Prediction Stage which is subsequently refined through Nr Refinement Stages. Each stage in our setup consists of 13 layers and utilizes dilated convolutions to integrate an increasing temporal context. The phases predictions output by the Network are then compared with ground truth to calculate various performance metrics (e.g. accuracy).