Figure 2
From: Video-based formative and summative assessment of surgical tasks using deep learning

Results for the primary PC datasets. (a) Actual vs. predicted FLS scores for all ten training sessions combined. Here, the histograms show the frequency of samples for a given score. As seen, the network has a slightly inflated score prediction trend resulting in some trials close to the cut-off ratio to cross it—shown in red. Since classification analysis was conducted separately, this inflated prediction does not affect the pass/fail prediction accuracy. (b) The ROC curves. The blue line is the average of 10 running sessions, each shown in gray. The yellow line represents the random chances. (c) Question–answer trust plots for each class. The VBA-Net has high trustworthiness for true predictions. i.e., Softmax probabilities are close to 1.0 for the majority of the samples, as shown in green. On the other hand, the network is cautious about wrong predictions, i.e., the Softmax probabilities are close to the threshold of 0.5 and do not accumulate on the extreme end of 0.0—illustrated in red.