Fig. 2 | Scientific Reports

Fig. 2

From: An open-source tool for automated human-level circling behavior detection

Fig. 2

Human F1 scores. (A) Treating one independent observer as the gold standard for another reveals that humans show substantial variability in labeling circling behavior. In particular, although average F1 scores for each pair (AB, BC, CA) are similar (0.53, 0.52, 0.49), the distributions of scores across videos differ significantly between one pair and the other two (pair CA, p = 3.5E−2 and 1.4E−4 versus pairs AB, BC respectively) while the other pair did not differ significantly (AB versus BC, p = 0.28). (B) Scoring of independent observers' labels against another observer (left columns) or against consensus labels (agreement among 3 observers, right columns) produce similar results (p = 0.2), as does comparing between our two human data subsets (train versus test subset, p = 0.65 and 0.75). Pooled pairwise F1 scores averaged 0.51 (95% CI 0.47–0.55) in the training set and 0.53 (0.41–0.62) in the testing set. Scoring against consensus occurrences, in which all observers mark a complete circle within 0.1 s of one another, produced similar scores of 0.51 (0.44–0.57) in the training set and 0.53 (0.38–0.65). Each point in a column represents a single video. Labeler-video combinations for which F1 score is undefined (i.e., both scorer and ground truth marked no circling instances), are not displayed for either paired or consensus scoring but were included in bootstrapping for purposes of calculating confidence intervals.

Back to article page