Fig. 5
From: An open-source tool for automated human-level circling behavior detection

Dataset size performance comparison. (A) Labeling performance (error, in pixels) for each of 10 trained networks on datasets of progressively smaller sizes. All dataset sizes resulted in greater labeling error than the Full Dataset model (dashed horizontal line), particularly for frames not seen during training (test frames). Notably, this trend was not monotonic—the set of quarter-dataset models performed better on test frames, on average, than the set of half-dataset models. Root-mean-squared errors on training set frames were (mean and 95% CI) 9.29 (8.13–10.73), 9.84 (8.53–11.7), and 11.02 (9.11–12.91) pixels respectively. For unseen frames, these errors increased to 19.37 (16.92–22.28), 12.3 (10.51–14.4), and 14.34 (12.66–15.98). Dashed horizontal line represents Full Dataset model training frame error (7.82 pixels). (B) To determine whether these changes in labeling quality impacted, we applied the optimized Box-Angle method to the keypoint tracking produced by each network at each dataset size. Within a dataset, the true-positive, false-positive, and false-negative scores for each video were summed to calculate a representative F1 score, plotted here as individual dots in the half-, quarter-, and eighth-sized datasets. The resulting distributions are compared to scores from the Full Dataset network (left column) and to independent human scores (right). As elsewhere, video-net combinations for which F1 score are undefined are included in confidence interval calculations but not displayed as individual datapoints. These smaller datasets underperformed the Full Dataset network (p = 0.03, 0.03, 0.02) as well as human labels (p = 1.7E−4, 1.4E−4, 3.9E−5), indicating that even small reductions in keypoint tracking quality can impact behavioral detection. *p < 0.05, ***p < 0.001.