Supplementary Figure 2: Cross-validating model parameters and testing deeper networks.
From: DeepLabCut: markerless pose estimation of user-defined body parts with deep learning

a: Average training and test error for the same 6 splits with 10% training set size as in Fig. 2f with standard augmentation (i.e. scaling), augmentation by rotations as well as rotations and translations (see Methods). Although with augmentation there are 8 and 24 times as many training samples (rotations and rotations + translations, respectively), the training and test errors remained comparable. b: Average training and test error for the same 3 splits with 50% training set size for three different architectures: ResNet-50, ResNet-101 as well as ResNet-101ws, where part loss layers are added to conv4 bank29. For these networks the training error is strongly reduced, and the test performance modestly improved, indicating that the deeper networks do not over-fit (but do not offer radical improvement). Averaged over 3 splits, individual simulation results shown in as faint lines. The deeper networks reach human level accuracy on test set. The data for ResNet-50 is also depicted in Fig. 2d. c: Cross validating model parameters for ResNet-50 and 50%-training set fraction. We varied the distance variable ϵ, which determines the width of the score-map template during training around the ground-truth value with scale variable 100% (otherwise the scale ratio of the output layer was set to 80% relative to the input image size). Varying distance parameters only mildly improves the test performance (after 500k training steps). The average performance for scale 0.8 and ϵ=17 is indicated by horizontal lines (from Fig. 2d). In particular, for smaller distance parameters the RMSE increases and learning proceeds much slower (c,d). d-e: Evolution of the training and test errors at various states of the network training for various distance variables ϵ corresponding to c.