Extended Data Fig. 6: Active learning speed across different sample sizes.
From: A-SOiD, an active-learning platform for expert-guided, data-efficient discovery of behavior

To estimate the total time it takes to run larger datasets with A-SOiD, we timed the time it takes for our active-learning regime (max iterations = 20, max number of samples per iteration = 200, initial ratio = 0.01) across a range of subsets (0.3 to 1.0) of the CalMS21 dataset (number of features = 100). We then fit a linear function to the measurements to estimate the performance speed with increasing sample sizes. a) Total time A-SOiD takes to complete 20 iterations, including feature extraction of the train set. Given the fit, every 1000 new samples increase the runtime by 3 seconds. The time it takes to run 1 Million samples is roughly 53 min. b) Isolated feature extraction speed for each subset. Given the fit, every 1000 samples increase the runtime by 2 seconds (1M samples about 28 min). Notably, we considerably optimized feature extraction by employing just-in-time compilation using the Python implementations of numba 0.52.0 (https://github.com/numba/numba). However, the feature extraction, which is run once in the beginning, is still the major bottleneck when it comes to speed. The vertical dotted line indicates the original size of the dataset. The vertical dotted line indicates the original size of the dataset. Each subset was repeated 3 times, and the speed was averaged across seeds. Error bars represent the standard deviation.