Extended Data Fig. 1: Results on six fine-grained visual recognition benchmarks.
From: Emulating human-like adaptive vision for efficient and flexible machine visual perception

a, Quantitative comparisons of AdaptiveNN and conventional non-adaptive models: Top-1 validation accuracy versus average computational cost for inferring the model. Datasets: CUB-200-2011 100, NABirds 101, Oxford-IIIT Pet 102, Stanford Dogs 103, Stanford Cars 104, FGVC-Aircraft 105. The results show means ± standard deviations from five independent trials with different random seeds. Non-adaptive models with varying costs are obtained by modifying model sizes and input resolutions. Here we set the maximum fixation number to be two, which is generally sufficient to accomplish the recognition tasks. b–e, Qualitative evaluation of the visual fixations chosen by AdaptiveNN-DeiT-S across four datasets: CUB-200-2011, Oxford-IIIT Pet, Stanford Cars, and FGVC-Aircraft. The visualizations adhere to the setups established in Fig. 3a. Images adapted from ref. 112 under a Creative Commons license CC0 1.0; ref. 114 under a Creative Commons license CC BY-SA 4.0; ref. 115 under a Creative Commons license CC0 1.0; refs. 105,116.