Fig. 4: Networks trained to estimate F0 directly from sound waveforms exhibit less human-like pitch behavior.

a Schematic of model structure. Model architecture was identical to that depicted in Fig. 1a, except that the hardwired cochlear input representation was replaced by a layer of one-dimensional convolutional filters operating directly on sound waveforms. The first-layer filter kernels were optimized for the F0 estimation task along with the rest of the network weights. We trained the ten best networks from our architecture search with these learnable first-layer filters. b The best frequencies (sorted from lowest to highest) of the 100 learned filters for each of the ten network architectures are plotted in magenta. For comparison, the best frequencies of the 100 cochlear filters in the hardwired peripheral model are plotted in black. c Effect of learned cochlear filters on network behavior in all five main psychophysical experiments (see Fig. 2a–e): F0 discrimination as a function of harmonic number and phase (Expt. a), pitch estimation of alternating-phase stimuli (Expt. b), pitch estimation of frequency-shifted complexes (Expt. c), pitch estimation of complexes with individually mistuned harmonics (Expt. d), and frequency discrimination with pure and transposed tones (Expt. e). Lines plot means across the ten networks; error bars plot 95% confidence intervals, obtained by bootstrapping across the ten networks. d Comparison of human-model similarity metrics between networks trained with either the hardwired cochlear model (black) or the learned cochlear filters (magenta) for each psychophysical experiment. Asterisks indicate statistical significance of two-sample t-tests comparing the two cochlear model conditions: ***p < 0.001, *p = 0.016. Error bars indicate 95% confidence intervals bootstrapped across the ten network architectures.