Fig. 2: Hierarchy of layers in DNNs correlates with the AN–midbrain–STG ascending auditory pathway.
From: Dissecting neural computations in the human auditory pathway using deep neural networks for speech

a, Normalized BPS of the best-performing neural encoding model based on every single layer in the HuBERT model (maximum over delay window lengths). Magenta bars indicate CNN output layers; cyan bars indicate transformer layers. Red star indicates the best model for each area; black dot indicates other models that were not statistically different from the best model (P > 0.05, two-sided paired t test; n = 50 neurons for the AN, n = 100 neurons for the IC, n = 53 electrodes for the HG, n = 144 electrodes for the STG). From left to right: AN, IC, HG and STG (same for each row in b and c). b, Averaged TRF weights (absolute beta weights of the spectrotemporal encoding model) in speech-responsive units/electrodes of each area (mean ± s.e.m.; light-shaded areas indicate random permuted distributions; black dots indicate time points with TRF weights significantly higher than the chance level; t test, two-sided P < 0.05, Bonferroni-corrected for 20 time points). c, Normalized BPS of the best-performing neural encoding model (maximum over single layers and delay window lengths) for different areas of the pathway. Color key indicates different layer types (CNN supervised, CNN layers from the supervised Deep Speech 2 model or HuBERT supervised model; CNN-SSL, CNN layers from the self-supervised Wav2Vec 2 or HuBERT model; LSTM supervised, LSTM layers from Deep Speech 2; Transformer SSL + FT, transformer layers from the self-supervised and fine-tuned Wav2Vec 2 model; Transformer SSL, transformer layers from the self-supervised Wav2Vec 2 or HuBERT model; Transformer supervised, transformer layers from the pure supervised HuBERT model; CNN random, CNN layers from the randomized HuBERT model; Transformer random, transformer layers from the randomized HuBERT model). Red star indicates the best model for each area; black dot indicates other models that were not statistically different from the best model (P > 0.05, two-sided paired t test). Dashed horizontal line indicates the baseline model using full acoustic–phonetic features. For a and c, the box plot shows the first and third quantiles across electrodes (orange line indicates the median; black line indicates the mean value; whiskers indicate the 5th and 95th percentiles). a.u., arbitrary units; ECoG, electrocorticography; Spect, spectrogram; feat., features; DS2, Deep Speech 2; W2V, Wav2Vec 2; HuB., HuBERT; W2V-A, Wav2Vec 2 ASR supervised model; Tr., transformer; Sup., supervised; Ran., randomized.