Abstract
Transforming continuous acoustic speech signals into discrete linguistic meaning is a remarkable computational feat accomplished by both the human brain and modern artificial intelligence. A key scientific question is whether these biological and artificial systems, despite their different architectures, converge on similar strategies to solve this challenge. Although automatic speech recognition systems now achieve human-level performance, research on their parallels with the brain has been limited by biologically implausible, non-causal models and comparisons that stop at predicting brain activity without detailing the alignment of the underlying representations. Furthermore, studies using text-based models overlook the crucial acoustic stages of speech processing. Here we bridge these gaps by uncovering a striking correspondence between the brain’s processing hierarchy and the model’s internal representations using high-resolution intracranial recordings and a causal, recurrent automatic speech recognition model. Specifically, we demonstrate a deep alignment in their algorithmic approach: neural activity in distinct cortical regions maps topographically to corresponding model layers, and critically, the representational content at each stage follows a parallel progression from acoustic to phonetic, lexical and semantic information. This work thus moves beyond demonstrating simple model–brain alignment to specifying the shared underlying representations at each stage of processing, providing direct evidence that both systems converge on a similar computational strategy for transforming sound into meaning.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
Although the iEEG recordings used in this study cannot be made publicly available owing to patient privacy restrictions, they can be requested from the author (N.M.). Source data are provided with this paper.
Code availability
References
Stolcke, A. & Droppo, J. Comparing human and machine errors in conversational speech transcription. Preprint at https://doi.org/10.48550/arXiv.1708.08615 (2017).
DeWitt, I. & Rauschecker, J. P. Phoneme and word recognition in the auditory ventral stream. Proc. Natl Acad. Sci. USA 109, E505–E514 (2012).
Price, C. J. The anatomy of language: a review of 100 fMRI studies published in 2009. Ann. N. Y. Acad. Sci. 1191, 62–88 (2010).
Poeppel, D. The neuroanatomic and neurophysiological infrastructure for speech and language. Curr. Opin. Neurobiol. 28, 142–149 (2014).
Belinkov, Y. & Glass, J. Analyzing hidden representations in end-to-end automatic speech recognition systems. In Advances in Neural Information Processing Systems Vol. 30 (eds Guyon, I. et al.) (Curran Associates, 2017); https://proceedings.neurips.cc/paper_files/paper/2017/hash/b069b3415151fa7217e870017374de7c-Abstract.html
Li, Y. et al. Dissecting neural computations in the human auditory pathway using deep neural networks for speech. Nat. Neurosci. 26, 2213–2225 (2023).
Caucheteux, C., Gramfort, A. & King, J.-R. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat. Hum. Behav. 7, 430–441 (2023).
Caucheteux, C. & King, J.-R. Brains and algorithms partially converge in natural language processing. Commun. Biol. 5, 134 (2022).
Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).
Hosseini, E. A. et al. Artificial neural network language models predict human brain responses to language even after a developmentally realistic amount of training. Neurobiol. Lang. 5, 43–63 (2024).
Mischler, G., Li, Y. A., Bickel, S., Mehta, A. D. & Mesgarani, N. Contextual feature extraction hierarchies converge in large language models and the brain. Nat. Mach. Intell. 6, 1467–1477 (2024).
Hadidi, N., Feghhi, E., Song, B. H., Blank, I. A. & Kao, J. C. Illusions of alignment between large language models and brains emerge from fragile methods and overlooked confounds. Preprint at bioRxiv https://doi.org/10.1101/2025.03.09.642245 (2025).
Graves, A. Sequence transduction with recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.1211.3711 (2012).
Ray, S. & Maunsell, J. H. Different origins of gamma rhythm and high-gamma activity in macaque visual cortex. PLoS Biol. 9, e1000610 (2011).
Steinschneider, M., Fishman, Y. I. & Arezzo, J. C. Spectrotemporal analysis of evoked and induced electroencephalographic responses in primary auditory cortex (A1) of the awake monkey. Cereb. Cortex 18, 610–625 (2008).
Fischl, B. et al. Automatically parcellating the human cerebral cortex. Cereb. Cortex 14, 11–22 (2004).
He, Y. et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing 6381–6385 (ICASSP, 2019); https://ieeexplore.ieee.org/abstract/document/8682336
Rao, K., Sak, H. & Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop 193–199 (ASRU, 2017); https://ieeexplore.ieee.org/abstract/document/8268935
Li, J., Zhao, R., Hu, H. & Gong, Y. Improving RNN transducer modeling for end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop 114–121 (ASRU; 2019); https://ieeexplore.ieee.org/abstract/document/9003906
Shafey, L. E., Soltau, H. & Shafran, I. Joint speech recognition and speaker diarization via sequence transduction. Preprint at https://doi.org/10.48550/arXiv.1907.05337 (2019).
Ghodsi, M., Liu, X., Apfel, J., Cabrera, R. & Weinstein, E. RNN-transducer with stateless prediction network. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing 7049–7053 (ICASSP, 2020); https://ieeexplore.ieee.org/abstract/document/9054419
Stooke, A., Prabhavalkar, R., Sim, K. C. & Mengibar, P. M. Aligner-encoders: self-attention transformers can be self-transducers. Adv. Neural Inf. Process. Syst. 37, 100318–100340 (2024).
Gulati, A. et al. Conformer: convolution-augmented transformer for speech recognition. Preprint at https://doi.org/10.48550/arXiv.2005.08100 (2020).
Keshishian, M. et al. Estimating and interpreting nonlinear receptive field of sensory neural responses with deep neural network models. eLife 9, e53445 (2020).
Mischler, G., Keshishian, M., Bickel, S., Mehta, A. D. & Mesgarani, N. Deep neural networks effectively model neural adaptation to changing background noise and suggest nonlinear noise filtering methods in auditory cortex. NeuroImage 266, 119819 (2023).
Saon, G. et al. Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities. Preprint at https://doi.org/10.48550/arXiv.2505.08699 (2025).
McGettigan, C. et al. An application of univariate and multivariate approaches in fMRI to quantifying the hemispheric lateralization of acoustic and linguistic processes. J. Cogn. Neurosci. 24, 636–652 (2012).
Zatorre, R. J., Evans, A. C., Meyer, E. & Gjedde, A. Lateralization of phonetic and pitch discrimination in speech processing. Science 256, 846–849 (1992).
Crosse, M. J., Di Liberto, G. M., Bednar, A. & Lalor, E. C. The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. Front. Hum. Neurosci. 10, 604 (2016).
Keshishian, M. et al. Joint, distributed and hierarchically organized encoding of linguistic features in the human auditory cortex. Nat. Hum. Behav. 7, 740–753 (2023).
Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of neural network representations revisited. In International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 3519–3529 (PMLR, 2019).
Stork. Is backpropagation biologically plausible? In International 1989 Joint Conference on Neural Networks (IEEE Xplore, 1989); https://ieeexplore.ieee.org/abstract/document/118705
Betti, A., Gori, M. & Marra, G. Backpropagation and biological plausibility. Preprint at https://doi.org/10.48550/arXiv.1808.06934 (2018).
Gilbert, C. D. & Sigman, M. Brain states: top-down influences in sensory processing. Neuron 54, 677–696 (2007).
Kveraga, K., Ghuman, A. S. & Bar, M. Top-down predictions in the cognitive brain. Brain Cogn. 65, 145–168 (2007).
Rahman, M., Willmore, B. D. B., King, A. J. & Harper, N. S. Simple transformations capture auditory input to cortex. Proc. Natl Acad. Sci. USA 117, 28442–28451 (2020).
Gill, P., Zhang, J., Woolley, S. M. N., Fremouw, T. & Theunissen, F. E. Sound representation methods for spectro-temporal receptive field estimation. J. Comput. Neurosci. 21, 5–20 (2006).
Tuckute, G., Feather, J., Boebinger, D. & McDermott, J. H. Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLoS Biol. 21, e3002366 (2023).
Nagamine, T. & Mesgarani, N. Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition. In Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 2564–2573 (PMLR, 2017); https://proceedings.mlr.press/v70/nagamine17a.html
Nagamine, T., Seltzer, M. L. & Mesgarani, N. Exploring how deep neural networks form phonemic categories. In Interspeech, 1912–1916 (2015).
Raymondaud, Q., Rouvier, M. & Dufour, R. Probing the information encoded in neural-based acoustic models of automatic speech recognition systems. Preprint at https://doi.org/10.48550/arXiv.2402.19443 (2024).
Theunissen, F. E., Sen, K. & Doupe, A. J. Spectral–temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci. 20, 2315–2331 (2000).
Theunissen, F. E. et al. Estimating spatio-temporal receptive fields of auditory and visual neurons from their responses to natural stimuli. Netw., Comput. Neural Syst. 12, 289 (2001).
Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8, 393–402 (2007).
Hickok, G. & Poeppel, D. Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92, 67–99 (2004).
Friederici, A. D. The cortical language circuit: from auditory perception to sentence comprehension. Trends Cogn. Sci. 16, 262–268 (2012).
Pallier, C., Devauchelle, A.-D. & Dehaene, S. Cortical representation of the constituent structure of sentences. Proc. Natl Acad. Sci. USA 108, 2522–2527 (2011).
Fedorenko, E., Nieto-Castañon, A. & Kanwisher, N. Lexical and syntactic representations in the brain: an fMRI investigation with multi-voxel pattern analyses. Neuropsychologia 50, 499–513 (2012).
Costafreda, S. G. et al. A systematic review and quantitative appraisal of fMRI studies of verbal fluency: role of the left inferior frontal gyrus. Hum. Brain Mapp. 27, 799–810 (2006).
Rauschecker, J. P. Ventral and dorsal streams in the evolution of speech and language. Front. Evol. Neurosci. 4, 7 (2012).
Matchin, W., Hammerly, C. & Lau, E. The role of the IFG and pSTS in syntactic prediction: evidence from a parametric study of hierarchical structure in fMRI. Cortex 88, 106–123 (2017).
Hagoort, P. On Broca, brain, and binding: a new framework. Trends Cogn. Sci. 9, 416–423 (2005).
Matchin, W. & Hickok, G. The cortical organization of syntax. Cereb. Cortex 30, 1481–1498 (2020).
Rogalsky, C. & Hickok, G. The role of Broca’s area in sentence comprehension. J. Cogn. Neurosci. 23, 1664–1680 (2011).
Novick, J. M., Trueswell, J. C. & Thompson-Schill, S. L. Broca’s area and language processing: evidence for the cognitive control connection. Lang. Linguist. Compass 4, 906–924 (2010).
January, D., Trueswell, J. C. & Thompson-Schill, S. L. Co-localization of stroop and syntactic ambiguity resolution in Broca’s area: implications for the neural basis of sentence processing. J. Cogn. Neurosci. 21, 2434–2444 (2009).
Hasson, U., Yang, E., Vallines, I., Heeger, D. J. & Rubin, N. A hierarchy of temporal receptive windows in human cortex. J. Neurosci. 28, 2539–2550 (2008).
Lerner, Y., Honey, C. J., Silbert, L. J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci. 31, 2906–2915 (2011).
Ding, N. et al. Characterizing neural entrainment to hierarchical linguistic units using electroencephalography (EEG). Front. Hum. Neurosci. 11, 481 (2017).
Keshishian, M., Norman-Haignere, S. & Mesgarani, N. Understanding adaptive, multiscale temporal integration in deep speech recognition systems. In Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M. et al.) 24455–24467 (Curran Associates, 2021); https://proceedings.neurips.cc/paper/2021/hash/ccce2fab7336b8bc8362d115dec2d5a2-Abstract.html
Dennis, M. & Whitaker, H. A. 8—hemispheric equipotentiality and language acquisition. In Language Development and Neurological Theory (eds Segalowitz, S. J. & Gruber, F. A.) 93–106 (Academic Press, 1977).; https://www.sciencedirect.com/science/article/pii/B9780126356502500145
Millar, J. M. & Whitaker, H. A. Chapter 4—the right hemisphere’s contribution to language: a review of the evidence from brain-damaged subjects. In Language Functions and Brain Organization (ed. Segalowitz, S. J.) 87–113 (Academic Press, 1983).; https://www.sciencedirect.com/science/article/pii/B9780126356403500102
Poeppel, D. Pure word deafness and the bilateral processing of the speech code. Cogn. Sci. 25, 679–693 (2001).
Poeppel, D. The analysis of speech in different temporal integration windows: cerebral lateralization as ‘asymmetric sampling in time’. Speech Commun. 41, 245–255 (2003).
Oderbolz, C., Poeppel, D. & Meyer, M. Asymmetric sampling in time: evidence and perspectives. Neurosci. Biobehav. Rev. 171, 106082 (2025).
Tang, C., Hamilton, L. S. & Chang, E. F. Intonational speech prosody encoding in the human auditory cortex. Science 357, 797–801 (2017).
Li, Y., Tang, C., Lu, J., Wu, J. & Chang, E. F. Human cortical encoding of pitch in tonal and non-tonal languages. Nat. Commun. 12, 1161 (2021).
Edwards, E. et al. Comparison of time–frequency responses and the event-related potential to auditory speech stimuli in human cortex. J. Neurophysiol. 102, 377–386 (2009).
Groppe, D. M. et al. iELVis: An open source MATLAB toolbox for localizing and visualizing human intracranial electrode data. J. Neurosci. Methods 281, 40–48 (2017).
Jenkinson, M. & Smith, S. A global optimisation method for robust affine registration of brain images. Med. Image Anal. 5, 143–156 (2001).
Jenkinson, M., Bannister, P., Brady, M. & Smith, S. Improved optimization for the robust and accurate linear registration and motion correction of brain images. NeuroImage 17, 825–841 (2002).
Smith, S. M. Fast robust automated brain extraction. Hum. Brain Mapp. 17, 143–155 (2002).
Papademetris, X. et al. BioImage suite: an integrated medical image analysis suite: an update. Insight J. 2006, 209 (2006).
Destrieux, C., Fischl, B., Dale, A. & Halgren, E. Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage 53, 1–15 (2010).
Sweet, R. A., Dorph-Petersen, K.-A. & Lewis, D. A. Mapping auditory core, lateral belt, and parabelt cortices in the human superior temporal gyrus. J. Comp. Neurol. 491, 270–289 (2005).
Hamilton, L. S., Oganian, Y., Hall, J. & Chang, E. F. Parallel and distributed encoding of speech across human auditory cortex. Cell 184, 4626–4639.e13 (2021).
Ozker, M., Schepers, I. M., Magnotti, J. F., Yoshor, D. & Beauchamp, M. S. A double dissociation between anterior and posterior superior temporal gyrus for processing audiovisual speech demonstrated by electrocorticography. J. Cogn. Neurosci. 29, 1044–1060 (2017).
Saon, G., Tüske, Z., Bolanos, D. & Kingsbury, B. Advancing RNN transducer technology for speech recognition. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing 5654–5658 (ICASSP, 2021); https://ieeexplore.ieee.org/abstract/document/9414716
Masanori, M., Hideki, K. & Haruhiro, K. Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. J. Audio Eng. Soc. 57, 11 (2009).
Lenzo, K. The CMU pronouncing dictionary. Carnegie Mellon Univ. http://www.speech.cs.cmu.edu/cgi-bin/cmudict (2025).
Brysbaert, M. & New, B. Moving beyond Kučera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behav. Res. Methods 41, 977–990 (2009).
Vitevitch, M. S. & Luce, P. A. Probabilistic phonotactics and neighborhood activation in spoken word recognition. J. Mem. Lang. 40, 374–408 (1999).
Brodbeck, C., Hong, L. E. & Simon, J. Z. Rapid transformation from auditory to linguistic representations of continuous speech. Curr. Biol. 28, 3976–3983.e5 (2018).
Ylinen, S. et al. Predictive coding of phonological rules in auditory cortex: a mismatch negativity study. Brain Lang. 162, 72–80 (2016).
Friston, K. The free-energy principle: a rough guide to the brain? Trends Cogn. Sci. 13, 293–301 (2009).
Gagnepain, P., Henson, R. N. & Davis, M. H. Temporal predictive codes for spoken words in auditory cortex. Curr. Biol. 22, 615–621 (2012).
Leonard, M. K., Bouchard, K. E., Tang, C. & Chang, E. F. Dynamic encoding of speech sequence probability in human temporal cortex. J. Neurosci. 35, 7203–7214 (2015).
Balota, D. A. et al. The English lexicon project. Behav. Res. Methods 39, 445–459 (2007).
Shaoul, C. & Westbury, C. Exploring lexical co-occurrence space using HiDEx. Behav. Res. Methods 42, 393–413 (2010).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
Mischler, G., Raghavan, V., Keshishian, M. & Mesgarani, N. naplib-python: neural acoustic data processing and analysis tools in python. Softw. Impacts 17, 100541 (2023).
Mischler, G., Raghavan, V., Keshishian, M. & Mesgarani, N. naplib-python: neural acoustic data processing and analysis tools in python, version 1.0. CodeOcean https://doi.org/10.24433/CO.7346414.v1 (2023).
Acknowledgements
This study was funded by the National Institute on Deafness and Other Communication Disorders (grant no. R01DC014279 to N.M.). S.B. was also supported by the National Institute on Deafness and Other Communication Disorders, grant no. R01DC019979. The funders had no role in the study design, data collection and analysis, decision to publish and manuscript preparation.
Author information
Authors and Affiliations
Contributions
Conceptualization (M.K. and N.M.); methodology (M.K., G.M., S.T., B.K. and N.M.); data collection (S.B. and A.D.M.); data analysis (M.K., G.M., S.T. and N.M.); writing—original draft (M.K. and N.M.); and writing—revision (G.M., S.T., B.K. and N.M.).
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Yuanning Li, Jonathan Venezia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–3.
Source data
Source Data Fig. 1
Data for plotting Fig. 1.
Source Data Fig. 2
Data for plotting Fig. 2.
Source Data Fig. 3
Data for plotting Fig. 3.
Source Data Fig. 4
Data for plotting Fig. 4.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Keshishian, M., Mischler, G., Thomas, S. et al. Parallel hierarchical encoding of linguistic representations in the human auditory cortex and recurrent automatic speech recognition systems. Nat Mach Intell (2026). https://doi.org/10.1038/s42256-026-01185-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s42256-026-01185-0


