Abstract
Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision–language DNN models, such as contrastive language–image pretraining (CLIP), with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates human visual perception. However, interpreting the results from such comparisons is inherently limited owing to the ‘black box’ nature of DNNs. Here we combine model–brain fitness analyses with human brain lesion data to examine how disrupting the communication pathway between the visual and language systems causally affects the ability of vision–language DNNs to explain the activity of the VOTC to address this. Across four diverse datasets, CLIP consistently captured unique variance in VOTC neural representations, relative to both label-supervised (ResNet) and unsupervised (MoCo) models. This advantage tended to be left-lateralized at the group level, aligning with the human language network. Analyses of 33 patients who experienced a stroke revealed that reduced white matter integrity between the VOTC and the language region in the left angular gyrus was correlated with decreased CLIP–brain correspondence and increased MoCo–brain correspondence, indicating a dynamic influence of language processing on the activity of the VOTC. These findings support the integration of language modulation in neurocognitive models of human vision, reinforcing concepts from vision–language DNN models. The sensitivity of model–brain similarity to specific brain lesions demonstrates that leveraging the manipulation of the human brain is a promising framework for evaluating and developing brain-like computer models.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The data that support the findings of this study are available via figshare at https://doi.org/10.6084/m9.figshare.29531288.v3 (ref. 70). The original neuroimaging data are not publicly available owing to ethical constraints. De-identified data may be accessed by researchers who meet the criteria upon reasonable request: study 1 via the corresponding author (ybi@pku.edu.cn); study 2 via the Ethics Committee of the First Hospital of Shanxi Medical University (phone: +86 351 4639242) or the corresponding author (ybi@pku.edu.cn). Eligible requests will receive a response within 2 weeks.
Code availability
The custom codes that support the findings of this study are available via figshare at https://doi.org/10.6084/m9.figshare.29531288.v3 (ref. 70).
References
Bao, P., She, L., McGill, M. & Tsao, D. Y. A map of object space in primate inferotemporal cortex. Nature 583, 103–108 (2020).
Schrimpf, M. et al. The neural architecture of language: Integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).
Schrimpf, M. et al. Brain-score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv https://doi.org/10.1101/407007 (2018).
Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Kriegeskorte, N. et al. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60, 1126–1141 (2008).
Ungerleider, L. G. & Haxby, J. V. ‘What’ and ‘where’in the human brain. Curr. Opin. Neurobiol. 4, 157–165 (1994).
Dobs, K., Martinez, J., Kell, A. J. E. & Kanwisher, N. Brain-like functional specialization emerges spontaneously in deep neural networks. Sci. Adv. 8, eabl8913 (2022).
Konkle, T. & Alvarez, G. A. A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 13, 491 (2022).
Vinken, K., Prince, J. S., Konkle, T. & Livingstone, M. S. The neural code for ‘face cells’ is not face-specific. Sci. Adv. 9, eadg1736 (2023).
Prince, J. S., Alvarez, G. A. & Konkle, T. Contrastive learning explains the emergence and function of visual category-selective regions. Sci. Adv. 10, eadl1776 (2024).
Wang, A. Y., Kay, K., Naselaris, T., Tarr, M. J. & Wehbe, L. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat. Mach. Intell. 5, 1415–1426 (2023).
Zhou, Q., Du, C., Wang, S. & He, H. CLIP-MUSED: CLIP-guided multi-subject visual neural information semantic decoding. In Proc. 12th International Conference on Learning Representations (eds Kim, B. et al.) https://openreview.net/pdf?id=lKxL5zkssv (ICLR, 2024).
Doerig, A. et al. High-level visual representations in the human brain are aligned with large language models. Nat. Mach.Intell. 7, 1220–1234 (2025).
Conwell, C., Prince, J. S., Hamblin, C. J. & Alvarez, G. A. Controlled assessment of CLIP-style language-aligned vision models in prediction of brain and behavioral data. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo, 2023).
Luo, A. F., Henderson, M. M., Wehbe, L. & Tarr, M. J. Brain diffusion for visual exploration: cortical discovery using large-scale generative models. In Proc. 37th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) 75740–75781 (Curran Associates, 2023).
Luo, A. F., Henderson, M. M., Tarr, M. J. & Wehbe, L. BrainSCUBA: fine-grained natural language captions of visual cortex selectivity. In Proc. 12th International Conference on Learning Representations (eds Kim, B. et al.) https://openreview.net/pdf?id=mQYHXUUTkU (ICLR, 2024).
Lupyan, G. The centrality of language in human cognition. Lang. Learn. 66, 516–553 (2016).
Thierry, G. Neurolinguistic relativity: how language flexes human perception and cognition. Lang. Learn. 66, 690–713 (2016).
Gilbert, A. L., Regier, T., Kay, P. & Ivry, R. B. Whorf hypothesis is supported in the right visual field but not the left. Proc. Natl Acad. Sci. USA 103, 489–494 (2006).
Drivonikou, G. V. et al. Further evidence that Whorfian effects are stronger in the right visual field than the left. Proc. Natl Acad. Sci. USA 104, 1097–1102 (2007).
Winawer, J. et al. Russian blues reveal effects of language on color discrimination. Proc. Natl Acad. Sci. USA 104, 7780–7785 (2007).
Ting Siok, W. et al. Language regions of brain are operative in color perception. Proc. Natl Acad. Sci. USA 106, 8140–8145 (2009).
Martinovic, J., Paramei, G. V. & MacInnes, W. J. Russian blues reveal the limits of language influencing colour discrimination. Cognition 201, 104281 (2020).
Fedorenko, E., Piantadosi, S. T. & Gibson, E. A. Language is primarily a tool for communication rather than thought. Nature 630, 575–586 (2024).
Maier, M. & Abdel Rahman, R. No matter how: top-down effects of verbal and semantic category knowledge on early visual perception. Cogn. Affect. Behav. Neurosci. 19, 859–876 (2019).
Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines. Nat. Commun. 15, 9383 (2024).
Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).
Conwell, C. et al. Monkey See, model knew: large language models accurately predict visual brain responses in humans and non-human primates. Preprint at bioRxiv https://doi.org/10.1101/2025.03.05.641284 (2025).
Kriegeskorte, N., Mur, M. & Bandettini, P. A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 249 (2008).
Fu, Z. et al. Different computational relations in language are captured by distinct brain systems. Cereb. Cortex. 33, 997–1013 (2023).
Liu, B. et al. Object knowledge representation in the human visual cortex requires a connection with the language system. PLoS Biol. 23, e3003161 (2025).
Hebart, M. N. et al. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife 12, e82580 (2023).
Güntürkün, O., Ströckens, F. & Ocklenburg, S. Brain lateralization: a comparative perspective. Physiol. Rev. 100, 1019–1063 (2020).
Wilke, M. & Lidzba, K. LI-tool: a new toolbox to assess lateralization in functional MR-data. J. Neurosci. Methods 163, 128–136 (2007).
Seghier, M. L. Laterality index in functional MRI: methodological issues. Magn. Reson. Imaging 26, 594–601 (2008).
Fedorenko, E., Hsieh, P. J., Nieto-Castañón, A., Whitfield-Gabrieli, S. & Kanwisher, N. New method for fMRI investigations of language: defining ROIs functionally in individual subjects. J. Neurophysiol. 104, 1177–1194 (2010).
Oliva, A. & Torralba, A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001).
Hua, K. et al. Tract probability maps in stereotaxic spaces: analyses of white matter anatomy and tract-specific quantification. Neuroimage 39, 336–347 (2008).
Brown, T. B. et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, Inc., 2020).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Mu, N., Kirillov, A., Wagner, D. & Xie, S. SLIP: self-supervision meets language-image pre-training. In European Conference on Computer Vision (eds Avidan, S. et al.) 529–544 (Springer, 2022).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning: Proc. Machine Learning Research Vol. 139 (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Gelman, S. A. & Roberts, S. O. How language shapes the cultural inheritance of categories. Proc. Natl Acad. Sci. USA 114, 7900–7907 (2017).
Unger, L. & Fisher, A. V. The emergence of richly organized semantic knowledge from simple statistics: a synthetic review. Dev. Rev. 60, 100949 (2021).
Xu, Y., He, Y. & Bi, Y. A tri-network model of human semantic processing. Front. Psychol. 8, 1538 (2017).
Seghier, M. L. The angular gyrus: multiple functions and multiple subdivisions. Neuroscientist 19, 43–61 (2013).
Xu, Y. et al. Doctor, teacher, and stethoscope: neural representation of different types of semantic relations. J. Neurosci. 38, 3303–3317 (2018).
Schwartz, M. F. et al. Neuroanatomical dissociation for taxonomic and thematic knowledge in the human brain. Proc. Natl Acad. Sci. USA 108, 8520–8524 (2011).
Zhang, W., Xiang, M. & Wang, S. The role of left angular gyrus in the representation of linguistic composition relations. Hum. Brain Mapp. 43, 2204–2217 (2022).
Price, A. R., Bonner, M. F., Peelle, J. E. & Grossman, M. Converging evidence for the neuroanatomic basis of combinatorial semantics in the angular gyrus. J. Neurosci. 35, 3276–3284 (2015).
Lupyan, G., Rahman, R. A., Boroditsky, L. & Clark, A. Effects of language on visual perception. Trends Cogn. Sci. 24, 930–944 (2020).
Mattioni, S. et al. Categorical representation from sound and sight in the ventral occipito-temporal cortex of sighted and blind. Elife 9, e50732 (2020).
van den Hurk, J., Van Baelen, M. & Op de Beeck, H. P. Development of visual category selectivity in ventral visual cortex does not require visual experience. Proc. Natl Acad. Sci. USA 114, E4501–E4510 (2017).
Wang, X. et al. How visual is the visual cortex? Comparing connectional and functional fingerprints between congenitally blind and sighted individuals. J. Neurosci. 35, 12545–12559 (2015).
Ricciardi, E., Bonino, D., Pellegrini, S. & Pietrini, P. Mind the blind brain to understand the sighted one! Is there a supramodal cortical functional architecture?. Neurosci. Biobehav. Rev. 41, 64–77 (2014).
Bi, Y., Wang, X. & Caramazza, A. Object domain and modality in the ventral visual pathway. Trends Cogn. Sci. 20, 282–290 (2016).
Peelen, M. V. & Downing, P. E. Category selectivity in human visual cortex: beyond visual object recognition. Neuropsychologia 105, 177–183 (2017).
Mahon, B. Z. et al. Action-related properties shape object representations in the ventral stream. Neuron 55, 507–520 (2007).
Striem-Amit, E. et al. Functional connectivity of visual cortex in the blind follows retinotopic organization principles. Brain 138, 1679–1695 (2015).
Burton, H., Snyder, A. Z. & Raichle, M. E. Resting state functional connectivity in early blind humans. Front. Syst. Neurosci. 8, 51 (2014).
Ashburner, J. & Friston, K. J. Unified segmentation. Neuroimage 26, 839–851 (2005).
Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9640–9649 (IEEE, 2021).
Kriegeskorte, N., Goebel, R. & Bandettini, P. Information-based functional brain mapping. Proc. Natl Acad. Sci. USA 103, 3863–3868 (2006).
Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. 3, 1026 (2018).
Fonov, V. et al. Unbiased average age-appropriate atlases for pediatric studies. Neuroimage 54, 313–327 (2011).
Xia, M., Wang, J. & He, Y. BrainNet Viewer: a network visualization tool for human brain connectomics. PLoS ONE 8, e68910 (2013).
Yan, C. G., Wang, X. D., Zuo, X. N. & Zang, Y. F. DPABI: data processing and analysis for (resting-state) brain imaging. Neuroinformatics 14, 339–351 (2016).
Chen, H. Language modulates vision: evidence from neural networks and human brain-lesion models. figshare https://doi.org/10.6084/m9.figshare.29531288.v3 (2025).
Stoinski, L. M., Perkuhn, J. & Hebart, M. N. THINGSplus: new norms and metadata for the THINGS database of 1854 object concepts and 26,107 natural object images. Behav. Res. 56, 1583–1603 (2024).
Acknowledgements
This research was supported by grants from the STI2030-Major Project 2021ZD0204100 (grant no. 2021ZD0204104 to Y.B.); the National Natural Science Foundation of China (grant nos. 31925020 and 82021004 to Y.B.; grant no. 62376009 to Y.Z.; grant no. 32171052 to Xiaosha Wang; 62406020 to W.H.); the Fundamental Research Funds for the Central Universities (Y.B.); and the PKU-BingJi Joint Laboratory for Artificial Intelligence (Y.Z.). The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank H. Yang, Z. Xiong, Z. Fu and H. Wen for their valuable comments on earlier drafts of the manuscript.
Author information
Authors and Affiliations
Contributions
Y.B., Y.Z. and W.H. conceived the study. H.C., B.L., Xiaosha Wang and Xiaochun Wang designed the experiment. H.C., B.L. and S.W. implemented and conducted the experiments. H.C. and B.L. analysed the data. H.C. and Y.B. wrote the initial draft. All authors reviewed and edited the Article.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Guadalupe Dávila, Francesca Setti and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 White matter integrity of the bilateral VOTC–LeftAG tract predicts model–brain correspondence of CLIP and MoCo (n = 33 patients).
Correlation coefficients and one-tailed P values are displayed on the plots (d.f.=30). a. Partial correlations between bilateral VOTC–LeftAG tract integrity and model–brain correspondence, controlling for lesion volume. Both sentence description effects (CLIP-specific) and MoCo-specific effects correlate significantly with VOTC–LeftAG tract integrity. b. Validation analysis using connections with right AG shows no significant relationships, confirming left-lateralized pathway specificity.
Extended Data Fig. 2 Low-level versus higher-level visual feature dependencies in WM–neural representation relationships (n = 33 patients).
Pearson correlations (d.f.=30) between left (a) or bilateral (b) VOTC–AG tract FA values and model–brain correspondence, controlling for lesion volume. The left panel shows that GIST effects (low-level visual features) exhibit a negative trend with tract integrity, whereas the right panel demonstrates that MoCo-specific effects (controlling for CLIP, ResNet, and GIST) correlate significantly with WM integrity. Correlation coefficients and one-tailed P values are displayed on the plots.
Extended Data Fig. 3 Voxel-based FA–symptom mapping (VFSM) results for model–brain correspondence (n = 33 patients).
Whole-brain correlation analyses (Pearson’s correlations) examine the relationships between FA values of each voxel and Fisher z-transformed RSA correlations in VOTC across patients. Left column shows correlations with sentence description effects (CLIP-specific effect); middle column displays MoCo-specific effects; right column indicates voxels with significant correlations for both conditions. Colour bars represent t-statistics from correlation analyses controlling for lesion volume. Results are thresholded at voxel-level P < 0.005, one-tailed, and cluster-level FWE-corrected P < 0.05. Axial slices displayed in MNI coordinate space.
Supplementary information
Supplementary Information
Supplementary Figs. 1 and 2 and Tables 1 and 2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, H., Liu, B., Wang, S. et al. Combined evidence from artificial neural networks and human brain-lesion models reveals that language modulates vision in human perception. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02357-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41562-025-02357-5


