Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Combined evidence from artificial neural networks and human brain-lesion models reveals that language modulates vision in human perception

Abstract

Comparing information structures in between deep neural networks (DNNs) and the human brain has become a key method for exploring their similarities and differences. Recent research has shown better alignment of vision–language DNN models, such as contrastive language–image pretraining (CLIP), with the activity of the human ventral occipitotemporal cortex (VOTC) than earlier vision models, supporting the idea that language modulates human visual perception. However, interpreting the results from such comparisons is inherently limited owing to the ‘black box’ nature of DNNs. Here we combine model–brain fitness analyses with human brain lesion data to examine how disrupting the communication pathway between the visual and language systems causally affects the ability of vision–language DNNs to explain the activity of the VOTC to address this. Across four diverse datasets, CLIP consistently captured unique variance in VOTC neural representations, relative to both label-supervised (ResNet) and unsupervised (MoCo) models. This advantage tended to be left-lateralized at the group level, aligning with the human language network. Analyses of 33 patients who experienced a stroke revealed that reduced white matter integrity between the VOTC and the language region in the left angular gyrus was correlated with decreased CLIP–brain correspondence and increased MoCo–brain correspondence, indicating a dynamic influence of language processing on the activity of the VOTC. These findings support the integration of language modulation in neurocognitive models of human vision, reinforcing concepts from vision–language DNN models. The sensitivity of model–brain similarity to specific brain lesions demonstrates that leveraging the manipulation of the human brain is a promising framework for evaluating and developing brain-like computer models.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the fMRI datasets, vision models and study 1 analysis schema.
Fig. 2: Intercorrelations among vision model RDMs and their alignment with human behaviour.
Fig. 3: Language effect in VOTC across datasets.
Fig. 4: Study 2 analysis workflow linking WM integrity and model–brain correspondence in patients with chronic stroke.
Fig. 5: WM integrity of left VOTC–left AG tract predicts model–brain correspondence of CLIP and MoCo (n = 33 patients).
Fig. 6: Validation analyses using vision models trained on the identical dataset.

Similar content being viewed by others

Data availability

The data that support the findings of this study are available via figshare at https://doi.org/10.6084/m9.figshare.29531288.v3 (ref. 70). The original neuroimaging data are not publicly available owing to ethical constraints. De-identified data may be accessed by researchers who meet the criteria upon reasonable request: study 1 via the corresponding author (ybi@pku.edu.cn); study 2 via the Ethics Committee of the First Hospital of Shanxi Medical University (phone: +86 351 4639242) or the corresponding author (ybi@pku.edu.cn). Eligible requests will receive a response within 2 weeks.

Code availability

The custom codes that support the findings of this study are available via figshare at https://doi.org/10.6084/m9.figshare.29531288.v3 (ref. 70).

References

  1. Bao, P., She, L., McGill, M. & Tsao, D. Y. A map of object space in primate inferotemporal cortex. Nature 583, 103–108 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Schrimpf, M. et al. The neural architecture of language: Integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Schrimpf, M. et al. Brain-score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv https://doi.org/10.1101/407007 (2018).

  4. Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012).

    Google Scholar 

  6. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  7. Kriegeskorte, N. et al. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron 60, 1126–1141 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Ungerleider, L. G. & Haxby, J. V. ‘What’ and ‘where’in the human brain. Curr. Opin. Neurobiol. 4, 157–165 (1994).

    Article  CAS  PubMed  Google Scholar 

  9. Dobs, K., Martinez, J., Kell, A. J. E. & Kanwisher, N. Brain-like functional specialization emerges spontaneously in deep neural networks. Sci. Adv. 8, eabl8913 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Konkle, T. & Alvarez, G. A. A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 13, 491 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Vinken, K., Prince, J. S., Konkle, T. & Livingstone, M. S. The neural code for ‘face cells’ is not face-specific. Sci. Adv. 9, eadg1736 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Prince, J. S., Alvarez, G. A. & Konkle, T. Contrastive learning explains the emergence and function of visual category-selective regions. Sci. Adv. 10, eadl1776 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Wang, A. Y., Kay, K., Naselaris, T., Tarr, M. J. & Wehbe, L. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat. Mach. Intell. 5, 1415–1426 (2023).

    Article  Google Scholar 

  14. Zhou, Q., Du, C., Wang, S. & He, H. CLIP-MUSED: CLIP-guided multi-subject visual neural information semantic decoding. In Proc. 12th International Conference on Learning Representations (eds Kim, B. et al.) https://openreview.net/pdf?id=lKxL5zkssv (ICLR, 2024).

  15. Doerig, A. et al. High-level visual representations in the human brain are aligned with large language models. Nat. Mach.Intell. 7, 1220–1234 (2025).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Conwell, C., Prince, J. S., Hamblin, C. J. & Alvarez, G. A. Controlled assessment of CLIP-style language-aligned vision models in prediction of brain and behavioral data. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo, 2023).

  17. Luo, A. F., Henderson, M. M., Wehbe, L. & Tarr, M. J. Brain diffusion for visual exploration: cortical discovery using large-scale generative models. In Proc. 37th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) 75740–75781 (Curran Associates, 2023).

  18. Luo, A. F., Henderson, M. M., Tarr, M. J. & Wehbe, L. BrainSCUBA: fine-grained natural language captions of visual cortex selectivity. In Proc. 12th International Conference on Learning Representations (eds Kim, B. et al.) https://openreview.net/pdf?id=mQYHXUUTkU (ICLR, 2024).

  19. Lupyan, G. The centrality of language in human cognition. Lang. Learn. 66, 516–553 (2016).

    Article  Google Scholar 

  20. Thierry, G. Neurolinguistic relativity: how language flexes human perception and cognition. Lang. Learn. 66, 690–713 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Gilbert, A. L., Regier, T., Kay, P. & Ivry, R. B. Whorf hypothesis is supported in the right visual field but not the left. Proc. Natl Acad. Sci. USA 103, 489–494 (2006).

    Article  CAS  PubMed  Google Scholar 

  22. Drivonikou, G. V. et al. Further evidence that Whorfian effects are stronger in the right visual field than the left. Proc. Natl Acad. Sci. USA 104, 1097–1102 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Winawer, J. et al. Russian blues reveal effects of language on color discrimination. Proc. Natl Acad. Sci. USA 104, 7780–7785 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Ting Siok, W. et al. Language regions of brain are operative in color perception. Proc. Natl Acad. Sci. USA 106, 8140–8145 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Martinovic, J., Paramei, G. V. & MacInnes, W. J. Russian blues reveal the limits of language influencing colour discrimination. Cognition 201, 104281 (2020).

    Article  PubMed  Google Scholar 

  26. Fedorenko, E., Piantadosi, S. T. & Gibson, E. A. Language is primarily a tool for communication rather than thought. Nature 630, 575–586 (2024).

    Article  CAS  PubMed  Google Scholar 

  27. Maier, M. & Abdel Rahman, R. No matter how: top-down effects of verbal and semantic category knowledge on early visual perception. Cogn. Affect. Behav. Neurosci. 19, 859–876 (2019).

    Article  PubMed  Google Scholar 

  28. Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines. Nat. Commun. 15, 9383 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).

    Article  CAS  PubMed  Google Scholar 

  30. Conwell, C. et al. Monkey See, model knew: large language models accurately predict visual brain responses in humans and non-human primates. Preprint at bioRxiv https://doi.org/10.1101/2025.03.05.641284 (2025).

  31. Kriegeskorte, N., Mur, M. & Bandettini, P. A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 249 (2008).

    Google Scholar 

  32. Fu, Z. et al. Different computational relations in language are captured by distinct brain systems. Cereb. Cortex. 33, 997–1013 (2023).

    Article  PubMed  Google Scholar 

  33. Liu, B. et al. Object knowledge representation in the human visual cortex requires a connection with the language system. PLoS Biol. 23, e3003161 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Hebart, M. N. et al. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife 12, e82580 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Güntürkün, O., Ströckens, F. & Ocklenburg, S. Brain lateralization: a comparative perspective. Physiol. Rev. 100, 1019–1063 (2020).

    Article  PubMed  Google Scholar 

  36. Wilke, M. & Lidzba, K. LI-tool: a new toolbox to assess lateralization in functional MR-data. J. Neurosci. Methods 163, 128–136 (2007).

    Article  PubMed  Google Scholar 

  37. Seghier, M. L. Laterality index in functional MRI: methodological issues. Magn. Reson. Imaging 26, 594–601 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Fedorenko, E., Hsieh, P. J., Nieto-Castañón, A., Whitfield-Gabrieli, S. & Kanwisher, N. New method for fMRI investigations of language: defining ROIs functionally in individual subjects. J. Neurophysiol. 104, 1177–1194 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Oliva, A. & Torralba, A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001).

    Article  Google Scholar 

  40. Hua, K. et al. Tract probability maps in stereotaxic spaces: analyses of white matter anatomy and tract-specific quantification. Neuroimage 39, 336–347 (2008).

    Article  PubMed  Google Scholar 

  41. Brown, T. B. et al. Language models are few-shot learners. In Proc. 34th International Conference on Neural Information Processing Systems: Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, Inc., 2020).

  42. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  43. Mu, N., Kirillov, A., Wagner, D. & Xie, S. SLIP: self-supervision meets language-image pre-training. In European Conference on Computer Vision (eds Avidan, S. et al.) 529–544 (Springer, 2022).

  44. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning: Proc. Machine Learning Research Vol. 139 (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

  45. Gelman, S. A. & Roberts, S. O. How language shapes the cultural inheritance of categories. Proc. Natl Acad. Sci. USA 114, 7900–7907 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Unger, L. & Fisher, A. V. The emergence of richly organized semantic knowledge from simple statistics: a synthetic review. Dev. Rev. 60, 100949 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Xu, Y., He, Y. & Bi, Y. A tri-network model of human semantic processing. Front. Psychol. 8, 1538 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Seghier, M. L. The angular gyrus: multiple functions and multiple subdivisions. Neuroscientist 19, 43–61 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Xu, Y. et al. Doctor, teacher, and stethoscope: neural representation of different types of semantic relations. J. Neurosci. 38, 3303–3317 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Schwartz, M. F. et al. Neuroanatomical dissociation for taxonomic and thematic knowledge in the human brain. Proc. Natl Acad. Sci. USA 108, 8520–8524 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Zhang, W., Xiang, M. & Wang, S. The role of left angular gyrus in the representation of linguistic composition relations. Hum. Brain Mapp. 43, 2204–2217 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Price, A. R., Bonner, M. F., Peelle, J. E. & Grossman, M. Converging evidence for the neuroanatomic basis of combinatorial semantics in the angular gyrus. J. Neurosci. 35, 3276–3284 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Lupyan, G., Rahman, R. A., Boroditsky, L. & Clark, A. Effects of language on visual perception. Trends Cogn. Sci. 24, 930–944 (2020).

    Article  PubMed  Google Scholar 

  54. Mattioni, S. et al. Categorical representation from sound and sight in the ventral occipito-temporal cortex of sighted and blind. Elife 9, e50732 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. van den Hurk, J., Van Baelen, M. & Op de Beeck, H. P. Development of visual category selectivity in ventral visual cortex does not require visual experience. Proc. Natl Acad. Sci. USA 114, E4501–E4510 (2017).

    PubMed  PubMed Central  Google Scholar 

  56. Wang, X. et al. How visual is the visual cortex? Comparing connectional and functional fingerprints between congenitally blind and sighted individuals. J. Neurosci. 35, 12545–12559 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Ricciardi, E., Bonino, D., Pellegrini, S. & Pietrini, P. Mind the blind brain to understand the sighted one! Is there a supramodal cortical functional architecture?. Neurosci. Biobehav. Rev. 41, 64–77 (2014).

    Article  PubMed  Google Scholar 

  58. Bi, Y., Wang, X. & Caramazza, A. Object domain and modality in the ventral visual pathway. Trends Cogn. Sci. 20, 282–290 (2016).

    Article  PubMed  Google Scholar 

  59. Peelen, M. V. & Downing, P. E. Category selectivity in human visual cortex: beyond visual object recognition. Neuropsychologia 105, 177–183 (2017).

    Article  PubMed  Google Scholar 

  60. Mahon, B. Z. et al. Action-related properties shape object representations in the ventral stream. Neuron 55, 507–520 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Striem-Amit, E. et al. Functional connectivity of visual cortex in the blind follows retinotopic organization principles. Brain 138, 1679–1695 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Burton, H., Snyder, A. Z. & Raichle, M. E. Resting state functional connectivity in early blind humans. Front. Syst. Neurosci. 8, 51 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Ashburner, J. & Friston, K. J. Unified segmentation. Neuroimage 26, 839–851 (2005).

    Article  PubMed  Google Scholar 

  64. Chen, X., Xie, S. & He, K. An empirical study of training self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9640–9649 (IEEE, 2021).

  65. Kriegeskorte, N., Goebel, R. & Bandettini, P. Information-based functional brain mapping. Proc. Natl Acad. Sci. USA 103, 3863–3868 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. 3, 1026 (2018).

    Article  Google Scholar 

  67. Fonov, V. et al. Unbiased average age-appropriate atlases for pediatric studies. Neuroimage 54, 313–327 (2011).

    Article  PubMed  Google Scholar 

  68. Xia, M., Wang, J. & He, Y. BrainNet Viewer: a network visualization tool for human brain connectomics. PLoS ONE 8, e68910 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Yan, C. G., Wang, X. D., Zuo, X. N. & Zang, Y. F. DPABI: data processing and analysis for (resting-state) brain imaging. Neuroinformatics 14, 339–351 (2016).

    Article  PubMed  Google Scholar 

  70. Chen, H. Language modulates vision: evidence from neural networks and human brain-lesion models. figshare https://doi.org/10.6084/m9.figshare.29531288.v3 (2025).

  71. Stoinski, L. M., Perkuhn, J. & Hebart, M. N. THINGSplus: new norms and metadata for the THINGS database of 1854 object concepts and 26,107 natural object images. Behav. Res. 56, 1583–1603 (2024).

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by grants from the STI2030-Major Project 2021ZD0204100 (grant no. 2021ZD0204104 to Y.B.); the National Natural Science Foundation of China (grant nos. 31925020 and 82021004 to Y.B.; grant no. 62376009 to Y.Z.; grant no. 32171052 to Xiaosha Wang; 62406020 to W.H.); the Fundamental Research Funds for the Central Universities (Y.B.); and the PKU-BingJi Joint Laboratory for Artificial Intelligence (Y.Z.). The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank H. Yang, Z. Xiong, Z. Fu and H. Wen for their valuable comments on earlier drafts of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

Y.B., Y.Z. and W.H. conceived the study. H.C., B.L., Xiaosha Wang and Xiaochun Wang designed the experiment. H.C., B.L. and S.W. implemented and conducted the experiments. H.C. and B.L. analysed the data. H.C. and Y.B. wrote the initial draft. All authors reviewed and edited the Article.

Corresponding authors

Correspondence to Xiaochun Wang, Yixin Zhu or Yanchao Bi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Human Behaviour thanks Guadalupe Dávila, Francesca Setti and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 White matter integrity of the bilateral VOTC–LeftAG tract predicts model–brain correspondence of CLIP and MoCo (n = 33 patients).

Correlation coefficients and one-tailed P values are displayed on the plots (d.f.=30). a. Partial correlations between bilateral VOTC–LeftAG tract integrity and model–brain correspondence, controlling for lesion volume. Both sentence description effects (CLIP-specific) and MoCo-specific effects correlate significantly with VOTC–LeftAG tract integrity. b. Validation analysis using connections with right AG shows no significant relationships, confirming left-lateralized pathway specificity.

Extended Data Fig. 2 Low-level versus higher-level visual feature dependencies in WM–neural representation relationships (n = 33 patients).

Pearson correlations (d.f.=30) between left (a) or bilateral (b) VOTC–AG tract FA values and model–brain correspondence, controlling for lesion volume. The left panel shows that GIST effects (low-level visual features) exhibit a negative trend with tract integrity, whereas the right panel demonstrates that MoCo-specific effects (controlling for CLIP, ResNet, and GIST) correlate significantly with WM integrity. Correlation coefficients and one-tailed P values are displayed on the plots.

Extended Data Fig. 3 Voxel-based FA–symptom mapping (VFSM) results for model–brain correspondence (n = 33 patients).

Whole-brain correlation analyses (Pearson’s correlations) examine the relationships between FA values of each voxel and Fisher z-transformed RSA correlations in VOTC across patients. Left column shows correlations with sentence description effects (CLIP-specific effect); middle column displays MoCo-specific effects; right column indicates voxels with significant correlations for both conditions. Colour bars represent t-statistics from correlation analyses controlling for lesion volume. Results are thresholded at voxel-level P < 0.005, one-tailed, and cluster-level FWE-corrected P < 0.05. Axial slices displayed in MNI coordinate space.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2 and Tables 1 and 2.

Reporting summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Liu, B., Wang, S. et al. Combined evidence from artificial neural networks and human brain-lesion models reveals that language modulates vision in human perception. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02357-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41562-025-02357-5

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics