Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Compact deep neural network models of the visual cortex

Abstract

A powerful approach to understand the computations carried out by the visual cortex is to build models that predict neural responses to any arbitrary image. Deep neural networks (DNNs) have emerged as the leading predictive models1,2, yet their underlying computations remain buried beneath millions of parameters. Here we challenge the need for models at this scale by seeking predictive and parsimonious DNN models of the primate visual cortex. We first built a highly predictive DNN model of neural responses in macaque visual area V4 by alternating data collection and model training in adaptive closed-loop experiments. We then compressed this large, black-box DNN model, which comprised 60 million parameters, to identify compact models with 5,000 times fewer parameters yet comparable accuracy. This dramatic compression enabled us to investigate the inner workings of the compact models. We discovered a salient computational motif: compact models share similar filters in early processing, but individual models then specialize their feature selectivity by ‘consolidating’ this shared high-dimensional representation in distinct ways. We examined this consolidation step in a dot-detecting model neuron, revealing a computational mechanism that leads to a testable circuit hypothesis for dot-selective V4 neurons. Beyond V4, we found strong model compression for macaque visual areas V1 and IT (inferior temporal cortex), revealing a general computational principle of the visual cortex. Overall, our work challenges the notion that large DNNs are necessary to predict individual neurons and establishes a modelling framework that balances prediction and parsimony.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Identifying compact models of macaque V4 neurons.
Fig. 2: Experimentally validating the stimulus preferences of compact models.
Fig. 3: Compact models specialize their feature selectivity via a consolidation step.
Fig. 4: Uncovering the computations of a dot-detecting compact model.

Data availability

Raw data including spike timing for all recording sessions are available97. Processed responses for model training and evaluation as well as stimulus images are available on GitHub (https://github.com/cowleygroup/V4_compact_models). Source data are provided with this paper.

Code availability

All spike sorting was performed using custom Matlab software available on GitHub (https://github.com/smithlabvision/spikesort). Model weights and code are available on GitHub (https://github.com/cowleygroup/V4_compact_models).

References

  1. Yamins, D. L. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).

    Article  CAS  PubMed  Google Scholar 

  2. Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 1, 417–446 (2015).

    Article  PubMed  Google Scholar 

  3. Heeger, D. J. Half-squaring in responses of cat striate cells. Vis. Neurosci. 9, 427–443 (1992).

    Article  CAS  PubMed  Google Scholar 

  4. Rust, N. C., Schwartz, O., Movshon, J. A. & Simoncelli, E. P. Spatiotemporal elements of macaque V1 receptive fields. Neuron 46, 945–956 (2005).

    Article  CAS  PubMed  Google Scholar 

  5. Pillow, J. W. et al. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature 454, 995–999 (2008).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  6. Vintch, B., Movshon, J. A. & Simoncelli, E. P. A convolutional subunit model for neuronal responses in macaque V1. J. Neurosci. 35, 14829–14841 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ustyuzhaninov, I. et al. Digital twin reveals combinatorial code of non-linear computations in the mouse primary visual cortex. Preprint at bioRxiv https://doi.org/10.1101/2022.02.10.479884 (2022).

  8. Maheswaranathan, N. et al. Interpreting the retinal neural code for natural scenes: from computations to neurons. Neuron 111, 2742–2755 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. David, S. V., Hayden, B. Y. & Gallant, J. L. Spectral receptive field properties explain shape selectivity in area V4. J. Neurophysiol. 96, 3492–3505 (2006).

    Article  PubMed  Google Scholar 

  10. Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  11. Cadieu, C. et al. A model of V4 shape selectivity and invariance. J. Neurophysiol. 98, 1733–1750 (2007).

    Article  PubMed  Google Scholar 

  12. Bahri, Y., Dyer, E., Kaplan, J., Lee, J. & Sharma, U. Explaining neural scaling laws. Proc. Natl Acad. Sci. USA 121, e2311878121 (2024).

    Article  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  13. Liu, Z. et al. A ConvNet for the 2020s. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11966–11976 (IEEE, 2022).

  14. Kaplan, J. et al. Scaling laws for neural language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2001.08361 (2020).

  15. Schrimpf, M. et al. Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron 108, 413–423 (2020).

    Article  CAS  PubMed  Google Scholar 

  16. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at arXiv https://doi.org/10.48550/arXiv.1503.02531 (2015).

  17. Han, S., Mao, H. & Dally, W. J. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In Proc. 4th International Conference on Learning Representations (ICLR, 2016).

  18. Butts, D. A. Data-driven approaches to understanding visual neuron activity. Annu. Rev. Vis. Sci. 5, 451–477 (2019).

    Article  PubMed  Google Scholar 

  19. Pierzchlewicz, P. et al. Energy guided diffusion for generating neurally exciting images. In Proc. Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (NeurIPS, 2023).

  20. Bashivan, P., Kar, K. & DiCarlo, J. J. Neural population control via deep image synthesis. Science 364, eaav9436 (2019).

    Article  CAS  PubMed  Google Scholar 

  21. DiMattina, C. & Zhang, K. Adaptive stimulus optimization for sensory systems neuroscience. Front. Neural Circuits 7, 101 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Pillow, J. W. & Park, M. in Closed Loop Neuroscience (ed. El Hady, A.) Ch. 1 (Elsevier, 2016).

  23. Seung, H. S., Opper, M. & Sompolinsky, H. Query by committee. In Proc. Fifth Annual Workshop on Computational Learning Theory 287–294 (Association for Computing Machinery, 1992).

  24. Cowley, B. & Pillow, J. W. High-contrast “gaudy” images improve the training of deep neural network models of visual cortex. In. Proc. Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (NeurIPS, 2020).

  25. Frankle, J. & Carbin, M. The lottery ticket hypothesis: finding sparse, trainable neural networks. In Proc. 7th International Conference on Learning Representations 8954–8995 (ICLR, 2019).

  26. He, Y., Zhang, X. & Sun, J. Channel pruning for accelerating very deep neural networks. In Proc. IEEE International Conference on Computer Vision 1398–1406 (IEEE, 2017).

  27. Wang, T. et al. Large-scale calcium imaging reveals a systematic V4 map for encoding natural scenes. Nat. Commun. 15, 6401 (2024).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  28. Du, F., Núñez-Ochoa, M. A., Pachitariu, M. & Stringer, C. Towards a simplified model of primary visual cortex. Preprint at bioRxiv https://doi.org/10.1101/2024.06.30.601394 (2024).

  29. Kamali, F., Suratgar, A. A., Menhaj, M. & Abbasi-Asl, R. Compression-enabled interpretability of voxelwise encoding models. PLoS Comput. Biol. 21, e1012822 (2025).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  30. Cowley, B. R., Williamson, R. C., Acar, K., Smith, M. A. & Yu, B. M. Adaptive stimulus selection for optimizing neural population responses. In Proc. Advances in Neural Information. Processing Systems 30 (eds Guyon, I. et al.) (NeurIPS, 2017).

  31. Willeke, K. F. et al. Deep learning-driven characterization of single cell tuning in primate visual area V4 unveils topological organization. Preprint at bioRxiv https://doi.org/10.1101/2023.05.12.540591 (2023).

  32. Szegedy, C. et al. Intriguing properties of neural networks. In Proc. 2nd International Conference on Learning Representations (ICLR, 2014).

  33. Guo, C. et al. Adversarially trained neural representations may already be as robust as corresponding biological neural representations. In Proc. 39th International Conference on Machine Learning (eds Chaudhuri, K. et al.) 8072–8081 (PMLR, 2022).

  34. Berardino, A., Ballé, J., Laparra, V. & Simoncelli, E. Eigen-distortions of hierarchical representations. In Proc. Advances in neural information processing systems 30 (eds Guyon, I. et al.) (NeurIPS, 2017).

  35. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM https://doi.org/10.1145/3065386 (2017).

  36. DiCarlo, J. J., Zoccolan, D. & Rust, N. C. How does the brain solve visual object recognition? Neuron 73, 415–434 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Dapello, J. et al. Simulating a primary visual cortex at the front of cnns improves robustness to image perturbations. In Proc. Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (NeurIPS, 2020).

  38. Federer, C., Xu, H., Fyshe, A. & Zylberberg, J. Improved object recognition using neural networks trained to mimic the brain’s statistical properties. Neural Netw. 131, 103–114 (2020).

    Article  PubMed  Google Scholar 

  39. Kornblith, S., Norouzi, M., Lee, H. & Hinton, G. Similarity of neural network representations revisited. In Proc. 36th International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 3519–3529 (PMLR, 2019).

  40. Majaj, N. J., Hong, H., Solomon, E. A. & DiCarlo, J. J. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 35, 13402–13418 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Cadena, S. A. et al. Deep convolutional models improve predictions of macaque V1 responses to natural images. PLoS Comput. Biol. 15, e1006897 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Conway, B. R. The organization and operation of inferior temporal cortex. Annu. Rev. Vis. Sci. 4, 381–402 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Pasupathy, A., Popovkina, D. V. & Kim, T. Visual functions of primate area V4. Annu. Rev. Vis. Sci. 6, 363–385 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Heeger, D. J. Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9, 181–197 (1992).

    Article  CAS  PubMed  Google Scholar 

  45. Carandini, M. & Heeger, D. J. Normalization as a canonical neural computation. Nat. Rev. Neurosci. 13, 51–62 (2012).

    Article  CAS  Google Scholar 

  46. Coen-Cagli, R., Kohn, A. & Schwartz, O. Flexible gating of contextual influences in natural vision. Nat. Neurosci. 18, 1648 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Burg, M. F. et al. Learning divisive normalization in primary visual cortex. PLoS Comput. Biol. 17, e1009028 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Ruff, D. A. & Cohen, M. R. A normalization model suggests that attention changes the weighting of inputs between visual areas. Proc. Natl Acad. Sci. USA 114, E4085–E4094 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  49. Verhoef, B.-E. & Maunsell, J. H. Attention-related changes in correlated neuronal activity arise from normalization mechanisms. Nat. Neurosci. 20, 969–977 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Ungerleider, L. G., Galkin, T. W., Desimone, R. & Gattass, R. Cortical connections of area V4 in the macaque. Cereb. Cortex 18, 477–499 (2008).

    Article  PubMed  Google Scholar 

  51. Semedo, J. D. et al. Feedforward and feedback interactions between visual cortical areas use different population activity patterns. Nat. Commun. 13, 1099 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  52. Jun, N. Y. et al. Coordinated multiplexing of information about separate objects in visual cortex. eLife 11, e76452 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Day-Cooney, J., Cone, J. J. & Maunsell, J. H. Perceptual weighting of V1 spikes revealed by optogenetic white noise stimulation. J. Neurosci. 42, 3122–3132 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Shahbazi, E., Ma, T., Pernuš, M., Scheirer, W. & Afraz, A. Perceptography unveils the causal contribution of inferior temporal cortex to visual perception. Nat. Commun. 15, 3347 (2024).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  55. Smith Breault, M. SciDraw: monkey brain. Zenodo https://doi.org/10.5281/zenodo.3926117 (2025).

  56. Stan, P. L. & Smith, M. A. Recent visual experience reshapes V4 neuronal activity and improves perceptual performance. J. Neurosci. 44, e1764232024 (2024).

  57. Issar, D., Williamson, R. C., Khanna, S. B. & Smith, M. A. A neural network for online spike classification that improves decoding accuracy. J. Neurophysiol. 123, 1472–1485 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Pospisil, D. A. & Bair, W. The unbiased estimation of the fraction of variance explained by a model. PLoS Comput. Biol. 17, e1009212 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  59. Cadena, S. A. et al. Diverse task-driven modeling of macaque v4 reveals functional specialization towards semantic tasks. PLoS Comput. Biol. 20, e1012056 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Sponheim, C. et al. Longevity and reliability of chronic unit recordings using the Utah, intracortical multi-electrode arrays. J. Neural Eng. 18, 066044 (2021).

    Article  ADS  Google Scholar 

  61. Degenhart, A. D. et al. Stabilization of a brain–computer interface via the alignment of low-dimensional spaces of neural activity. Nat. Biomed. Eng. 4, 672–685 (2020).

  62. Kleiner, M., Brainard, D. & Pelli, D. What’s new in psychtoolbox-3? Perception 36, 1–16 (2007).

  63. Cohen, M. R. & Kohn, A. Measuring and interpreting neuronal correlations. Nat. Neurosci. 14, 811–819 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Walker, E. Y. et al. Inception loops discover what excites neurons most using deep predictive models. Nat. Neurosci. 22, 2060–2065 (2019).

    Article  CAS  PubMed  Google Scholar 

  65. Thomee, B. et al. YFCC100M: the new data in multimedia research. Commun. ACM 59, 64–73 (2016).

    Article  Google Scholar 

  66. Deng, J. et al. ImageNet: a large-scale hierarchical image database. In Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).

  67. Lewi, J., Butera, R. & Paninski, L. Sequential optimal design of neurophysiology experiments. Neural Comput. 21, 619–687 (2009).

    Article  MathSciNet  PubMed  Google Scholar 

  68. Benda, J., Gollisch, T., Machens, C. K. & Herz, A. V. From response to stimulus: adaptive sampling in sensory physiology. Curr. Opin. Neurobiol. 17, 430–436 (2007).

    Article  CAS  PubMed  Google Scholar 

  69. DiMattina, C. & Zhang, K. Active data collection for efficient estimation and comparison of nonlinear neural models. Neural Comput. 23, 2242–2288 (2011).

    Article  MathSciNet  PubMed  Google Scholar 

  70. Klindt, D., Ecker, A. S., Euler, T. & Bethge, M. Neural system identification for large populations separating “what” and “where”. In Proc. Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) (NeurIPS, 2017).

  71. Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Doerig, A. et al. The neuroconnectionist research programme. Nat. Rev. Neurosci. 24, 431–450 (2023).

  73. Chollet, F. et al. Keras. Keras https://keras.io (2015).

  74. Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) (NeurIPS, 2019).

  75. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  76. Howard, A. G. et al. Mobilenets: efficient convolutional neural networks for mobile vision applications. Preprint at arXiv https://doi.org/10.48550/arXiv.1704.04861 (2017).

  77. Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. 2017 IEEE Conference on Computer Vision and Pattern Recognition 2261–2269 (IEEE, 2017).

  78. Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proc. 2017 IEEE Conference on Computer Vision and Pattern Recognition 1800–1817 (IEEE, 2017).

  79. Szegedy, C., Ioffe, S., Vanhoucke, V. & Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proc. Thirty-First AAAI Conference on Artificial Intelligence 4278–4284 (AAAI, 2017).

  80. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at arXiv https://doi.org/10.48550/arXiv.1409.1556 (2014).

  81. Zoph, B., Vasudevan, V., Shlens, J. & Le, Q. V. Learning transferable architectures for scalable image recognition. In Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 8697–8710 (IEEE, 2018).

  82. Kubilius, J. et al. CORnet: modeling the neural mechanisms of core object recognition. Preprint at bioRxiv https://doi.org/10.1101/408385 (2018).

  83. Salman, H., Ilyas, A., Engstrom, L., Kapoor, A. & Madry, A. Do adversarially robust imagenet models transfer better? In Proc. Advances in Neural Information Processing Systems 33 (eds Larochelle, H. et al.) (NeurIPS, 2020).

  84. Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci. USA 118, e2014196118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. 32nd International Conference on Machine Learning (eds Bach, F. & Blei, D.) 448–456 (PMLR, 2015).

  86. Lurz, K.-K. et al. Generalization in data-driven models of primary visual cortex. In Proc. 9th International Conference on Learning Representations (ICLR, 2021).

  87. Turishcheva, P., Burg, M. F., Sinz, F. H. & Ecker, A. S. Reproducibility of predictive networks for mouse visual cortex. In Proc. Advances in Neural Information Processing Systems 37 (eds Globerson, A. et al.) (NeurIPS, 2024).

  88. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at arXiv https://doi.org/10.48550/arXiv.1412.6980 (2014).

  89. Han, S., Pool, J., Tran, J., & Dally, W. Learning both weights and connections for efficient neural network. In Proc. Advances in Neural Information Processing Systems 28 (eds Cortes, C. et al.) (NeurIPS, 2015).

  90. Luo, J.-H., Wu, J. & Lin, W. ThiNet: a filter level pruning method for deep neural network compression. In Proc. IEEE International Conference on Computer Vision 5068–5076 (IEEE, 2017).

  91. Li, H., Kadav, A., Durdanovic, I., Samet, H. & Graf, H. P. Pruning filters for efficient convnets. In Proc. 5th International Conference on Learning Representations 1683–1695 (ICLR, 2017).

  92. Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at arXiv https://doi.org/10.48550/arXiv.1412.6572 (2015).

  93. Veerabadran, V. et al. Subtle adversarial image manipulations influence both human and machine perception. Nat. Commun. 14, 4933 (2023).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  94. Morcos, A., Raghu, M. & Bengio, S. Insights on representational similarity in neural networks with canonical correlation. In Proc. Advances in Neural Information Processing Systems 31 (eds Bengio, S. et al.) (NeurIPS, 2018).

  95. Williams, A. H., Kunz, E., Kornblith, S. & Linderman, S. Generalized shape metrics on neural representations. In Proc. Advances in Neural Information Processing Systems 34 (eds Ranzato, M. et al.) (NeurIPS, 2021).

  96. Pattisapu, S. SciDraw: monkey brain. Zenodo https://doi.org/10.5281/zenodo.17553661 (2025).

  97. Cowley, B. R., Stan, P. L., Pillow, J. W. & Smith, M. A. Data from “Compact deep neural network models of visual cortex”. Carnegie Mellon University https://doi.org/10.1184/R1/30500090 (2025).

  98. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition 2818–2826 (2016).

  99. Pospisil, D. A., Pasupathy, A. & Bair, W. ‘Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife 7, e38242 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  100. Olah, C. et al. Zoom in: an introduction to circuits. Distill 5, e00024-001 (2020).

    Article  Google Scholar 

  101. Goh, G. et al. Multimodal neurons in artificial neural networks. Distill 6, e30 (2021).

    Article  Google Scholar 

  102. Ponce, C. R. et al. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell 177, 999–1009 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Gallant, J. L., Connor, C. E., Rakshit, S., Lewis, J. W. & Van Essen, D. C. Neural responses to polar, hyperbolic, and cartesian gratings in area V4 of the macaque monkey. J. Neurophysiol. 76, 2718–2739 (1996).

    Article  CAS  PubMed  Google Scholar 

  104. Pasupathy, A. & Connor, C. E. Responses to contour features in macaque area V4. J. Neurophysiol. 82, 2490–2502 (1999).

    Article  CAS  PubMed  Google Scholar 

  105. Tanigawa, H., Lu, H. D. & Roe, A. W. Functional organization for color and orientation in macaque V4. Nat. Neurosci. 13, 1542–1548 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Nandy, A. S., Sharpee, T. O., Reynolds, J. H. & Mitchell, J. F. The fine structure of shape tuning in area V4. Neuron 78, 1102–1115 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  107. Li, M., Liu, F., Juusola, M. & Tang, S. Perceptual color map in macaque visual area V4. J. Neurosci. 34, 202–217 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  108. Okazawa, G., Tajima, S. & Komatsu, H. Image statistics underlying natural texture selectivity of neurons in macaque V4. Proc. Natl Acad. Sci. USA 112, E351–E360 (2015).

    Article  ADS  CAS  PubMed  Google Scholar 

  109. Lieber, J. D., Oleskiw, T. D., Simoncelli, E. P. & Movshon, J. A. Responses of neurons in macaque V4 to object and texture images. Preprint at bioRxiv https://doi.org/10.1101/2024.02.20.581273 (2025).

  110. Carlson, E. T., Rasquinha, R. J., Zhang, K. & Connor, C. E. A sparse object coding scheme in area V4. Curr. Biol. 21, 288–293 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  111. Abbasi-Asl, R. et al. The DeepTune framework for modeling and characterizing neurons in visual cortex area V4. Preprint at bioRxiv https://doi.org/10.1101/465534 (2018).

  112. Hofer, H., Carroll, J., Neitz, J., Neitz, M. & Williams, D. R. Organization of the human trichromatic cone mosaic. J. Neurosci. 25, 9669–9679 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  113. Brainard, D. H. Color and the cone mosaic. Annu. Rev. Vis. Sci. 1, 519–546 (2015).

    Article  PubMed  Google Scholar 

  114. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. Workshop at 2nd International Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2014).

Download references

Acknowledgements

We are grateful to S. Schmitt for assistance with data collection and spike sorting; to our animal care staff; and N. Rafidi for providing comments on the manuscript. We thank S. Pattisapu for the macaque illustration in Fig. 1a; we used SciDraw for the illustrations of the monkey96 and the brain55 in Fig. 1a. Images in experiments come from the YFCC100M dataset65; these original experimental stimuli have been replaced with close matches of copyright-free equivalents from Adobe Stock. Original experimental images can be found in the publicly available data repository97. This work was supported by a C.V. Starr Fellowship and the Pershing Square Innovation Fund to B.R.C.; a US National Institutes of Health (NIH) grant (F31EY031975) to P.L.S.; a Simons Collaboration on the Global Brain Investigator Award (SCGB AWD543027), NIH BRAIN Initiative grants (NS104899 and R01EB026946) and a U19 NIH-NINDS BRAIN Initiative Award (5U19NS104648) to J.W.P.; an NIH grant (R01EY029250) to M.A.S.; and an NIH grant (R01EY037194) to M.A.S. and B.R.C.

Author information

Authors and Affiliations

Authors

Contributions

B.R.C., P.L.S., J.W.P. and M.A.S. conceived of and designed the study. B.R.C. and P.L.S. designed and performed the closed-loop experiments. B.R.C. provided image stimuli. P.L.S. recorded the electrophysiological data. B.R.C. designed, trained and analysed the models. B.R.C. wrote the manuscript with input from P.L.S., J.W.P. and M.A.S.

Corresponding authors

Correspondence to Benjamin R. Cowley or Matthew A. Smith.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Heterogeneity of stimulus encoding in V4 populations within and across sessions, as well as model improvements for prediction.

a. Ideally, we would train the deep ensemble model on recorded responses of the same neurons to tens of thousands of images (each with multiple repeats), but this is difficult: Recording the same neuron in an awake animal across multiple sessions with current electrophysiological techniques, such as an electrode array, cannot be guaranteed, as the electrode array may shift between sessions or degrade in quality due to gradual build up of scar tissue. Indeed, after automated spike sorting of our V4 responses, we found that the number of neurons differed across sessions, indicating that at least subsets of neurons were not the same across sessions. Each trace is the number of neurons for one of three monkeys (M1, M2-2020, M2-2021, and M3, where M2-2020 and M2-2021 were recordings from the same monkey but 5 months apart; M1, M2, M3 correspond to identifiers WE, PE, and RA, see Methods). b. One possible way to overcome the small number of training images per neuron (1,000-3,000 images per session) is to leverage signal correlations across neurons within a session. In other words, if two neurons have similar stimulus tuning (for example, a preference for dots), the model may reuse its computations to predict both neurons—reducing its number of parameters and thus needing less training data. To test this, we computed the signal correlations of V4 neurons for the four test recording sessions (one per each monkey dataset). Lines indicate medians. The signal correlations between pairs of V4 neurons within a session were low (median ρ = 0.11 across neurons and sessions). Thus, within a session, neurons had diverse tuning preferences; we found the same result for the compact models (Fig. 3b). This motivated us to see if we could “stitch” neurons across sessions for training. c. One possibility to increase the number of training images per neuron is to only show one repeat per image without inter-stimulus intervals of blank grey screens59; however, the repeat-to-repeat variability is often too large for reliable parameter estimation (Supplementary Fig. 2). Another possibility is to repeat a small number of images each session and then match pairs of recorded units across sessions based on how correlated the pair’s responses are to the images20. A drawback to this approach is the matching may too strongly rely on a small number of images (25 images used in ref. 20), resulting in chosen neurons all with similar stimulus tuning (see Supplementary Fig. 2). In addition, repeating the same set of images each session takes away possible presentations of unique images—presenting a set of the same 25 images would otherwise take the place of ~1,000 unique images in our experiments. To avoid these drawbacks, we sought a way to train the deep ensemble model using all available recorded units across sessions and animals. We reasoned that re-inserting an electrode array each session—that is, recording from a new set of neurons each session—would not provide enough recording time to present the number of training images needed to fit the diversity of stimulus tuning across all V4 neurons (b). Instead, we opted for chronically-implanted electrode arrays that would likely record from the same neurons between sessions with some neurons lost and added over time (a). To verify this, we computed the signal correlation between matched pairs of neurons between the first recording session (session 0) and each kth recording session. Because images differed across sessions, we used the deep ensemble predicted responses to 5,000 natural images to compute the signal correlations. For matched pairs, we started with the neuron pair with the largest signal correlation and iteratively added matches with the largest signal correlation ignoring any neurons already in a matched pair; lines indicate medians, shaded areas denote bootstrapped 90% confidence intervals (n = 25−89, exact corresponding values in a). We found that sessions nearest to the first recording (1≤k≤5) often had stable signal correlations that eventually decreased or varied as time increased between sessions. Still, the signal correlations within animals were often larger than signal correlations of pairs matched between animals (rightmost dots, where each dot denotes a median, and error bars indicate bootstrapped 90% confidence intervals over n neurons). The signal correlations between M2-2020 and M2-2021 were small (rightmost dots, ‘M2’20’ and ‘M2’21’), indicating that the 5 months between these two recording periods were enough to record from a seemingly new set of neurons; thus, we treat these two recording periods separately. These results suggest that for the most part, recording sessions share a subset of the same neurons; in other words, the stimulus tunings of the recorded V4 population remain relatively stable across training and test sessions. We leverage this fact to train on neurons across multiple sessions—stitched together via linear mappings—and evaluate on held-out recording sessions (where neurons may differ in number or identity). d-f. Three modeling improvements that contributed to our 30% boost in prediction performance (Fig. 1b). To reduce the computational burden, we trained 5 ensemble members instead of the 25 members used for the deep ensemble model (Fig. 1b); this explains small decreases in prediction performance between Fig. 1b,d–f. In all panels, lines denote medians, error bars denote 90% bootstrapped confidence intervals, n = 219 neurons. First, we asked whether placing nonlinearities between ResNet50 features and V4 responses led to better prediction. We trained a 5-member deep ensemble with different numbers of skip connection layers (d, ‘ReLUs, nonlinear’); we chose 4 skip connection layers for our final deep ensemble model (Fig. 1b). We then asked if such nonlinearity was necessary by substituting all ReLU activation functions with linear (identity) functions (d, ‘no ReLUs, linear-only’), converting the deep ensemble members to entirely linear models. For reference, we considered prediction performance for linear mappings identified via ridge regression (‘linear-ridge’) or a factorized mapping (‘linear-factorized’, see Methods section ‘Predicting V4 responses’). Thus, using a nonlinear mapping (that is, a deep ensemble with ReLU activation functions) between task-driven DNN features (from ResNet50) and V4 responses better predicted responses that using a purely linear mapping typical of most studies10,15 or linear-only ensemble DNNs. This suggests the deep ensemble model captures important nonlinearities of V4 processing. Second, as expected, training on an increasing number of recording sessions led to better prediction (e). For this analysis, we varied the number of training sessions k by randomly choosing a subset of k sessions out of the 44 sessions used for training. Third, using an ensemble of DNNs versus a single DNN increased prediction performance, especially for a smaller number of training sessions (f). This increase is primarily because ensembling overcomes overfitting: Each DNN in the ensemble has a different random initialization and overfits differently to the small amount of training data. Averaging over the ensemble averages away differences in overfitting. For this analysis, we varied the number of ensemble DNNs by training a model with 40 ensemble DNNs and taking subsets of ensemble DNNs; for each subset, we recomputed prediction performance. To confirm that using an ensemble versus a single model is even more effective with less training data, we trained the ensembles on either all 44 recording sessions or 15 randomly-chosen recording sessions. Our final choice of 25 ensemble DNNs ensured good prediction and good estimation of ensemble disagreement (used for active learning).

Source data

Extended Data Fig. 2 Comparing our noise-corrected R2 values versus those reported in previous studies.

a. Prediction performance (noise-corrected R2) on held-out V4 responses from our dataset for different task-driven and data-driven DNN models as well as our hybrid deep ensemble model (green). In addition to the task-driven, data-driven, and hybrid models in Fig. 1b, we also evaluated NASNetMobile81 and InceptionV398, as well as a recently-proposed model architecture called EGGnet19. We considered two versions of EGGnet. The first is a pre-trained EGGnet (‘EGGnet, pre-train’), trained to predict V4 responses from a different dataset31. We fixed the weights of its core network and used a factorized linear mapping to read out its embedding in the same way as for the task-driven DNNs. The poor performance of this task-driven EGGnet was not unexpected, as we also find training on one set of V4 neurons fails to generalize to other sets of V4 neurons (see panel f). The second version was a data-driven EGGnet trained on our 44 recording sessions in the same procedure as that for the other data-driven models. Again, we observed lower prediction performance for data-driven EGGnet than the other data-driven models. This may stem from using a factorized linear mapping versus an attention layer readout19, especially considering EGGnet’s output embedding has a much larger number of spatial dimensions (92 × 92 × 64) than our other data-driven model (28 × 28 × 100). Lines denote medians; error bars are 90% confidence intervals, n = 219 neurons. b. The V4 responses in panel a were spike-sorted with an automated pipeline using a pre-trained convolutional neural network for classification of spikes57. Here, we tested whether this automated pipeline differed too much from manual spike sorting by a human expert. We manually spike-sorted the four held-out test sessions, removing any false spikes or noisy electrode channels. We then used the deep ensemble model—trained on all other sessions whose responses were only spike-sorted by our automated pipeline—to predict these manually spike-sorted responses. Same format as in a, n = 219 neurons. We found only a small decrease in noise-corrected R2R2 < 0.02). This indicates that our automated spike-sorting pipeline had a reliability on par with that of an expert human annotator; this, along with our criteria for including electrode units as a “neuron” (see Methods), gave us strong confidence that our responses strongly reflected the activity of the actual recorded V4 neurons. c. To evaluate prediction performance, we used a newly-proposed metric \({R}_{{\rm{ER}}}^{2}\)58 (mathematically defined in Methods) that is a consistent statistical estimator of the noise-corrected R2 with expected responses (ER). We compared the \({R}_{{\rm{ER}}}^{2}\) metric to the commonly-used BrainScore R2 metric, which computes an R2 ceiling by splitting repeats into two halves, corrected with the Spearman-Brown procedure15. The BrainScore R2 is computed with the following pseudocode:

While both R2 estimators should converge on the same R2 value given enough data58, it was unclear whether these estimators differed for our V4 dataset with its numbers of images and repeats. We re-evaluated the deep ensemble model predictions with the BrainScore R2 metric and found little difference between the two estimators (compare with results in a; same format, n = 219 neurons). This indicates that the BrainScore R2 metric and noise-corrected \({R}_{{\rm{ER}}}^{2}\) we use are mostly identical for our V4 data, suggesting our number of repeats (<10) are sufficient for estimating noise-corrected R2. d. A recent study reported a BrainScore R2 = 0.89 using ResNet50-robust to predict V4 responses20, substantially larger than our reported noise-corrected R2 = 0.48 for ResNet50-robust and R2 = 0.62 for the deep ensemble model for our data (a) as well as the reported BrainScore R2 ≈ 0.6 for the top-performing task-driven DNNs on the BrainScore leaderboard for V4 data15. Using the Bashivan et al. V4 data (n = 131 neurons), we largely reproduced their R2 results (same format as in a), where ResNet50-robust was the most predictive model with noise-corrected R2 = 0.83. Because this V4 dataset only had enough images per neuron to train a factorized linear mapping (and not the shared weights of the deep ensemble), our deep ensemble model’s prediction performance was on par with its ResNet50 backbone DNN (‘ensemble’ and ‘ResNet50’). We suspect that with more training sessions, the deep ensemble model’s performance would outperform task-driven DNN models, as we see in our V4 data (a). An open question is how many V4 neurons would be needed to train a data-driven foundation model, such as our deep ensemble model, until this model could reliably predict out-of-the-box a newly recorded V4 neuron with a single recording session. e. Another recent study reported performance of ResNet50-robust predicting V4 responses as a test correlation ρ = 0.4 and a fraction explainable variance explained FEVE = 0.12; FEVE is similar to computing our noise-corrected R2 as well as the BrainScore R2. We predicted this study’s V4 responses (n = 255 neurons) and, consistent to their reported FEVE values, also found a low noise-corrected R2 = 0.15 (same format as in a). This, along with the results in d, suggest that noise-corrected R2 may considerably vary across studies, making within-study comparisons more important. We investigate the reasons for these differences in R2 in Supplementary Fig. 2. The deep ensemble model (trained on our V4 data and fit with a factorized linear mapping between its embeddings and V4 responses) was outperformed by the task- and data-driven DNN models (‘ensemble’ furthest left) but had close performance to its ResNet50 backbone DNN (‘ResNet50’ also had R2 < 0.1). A previous study19 found EGGnet, trained on this same V4 data, to outperform ResNet50-robust; here, we find the opposite (‘EGGnet’ left of ‘ResNet50-robust’). This is likely for two reasons. First, we use a factorized linear mapping to fit DNN features to neural responses instead of an attention layer (used by ref. 19); because the output embedding of EGGnet has a high-dimensionality (92 pixels × 92 pixels × 96 channels), the linear mapping has a large number of parameters to fit. Second, we use a pre-trained version of EGGnet and thus there may be overfitting to the same training data. Given the differences in reported noise-corrected R2 values in our work (a), in Bashivan et al. (d), and in Cadena et al. (e), we further investigated the reasons for these differences (see Supplementary Fig. 2 for detailed analyses). We found that these R2 differences primarily arise from the number of repeats for training images, the size of the neurons’ spatial receptive fields relative to the presented images, and how similar the tuning is across neurons. f. Given the better prediction performance of the deep ensemble model relative to other models on our own V4 data versus that of V4 data from other studies (a versus d and e), we wondered if this could be explained by the fact that the ensemble DNNs were only trained on our V4 data. To test this, we trained the deep ensemble model on our V4 data from different subsets of animals and then evaluated the model on responses from individual held-out test sessions. Each bar represents the median R2 across neurons from monkey 1 (M1), monkey 2 (M2), or monkey 3 (M3), corresponding to identifiers WE, PE, and RA; error bars denote bootstrapped 90% confidence intervals, n = 33, 89, 42 neurons for M1, M2, M3, respectively. As expected, training on data from a single animal best predicted its corresponding test data ({M1}, {M2}, {M3}). Training on data from two animals did not improve prediction performance for the test data from the left-out animal (for instance, evaluating M3 test data for training on {M1,M2}). These results suggest that to achieve an out-of-box performance better than that of a task-driven DNN, data-driven foundation DNN models will need to train on sessions from a large number of animals (or many different V4 neurons). For predicting responses from a targeted set of recorded V4 neurons, one may rely on a data-driven approach when prediction is most important (as in our work to identify the underlying computations), whereas relying on a task-driven approach may be more appropriate when recording time is limited.

Source data

Extended Data Fig. 3 Response-maximizing synthesized images for all compact models.

For each compact model, we computed its response-maximizing synthesized image via gradient ascent techniques (Fig. 2a). Here, we show these images for all compact models (one image per compact model). For visual clarity, we loosely grouped images into categories by eye. We make no claims about grouping of V4 neurons; in fact, the mean signal correlation squared across all pairs of compact models was low (ρ2 = 0.11 between model responses to 10,000 natural images, see Fig. 3b, ‘output’). This remained true when controlling for spatial receptive field location and size (Fig. 3b, ‘layer 5’). Thus, the stimulus preferences across V4 neurons appear to be largely heterogeneous, allowing for a highly-expressive set of features for downstream processing in IT and other brain areas36,43,99,100,101,102. The image statistics of the preferred images of V4 neurons—edges, curves, textures, dots, etc.—largely match those observed in many previous studies9,27,30,103,104,105,106,107,108,109, including those that synthesize preferred images19,20,31,110,111. Our work provides three novelties: We consider response-maximizing synthesized images in color (which, to our knowledge, have not yet been identified for V4); we explore a large number of natural images (500,000 images) to find the natural images that maximize a neuron’s response (Fig. 2b); and we identify slight perturbations of natural images that elicit large changes in responses (ε-perturbed images, see Fig. 2d–f). A salient observation is the large extent to which greens, magentas, and yellows appear in our synthesized preferred images. Many V4 neurons have strong preferences for color27,105,107; a preference for greens and magentas versus blues may arise from the fact that primate retina has many more L and M cones (that detect red and green hues) compared to S cones (that detect blue hues)112,113. An exciting future direction is to characterize the strength of color processing in V4 and how it relates to natural image statistics. Overall, our response-maximizing synthesized images (MSIs) (also known as most exciting inputs) are largely in agreement but perhaps not as crisp or object-detailed as in other studies19,20,31. Our MSIs have shape 112 pixels × 112 pixels (versus the typically displayed 224 pixels × 224 pixels MSIs) and are in color (versus greyscale that previous studies use), which may explain the differences in crispness. It is reassuring that the MSIs of our 5-layer compact networks have similar statistics to those of MSIs derived from large task-driven DNNs, as MSIs from the latter may more strongly reflect the internal image statistics priors of task-driven DNNs versus the true preferences of V4 neurons. For example, a V4 neuron may truly prefer two dots but the task-driven DNN may embellish its MSI with two eyes (likely the most common occurrence of two dots in a large image dataset). We make no claims that our MSI approach is the best, but rather use the MSI approach to experimentally validate the predicted stimulus preferences of the compact models. A head-to-head comparison is needed to determine which type of MSIs drive V4 responses the strongest, which should also include the response-maximizing natural images chosen from a large candidate pool (500,000 images or greater).

Extended Data Fig. 4 V4 responses to maximizing images.

The response-maximizing images of the compact models drove larger responses compared to those of randomly-chosen natural images (Fig. 2b,c). Here, we take a closer look at all responses to these maximizing images as well as compare the statistics between the response-maximizing natural images, response-maximizing synthesized images, and the natural images with the largest responses. a. In our validation experiments, we matched each compact model with a corresponding V4 neuron and tested whether the maximizing images predicted by the compact models strongly drove V4 responses. One possibility is that the compact models identified maximizing images that strongly drove all V4 neurons together without specificity for each V4 neuron’s stimulus preference—for example, an image with high contrast likely evokes larger responses for all neurons than an image with low contrast. To see if this were the case, we examined the repeat-averaged V4 responses (left heatmap for monkey 3, RA, and right heatmap for monkey 2, PE) to response-maximizing natural images, response-maximizing synthesized images, and randomly-chosen natural images from our validation experiments (Fig. 2b,c). For each V4 neuron, we normalized its responses between 0 and 1 and sorted neurons (rows) based on the average normalized response to the natural images. For monkey 3 (left heatmap), we trained 42 compact models on data from previous recording sessions and probed the compact models for their response-maximizing natural and synthesized images. We then presented these probing images in a following session, matching neuron to compact model via the randomly-chosen natural images (see Methods). We organized the columns for the response-maximizing natural images such that the first 10 columns were responses to the maximizing natural images predicted by the first compact model; the next 10 columns were for the second compact model, and so on; the response-maximizing synthesized images were ordered in the same way. For the natural images, we sorted columns based on the mean response across all V4 neurons, where the largest mean response was the rightmost column. We organized the response heatmap for monkey 2 (right heatmap) in a similar way; for this animal, only one response-maximizing natural image and one response-maximizing synthesized image were generated for each of 50 compact models. We found that it was not the case that the response-maximizing images for one compact model drove the entire V4 population (red squares in columns are sparse) and that few neurons were activated by all response-maximizing images (red squares in rows are sparse). We also found similar response patterns to response-maximizing natural images and response-maximizing synthesized images (left heatmap, compare heatmaps between natural images and synthesized images; this was more difficult to see in the right heatmap due to one image per compact model), consistent with both sets of response-maximizing images having similar image statistics. Our main finding was that the response-maximizing images—predicted by the compact models trained only on previous sessions—yielded larger responses than responses to randomly-chosen natural images (compare left to right in each heatmap, also see Fig. 2b,c). b. We noticed that some of the responses to randomly-chosen natural images were larger than the response-maximizing images predicted by the compact models (black dots, ‘natural images with largest responses’); each dot denotes the repeat-averaged response of one V4 neuron to one image. Because the randomly-chosen natural images could include images similar to the response-maximizing images, we wondered to what extent these natural images with largest responses resembled the response-maximizing images. For 8 example neurons, we found close matches in image statistics between response-maximizing natural images (orange), response-maximizing synthesized images (red), and natural images with largest responses (black). For example, V4 neuron 1 preferred images with buttons and circles as well as images with round human faces and curved objects. V4 neuron 2 preferred images with a vertical object with a high-spatial frequency texture. V4 neuron 3 preferred images with dots as well as images with eyes and far away objects. This further increased our confidence that the preferred stimuli as predicted by the compact models were good approximations for the true preferred stimuli of the real V4 neurons. Images were reproduced from Adobe Stock.

Source data

Extended Data Fig. 5 Further experimental validations of the compact models.

Our compact models’ predictions held up to experimental validation (Fig. 2), including identifying response-maximizing natural and synthesized images. Here, we compare which type of these images better drives V4 responses. In addition, we present results of further experimental validations, including identifying saliency maps and “ε-perturbed smooth” images. a-c. We found that the compact models predicted much stronger V4 responses to response-maximizing synthesized images versus response-maximizing natural images (a, dots above dashed line; difference of means between maximizing synthesized and maximizing natural: 111.7 spikes/sec, p < 0.002, permutation test, n = 78). However, this was not the case for the real responses—both types of images evoked similarly large responses (b, dots hug dashed line, difference of means not significant, p = 0.836, permutation test, n = 78). This difference largely stems from the compact models’ inability to predict V4 responses to maximizing synthesized images (c, red dots, coefficient of determination R2 = −26.0, not noise-corrected) whereas their prediction for maximizing natural images remains relatively intact (c, orange dots, coefficient of determination R2 = 0.2, not noise-corrected). Each dot denotes the response of a single neuron, averaged over repeats and maximizing images (if multiple maximizing images were shown for a V4 neuron, see Fig. 2b,c). This inability to predict maximizing synthesized images was not unexpected—the optimizing procedure had full access to every weight and every pixel to optimize an image tailored to that compact model, and some “adversarial” noise was expected32,34. Moreover, the resulting synthesized images were well outside the distribution of training images, and we would expect poor prediction for these outlying regions of image space. This was one motivation for training our deep ensemble model with closed-loop active learning (Fig. 1c) in which we trained the model on out-of-distribution images. We were also surprised that the maximizing natural images evoked V4 responses as large as those to maximizing synthesized images (b). This suggests that one cannot rule out choosing from a large pool of candidate images (in this case, 500,000 candidate images) to maximally drive V4 responses. d. A commonly-used approach to explain a DNN’s output is to identify which parts of an input image are the most relevant or “salient” for the DNN’s prediction; this approach is called saliency analysis114. One implementation is to smooth a small patch of the image (small orange circles in example images denote smoothed patches; these circles were not present in actual stimuli) and see if the resulting response is larger or smaller than the response to the original image. An increase in response (pink) indicates that the visual feature within the smoothed patch is distracting, as removing this feature leads to a larger response. Likewise, a decrease in response (green) indicates a salient or excitatory visual feature. We fed as input a set of images where the (i,j)th image had a smoothed image patch centered at (i,j). We then formed a heatmap of the resulting responses r based on the (i,j)th image patch (forming a ‘saliency heatmap’). For this example image of a squirrel and the chosen compact model, the most salient features are the eyes (green regions), while the most distracting features are the fur texture and the edges around the left eye (pink regions). The squirrel images were reproduced from Adobe Stock. e. In our validation experiments, we probed the trained compact models to identify maximizing natural images (Fig. 2b). For each maximizing natural image, we computed the saliency heatmap of the compact model (‘compact model prediction’) following the procedure in d. We then used these predictions to smooth 25 non-overlapping image patches that led to the largest changes in responses (where the number 25 was chosen as a compromise between covering as much as the image as possible within recording time constraints). In a following session, we showed each ‘base’ image as well as the 25 smoothed images, each with one smoothed image patch. For each example image shown here, we matched a V4 neuron with the image’s corresponding compact model (in the same way as in Fig. 2b,c) and computed the resulting saliency heatmap for V4 neurons (rightmost panels). Responses were z-scored using the mean and standard deviation estimated with the V4 neuron’s responses to all natural images shown in the session. We found that V4 neurons did vary their responses to local smoothing and that these changes in responses largely matched those predicted by the compact models. Thus, for a given image, a V4 neuron’s response can be suppressed and excited by different local visual features; our compact models can be used to predict which features at which locations. The natural images were reproduced from Adobe Stock. f. Inspired by the saliency approach in d and e, we experimentally validated our compact models by having them predict which visual features of an image to smooth in order to minimize a V4 neuron’s response. We first began with the compact model’s maximizing natural image as a base image. Then, in a greedy manner, we iteratively chose an image patch to smooth that led to the most suppressed response as predicted by the compact model (‘smooth-’, see orange circle). Successive iterations added image patches that did not overlap with previously-chosen image patches. This led to an image with specific visual features smoothed away. Bottom inset: a sequence of images for which a base image is cumulatively smoothed at different patches determined by the compact model’s predictions; the final smoothed image is the rightmost. Images were reproduced from Adobe Stock. g. Example base images (left, top row, ‘maximizing base images’), smoothed versions to minimize the model output response as predicted by a compact model (left, middle row, ‘smoothed- images’), and images for which randomly-chosen patches were smoothed as a control (left, bottom row, ‘randomly-smoothed images’). The randomly-smoothed and smoothed- images had the same number of pixels smoothed. These example maximizing base images and randomly-smoothed images elicited similarly large responses from a V4 neuron (right, black versus green dots, p = 0.70, two-sided permutation test, n = 10) whereas the smoothed- image led to a substantially smaller response (black versus blue dots, p < 0.001, permutation test, n = 10, significance denoted by an asterisk). Each dot denotes the repeat-averaged response to one image. The response-maximizing base images were reproduced from Adobe Stock. The smoothed- and randomly smoothed images were adapted from Adobe Stock. h. Responses for all V4 neurons from two recording sessions (each session from a different animal). V4 neurons were matched to compact models via held-out images (same procedure as used in Fig. 2b–f). For one session, only one base image was shown per compact model; for these images, dots denote repeat-averaged responses with no error bars. For the other session, we presented 10 base images (and their smoothed- and randomly-smoothed counterparts) per neuron. For this session, dots denote the average response over the 10 images, and error bars denote 1 s.e.m. of the repeat-averaged responses over the 10 images. Neurons were sorted based on mean response to the base images. We found that responses to smoothed- images were roughly half as small as responses to the base images across V4 neurons (blue versus black dots, normalized percent change computed as \( \% \Delta \bar{r}=100\cdot [\bar{r}({\rm{smoothed-}})-\bar{r}({\rm{base}})]/\bar{r}({\rm{base}})\): mean \( \% \Delta \bar{r}{\rm{}}\pm \) s.e.: −46.1% ± 3.5%, where \(\bar{r}\) denotes a neuron’s repeat-averaged response). There was little to no decrease between randomly-smoothed and base images (green versus black dots, mean \( \% \Delta \bar{r}{\rm{}}\pm \) s.e.: −3.7% ± 1.9%). Thus, confirmed via experimental validation, the compact models accurately predicted which visual features (and their spatial information) were most salient to the V4 neurons. This provides further evidence that the compact models accurately capture the stimulus preferences of V4 neurons—and what visual features in those preferred stimuli are most important to the V4 neuron.

Source data

Extended Data Fig. 6 Compressing models of V1, V4, and IT neurons as well as DNN units.

a-c. Given the large amount of compression observed for shared compact models trained on the predicted responses of task-driven DNN models for our V4 response dataset (Fig. 3c), we wondered to what extent we could compress these models on different response datasets from brain areas along the visual stream. To this end, we compressed ResNet50-robust predicting macaque V1, V4, and IT datasets (Fig. 3d–f); here, we measure the compressibility of three other task-driven DNNs: ResNet50, CORnet-S, and VGG19 (the data-driven shared compact models, purple traces, are the same as in Fig. 3d–f). We trained shared compact models via distillation, varying the number of filters in the first three “core” layers (see Methods). We then compared the noise-corrected R2 of these shared compact models with the full prediction performance of their task-driven DNN counterparts (rightmost dots). Interestingly, some compact models outperformed the full task-driven DNNs (b, blue, green, mauve traces above corresponding dots), sometimes observed with knowledge distillation16. Overall, the general trends of ResNet50-robust hold for all of other task-driven DNNs: Task-driven DNNs of cortical neurons are highly compressible, with the most compression achieved for V1, followed by V4, and then IT. We verified that these compact models achieve similar prediction as large task-driven DNNs but are orders-of-magnitude smaller by submitting the compact models to the public benchmark BrainScore; the compact models are now state-of-the-art predicting the corresponding V4 and IT datasets (Supplementary Table 2). Traces and dots denote medians; error bars denote bootstrapped 90% confidence intervals, n = 115, 88, 168 neurons for a, b, c, respectively. Arrows denote the smallest number of filters needed to achieve a prediction performance no less than 5% of that for 200 filters/core layer. d. We also evaluated the compressibility of the task-driven DNNs on the V4 response dataset from Bashivan et al.20 (n = 131 neurons), which we analyzed in comparison to our own V4 response dataset (Extended Data Fig. 2). Similar to all other datasets (Fig. 3c–f), we found high compressibility (arrows on 10-15 filters). These results provide further evidence that task-driven DNN models of V4 neurons can be greatly reduced in size by orders of magnitude. e. Given the compressibility of models of V4 neurons, we wondered to what extent we could compress a model of responses from DNN units (that is, internal units from an intermediate layer of a task-driven DNN), treating these units as V4 neurons. To test this, we trained shared compact models to predict responses of 219 DNN internal units (whose number equaled that of the number of our recorded V4 neurons in Fig. 3c). DNN units were chosen as the hidden units centrally located in the activity map (that is, for 14 × 14 pixel activity maps, we chose hidden units with spatial locations [7,7]) with the largest activity variance over 5,000 natural images; layers matched those most predictive of our V4 responses (see Methods). We trained these shared compact models (same architecture as those in ad) to predict DNN unit responses taken directly from a task-driven DNN to 12 million images and varied the number of filters in each of the “core” layers. We computed the raw R2 between DNN unit responses and the predicted responses of the shared compact models; because the DNN units were deterministic, we did not need a noise-corrected R2 metric. Despite there being no noise in the DNN unit responses, the 200-filter shared compact models failed to achieve good prediction performance for most task-driven DNNs (‘200’, traces below R2 = 0.6; traces indicate median R2s and error bars denote 90% confidence intervals over 219 units). Beyond this, the DNN units were not easily compressible (arrows at 75-100 filters). That an individual DNN unit is not compressible likely arises from the fact these DNN units must carry out complex computations but must read out from only ~ 1,000 channels from an upstream layer to aid in performing object recognition; V4 comprises hundreds of millions of neurons and may not require each individual V4 neuron to carry out many computations on its own. f. We wondered whether the DNN units were truly more compressible than V4 neurons or that they were less compressible because our R2 metric did not include a noise-ceiling; in other words, the DNN unit responses had spurious nonlinearities that would otherwise be hidden by a noise-corrected R2. To see if this were the case, we first fit factorized linear mappings from the activity of task-driven DNNs (same layers and including the same DNN units as in e) to the 219 V4 neurons from the 4 test recording sessions. In other words, this setting differs from e only with a linear mapping appended to all DNN units from that layer of the task-driven DNN. We then distilled this model with shared compact models, following the same procedure as in e. In contrast to the DNN units, we observed strong compressibility: The shared compact models with 200-filters have good prediction (rightmost end of traces, all traces have R2 > 0.6), strong plateauing for larger numbers of filters (for example, between 50 and 200 filters/core layer), and only a small number of filters were needed to be within a 5% drop from the performance of 200 filters/core layer (arrows), consistent with the V4 results across datasets (b and d, as well as Fig. 3c). These results rule out the possibility that DNN units fail to compress because of small, spurious nonlinearities. Instead, it appears that linear readouts from DNN units that align with V4 neurons are more compressible than the DNN units themselves. An exciting possibility is that V4 may comprise expressive yet simple functions to ensure accurate stimulus encoding with robustness to noise, and that task-driven DNNs have only partially arrived at these compressible representations (that is, they are still susceptible to adversarial noise, etc.). By optimizing linear readouts of DNN units to achieve high levels of compression, one may achieve a task-driven representation more similar to that of V4. Future experiments are needed to ensure subtle nonlinearities of V4 neurons are not hidden behind repeat-to-repeat variability.

Source data

Extended Data Fig. 7 Identifying real V4 dot detectors with experimental validation.

To confirm the presence of dot-detecting V4 neurons, we ran validation experiments specifically tailored for compact models that resemble dot detectors. We first identified compact models by training on previous recording sessions. From the identified compact models, we chose 5 compact models that most resembled dot detectors based on their stimulus preferences (response-maximizing natural and synthesized images) and their responses to artificial dot stimuli (see Fig. 4a,b and Extended Data Fig. 8). We note that the chosen dot detecting compact model in Fig. 4 was not one of the chosen, as this compact model matched to a V4 neuron from another animal. For a future recording session, we presented the response-maximizing natural and synthesized images of the five chosen models as well as artificial dot stimuli (same dot stimuli as in Fig. 4b and Extended Data Fig. 8b). We identified the five recorded V4 neurons that best matched the predictions of the five compact models (by computing the noise-corrected R2 from all other images shown in the session, same procedure as in Fig. 2b–f) and show their responses here. a. The response-maximizing natural images (left, examples) and response-maximizing synthesized images (middle, examples) chosen from the five compact models tended to more strongly drive responses than responses to randomly-chosen natural images (right, ‘resp.-max. synth. images’ and ‘resp.-max. natural images’ dots more to the right than ‘natural’ black dots). Each dot is the repeat-averaged response to one image; lines denote medians. All response-maximizing stimuli yielded median responses significantly greater than the median response to natural images (p < 0.02, one-sided permutation test, n = 10, asterisks) except one set of maximizing synthesized stimuli (bottom row, V4 neuron 5, red dots, p = 0.922, one-sided permutation test, n = 10). The response-maximizing natural images were reproduced from Adobe Stock. b. Real V4 responses to the artificial dot stimuli that varied in dot location (left), dot size (middle) and number of dots (right). Same format as in Fig. 4b and Extended Data Fig. 8b. Dot locations were subsampled to 28 × 28 locations to limit the number of images. Error bars in ‘vary dot number’ denote 1 s.e.m. across 10 different images, where each image had the same number of dots but in randomly-chosen, non-overlapping locations within 25 pixels of the preferred dot location. We found that these V4 neurons had preferred dot locations (left column), preferred dot sizes (middle column), and preferred numbers of dots (right column), consistent with these V4 neurons being dot detectors. Thus, these results provide strong evidence for the presence of dot detectors in V4. We observed diverse selectivity to dot size, including V4 neurons selective to the tiniest dots (neurons 1, 2, and 5) and small dots (neurons 3 and 4). Similarly, we observed selectivity to one dot (neurons 2 and 4), 2 dots (neuron 3), and 3 or more dots (neurons 1 and 5). Thus, even within the class of dot detectors, there appears to be large diversity in stimulus preferences.

Source data

Extended Data Fig. 8 Explaining a dot-detecting compact model’s selectivity for dot number.

To understand the inner computations of a compact model, we chose to investigate a particular compact model that resembled a dot detector (Fig. 4). We focused on the model’s selectivity to dot size (Fig. 4b) and uncovered a simple computation by isolating the filters that contributed to dot size selectivity (Fig. 4c–h). However, we suspected this compact model was also selective to the number of dots, as its preferred stimuli typically had two to five dots in the image (Fig. 4a). In addition, it was unclear to what extent dots fully explained the model’s responses, which likely also depend on other complicated feature processing (for example, the texture surrounding the dot in the response-maximizing synthesized images, Fig. 4a). Here, we investigate this model’s preference for dots and focus on how dot number selectivity may arise in a compact model. a. Diagram of the compact model that resembles a dot detector. We expected to see dot-like filters (that is, center-surround filters with middle excitation surrounded by inhibition or vice versa) in layers 1-3 but found none. This led us to identify corner-curvature and large-edge detecting filters that ultimately contributed to dot size selectivity (Fig. 4c–h). b. Besides varying dot size, we also varied the location of a small dot with a radius of 5 pixels (left, ‘vary dot location’) and dot number (right, ‘vary dot number’). For dot number, we chose the location of the dots in a greedy manner by choosing the location of the next dot that maximized the model’s response. This compact model (same as in a and Fig. 4) preferred two to three dots to the center right (with dot radii of 5 pixels). c. In the same manner as we identified filters contributing to dot size selectivity (Fig. 4c,d), we ablated each filter and measured the model’s dot number invariance. A dot number invariance of 1 indicates that after ablating a filter, the model is no longer able to detect different numbers of dots; a dot number invariance of 0 indicates no change in the model’s output after filter ablation (that is, dot number selectivity remains intact). Similar to dot size selectivity (Fig. 4e), we found that the most contributing filters resided in layers 4 and 5, consistent with the idea that a compact model’s specialization occurs in its consolidation step. d. To better understand the chosen compact model’s responses to dots versus other image features, we fed in images from different stimulus classes, including gratings, plaids, dots, and combinations of the three. Background colors were sampled from colors found in natural images. We found that punctate, black dots yielded the largest responses (‘black dots’) but that the response range was well above that for natural images (compare ‘black dots’ and ‘natural’), suggesting that black dots alone did not capture the full response profile of the model. Considering dots that varied in size and color (‘dots’) as well as different combinations of gratings and plaids (with different orientations and spatial frequencies) better matched the response distribution of natural images. For the combination of dots, gratings, and plaids, each feature (dots with different sizes and colors, black dots, gratings, and plaids) was randomly added with a chance of 0.5. Natural greyscale images (‘natural greyscale’) tended to drive responses less than those to natural images (‘natural’), suggesting the compact model encodes color. The natural greyscale images were adapted from Adobe Stock. The natural images were reproduced from Adobe Stock. e. To assess the importance of dots to the chosen compact model, we used the compact model’s responses to each stimulus class in d as training data to train a “student” DNN model (same architecture as the distilled model with 5 layers, 100 filters per layer, see Methods) to match as close as possible the original compact model (the “teacher” model). We evaluated each student model’s prediction performance (noise-corrected R2) on held-out real V4 responses to natural images for the V4 neuron corresponding to the chosen compact model. For each student model, we trained on the same amount of data (12 million images), where each image was either artificially generated (dots, gratings, plaids, …) or from our image dataset (greyscale natural and natural images). Each bar denotes the noise-corrected R2 for this particular V4 neuron (n = 1). We found that training separately on gratings, plaids, gratings and plaids, or black dots led to poor prediction performance compared to the original (orange bar). By increasing the complexity of the artificial stimuli to include dots of different sizes and colors (‘dots’), as well as to add gratings and plaids to the background (‘dots+gratings’, ‘dots+plaids’, ‘dots+grats+plaids’), improved performance, especially for dots and plaids. The prediction performance for training on dots and plaids was only 90% of the original performance (‘dots+plaids’ versus ‘original’). This suggests that including stimuli that strongly drive the compact model (dots) as well as stimuli that weakly drive the compact model (plaids) are both important to capture the response distribution of natural images (compare response distributions in d). Finally, training on greyscale natural images had worse performance than training on colorful natural images (grey bar versus black bar), indicating the importance of color for the compact model. We note that the noise-corrected R2 = 0.4 for this neuron suggests that the neuron itself uses computations not captured by the compact model although the stimulus preferences are likely the same (Fig. 2). f. To understand the specific computations of dot number selectivity, we fed in three input images with different numbers of dots (‘input image’). We then observed the resulting activity and filter weights for layer 5 and the spatial readout layer. We noticed that the input to layer 5 (‘layer 5 input’) appeared to detect the presence of a dot at any given location (matching our intuition for identifying a single dot in Fig. 4f–h). The layer 5 filters (chosen as those with the largest dot number invariance in c) appeared to extract specific patterns of the detected dots, with sparse weights that have large magnitudes on the perimeters (for example, the first filter in layer 5 has large excitatory weights on the four corners). This spacing of the weights enables the filters to detect multiple dots spaced apart while not responding to dots too close together (as these would not align with the filter weights). After convolution with layer 5 filters, some filters formed large excitatory regions around the dots whereas others formed smaller regions of inhibition (‘layer 5 conv. output’). After a linear combination of these filters (by multiplying with readout matrix Wreadout) and the ReLU activation function, we found that most of the dot regions were extinguished with some filters activated by the boundary (or shell) of the dot region (‘spatial readout layer input’). This activity was then excited or suppressed by spatial readout filters that act as spatial receptive fields. After taking a linear combination over filters (‘sum over filters’), we observed that a single small dot had both excitation and inhibition (top activity map), while three dots led to large activity around the region of the dots (middle activity map). Many dots led to little to no activity (bottom activity map). Summing over spatial information across pixels recreated the selectivity to dot number (rightmost plot). g. We illustrate the computational motifs in f with a conceptual diagram. The key concept is that the model identifies a region of dots with filters that have sparse weights (encouraging distance between dots), extracts the shell around this region (by creating a larger excitatory region and a smaller inhibitory region), and queries the size of this shell. Too small a region (a single dot) leads to weak activity and inhibition (rightmost, top activity map). Too large a region leads to weak excitation, as the shell of the region is outside the spatial receptive field. Only an appropriately sized region (that is, a specific number of dots) will fit within the spatial receptive field, leading to strong excitation. These results suggest two key computations for dot number selectivity: 1) filters with sparse weights that identify spaced-apart dots and 2) extracting “regions” of interest and then identifying the boundary of these regions. If the region of interest is too large, the region’s boundary will be outside the spatial receptive field, yielding a small response. These computations may comprise reoccurring themes in the visual cortex for other visual features.

Source data

Extended Data Fig. 9 Other dot detecting compact models share similar computations for dot size selectivity.

We chose to investigate a single compact model that strongly resembled a dot detector (Fig. 4), but we found other compact models whose preferred stimuli contained dots (Fig. 2a). We wondered whether these other dot-detecting compact models had similar computations for dot size selectivity as the compact model we investigated. We chose two other compact models, performed the same analyses as for our chosen compact model (Fig. 4), and found that all three models used similar computations. This suggests that these compact models converged to the same solutions despite different random initializations and V4 neurons. a. Response-maximizing natural and synthesized images for a second compact model. The response-maximizing natural images were reproduced from Adobe Stock. b. Stimulus tuning to artificial dot stimuli (same stimuli as in Fig. 4b and Extended Data Fig. 8b). This compact model preferred a dot to the bottom left with a dot size of 5 pixels and 3-5 dots. c. We computed the dot size invariance (DSI) for each filter by ablating that filter and comparing the model’s outputs to its outputs for no ablation (see Methods, same procedure as in Fig. 4c,d). Filters with the largest DSIs (that is, filters that strongly contribute to dot size selectivity) were in the deeper layers (4,5, and spatial readout). d. We performed cumulative ablation—ablating all previously-chosen filters and choosing the next filter to ablate that led to the smallest increase in DSI (same procedure as in Fig. 4e)—to the filters in layer 3 and found roughly 10 filters together strongly contributed to dot size selectivity (rightmost dots). e. We investigated 6 filters that contributed to dot size selectivity; the other contributing filters appeared to serve redundant computations as the chosen filters. Much like the first chosen compact model (Fig. 4f), we found 4 excitatory filters that detected corners of dots (‘layer 3 filters’, top 4 filters with pink weights) and two inhibitory filters that detected large edges (‘layer 3 filters’, bottom 2 filters with green weights). In response to an image of a small dot, summing the output of these filters (‘layer 3 conv. output’) led to a large amount of overlap for the outputs of the excitatory filters but not for the inhibitory filters. The resulting large activity after layer 4 (‘layer 4 output’), once summed across pixels, would lead to a dot being detected. We note that the layer 4 filter (‘layer 4 filter’) has sparse weights that likely promotes the detection of multiple dots, similar to what we found for the first chosen compact model (Extended Data Fig. 8). f. Same analysis of internal activations of the compact model as in e except for an input image of a large dot. Here, summing across filter outputs causes overlap between excitatory and inhibitory activity, cancelling each other out. This leads to weak levels of output activity (‘layer 4 output’) that results in no dot being detected. Comparing the results of af with the results of the first chosen compact model (Fig. 4), we conclude that both compact models use almost identical computations for dot size selectivity. g. Same analyses as in af except for a different compact model. We found that this third compact model’s computations were largely similar to those of the other two dot-detecting compact models. The response-maximizing natural images were reproduced from Adobe Stock.

Source data

Supplementary information

Supplementary Information

Supplementary Figs. 1–4 and Supplementary Tables 1 and 2.

Reporting Summary

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cowley, B.R., Stan, P.L., Pillow, J.W. et al. Compact deep neural network models of the visual cortex. Nature (2026). https://doi.org/10.1038/s41586-026-10150-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41586-026-10150-1

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing