Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Convolutional architectures are cortex-aligned de novo

A preprint version of the article is available at bioRxiv.

Abstract

What underlies the emergence of cortex-aligned representations in deep neural network models of vision? Earlier work suggested that shared architectural constraints were a major factor, but the success of widely varied architectures after pretraining raises critical questions about the importance of architectural constraints. Here we show that in wide networks with minimal training, architectural inductive biases have a prominent role. We examined networks with varied architectures but no pretraining and quantified their ability to predict image representations in the visual cortices of monkeys and humans. We found that cortex-aligned representations emerge in convolutional architectures that combine two key manipulations of dimensionality: compression in the spatial domain, through pooling, and expansion in the feature domain by increasing the number of channels. We further show that the inductive biases of convolutional architectures are critical for obtaining performance gains from feature expansion—dimensionality manipulations were relatively ineffective in other architectures and in convolutional models with targeted lesions. Our findings suggest that the architectural constraints of convolutional networks are sufficiently close to the constraints of biological vision to allow many aspects of cortical visual representation to emerge even before synaptic connections have been tuned through experience.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Convolutional model architecture and evaluation framework.
Fig. 2: Dimensionality expansion strongly improves the encoding performance of convolutional neural networks.
Fig. 3: Encoding performance remains high after dimensionality reduction.
Fig. 4: Two-dimensional visualization of image representations in a high-dimensional untrained convolutional network.
Fig. 5: Analysis of critical architectural components of the convolutional network.
Fig. 6: Image classification performance for the untrained convolutional network.

Similar content being viewed by others

Data availability

The NSD is available at https://naturalscenesdataset.org/. The THINGS fMRI dataset is available at https://openneuro.org/datasets/ds004192. The monkey electrophysiology dataset is available as part of the Brain-Score GitHub package at https://github.com/brain-score. The Places dataset is available at http://places2.csail.mit.edu/index.html.

Code availability

The Expansion model, as well as code for all analyses, is available via GitHub at https://github.com/akazemian/untrained_models_of_visual_cortex (ref. 61).

References

  1. Carandini, M. et al. Do we know what the early visual system does? J. Neurosci. 25, 10577–10597 (2005).

    Article  Google Scholar 

  2. Jones, J. P. & Palmer, L. A. The two-dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysiol. 58, 1187–1211 (1987).

    Article  Google Scholar 

  3. Movshon, J. A., Thompson, I. D. & Tolhurst, D. J. Spatial summation in the receptive fields of simple cells in the cat’s striate cortex. J. Physiol. 283, 53–77 (1978).

    Article  Google Scholar 

  4. Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999).

    Article  Google Scholar 

  5. Agrawal, P., Stansbury, D., Malik, J. & Gallant, J. L. Pixels to voxels: modeling visual representation in the human brain. Preprint at https://arxiv.org/abs/1407.5104 (2014).

  6. Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).

    Article  Google Scholar 

  7. Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

    Article  Google Scholar 

  8. Chen, Z. & Bonner, M. F. Universal dimensions of visual representation. Sci. Adv. 11, eadw7697 (2025).

    Article  Google Scholar 

  9. Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A. & Konkle, T. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines. Nat. Commun. 15, 9383 (2024).

    Article  Google Scholar 

  10. Elmoznino, E. & Bonner, M. F. High-performing neural network models of visual cortex benefit from high latent dimensionality. PLoS Comput. Biol. 20, e1011792 (2024).

    Article  Google Scholar 

  11. Saxe, A., Nelli, S. & Summerfield, C. If deep learning is the answer, what is the question? Nat. Rev. Neurosci. 22, 55–67 (2021).

    Article  Google Scholar 

  12. Serre, T. Deep learning: the good, the bad, and the ugly. Annu. Rev. Vis. Sci. 5, 399–426 (2019).

    Article  Google Scholar 

  13. Storrs, K. R., Kietzmann, T. C., Walther, A., Mehrer, J. & Kriegeskorte, N. Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. J. Cogn. Neurosci. 10, 1–21 (2021)

  14. Cao, R. & Yamins, D. Explanatory models in neuroscience, part 1: taking mechanistic abstraction seriously. Cogn. Syst. Res. 87, 101244 (2024).

    Article  Google Scholar 

  15. Cao, R. & Yamins, D. Explanatory models in neuroscience, part 2: functional intelligibility and the contravariance principle. Cogn. Syst. Res. 85, 101200 (2024).

    Article  Google Scholar 

  16. Konkle, T. & Alvarez, G. A. A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 13, 491 (2022).

    Article  Google Scholar 

  17. Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 1, 417–446 (2015).

    Article  Google Scholar 

  18. Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).

    Article  Google Scholar 

  19. Yamins, D. L. K. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).

    Article  Google Scholar 

  20. Bruna, J. & Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1872–1886 (2013).

    Article  Google Scholar 

  21. Pogoncheff, G. et al. Explaining V1 properties with a biologically constrained deep learning architecture. Adv. Neural Inf. Process. Syst. 36, 13908–13930 (2023).

  22. Yue, X., Pourladian, I. S., Tootell, R. B. H. & Ungerleider, L. G. Curvature-processing network in macaque visual cortex. Proc. Natl Acad. Sci. USA 111, e3467–e3475 (2014).

    Article  Google Scholar 

  23. Yue, X., Robert, S. & Ungerleider, L. G. Curvature processing in human visual cortical areas. NeuroImage 222, 117295 (2020).

  24. Majaj, N. J., Hong, H., Solomon, E. A. & DiCarlo, J. J. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 35, 13402–13418 (2015).

    Article  Google Scholar 

  25. Allen, E. J. et al. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nat. Neurosci. 25, 116–126 (2022).

    Article  Google Scholar 

  26. Hebart, M. N. et al. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. eLife 12, e82580 (2023).

    Article  Google Scholar 

  27. Schrimpf, M. et al. Brain-Score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv https://doi.org/10.1101/407007 (2018).

  28. Casper, S. Frivolous units: wider networks are not really that wide. In Proc. 35th AAAI Conference on Artificial Intelligence 6921–6929 (AAAI, 2021).

  29. Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B. & Liao, Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput. 14, 503–519 (2017).

    Article  Google Scholar 

  30. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2018).

    Article  Google Scholar 

  31. Oliva, A. & Torralba, A. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001).

    Article  Google Scholar 

  32. Portilla, J. & Simoncelli, E. P. A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis. 40, 49–70 (2000).

    Article  Google Scholar 

  33. Cordonnier, J.-B., Loukas, A. & Jaggi, M. On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations (2020).

  34. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2021).

  35. Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proc. Natl Acad. Sci. USA 119, e2201854119 (2022).

    Article  Google Scholar 

  36. Jarrett, K., Kavukcuoglu, K., Ranzato, M. & LeCun, Y. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision 2146–2153 (IEEE, 2009).

  37. Saxe, A. M. et al. On random weights and unsupervised feature learning. In Proc. 28th International Conference on Machine Learning 2 (2011).

  38. Cao, Y.-H. & Wu, J. A random CNN sees objects: one inductive bias of CNN and its applications. In AAAI Conference on Artificial Intelligence 194–202 (AAAI, 2022).

  39. Baek, S., Song, M., Jang, J., Kim, G. & Paik, S.-B. Face detection in untrained deep neural networks. Nat. Commun. 12, 7328 (2021).

    Article  Google Scholar 

  40. Geiger, F., Schrimpf, M., Marques, T. & DiCarlo, J. J. Wiring up vision: minimizing supervised synaptic updates needed to produce a primate ventral stream. In International Conference on Learning Representations (2022).

  41. Cadena, S. A. et al. Deep convolutional models improve predictions of macaque V1 responses to natural images. PLoS Comput. Biol. 15, e1006897 (2019).

    Article  Google Scholar 

  42. Shi, J. et al. Comparison against task driven artificial neural networks 846 reveals functional properties in mouse visual cortex. Adv. Neural Inf. Process. Syst. 32, 5674–5774 (2019).

  43. Chang, H. & Futagami, K. Reinforcement learning with convolutional reservoir computing. Appl. Intell. 50, 2400–2410 (2020).

    Article  Google Scholar 

  44. Jaeger, H. The ‘Echo State’ Approach to Analysing and Training Recurrent Neural Networks—With an Erratum Note GMD Technical Report 148 (German National Research Center for Information Technology, 2001).

  45. Mei, S. & Montanari, A. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2022).

    Article  MathSciNet  Google Scholar 

  46. Mei, S., Misiakiewicz, T. & Montanari, A. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Appl. Comput. Harmon. Anal. 59, 3–84 (2022).

  47. Rahimi, A. & Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 20, 1177–1184 (2007).

  48. Tong, Z. & Tanaka, G. Reservoir computing with untrained convolutional neural networks for image recognition. In Proc. 24th International Conference on Pattern Recognition 1289–1294 (IEEE, 2018).

  49. Teney, D., Nicolicioiu, A. M., Hartmann, V. & Abbasnejad, E. Neural redshift: random networks are not random functions. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4786–4796 (IEEE, 2024).

  50. Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. Adv. Neural Inf. Process. Syst. 31, 8580–8589 (2018).

  51. Bonner, M. F. & Epstein, R. A. Object representations in the human brain reflect the co-occurrence statistics of vision and language. Nat. Commun. 12, 4081 (2021).

    Article  Google Scholar 

  52. Doerig, A. et al. High-level visual representations in the human brain are aligned with large language models. Nat. Mach. Intell. 7, 1220–1234 (2025).

    Article  Google Scholar 

  53. Wang, A. Y., Kay, K., Naselaris, T., Tarr, M. J. & Wehbe, L. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nat. Mach. Intell. 5, 1415–1426 (2023).

    Article  Google Scholar 

  54. Babadi, B. & Sompolinsky, H. Sparseness and expansion in sensory representations. Neuron 83, 1213–1226 (2014).

    Article  Google Scholar 

  55. Cayco-Gajic, N. A. & Silver, R. A. Re-evaluating circuit mechanisms underlying pattern separation. Neuron 101, 584–602 (2019).

    Article  Google Scholar 

  56. Sakai, J. How synaptic pruning shapes neural wiring during development and, possibly, in disease. Proc. Natl Acad. Sci. USA 117, 16096–16099 (2020).

  57. Pytorch image models. timmdocs https://timm.fast.ai/ (2022).

  58. Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Proc. Computer Vision—ECCV 2014 (eds Fleet, D. et al.) 740–755 (Springer, 2014).

  59. Kay, K. N., Rokem, A., Winawer, J., Dougherty, R. F. & Wandell, B. A. GLMdenoise: a fast, automated technique for denoising task-based fMRI data. Front. Neurosci. 7, 247 (2013).

    Article  Google Scholar 

  60. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).

    Article  Google Scholar 

  61. Kazemian, A. akazemian/untrained_models_of_visual_cortex: initial release. Zenodo https://doi.org/10.5281/ZENODO.16920087 (2025).

Download references

Acknowledgements

This work was supported in part by a JHU Catalyst award to M.F.B.

Author information

Authors and Affiliations

Authors

Contributions

A.K. and M.F.B. wrote the paper. E.E. provided feedback on the analyses and writing. A.K. performed the research and analysed the data. M.F.B. supervised the research. A.K., E.E. and M.F.B. conceived of the research.

Corresponding authors

Correspondence to Atlas Kazemian or Michael F. Bonner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks David Klindt and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Dimensionality expansion strongly improves the encoding performance of convolutional neural networks in another large-scale fMRI dataset.

a) The effects of dimensionality expansion were examined for a convolutional neural network, a deep fully connected network, and a vision transformer. b) These plots illustrate the encoding performance for all three architectures as a function of dimensionality expansion. There was no pre-training for these networks. Encoding performance was evaluated for regions V1 to V4 from the large-scale THINGS fMRI dataset. The x-axis plots the number of random features in the output layer, and the y-axis shows the encoding score for predicting image-evoked cortical responses. The gray dashed line indicates the performance of the best performing convolutional layer of pre-trained AlexNet. The convolutional architecture without pre-training attained large performance gains as a function of dimensionality expansion. In contrast, the other two architectures showed much less improvement as the number of output features was expanded. The encoding performance plots show the mean performance across voxels from all participants, and the error bars denote 97.5% confidence intervals from 1,000 bootstrap samples. c) A summary statistic for the effect of dimensionality expansion was calculated as the difference in performance between the highest- and lowest-dimensional version of each architecture, using the mean performance across voxels from all participants. The performance of the convolutional model was strongly modulated by the dimensionality of its random feature space. In contrast, the effect of dimensionality expansions was much weaker in the other two architectures.

Extended Data Fig. 2 The benefits of dimensionality expansion depend on linear reweighting.

The effects of dimensionality expansion were examined for a convolutional neural network, a deep fully connected network, and a vision transformer. These plots illustrate representational similarity analysis (RSA) scores for all three architectures as a function of dimensionality expansion. There was no pre-training for these networks. a, RSA scores were evaluated for regions V4 and IT in the monkey electrophysiology data. The x-axis plots the number of random features in the output layer, and the y-axis shows the RSA score. The gray dashed line indicates the performance of the best performing convolutional layer of pre-trained AlexNet. As expected, the RSA scores do not systematically increase as a function of dimensionality expansion. This contrasts with the effect of dimensionality expansion observed for the encoding scores in Fig. 2, and it demonstrates that a linear reweighting procedure is needed to yield performance gains from dimensionality expansion in untrained networks. These plots show the mean RSA correlations across all participants. b, These plots show the same analyses as in panel a but for regions along the ventral stream in the human fMRI data.

Extended Data Fig. 3 Comparisons with other trained deep neural network architectures.

a) These plots show the encoding performance of state-of-the-art trained architectures for comparison with the untrained networks and trained AlexNet. For each untrained architecture, these plots show performance at both the smallest and largest levels of dimensionality. In addition to AlexNet, these plots include more modern architectures, specifically ViT and BarlowTwins, pre-trained on ImageNet. Encoding performance was evaluated for regions V4 and IT in the monkey electrophysiology data. The encoding performance plots in this panel and in panel b show the mean performance across units/voxels from all participants, and the error bars denote 97.5% confidence intervals from 1,000 bootstrap samples. The blue dashed line shows the mean noise ceiling across units/voxels. b) These plots show the same analyses as in panel a but for regions along the ventral stream in the human fMRI data.

Extended Data Fig. 4 Variance partitioning shows that the Expansion model explains the same variance as trained AlexNet.

Variance partitioning was used to determine whether the Expansion model explains the same variance in cortical responses as trained AlexNet. These analyses were performed using the largest convolutional model, and they show that in both the monkey electrophysiology data (a) and human fMRI data (b) the explained variance of the Expansion model is fully shared with AlexNet. As expected, AlexNet explains additional unique variance that is not shared with the Expansion model.

Extended Data Fig. 5 Improvements in encoding performance are linked to increases in latent dimensionality.

a) These plots illustrate the encoding performance of the networks from Fig. 2 as a function of their latent dimensionality. Encoding performance was evaluated for regions V4 and IT in the monkey electrophysiology data. Plotting conventions are the same as in Fig. 2, except that here the x-axis plots the number of principal components that account for 85% of variance in the output layer. The results show that improvements in encoding performance are closely linked to increases in the latent dimensionality of a network’s natural image representations. However, note that measures of latent dimensionality are architecture-dependent, and thus different architectures with the same level of latent dimensionality can nonetheless differ in encoding performance. b) These plots show the same analyses as in panel a but for regions along the ventral stream in the human fMRI data.

Extended Data Fig. 6 Analysis of architectural components in the final layer of the convolutional network.

a-b) These plots show the performance of the largest untrained convolutional network after altering key architectural components in the final layer only. Panel a shows results for macaque IT, and panel b shows results for the human high-level ventral stream. These plots show that ablating the nonlinearity in the final layer (No ReLU) has little effect, which means that the crucial nonlinear operations are those that occur in earlier layers. On the other hand, ablating the convolution and pooling operations (Linear Projection) reduces performance, demonstrating the importance of these operations in the final layer. They also show that removing the spatial locality of the convolutional filters in the final layer (No Spatial Continuity) results in an overall decrease in encoding performance, demonstrating that even in the highest layer of the network it is beneficial to compute spatially local representations. Plotting conventions are the same as in Fig. 2.

Extended Data Fig. 7 Effect of pre-defined wavelets in the first layer of the untrained convolutional neural network.

These plots show the effect of using pre-defined wavelets in the first layer of the untrained convolutional network. Encoding performance is shown for macaque IT (a) and the human high-level ventral stream (b). For comparison, we examined a model with 3,000 randomly initialized filters in layer 1 (the number of random filters was maximized within computational memory limits). As in Fig. 2, these plots show how encoding performance changes as a function of dimensionality expansion in the final convolutional layer. The x-axis plots the number of random features in the output layer, and the y-axis shows the encoding score for predicting image-evoked cortical responses. The gray dashed line indicates the performance of the best-performing convolutional layer of pre-trained AlexNet. There is a small, but consistent drop in performance for the fully random model across both datasets, demonstrating that overall, the network benefits from the implementation of pre-defined wavelets in its first layer.

Extended Data Fig. 8 Different activation functions yield similar encoding performance for the untrained convolutional neural network.

The effects of using different nonlinear activation functions in our untrained convolutional network were explored for monkey IT (a) and the human high-level ventral stream (b). These plots illustrate encoding performance for models with different nonlinear activation functions with otherwise identical architectures, all containing 105 features in their output layer. For comparison, this plot also includes a network without any nonlinear activation functions (the Linear model). The y-axis shows the encoding score for predicting image-evoked cortical responses. The results demonstrate that the inclusion of nonlinearities is critical, but various types of nonlinearities yield similar levels of performance. ReLU = rectified linear unit, GELU = Gaussian error linear unit, ELU = exponential linear unit.

Extended Data Fig. 9 Different random initialization methods yield similar encoding performance for the untrained convolutional neural network.

The effects of initializing the random features of our untrained convolutional network using different methods were explored for monkey IT (a) and human ventral visual stream (b). These plots illustrate encoding performance for identical architectures with different random initialization types. As in Fig. 2, the x-axis plots the number of random features in the output layer, and the y-axis shows the encoding score for predicting image-evoked cortical responses. There is variation in encoding performance for different initialization methods in models with low dimensionality (on the left side of the x-axis). However, at higher levels of dimensionality, these performance differences diminish. This indicates that the type of initialization has minimal impact on encoding performance in the presence of model expansion.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kazemian, A., Elmoznino, E. & Bonner, M.F. Convolutional architectures are cortex-aligned de novo. Nat Mach Intell 7, 1834–1844 (2025). https://doi.org/10.1038/s42256-025-01142-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01142-3

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing