Orthogonal neural representations support perceptual judgments of natural stimuli

Srinath, Ramanujan; Ni, Amy M.; Marucci, Claire; Cohen, Marlene R.; Brainard, David H.

doi:10.1038/s41598-025-88910-8

Download PDF

Article
Open access
Published: 13 February 2025

Orthogonal neural representations support perceptual judgments of natural stimuli

Ramanujan Srinath¹^na1,
Amy M. Ni^1,2^na1,
Claire Marucci²,
Marlene R. Cohen¹^na2 &
…
David H. Brainard²^na2

Scientific Reports volume 15, Article number: 5316 (2025) Cite this article

3064 Accesses
2 Citations
4 Altmetric
Metrics details

Subjects

Abstract

In natural visually guided behavior, observers must separate relevant information from a barrage of irrelevant information. Many studies have investigated the neural underpinnings of this ability using artificial stimuli presented on blank backgrounds. Natural images, however, contain task-irrelevant background elements that might interfere with the perception of object features. Recent studies suggest that visual feature estimation can be modeled through the linear decoding of task-relevant information from visual cortex. So, if the representations of task-relevant and irrelevant features are not orthogonal in the neural population, then variation in the task-irrelevant features would impair task performance. We tested this hypothesis using human psychophysics and monkey neurophysiology combined with parametrically variable naturalistic stimuli. We demonstrate that (1) the neural representation of one feature (the position of an object) in visual area V4 is orthogonal to those of several background features, (2) the ability of human observers to precisely judge object position was largely unaffected by those background features, and (3) many features of the object and the background (and of objects from a separate stimulus set) are orthogonally represented in V4 neural population responses. Our observations are consistent with the hypothesis that orthogonal neural representations can support stable perception of object features despite the richness of natural visual scenes.

Opposing effects of selectivity and invariance in peripheral vision

Article Open access 28 July 2021

Neural tuning instantiates prior expectations in the human visual system

Article Open access 01 September 2023

Revealing nonlinear neural decoding by analyzing choices

Article Open access 16 November 2021

Introduction

A major function of the visual system is to infer properties of currently relevant stimuli without interference from the tremendous amount of task-irrelevant information that bombards our retinas. Many laboratory studies of the neural basis of this ability use, for good reasons, relatively simple stimuli^{1,2,3,4,5,6,7,8,9}. An advantage of this approach is experimental control: one can parametrically vary stimuli and completely specify the input to the visual system. However, a downside of using such stimuli is that their simplicity prevents them from fully illuminating the neural algorithms by which the brain sorts through the large quantity of visual information characteristic of natural viewing (see simulations in Ruff et al.¹⁰).

In contrast to simple artificial stimuli, natural images can vary in many features, and these features are jointly encoded by the responses of populations of neurons in visual cortex^{11,12,13,14,15,16}. It is well known that neurons in visual cortex are tuned for various features^4,17, so a single neuron’s response may not allow unique identification of multiple image feature values. In general, tuning for simple features is thought to be independent, meaning that, for example, a neuron’s tuning for orientation does not predict its tuning for spatial frequency^18,19,20. In this case, a population of neurons, comprised of neurons that on their own confound multiple stimulus features, will carry sufficient information as a group to allow robust estimation of any one feature¹⁰. Whether neural populations encode different features of natural images, which themselves contain many more statistical dependencies than typical artificial stimuli, in a similarly independent way remains unknown.

Knowing how neural population responses to different image features covary for natural images is important because this has profound implications for how those features can guide behavior. A given feature can guide behavior in a way that is uncorrupted by other features only if its representation is independent from those other features. We can test for independence by viewing feature representations in a high-dimensional space in which each dimension represents the firing rate of one neuron in a population^21,22. Because tuning curves are smooth and continuous, systematically varying one scene feature, such as the position of a banana, traces out a continuous trajectory in the population space, which can typically be approximated by a line^23,24. If just one feature varies, then the value of that feature can be read out by projecting the population response onto this line (aka linear decoding).

If tuning for two features is independent across neurons, then we expect that systematically varying the second feature would move the population response in an orthogonal direction¹⁰. Such orthogonality would mean that the two features can each be linearly decoded without interference from the other. However, if tuning for the two features were highly correlated (e.g., all neurons that prefer vertical orientations also prefer high spatial frequencies), varying the two features would trace out similar lines, and it would be difficult, if not impossible, to read out the two features independently. Intermediate cases are also possible, in which the representations for different features trace out somewhat separate trajectories that are neither close to co-linear nor orthogonal.

If a population of neurons contains the stimulus representation that mediates behavior, and the representations of multiple features are not orthogonal or close to it in this population, then we expect that behavioral measurements requiring judgments about these features to be error prone. Specifically, changing a task-irrelevant feature would perturb estimates of a task-relevant feature to the detriment of visual performance and lead to misjudgments of the task-relevant feature. Therefore, and following Hong et al.²⁵, we reasoned that variation in task-irrelevant features of a natural scene should not impair performance on a visual task involving judgments about these features if two conditions are met: visual information is read out of a neural population in a way that approximates a linear decoder, and the representations of relevant and irrelevant features are orthogonal in the relevant neural populations.

Here, we studied how the representation of the position of a foreground target object depends on variations in other scene features. The first experiment studies the effect of the position of background objects in the scene, using both neural population recordings in monkey and human psychophysics. The second uses neural recordings in monkeys and considers a broader range of scene variations. Because we are ultimately interested in how neural population responses support behavior in the natural environment, we employed naturalistic stimuli in these first two experiments. In a third experiment, we measured the link between neural responses and behavior using somewhat complex but not fully naturalistic stimuli that parametrically vary across many feature dimensions.

We leveraged the power of computer graphics to take parametric control of stimulus features in naturalistic stimuli, enabling us to vary many naturalistic stimulus dimensions and test the hypothesis that the observers’ perceptual abilities to make fine perceptual distinctions will not be perturbed on a threshold-level judgment task by task-irrelevant variations in stimuli and backgrounds if the neural representations of task-relevant and irrelevant features are orthogonal. Using a combination of human psychophysics and monkey neurophysiology, we demonstrate that (1) the population representation of a target object’s position in V4 is orthogonal to those of several background features, (2) the ability of human subjects to make precise perceptual judgments about object position was largely unaffected by task-irrelevant variation in those background features, and (3) many features of the object and the background (position, color, luminance, rotation, and depth) in these naturalistic images, and also of artificial stimuli in which many features are parametrically varied, are independently decodable from V4 population responses. We also examined monkey’s behavior for estimating features of the artificial stimuli. Together, these observations support the idea that orthogonal neuronal representations enable stable perception of objects and features despite the tremendous irrelevant variation inherent in natural scenes.

Results

Central hypothesis: orthogonal representations enable observers to ignore irrelevant visual information

We tested the hypothesis that task-irrelevant information will not affect a perceptual judgment if the representations of the task-relevant and irrelevant features are orthogonal²⁵. Figure 1a-b depict how we used computer graphics to parametrically vary different scene features, such as the position of a central object, the rotational position of objects in the background, and the depth position of objects in the background. Using this set of stimulus variations, consider the effect of varying the position of the object on a hypothetical neural population response as illustrated in the left panel of Fig. 1c. Each point in the plot represents the noisy population response to one presentation of an image, illustrating how varying object position against a fixed background can trace out a line in the high-dimensional neural population space. For this background, the position of the background could be read out by projecting the population response onto the line shown (labeled ‘position axis for one background’ in the figure). The middle panel of Fig. 1c shows how varying the background objects could affect this line in an orthogonal manner. In this case, the line tracing out the neural population response to the object at various positions is shifted in a direction orthogonal to the position axis shown in the left panel. Although a different line is swept out by varying object position against this second background, projecting onto the line for the first background continues to decode the object position accurately. If, on the other hand, changing the background causes a change in the position axis that is not orthogonal (right panel of Fig. 1c), projecting onto the line for the first background will not provide an accurate linear position readout. Thus, we test the hypothesis that changing an irrelevant background feature (e.g., the position of background objects) will not impact the perception of the task-relevant feature if the irrelevant background changes are orthogonal to the relevant ones (Fig. 1c middle).

Naturalistic stimuli with parameterizable properties

To test these predictions, we created naturalistic stimuli with many parameterizable, interpretable features (Fig. 1a-b). We parametrically varied the position of a central object (banana) and the rotation and depth of background objects (leaves and branches) set against a larger fixed contextual scene (rocks, moss-covered stumps, mountains, and skyline). The context was consistent across images to anchor the representations of object and background features, which means that the position of the central object could be judged relative to the edge of the monitor, the fixation point, the contextual elements, or a combination. We presented these stimuli within the joint receptive fields of recorded V4 neurons (Supp. Fig. 1) while each of the two monkeys fixated on a central point. In sum, we recorded V4 responses in 26 experimental sessions across two animals (85–94 visually responsive multi-units per session). Most of the units in our measured population were sensitive to variations in both the position of the central object and the variations in background rotation and depth (Supp. Fig. 1d).

V4 neurons robustly encode stimulus position for each stimulus background

We first measured the extent to which V4 neurons encode object position by linearly decoding that position for each unique background stimulus. This decoding ability is a pre-requisite for meaningful tests of the orthogonality of the population representation when other image features are varied. Figure 2a shows that for each unique background configuration (rotation and depth), V4 neurons from a single session support good linear decoding of the position of the object (each of the 25 panels in Fig. 2a is for a specific background configuration; each gray point in the panels shows decoded object position for a single presentation; the mean and standard deviation for each position - open circles and error bars - summarize our ability to predict the position of the object from the activity of V4 neurons). The numbers at the upper left of each panel provide the correlation between the predicted and actual object stimulus position (mean performance = 0.698).

V4 representations of stimulus position and background features are approximately orthogonal

Because the first condition, that we could use a linear decoder to estimate the position of the object in each background (Fig. 2a), was met, we tested the second part of our orthogonality hypothesis. We tested this second part, that the neural representation of the position of the central object in V4 is approximately orthogonal to the representations of features of the background objects (depth and rotation of the leaves and branches) in our stimuli, in five ways.

First, we calculated how well we could use a linear decoder to decode object position with a single general decoder, across the various background conditions. This worked well. Indeed, across sessions, we could decode object position (Fig. 2b), background rotation (Supp. Fig. 2a), and background depth (Supp. Fig. 2b) accurately across variation in the values of the other two features. We trained these decoders with either all presentations in a leave-one-out fashion or matching presentation counts with the condition-specific decoders with qualitatively similar results (see Methods for the details of this comparison). The ability of a single decoder to read out banana position across all background variations is similar to that of the decoders that were optimized for each unique background stimulus and is high across all sessions for two monkeys (Fig. 2c). This result suggests that the population of V4 neurons we measured encodes all of the tested features well and near orthogonally.

Second, we found that on a trial-by-trial basis, errors in the decoded estimates of object position are not correlated with errors in decoding of background rotation and depth (Fig. 2d and Supp. Fig. 2e). This lack of correlation also suggests that the representations of object position are independent in V4 from representations of background rotation and depth.

Third, to probe the orthogonality of the representations in more detail, we also tested the performance of condition-specific decoders using responses obtained in other background conditions (cross-condition decoders; Supp. Fig. 3). In the case of perfect orthogonality, these decoders would perform as well on data from other background conditions as on held-out data from the condition they were trained on. We found that as the feature value of the background deviated further from the one the decoder was trained on, decoding accuracy did decrease slightly. This is a strong test of orthogonality, however, and the largest effect at the furthest difference in background feature value is a quite modest ~ 0.15. This suggests that the object position representations are largely, but not perfectly, orthogonal to the effect of background feature changes (see Discussion).

Fourth, we compared the weights assigned to each neuron when constructing each linear decoder. We found no significant correlations between the weights assigned while decoding any of the features (position, rotation, or depth; Supp. Fig. 4a). There was also no detectable relationship between each neuron’s sensitivity to a feature and its decoding weight (Supp. Fig. 4d).

Finally, if the representations of stimulus position and background parameters are orthogonal, then the decoders optimized for each unique stimulus (Fig. 2a) should be mutually aligned in neural population space. Put another way, if the representation of stimulus position is robust to variation in the background, this representation should vary along the same direction across backgrounds. To probe this, we calculated the line in neural population space that best explains population responses to each stimulus position for each background. These are depicted in Fig. 3a for each unique background condition (plotted for the first two principal components of the population neural responses for visualization purposes only; each point is a trial, each bright point is the average population response to a particular object position, and gray to yellow point colors represent the five object position values). The angle in the population space between the decoders for each unique background and the decoder for the background configuration whose decoder is shown in the center of 3a marked with ∗ (chosen as a reference to define the origin of the angular measure) is indicated in degrees. The distribution of angles for this example session (Fig. 3b) and across all sessions in both animals (Fig. 3c) is skewed toward much smaller angles than expected by chance – gray distributions in Fig. 3c depicting a median of ~ 90° for decoders trained on shuffled (randomizing trial labels within background configuration) responses to each unique background condition (labeled “shuffle”; dark gray) and angles between random vectors in a response space with the same dimensionality as the neural decoding space (labeled “random”; light gray).

Together, these recording results support the idea that the neural representation of stimulus position is close to orthogonal to the representations of background rotation and depth in V4.

Human subjects discriminate stimulus position robustly with respect to background variation

Our central hypothesis predicts that when the representations of two stimulus features are orthogonal in the brain, varying one should not impact the ability of subjects to discriminate the other. We tested this hypothesis by measuring the ability of human observers to discriminate the position of the central object in our stimuli amid variation in the background rotation and depth. Using a threshold paradigm, we tested this idea for stimulus step sizes that approach the limits of perception.

Human subjects viewed two images of the object separated by two different masks (Fig. 4a) and reported whether the object in the second image was positioned to the left or right of the object in the first presentation. The offset between the two object positions was varied systematically to allow us to calculate a discrimination threshold for each background variation condition. Across blocks of trials, we varied the amount of within-trial image-to-image variability in the background objects across the two presentations of the central object. The context (the rocky and grassy textures) was held consistent across images to give the variations in the object and background features a frame of reference. When there was background variation, intrusion of that variation on decoding of position would manifest as an elevated discrimination threshold if the representation of the object position was not orthogonal to that of the background features^26,27. However, consistent with the idea that the orthogonal representations of object position and background features that we found in the neural recordings enable background-independent perception, introducing variability into the background did not significantly impact the position discrimination performance of human subjects (Fig. 4b and Supp. Fig. 5). To make a direct comparison between parameter decoding of neural representations and human psychophysical performance, we partitioned neural data into “no variation”, “rotation variation only”, and “depth and rotation variation” groups and trained linear discriminants to classify left or right position difference between pairs of presentations (Fig. 4c). As with the human psychophysical performance, the cross-validated discrimination performance of these classifiers did not differ across the three background variation groups. Thresholds for the neural classifiers are higher than those for the human subjects (note the difference in the x-axis scale between Fig. 4b and c), which is expected given that neuronal responses were measured to peripheral stimuli and that we recorded from a small subset of the neurons that could support psychophysical performance.

At least ten object and background features are represented approximately orthogonally in V4

To test the extent to which the orthogonality of representations of different features generalizes to other features of the object and background in our stimuli, we measured V4 responses to a large image set where the color, luminance, position, rotation, and depth of both the background and object each took one of two values (this yields 2¹⁰ = 1,024 unique images; Fig. 5a). If any two parameters are encoded orthogonally in neural population space, then it should be possible to linearly decode those parameters successfully despite the variation in the others. Conversely, a decoder trained on one parameter should not provide information about the others. To test these predictions, we trained linear decoders for each object or background feature and then tested our ability to decode each of the ten features with each decoder.

Each of the ten features was encoded in the V4 population despite the variation in the other features, meaning that the correlation between the actual value of the feature parameter and the value predicted by a cross-validated linear decoder was above chance (diagonals in Fig. 5b). In addition, the correlation between a given parameter and the value predicted by a decoder trained on a different parameter (off-diagonals in Fig. 5b) was indistinguishable from chance except in one case (the color of the central object and background objects). These observations suggest that a population of V4 neurons can encode a relatively large number of natural scene parameters independently, enabling observers to avoid distraction by task-irrelevant stimulus features. The observation that there is an interaction between central and background objects presents an opportunity for future work to test the prediction that task-irrelevant variation in background object color should affect psychophysical discrimination of central object color. This outcome would be consistent with the results of Singh et al.²⁷.

Object features represented orthogonal to a task-relevant feature do not affect behavioral estimation

The results of these first two experiments demonstrate that a feature of naturalistic scenes (object position) that is represented orthogonally to the features of its background can be perceptually estimated independently of those background features. To explore the generality of the results of the experiments presented above, we also analyzed published data from our lab to test the ideas in the context of within-object features. Specifically, we analyzed recordings from the same monkeys while they viewed isolated objects on a gray background (a detailed explanation of the stimuli, task, and methods and a different analysis of a subset of these data are reported elsewhere²⁸). We generated 50 three-dimensional shapes that varied in size, color, orientation, curvature, thickness, and several other features. (Supp. Fig. 6a). We flashed them while the monkeys fixated on a central dot (as with the images with the banana above). We analyzed the orthogonality of the shape features using cross-validated cross-decoding (Supp. Fig. 6b, compare to Fig. 5). We found that 116 of the 120 feature pairs were encoded orthogonally. The four feature pairs that could be cross-decoded were varied in the limited stimulus set in a correlated manner by chance (see Methods for details).

To test whether an orthogonally represented feature affects the monkeys’ behavioral estimates of shape, we trained the monkeys to estimate the axial curvature of new 3D objects (Supp. Fig. 6c). The curvature was varied continuously, and the behavioral report was also on a continuous, analog scale. The monkeys were rewarded based on their estimation accuracy. Consistent with our hypothesis, the monkeys’ curvature estimation behavior was invariant to task-irrelevant features that were represented orthogonally to curvature, including color, thickness, length, gloss, (Supp. Fig. 6d).

Discussion

Using a combination of multi-neuron electrophysiology in monkeys and human psychophysics, we tested the hypothesis that irrelevant visual features, whether in the object of interest or in the background of a scene, will not interfere with the perception of a target feature when their representations are orthogonal in visual cortex. We demonstrated that (1) in monkey area V4, the representation of object position is orthogonal to the representations of many irrelevant features of that object and the background, (2) the threshold for human observers to judge a change in object position was unaffected by the variations in the background stimulus that were shown neuronally to have orthogonal representations in monkey V4, and (3) the ability for monkey observers to estimate the curvature of an object was unaffected by irrelevant features of that object that were represented orthogonally in V4.

Mechanisms supporting orthogonality

Our study quantifies how neural populations represent multiple naturalistic stimulus variations, but it does not provide direct insight into how the encoding and processing of visual stimuli produce those representations. Under biologically realistic assumptions, simulations show that although it is possible to learn about the independence/orthogonality of feature representations within a population from small population recordings, it is generally not possible to characterize the role of each recorded neuron¹⁰.

It is possible that other feature pairs that we did not study, such as purely lateral shifts in background objects relative to the central object, may interact either in behavior or neurally. In recent years, many studies have demonstrated that neural networks trained to categorize natural images produce representations that strongly resemble neural representations in the ventral visual stream^{12,15,29,30,31}. These models provide an opportunity to understand the conditions under which aspects of natural stimuli are most likely to be represented orthogonally and which aspects might best be targeted in future studies to probe potential failures of orthogonality^{25,32,33,34,35,36,37}. Additionally, our strong test of orthogonality based on cross-condition decoding revealed slight deviations in orthogonality of object position relative to background variations (Supp. Fig. 3). We predict that future studies on the capacity of a neural population for encoding naturalistic visual features (discussed further below) will shed light on the limits of orthogonal representations. We hope our results will produce a productive coupling of computational analysis of the mechanisms by which orthogonal representations emerge with behavioral experiments using the same parametrically varied computer graphics stimuli.

We also note that figure-ground segmentation is relevant to the ability to judge features of an object independent of the background^7,38,39. Segregation of figure from ground does not guarantee that the representation of features of the ground does not influence the representation of the figure (or vice-versa). An exciting avenue for future work will be understanding, in a much wider set of natural scenes than we considered, the relationship between figure-ground segmentation, orthogonal representations of the features of a figure and ground, and perception of a figure independent of its background.

Relationship to the notion of untangling and representational geometry

The conditions under which objects can be disambiguated from neural population responses have been studied using the concept of untangling^40,41,42,43. Untangling has been primarily discussed in the context of object classification. The hypothesis is that different objects can be appropriately classified (e.g. discriminating images of bananas from images of leaves) when the neural population representations of those objects are linearly separable in the face of irrelevant variations in the images (e.g. changes in position, orientation, size, or background). Support for this hypothesis comes from the observation that as one moves from early to late stages of the primate ventral visual stream, representations of different object categories become more linearly separable^{25,33,42,44,45}. Progress has been made in understanding how the tuning functions and mixed selectivities of neurons support untangled population representations^46,47,48.

The untangling framework has been extended to address the structure of neural populations that represent object categories more generally by characterizing the geometry of the high-dimensional representational manifolds^{34,35,49,50,51,52}. The capacity and dynamics of representational geometries in visual cortex correlate with classification behavior^42,53,54, in parietal and prefrontal cortices with perceptual decision-making^23,55,56,57, and in motor cortex with control of muscle activity^58,59,60,61.

Other studies consistent with this line of thinking have also considered features and have demonstrated that neural responses to relevant and distracting features of simple stimuli are linearly separable in the brain areas (or analogous layers of deep network models of vision) that are thought to mediate that aspect of vision^{35,36,40,44,62}. Indeed, a previous study in our lab also found a relationship between our ability to linearly decode visual information, the activity of neural populations in monkeys, and the ability of human observers to discriminate the same stimuli³². Our conclusions are also consistent with those reached in a study that considered neuronal representations in V4 and IT and behavioral estimates of the properties of objects presented against natural image backgrounds²⁵. That study found an increase in the orthogonality of representation from V4 to IT and good behavioral estimation of the orthogonally represented properties. Given those results, it seems possible that our stimuli would have revealed increased orthogonality in areas further along the processing hierarchy than the V4 site of our electrode array; such a result would not change the general conclusions we draw about the relation between orthogonality and behavior performance.

Opportunities from studying parameterizable naturalistic images

The present study extends the measurements of the relationship between the neural untangling of lower-level features and visually guided behavior with respect to features in naturalistic images. For our naturalistic scenes, we did not find that variation in background object positions perturbed the perception representation of foreground object position. Interestingly, there are known cases with simple stimuli where the position of some scene elements influences the position (or motion) judgments of other elements. Both the Poggendorf⁶³ and the Flash Grab Effect⁶⁴ could be taken as of such interaction. What needs to be true of the scene for the visual system to generate maximally robust perceptual representations, particularly whether this is related to the statistical structure of natural scenes, remains an interesting and open question.

More generally, our view is that the perception of object features in complex natural images provides increased power for testing the untangling hypothesis in the context of feature decoding. Unlike the case of simpler stimuli, the number of task-irrelevant features available for manipulation is larger and is likely to more fully challenge the coding capacity of neural populations whose representations are of limited dimensionality⁵⁰. Furthermore, visual distractors (like variation in the background) heavily influence scene categorization performance in artificial stimuli but not natural stimuli, suggesting that orthogonal feature representations in natural stimuli are more resilient to noise^34,65. Studying the relationship between neurons and visually guided behavior using parameterizable naturalistic images solves many of the challenges inherent in using simple artificial stimuli on the one hand or natural images on the other^{1,2,30,66,67,68}. The graphics-generated stimuli we employ strike a balance between the experimental control available through parameterization and the ability to measure principles governing neural responses to and perception of features of natural images that are difficult or impossible to glean using artificial stimuli.

Opportunities from cross-species investigations of visual perception

Our results highlight the power of pairing neural population recordings in animals with behavior in humans for understanding the neural basis of visual perception. We showed that the same principle (orthogonally represented features do not interfere perceptually) can be gleaned from cross-species approaches as from simultaneous recordings from behaving monkeys. Although simultaneously recording neurons and measuring behavior has many advantages, comparison with human performance provides some assurance that the neural results obtained in an animal model generalize to humans. In addition, our approach links observations from the more peripheral visual field locations, where, for technical reasons, the neural recordings are most often made, to the central visual field locations that are typically the focus of studies with human subjects.

Since the monkeys were rewarded simply for fixating during the recordings for our first two experiments, our experiments focus on neural population activity that is stimulus-driven rather than reflecting internally driven processes like attention or motivation. In future work, it will be interesting to merge our knowledge of how stimulus-driven and internal processes combine to influence neuronal responses and performance on visual tasks.

Conclusion

Our results provide behavioral and neurophysiological evidence supporting the powerful untangling hypothesis. They extend the study of untangling to representations of features of objects and backgrounds and demonstrate the value of parameterizable naturalistic images for studying the neural basis of visual perception. They also suggest a promising future for investigating the neural basis of perceptual and cognitive phenomena by leveraging the complementary strengths of multiple species.

Methods

Experimental models and subject details

Monkey electrophysiology

Two adult male rhesus monkeys (Macaca mulatta, 10 and 11 kg) were implanted with titanium head posts before behavioral training. Subsequently, multielectrode arrays were implanted in cortical area V4 identified by visualizing the sulci and using stereotactic coordinates. The monkeys were sourced from Alpha Genesis, Inc. All animal procedures were approved by the Institutional Animal Care and Use Committees of the University of Pittsburgh and Carnegie Mellon University, and all training, surgery, and experimentation methods were performed in accordance with the relevant guidelines and regulations. Additionally, this study is reported in accordance with ARRIVE animal use and reporting guidelines.

Human psychophysics

This study was preregistered at ClinicalTrials.gov, NCT number NCT05004649, https://clinicaltrials.gov/ct2/show/NCT05004649. The experimental protocols were approved by the University of Pennsylvania Institutional Review Board, and all recruitment and experimentation methods were performed in accordance with the relevant guidelines and regulations. Participants were invited to volunteer to participate in this study. Participants provided informed consent and filled out a lab participant survey. We also screened for visual acuity using a Snellen eye chart and for color deficiencies using the Ishihara plate test. Participants were excluded prior to the experiment if their best-corrected visual acuity was worse than 20/40 in either eye or if they made any errors on the Ishihara plate test.

Participants were excluded after the conclusion of their first session if their horizontal position discrimination threshold in the no variation condition (see description of conditions below) was higher than 0.6° of visual angle, and participants excluded at this point did not participate in any further experimental sessions.

Experimental design

Image generation (for both human psychophysics and monkey electrophysiology)

All the stimuli were variants of the same natural visual scene: a square image with a central object (a banana) presented on an approximately circular array of overlapping background objects (made up of overlapping branches and leaves). These objects were rendered against a distant and static context (rocky and grassy textures) which served as a consistent cue to estimate the spatial parameters (position, orientation, depth). The central object and/or the background objects changed in horizontal position, rotation, and/or depth across different stimuli. In the larger set of stimuli (detailed below) the luminance and color of the central and background objects also changed. The central object and background objects are presented in the context of other objects (a rock ledge, a skyline, and three moss-covered stumps) that remain unchanged across all stimulus conditions. This natural visual scene was created using Blender, an open-source 3D creation suite (https://www.blender.org, Version 2.81a). The object and background parameters were varied using ISET3d, an open-source software package (https://github.com/ISET/iset3d) that works with a modified version of PBRT (https://github.com/scienstanford/pbrt-v3-spectral; unmodified version at https://github.com/mmp/pbrt-v3).

The images created using ISET3d were converted to RGB images using custom software (Natural Image Thresholds; https://github.com/AmyMNi/NaturalImageThresholds) written using MATLAB (MathWorks; Natick, MA) and based on the software package Virtual World Color Constancy (github.com/BrainardLab/VirtualWorldColorConstancy). Natural Image Thresholds is dependent on routines from the Psychophysics Toolbox (http://psychtoolbox.org), ISET3d (https://github.com/ISET/iset3d), ISETBio (http://github.com/isetbio/isetbio), PBRT (https://github.com/scienstanford/pbrt-v3-spectral; unmodified version at https://github.com/mmp/pbrt-v3), and the Palemedes Toolbox (palamedestoolbox.org).

To convert a hyperspectral image created using ISET3d to an RGB image for presentation on the calibrated monitor, the hyperspectral image data were first used to compute LMS cone excitations. The LMS cone excitations were converted to a metameric rendered image in the RGB color space of the monitor, based on the monitor calibration data. A scale factor was applied to this image so that its maximum RGB value was 1 and the image was then gamma corrected, again using monitor calibration data. This process was completed separately for the two different monitors used, one for the psychophysics and one for the neurophysiology.

Monkey electrophysiology

Array implantation, task parameters

Both animals were implanted with titanium headposts before behavioral training. After training, microelectrode arrays were implanted in area V4 (96 recording sites; Blackrock Microsystems). Array placement was guided by stereotactic coordinates and visual inspection of the sulci and gyri. The monkeys were trained to perform a fixation task along with other behavioral tasks that were not relevant to this study. The stimulus images used in this study were not displayed outside of the context of this task. The monkeys fixated a central spot for a pre-stimulus blank period of 150–400 ms followed by stimulus presentations (200–250 ms) interleaved with blank intervals (200–250 ms). The stimuli were presented one at a time at a peripheral location that overlapped the receptive fields of the recorded neurons. In each trial, 6–8 stimuli were presented, after which the monkey received a liquid reward for having maintained fixation on the central spot until the end of the stimulus presentations. If the monkey broke fixation before the end of the stimulus presentations, the trial was terminated. The intertrial interval was at least 500 ms. The stimuli were presented pseudo-randomly.

The visual stimuli were presented on a calibrated (X-Rite calibrator) 24” ViewPixx LCD monitor (1920 × 1080 pixels; 120 Hz refresh rate) placed 54 cm (monkey 1) or 56 cm (monkey 2) from the monkey, using custom software written in MATLAB (Psychophysics Toolbox; Brainard, 1997; Pelli, 1997). Eye position was monitored using an infrared eye tracker (EyeLink 1000; SR Research). Eye position (1000 samples/s), neuronal activity (30,000 samples/s) and the signal from a photodiode was recorded to align neuronal responses to stimulus presentation times (30,000 samples/s) using Blackrock CerePlex hardware.

Neural responses

The filtered electrical activity (bandpass 250–5000 Hz) was thresholded at 2–3% RMS value for each recording site and the threshold crossing timestamps were saved (along with the raw electrical signal, waveforms at each crossing, and other signals). Spikes were not sorted for these experiments, and ‘unit’ refers to the multiunit activity at each recording electrode. The stimulus-evoked firing rate of each V4 unit was calculated based on the spike count responses between 50 and 250 ms after stimulus onset to account for V4 response latency. The baseline firing rates were calculated based on the spike count responses in the 100 ms time period before the onset of the stimulus.

Neuron exclusion

For each unit in an experimental session, the average stimulus-evoked responses across all stimuli were compared with the average baseline activity. The unit was included in further analyses if the average evoked activity was at least 1.1x the baseline activity. This lenient inclusion criterion was chosen because, for the chosen experimental design and stimuli, dimensionality-reduced decoding analyses are resilient to noise and benefit from information distributed across many neurons. Each recording experiment yielded data from 90 to 95 units (mean 94.1).

Receptive field mapping

A set of 2D closed contours, 3D solid objects, and black-and-white Gabor images were flashed as described above in the lower left quadrant of the screen. The positions and sizes were chosen manually across several experiments to home in on the receptive fields of each V4 recording site. Typically, a grid of 5 × 5 positions and two image sizes were chosen to overlap partially. The spikes were counted within a 50–250 ms window after stimulus onset, and a RF heat map was constructed for each site. The center of mass of this heat map was chosen as the center of the RF, and an ellipse was fit to circumscribe the central two standard deviations. This resulted in centers and extents of the RF of each recording site. The naturalistic image sets for the experiments described below were scaled such that the circular aperture within which the background objects were contained fully overlapped the population RF. This necessitated that the image boundary exceeded the RF of some neurons, but the image information outside of the circular aperture was held constant across images.

Experiment 1: Effect of task-irrelevant stimulus changes on the ability of V4 neurons to encode a feature of interest about the central object

The first goal of the electrophysiology experiments was to determine if the information about the chosen parameter of the central object (banana position) interferes with information about distracting parameters (background object rotation and depth). To do this, we systematically varied the horizontal position of the object and the background parameters in an uncorrelated fashion. The values and ranges of the object and background parameters were customized for each monkey such that there was a differential response to each condition on average across all other conditions, i.e., a 3-way ANOVA for object position and the two background conditions all had a significant main effect (p < 0.01). Five values of object position, background depth, and background rotation were chosen and permuted, yielding 125 image stimuli. Further details of stimuli can be found in Fig. 1 and Supp. Figure 1, as well as the associated code and data repositories.

The data were collected in 26 recording experiments (17 sessions across 11 days from monkey 1 and 9 sessions across 8 days from monkey 2). Recording experiments with fewer than three repetitions per stimulus image were excluded. Therefore, each stimulus was presented between 3 and 16 times, yielding between 381 and 2084 presentations (mean 831).

Experiment 2: relationships between multiple object and background feature dimensions

The second goal of the monkey electrophysiology was to determine whether different visual features are encoded orthogonally in neuronal population responses. Therefore, we measured responses to stimuli that varied many features of the central object (banana), including its horizontal position, depth, orientation, and two surface parameters (color and luminance). We also independently varied the same five features of the background objects (branches and leaves). We used two values for each of the ten features. We chose to make the ten features equally decodable by the population of V4 neurons (see Fig. 5). We measured responses to five repetitions of each of 2¹⁰=1024 stimuli. Each stimulus image was repeated between two to three times. Because of the large dataset required for this experiment, the data analyzed in Fig. 5 were collected from one session from monkey 1.

Experiment 3: relationship between multiple object feature dimensions and their influence on behavior

We repeated the fixation experiment in the same monkeys to display 3D objects that were parametrically generated and varied in up to 16 parameters. Details of shape generation have been published elsewhere²⁸. Unlike in experiment 2, we did not generate all permutations of the 16 parameters; instead, we generated 50 objects with random values of those features (Supp. Figure 6a). Stimuli were repeated at least five times. We collected data from both monkeys in six and eight experimental sessions, respectively.

Please refer to the publication referred to above for details on the curvature estimation experiment. Briefly, we generated a base shape with a random set of features as above and displayed it in one of 20 values of axial curvature for 500–800 ms before displaying a 140° arc in the upper hemifield. When the fixation point disappeared, the monkey made a saccade to this arc to indicate the curvature estimate such that leftward saccades indicated straight and rightward curved reports. The monkey was rewarded based on the error in behavioral estimate. For each behavioral session, we tested up to four random base shapes simultaneously such that the shape varied in several features (including axial curvature) across trials. In subsets of sessions, we also tested a single base shape with variations in only one feature (in-plane orientation or color) across trials. Monkeys’ curvature estimation behavior was not affected by trial-to-trial variations in single features or multiple features.

Human psychophysics

Apparatus

A calibrated LCD color monitor (27-inch NEC MultiSync PA271Q QHD Color Critical Desktop W-LED Monitor with SpectraView Engine; NEC Display Solutions) displayed the stimuli in an otherwise dark room, after participants dark-adapted in the experimental room for a minimum of 5 min. The monitor was driven at a pixel resolution of 1920 × 1080, with a refresh rate of 60 Hz and with 8-bit resolution for each RGB channel. The host computer for this monitor was an Apple Macintosh with an Intel Core i7 processor. The head position of each participant was stabilized using a chin cup (Headspot, UHCOTech, Houston, TX). The participant’s eyes were centered horizontally and vertically with respect to the monitor, which was 75 cm from the participant’s eyes. The participant indicated their responses using a Logitech F310 gamepad controller.

Stimulus parameters

The entire image subtended ~ 8° in width and height. The central object subtended ~ 4° in the longest dimension, and the circular array of background objects (branches and leaves) subtended ~ 5° of visual angle. The images were created using ISET3d at a resolution of 1920 × 1920 with 100 samples per pixel, at 31 equally spaced wavelengths between 400 nm and 700 nm.

Psychophysical task

The psychophysical task was a two-interval forced choice (2AFC) task with one stimulus per interval. Each stimulus interval had a duration of 250 ms. Stimuli were presented at the center of the monitor. Between the two stimulus intervals, two masks were shown in succession at the center of the monitor (Fig. 5). Each mask was be presented for a duration of 400 ms, for a total interstimulus interval of 800 ms (see Session organization below for mask details). Display times are approximate as the actual display times were quantized by the hardware to integer multiples of the 16.67 ms frame rate.

The participant’s task was to determine whether the central object presented in the second interval was to the left or to the right of the one presented in the first interval. Following the two intervals, the participant had an unlimited amount of time to press one of two response buttons on a gamepad to indicate their choice. Feedback was provided via auditory tones. Trials were separated by an intertrial interval of approximately one second.

The experimental programs can be found in the custom software package Natural Image Thresholds (https://github.com/AmyMNi/NaturalImageThresholds). They were written in MATLAB (MathWorks; Natick, MA) and were based on the software package Virtual World Color Constancy (github.com/BrainardLab/VirtualWorldColorConstancy). They rely on routines from the Psychophysics Toolbox (http://psychtoolbox.org) and mgl (http://justingardner.net/doku.php/mgl/overview).

Session organization

The first session experimental session for each participant included participant enrollment procedures (informed consent, vision tests, etc.; see Participants above for details) as well as familiarization trials (see next paragraph) and lasted one and a half hours. The additional experimental sessions lasted approximately one hour each.

For the first session only, the participant began with 30 familiarization trials. The familiarization trials comprised, in order: 10 randomly selected easy trials (the largest position-change comparisons), 10 randomly selected medium-difficulty trials (the 4th and 5th largest position-change comparisons), and 10 randomly selected trials from all possible position-change comparisons. The familiarization trials did not include any task-irrelevant variability and data from these trials was not saved.

In each session, there were two reference positions for the object, and for each reference position there were 11 comparison positions: five comparison positions in the positive horizontal direction, five comparison positions in the negative horizontal direction, and a comparison position of 0 indicating no change. On each trial, one interval contained one of the two reference stimuli and the other interval will contain one of that reference stimulus’s comparison stimuli. The order in which these two stimuli were presented within a trial was selected randomly per trial.

A block of trials consisted of presentation of the 11 comparison positions for each of the two reference positions for a total of 22 trials per block. The trials within a block were run in randomized order. Each was completed before the next block began. Each block was repeated 7 times in a run of trials, for a total of 154 trials per run.

Within each run of 154 trials, a single background variation condition was studied. There were three such conditions, as described in more detail below – “no variation”, “rotation only”, and “rotation and depth”. Two runs for each of the three conditions was completed in each experimental session, and except as noted in the results, each subject completed 6 sessions. The six runs were conducted in random order within each session, and each run was separated by a break that lasted at least one minute and during which the participant was encouraged to stand or stretch as needed. After a minimum of one minute, the next run was initiated when the participant was ready.

Additionally, each session began with four practice trials (including in the first experimental session, where these practice trials were preceded by the familiarization trials as described). Each run after the first also started with one practice trial. The practice trials were all easy trials as described above and not include any task-irrelevant variability. The data from the practice trials was be saved. The maximum variation in background features was matched to the maximum variation in the neurophysiology experiments but sampled more finely as described for each of the variation blocks below.

For the “no variation” condition, there were not any changes to the background objects (the branches and leaves). This run determines the participant’s threshold for discriminating the horizontal position of the central object without any task-irrelevant stimulus variation.

The “rotation only” run introduced task-irrelevant variability single task-irrelevant feature: rotation of the background objects. For each trial, a single rotation amount was drawn randomly from a pool of 51 rotations, and the background objects (leaves and sticks) in the stimulus were all rotated by that amount around their own centers. The rotation was drawn separately (randomly with replacement) for each of the two stimuli presented on a trial (the reference position stimulus and the comparison position stimulus). Thus, subjects had to judge the position of the central object across a change in the background, so that any effect of background variation on the positional representation of the central object would be expected to elevate threshold. The pool of 51 rotations comprised: a rotation of zero (no change to the background objects), 25 equally spaced rotations in the clockwise direction in 2° intervals, and 25 equally spaced rotation amounts in the counterclockwise direction in 2° intervals.

“Rotation and depth” runs had variation in two task-irrelevant features: rotation and depth of the background objects. For this run, there was a pool of 51 rotations, but along with the rotation of the background objects, these objects also varied in depth. There were 51 possible depth amounts (one depth amount of zero, 25 equally spaced depth amounts in the positive depth direction, and 25 equally spaced depth amounts in the negative direction; depth amounts ranged from − 500 mm to 500 mm in the rendering scene space). One of the of images was a rotation of zero and a depth amount of zero. For the remaining 50 images in the pool, each of the remaining 50 rotation amounts was randomly assigned (without replacement) to one of the remaining 50 depth amounts. The same depth shift was applied to each of the background objects. From this pool of 51 images, a single image was randomly drawn (with replacement) for each of the two stimuli presented in the trial.

Finally, as noted above (see Psychophysical task), two masks were shown per trial during the interstimulus interval. All masks across all background variation conditions were created from the same distribution of stimuli (stimuli with “no variation”, thus containing no task-irrelevant noise). To create each of the two masks, first the central object positions in the first and second intervals of the trial were be determined. The two stimuli with that matched the central object positions in the first and second intervals were then used to create the trial masks. For each of these two stimuli, the average intensity was calculated in each RGB channel per 16 × 16 block of the stimulus. Next, each 16 × 16 block of a mask was randomly drawn from the mask corresponding to the two stimuli. Thus, the two masks shown per trial were each a random mixture of 16 × 16 blocks from stimuli with the two central object positions for that trial.

Statistical analysis and quantification

Monkey electrophysiology

Cross-validated parameter decoding (Fig. 2)

First, the response matrix (multiunit spike rates for each site for each image stimulus presentation) was reduced to 10 dimensions of activity. This ensured sufficient dimensionality for decoding object and background parameters and explained between 87.8% and 94.8% (mean 91.2% across sessions) of the variance across stimulus responses. Parameter decoding without dimensionality reduction produced qualitatively similar results. Then, for each background condition (unique combination of background rotation and depth – or specific decoding), the object position in each presentation/trial was decoded from neural responses by learning regression weights from all other trials (leave-one-out cross-validation). Decoding accuracy was defined as the correlation between actual and decoded values so that perfect decoding would result in an accuracy of 1 and chance decoding accuracy of 0. We did not encounter decoding accuracies below 0. We also calculated other decoding performance measures like mean squared error, cosine distance, etc. While other measures provide more sensitivity in the specific kinds of decoding error, their estimate of aggregate performance was qualitatively similar to correlation-based measures.

The same procedure was repeated for general decoding, where the background parameters were ignored (Fig. 2b). We also trained general decoders using random subsets of trials across the dataset to match the training set of the specific decoders. Specifically, we subsampled the minimum number of trials that any of the 25 specific decoders were trained on from the full dataset regardless of background feature values, trained a position decoder, and tested it on the rest of the trials. We repeated this for 100 subsamples and averaged the decoded predictions across folds. Across folds, the performance of the trial-matched general decoders was not significantly different from either the general decoder trained on all the data at once in a leave-one-out fashion or the distribution of specific decoder performances (paired t-test, p > 0.05).

Error in decoding was defined as the difference between the predicted object position and the actual position (Fig. 2d). The same procedure for specific and general decoding was also repeated for each of the two background conditions (Supp. Figure 2). We also compared the linear weights of each neuron for each general decoder (Supp. Figure 3a-c). For this comparison, we normalized the range of feature values between 0 and 1 and averaged the weights for each neuron across folds. We also compared these weights to the feature sensitivity of each neuron (calculated in Supp. Figure 1d; plotted in Supp. Figure 3d-f).

To compare the performance of the specific decoders across conditions, we split the trials for each unique background condition into two sets. We trained and tested the specific decoders within (self) and across (cross) conditions and plotted the mean decoding performance difference across folds (Supp. Figure 3). We plotted these cross-decoding differences separately for background rotation and depth changes. We did not distinguish between increasing and decreasing values of background rotation or depth, i.e., a condition difference of 1 denotes that the decoder was tested on a background condition that was one away from the condition it was trained on. Since there were five levels of variation in each background condition, there were 5 pairs of conditions that were 0 levels away (self-decoder); 8 pairs, 1 away; 6 pairs, 2 away; 2 pairs, 3 away; 1 pair, 4 away. We observed slight deviations from the self-decoder as the level of variation increased.

Angle calculation (Fig. 3)

To calculate the angle between the specific decoders, an n-dimensional line was fit to the dimensionality-reduced responses, and the unit vector was found. The angle between each specific decoder and the decoder for the central condition was calculated as the arc-cos of the dot product of the two unit vectors.

Linear discriminant analysis and comparison with human psychophysics (Fig. 4c)

To directly compare human psychophysics discrimination accuracy with decoding results, we matched the three blocked conditions – no background variation, rotation only, and rotation and depth variation – by subsampling trials from the 5 × 5 × 5 stimulus set from Experiment 1. For the three conditions, we either found all pairs of trials, pairs of trials that varied in rotation only (by holding background depth at the central value), or pairs of trials that varied in depth only (by holding background rotation at the central value). Then, for 200 folds, we sampled a maximum of 500 pairs of trials, and depending on the object position in those trials, we assigned a left or right choice. If the positions were identical, we randomly assigned the choice for that pair. We then collated the responses across the pairs of trials and a fit linear discriminant in a leave-one-out fashion to predict the correct choice. The classification prediction accuracy for each of the three blocked conditions was calculated independently.

Cross-decoding analysis (Fig. 5)

For experiment 2, even though only two values were chosen for each of the five object and five background parameters, linear regression was chosen instead of classification using discriminant analysis for comparison with decoding analyses in the previous experiment. Even though each stimulus image was only repeated 2–3 times, since each parameter could take one of two values each, all unique pairs of images would be informative about at least one parameter change. To enable cross-decoding, we altered the cross-validation procedure. For each parameter pair, for each of the 100 folds, we randomly split all image presentations evenly into training and testing sets (uneven splits also produced qualitatively similar results). We then trained a linear regression model for one parameter using the training trials and used it to predict the values of the other parameters for the held-out testing trials. The decoding accuracy was calculated as the average correlation across folds between the actual and decoded parameter values. Since each parameter decoder was trained while ignoring all other parameter variations, the diagonals in Fig. 5b are akin to the general decoder accuracy for those parameters, and the off diagonals correspond to how well those general decoders are aligned to the representations of the other parameters. The diagonal correlations were all significantly above 0 (p < 10⁻⁸⁰; t-test across folds), and none of the off-diagonal correlations were significant except the cross-decoding of background and object color. We repeated the same procedure for data from experiment 3 i.e., the responses to the 3D shapes from Srinath et al., 2024 for the 16 shape features. Unlike experiment 2, since the 16 features were not permuted but chosen randomly, of the 120 possible stimulus pairs, 4 pairs were correlated in the stimulus set. Therefore, the cross-decoding accuracy of the four sets of features that could be cross-decoded (Z orientation-color R at 0.24, curvature-surface gloss at 0.21, curvature-thickness X1 at 0.21, and thickness Y2-thickness Y3 at 0.23) can be trivially explained by correlations between those features in the stimulus set (0.29, 0.25, 0.43, and 0.43 respectively).

Human psychophysics (Fig. 4a-b)

Per session, the participant’s threshold for discriminating object position was measured for each background variation condition. First, for each comparison position, the proportion of trials on which the participant responded that the comparison stimulus was located to the right of the reference stimulus was calculated. Next, the proportion of the comparison was chosen as rightwards was fit with a cumulative normal function using the Palamedes Toolbox (http://www.palamedestoolbox.org). To estimate all four parameters of the psychometric function (threshold, slope, lapse rate, and guess rate), the lapse rate was constrained to be equal to the guess rate and to be in the range [0, 0.05], and the maximum likelihood fit was determined. The threshold was calculated as the difference between the stimulus levels at performances equal to 0.7602 and 0.5 as determined by the cumulative normal fit.

We calculated the Bayes Factor using the MATLAB bayesFactor Toolbox⁶⁹ which calculates Bayes Factor (BF) for ANOVA designs detailed in Rouder et al.⁷⁰. We followed the example to test the hypothesis that variations in background conditions significantly affect choices. First, we calculated the BF for the full model across subjects, i.e., using the main effect of object position and background variation conditions and the interaction effects (full model). Then, to isolate the effect of the background variation, we repeated the ANOVA while excluding background variations (restricted model). We then calculated the ratio of the BFs of the full model to the restricted models and inverted it to test for evidence of the absence of the effect. We also repeated this procedure for the monkey electrophysiology data that accompanies this analysis (Fig. 4).

Data availability

The data and code that generate the figures in this study have been deposited in a public Github repository https://github.com/ramanujansrinath/UntanglingBananas. MATLAB code for creating and displaying the images for human psychophysical experiments, as well as analyzing the raw data from these experiments, can be found at https://github.com/AmyMNi/NaturalImageThresholds. Request for further information should be directed to and will be fulfilled by the corresponding author David H. Brainard (brainard@psych.upenn.edu) in consultation with the other authors.

References

Martinez-Garcia, M., Bertalmío, M. & Malo, J. In praise of artifice reloaded: Caution with natural image databases in modeling vision. Front. Neurosci. 13, 8 (2019).
Article PubMed PubMed Central MATH Google Scholar
Rust, N. C. & Movshon, J. A. In praise of artifice. Nat. Neurosci. 8, 1647–1650 (2005).
Article PubMed MATH CAS Google Scholar
Pasupathy, A. & Connor, C. E. Population coding of shape in area V4. Nat. Neurosci. 5, 1332–1338 (2002).
Article PubMed MATH CAS Google Scholar
Gallant, J. L., Braun, J. & Essen, D. V. Selectivity for polar, hyperbolic, and cartesian gratings in macaque visual cortex. Science 259, 100–103 (1993).
Article ADS PubMed MATH CAS Google Scholar
Leopold, D. A. & Logothetis, N. K. Activity changes in early visual cortex reflect monkeys’ percepts during binocular rivalry. Nature 379, 549–553 (1996).
Article ADS PubMed CAS Google Scholar
Peterhans, E. & von Heydt, R. Der. Subjective contours—Bridging the gap between psychophysics and physiology. Trends Neurosci. 14, 112–119 (1991).
Article PubMed CAS Google Scholar
von der Heydt, R., Peterhans, E. & Baumgartner, G. Illusory contours and cortical neuron responses. Science 224, 1260–1262 (1984).
Article ADS PubMed MATH Google Scholar
Snow, J. C. & Culham, J. C. The treachery of images: How realism influences brain and behavior. Trends Cogn. Sci. 25, 506–519 (2021).
Article PubMed PubMed Central Google Scholar
Peters, B. & Kriegeskorte, N. Capturing the objects of vision with neural networks. Nat. Hum. Behav. 5, 1127–1144 (2021).
Article PubMed MATH Google Scholar
Ruff, D. A., Ni, A. M. & Cohen, M. R. Cognition as a window into neuronal population space. Annu. Rev. Neurosci. 41, 77–97 (2018).
Article PubMed PubMed Central MATH CAS Google Scholar
Cadieu, C. et al. A model of V4 shape selectivity and invariance. J. Neurophysiol. 98, 1733–1750 (2007).
Article PubMed MATH Google Scholar
Oleskiw, T. D., Nowack, A. & Pasupathy, A. Joint coding of shape and blur in area V4. Nat. Commun. 9, 466 (2018).
Article ADS PubMed PubMed Central MATH Google Scholar
Kim, T., Bair, W. & Pasupathy, A. Neural coding for shape and texture in Macaque Area V4. J. Neurosci. 39, 4760–4774 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Yamane, Y. et al. Population coding of figure and ground in natural image patches by V4 neurons. PLoS ONE 15, e0235128 (2020).
Article PubMed PubMed Central MATH CAS Google Scholar
Srinath, R. et al. Early emergence of solid shape coding in natural and deep network vision. Curr. Biol. 31, 51–65e5 (2021).
Article PubMed MATH CAS Google Scholar
Hatanaka, G. et al. Processing of visual statistics of naturalistic videos in macaque visual areas V1 and V4. Brain Struct. Funct. 227, 1385–1403 (2022).
Article PubMed PubMed Central MATH Google Scholar
Hubel, D. H. & Wiesel, T. N. Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195, 215–243 (1968).
Article PubMed PubMed Central MATH CAS Google Scholar
Born, R. T. & Tootell, R. B. Spatial frequency tuning of single units in macaque supragranular striate cortex. Proc. Natl. Acad. Sci. 88, 7066–7070 (1991).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Nauhaus, I., Nielsen, K. J., Disney, A. A. & Callaway, E. M. Orthogonal micro-organization of orientation and spatial frequency in primate primary visual cortex. Nat. Neurosci. 15, 1683–1690 (2012).
Article PubMed PubMed Central CAS Google Scholar
Everson, R. M. et al. Representation of spatial frequency and orientation in the visual cortex. Proc. Natl. Acad. Sci. 95, 8334–8338 (1998).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Kohn, A. et al. Principles of corticocortical communication: Proposed schemes and design considerations. Trends Neurosci. 43, 725–737 (2020).
Article PubMed PubMed Central MATH CAS Google Scholar
Vyas, S., Golub, M. D., Sussillo, D. & Shenoy, K. V. Computation through neural population dynamics. Annu. Rev. Neurosci. 43, 249–275 (2020).
Article PubMed PubMed Central MATH CAS Google Scholar
Okazawa, G., Hatch, C. E., Mancoo, A., Machens, C. K. & Kiani, R. Representational geometry of perceptual decisions in the monkey parietal cortex. Cell 184, 3748-3761.e18 (2021).
Article PubMed PubMed Central CAS Google Scholar
Misaki, M., Kim, Y., Bandettini, P. A. & Kriegeskorte, N. Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage 53, 103–118 (2010).
Article PubMed MATH Google Scholar
Hong, H., Yamins, D. L. K., Majaj, N. J. & DiCarlo, J. J. Explicit information for category-orthogonal object properties increases along the ventral stream. Nat. Neurosci. 19, 613–622 (2016).
Article PubMed CAS Google Scholar
Reynolds, D. & Singh, V. Characterization of human lightness discrimination thresholds for independent spectral variations. 06.16.545355 Preprint at (2023). https://doi.org/10.1101/2023.06.16.545355 (2023).
Singh, V., Burge, J. & Brainard, D. H. Equivalent noise characterization of human lightness constancy. J. Vis. 22, 2 (2022).
Article PubMed PubMed Central MATH Google Scholar
Srinath, R., Czarnik, M. M. & Cohen, M. R. Coordinated Response Modulations Enable Flexible Use of Visual Information. 07.10.602774 Preprint at (2024). https://doi.org/10.1101/2024.07.10.602774 (2024).
Bashivan, P., Kar, K. & DiCarlo, J. J. Neural population control via deep image synthesis. Science 364, eaav9436 (2019).
Cowley, B. R., Stan, P. L., Pillow, J. W. & Smith, M. A. Compact deep neural network models of visual cortex. 11.22.568315 Preprint at (2023). https://doi.org/10.1101/2023.11.22.568315 (2023).
Pospisil, D. A., Pasupathy, A. & Bair, W. Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife 7, e38242 (2018).
Article PubMed PubMed Central Google Scholar
Kramer, L. E., Chen, Y. C., Long, B., Konkle, T. & Cohen, M. R. Contributions of early and mid-level visual cortex to high-level object categorization. 05.31.541514 Preprint at (2023). https://doi.org/10.1101/2023.05.31.541514 (2023).
Majaj, N. J., Hong, H., Solomon, E. A. & DiCarlo, J. J. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. J. Neurosci. 35, 13402–13418 (2015).
Article PubMed PubMed Central CAS Google Scholar
Chung, S., Lee, D. D. & Sompolinsky, H. Classification and geometry of general perceptual manifolds. Phys. Rev. X 8, 031003 (2018).
MATH CAS Google Scholar
Cohen, U., Chung, S., Lee, D. D. & Sompolinsky, H. Separability and geometry of object manifolds in deep neural networks. Nat. Commun. 11, 746 (2020).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Chung, S., Lee, D. D. & Sompolinsky, H. Linear readout of object manifolds. Phys. Rev. E 93, 060301 (2016).
Article ADS PubMed MATH Google Scholar
Ni, A. M., Huang, C., Doiron, B. & Cohen, M. R. A general decoding strategy explains the relationship between behavior and correlated variability. eLife 11, e67258 (2022).
Article PubMed PubMed Central CAS Google Scholar
Tsao, T. & Tsao, D. Y. A topological solution to object segmentation and tracking. Proc. Natl. Acad. Sci. U. S. A. 119, e2204248119 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Luongo, F. J. et al. Mice and primates use distinct strategies for visual segmentation. eLife 12, e74394 (2023).
Article PubMed PubMed Central MATH CAS Google Scholar
DiCarlo, J. J. & Cox, D. D. Untangling invariant object recognition. Trends Cognit. Sci. 11, 333–341 (2007).
Article Google Scholar
Rust, N. C. & Dicarlo, J. J. Selectivity and tolerance (‘invariance’) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 30, 12978–12995 (2010).
Article PubMed PubMed Central MATH CAS Google Scholar
DiCarlo, J. J., Zoccolan, D. & Rust, N. C. How does the brain solve visual object recognition? Neuron 73, 415–434 (2012).
Article PubMed PubMed Central CAS Google Scholar
Pagan, M., Urban, L. S., Wohl, M. P. & Rust, N. C. Signals in inferotemporal and perirhinal cortex suggest an untangling of visual target information. Nat. Neurosci. 16, 1132–1139 (2013).
Article PubMed PubMed Central CAS Google Scholar
Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. U. S. A. 111, 8619–8624 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Hénaff, O. J., Goris, R. L. T. & Simoncelli, E. P. Perceptual straightening of natural videos. Nat. Neurosci. 22, 984–991 (2019).
Article PubMed MATH Google Scholar
Kriegeskorte, N. & Wei, X. X. Neural tuning and representational geometry. Nat. Rev. Neurosci. 22, 703–718 (2021).
Article PubMed MATH CAS Google Scholar
Fusi, S., Miller, E. K. & Rigotti, M. Why neurons mix: High dimensionality for higher cognition. Curr. Opin. Neurobiol. 37, 66–74 (2016).
Article PubMed MATH CAS Google Scholar
Rigotti, M. et al. The importance of mixed selectivity in complex cognitive tasks. Nature 497, 585–590 (2013).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Saxena, S. & Cunningham, J. P. Towards the neural population doctrine. Curr. Opin. Neurobiol. 55, 103–111 (2019).
Article PubMed MATH CAS Google Scholar
Chung, S. & Abbott, L. F. Neural population geometry: An approach for understanding biological and artificial neural networks. Curr. Opin. Neurobiol. 70, 137–144 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Jazayeri, M. & Ostojic, S. Interpreting neural computations by examining intrinsic and embedding dimensionality of neural activity. Curr. Opin. Neurobiol. 70, 113–120 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Yuste, R. From the neuron doctrine to neural networks. Nat. Rev. Neurosci. 16, 487–497 (2015).
Article PubMed MATH CAS Google Scholar
Stringer, C., Pachitariu, M., Steinmetz, N., Carandini, M. & Harris, K. D. High-dimensional geometry of population responses in visual cortex. Nature 571, 361–365 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Rajalingham, R. et al. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. J. Neurosci. 38, 7255–7269 (2018).
Article PubMed PubMed Central MATH CAS Google Scholar
Ehrlich, D. B. & Murray, J. D. Geometry of neural computation unifies working memory and planning. Proc. Natl. Acad. Sci. 119, e2115610119 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503, 78–84 (2013).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Bernardi, S. et al. The geometry of abstraction in the hippocampus and prefrontal cortex. Cell 183, 954–967e21 (2020).
Article PubMed PubMed Central MATH CAS Google Scholar
Russo, A. A. et al. Motor Cortex embeds muscle-like commands in an untangled population response. Neuron 97, 953–966e8 (2018).
Article PubMed PubMed Central MATH CAS Google Scholar
Gallego, J. A. et al. Cortical population activity within a preserved neural manifold underlies multiple motor behaviors. Nat. Commun. 9, 4233 (2018).
Article ADS PubMed PubMed Central MATH Google Scholar
Sussillo, D., Churchland, M. M., Kaufman, M. T. & Shenoy, K. V. A neural network that finds a naturalistic solution for the production of muscle activity. Nat. Neurosci. 18, 1025–1033 (2015).
Article PubMed PubMed Central CAS Google Scholar
Shenoy, K. V., Sahani, M. & Churchland, M. M. Cortical control of arm movements: A dynamical systems perspective. Annu. Rev. Neurosci. 36, 337–359 (2013).
Article PubMed MATH CAS Google Scholar
Khaligh-Razavi, S. M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).
Article PubMed PubMed Central MATH Google Scholar
Day, R. H. & Dickinson, R. G. The components of the Poggendorff Illusion. Br. J. Psychol. 67, 537–552 (1976).
Article PubMed MATH CAS Google Scholar
Cavanagh, P. & Anstis, S. The flash grab effect. Vis. Res. 91, 8–20 (2013).
Article PubMed MATH Google Scholar
Zhou, H., Friedman, H. S. & Von Der Heydt, R. Coding of border ownership in monkey visual cortex. J. Neurosci. 20, 6594–6611 (2000).
Article PubMed PubMed Central MATH CAS Google Scholar
Maheswaranathan, N. et al. Interpreting the retinal neural code for natural scenes: From computations to neurons. Neuron 111, 2742–2755e4 (2023).
Article PubMed PubMed Central MATH CAS Google Scholar
Ding, X. et al. Information geometry of the retinal representation Manifold. bioRxiv 2023.05.17.541206 https://doi.org/10.1101/2023.05.17.541206 (2023).
Felsen, G. & Dan, Y. A natural approach to studying vision. Nat. Neurosci. 8, 1643–1646 (2005).
Article PubMed MATH CAS Google Scholar
Krekelberg, B. klabhub/bayesFactor: Bayes only. Zenodo (2024). https://doi.org/10.5281/zenodo.13744717
Rouder, J. N., Morey, R. D., Speckman, P. L. & Province, J. M. Default Bayes factors for ANOVA designs. J. Math. Psychol. 56, 356–374 (2012).
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are grateful to K. McKracken for providing technical assistance and to Douglas Ruff, Cheng Xue, and Lily Kramer for comments on an earlier version of this manuscript and suggestions regarding data analysis. This work is supported by Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Sciences, LLC program (to R.S.), the Simons Foundation (Simons Collaboration on the Global Brain award 542961SPI to M.R.C, postdoctoral fellowship to A.M.N.), the National Institutes of Health (awards R01EY022930, R01EY034723, and RF1NS121913 to M.R.C, K99NS118117 to A.M.N, K99EY035362 to R.S.)

Author information

Ramanujan Srinath and Amy M. Ni have contributed equally to this work.
Marlene R. Cohen and David H. Brainard have contributed equally to this work.

Authors and Affiliations

Department of Neurobiology and Neuroscience Institute, The University of Chicago, Chicago, IL, 60637, USA
Ramanujan Srinath, Amy M. Ni & Marlene R. Cohen
Department of Psychology, University of Pennsylvania, Philadelphia, PA, 19104, USA
Amy M. Ni, Claire Marucci & David H. Brainard

Authors

Ramanujan Srinath
View author publications
Search author on:PubMed Google Scholar
Amy M. Ni
View author publications
Search author on:PubMed Google Scholar
Claire Marucci
View author publications
Search author on:PubMed Google Scholar
Marlene R. Cohen
View author publications
Search author on:PubMed Google Scholar
David H. Brainard
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization - A.M.N., M.R.C., D.H.B.; Methodology - R.S., A.M.N., M.R.C., D.H.B.; Software - R.S., A.M.N., D.H.B; Formal Analysis - R.S., A.M.N., C.M. M.R.C., D.H.B.; Investigation - R.S., A.M.N., C.M.; Data curation - R.S., A.M.N., C.M.; Writing – Original Draft - R.S., M.R.C., D.H.B. ; Writing – Review & Editing - R.S., M.R.C., D.H.B.; Visualization - R.S.; Supervision - M.R.C., D.H.B.; Project Administration - M.R.C., D.H.B. ; Funding acquisition - R.S., A.M.N., M.R.C., D.H.B.

Corresponding author

Correspondence to David H. Brainard.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Srinath, R., Ni, A.M., Marucci, C. et al. Orthogonal neural representations support perceptual judgments of natural stimuli. Sci Rep 15, 5316 (2025). https://doi.org/10.1038/s41598-025-88910-8

Download citation

Received: 30 July 2024
Accepted: 31 January 2025
Published: 13 February 2025
DOI: https://doi.org/10.1038/s41598-025-88910-8

Subjects

Abstract

Similar content being viewed by others

Opposing effects of selectivity and invariance in peripheral vision

Neural tuning instantiates prior expectations in the human visual system

Revealing nonlinear neural decoding by analyzing choices

Introduction

Results

Central hypothesis: orthogonal representations enable observers to ignore irrelevant visual information

Naturalistic stimuli with parameterizable properties

V4 neurons robustly encode stimulus position for each stimulus background

V4 representations of stimulus position and background features are approximately orthogonal

Human subjects discriminate stimulus position robustly with respect to background variation

At least ten object and background features are represented approximately orthogonally in V4

Object features represented orthogonal to a task-relevant feature do not affect behavioral estimation

Discussion

Mechanisms supporting orthogonality

Relationship to the notion of untangling and representational geometry

Opportunities from studying parameterizable naturalistic images

Opportunities from cross-species investigations of visual perception

Conclusion

Methods

Experimental models and subject details

Monkey electrophysiology

Human psychophysics

Experimental design

Image generation (for both human psychophysics and monkey electrophysiology)

Monkey electrophysiology

Array implantation, task parameters

Neural responses

Neuron exclusion

Receptive field mapping

Experiment 1: Effect of task-irrelevant stimulus changes on the ability of V4 neurons to encode a feature of interest about the central object

Experiment 2: relationships between multiple object and background feature dimensions

Experiment 3: relationship between multiple object feature dimensions and their influence on behavior

Human psychophysics

Apparatus

Stimulus parameters

Psychophysical task

Session organization

Statistical analysis and quantification

Monkey electrophysiology

Cross-validated parameter decoding (Fig. 2)

Angle calculation (Fig. 3)

Linear discriminant analysis and comparison with human psychophysics (Fig. 4c)

Cross-decoding analysis (Fig. 5)

Human psychophysics (Fig. 4a-b)

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links