Main

Natural visual scenes contain strong positive stimulus correlations in both space and time3. According to the prominent efficient coding hypothesis1,2, the retina’s function is to encode stimulus information without wasting resources on signalling this inherent redundancy of natural scenes. Thus, to reduce the redundancy, the retina should decorrelate its output, the spiking activity of retinal ganglion cells, at least as much as the intrinsic noise in the system permits while retaining stimulus information4,6. In addition to this intuitive rationale, the popularity of the efficient coding hypothesis is based on its success in explaining characteristics of the early visual system, including centre-surround receptive fields4 and the emergence and spatial organization of retinal cell types7,8,9,10.

However, the decorrelation prediction of efficient coding has so far only been tested with stimuli that at most share some statistical similarities with natural scenes11,12,13,14, such as static images, sometimes including object movement. Instead, the natural retinal input is dynamically structured by eye and head movements that rapidly shift the retinal image5. Such gaze shifts can induce robust response transients at fixation onset in neurons at the early stages of the visual system15, thus shaping the encoding of natural scenes. Here we therefore sought to study whether retinal redundancy reduction and decorrelation hold for natural stimuli that include gaze shifts and whether stimulus correlations are efficiently discarded by the retina.

Redundancy in natural-video responses

We recorded ganglion cell spiking activity from isolated marmoset retinas with multielectrode arrays in response to natural videos generated by shifting photographic images according to natural gaze traces (Fig. 1a). The traces had been measured from head-fixed marmosets viewing natural scenes16 and contained both saccades and fixational eye movements. From the recordings, we functionally identified the four numerically dominant ganglion cell types of the primate retina, ON and OFF parasol cells, as well as ON and OFF midget cells, by their characteristic response kinetics, receptive-field sizes and the tiling of visual space by receptive fields of a given type (Fig. 1c, Extended Data Fig. 1).

Fig. 1: Correlations and redundancy in primate ganglion cell responses to natural videos.
figure 1

a, Marmoset-specific videos, generated by shifting natural images according to gaze traces recorded from head-fixed marmosets. Each image was presented for 1 s as marked by blue lines. The receptive field of a sample ON parasol ganglion cell is overlaid on the sample images shown. b, Spike raster of the sample ON parasol cell for 30 trials in response to the stimulus in a. c, Receptive-field mosaic of simultaneously recorded ON parasol cells from the peripheral marmoset retina, with the sample cell highlighted. d, Receptive fields and firing-rate profiles of two neighbouring ON parasol cells (top) and two neighbouring OFF midget cells (bottom) with resulting activity correlation coefficients (corr.). e, Correlation coefficients for ganglion cell pairs under the natural video as a function of receptive-field distance (number of pairs specified in the figure). For reference, black lines show the correlation between stimulus pixels. f, Same as e, but for pairs of ON and OFF cells. g, Histograms of response reliability under natural videos and white noise, measured by the coefficient of determination between firing rates of even and odd repeats. Number of cells: n = 41/34/38/53 for ON parasol/OFF parasol/ON midget/OFF midget. Note that more trials for firing-rate evaluation were available under white noise, contributing to higher reliability values as compared to natural videos. h, Fractional redundancy as a function of receptive-field distance for cell pairs of the same type responding to the natural video (coloured traces) or white noise (grey). i, Relationship between correlation and fractional redundancy. For e,f,h,i, lines represent binned averages for pairs at similar x coordinates (with 95% confidence intervals) for simultaneously recorded cell pairs, and data are from three retinas.

Source Data

Natural videos generated strong and reliable responses (Fig. 1b), which often displayed considerable correlations for pairs of neighbouring cells of the same type (Fig. 1d). Especially ON parasol cells frequently showed simultaneous firing-rate peaks (Fig. 1d, top). Correspondingly, pairwise correlations for ON parasol cells were nearly as high as the corresponding light-intensity correlation in the stimulus (Fig. 1e), thus showing almost no decorrelation. This included cell pairs with neighbouring receptive fields (typically distances below approximately 300 µm), as well as across larger distances. By contrast, OFF midget cells displayed a high degree of decorrelation (Fig. 1e), with firing events of neighbouring cells often occurring for distinct fixations (Fig. 1d, bottom). Correlations for pairs of OFF parasol and ON midget cells, respectively, lay between these two extremes. OFF parasol cells displayed more decorrelation than their ON counterparts but were more strongly correlated than OFF midget cells. Thus, the expected decorrelation is not seen in all ganglion cell types but ranges from strong decorrelation, as in OFF midget cells, to essentially no decorrelation, as in ON parasol cells.

Substantial positive correlations were also found for pairs of parasol cells with opposing contrast preference—that is, an ON cell and an OFF cell (Fig. 1f)—indicating a common pattern in the co-activation of parasol cells within and across types. Pairs of ON and OFF midget cells, on the other hand, showed essentially no correlation or, at small distances, negative correlation, as should be expected when one cell responds to increases and the other to decreases in light intensity.

For the case of noiseless transmission channels, decorrelation is a direct prediction of the efficient coding hypothesis2,6, as any statistical dependencies between the system’s output components reduce the entropy of the joint output patterns and thus prevent the system from using its full coding capacity. In the presence of intrinsic noise at the system’s input stage or during processing, a certain level of correlations in the output may help preserve information in accordance with efficient coding4,6, as correlated activity allows averaging to increase the signal-to-noise ratio. However, under the present stimulation conditions in the photopic range with natural contrast values well above detection threshold, the retina can be assumed in a low-noise regime, as evident in the reliable spiking responses (Fig. 1b) with Fano factors generally below unity over individual fixation periods (Extended Data Fig. 2e). Thus, averaging over cells offers minimal benefit for efficient information transmission.

To further check whether the observed cell-type-specific correlations could be consistent with compensating for noise, we measured the response reliability for repeated presentations of the same video sequence by the coefficient of determination between firing-rate profiles for even versus odd stimulus repeats. We found that responses of parasol cells were at least as reliable as responses of midget cells. In particular, ON parasol cells generally displayed the highest level of reliability under natural stimulation among the four analysed types (Fig. 1g). Thus, the most strongly correlated cell type was also the most reliable one, which is inconsistent with the idea that correlations in parasol cells would arise to counteract noise in the signals that they encode.

In principle, correlations could also contribute to efficient stimulus encoding in the form of stimulus-independent so-called noise correlations17,18. However, noise correlations were small in our data and typically negligible compared to stimulus-induced correlations (Extended Data Fig. 2a–c), thus indicating that correlations are not part of a synergistic encoding scheme, but imply redundancy17. Note also that we here measured correlations in trial-averaged firing rates and thereby obtain a measure that is largely independent of noise correlations.

The deficiency in decorrelation in parasol cells indicates that their representation of natural scenes contains considerable redundancy. To directly assess whether this is indeed the case, we evaluated the fractional redundancy12,19 of a given cell pair by quantifying the stimulus information provided by the joint responses of the pair and relating it to the single-cell information obtained from its constituent cells (Extended Data Fig. 2d). The fractional redundancy is zero if cells contribute independent information and takes positive values if the information carried by a cell pair falls below the sum of the single-cell information values, up to a maximum of unity if one of the cells adds no new information. Indeed, we found that fractional redundancy values could be substantial for our data, in particular for ON, but also for OFF parasol cells at short distances, indicating that more than 20% of single-cell information could be redundant (Fig. 1h). By contrast, OFF midget cell pairs displayed much less and often no redundancy. Moreover, for each cell type, the fractional redundancy was tightly connected to the measured correlation values (Fig. 1i), confirming response correlations as a source of redundancy. We also found that the spatiotemporal structure of natural stimuli is essential for the high redundancy values. Under a repeated presentation of spatiotemporal white noise, all four ganglion cell types had consistently low redundancy (Fig. 1h), which was also reflected in low cell-pair correlation values (Fig. 1i).

Spatial contrast triggers correlations

The idea of retinal decorrelation is typically associated with the centre-surround receptive fields of ganglion cells4. Yet these considerations generally assume a linear receptive field that acts as a spatial stimulus filter, whereas ganglion cells often display nonlinear processing in the receptive field, which can lead to cell-type-specific sensitivity to spatial contrast on spatial scales below the receptive-field size20. Such spatial contrast, which is high when edges or textures are present in natural scenes, can particularly drive parasol cell responses in the macaque retina21. We therefore sought to identify whether sensitivity to spatial contrast directly influenced the pairwise response correlations.

In ON parasol cells, fixations with high spatial contrast led to stronger responses that were also more correlated than for fixations with comparable light intensity but low spatial contrast (Fig. 2a,b). These effects also existed in OFF parasol and ON midget cells, albeit to a lesser degree, but not in OFF midget cells (Fig. 2b). For pairs of ON and OFF parasol cells, we also observed stronger correlations for high-spatial-contrast fixations, but not for pairs of ON and OFF midget cells (Fig. 2c,d). Thus, spatial stimulus structure can promote response correlations for certain cell pairs (Fig. 2e, top). This seems to be mediated by nonlinear processing in ganglion cell receptive fields, as the same analysis with predictions of linear–nonlinear (LN) models, fitted to the ganglion cells and capturing the encoding properties of their linear receptive fields, failed to reproduce the spatial-contrast-dependent correlation differences observed across cell types (Fig. 2e, bottom).

Fig. 2: Spatial contrast in natural videos leads to concerted responses within and across ganglion cell types of primate retina.
figure 2

a, Responses of two neighbouring ON parasol cells to fixations with similar light intensity but either high (top) or low spatial contrast (SC) (bottom). b, Partial correlations for each cell type (mean ± 95% confidence interval), separating the pairwise correlations into contributions from fixation periods with high versus low spatial contrast. c, Same as a for a pair of ON and OFF parasol cells. d, Same as b, but between types of different response polarity. e, Median differences between high- and low-spatial-contrast partial correlations across types (top) and predicted differences calculated with fitted LN models (bottom). Increases in correlation due to spatial contrast were statistically significant (one-sided Wilcoxon sign-rank test) for ON parasol (ONp, P = 3.5 × 10−27), OFF parasol (OFFp, P = 8.1 × 10−5), ON midget (ONm, P = 7.4 × 10−9) and ON versus OFF parasol (OvOp, P = 4.8 × 10−20) cell pairs, but not for OFF midget (OFFm, P = 0.93) or ON versus OFF midget (OvOm, P = 0.99; here, correlations even decreased slightly but significantly) cell pairs (cell pair numbers shown in Fig. 1e,f). Error bars are median ± robust confidence interval (95%), and data are from three retinas.

Source Data

Comparison of marmoset and mouse retina

To assess whether spatial-contrast-dependent correlations generalize across species, we also recorded from ganglion cells in the isolated mouse retina (Extended Data Figs. 35), to which we presented natural videos generated by pairing horizontal gaze traces recorded from freely moving mice22 with natural images from a standard database. We functionally identified (Extended Data Fig. 3) the four types of alpha ganglion cells, which are among the most accessible and widely studied mouse ganglion cell types23,24,25. These cells can be identified by their characteristic visual response properties (Extended Data Fig. 4). Moreover, transient and sustained alpha cell types seem to be orthologs of the primate parasol and midget cell types, respectively, as indicated by transcriptome analysis26. Our analysis of correlations and redundancy showed striking similarities between marmoset and mouse ganglion cells (Extended Data Fig. 5). In particular, we found substantial pairwise correlations under the natural video for certain types, but not others. Sustained-OFFα cells were strongly decorrelated, whereas the other three cell types displayed sizeable pairwise correlations. Positive correlations also occurred across ON and OFF types for transient alpha cells.

The pairwise correlations were tightly linked to redundancy in the retinal output, and the high response reliability (Extended Data Fig. 5b) with Fano factors mostly below unity (Extended Data Fig. 3g) again indicated a low-noise regime. As in the marmoset retina, the cells with the highest correlation and redundancy (Extended Data Fig. 5d,f), here transient-OFFα cells, displayed much more reliable responses than the most decorrelating ones (Extended Data Fig. 3h), here sustained-OFFα cells. Moreover, stronger correlations were generally associated with higher spatial contrast, in particular at short retinal distances, and this spatial-contrast-dependence of pairwise correlations was not well captured by LN models (Extended Data Fig. 5h–j). Thus, the cell-type-dependent deficiency in redundancy reduction during natural videos and the correlation-boosting characteristics of high-spatial-contrast fixations seem to be general phenomena across species.

Subunit models for natural scenes

To investigate how spatial contrast influences response correlations, we aimed at capturing the spatial-contrast sensitivity of the cells under natural stimuli in a computational model. We used a subunit model, which partitions the receptive field of a ganglion cell into smaller subunits whose outputs are nonlinearly summed. The subunits are thought to correspond to bipolar cells that provide excitatory input to ganglion cells27,28,29. To overcome challenges of previous approaches for fitting subunit models to experimental data14,28,30, such as the reliance on white-noise stimulation, we developed a new parameterized subunit model, which we call the subunit grid model (Fig. 3b). The model contains a set of identical subunits with centre-surround receptive fields and semiregular spacing for each ganglion cell. It can be efficiently fitted to responses obtained under flashed sinusoidal gratings of varying orientation and spatial frequency (Extended Data Fig. 6), which are potent stimuli for driving ganglion cells.

Fig. 3: The subunit grid model captures the nonlinear receptive field and responses to natural images.
figure 3

a, Left, schematic LN model, depicting a centre-surround spatial filter (top, here shown in one spatial dimension) and a subsequent nonlinear transformation (bottom). Right, sample stimulus frames of the spatiotemporal white noise used for fitting the LN model. b, Left, schematic subunit grid (SG) model, which sums nonlinearly transformed signals from several identical centre-surround subunits. Right, sample stimulus frames of the flashed gratings, used for fitting the SG model. c, Receptive-field mosaics of different ganglion cell types (left), with highlighted sample cells (red outlines) and their linear spatial filters (right). Darker pixels in spatial filters denote larger (positive) values. d, Obtained subunit layouts for the sample cells (left), spatial profiles of subunits (middle) and subunit nonlinearities (right). For the subunit maps, each circle corresponds to the 2σ Gaussian contour of the subunit centre, with the saturation of the outline denoting the subunit weight. Spatial profiles and nonlinearities are shown as mean and 95% confidence interval across all cells of the same type. e, Sample natural images that were flashed onto the retina (left). Output from the linear filter of the LN model (middle) and from the summed nonlinear subunits of the SG models (right) plotted against natural image responses for a single cell. ρ denotes the Spearman correlation. f, Comparison of model performance (measured by the absolute value of the Spearman correlation as in (e)) for ON parasol (ONp, n = 63), ON midget (ONm, n = 31), OFF parasol (ONp, n = 67) and OFF midget (OFFm, n = 79) cells from three retinas. Grey dot marks all cells that were unclassified (n = 193) but reliable. Error bars are median ± robust confidence interval (95%). Norm., normalized.

Source Data

The obtained subunit models showed component differences between ganglion cell types (Fig. 3c,d). For example, subunit nonlinearities of ON parasol cells had particularly high thresholds, showing stronger rectification than for OFF parasol cells, the opposite of what was expected from findings in the macaque retina21. Midget cells generally also showed substantial rectification, consistent with findings in the peripheral macaque retina31,32, but OFF midget cells additionally displayed a more linear regime around the origin of the subunit nonlinearity. Obtained subunit diameters for OFF parasol cells (around 30–40 µm; Extended Data Fig. 6i) and OFF midget cells (around 20 µm) roughly matched data on dendritic field sizes of the putative presynaptic bipolar cells in the peripheral marmoset retina33 (around 30 µm for type 3 diffuse bipolar cells and 15–20 µm for flat midget bipolar cells, respectively). For ON midget cells, on the other hand, subunits were often surprisingly large (around 50 µm), indicating that they do not represent individual bipolar cell inputs, potentially because subunit size was not well constrained by the data for these cells. Alternatively, this could reflect several midget bipolar cells combining to form individual subunits30,34 or signals from type-6 diffuse bipolar cells, which also provide input to ON midget cells35 and which have dendritic diameters of 40–80 µm in the marmoset retina33. Cell-type-specific differences in nonlinear components were also prominent in the mouse retina (Extended Data Fig. 7). Besides rectification in most cell types, we observed prominent saturation of subunit signals in transient-OFFα cells. This subunit nonlinearity, found particularly for dorsal transient-OFFα cells (Extended Data Fig. 8), is consistent with increased sensitivity to spatial homogeneity36. For most cell types, subunit models captured responses to flashed natural images better than linear receptive fields for both marmoset (Fig. 3e,f) and mouse retina (Extended Data Fig. 7). Thus, cell-type-specific models of the nonlinear receptive field can reflect retinal processing of spatial contrast in naturalistic stimuli, indicating that the different nonlinear characteristics may help explain differences in spatial-contrast-driven correlations between ganglion cell types.

To extend the analysis to dynamic stimuli, we added temporal filters to both the centre and surround of the subunits and fitted spatiotemporal subunit grid models to ganglion cell responses under sinusoidal gratings flickering in rapid succession (Fig. 4a). The obtained spatiotemporal models captured natural video responses for different types of ganglion cells, in both marmoset and mouse (Fig. 4b,c). Subunit grid models improved over simple LN models for most cell types, except for mouse sustained- and transient-OFFα cells, by reproducing additional response peaks. Response predictions of the subunit grid model also outperformed those of alternative subunit identification schemes, such as spike-triggered non-negative matrix factorization28 and spike-triggered clustering30 (Extended Data Figs. 9 and 10).

Fig. 4: The subunit grid model captures the retinal output under natural videos.
figure 4

a, Left, schematic of the flickering-gratings stimulus. Middle, subunit layout for a single ON parasol cell, shown as in Fig. 3d, and corresponding centre and surround spatial and temporal components. Right, LN model components determined with white-noise stimulation, matching roughly the union of subunit centres and their temporal profile. b, LN and SG model predictions (coloured traces) of ganglion cell responses (grey histograms) to natural videos for sample cells of different types in marmoset and mouse retina. Arrows mark response peaks captured by the SG model but not by the LN model. c, Median model performance for all identified types in marmoset and mouse. d, Left, schematic depiction of DoG LN model, showing elliptical outlines of centre and surround and corresponding temporal components. Right, comparison of median model performance between DoG LN and SG models for different cell types. e, Deviations of pairwise correlation coefficients for model predictions from the actual correlations in the data. For ce, error bars are median ± 95% robust confidence interval and number of cells (and corresponding cell pairs) are ON parasol n = 41 (269), OFF parasol n = 34 (176), ON midget n = 37 (330), OFF midget n = 53 (494), transient-OΝα n = 18 (57), transient-OFFα n = 43 (315), sustained-ONα n = 88 (1923), sustained-OFFα n = 76 (889) from three marmoset and four mouse retina pieces (eight pieces for LN model evaluation with n values given in Extended Data Fig. 5).

Source Data

Moreover, the obtained models reproduced the measured response correlations well. In particular, cell-type-specific correlations predicted by subunit grid models were much closer to the data than for LN models (Fig. 4e), which tended to overestimate these correlations as previously reported11. The lower predicted response correlations by the subunit grid model might seem counterintuitive: subunits confer sensitivity to spatial contrast, which, as we have seen, boosts correlations in the data (Fig. 2). However, this discrepancy can be explained by the inclusion of surround suppression in the subunit grid model through the subunit surround. LN models, particularly when fitted to white-noise stimuli, may underestimate the strength of the receptive-field surround37.

Nonlinearities drive correlations

To distinguish the effects of surround suppression and spatial-contrast sensitivity conveyed by the subunits, we compared the subunit grid models to difference-of-Gaussians (DoG) LN models fitted directly to the flickering grating responses, the same stimulus used for the subunit grid. DoG LN models showed better response predictions than white-noise-fitted LN models but were still outperformed by subunit grid models for certain cell types (Fig. 4d). Pairwise correlations estimated by DoG LN models matched those of the subunit grid models (Fig. 4e), confirming that surround suppression is essential for reducing the overestimation of correlations by the standard LN model and capturing the correct range of response correlations. However, the DoG LN model lacks the required spatial nonlinearities, and we therefore used the subunit grid model to investigate the observed dependence of response correlations on spatial stimulus structure.

To investigate how the nonlinear receptive field contributes to the response correlations, we separated the fixations for each cell pair according to how important nonlinear spatial processing was for determining the cells’ responses (Fig. 5a). Specifically, we tagged those fixations as nonlinear for which the predictions of the (spatially nonlinear) subunit grid model differed most from the predictions of the (spatially linear) DoG LN model. For these ‘maximally differentiating fixations’, the subunit grid model displayed superior model predictions compared to the DoG LN models for certain cell types, such as ON and OFF parasol cells in the marmoset and sustained-ONα cells in the mouse (Fig. 5b), for which capturing receptive-field nonlinearities thus mattered most. These cell types had also shown strong spatial-contrast dependence of the pairwise response correlation (Fig. 2b and Extended Data Fig. 5i).

Fig. 5: Correlated activity results from fixations that evoke nonlinear responses.
figure 5

a, Top, sample image trajectory for the presentation of a single image. Middle, corresponding model predictions for two neighbouring OFF parasol cells (receptive fields in inset) for the DoG LN and the subunit grid models. Shaded areas mark two consecutive fixations, the first classified as differentiating (the two model predictions diverge) and the second as non-differentiating (model predictions align). Bottom, responses of the two cells during the same period. b, Comparison of average model performances (R2) for the subunit grid model and the DoG LN model during the top 20% maximally differentiating fixations. c, Contributions of linear and nonlinear fixations to the total pairwise correlations for the natural video (mean with 95% confidence intervals). d, Relationship between receptive-field nonlinearity, calculated as the ratio of subunit grid over DoG LN model performance during maximally differentiating fixations, and overall pairwise response decorrelation, calculated as the difference between stimulus and response correlations relative to the stimulus correlations. For cells with zero DoG LN performance, the ratio was set to the maximum value measured across cells of the same type. * denotes significant Spearman correlation (p = 0.037, two-sided permutation test). For b and d, error bars are median ± 95% robust confidence interval and number of cell pairs are ON parasol n = 269, OFF parasol n = 176, ON midget n = 355, OFF midget n = 494, transient-OΝα n = 63, transient-OFFα n = 315, sustained-ONα n = 2040, sustained-OFFα n = 889 from three marmoset and four mouse retina pieces (eight pieces for response decorrelation with n values given in Extended Data Fig. 5).

Source Data

Thus, to determine whether the nonlinear receptive field alone might explain the difference in contributions of high- and low-contrast stimulus segments to the response correlations, we split the set of all fixations into those in which predictions of a linear and a nonlinear receptive field matched best (‘linear fixations’) and those in which the two predictions most diverged (‘nonlinear fixations’). We then assessed the relative importance of the linear and nonlinear fixations for the correlated spiking activity by calculating the contribution of each subset of fixations to the total correlation. Indeed, for parasol cells in the marmoset and nonlinear alpha cells in the mouse, nonlinear fixations contributed the most to the overall pairwise correlations, as indicated by the larger partial correlations for this set of fixations (Fig. 5c), whereas partial correlations for linear fixations were much more similar between cell types. Moreover, it is the nonlinear fixations that were responsible for the positive correlations of ON and OFF parasol cells (Fig. 5c). By contrast, linear cells, such as marmoset OFF midget cells and mouse sustained-OFFα cells, displayed more balanced partial correlations between linear and nonlinear fixations, indicating that responses during both sets of fixations contributed equally to the total correlation for these cells. We therefore conclude that fixations containing salient spatial structure, which drives particularly the nonlinear components of receptive fields, elicit concerted responses for specific types of retinal ganglion cells. Thus, across both marmoset and mouse, ganglion cells with stronger receptive-field nonlinearities tend to perform less stimulus decorrelation during stimulation with natural gaze dynamics (Fig. 5d).

Discussion

We provide direct evidence that redundancy reduction in the retina is violated in a cell-type-specific manner under natural stimuli that include gaze dynamics. Although some ganglion cell types displayed substantial decorrelation and redundancy reduction, others showed highly correlated activity. The correlations led to redundant representations and were particularly pronounced when the stimulus shifted to a new fixation that contained high spatial contrast. This concerted activity originated in nonlinear processing in the receptive fields of retinal ganglion cells, a processing feature that has been absent in many considerations of efficient coding and redundancy reduction in the retina4,7,9,19. Under the global changes in spatial stimulus patterns induced by gaze shifts, nonlinear receptive fields become simultaneously activated in a way that is not effectively suppressed by surround mechanisms. This co-activation occurs for a range of distances, as well as across preferred contrast polarity, thus even creating seemingly paradoxical positive correlations between ON and OFF cells.

The correlated activity of parasol ganglion cells in the primate retina and transient alpha cells in mouse challenges the efficient coding hypothesis. Although complete decorrelation is predicted by efficient coding only when there is no noise before the output stage of the system4, it seems unlikely that the observed high correlations directly counteract noise for efficient signal transmission. First, stimulation with temporal dynamics from natural gaze shifts drives responses with high reliability and signal-to-noise ratio. Second, the most correlated cell types show particularly reliable responses compared to the least correlated ones. Thus, a cell-type-specific role of correlations for counteracting noise is not supported. It remains possible, however, that correlations could support coding efficiency for large populations of several ganglion cell types. For example, correlations between ON parasol cells might counteract noise in midget cells for a joint efficient stimulus encoding, in particular because noise may be shared between parasol and midget cells38,39. However, it seems unclear whether their distinct downstream pathways and their differences in conduction velocity40 may support a joint coding scheme of parasol and midget cells.

The analysis of decorrelation and redundancy does not hinge on the specific stimulus aspects represented by the cells’ activity or whether cells can be described by a linear receptive field. If the task were, for example, to encode high-frequency spatial contrast without redundancy, lateral inhibition that is as sensitive to spatial contrast as centre excitation could decorrelate responses even with nonlinear receptive fields. However, ganglion cell receptive-field surrounds may differ substantially from the centre in their spatial nonlinearities31,41. To further investigate the relationship between spatial nonlinearities and redundancy reduction, it would thus be interesting to analyse how spatially nonlinear ganglion cell models should be structured to optimize coding efficiency42.

Earlier investigations of correlated retinal activity had often focused on spontaneous activity or artificial stimuli38,43,44,45,46, such as white noise. Notably, in the context of artificial stimuli, nonlinearities of receptive fields had previously been associated with strengthening decorrelation14, in contrast to our finding with natural stimuli. Studies of salamander retina with natural stimuli had also observed considerable correlations12,47, although not connected to spatial nonlinearities or gaze dynamics. Also on the basis of salamander retina, fixational eye movements had been proposed to contribute to decorrelation48, yet our data show strong correlations for stimuli that contained measured fixational eye movements.

Correlated retinal activity has been indicated to play a role in increasing spatial resolution43 and error correction47. Because retinal circuit nonlinearities have been associated with computations underlying visual feature detection49,50, we hypothesize that the response correlations in nonlinear cell types aid in signalling the detection of a relevant visual feature in natural scenes. For example, we found that mammalian direction-selective (DS) retinal ganglion cells, which are a prime example of feature detectors, have strongly nonlinear receptive fields, and their strong pronounced pairwise response correlations could even exceed stimulus correlations (Extended Data Fig. 11). Correlations may be particularly important for tagging the relevant feature, such as local spatial contrast or the preferred motion signal, and distinguishing it from changes in illumination of the receptive field. Although a single neuron’s firing rate might be confounded by light intensity or other stimulus dimensions to which the neuron is sensitive, the feature of interest may be isolated by combining the activity from groups of neurons51. Further insight into the functional consequences of correlated activity may come from assessing their dependence on stimulus context, such as average light level. Spatial nonlinearities in ganglion cell receptive fields, for example, may decrease at lower light levels52, which should result in decreased stimulus-induced correlations, and noise correlations may become more prevalent18.

Our observations of stronger nonlinearities in marmoset ON than OFF parasol cells differ from previous findings that OFF parasol cells are the more nonlinear ones in the macaque21,53. In our data, the stronger nonlinearities of ON parasol cells were observed in the subunit models fitted to flashed gratings, as well as in those fitted to flickering gratings; in the stronger improvements of response predictions for natural stimuli when nonlinearities were included; and in responses to reversing gratings (for example, Extended Data Fig. 9b). It seems feasible that this represents a species difference between macaque and marmoset but could also depend on experimental conditions, such as illumination level or stimulation of the receptive-field surround21,31,52,54. Generally, differences in nonlinearities between ON and OFF channels in the retina seem to be species and cell-type dependent25,55,56 and may reflect differences in visual tasks57.

Efficient coding is often considered a natural assumption for sensory systems because of the need to preserve energy associated with neuronal activity58. However, whether the retinal output is energy-efficient in vivo has been debated59. Moreover, feature detection might have different requirements than general information transmission, such as robustness or future prediction60, which could lead to deviations from efficient coding. In this context, the retinal code may multiplex correlated nonlinear responses containing feature information with decorrelated baseline activity. Our findings indicate that the retinal output can maintain efficiency in various stimulus contexts while being robust for feature detection. Energy constraints could also be addressed by other mechanisms, such as making responses transient, allowing the visual system to detect important features promptly. Thus, the different information channels of the retina may balance energy conservation and robust feature detection on the basis of their respective visual tasks.

Methods

Tissue preparation and electrophysiology

We recorded spiking activity from retinas of three adult male marmoset monkeys (Callithrix jacchus), 12, 13 and 18 years of age, using a single piece of retina from each animal. No previous determination of sample size was used. The retinal tissue was obtained immediately after euthanasia from animals used by other researchers, in accordance with national and institutional guidelines and as approved by the institutional animal care committee of the German Primate Center and by the responsible regional government office (Niedersächsisches Landesamt für Verbraucherschutz und Lebensmittelsicherheit, permit number 33.19-42502-04-20/3458). After enucleation, the eyes were dissected under room light, and the cornea, lens and vitreous humour were carefully removed. The resulting eyecups were then transferred into a light-tight container containing oxygenated (95% O2 and 5% CO2) Ames’ medium (Sigma-Aldrich) supplemented with 4 mM D-glucose (Carl Roth) and buffered with 20–22 mM NaHCO3 (Merck Millipore) to maintain a pH of 7.4. The container was gradually heated to 33 °C, and after at least an hour of dark adaptation, the eyecups were dissected into smaller pieces. All retina pieces used in this study came from the peripheral retina (7–10 mm distance to the fovea). The retina was separated from the pigment epithelium just before the start of each recording. All reported marmoset data are from pieces for which a 5% contrast full-field modulation at 4 Hz produced at least a 10 spikes per second modulation in the average ON parasol spike rate. This ensured high quality and light sensitivity of the analysed retina pieces (Extended Data Fig. 12).

We also recorded spiking activity from 12 retina pieces of eight wild-type female mice (C57BL/6J) between 7 and 15 weeks old (except for one 23-week-old mouse). No previous determination of sample size was used. All mice were housed on a 12-hour light/dark cycle. The ambient conditions in the animal housing room were kept at around 21 °C (20–24 °C) temperature and near 50% (45–65%) humidity. Experimental procedures were in accordance with national and institutional guidelines and approved by the institutional animal care committee of the University Medical Center Göttingen, Germany. We cut the globes along the ora serrata and then removed the cornea, lens and vitreous humour. The resulting eyecups were hemisected to allow two separate recordings. On the basis of anatomical landmarks, we performed the cut along the horizontal midline and marked dorsal and ventral eyecups. Before the start of each recording, we isolated retina pieces from the sclera and pigment epithelium.

For both marmoset and mouse retina recordings, we placed retina pieces ganglion-cell-side-down on planar multielectrode arrays (Multichannel Systems; 252 electrodes; 10 or 30 μm electrode diameter, either 60 or 100 μm minimal electrode distance) with the help of a semipermeable dialysis membrane (Spectra Por) stretched across a circular plastic holder (removed before the recording). The arrays were coated with poly-D-lysine (Merck Millipore). For some marmoset recordings, we used 60-electrode perforated arrays61. Dissection and mounting were performed under infrared light (using LEDs with peak intensity at 850 nm) on a stereo-microscope equipped with night-vision goggles. Throughout the recordings, retina pieces were continuously superfused with oxygenated Ames’ solution flowing at 8–9 ml min−1 for the marmoset or 5–6 ml min−1 for the mouse retina. The solution was heated to a constant temperature of 33–35 °C through an inline heater in the perfusion line and a heating element below the array.

Extracellular voltage signals were amplified, bandpass filtered between 300 Hz and 5 kHz and digitized at 25 kHz sampling rate. We used Kilosort62 for spike sorting. To ease manual curation, we implemented a channel-selection step from Kilosort2 by discarding channels that contained only a few threshold crossings. We curated the output of Kilosort through phy, a graphical user interface for visualization and selected only well-separated units with clear refractory periods in the autocorrelograms. In a few cases, we had to merge units with temporally misaligned templates; we aligned the spike times by finding the optimal shift through the cross-correlation of the misaligned templates.

Visual stimulation

Visual stimuli were sequentially presented to the retina through a gamma-corrected monochromatic white organic LED monitor (eMagin) with 800 × 600 square pixels and 85 Hz (marmoset) or 75 Hz (mouse) refresh rate. The monitor image was projected through a telecentric lens (Edmund Optics) onto the photoreceptor layer, and each pixel’s side measured 7.5 μm on the retina or 2.5 μm for some marmoset recordings in which we used a different light-projection setup61. All stimuli were presented on a background of low photopic light levels, and their mean intensity was always equal to the background. To estimate isomerization rates of photoreceptors, we measured the output spectrum of the projection monitors and the irradiance at the site of the retina and combined this information with the absorbance profile63 and peak sensitivities of the opsins (543–563 nm for different marmoset M-cones and 499 nm for marmoset rods64,65; 498 nm for mouse rods) and with the collecting areas of photoreceptors, using 0.5 μm2 for mouse rods66, 0.37 μm2 for marmoset cones and 1.0 μm2 for marmoset rods, applying here the values from macaque cones and rods67,68. For the marmoset, the background light intensity resulted in approximately 3,000 photoisomerizations per M-cone per second and approximately 6,000 photoisomerizations per rod per second, and for the mouse, approximately 4,000 photoisomerizations per rod per second. We fine-tuned the focus of stimuli on the photoreceptor layer before the start of each experiment by visual monitoring through a light microscope and by inspection of spiking responses to contrast-reversing gratings with a bar width of 30 μm.

Receptive-field characterization

To characterize functional response properties of the recorded ganglion cells, we used a spatiotemporal binary white-noise stimulus (100% contrast) consisting of a checkerboard layout with flickering squares, ranging from 15 to 37.5 μm on the side in different recordings. The stimulus update rate ranged from 21.25 to 85 Hz. Each stimulus cycle consisted of a varying training stimulus and a repeated test stimulus, with 18–55 cycles presented in total. The training stimulus duration ranged from 45 to 144 s in different experiments. The test stimulus consisted of a fixed white-noise sequence ranging from 16 s to 18 s, which we used here to determine noise entropies and noise correlations.

We calculated spike-triggered averages (STAs) over a 500 ms time window and extracted spatial and temporal filters for each cell as previously described69. In brief, the temporal filter was calculated from the average of spatial STA elements whose absolute peak intensity exceeded 4.5 robust standard deviations of all elements. The robust standard deviation of a sample is defined as 1.4826 times the median absolute deviation of all elements, which aligns with the standard deviation for a normal distribution. The spatial receptive field was obtained by projecting the spatiotemporal STA on the temporal filter. We also calculated spike-train autocorrelation functions under white noise, using a discretization of 0.5 ms. For plotting and subsequent analyses, all autocorrelations were normalized to unit sum.

For each cell, a contour was used to summarize the spatial receptive field. We upsampled the spatial receptive field to single-pixel resolution and then blurred it with a circular Gaussian of σ = 4 pixels. We extracted receptive-field contours using MATLAB’s ‘contourc’ function at 25% of the maximum value in the blurred filter. In some cases, noisy STAs would cause the contour to contain points that lay further away from the actual spatial receptive field. Thus, we triaged the contour points and removed points that exceeded 20 robust standard deviations of all distances between neighbours of the points that were used to define the contour. This process typically resulted in a single continuous area without holes. The centre of each receptive field was defined as the median of all contour points, and its area was determined by the area enclosed by the contour.

Ganglion cell type identification

We used responses to a barcode stimulus70 to cluster cells into functional types in each single recording. The barcode pattern had a length of 12,750 (or 12,495) µm and was generated by superimposing sinusoids of different spatial frequencies (f) with a 1/f weighting. The constituent sinusoids had spatial frequencies between 1/12,750 (or 1/12,495) and 1/120 μm−1 (separated by 1/12,750 or 1/12,495 μm−1 steps, respectively) and had pseudorandom phases. The final barcode pattern was normalized so that the brightest (and dimmest) values corresponded to 100% (and −100%) Weber contrast from the background. The pattern moved horizontally across the screen at a constant speed of 1,275 (or 1,125) μm s−1, and the stimulus was repeated 10–20 times. Obtained spike trains were converted into firing rates using 20 ms time bins and Gaussian smoothing with σ = 20 ms. We quantified cell reliability with a symmetrized coefficient of determination (R2), as described previously36. We only included cells with a symmetrized R2 value of at least 0.1 and that were not putative DS cells (see below).

We used average responses to the barcode stimulus to generate a pairwise similarity matrix, as described previously70. We defined the similarity between each pair of cells as the peak of the cross-correlation function (normalized by the standard deviations of the two signals) between the spike rate profiles of the two cells. To obtain a final similarity matrix, we multiplied the entries of the barcode similarity matrix with the entries of three more similarity matrices, obtained from receptive-field response properties. The first two were generated by computing pairwise correlations between both the temporal filters and the autocorrelation functions of each cell. The third one used receptive-field areas and was defined as the ratio of the minimum of the two areas over their maximum (Jaccard index).

We converted the combined similarity matrix to a distance matrix by subtracting each entry from unity. We then computed a hierarchical cluster tree with MATLAB’s ‘linkage’ function, using the largest distance between cells from two clusters as a measure for cluster distance (complete linkage). The tree was used to generate 20–50 clusters; we chose the number depending on the number of recorded cells. This procedure yielded clusters with uniform temporal components and autocorrelations and with minimally overlapping receptive fields but typically resulted in oversplitting functional ganglion cell types. Thus, we manually merged clusters with at least two cells on the basis of the similarity of properties used for clustering and on receptive-field tiling. To incorporate cells that were left out of the clustering because of the barcode quality criterion, we expanded the clusters obtained after merging. An unclustered cell was assigned to a cluster if its Mahalanobis distance from the centre of the cluster was at most 5 but at least 10 for all other clusters. Our method could consistently identify types with tiling receptive fields forming a mosaic over the recorded area. This was generally the case for parasol and midget cells in the marmoset and the different alpha cells in the mouse retina, which are the ones primarily analysed in this work. Note, though, that mosaics are typically incomplete because of missed cells whose spikes were not picked up by the multielectrode array (or not sufficiently strong to allow reliable spike sorting). As is common with this recording technique, missed cells are more frequent for some cell types than for others, and this recording bias renders, for example, midget cell mosaics less complete than those of parasol cells. The present analyses, however, do not rely on the recovery of complete mosaics, as the pairwise investigations of correlation and redundancy only require sufficient sampling of pairs at different distances.

Matching cell types to mouse ganglion cell databases

We validated the consistency of cell type classification by examining cell responses to a chirp stimulus24, which was not used for cell clustering. Light-intensity values of the chirp stimulus ranged from complete darkness to the maximum brightness of our stimulation screen. The stimulus was presented 10–20 times. For the mouse retina, the parameters of the chirp stimulus matched the original description, which allowed us to compare cell responses to calcium traces in a database of classified retinal ganglion cells24. To convert spike rates to calcium signals, we convolved our spiking data with the calcium kernel reported in the original paper. We then computed correlations to the average traces of each cluster in the database.

For some mouse experiments, we also used responses to spot stimuli23. In brief, we flashed one-second-long spots over the retina at different locations and with five different spot diameters (100, 240, 480, 960 and 1,200 μm). Between spot presentations, illumination was set to complete darkness, and the spots had an intensity of 100–200 photoisomerizations per rod per second. For each cell, we estimated a response centre by identifying which presented spot location yielded the strongest responses when combining all five spot sizes. We only used cells whose estimated response centre for the spots lay no further than 75 μm from the receptive-field centre as determined with white-noise stimulation. To calculate similarities to the cell types in the available database23, we concatenated firing-rate responses to the five spot sizes and then calculated correlations with the available database templates. We also applied saccade-like shifted gratings to detect image-recurrence-sensitive cells in mouse retinas as previously described36,71. These cells correspond to the transient-OFFα cells in the mouse retina.

Natural videos, LN model predictions and response correlations

For the marmoset retina, we constructed natural videos in similar fashion as previously done for the macaque retina30,72. In brief, the videos consisted of 347 grayscale images that were shown for 1 s each and jittered according to measurements of eye movements obtained from awake, head-fixed marmoset monkeys16 (graciously provided by J. L. Yates and J. F. Mitchell; personal communication). These eye movement data had been collected at a scale of about 1.6 arcmin per pixel, which roughly corresponds to 2.67 µm on the marmoset retina, using a retinal magnification factor of 100 μm deg−1 (ref. 73). To align the sampled traces with the resolution of our projection system, we adjusted the pixel size of the gaze traces to 2.5 µm when presenting the natural video. We furthermore resampled the original 1,000 Hz gaze traces to produce a video with a refresh rate of 85 Hz. The presented natural video consisted of 30–35 cycles of varying training and repeated test stimuli. Test stimuli consisted of 22 distinct natural images, using the original grayscale images (graciously provided by J. L. Yates and J. F. Mitchell; personal communication) viewed by the marmosets during eye movement tracing (mean intensity −10% relative to background; 38% average contrast, calculated as the standard deviation across all pixels for each image). Each test image was paired with a unique movement trajectory given by the marmoset eye movements. For each training stimulus cycle, we presented 40 images out of the 325 remaining images (sampled with replacement), each paired with a unique movement trajectory. These 325 images were obtained from the van Hateren database74, were multiplicatively scaled to have the same mean intensity as the background and had an average contrast of 45%.

For the mouse retina, we applied a similar procedure. In brief, the videos consisted of the same 325 images from the van Hateren database as used for the marmoset training stimulus, shown for 1 s each and jittered according to the horizontal gaze component22 of freely moving mice (graciously provided by A. Meyer; personal communication). We resampled the original 60 Hz gaze traces to produce a video with a refresh rate of 75 Hz. For our recordings, the one-dimensional gaze trajectory was randomly assigned to one of four orientations (0, 45, 90 or 135 degrees) for each 1 s image presentation. The amplitude of the original movement was transformed into micrometre on the retina using a retinal magnification factor of 31 μm deg−1 for the mouse. All images were multiplicatively scaled to have the same mean intensity as the background. Test stimuli consisted of 30–35 cycles of 25 distinct natural images, paired with unique movement trajectories. The training stimuli consisted of batches of 35 images out of the remaining 300 (sampled with replacement), each paired with a unique movement trajectory.

For the model-based analyses of the responses to the natural videos, we applied a temporal binning corresponding to the stimulus update frequency (85 Hz for the marmoset and 75 Hz for the mouse) and used the spike count in each bin. To extract firing rates for the test stimuli, we averaged the binned responses over repeats. Furthermore, to eliminate cells with noisy responses, we only used cells for subsequent analyses with a symmetrized R2 of at least 0.2 between even and odd trials of the test set.

All model predictions for natural videos used the stimulus training part for estimating an output nonlinearity and the test part for evaluation of model performance. For the LN model, we obtained the spatiotemporal stimulus filter (decomposed into a spatial and a temporal filter as explained in ‘Receptive-field characterization’) from the spatiotemporal white-noise experiments but estimated the nonlinearity from the natural-video data. To do so, we projected the video frames onto the upsampled spatial filter (to single-pixel resolution) and then convolved the result with the temporal filter. The output nonlinearity was obtained as a histogram (40 bins containing the same number of data points across the range of filtered video-stimulus signals) containing the average filtered signal and the average corresponding spike count. To apply the nonlinearity to the test data, we used linear interpolation of histogram values. We estimated model performance using the coefficient of determination between model prediction and measured firing rate to obtain the fraction of explained variance (R2). Negative values were clipped to zero.

We calculated video response correlations, using the trial-averaged firing rates of the test stimulus, as the Pearson correlation coefficient between the firing rates for each cell pair of the same type (as well as across specific types). We performed the same analyses for model predictions and for calculating correlations inherent to the test stimulus, where we calculated pairwise correlations of the light intensity of 5,000 randomly selected pixels. Decorrelation was defined for each cell pair as the difference between stimulus and response correlation relative to stimulus correlation for a pixel distance matching that of the actual cell pair distance. To generate correlation–distance curves (Fig. 1e), we sorted cell pairs by ascending distance and averaged pair correlations over groups of 20–60 pairs (depending on cell type, using fewer pairs per bin when the number of available cells was small).

Spike-train information and fractional redundancy

To estimate whether pairwise correlations led to coding redundancy, we quantified the information contained in ganglion cell spike trains by measuring entropies of response patterns in temporal-frequency space by evaluating the Fourier transforms of the response patterns75,76. For temporal patterns that are sufficiently long compared to the time scales of correlations, this approach allows treating the different frequency modes independently and approximating, through the central limit theorem, the empirical distribution of Fourier components by normal distributions whose entropies can be analytically computed75,76. This greatly reduces the sampling problem of information-theoretic evaluations encountered by direct methods of computing entropies through empirical frequencies of different response patterns77.

Here we applied the method to spike-train responses (binned at 0.4 ms) from the repeated parts of the natural video (or white noise) and divided them into 0.8-s-long non-overlapping sections separately for each stimulus trial. This process yielded 27 (or 31) sections for the marmoset (or mouse) natural video and 20–22 sections for white noise for each trial (around 55 white-noise trials for marmoset and 40 for mouse recordings). The selection of the section length aimed at having comparable numbers of sections per trial and trials per section (both around 30) in the natural video analysis to mitigate bias effects from limited data in the calculation of information rates. For each section (s) and trial (t), we performed a Fourier transform to obtain complex-valued frequency coefficients that were then separated into real-valued cosine (ccos) and sine (csin) coefficients for each frequency (f). For a single-cell analysis, we then estimated signal (Hsignal) and noise (Hnoise) entropies by computing the variance of those coefficients either over sections of a given trial or over trials for a given section, respectively, and then averaging the variances over the remaining dimension (that is, trials or sections, respectively):

$${H}_{\text{signal}}=\frac{1}{2}{\log }_{2}[2{\rm{\pi }}e({V}_{\cos }^{s}+{V}_{\sin }^{s})]$$
$${H}_{\text{noise}}(f)=\frac{1}{2}{\log }_{2}[2{\rm{\pi }}e({V}_{\cos }^{t}+{V}_{\sin }^{t})]$$

where, for example, \({V}_{\sin }^{s}={\langle {{\rm{V}}{\rm{a}}{\rm{r}}({c}_{\sin }(f))}_{s}\rangle }_{t}\) denotes the variance of sine coefficients (subscript ‘sin’) over sections (superscript ‘s’), averaged over trials.

The frequency-resolved information rate was calculated as the difference of signal and noise entropies, normalized by the duration of the applied response sections to obtain information per time. The total information rate was then obtained as the sum over frequencies. For this sum, we applied an upper cutoff at 200 Hz, because signal and noise entropies had converged to the same baseline level by then.

For estimating the information content of a cell pair, we proceeded analogously, but instead of computing the variances of the sine and cosine coefficients directly, we first gathered the sine and cosine coefficients from both cells to compile the corresponding 4 × 4 covariance matrix over sections (or trials) and averaged the covariance matrices over trials (or sections). We then used the four eigenvalues (\({\lambda }_{k}\)) of each averaged covariance matrix to calculate the response entropy for a cell pair (separately for signal and noise):

$${H}_{\text{pair}}(f)=\frac{1}{2}{\log }_{2}\left[2{\rm{\pi }}e\left(\mathop{\sum }\limits_{k=1}^{4}{\lambda }_{k}(f)\right)\right]$$

Information rates were again obtained as the difference between signal and noise entropy, summed over frequencies and normalized by the duration of the response sections. To check for bias from finite data in the calculation of information rates78, we also computed information rates for different fractions of the full dataset but observed little systematic dependence on the size of the data fraction. This is for two reasons. First, the possibility to obtain entropies analytically only after estimating the variances of the Fourier components greatly limits the sampling problem, and second, the comparable numbers of trials and sections used in the estimation of entropies mean that any residual bias is of similar scale for signal and noise entropies, thus leading to at least partial cancellation.

To obtain the contributions of individual frequency bands to the information rates, we used the same approach as above separately for each frequency component without summation over frequencies.

Fractional redundancy for a cell pair (i, j) was calculated on the basis of a previous definition12,19 as the difference between the sum of single-cell information values (Ii and Ij) and pair information (Iij) normalized by the minimum single-cell information:

$${C}_{{ij}}\,=\,\frac{{I}_{i}\,+\,{I}_{j}-{I}_{{ij}}}{\min ({I}_{i},\,{I}_{j})}$$

Other definitions of redundancy, in particular in the context of efficient coding, are based on a comparison of the actual information passed through an information channel (here corresponding to the joint responses of the cell pair and their information rate Iij) and the channel capacity: that is, the maximum information that the channel could supply2,79. In practice, however, channel capacity is difficult to assess and requires fundamental assumptions about the neural code and attainable firing rates. Instead, the comparison of Iij to the sum of single-cell information rates, as used here, can be thought of as capturing whether the capacity as specified by the constraints of the individual cells’ response characteristics is fully exhausted by the joint responses. Thus, fractional redundancy is sensitive to inefficient use of the channel capacity that stems from correlation but not from inefficient coding by individual cells.

Analysis of fixations and spatial contrast

To investigate the effects of spatial contrast on response correlations, we divided the test part (repeated image sequences) of the natural video into distinct fixations by detecting saccadic transitions. To do so, we first marked each time point when a new image was presented as a transition. In each image presentation, we calculated the distance between consecutive positions to estimate the instantaneous eye velocity and used MATLAB’s ‘findpeaks’ function to obtain high-velocity transitions. We constrained peak finding for the marmoset (and mouse) to a minimum peak time interval of 47 (and 53) ms and a minimum amplitude of 10 (and 300) deg s−1. This process yielded 80 fixations for the marmoset and 68 fixations for the mouse video. Fano factors were computed for individual fixations. To reduce effects of nonstationary activity, we included here only cells with a positive symmetrized coefficient of determination between the firing-rate profiles of the first half and second half of trials. To mitigate noise from fixations with no or little activity, we excluded, for each cell, fixations with fewer than three spikes on average and report the average Fano factor over fixations, weighted by the mean spike count.

For each video frame and each ganglion cell, spatial contrast was calculated as described previously36 using the standard deviation of pixels inside the cell’s receptive field, weighted by the receptive-field profile. For each fixation, we assigned to each cell the median spatial contrast of all frames during the fixation period. We also assigned a linear activation per fixation, estimated by filtering video frames with the spatial filter obtained from white noise and taking the median over all fixation frames.

To reduce effects of the light level on the analysis of spatial contrast, we aimed at separating the fixations into high-spatial-contrast and low-spatial-contrast groups while balancing the linear activation between the groups. For a pair of cells, we therefore sorted all fixations of the test set by the average linear activation across both cells and paired neighbouring fixations in this sorted list. This led to 40 pairs (34 for the mouse), and for each pair, we assigned the fixation with the higher spatial contrast to the high-spatial-contrast group and the other fixation to the low-spatial-contrast group. To expand the pairwise correlation (rpair) into high- and low-spatial-contrast parts, we split the numerator of the Pearson correlation coefficient so that rpair = rhigh + rlow, with

$${r}_{{\rm{high}}}=\frac{{\sum }_{i\in {\rm{high}}}({x}_{i}-\bar{x})({y}_{i}-\bar{y})}{({N}_{{\rm{frames}}}-1){\sigma }_{X}\,{\sigma }_{Y}}$$

with x and y corresponding to the responses of the two cells and i indexing the frames of the natural video, the sum here running over the frames from high-spatial-contrast fixations and Nframes denoting the total number of frames. Mean (\(\bar{x}\), \(\bar{y}\)) and standard deviation (\({\sigma }_{X}\), \({\sigma }_{Y}\)) values correspond to the length of the entire test part of the video.

Extraction of DS ganglion cells

To identify DS ganglion cells in the mouse retina, we used drifting sinusoidal gratings of 100% contrast, 240 μm spatial period and a temporal frequency of 0.6 Hz, moving along eight different, equally spaced directions. We analysed cell responses as previously described36. Cells with a mean firing rate of at least 1 Hz and a direction selectivity index (DSI) of at least 0.2 (significant at 1% level) were considered putative DS cells. The DSI was defined as the magnitude of the normalized complex sum \(\sum _{\theta }{r}_{\theta }{e}^{i\theta }\,/\sum _{\theta }{r}_{\theta }\), with θ specifying the drift direction and rθ the average (across trials) spike count during the grating presentation for direction θ (excluding the first grating period). The preferred direction was obtained as the argument of the same sum. The statistical significance of the DSI was determined through a Monte Carlo permutation approach28,36.

To separate ON from ON–OFF DS cells, we used a moving-bar stimulus. The bar (width: 300 μm, length: 1,005 μm) had 100% contrast and was moved parallel to the bar orientation in eight different directions with a speed of 1,125 μm s−1. We extracted a response profile to all bar directions through singular value decomposition, as previously described24 and calculated an ON–OFF index to determine whether cells responded only to the bar onset (ON) or to both onset and offset (ON–OFF). Cells with an ON–OFF index (computed as the difference of onset and offset spike-count responses divided by their sum) above 0.4 were assigned as ON DS cells and were grouped into three clusters on the basis of their preferred directions.

Flashed gratings

Depending on the experiment, we generated 1,200 to 2,400 different sinusoidal gratings with 25 or 30 different spatial frequencies (f), with half-periods between 15 and 1,200 μm, roughly logarithmically spaced. For each grating, we generated 12 or 10 equally spaced orientations (θ) and four or eight equally spaced spatial phases (\(\varphi \)). For a given grating, the contrast value for each pixel with (x, y) coordinates were generated according to the following equation:

$$C(x,y)=\sin (2{\rm{\pi }}f(x\cos \theta +y\sin \theta )+\varphi )$$

Gratings were presented as 200 ms flashes on the retina, separated by a 600 or 800 ms grey screen. The order of presentation was pseudorandom. We collected spike-count responses to the flashes by counting spikes during stimulus presentation for the marmoset or 20 ms after stimulus onset up to 20 ms after stimulus offset for the mouse. We used tuning surfaces to summarize responses (Extended Data Fig. 6), which we generated by averaging responses over trials and spatial phases for each frequency–orientation pair. In the mouse recordings, in which we typically collected four to five trials per grating, we calculated symmetrized R2 values for the spike counts, and we only used cells with an R2 of at least 0.2 for further analyses. In marmoset recordings, we typically collected one to two trials per grating, and we thus used no exclusion criterion.

DoG subunits

The subunit grid model consists of DoG subunits, and fitting its parameters to data is facilitated by an analytical solution of the DoG activation by a grating. The latter was obtained by considering the grating activations of both centre and surround elliptical Gaussians on the basis of previous calculations80, as described in the following. The DoG receptive field was defined with these parameters: standard deviations \({\sigma }_{x}\) and \({\sigma }_{y}\) at the x and y axis, the orientation of the x axis θDoG, the spatial scaling for the subunit surround ks and a factor determining the relative strength of the surround ws. Concretely, the response of a DoG receptive field (rDoG) centred at (xo, yo) to a parametric sinusoidal grating (\(f,\theta ,\varphi \)) is

$${r}_{{\rm{D}}{\rm{o}}{\rm{G}}}(\,f,\theta ,\varphi \,;{x}_{{\rm{o}}},{y}_{{\rm{o}}},{\sigma }_{x},{\sigma }_{y},{{\theta }_{{\rm{D}}{\rm{o}}{\rm{G}}},k}_{{\rm{s}}},{w}_{{\rm{s}}})={A}_{{\rm{D}}{\rm{o}}{\rm{G}}}(\,f;{\sigma }_{x},{\sigma }_{y},{{\theta }_{{\rm{D}}{\rm{o}}{\rm{G}}},k}_{{\rm{s}}},{w}_{{\rm{s}}})\times \cos {\Theta }_{{\rm{D}}{\rm{o}}{\rm{G}}}(\,f,\theta ,\varphi \,;{x}_{{\rm{o}}},{y}_{{\rm{o}}})$$

with the amplitude ADoG given by

$${A}_{{\rm{DoG}}}(\,f;\sigma ,{k}_{{\rm{s}}},{w}_{{\rm{s}}})={e}^{-2{\rm{\pi }}{\sigma }_{{\rm{DoG}}}^{2}{f}^{2}}-{w}_{{\rm{s}}}{e}^{-2{\rm{\pi }}{({k}_{{\rm{s}}}{\sigma }_{{\rm{DoG}}})}^{2}{f}^{2}}$$

with

$${\sigma }_{{\rm{DoG}}}=\sqrt{{\sigma }_{y}^{2}\,{\sin }^{2}(\theta +{\theta }_{{\rm{DoG}}})+{\sigma }_{x}^{2}\,{\cos }^{2}(\theta +{\theta }_{{\rm{DoG}}})}$$

The receptive-field phase \({\Theta }_{{\rm{DoG}}}\) is given by

$${\Theta }_{{\rm{D}}{\rm{o}}{\rm{G}}}(\,f,\theta ,\varphi \,;{x}_{{\rm{o}}},{y}_{{\rm{o}}})=2{\rm{\pi }}f\sqrt{{x}_{{\rm{o}}}^{2}+{y}_{{\rm{o}}}^{2}\,}\cos \left(\theta -{\tan }^{-1}\frac{{y}_{{\rm{o}}}}{{x}_{{\rm{o}}}}\right)+\varphi -{\rm{\pi }}/2$$

DoG LN model

We fitted parameterized DoG LN models to the measured grating responses. The full model combined the DoG receptive-field activation with an output nonlinearity, for which we chose a logistic function \(N(x)={(1+{e}^{-x})}^{-1}\). The model response (R), denoting the modelled neuron’s firing rate, was thus given by

$$R=aN({\beta }_{{\rm{DoG}}}{r}_{{\rm{DoG}}}+{\gamma }_{{\rm{DoG}}})$$

where \({\beta }_{{\rm{DoG}}}\) and \({\gamma }_{{\rm{DoG}}}\) are parameters determining the steepness and threshold of the output nonlinearity and \(a\) is a response scaling factor.

All model parameters (\({x}_{{\rm{o}}},{y}_{{\rm{o}}},{\sigma }_{x},{\sigma }_{y},{\theta }_{{\rm{DoG}}},{k}_{{\rm{s}}},{w}_{{\rm{s}}},\,{{\beta }_{{\rm{DoG}}},\gamma }_{{\rm{DoG}}},a\)) were optimized simultaneously by minimizing the negative Poisson log-likelihood, using constrained gradient descent in MATLAB with the following constraints: \({\sigma }_{x},{\sigma }_{y} > 7.5\,{\rm{\mu }}{\rm{m}},\,-{\rm{\pi }}/4 < {\theta }_{{\rm{DoG}}} < {\rm{\pi }}/4,\) \(1 < {k}_{s} < 6,a > 0\). Each trial was used independently for fitting.

Subunit grid model

We fitted all subunit grid models with 1,200 potential subunit locations, placed on a hexagonal grid around a given receptive-field centre location. The centre was taken as the centre of a fitted DoG model. The subunits were spaced 16 μm apart. Each subunit had a circular DoG profile with a standard deviation of \(\sigma \) (centre Gaussian) and centred at (\({x}_{{\rm{os}}},{y}_{{\rm{os}}}\)), and its activation in response to a grating was given by

$${r}_{{\rm{s}}}(\,f,\theta ,\varphi \,;{x}_{{\rm{o}}{\rm{s}}},{y}_{{\rm{o}}{\rm{s}}},\sigma ,{k}_{{\rm{s}}},{w}_{{\rm{s}}})={A}_{{\rm{s}}}(\,f;\sigma ,{k}_{{\rm{s}}},{w}_{{\rm{s}}})\times \cos {\Theta }_{{\rm{s}}}(\,f,\theta ,\varphi \,;{x}_{{\rm{o}}{\rm{s}}},{y}_{{\rm{o}}{\rm{s}}})$$

where both amplitude and phase are given by the DoG receptive-field formulas with \({\sigma }_{x}={\sigma }_{y}=\sigma \).

The full response model was

$${R}_{{\rm{SG}}}=\,G\left(\mathop{\sum }\limits_{s=1}^{{N}_{{\rm{sub}}}}{w}_{{\rm{s}}}N(\beta {r}_{{\rm{s}}}+\gamma )\right)$$

where \(N(x)={(1+{e}^{-x})}^{-1}\) is a logistic function, \(\beta \) and \(\gamma \) are parameters determining the steepness and threshold of the subunit nonlinearity, Nsub is the number of subunits with non-zero weights, \({w}_{{\rm{s}}}\) are positive subunit weights, and \(G\) is a Naka–Rushton output nonlinearity \(G(x)=a{x}^{n}/({x}^{n}+{k}^{n})+b\), with non-negative parameters \({{\rm{\theta }}}_{{\bf{out}}}=(a,b,n,k)\).

Fitting and model selection

We optimized subunit grid models using the stochastic optimization method ADAM81 with the following parameters: batch size = 64, β1 = 0.9, β2 = 0.999, ε = 10−6. For the learning rate (η), we used a schedule with a Gaussian profile of μ = Nepochs/2 and σ = Nepochs/5: this led to a learning rate that was low in the beginning of the training, peaked midway and was lowered again towards the end. Peak learning rate was set to ηmax = 0.005. The number of epochs (Nepochs) was fixed for all cells to 4 × 105/Ntrials, with Ntrials representing the number of all grating presentations used for fitting, which typically resulted in 50–150 epochs.

To enforce parameter constraints during fitting, such as non-negativity, we used projected gradient descent. We also aimed at regularizing the parameter search in a way that non-zero subunit weights were penalized more strongly when other subunits with non-zero weights were spatially close. We therefore introduced a density-based regularizer that controlled the coverage of the receptive field with a flexible number of subunits (Extended Data Fig. 6).

Concretely, the cost function we minimized was

$$-\frac{1}{{N}_{{\rm{sp}}}}{\rm{ln}}L({{\bf{s}}}_{{\bf{G}}},{{\bf{r}}}_{{\bf{G}}}\,;{{\boldsymbol{\theta }}}_{{\bf{p}}},{\bf{w}})+\lambda \mathop{\sum }\limits_{s=1}^{{N}_{{\rm{sub}}}}{w}_{s}\sum _{i\ne s}\frac{{w}_{i}}{{d}_{si}^{2}}$$

with Nsp being the total number of spikes, L the Poisson likelihood, sG the vector of all grating parameters used, rG the corresponding spike-count response vector, \({{\rm{\theta }}}_{{\bf{p}}}\) = (\(\sigma ,{k}_{s},{w}_{s},\,\beta ,\,\gamma ,\,{{\rm{\theta }}}_{{\bf{out}}}\)) all the shared model parameters, and \({\bf{w}}=({w}_{1},\,\ldots ,\,{w}_{{N}_{{\rm{sub}}}})\) the vector containing all subunit weights. \(\lambda \) controls the regularization strength, which depends on the pairwise subunit distances dsi.

After the end of the optimization, we pruned subunit weights with small contributions or weights that ended up outside the receptive field. To do so, we first set to zero every weight smaller than 5% of the maximum subunit weight. We then fitted a two-dimensional Gaussian to an estimate of the receptive field, obtained by summing subunit receptive fields weighted by the subunit weights. The weight corresponding to any subunit centre lying more than 2.5σ outside that Gaussian was set to zero. To ensure proper scaling of the output nonlinearity after weight pruning, we refitted the output nonlinearity parameters along with a global scaling factor for the weights.

We typically fitted six models per cell with different regularization strengths \(\lambda \) ranging from 10−6 to 5 × 10−4. To select for the appropriate amount of regularization, we only accepted models that yielded at least three subunits and had a low receptive-field coverage (less than 3; see below). If no eligible model was fitted, cells were excluded from further analyses. Among the remaining models, we selected the one that minimized the Bayesian information criterion, which we defined for the subunit grid model as

$${N}_{{\rm{sub}}}{\rm{ln}}({N}_{{\rm{data}}})-2{\rm{ln}}(L)$$

where Nsub is again the number of subunits with non-zero weights, Ndata is the number of grating-response pairs used to fit the model and L the likelihood of the fitted model. The selected model balanced good prediction performance and realistic receptive-field substructure. Note, though, that the actual size and layout of the subunits might not be critical to obtain good model performance31 as long as appropriate spatial nonlinearities are included, and that the subunit nonlinearities are generally better constrained by the data than the subunits themselves (Extended Data Fig. 6).

Parameter characterization of the subunit grid model

To summarize how densely subunits covered a cell’s receptive field, we defined a measure for the subunit coverage. It was calculated as the ratio A/B, where A was the subunit diameter (4σ of the centre Gaussian) and B was the average distance between subunit centre points. For a particular cell, the average subunit distance was calculated as the average over all nearest-neighbour distances, weighted by each pair’s average subunit weight. If fewer than three subunits had non-zero weights in the model, no coverage value was computed.

To plot and characterize subunit nonlinearities, we first added an offset so that an input of zero corresponded to zero output. We then scaled the nonlinearities so that the maximum value over the input range [−1, 1] was unity. Following offsetting and scaling, we calculated nonlinearity asymmetries to quantify the response linearity of subunits as (1 − M)/(1 + M), where M is the absolute value of the minimum of the nonlinearity over the input range \([-1,\,1]\).

Natural images and response predictions

We flashed a series of 220 (or 120) natural images to the retina, as described previously36. We used images from the van Hateren database, which were cropped to their central 512 × 512 pixel square and presented over the multielectrode array at single-pixel resolution. All images were multiplicatively scaled to have the same mean intensity as the background. Interspersed with the natural images, we also presented artificial images. The images were generated as black-and-white random patterns at a single-pixel level and then blurred with Gaussians of eight different spatial scales29, but the corresponding responses were not analysed as part of this study. All images were flashed for 200 ms, separated by either 600 or 800 ms of grey background illumination. Images were flashed in a randomized order, and we typically collected eight trials per image. Average spike counts were calculated in the same way as in the case of flashed gratings, and only cells with symmetrized R2 of at least 0.2 were used for further analyses.

To calculate response predictions for models built with white noise, we used the output of spatial filters applied to the natural images. The filters were upsampled to match the resolution of the presented images and normalized by the sum of their absolute values. For models obtained from responses to flashed gratings, DoG receptive fields were instantiated at single-pixel resolution, and the natural images were then projected onto the DoG receptive fields. For the subunit model, the subunit filter outputs were passed through the fitted subunit nonlinearity and then summed while applying the subunit weights. The performance for each model was calculated as the Spearman rank correlation ρ between the model output (without an explicit output nonlinearity) and cell responses to the natural images28.

Flickering gratings and spatiotemporal DoG LN models

We generated 3,000 (or 4,800) different gratings with 25 (or 30) different spatial frequencies, between 7.5 and 1,200 μm half-periods, roughly logarithmically spaced. For each grating, we generated 20 orientations and six (or eight) spatial phases. The gratings were presented in pseudorandom sequence, updated at a 85 (or 75) Hz refresh rate. Every 6,120 (or 3,600) frames, we interleaved a unique sequence of 1,530 (or 1,200) frames that was repeated throughout the recording to evaluate response quality.

We fitted a spatiotemporal DoG LN model to the grating responses. The temporal filters spanned a duration of 500 ms and were modelled as a linear combination of ten basis functions. The response delay was accounted for with two square basis functions for each of the two frames before a spike. The remaining eight were chosen from a raised cosine basis, with peaks ranging from 0 to 250 ms before a spike.

Concretely, the spatiotemporal DoG model had the form

$$R=aN({{\bf{r}}}_{{\rm{C}}}^{{\rm{T}}}{{\bf{k}}}_{{\rm{Ct}}}+{{\bf{r}}}_{{\rm{S}}}^{{\rm{T}}}{{\bf{k}}}_{{\rm{St}}}+b)$$

where \(N(x)={(1+{e}^{-x})}^{-1}\) is a logistic function, kCt and kSt are separate temporal filters for the centre and the surround, b determines the baseline activation, and \(a\) is a response scaling factor. The vectors rC and rS contain DoG receptive-field activations for 500 ms before a particular frame and were calculated as for the flashed gratings. The model was fitted with nonlinear constrained optimization, with DoG constraints identical to the case of flashed gratings and \(a > 0\).

Spatiotemporal subunit grid model

We also fitted a spatiotemporal subunit grid model to the grating responses. Our strategy was similar to the case of flashed gratings. We fitted each subunit grid model with 1,200 subunit locations, placed in a hexagonal grid around a given receptive-field centre location. The centre was taken as the fitted centre of the DoG model. The subunits were spaced 16 μm apart. The model response (R) was given by

$$R=G\left(\mathop{\sum }\limits_{s=1}^{{N}_{{\rm{sub}}}}{w}_{{\rm{s}}}N({{\bf{r}}}_{{\rm{C}}}^{{\rm{T}}}{{\bf{k}}}_{{\rm{Ct}}}+{{\bf{r}}}_{{\rm{S}}}^{{\rm{T}}}{{\bf{k}}}_{{\rm{St}}}+\gamma )\right)$$

where \({w}_{s}\) are non-negative subunit weights, \(N(x)\) is a logistic function, kCt and kSt are separate temporal filters for the centre and the surround shared across all subunits, and \(\gamma \) determines the nonlinearity threshold. We used a Naka–Rushton output nonlinearity \(G(x)=a{x}^{n}/({x}^{n}+{k}^{n})\), with non-negative parameters \({{\rm{\theta }}}_{{\bf{out}}}=(a,n,k)\). The vectors rc (rs) contain Gaussian centre (surround) subunit activations for 500 ms before a particular frame and for each subunit. The parameters required to fit DoG subunits are the standard deviation of the centre, the scaling for the subunit surround and a factor determining the relative strength of the surround.

We used stochastic gradient descent with the ADAM optimizer to fit spatiotemporal models. The parameters were the same as in the flashed-grating models, except for the batch size = 2,000 and ηmax = 0.02. We used the same learning schedule for η and the same regularization to control for subunit density as in the case of flashed gratings.

Natural video predictions of grating-fitted models

To obtain natural video predictions for models built from flickering gratings, we instantiated receptive fields, as well as subunit filters, at single-pixel resolution. Again, we projected video frames on centre and surround filters separately, convolved each result with the corresponding temporal filter and summed the two outputs for obtaining the final filter output. For subunit grid models, the subunit nonlinearity fitted from the gratings was applied to the linear subunit outputs, which were then summed with the non-negative weights to obtain the final activation signal. The training part of the video was used for estimating an output nonlinearity using maximum likelihood (under Poisson spiking) for both the DoG LN and the subunit grid models. The nonlinearity had the same parametric form as in the model fit with gratings. Unlike the models applied to natural images, in which the Bayesian information criterion was applied, we here used the training set to select the appropriate regularization strength by finding the maximum of the log-likelihood among the eligible models (with at least three subunits and a receptive-field coverage below 3). If no eligible model was fitted, cells were excluded from further analyses. Model performance was estimated for the test set using the coefficient of determination between model prediction and measured firing rate as a fraction of explained variance (R2). Negative values of R2 were clipped to zero.

To better differentiate DoG and subunit grid model performance, we selected fixations on the basis of model predictions. For each cell pair, we selected the 20% of the fixations for which the deviation in the predictions of the two models, averaged over the two cells, was largest. For a single cell and a single fixation, the deviations were calculated as the absolute value of model differences normalized by the cell’s overall response range (maximum minus minimum during the test part of the video) and averaged over all frames of the fixation. Performance of both models (R2) was then compared to the frame-by-frame neural response on these fixations and averaged over the two cells. The selection of maximally differentiating fixations does not favour either model a priori, because it is only based on how much model predictions differ and not on their performance in explaining the data.

Similar to the spatial contrast analysis, we expanded the pairwise correlation (rpair) into linear and nonlinear contributions by splitting the numerator of the Pearson correlation coefficient so that rpair = rnonlinear + rlinear. For a pair of cells, we sorted all fixations (in descending order) by the average deviation of model predictions. We assigned the first half of the fixations to the nonlinear group and the remaining ones to the linear group.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.