Eye movements during free viewing to maximize scene understanding

Murlidaran, Shravan; Eckstein, Miguel P.

doi:10.1038/s41467-025-67673-w

Download PDF

Article
Open access
Published: 21 December 2025

Eye movements during free viewing to maximize scene understanding

Nature Communications volume 17, Article number: 940 (2026) Cite this article

6281 Accesses
1 Citations
7 Altmetric
Metrics details

Subjects

Abstract

What humans look at and do when freely viewing a scene is not well understood. We measure observer eye movements under different instructions while participants view a customized set of image pairs containing small visual alterations that greatly change scene interpretation (Winograd images). We show that free-viewing fixations resemble those of observers describing scenes but differ from those of observers counting or searching for objects. Fixations are more often directed toward people and objects whose removal most alters scene interpretation, rather than toward the most salient or meaningfully judged regions (meaning maps), or objects perceived to be grasped or gazed at. Small image changes that modify scene understanding (Winograd images), but not salience or meaning maps, alter fixation patterns. By instructing observers to describe scenes while fixating on objects either relevant or irrelevant to scene understanding, we demonstrate that free-viewing eye movements are functionally important for accurate scene comprehension. Thus, an important human default task of free viewing eye movements is to comprehend scenes.

Imagery-related eye movements in 3D space depend on individual differences in visual object imagery

Article Open access 19 August 2022

Visual exploration dynamics are low-dimensional and driven by intrinsic factors

Article Open access 17 September 2021

Face familiarity revealed by fixational eye movements and fixation-related potentials in free viewing

Article Open access 23 November 2022

Introduction

Eye movements are a critical part of human vision. Ever since Buswell (1935)¹ and Yarbus (1967)² conducted their classic eye-tracking studies, researchers have tried to understand what drives and influences eye movements. One of the prominent theories proposed two decades ago is that people make eye movements to salient regions. Saliency was originally defined (Itti et al.³) in terms of a combination of low-level features such as color, intensity, and orientation at any given location relative to its immediate surroundings, with many subsequent elaborations of the model^3,4,5,6,7,8. Since then, studies have shown that when humans perform specific tasks such as searching for an object or person^{9,10,11,12,13}, executing a motor action^14,15,16, navigating in an environment^17,18, or identifying faces¹⁹, they do not look at the most salient object/region. Instead, humans look at locations that contain objects or visual features relevant to the task^{20,21,22,23,24,25,26,27,28,29,30,31}, and/or that allow maximizing task accuracy^19,32,33. When instructed to artistically evaluate or describe scenes using keywords, observers fixate on objects more than salient regions³⁴.

Humans are not always engaged in specific tasks. They might look through a bus window, sit on a park bench, wait for a restaurant table, and explore a scene with no particular task. This is often called free viewing^4,35. Fixations during free viewing have been commonly used to support the theory that people direct eye movements to salient regions^{36,37,38,39,40,41,42}. More recently, studies have shown that even during free viewing, people do not direct their eyes to salient regions but rather to objects⁴³ or meaningful regions measured by observers’ subjective judgments of the meaningfulness of segmented local patches of scenes (meaning maps; subsequently referred to as locally meaningful regions)^44,45.

Here, we introduce a new theory. We hypothesize that, during free viewing, humans aim to understand a scene. To rapidly do so, they direct their eyes to regions critical to scene understanding rather than looking at those salient or segmented regions, locally judged to be meaningful. An important distinction between a locally meaningful region and one critical to scene understanding is that the latter considers how the region, people, or object contributes to understanding the whole scene. For example, Fig. 1a illustrates the concept with an image described by observers as: Someone replacing batteries of a TV remote control. The highly contrasting blue paper is the most salient object, and the sunglasses are the most locally meaningful object. However, the inverted remote with its battery cap open is the most critical object to the scene’s understanding. How would one measure the contribution of an object to the scene’s understanding? We propose a new method by having observers describe the scene and assess the impact of removing one object at a time on the scene descriptions. Removing the remote control from the image in Fig. 1a significantly alters the description of the scene, whereas eliminating the sunglasses or the blue paper does not (see Fig. 1a, right side). Thus, our theory hypothesizes that the remote control, critical to scene understanding, would attract more human fixations than the blue paper or the sunglasses. Furthermore, our second hypothesis is that fixating the object critical to scene understanding has a functional role in extracting the information required to understand it. The theory would predict that directing the high-resolution fovea to the remote control would be important to understand the scene (accurately describe it), while fixating on the blue paper would not. Although previous work has related eye movements to scene comprehension when observers are instructed to describe scenes^46,47,48, these studies have not examined how free-viewing fixations relate to the contributions of each object to scene understanding, evaluated other tasks or models, or experimentally investigated the causal functional importance of fixations on scene understanding.

**Fig. 1: Motivation and experimental design.**

Our initial methodology and analysis focused on predicting human fixations on different objects. In the latter part of the paper, we generalize the analysis to assess whether the theoretical framework can predict the well-documented frequent fixations on people (bodies and faces^49,50; for example, deleting the hand with the batteries in Fig. 1a would maximally alter the description of the scene). We also assess how social cues (gaze and grasp cues) fit the theoretical framework.

One of the main challenges in evaluating the hypothesis that eye movements are directed to people or objects that maximize scene understanding is that in most image data sets^{51,52,53,54,55,56} the person or object critical to scene understanding is often also the most salient object and the locally most meaningful region. This is a consequence of photographers’ practice of typically positioning the people or objects critical to the scene’s understanding as salient and central in the photograph⁵⁷. To overcome this challenge, we developed a new dataset that decouples salient regions from regions that contribute to scene understanding.

We created pairs of real scenes that were minimally altered visually but resulted in large changes in the scene’s understanding (Fig. 1b; left) while maintaining the saliency maps of the scenes and the local meaningfulness of each region. These small visual changes across image pairs could be manipulations of an object’s position, its substitution by another object, or an actor’s posture. We refer to these image pairs as Winograd Images (WI), inspired by the Winograd schema developed for sentences for which minor modifications of a word in a sentence lead to a large change in the sentence’s meaning⁵⁸. To determine which objects were critical to the scene’s understanding, we assessed how deleting each object impacted the accuracy of scene descriptions relative to the intact images.

To test our first hypothesis, we had different groups of observers view our Winograd pair image dataset following different condition instructions (experiment 1): search for a specific object (object search; OS), describe the scene (scene description; SD), count the number of objects (counting objects; CO), or critically, free viewing (no instructions; FV). If observers execute eye movements during free viewing to understand the scene, we would expect various specific results. First, fixation patterns during free viewing should be similar to those obtained from observers during the scene description condition, which explicitly requires the observer to understand a scene to describe it. Fixation locations during free viewing should be less similar to the fixations of observers executing other specific tasks that do not require understanding the scene, such as searching for an object or counting objects. Second, Winograd image pairs, which alter scene understanding, should result in varying fixation patterns (across Winograd images) for the scene description and free viewing conditions, but less for the object search and counting conditions. Third, fixations during the free viewing and scene description conditions should be directed to objects critical to scene understanding. In contrast, fixations during the object search would be directed to the object searched (target), and fixations during object counting should be equally spread among objects. As benchmark comparisons, we also evaluated the ability of a saliency model (Graph-Based Visual Saliency (GBVS)⁴), a human eye movement-trained deep neural network (DeepGaze⁵⁹), and measurements of the most meaningful local regions (meaning maps⁴⁴) to predict the fixations across Winograd pairs. We hypothesize that the objects critical to scene understanding in our data would be fixated more frequently than the most salient or locally meaningful region (as measured by meaning maps) for the free viewing and scene description conditions.

To test the functional role of fixations during free viewing in scene understanding, we conducted a second experiment. Observers described a scene while maintaining fixation at an object critical to scene understanding or another object irrelevant to scene comprehension. Suppose that fixations play a functional role in rapidly extracting the information necessary to understand a scene. In that case, scene descriptions should be more accurate when observers maintain fixation on objects relevant to scene understanding than when they fixate on objects irrelevant to comprehending the scene.

We show that observers’ free-viewing fixations are more similar to fixations of observers instructed to describe the scenes than observers’ fixations when counting objects or searching for specific objects. Small visual alterations to images that modify a scene’s understanding but not the most salient or its meaning map (Winograd image pairs) change where humans most frequently fixate. Free viewing fixations are more frequently directed to people and objects critical to the understanding of a scene (elements that, when erased from the scene, maximally alter the scene’s description) rather than the most salient, most meaningfully judged scene region (meaning map), or the object perceived to be grasped or gazed at. The theoretical framework also explains the higher frequency of fixations on people than on objects for most scenes because, when people are erased, scene descriptions are maximally altered. A temporal analysis shows that the first few fixations are most frequently directed to people and objects perceived to be grasped or gazed at, while later fixations increasingly focus on people and objects relevant to scene understanding. We also show that observers’ scene description accuracy is higher while maintaining fixation on objects relevant than on objects irrelevant to scene understanding, suggesting that eye movements during free viewing are functionally important to comprehend scenes accurately.

Results

Free viewing fixations are most similar to scene description

We created a total of 18 Winograd pairs⁶⁰. Fifty participants took part in each of the four conditions (a between-subjects design, with a total of 200 participants). Within each condition, participants were assigned randomly and equally to view one set of Winograd pairs (18 image trials). Figure 1c details the procedural flow of the four conditions.

Our primary analyses removed fixations/model predictions on human faces or bodies in the scenes. This pre-processing enabled us to focus on fixations on objects that were critical versus those irrelevant to scene understanding, rather than on human figures, which were critical across all Winograd image pairs. Figure 2a shows an example of the fixation heat maps (see methods for details) generated using 25 participants for each of the four conditions. The gray regions indicate fixations on people that were not considered in our primary analysis (but see further below for an analysis including fixations on people). Figure 2b shows the correlation of fixation heat maps across conditions (for 25 observers) computed for each of the 36 images (18 pairs) in our dataset. The fixation heat maps for free viewing were significantly more similar to the fixation heat maps from the scene description condition (FV-SD, r = 0.54) than to those from the object search (FV-OS, r = 0.32, bootstrap resampling of observers and images for all analyses, p < 0.001 vs. FV-SD, Cohen’s d = 1.28) or counting objects (FV-CO, r = 0.42, p < 0.001 vs. FV-SD, Cohen’s d = 0.75, see Figure Supplementary 1a). To compare the across-condition fixation heat map correlations to an upper bound, we estimated a within-condition (FV-FV) correlation of fixation heat maps, each using distinct subgroups of n=12 observers (Fig. 2c, see methods). As with the n=25 analysis, we found higher FV-SD fixation heat map correlations (r = 0.4) than those for FV-OS (r = 0.23, p < 0.001, Cohen’s d = 1.19) and FV-CO (r = 0.29; p < 0.001, Cohen’s d = 0.9). Notably, the FV-SD fixation heat map correlation was the closest to the within-condition (FV-FV) inter-observer (see dotted lines in Fig. 2c) heat map correlations (Δr = r_FV-FV - r_FV-SD = 0.06) relative to the other condition comparisons (the FV-OS vs. FV-FV correlations: Δr = 0.26; and the FV-CO vs. FV-FV correlations: Δr = 0.17).

**Fig. 2: Fixation heat map correlations across conditions.**

Changes in scene understanding with small visual manipulations influence observer fixations

We compared the measured human fixation patterns across the Winograd image pairs for different conditions with other existing fixation prediction models: a purely low-level saliency model (GBVS)⁴, a neural network model trained on human fixations and images to predict fixations on novel images (DeepGaze II)⁵⁹, meaning maps that quantify the meaningfulness of local regions of scenes⁴⁴, and our newly proposed scene understanding maps that quantify the contribution of each object to the meaning of the entire scene. Figure 3a, b provides a flow chart of all the models we have used in this comparison.

**Fig. 3: Scene understanding maps and other fixation prediction models.**

To measure the contribution of an object to scene understanding, we developed a technique that measures the impact of an object present in the scene on the description provided by participants. Figure 3a illustrates the procedure with an example. The original descriptions for the intact (i.e., no object removed) images were compared with the descriptions of the images after the object removal to quantify the object’s impact on scene understanding. If the object is irrelevant to scene understanding, the scene descriptions with the object present or removed would be similar. In contrast, if the object is critical to scene understanding, removing it would greatly impact the scene descriptions. The similarity of the scene descriptions can be measured by relying on a separate group of participants who view sentence pairs (without the image) and rate their similarity. Quantifying the impact of removing each object on the scene description allows us to create a map that visualizes the objects’ contribution to scene understanding (scene understanding map, SUM). The SUM can be used to predict fixation distributions and the most fixated object (see methods and Fig. 4a). Similar SUM results were obtained using similarity ratings of descriptions based on automated Large Language Models (LLMs) based metrics (cosine similarity computed with the LLM embeddings of the sentences) rather than human ratings. The human-LLM rating agreement (r_Gemini-Human = 0.7, r_GPT4-Human = 0.72) was comparable to the average human-human (r = 0.75) agreement in the similarity ratings of these descriptions (see Figure Supplementary 1b, c).

**Fig. 4: Measured/Predicted fixation heat map correlations across Winograd pairs.**

We hypothesized that altering the scene understanding through small visual manipulations would change the human fixations in both the free viewing and scene description conditions. Thus, we predicted that the correlation of human fixation distributions across Winograd image pairs would be smaller than that for the saliency, local meaningful regions, and DeepGaze prediction maps. Similar to the human fixation distributions, our hypothesis predicts that the correlation across the scene understanding maps of Winograd image pairs would also be low. In addition, human fixation heat maps would be more similar (higher correlation) across the Winograd image pairs for the object search and counting objects conditions compared to the free viewing and scene description conditions.

Figure 4a shows examples of the predicted fixation heat maps of different models and measured fixation heat maps for humans for the four conditions for a Winograd image pair. The correlation between the human fixation heat maps of Winograd images for the free viewing condition (Fig. 4b) was significantly lower than for the object search (p < 0.001, Cohen’s d = 1.43) and counting objects (p = 0.018, Cohen’s d = 0.93) conditions. The correlation of Winograd image fixation heat maps for the scene description condition was significantly lower than for the free viewing condition (p = 0.034, Cohen’s d = 0.53). The correlations of fixation prediction heat maps between Winograd images for saliency, meaning maps, and DeepGaze were significantly higher (p < 0.001 vs. free-viewing for all comparisons, Cohen’s d = 1.86, 1.68, 1.25, respectively) than the correlations observed for free-viewing human fixation heat maps between Winograd pairs. In contrast, the correlation of the fixation predictions of Winograd image pairs for the SUM was the lowest and not different from that observed for human fixations for the free viewing and scene description conditions (p = 0.19, Cohen’s d = 0.54 for free viewing; p = 0.51, Cohen’s d = 0.12 for scene description condition).

These variations in Winograd pair fixation heat map correlations across conditions cannot be accounted for by differences in inter-observer variability (r_{inter-observer} = 0.45 for free viewing; r_{inter-observer} = 0.47 for scene descriptions; r_{inter-observer} = 0.49 for object search; r_{inter-observer} = 0.38 for counting objects condition, see Figure Supplementary 2a, dotted lines). In addition, as a control comparison, the correlations observed for a random pairing of images (shaded area in Fig. 4b) do not result in the ordering observed in Winograd image correlations across conditions.

Even though the SUM is most helpful in predicting the most frequently fixated object, it also attains comparable accuracy at predicting all human fixations to the meaning maps and DeepGaze and is better than GBVS (shuffled AUROC analysis; see methods for implementation details and see Figure Supplementary 2b for the results).

Observers look at regions/objects critical to scene understanding during free viewing

To assess where human observers fixate most frequently during free viewing, we categorized objects based on their contribution to the scene understanding map (see Fig. 3a). We categorized objects as the most relevant to scene understanding (SU-relevant) if erasing them from the scene resulted in the largest change in the participants’ scene description. Similarly, we categorized objects as irrelevant to scene understanding (SU-irrelevant) if erasing them from the scene did not result in large changes in the participants’ scene description (see methods; this analysis also excluded people in scenes, see further below for analysis including people). To control for low-level visual properties of objects and object types, we designed our stimuli so that the same objects were SU-relevant for one Winograd image (example: projector in Fig. 5a; left, clothes in Fig. 5a; right) and SU-irrelevant to the complementary Winograd image (example: clothes in Fig. 5a; left, projector in Fig. 5a; right). Across the Winograd images, SU-relevant and SU-irrelevant categories were the same set of objects, providing a strong control.

**Fig. 5: Fixation frequency for object categories.**

We measured the frequency of human fixations on SU-relevant and SU-irrelevant objects for each condition. We also quantified the frequency of human fixations on the top predictions of the GBVS saliency model, DeepGaze, and the most locally meaningful region based on meaning maps. Figure 5b shows the fixation frequency for each object category averaged across all images and participants. Results are shown for the four conditions. Observers fixated on objects more frequently when they were critical to scene understanding than when they were irrelevant for both the free viewing (SU-relevant vs. SU-irrelevant objects, p < 0.001, Cohen’s d = 1.41, adjusted p-values (q-value) reported for False Discovery Rate (FDR) at α = 0.05 for 28 comparisons; 20 comparisons in Fig. 5b and four comparisons in Fig. 6d and four comparisons in Fig. 8c) and scene description conditions (p < 0.001, Cohen’s d = 1.70). Furthermore, in the free viewing and scene description conditions, the SU-relevant objects were fixated significantly more often than any of the top GBVS salient (p < 0.001 for both conditions, Cohen’s d = 1.41, 1.46, respectively), DeepGaze (p = 0.005, Cohen’s d = 0.90 and p < 0.001, Cohen’s d = 0.98 respectively), or locally meaningful objects (p < 0.001 for both conditions, Cohen’s d = 1.46, 1.51, respectively). In the search condition, the searched target was fixated the most frequently (p < 0.001, Cohen’s d > 1.50 for all comparisons). Still, a smaller but significant difference was observed in fixation frequency between SU object categories (higher frequency to SU-relevant vs. SU-irrelevant objects) for the object search (p = 0.048, Cohen’s d = 0.150) and the counting conditions (p = 0.029, Cohen’s d = 0.790).

**Fig. 6: Analysis of the No gaze image subset.**

To assess how observer fixation preferences develop through time, we analyzed the cumulative fixation frequencies to SU-relevant and SU-irrelevant objects for all conditions as a function of fixation number after image onset. For the free viewing and scene description conditions, the cumulative fixation frequency difference across SU-relevant and SU-irrelevant objects (ΔCF_R,I) became significant after the 3rd fixation (p < 0.001, Cohen’s d = 1.24, 1.29, respectively, Figure Supplementary 3a, adjusted p-values (q-value) reported for FDR at α = 0.05 for all 40 comparisons in Figure Supplementary 3a). No statistically significant difference was observed for the search and counting objects conditions up to the 10th fixation. Figure 5c compares ΔCF_R,I as a function of fixation number for all conditions. The ΔCF_R,I in the scene description significantly deviated from that in free-viewing at the 6th fixation (p = 0.043, Cohen’s d = 0.43, adjusted p-values (q-value) reported for FDR at α = 0.05 for all 30 comparisons in Fig. 5c), while the free-viewing deviated significantly from the object counting condition at the 5th fixation (p = 0.049, Cohen’s d = 0.93). No significant difference in ΔCF_R,I was observed between object counting and object search conditions. An analysis that used time-weighted fixations rather than fixation frequency resulted in similar findings (Figure Supplementary 3b, c).

The role of gaze, head, and hand position in directing eye movements

Many of the images in our experiment present one or more individuals directing their gaze, head, or hands to objects that are typically relevant to the scene understanding. Thus, a possible explanation for our results is that observers’ fixations on the SU-relevant locations and objects are a byproduct of the well-documented finding that observers often follow with their eyes the gaze^{31,61,62,63,64} or anticipated action of others^65,66,67.

To show that the SU-relevant fixations are not solely a byproduct of following gaze, head, hand direction, or body posture but rather eye movements to objects critical to scene understanding, we conducted an experiment to assess which objects observers perceived to be grasped or gazed at for each scene. Figure 6a (left) details the experiment procedure. A separate group of participants (n=25) viewed the Winograd images with unlimited time and were asked to click the box they perceived to be grasped or gazed at by the person in the scene. Figure 6a (right) shows examples of scenes where a SU-relevant object was perceived by the majority of participants to be grasped or gazed at (top two images), and scenes where it was not (bottom two images). We found 20 images out of the 36 images for which the SU-relevant objects were not considered the object to be grasped or gazed at by the human in the image. We will refer to these scenes as the No Gaze image subset.

Analysis of the No Gaze image subset showed a similar pattern of results to our main findings (Figs. 2c, 4b, 5b). Observer fixation heat maps for free viewing were significantly more similar to the fixation heat maps from the scene description condition (FV-SD, r = 0.5) than to those from the object search (FV-OS, r = 0.32, p < 0.001, Cohen’s d = 1.28) or counting objects (FV-CO, r = 0.41, p = 0.030, Cohen’s d = 0.71) as shown in Fig. 6b.

The observer fixation heat map correlation across Winograd images (Winograd correlation, from seven pairs in the 20 images) for the free viewing condition was no different than the scene description condition (p = 0.22, Cohen’s d = 0.56) but significantly lower than that observed for the object search (p = 0.006, Cohen’s d = 1.34) and counting objects (p = 0.037, Cohen’s d = 1.03) conditions (Fig. 6c). The Winograd correlation for the existing fixation prediction models (GBVS, meaning maps, and Deepgaze, Fig. 6c) was significantly higher than human correlations (p < 0.001 vs. free-viewing for all comparisons, Cohen’s d = 1.91, 1.70, 1.23, respectively), while the correlation for the scene understanding map (SUM) was the lowest and not different from that observed in human fixation heat maps for the free viewing and scene description conditions (p > 0.9 in both conditions, Cohen’s d = (0.20, 0.18) respectively).

In addition, similar to what we observe in the entire data set, for the No Gaze subset, observers also execute more frequent fixations on SU-relevant objects in the free viewing and scene description conditions (p < 0.001 for both conditions, Cohen’s d = 1.26, 1.62, respectively) compared to the SU-irrelevant objects. However, unlike the entire data set, the No Gaze subset does not show a significant difference between the fixation frequencies to the SU-relevant and SU-irrelevant object categories for the object search (p = 0.282, Cohen’s d = 0.14) and counting objects (p = 0.170, Cohen’s d = 0.61) conditions, suggesting that for those two conditions, fixations on SU-relevant objects were driven mainly by gaze cues (images for which gaze cues point to the SU-relevant objects and which were not part of the No Gaze Subset). Finally, fixation frequencies for other object categories and the cumulative fixation frequency analysis show similar results for the No Gaze subset as in the entire image data set (see Figure Supplementary 4a, b).

To further assess whether observer eye movements are more guided by objects critical to scene understanding (SU-relevant objects) vs. objects perceived to be grasped/gazed at, we directly compared the frequency of fixations on the two categories. For the entire image dataset and the No Gaze image subset, SU-relevant objects were fixated more frequently than objects perceived to be grasped/gazed at (Fig. 6d, adjusted p-values (q-value) reported for FDR at α = 0.05 for all 24 comparisons in the No Gaze image subset, 16 comparisons in Figure Supplementary 4a) in free-viewing (p < 0.001 for both image sets, Cohen’s d = 1.19, 1.33, respectively), scene description (p < 0.001 for both image sets, Cohen’s d = 0.92, 1.20, respectively) and counting objects conditions (p = 0.009, Cohen’s d = 0.86 for the entire image set and p = 0.007, Cohen’s d = 0.99 for the No Gaze subset), but not for the object search condition (p > 0.7 for both image sets, Cohen’s d = 0.15, 0.20, respectively).

Together, all the analyses suggest that during free viewing or while describing scenes, observers most frequently direct their eyes not just to objects inferred to be grasped or gazed at but to objects critical to understanding the scene.

Causal influence on scene understanding of fixating critical objects

More frequent fixations on SU-relevant objects during free viewing and scene description conditions might imply that processing those objects with the high-resolution fovea is functionally important to maximize the understanding of the scene. We separately analyzed the scene descriptions condition trials for which observers correctly described the scenes and those for which they incorrectly described them (using a threshold based on cosine similarity of LLM embeddings of descriptions, see the methods section for details). Across all images, the difference in fixation frequency to SU-relevant vs. SU-irrelevant objects was significantly greater for trials that resulted in correct descriptions than for trials with incorrect descriptions (p = 0.004, Cohen’s d = 0.53; see Figure Supplementary 5a for examples of descriptions near the threshold used for classifying a description to be correct/incorrect and see Figure Supplementary 5b, c for the overall results and variations of adopted threshold to classify descriptions as correct/incorrect). This association between eye movements and scene descriptions has been shown in previous studies.^46,47.

However, the association does not necessarily imply a causal influence of fixation on SU-relevant objects on accurate scene understanding. An alternative explanation is that observers process the entire scene and extract its meaning before eye movements. In this alternative explanation, observer fixations on SU-relevant objects are a subsequent process of looking at objects critical to the already extracted scene understanding. In this interpretation, fixations on the SU-relevant objects are not required to understand the scenes accurately.

We conducted a separate experiment to evaluate the causal influence of fixations on SU-relevant objects on scene understanding. Two new groups of 15 observers maintained fixation either at the SU-relevant or the SU-irrelevant objects during a 500 ms presentation of the same test images used in experiment 1. The observers then described what was happening in the scene. The collected descriptions were compared to the gold standard description (descriptions from observers with unlimited time and eye movements) using sentence similarity measures with the embeddings of Large Language Models (LLMs; see methods for validation of Gemini similarity scores with human ratings). Figure 7a, b shows the procedural workflow of the experiment and an example description from observers fixating on the SU-relevant or SU-irrelevant object for a sample image. Figure 7c, d shows that observers’ descriptions were more similar to the gold standard description of the scene when they fixated on SU-relevant objects compared to when they fixated on SU-irrelevant objects (p < 0.001, Cohen’s d = 1.18). For reference, Fig. 7d shows an upper bound of similarity of descriptions provided by the average score across different gold standard descriptions (different observers describing the scene while exploring the image with unlimited time). A lower bound on similarity scores was estimated by permuting scene descriptions across images and comparing their similarity to the gold standard from other unmatched images (see methods. Also, see Figure Supplementary 5d for similar results obtained using embeddings from the GPT4 LLM model rather than Gemini).

**Fig. 7: Forced fixation at SU-relevant and SU-irrelevant locations.**

Together, the results suggest that fixating on SU-relevant objects during free viewing causally leads to more accurate descriptions of the scenes than when fixating on SU-irrelevant objects.

Fixations on people and their role in the understanding of scenes

All presented analyses involved fixations on objects in the scene and excluded the fixations on people (hands, faces, bodies) to isolate the influence on fixations of SU-relevant objects, which varied across Winograd images (while the people in the scene were present in both). However, the larger literature documenting the importance of fixations on people’s heads^49,50,68 and hands⁶⁹ motivated us to understand how such fixations fit within the theoretical framework of eye movements to maximize scene understanding.

We first re-estimated the SUMs by collecting descriptions of the scenes that included the deletion of people or people’s hands in the scenes (in addition to the deletion of objects). Figure 8a presents a sample SUM showing that the person’s hands in the scenes were the most critical (altered the scene description the most). We found that the people were the most critical in 77.78% (28 images) of the images in the dataset, while the SU-relevant object was in the remaining.

**Fig. 8: Recomputing Scene Understanding Maps with people.**

We then repeated all analyses from previous sections, but included the fixations on people in the scenes. As with the initial analyses excluding fixations on people, we found higher correlations of observer fixation heat maps of the free viewing with those of the scene description condition (FV-SD, r = 0.66) than with the fixation heat maps of counting objects and object search conditions (Fig. 8b; p < 0.001 in both comparisons, Cohen’s d = 1.50, 1.70, respectively). The presence of fixations on people increased the correlations across fixation heat maps of the Winograd image pairs for all conditions and images. Although the results remain the same as the analysis excluding fixations on people (see Figure Supplementary 6a), the fixation heat map correlation across the Winograd image pairs for the counting objects condition was no longer significantly different from that for the free viewing condition (p = 0.26, Cohen’s d = 0.45).

We also re-analyzed the eye movement data for the four conditions to assess the fixation frequency on people in the scenes (their hands, bodies, or faces; see methods). Observers fixated significantly more frequently on people for the scene description condition (Fig. 8c; p < 0.001, Cohen’s d = 1.45). Fixations on people in free viewing were significantly more frequent than those observed in the object counting and search conditions (p < 0.001 for both comparisons, Cohen’s d = 1.59, 1.65, respectively).

We calculated the cumulative fixation frequencies to understand the temporal dynamics of fixation selection on different scene components (people, SU-relevant objects, perceived to be grasped or gazed at objects, and SU-irrelevant objects). We used the No-Gaze image subset for which the perceived grasped/gazed-at objects were distinct from the SU-relevant objects (see Figure Supplementary 6b for similar results for the set of images in our dataset where the grasped/gazed at objects were the same as the SU-relevant objects). Figure 8d shows that starting with the 1st fixation, observers most frequently fixated on people (or the highest value of the scene understanding map estimated with the inclusion of people, see SU-relevant including people in Fig. 8d, adjusted p-values (q-value) reported for FDR at α = 0.05 for all 30 comparisons) for free viewing and scene description. For these conditions, early fixations (1st and 2nd) were also more frequently directed to objects perceived to be grasped or gazed at significantly above the SU-relevant object category (free viewing: p = 0.011, Cohen’s d = 1.3 for the 1st fixation; p = 0.016, Cohen’s d = 1.13 for the second fixation; scene description: p = 0.011, Cohen’s d = 1.26 for the first fixation, p = 0.003, Cohen’s d = 1.23 for the second fixation). Subsequently, the cumulative fixations on SU-relevant objects were significantly higher than the perceived grasped/gazed-at from the 6th fixation onward for free-viewing (p = 0.027 at the 6th fixation, Cohen’s d = 0.95) and the 7th fixation onward for scene description (p = 0.039 at the 7th fixation, Cohen’s d = 0.97) condition. We observed no statistically significant differences in cumulative fixations across scene components for the object counting and search conditions.

Similar to the AUROC analysis without people, AUROC analysis including fixations on people also resulted in a significant difference between SUM and GBVS saliency maps (p < 0.001, Cohen’s d = 1.87) and no statistically significant difference for DeepGaze, meaning maps, and SUMs, though SUMs were lower than meaning maps and DeepGaze (p > 0.09 for both comparisons; see Figure Supplementary 6c).

Discussion

Eye movements during free viewing are directed to objects critical to understanding scenes

The purpose of our study was to gain an understanding of the image properties that guide eye movements during free viewing of scenes. Although there has been a long-held view that low-level saliency drives eye movements during free viewing^{36,37,40,41,42}, recent studies have supported the concept that eye movements are directed to regions/patches of a scene judged to be meaningful^44,45,70. Here, we explored the hypothesis that humans aim to maximize scene understanding accuracy during free viewing and move their eyes to elements (people and objects) critical to scene understanding. The most locally meaningful region, object, or person (meaning map maximum) is not necessarily the most critical person/object for understanding a scene. The latter involves relationships between people, objects, and regions that contribute to the overall understanding of the scene, which is not present in meaning maps.

We first investigated human eye movements to objects for different instruction conditions with newly designed stimuli (Winograd images). The Winograd image pairs varied the objects critical to scene understanding while maintaining visual saliency and the meaningfulness of local regions. Three findings support our hypothesis. First, fixation distributions during free viewing were most similar to those of observers when describing the scenes. Their correlation was comparable to the inter-observer fixation distribution agreement for the free viewing condition and similar to previously reported correlations between the best models and human fixations (e.g., GBVS-human fixation correlation = 0.45,⁷¹). Second, small visual alterations to the images that changed the meaning of the entire scene (Winograd images⁶⁰) influenced human fixations during free viewing. Third, fixations were more frequently directed to the objects critical to understanding the scene rather than the most salient or locally meaningful region.

Previous studies have shown how scene gist, object spatial relations, and object function relations guide eye movements when observers are searching for objects^{28,72,73,74,75,76,77,78,79,80,81}. However, this had not been demonstrated for free viewing due to the difficulty of determining what components of the semantic information present in the scene were relevant to free viewing eye movement guidance and how to quantify these semantic components. Here, we developed a method to quantify the semantic role of an object in free viewing by measuring how the object’s removal alters scene understanding (scene description). This is distinct from recent efforts to quantify the semantic relationship among objects in scenes (concept maps⁸²), which do not define their relationship to the scene’s understanding. Such concept maps do not predict the object observers fixate the most nor the influence of the Winograd images on human fixations in our study (see Figure Supplementary 7a–c).

Eye movements to objects critical to understanding scenes vs. objects perceived to be grasped/gazed at

Some of our images contained a person directing their gaze or head toward the object critical to the scene understanding or the presence of hands, which could suggest an interpretation that observer eye movements are simply showing the well-documented behavior of following the gaze of other people in the scenes or fixating on objects perceived to-be-grasped/gazed-at^{31,61,62,63,64,65,66,67}.

However, we analyzed images with no grasp/gaze cues and directly compared fixations on objects judged to be grasped/gazed vs. objects critical to scene understanding. Our findings show that objects critical to scene understanding are frequently looked at even when the gaze is judged to be directed elsewhere or the to-be-grasped/gazed-at object is other than SU-relevant objects. However, gaze, head, and body cues are important predictors of the spatial location of these critical objects, and observers initially fixate on perceived-to-be-grasped or gazed at objects, consistent with gaze cueing studies^83,84. However, even without the grasp/gaze cues, the observer’s eye movements eventually fixate on the SU-relevant objects. We argue that objects critical to scene understanding are a general concept guiding eye movement fixations during free viewing and that the gaze of others, most often, but not always, serves as a predictive cue of the location of critical objects.

Functional role of fixations to maximize scene understanding

We found that trials in which a person did not fixate on the critical object were associated with less accurate scene descriptions. A follow-up experiment that manipulated the point of fixation in the scene evaluated the causal influence of fixation on extracting the information required to understand a scene. We demonstrated that fixating on the object critical to scene understanding yielded more accurate scene descriptions than fixating on a different, irrelevant object. This suggests that observers’ fixations on objects during free viewing have a functional purpose to maximize scene understanding. This is consistent with many recent findings demonstrating the functional perceptual role of saccades^{17,19,32,64,85}, micro-saccades⁸⁶, and smooth pursuit eye movements^14,87.

Our functional evaluation only evaluated a fixation on the object critical to the scene understanding, and another fixation on another object at a comparable distance from the center of the image. We did not evaluate an array of fixations that covered the entire image. It might not always be necessary to fixate on the center of an object. Fixating close to the critical object (within 2 degrees) may also result in maximal accuracy in describing the scenes. And for some scenes, it may be that center-of-mass fixations between the critical object and the hands might also provide accurate scene description^32,88,89.

The findings should not be interpreted to suggest that observers do not use peripheral information to guide their eye movements. For example, observers might identify a tool in the visual periphery but might not be able to specify the tool type. The peripheral information can guide the eye movement to the tool, and only after fixating on the object can the observer identify it as a screwdriver and more accurately describe the scene.

The results do not suggest that a single fixation is sufficient to understand a scene. As with other tasks with scenes^{90,91,92,93,94}, many of our scenes might require multiple fixations to extract their meaning accurately. The planning of multiple fixations might also consider the motor cost of saccades and result in fixating irrelevant objects placed between two objects critical to scene understanding^95,96.

The role of fixations on people and social cues in the theoretical framework of scene understanding

Humans looking at other people in scenes has been a universal finding across the eye movement literature^{2,49,50,68,69}. Our results follow the classic finding. During free viewing and scene description, people in the scenes were fixated the most, well above any objects, including those critical to scene understanding. These findings are consistent with the theoretical framework of eye movements aimed at maximizing scene understanding. Erasing people from the scenes had the largest impact on the scene description when we re-created the scene understanding maps (SUMs) to assess the influence of people.

Comparing scene understanding maps to saliency models, DeepGaze, and meaning maps

A saliency model based on low-level image features⁴ failed to predict the high frequency of human fixations on objects critical to understanding the scene. This is because low-level saliency models do not incorporate any aspects of the object relationships or tasks. Meaning maps⁴⁴ also fall short for our images because they consider the local meaningfulness of objects and not their contributions to understanding the entire scene. DeepGaze⁵⁹ is trained on human fixations and image features and thus, in principle, should be able to learn to predict fixations on the objects critical to scene understanding. DeepGaze’s lower prediction of frequent human fixations for our images may reflect the data used to train DeepGaze. The training data set might not include sufficient samples of images with complex behaviors as depicted in our Winograd images. Additionally, DeepGaze’s emphasis on fixation density maps may make it less capable of accurately predicting the most frequently fixated object. In addition, SUM’s advantage over the other models is not present when considering all human fixation density maps. SUMs were marginally better (though not significantly different) than DeepGaze and meaning maps when considering only fixations on objects, and were marginally lower (though not significantly different) than the two models when including fixations on people.

Importantly, even if DeepGaze predicted each scene’s most fixated object, it would not provide a unified theoretical understanding of what drives certain objects to be fixated the most for each image. Our experimental investigation provides such a theoretical account. It explains free viewing and frequent fixations on people and objects in terms of their contributions to understanding the scene.

Why do people seek to understand scenes during free viewing?

Our findings suggest that when observers view scenes without a specific task instructed by the investigator (free viewing), seeking visual information to understand the scene is one of the default tasks they engage in. Understanding a scene is essential for contextualizing the current visual input, making inferences about likely past events that have led to the current visual state, and, importantly, predicting future events^97,98,99,100. Making inferences requires an observer to consider all likely possibilities for that scene. In this framework, eye movements during free viewing of scenes are directed to locations that reduce the uncertainty of the possible states of the visual world^101,102, consistent with concepts of information sampling and curiosity¹⁰³.

Our findings also show a smaller but significant effect of fixating on objects critical to the scene’s understanding, even for the object-counting and search conditions. Our analysis reveals that these effects were primarily driven by grasp/gaze cues.

Consistent with our findings suggesting that understanding a scene is one of the default observer tasks during free viewing, a recent fMRI study showed that the semantic information of a scene (represented by the embeddings of a deep neural network visual-language model) predicted brain activity of people passively (no task) viewing scenes better than traditional object labels¹⁰⁴.

Generalization to other image types and influence of individual and cultural differences

All scenes in the Winograd pairs feature one or more individuals or a person’s hand, implying an action or social interaction. These types of images are fundamental to human daily lives. However, understanding these scenes can often change due to the presence or absence of a single object, thus requiring observers to explore the image with their eye movements. How would the analyses apply to images that contain objects spread across an image (e.g., a prototypical kitchen or forest image) with no implied future or past actions or social interactions? For such simpler images, there might not be a single object that is critical to their understanding. Furthermore, observers might be able to rapidly understand the scene with peripheral visual processing and without eye movement exploration^105,106,107. Thus, we might expect fixations with those images to have a lower perceptual function and the most frequently fixated object to be less related to its relevance to scene understanding. Additionally, unlike the Winograd image dataset, many real-world scenes feature multiple people, some of whom are critical to understanding the scene, while others are not. It is likely that the SUMs, but not the meaning maps, best predict which person or people in the scenes are fixated the most, but this will need to be tested.

In addition, a limitation of our approach is that it might not capture individual¹⁰⁸ and cross-cultural differences¹⁰⁹ in scene understanding that might influence the most fixated object in scenes. Individual and culture-specific SUMs may be a possible direction for predicting such inter-individual and inter-cultural differences in fixation behaviors.

To conclude, our findings suggest that during free viewing of scenes, eye movements are directed to people and objects that are critical to understanding the scene, rather than to regions that are visually salient or judged to be meaningful. The eye movement fixations serve a perceptual function, as they causally improve the accuracy of scene descriptions. The theory of eye movements to maximize scene understanding, along with its empirical implementation through scene understanding maps, provides a unified account of free viewing of scenes. They predict which objects are fixated on frequently, explain the higher frequency of fixations on people over objects, and frequent fixations on objects that are perceived to be grasped or gazed at. Together, our findings suggest that an important default task for the human brain during free viewing is comprehending the visual world.

Methods

Winograd image pairs (WI)

The experiment stimuli include 20 pairs of Winograd images photographed within the University of California, Santa Barbara premises. The images include indoor, outdoor, and table scenes. Each pair was ensured to have almost the same set of objects and positions. The small visual changes across image pairs could be manipulations of an object’s position, its substitution by another object, or an actor’s posture. The pairs were split into two sets: Winograd Set1 and Winograd Set2⁶⁰. The changes greatly alter how the scene was described while aiming to preserve the lower-level saliency and meaning maps. However, some changes would be introduced in the saliency and/or meaning maps by replacing an object or a variation of its location.

Each image pair was carefully curated to ensure that at least five random people described the pair differently and were consistent for a given image. Two pairs of Winograd images were removed from the dataset because their gold standard descriptions were inconsistent and vague. Out of 180 descriptions (from the other 36 images * 5 descriptions each) in our entire dataset, 149 were consistent with their corresponding image (at least three descriptions were consistent for each image). All of these images feature a person (or parts of a body) and tend to depict some future action or behavior that forms the scene’s understanding.

Participant information and informed consent

For all eye tracking experiments, the participants (260 participants in total) were undergraduate and graduate students of UCSB Psychological Brain Sciences. Participants did not know a priori about the hypothesis or the details of the experiment. Although the main study was conducted with participants ranging from 18 to 30 years, we expect the main findings to generalize to other ages. We collected self-reported ethnicity and gender data, but due to a technical issue, we have the data for only 100 out of 260 observers. We had consent to collect the gender and ethnicity data. We also collected Amazon Mechanical Turk data from US workers (276 participants in total) for our online studies, but we did not collect ethnicity or gender data from those participants. Race, gender, and ethnicity were not considered in the study design. Our main hypothesis is that eyes are directed to regions/objects that maximize scene understanding and are presumed to apply to people of all genders and ethnicities.

Participants provided written consent before participating in all experiments. This was approved by the Office of Research and Human Subjects, University of California, Santa Barbara. The authors affirm that all participants who took part as actors in the Winograd scenes provided informed consent for the publication of all images in the dataset.

Eye tracking and experimental setup

Eye movements were recorded using an EyeLink 1000+ desktop-mount eye tracker (spatial resolution: 0.01°) with a sampling rate of 500 Hz. Participants sat 75 cm away from a 19-inch monitor, so that the screen subtended a visual angle of 26. 6° × 21. 8° at 1,280 × 1024 pixels. The height of a stimulus image was approximately 12.7° of visual angle. Head movements were minimized using a chin and forehead rest. A velocity threshold of 22°/sec and an acceleration threshold of 4000°/sec² were used for the detection of saccades. Eye movements were recorded from the left eye. The experiment was controlled with SR Research Experiment Builder software.

Eye movement experiment with four conditions

In the first experiment, participants (between subject design) viewed the images in four conditions: free viewing, scene description, object search, and counting objects. Each condition involved 50 participants who were asked to complete 18 trials in which they viewed one set of Winograd images from all pairs (25 participants per image). Each image was shown for two seconds and was followed by condition-specific instructions. Below, we describe the instructions given to participants for the four conditions.

Free viewing

Participants were instructed to view the scene naturally. No explicit tasks or instructions were given to them.

Scene description

Participants were instructed to describe the presented scenes. After they viewed the scene, they typed their description with no time limit.

Object search

Participants were instructed to search for an object within the scene. Before the start of the trial, they were presented with the name of the object they needed to search for. After the trial, they reported back whether the object was on the left or right side of the image. By design, the image contained the object searched at the same location across the Winograd image pair.

Counting objects

Participants were instructed to count the objects on each image’s left and right sides. After the trial, they reported which side of the image had more objects (see Fig. 1b for the experiment’s procedural flow diagram.

Fixation prediction models

The study used four fixation prediction models. The models’ fixation prediction heat maps were compared against the fixation heat maps collectively made by participants in each of our eye-tracking experiments.

Graph based visual saliency (GBVS)

A bottom-up saliency model proposed by Harel et al.⁴ constructs a computational graph using Markov chains on top of the extracted image features to generate a heat map of possible fixation locations. Our study uses an implementation provided by Kümmerer¹¹⁰ for the GBVS model. We chose the GBVS model because it is one of the top saliency models¹¹¹ relying purely on low-level image features to compute its saliency map, true to the original definition of saliency.

DeepGaze

DeepGaze is a neural network model trained on image features extracted from VGG-19 Convolutional Neural Network (CNN)¹¹² along with human fixations on those images while free viewing⁵⁹. The model produces a heat map of possible fixation locations given an image. This study uses an implementation provided by Kümmerer¹¹³.

Meaning maps

Meaning maps is a crowd-sourced model developed by Henderson et al.⁴⁴, which uses the subjective ratings provided by people on how meaningful they find local circular patches of an image. Each image was divided into overlapping circular patches, which were then randomized across all images. Different individuals then rated small portions of these patches to determine their meaningfulness. The procedure ensured each patch of images had three raters, and the overlap between the patches ensured that any region of the image had ratings from 27 raters. To facilitate the creation of meaning maps for all the images in our study, 48 Amazon Mechanical Turk participants rated approximately 300 circular patches, each measuring 3 dva in size. The final result was a map of meaningful locations in a scene as predictors of locations people might fixate on while viewing the scenes. The procedure proposed by Henderson et al.⁴⁴ also included creating and rating 7 dva patches to compose a 7 dva meaning map and averaging it with the 3 dva meaning map. We collected 7 dva meaning maps but did not use them in our presented meaning maps. Since in our study image height was 12.7°, the 7 dva patch resulted in every patch having multiple objects that could be recognized by the raters, and making most of the image regions meaningful. Thus, the 7 dva meaning maps were uniformly meaningful, and averaging them with the 3 dva maps did not change the resulting meaning maps. Thus, our presented results show the 3 dva maps, but including the 7 dva maps did not alter the results. Refer to Fig. 3b for a procedural flow chart of all these models.

Scene understanding map

The scene understanding map visualizes an object’s contributions to understanding a scene by quantifying the change measured in participants’ descriptions of the scene after the object was removed. Creating a scene understanding map requires several steps described below: 1) Creating images with individual objects removed; 2) Collecting descriptions of images with objects removed; 3) Establishing a gold standard description for the intact image; 4) Assessing the similarity of the scene descriptions with an object removed to the scene description of the intact image; 5) Generating the scene understanding map. Below, we describe each of these steps.

Creating images with individual objects removed

Each image of the 18 Winograd pairs (36 images) was digitally manipulated to create versions with one object removed at a time. The resulting total number of images was 330. Each image had a total of 5 to 10 objects removed. Figure 3a shows an example of an image with some of its digitally manipulated versions. The photo editor app on the Samsung Galaxy S21 (version 3.4.2.43) was used to remove objects from these scenes.

Descriptions of images with objects removed

The dataset was split across the Winograd set, so participants viewed only one of the images of each Winograd pair. To ensure that participants do not see more than one version of each image, each set was divided into 11 unequal groups, each containing exactly one copy (original or manipulated) of each image. This implied that all participants provided descriptions for at most 18 images. One hundred and ten Amazon Mechanical Turk participants took part in this study. We collected five descriptions for each image (or its digitally manipulated copies) in the dataset. Before the online experiment, five images were used as a pre-test to ensure that observers followed the task instructions. Observers who did not accurately describe the pre-test images were not allowed to proceed to the main test with the 18 images.

Gold standard descriptions

To identify the impact of an object on scene understanding, we compared the descriptions of the original image to those collected for the manipulated image that had that specific object removed. We collected five descriptions for each of the original Winograd images. The gold standard description was defined as the best description among the five for each image. To establish which description was best for each image, a new group of fifty Amazon Mechanical Turk participants (twenty-five per Winograd Image set) selected the description that best described each of the original 18 Winograd pairs. Participants were randomly assigned to one of the Winograd pair sets (18 images). In each trial, the participant was shown an image with 7 descriptions (5 descriptions given to the image + two random descriptions from other images). Participants had to select the description that best described the presented scene. The two random descriptions were used to identify participants who did not follow the instructions. All data from participants who selected random descriptions were eliminated from the study. Importantly, the description with the highest vote (as the best description) was chosen for each image as the gold standard description.

Description similarity ratings to determine the contribution of objects to scene description

Eighteen Amazon Mechanical Turk participants were asked to rate the similarity of the descriptions using a scale from 1 to 10, with 10 indicating highly similar and 1 indicating very low similarity. In each trial, the participant saw the gold standard description for an image, followed by descriptions corresponding to each object being removed from the scene. Since there were five such description sets for each Winograd image, each participant had to finish 180 trials (36 images * 5 sets of descriptions). Refer to Fig. 3a for the procedural flow chart of this process.

Heat map generation for scene understanding map

To generate a heat map that visualizes the contribution of each object present in the scene to the scene description, we inverted the scale to have higher scores for dissimilar descriptions. Then we computed the median rating score corresponding to the description when each object was removed. The median score was assigned to all the pixels within the object’s corresponding bounding box (defined in methods; refer to Fig. 3a). To highlight relative contributions within each image, we normalized the ratings by subtracting the lowest-rated object’s rating from all the other objects’ ratings and then by dividing the map by the highest score for each image (scale-inverted normalized ratings). We followed Stoll et al.⁴³ to generate the preferred fixation location in the images by using scale-inverted, normalized ratings corresponding to the removal of each object from the scene. Using the bounding boxes for our object, we modeled the ratings as 2D Gaussians centered at each object, with its amplitude corresponding to scale-inverted normalized rating and the Gaussian horizontal and vertical standard deviation derived as a fraction of the size of the bounding box (fraction=0.29 along the x-direction and fraction=0.34 along the y-direction). Figure 3a shows an example of a scene understanding map.

Defining SU-relevant and SU-irrelevant objects

Objects in each scene were categorized based on their scene understanding map score (SU-relevant and SU-irrelevant). Objects that belong to the SU-relevant category had the highest impact (most critical to scene understanding among objects in that image) on participants’ scene descriptions when erased. If erasing them did not impact the participants’ scene description, they would belong to the SU-irrelevant category. For each Winograd pair, the same set of objects belonged to the SU-relevant category in one image and to the SU-irrelevant category in the other image of the pair (refer to Fig. 3a). Our analyses focused on objects that were SU-relevant for one Winograd image and SU-irrelevant for the other. We did not focus on objects that had a low impact on the participants’ scene descriptions for both images of a Winograd pair.

Measuring objects to be grasped and/or gazed at in the dataset

Fifty Amazon Mechanical Turk participants were divided equally between the Winograd sets and were shown the images with objects in the scene covered with black boxes. This was done to isolate the perceived grasped or gazed at object from any contextual information that could influence participants’ judgments. Participants were asked to click on the box they perceived to be grasped or gazed at by the person in the scene. The box with the maximum number of selections was considered the object perceived to be grasped or gazed at for each image. Figure 6a shows the experimental procedure and some examples. Images, where the SU-relevant object differed from the object perceived to be grasped/gazed at, constitute the No Gaze image subset.

Forced fixation scene description experiment

Sixty participants were split into four equal groups, and each group saw 18 trials. Two groups were assigned to images from Winograd set 1, while the others were assigned to set 2. A between-subjects design across the Winograd pairs ensured that observers did not use their knowledge of one image to interpret its corresponding pair. Within each set, the two groups were asked to fixate on nine SU-relevant and nine SU-irrelevant locations. Each group saw a unique combination of image and fixation location. The images were presented for 500 ms. All trials where eye movements (saccades larger than one dva) were detected within the presentation interval were discarded. On average, the discarded trials accounted for thirteen percent of all trials, resulting in thirteen descriptions per image and fixation location. Participants described each scene after it was presented. Because of the atypical nature of the task, observers were instructed to provide their best guess of what was happening in the scene.

Repeating analyses including fixations/predictions on people in the scenes

We repeated the procedure to generate scene understanding maps (SUMs) but incorporated the influence of people in the scenes. Similarly, we also included the fixation predictions of the other models on people. We generated the measured fixation heat maps, including fixations on people. With these updated fixation/prediction heat maps, we recomputed the correlations of fixation heat maps across conditions and within the Winograd pairs for each condition. We also added fixations on people and the prediction of SU-relevant with the people category in the cumulative fixation distribution plots. Finally, we performed the AUROC analysis for all fixation prediction models, including fixations/predictions on people.

Dependent variables

Heat map generation for measured fixations

Fixations on an image (a pixel white dot on a blank canvas) were convolved with a Gaussian kernel of standard deviation 0.5° visual angle. OpenCV2’s GaussianBlur function was used for this purpose. These convolved fixation points were added onto a blank map using their x and y locations to generate the heat map.

Selecting top predictions from fixation models

Fixation prediction models can sometimes predict contrastive regions that are not objects of importance. These regions were identified by finding the maximum locations determined by the convolution between the map and a 2 dva uniform and normalized circular patch. If the model’s prediction landed on an object, the object’s bounding box was used to represent its prediction. If not, a bounding box was generated with the average dimensions of all the bounding boxes in the image.

AUROC analysis for the fixation prediction models

The shuffled AUC (sAUC) technique¹¹⁴ was used for ROC analysis to evaluate the performance of prediction models in predicting human fixation heat maps. For each image and observer combination, the prediction maps were thresholded at different levels, and the true positives (TPs) at each level were calculated by counting the number of fixations that fell in the areas above the thresholded level. The false positives (FPs) at each threshold level were calculated by sampling fixations (the same number of fixations as the positive set) from the same observer in other images (the negative set) in the dataset, and counting the number of these fixations that fell in areas above the threshold level. The sAUC was then calculated by finding the area under the ROC curve (TPs vs FPs) for each observer-image combination. The sAUCs were averaged across all observers and images to get a final value for each map. The error bars for the sAUCs were obtained from bootstrapping the images and participants for 1000 trials.

Frequency of fixations/fixation time to object categories

Each scene had a maximum of 10 bounding boxes corresponding to different objects or regions. These bounding boxes were assigned specific identifiers or categories based on their relevance to scene understanding, as measured by the scene understanding map (SU-relevant and SU-irrelevant), the top predictions of the fixation prediction models (e.g., most salient), and condition-specific identifiers (search target from the object search condition). We counted the frequency of fixations or added the fixation times for these object/region categories across all participants. If an object category contains more than one object for a given image, we averaged fixations/fixation times for these objects.

Inter-observer correlation of fixation maps

For each condition, the 25 participants were divided into two random groups of 12 participants each. The correlation of the heat maps for each image was computed between these two groups. This process was repeated for 1000 trials. The average correlation across all images and 1000 trials was used as the inter-observer correlation for each condition. The correlations across Winograd images were computed using 25 participants in each group. In contrast, within-observer correlation computations have half the number of participants in the analysis. We computed the Winograd correlation with half the number of participants to make a valid comparison with the inter-observer correlation. We used 1000 combinations of 12 participants (with non-repeating participants) out of the 25 and calculated the mean fixation heat map correlation for all images for each condition. We performed a similar analysis for across-condition correlations (Fig. 2c), where we take 1000 combinations of 12 participants (with non-repeating participants) for each condition across all images to compute the mean correlations.

Computing Similarity using Large Language Models (LLMs)

To compute the similarity of experiment participants’ descriptions to the gold standard (defined in the methods above), the embeddings of the large language model (Gemini¹¹⁵) were utilized. The model API provides access to convert text into learned feature embeddings. A cosine similarity score was used to measure the similarity between these embeddings. To ensure consistency, we implemented the same analyses using the embeddings of another large language model, GPT4¹¹⁶. We used cosine similarity to assess the agreement between the LLM similarity metric and the human ratings obtained for all comparisons in the object erasure experiment.

Computing the lower and upper bound for similarity scores

Upper bound: Each Winograd image had five gold standard descriptions (defined in the methods above). A pairwise similarity comparison of all five gold standard descriptions was computed using the embedding similarity measure. The average similarity score across all pairwise comparisons across all images constitutes the upper bound for the similarity scores.

Lower bound: The gold standard description of each image was compared to the descriptions obtained from the force fixation scene description experiment for another random image in the dataset using LLMs. The lower bound was computed by averaging the similarity scores for 1000 such trials.

Fixations executed by observers who incorrectly described the scene

We calculated the fixation frequency distribution for object categories for images with correct and those with incorrect descriptions. The classification of the descriptions as correct or incorrect was based on human judgment similarity ratings relative to the gold standard description, and similarity ratings based on the embeddings of a Large Language Model, Gemini¹¹⁵.

For the LLM similarity measure, the participants whose description rating fell below one standard deviation (SD) from the mean rating were classified as participants who did not correctly describe the scene, and the rest were grouped as those who correctly described the scene. We also investigated how the analysis varied with different thresholds for categorizing descriptions: SD cutoffs ranging from 0 to 2.5 below the mean similarity rating for each image.

We compared the LLM similarity measures to human classification of correct and incorrect descriptions. The first author and three other research assistants (RAs) from the lab judged the correctness (binary classification: correct or incorrect) of these descriptions (based on the gold standard descriptions). A majority decision across the four raters was used to assign the final correct/incorrect classification. Ties were resolved through discussion to reach a consensus agreement.

Statistical analyses

Data analysis tools

Python (3.6 or above) and PsychoPy (2020 or above) are used to set up experiments, and other Python libraries are used to handle data (jsonlines, json_lines, pandas). Amazon Mechanical Turk was utilized to conduct online studies. Python libraries like numpy, pandas, and scipy were used for analyzing and testing statistical significance, and matplotlib, cv2, and seaborn were used for plotting and visualizing data and images.

Bootstrap resampling

Error bars for all our analyses were obtained using bootstrap resampling of participants, images, descriptions, and ratings. Below, we describe the general procedure. We created 100,000 resamples (with replacement) of images and participants for all the fixation frequency analyses. We used 10,000 observer/image resamples for the fixation heat map correlation analyses because of the computational cost of calculating heat map correlations. In the analysis that quantified inter-observer fixation heat map correlations, we picked 10 random samples of 12 participants from the 25. For each of the 10 samples, we used only 1000 bootstrap resamples (n = 12) due to computational constraints. The Cohen’s d reported was computed either across the image or the participant-level distributions of the data, depending on the analysis.

False discovery rate

All analyses involving moderate to large sets of comparisons had their significance levels corrected for the False Discovery Rate (FDR, α = 0.05) using the Benjamini-Hochberg method¹¹⁷, and the adjusted p-values were reported (q-value). For the cumulative fixation distributions, the comparison across SU-relevant and SU-irrelevant for each successive fixation depended on previous comparisons. Therefore, we added a simple, conservative modification to the FDR procedure, as detailed by the Benjamini-Yekutieli procedure¹¹⁸. All significances were determined using a one-tailed significance level.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The images created as part of the study have been deposited in a Mendeley repository¹¹⁹(https://doi.org/10.17632/z6jb259pcd.1). The repository also contains preprocessed data for plotting the fixation distribution across object categories and the cumulative fixation line plots. The code to access and visualize eye movement data is provided in a GitHub repository⁶⁰ (https://doi.org/10.5281/zenodo.17374055).

Code availability

The code to generate SUMs for the images in our dataset is provided in the GitHub repository⁶⁰ along with a README file that explains how to access the data and run the code.

References

Buswell, G. T. How people look at pictures: a study of the psychology and perception in art. How people look at pictures: a study of the psychology and perception in art (Univ. Chicago Press, Oxford, England, 1935).
Yarbus, A. L. Eye Movements During Perception of Complex Objects. In Yarbus, A. L. (ed.) Eye Movements and Vision, 171–211 (Springer US, Boston, MA, 1967). https://doi.org/10.1007/978-1-4899-5379-7_8.
Itti, L., Koch, C. & Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 1254–1259 (1998).
Article ADS Google Scholar
Harel, J., Koch, C. & Perona, P. Graph-based visual saliency. Advances in Neural Information Processing Systems 19, (2006).
Zhang, L., Tong, M. H., Marks, T. K., Shan, H. & Cottrell, G. W. Sun: A bayesian framework for saliency using natural statistics. J. Vis. 8, 32–32 (2008).
Article Google Scholar
Bruce, N. D. B. & Tsotsos, J. K. Saliency, attention, and visual search: An information theoretic approach. J. Vis. 9, 5–5 (2009).
Article Google Scholar
Zhang, J. & Sclaroff, S. Saliency detection: A boolean map approach. In Proceedings of the IEEE International Conference on Computer Vision, 153–160 (2013).
Erdem, E. & Erdem, A. Visual saliency estimation by nonlinearly integrating features using region covariances. J. Vis. 13, 11–11 (2013).
Article PubMed Google Scholar
Findlay, J. M. & Gilchrist, I. D. Chapter 13 - eye guidance and visual search. In Underwood, G. (ed.) Eye Guidance in Reading and Scene Perception, 295–312 (Elsevier Science Ltd, Amsterdam, 1998). https://www.sciencedirect.com/science/article/pii/B9780080433615500146.
Eckstein, M. P., Beutter, B. R. & Stone, L. S. Quantifying the performance limits of human saccadic targeting during visual search. Perception 30, 1389–1401 (2001).
Article CAS PubMed Google Scholar
Einhäuser, W., Rutishauser, U. & Koch, C. Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli. J. Vis. 8, 2–2 (2008).
Article Google Scholar
Malcolm, G. L. & Henderson, J. M. Combining top-down processes to guide eye movements during real-world scene search. J. Vis. 10, 4–4 (2010).
Article Google Scholar
Wischnewski, M. & Peelen, M. V. Causal neural mechanisms of context-based object recognition. Elife 10, e69736 (2021).
Article CAS PubMed PubMed Central Google Scholar
de Brouwer, A. J., Flanagan, J. R. & Spering, M. Functional use of eye movements for an acting system. Trends Cogn. Sci. 25, 252–263 (2021).
Article PubMed Google Scholar
Fooken, J., Kreyenmeier, P. & Spering, M. The role of eye movements in manual interception: A mini-review. Vis. Res. 183, 81–90 (2021).
Article PubMed Google Scholar
Hayhoe, M. M. & Lerch, R. A. Visual guidance of natural behavior https://oxfordre.com/psychology/view/10.1093/acrefore/9780190236557.001.0001/acrefore-9780190236557-e-848 (2022).
Hayhoe, M. & Ballard, D. Eye movements in natural behavior. Trends Cogn. Sci. 9, 188–194 (2005).
Article PubMed Google Scholar
Hayhoe, M. M. Vision and action. Annu. Rev. Vis. Sci. 3, 389–413 (2017).
Article PubMed Google Scholar
Peterson, M. F. & Eckstein, M. P. Looking just below the eyes is optimal across face recognition tasks. Proc. Natl. Acad. Sci. 109, E3314–E3323 (2012).
Article CAS PubMed PubMed Central ADS Google Scholar
Oliva, A., Konkle, T., Greene, M. R. & Torralba, A. Not all scene categories are created equal: The role of object and layout diagnosticity in scene gist understanding. J. Vis. 6, 464–464 (2006).
Article Google Scholar
Eckstein, M. P., Drescher, B. A. & Shimozaki, S. S. Attentional cues in real scenes, saccadic targeting, and bayesian priors. Psychological Sci. 17, 973–980 (2006).
Article Google Scholar
Neider, M. B. & Zelinsky, G. J. Scene context guides eye movements during visual search. Vis. Res. 46, 614–621 (2006).
Article PubMed Google Scholar
Tatler, B. W., Hayhoe, M. M., Land, M. F. & Ballard, D. H. Eye guidance in natural vision: Reinterpreting salience. J. Vis. 11, 5–5 (2011).
Article PubMed Google Scholar
Schütz, A. C., Braun, D. I. & Gegenfurtner, K. R. Eye movements and perception: A selective review. J. Vis. 11, 9–9 (2011).
Article PubMed Google Scholar
Schütz, A. C., Trommershäuser, J. & Gegenfurtner, K. R. Dynamic integration of information about salience and value for saccadic eye movements. Proc. Natl. Acad. Sci. 109, 7547–7552 (2012).
Article PubMed PubMed Central ADS Google Scholar
Hayhoe, M. & Ballard, D. Modeling task control of eye movements. Curr. Biol. 24, R622–R628 (2014).
Article CAS PubMed PubMed Central Google Scholar
Foulsham, T. Eye movements and their functions in everyday tasks. Eye 29, 196–199 (2015).
Article CAS PubMed Google Scholar
Koehler, K. & Eckstein, M. P. Beyond scene gist: Objects guide search more than scene background. J. Exp. Psychol.: Hum. Percept. Perform. 43, 1177 (2017).
PubMed Google Scholar
Eckstein, M. P., Koehler, K., Welbourne, L. E. & Akbas, E. Humans, but not deep neural networks, often miss giant targets in scenes. Curr. Biol. 27, 2827–2832.e3 (2017).
Article PubMed Google Scholar
Chakravarthula, P. N., Tsank, Y. & Eckstein, M. P. Eye movement strategies in face ethnicity categorization vs. face identification tasks. Vis. Res. 186, 59–70 (2021).
Article PubMed Google Scholar
Han, N. X. & Eckstein, M. P. Head and body cues guide eye movements and facilitate target search in real-world videos. J. Vis. 23, 5–5 (2023).
Article PubMed PubMed Central Google Scholar
Najemnik, J. & Geisler, W. S. Optimal eye movement strategies in visual search. Nature 434, 387–391 (2005).
Article CAS PubMed ADS Google Scholar
Hoppe, D. & Rothkopf, C. A. Multi-step planning of eye movements in visual search. Sci. Rep. 9, 144 (2019).
Article PubMed PubMed Central ADS Google Scholar
Einhäuser, W., Spain, M. & Perona, P. Objects predict fixations better than early saliency. J. Vis. 8, 18–18 (2008).
Article Google Scholar
Parkhurst, D., Law, K. & Niebur, E. Modeling the role of salience in the allocation of overt visual attention. Vis. Res. 42, 107–123 (2002).
Article PubMed Google Scholar
Bruce, N., & Tsotsos, J. Saliency based on information maximization. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Vol. 18, 155–162 (MIT Press, 2005).
Itti, L. Chapter 94 - models of bottom-up attention and saliency. In Itti, L., Rees, G. & Tsotsos, J. K. (eds.) Neurobiology of Attention, 576–582 (Academic Press, Burlington, 2005). https://www.sciencedirect.com/science/article/pii/B9780123757319500987.
Le Meur, O., Thoreau, D., Le Callet, P. & Barba, D. A spatio-temporal model of the selective human visual attention. In IEEE International Conference on Image Processing 2005, vol. 3, III–1188 (IEEE, 2005).
Peters, R. J., Iyer, A., Itti, L. & Koch, C. Components of bottom-up gaze allocation in natural images. Vis. Res. 45, 2397–2416 (2005).
Article PubMed Google Scholar
Le Meur, O., Le Callet, P. & Barba, D. Predicting visual fixations on video based on low-level visual features. Vis. Res. 47, 2483–2498 (2007).
Article PubMed Google Scholar
Borji, A. & Itti, L. State-of-the-art in visual attention modeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 185–207 (2012).
Article ADS Google Scholar
Borji, A., Sihite, D. N. & Itti, L. What stands out in a scene? a study of human explicit saliency judgment. Vis. Res. 91, 62–77 (2013).
Article PubMed Google Scholar
Stoll, J., Thrun, M., Nuthmann, A. & Einhäuser, W. Overt attention in natural scenes: Objects dominate features. Vis. Res. 107, 36–48 (2015).
Article PubMed Google Scholar
Henderson, J. M., Hayes, T. R., Rehrig, G. & Ferreira, F. Meaning Guides Attention during Real-World Scene Description. Sci. Rep. 8, 13504 (2018).
Article PubMed PubMed Central ADS Google Scholar
Peacock, C. E., Hayes, T. R. & Henderson, J. M. The role of meaning in attentional guidance during free viewing of real-world scenes. Acta Psychologica 198, 102889 (2019).
Article PubMed PubMed Central Google Scholar
Coco, M. I. & Keller, F. Scan patterns predict sentence production in the cross-modal processing of visual scenes. Cogn. Sci. 36, 1204–1223 (2012).
Article PubMed Google Scholar
Coco, M. I. & Keller, F. Classification of visual and linguistic tasks using eye-movement features. J. Vis. 14, 11–11 (2014).
Article PubMed Google Scholar
Esaulova, Y., Penke, M. & Dolscheid, S. Describing events: Changes in eye movements and language production due to visual and conceptual properties of scenes. Front. Psychol. 10, 835 (2019).
Article PubMed PubMed Central Google Scholar
Cerf, M., Frady, E. P. & Koch, C. Faces and text attract gaze independent of the task: Experimental data and computer model. J. Vis. 9, 10–10 (2009).
Article Google Scholar
Birmingham, E., Bischof, W. F. & Kingstone, A. Saliency does not account for fixations to eyes within social scenes. Vis. Res. 49, 2992–3000 (2009).
Article PubMed Google Scholar
Judd, T., Ehinger, K., Durand, F. & Torralba, A. Learning to predict where humans look. In 2009 IEEE 12th International Conference on Computer Vision, 2106–2113 (IEEE, 2009).
Judd, T., Durand, F. & Torralba, A. A benchmark of computational models of saliency to predict human fixations. In MIT Technical Report (2012).
Borji, A. & Itti, L. Cat2000: A large scale fixation dataset for boosting saliency research. Preprint at arXiv preprint arXiv:1505.03581 (2015).
Koehler, K., Guo, F., Zhang, S. & Eckstein, M. P. What do saliency models predict? J. Vis. 14, 14–14 (2014).
Article PubMed PubMed Central Google Scholar
Bylinskii, Z., Isola, P., Bainbridge, C., Torralba, A. & Oliva, A. Intrinsic and extrinsic effects on image memorability. Vis. Res. 116, 165–178 (2015).
Article PubMed Google Scholar
Xu, J., Jiang, M., Wang, S., Kankanhalli, M. S. & Zhao, Q. Predicting human gaze beyond pixels. J. Vis. 14, 28–28 (2014).
Article CAS PubMed Google Scholar
Hoh, W. K., Zhang, F.-L. & Dodgson, N. A. Salient-centeredness and saliency size in computational aesthetics. ACM Trans. Appl. Percept. 20, 1–23 (2023).
Article Google Scholar
Levesque, H., Davis, E. & Morgenstern, L. The winograd schema challenge. In the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (2012).
Linardos, A., Kümmerer, M., Press, O. & Bethge, M. Deepgaze iie: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12919–12928 (2021).
Murlidaran Shravan, E. M. Eye movements during free viewing to maximize scene understanding Available at https://github.com/shravan1394/WinogradDataset (2025).
Kleinke, C. L. Gaze and eye contact: a research review. Psychological Bull. 100, 78 (1986).
Article CAS Google Scholar
Brooks, R. & Meltzoff, A. N. The development of gaze following and its relation to language. Developmental Sci. 8, 535–543 (2005).
Article Google Scholar
Castelhano, M. S., Wieth, M. & Henderson, J. M. I see what you see: Eye movements in real-world scenes are affected by perceived direction of gaze. In Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint: 4th International Workshop on Attention in Cognitive Systems, WAPCV 2007 Hyderabad, India, January 8, 2007 Revised Selected Papers 4, 251–262 (Springer, 2007).
Han, N. X. & Eckstein, M. P. Inferential eye movement control while following dynamic gaze. Elife 12, e83187 (2023).
Article CAS PubMed PubMed Central Google Scholar
Eshuis, R., Coventry, K. R. & Vulchanova, M. Predictive eye movements are driven by goals, not by the mirror neuron system. Psychological Sci. 20, 438–440 (2009).
Article Google Scholar
Fawcett, C. & Gredebäck, G. Infants use social context to bind actions into a collaborative sequence. Developmental Sci. 16, 841–849 (2013).
Article Google Scholar
Gredebäck, G. & Falck-Ytter, T. Eye movements during action observation. Perspect. Psychological Sci. 10, 591–598 (2015).
Article Google Scholar
Borovska, P. & de Haas, B. Faces in scenes attract rapid saccades. J. Vis. 23, 11–11 (2023).
Article PubMed PubMed Central Google Scholar
Fausey, C. M., Jayaraman, S. & Smith, L. B. From faces to hands: Changing visual input in the first two years. Cognition 152, 101–107 (2016).
Article PubMed PubMed Central Google Scholar
Peacock, C. E., Hayes, T. R. & Henderson, J. M. Meaning guides attention during scene viewing, even when it is irrelevant. Atten., Percept., Psychophys. 81, 20–34 (2019).
Article PubMed Google Scholar
Borji, A., Sihite, D. N. & Itti, L. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Trans. Image Process. 22, 55–69 (2012).
Article PubMed ADS MathSciNet Google Scholar
Torralba, A., Oliva, A., Castelhano, M. S. & Henderson, J. M. Contextual guidance of eye movements and attention in real-world scenes: The role of global features in object search. Psychological Rev. 113, 766 (2006).
Article Google Scholar
Henderson, J. M., Malcolm, G. L. & Schandl, C. Searching in the dark: Cognitive relevance drives attention in real-world scenes. Psychonomic Bull. Rev. 16, 850–856 (2009).
Article Google Scholar
Võ, M. L.-H. & Henderson, J. M. Does gravity matter? Effects of semantic and syntactic inconsistencies on the allocation of attention during scene perception. J. Vis. 9, 24–24 (2009).
Article Google Scholar
Wolfe, J. M., Võ, M. L.-H., Evans, K. K. & Greene, M. R. Visual search in scenes involves selective and nonselective pathways. Trends Cogn. Sci. 15, 77–84 (2011).
Article PubMed PubMed Central Google Scholar
Castelhano, M. S. & Heaven, C. Scene context influences without scene gist: Eye movements guided by spatial associations in visual search. Psychonomic Bull. Rev. 18, 890–896 (2011).
Article Google Scholar
Mack, S. C. & Eckstein, M. P. Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. J. Vis. 11, 9–9 (2011).
Article Google Scholar
Castelhano, M. S. & Witherspoon, R. L. How you use it matters: Object function guides attention during visual search in scenes. Psychological Sci. 27, 606–621 (2016).
Article Google Scholar
Eckstein, M. P. Probabilistic computations for attention, eye movements, and search. Annu. Rev. Vis. Sci. 3, 319–342 (2017).
Article PubMed Google Scholar
Võ, M. L.-H., Boettcher, S. E. & Draschkow, D. Reading scenes: How scene grammar guides attention and aids perception in real-world environments. Curr. Opin. Psychol. 29, 205–210 (2019).
Article PubMed Google Scholar
Goettker, A., Pidaparthy, H., Braun, D. I., Elder, J. H. & Gegenfurtner, K. R. Ice hockey spectators use contextual cues to guide predictive eye movements. Curr. Biol. 31, R991–R992 (2021).
Article CAS PubMed Google Scholar
Hayes, T. R. & Henderson, J. M. Looking for semantic similarity: what a vector-space model of semantics can tell us about attention in real-world scenes. Psychological Sci. 32, 1262–1270 (2021).
Article Google Scholar
Friesen, C. K. & Kingstone, A. The eyes have it! reflexive orienting is triggered by nonpredictive gaze. Psychonomic Bull. Rev. 5, 490–495 (1998).
Article Google Scholar
Han, N. X. & Eckstein, M. P. Gaze-cued shifts of attention and microsaccades are sustained for whole bodies but are transient for body parts. Psychonomic Bull. Rev. 29, 1854–1878 (2022).
Article Google Scholar
Rucci, M. & Poletti, M. Control and functions of fixational eye movements. Annu. Rev. Vis. Sci. 1, 499–518 (2015).
Article PubMed PubMed Central Google Scholar
Ko, H. -k, Poletti, M. & Rucci, M. Microsaccades precisely relocate gaze in a high visual acuity task. Nat. Neurosci. 13, 1549–1553 (2010).
Article CAS PubMed PubMed Central Google Scholar
Spering, M., Schütz, A. C., Braun, D. I. & Gegenfurtner, K. R. Keep your eyes on the ball: smooth pursuit eye movements enhance prediction of visual motion. J. Neurophysiol. 105, 1756–1767 (2011).
Article PubMed Google Scholar
Zelinsky, G. J. A theory of eye movements during target acquisition. Psychological Rev. 115, 787 (2008).
Article Google Scholar
Eckstein, M. P., Schoonveld, W., Zhang, S., Mack, S. C. & Akbas, E. Optimal and human eye movements to clustered low value cues to increase decision rewards during search. Vis. Res. 113, 137–154 (2015).
Article PubMed Google Scholar
Hollingworth, A. & Henderson, J. M. Accurate visual memory for previously attended objects in natural scenes. J. Exp. Psychol.: Hum. Percept. Perform. 28, 113 (2002).
Google Scholar
Brenner, E., Granzier, J. J. & Smeets, J. B. Perceiving colour at a glimpse: The relevance of where one fixates. Vis. Res. 47, 2557–2568 (2007).
Article PubMed Google Scholar
Huebner, G. M. & Gegenfurtner, K. R. Effects of viewing time, fixations, and viewing strategies on visual memory for briefly presented natural objects. Q. J. Exp. Psychol. 63, 1398–1413 (2010).
Article Google Scholar
Gegenfurtner, K. R. The interaction between vision and eye movements. Perception 45, 1333–1357 (2016).
Article PubMed Google Scholar
Koehler, K. & Eckstein, M. P. Temporal and peripheral extraction of contextual cues from scenes during visual search. J. Vis. 17, 16–16 (2017).
Article PubMed Google Scholar
Araujo, C., Kowler, E. & Pavel, M. Eye movements during visual search: The costs of choosing the optimal path. Vis. Res. 41, 3613–3625 (2001).
Article CAS PubMed Google Scholar
Kowler, E. Eye movements: The past 25 years. Vis. Res. 51, 1457–1483 (2011).
Article PubMed ADS Google Scholar
Loschky, L. C., Larson, A. M., Smith, T. J. & Magliano, J. P. The scene perception & event comprehension theory (spect) applied to visual narratives. Top. Cogn. Sci. 12, 311–351 (2020).
Article PubMed Google Scholar
Loschky, L. C. et al. The role of event understanding in guiding attentional selection in real-world scenes: The scene perception & event comprehension theory (spect). PsyArXiv Preprints (2024).
Berlot, E., Schmitt, L., Huber-Huber, C., Peelen, M. & de Lange, F. I see! how narrative meaning influences gaze behaviour. Conference on Cognitive Computational Neuroscience (2023).
Roth, N., McLaughlin, J., Obermayer, K. & Rolfs, M. Gaze behavior reveals expectations of potential scene changes. Psychological Sci. 35, 1350–1363 (2024).
Article Google Scholar
Gottlieb, J., Oudeyer, P.-Y., Lopes, M. & Baranes, A. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn. Sci. 17, 585–593 (2013).
Article PubMed PubMed Central Google Scholar
Gottlieb, J. & Oudeyer, P.-Y. Towards a neuroscience of active sampling and curiosity. Nat. Rev. Neurosci. 19, 758–770 (2018).
Article CAS PubMed Google Scholar
Baranes, A., Oudeyer, P.-Y. & Gottlieb, J. Eye movements reveal epistemic curiosity in human observers. Vis. Res. 117, 81–90 (2015).
Article PubMed Google Scholar
Doerig, A. et al. High-level visual representations in the human brain are aligned with large language models. Nat. Mach. Intell. 7, 1220–1234 (2025).
Li, F. F., VanRullen, R., Koch, C. & Perona, P. Rapid natural scene categorization in the near absence of attention. Proc. Natl. Acad. Sci. 99, 9596–9601 (2002).
Article CAS PubMed PubMed Central ADS Google Scholar
Oliva, A. & Torralba, A. Building the gist of a scene: The role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006).
Article PubMed Google Scholar
Jonnalagadda, A., Wang, W. Y., Manjunath, B. & Eckstein, M. P. Foveater: Foveated transformer for image classification. Preprint at arXiv preprint arXiv:2105.14173 (2021).
Kollenda, D., Reher, A.-S. & de Haas, B. Individual gaze predicts individual scene descriptions. Sci. Rep. 15, 9443 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar
Chua, H. F., Boland, J. E. & Nisbett, R. E. Cultural variation in eye movements during scene perception. Proc. Natl. Acad. Sci. 102, 12629–12633 (2005).
Article CAS PubMed PubMed Central ADS Google Scholar
Kümmerer, M. Saleincy models implementation. https://github.com/matthias-k/pysaliency (2015).
Bruce, N. D., Wloka, C., Frosst, N., Rahman, S. & Tsotsos, J. K. On computational modeling of visual saliency: Examining what’s right, and what’s left. Vis. Res. 116, 95–112 (2015).
Article PubMed Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at arXiv preprint arXiv:1409.1556 (2014).
Kümmerer, M. Deep gaze ii. https://github.com/matthias-k/DeepGaze (2016).
Borji, A., Tavakoli, H. R., Sihite, D. N. & Itti, L. Analysis of scores, datasets, and models in visual saliency prediction. In Proceedings of the IEEE international conference on computer vision, 921–928 (2013).
Reid, M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at arXiv preprint arXiv:2403.05530 (2024).
Achiam, J. et al. Gpt-4 technical report. Preprint at arXiv preprint arXiv:2303.08774 (2023).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B (Methodol.) 57, 289–300 (1995).
Article MathSciNet Google Scholar
Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 1165–1188 (2001).
Murlidaran Shravan, E. M. Eye movements during free viewing to maximize scene understanding, https://doi.org/10.1167/jov.24.10.1189 (2025).

Download references

Acknowledgements

This study was supported by the Institute for Collaborative Biotechnologies (ICB) cooperative agreement W911NF-19-2-0026, Noyce Foundation (M.P.E.). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the US Government. The US government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein.

Author information

These authors contributed equally: Shravan Murlidaran, Miguel P. Eckstein.

Authors and Affiliations

Psychological & Brain Sciences, University of California, Santa Barbara, Santa Barbara, CA, USA
Shravan Murlidaran & Miguel P. Eckstein
Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA, USA
Miguel P. Eckstein
Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA, USA
Miguel P. Eckstein

Authors

Shravan Murlidaran
View author publications
Search author on:PubMed Google Scholar
Miguel P. Eckstein
View author publications
Search author on:PubMed Google Scholar

Contributions

The authors, SM and MPE, contributed equally to the design of experiments, data analysis, and writing the paper. SM conducted the experiments and analyzed the data under MPE’s supervision.

Corresponding author

Correspondence to Shravan Murlidaran.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Murlidaran, S., Eckstein, M.P. Eye movements during free viewing to maximize scene understanding. Nat Commun 17, 940 (2026). https://doi.org/10.1038/s41467-025-67673-w

Download citation

Received: 12 January 2025
Accepted: 04 December 2025
Published: 21 December 2025
Version of record: 23 January 2026
DOI: https://doi.org/10.1038/s41467-025-67673-w