Introduction

Object recognition, a fundamental aspect of visual perception, operates at varying levels of abstraction, including the super- ordinate (e.g., animate vs. inanimate), mid-level (e.g., face vs. body), and sub-ordinate (e.g., face identity) categories. There exist numerous studies on hierarchical object recognition1,2,3,4. On the other hand, spatial frequency (SF), ranging from low (coarse shapes) to high (fine details), significantly influences visual perception, highlighting its pivotal role in the neural representation of objects5,6,7,8. However, the role of spatial frequency (SF) in shaping visual perception, specifically in hierarchical object recognition, is not well understood. Understanding the role of SF in hierarchical object recognition is essential for uncovering the neural mechanisms that enable rapid and efficient visual processing in both natural and artificial vision systems. Clarifying how SF bands contribute to distinct levels of abstraction could inform both neuroscientific theory and practical applications in AI and visual perception models.

Biologically, our visual system is adept at utilizing different SF bands for various categorization tasks, implying an evolutionary adaptation to process visual information efficiently. This adaptability is evident in tasks like house vs. flower categorization, which is more effective at high SFs (HSF) (SF < 8 cycles image−1), or flower vs. face categorization, whichshows better performance at low SFs (LSF) (SF > 24 cycles image−1)9. These distinctions imply that our visual system may have evolved to use specific SF bands as shortcuts for different categorization tasks, enabling rapid and contextually appropriate responses. Such preferences underscore the importance of SF in optimizing neural resources for various recognition tasks, depending on the category demands. Moreover, studies show that SF can be explicitly decoded from neuronal responses in the Occipitotemporal cortex10 and the inferior temporal (IT) cortex11. Cheung and Bar showed that LSF bands of image (SF < 8 cycles image−1) drive top-down predictions that facilitate object recognition12. The middle frequencies (16–24 cycles image−1) are shown to be important for natural scene categorization (subjects need to determine the presence or absence of a car in a natural scene)13. Caplette et al. indicated that a delayed match to sample task using randomly filtered stimuli shows that objects are better represented in the middle SF bands (13.8–24 cycles image−1)14. Middle and high frequencies are also suggested as the best SF band for face recognition. Studies show that the appropriate frequency band for face recognition is between 5 and 25 cycles face−1 15,16,17,18,19. These findings collectively underscore that the visual system’s reliance on distinct SF bands for different tasks likely reflects an adaptive optimization of processing efficiency. By selectively recruiting certain SF ranges—such as LSF for broad categorizations or middle-to-high SFs for detailed recognition like faces—the brain minimizes computational demands and enhances speed in processing complex visual scenes. This selective SF utilization highlights a sophisticated neural mechanism for prioritizing certain visual features based on task requirements, underscoring the importance of SF in shaping hierarchical perception.

Moreover, psychophysical evidence has shown poor performance and higher reaction time in super-ordinate level catego- rization (living vs. non-living) in LSF20, suggesting the necessity of finer shape information content for super-ordinate level categorization. However, Ashtiani et al. suggest the better performance of super-ordinate level categorization (animal/non- animal) in LSFs compared to the mid and sub-ordinate levels, which appear to require finer shape information in higher SFs and take longer to be processed21. Studies indicate that the visual system uses various SF bands to perform the categorization tasks, which could be affected by the task and involved objects, suggesting a biological basis for the differential impact of SF bands on various category levels. Therefore, visual perception is greatly affected by both the SF contents and the level of abstraction. Despite the extensive examination of object recognition at various abstraction levels, the specific role of SF in modulating hierarchical categorization has not been fully explored. By addressing this interaction, this study aims to shed light on how the visual system dynamically allocates processing resources across SF bands to achieve optimal recognition across different levels, revealing underlying mechanisms in neural processing.

The IT cortex lies at the end of the ventral visual pathway in nonhuman primates and contains various category-selective neurons1,22,23,24. A faster representation of more abstract versions25,26 and mid-level categories compared to super- and sub- ordinate levels4 has been suggested in this region; however, the effect of SF on the representation of the categories in different levels of abstraction requires further investigation. This study is motivated by the hypothesis that different spatial frequency (SF) bands play distinct roles in the biological process of object categorization. Specifically, we explore how low spatial frequencies (LSF, 1–10 cycles image−1) and high spatial frequencies (HSF, 18–75 cycles image−1) differentially contribute to categorizing objects at various levels of abstraction (super-ordinate, mid-level, and sub-ordinate). To examine the effect of SF content on categorical representation at multiple levels, the responses of IT and STS neurons to various visual stimuli in three levels of abstraction and three levels of SF content (i.e., intact, high, and low) are recorded. We show that mid-level categories (e.g., face-body) are represented at all levels of SF content, while in sub-ordinate levels (e.g., identity), only the intact stimuli are represented in IT and STS neurons. We observed that the representation of animacy level (i.e., super-ordinate level) requires fine shape information (HSF contents), whereas there is no super-ordinate level information in LSF stimuli. Therefore, both HSF and LSF information are necessary for the sub-ordinate representation, while only the HSF information is essential for the super-ordinate levels of abstraction. Thus, our findings suggest the necessity of finer shape information (i.e., HSF content) for global object categorization (super-ordinate). Furthermore, fine and coarse shape information (i.e., HSF and LSF, respectively) are critical for finer categories in visual processing. Finally, we conducted a human psychophysics task to evaluate our findings in human decision-making. We found that, as in the neurophysiological data, super-ordinate categorization can be done in HSF similar to intact, while LSF decreases the performance. At the sub-ordinate level, both removing LSF and HSF components degrades the performance. Furthermore, a faster reaction time of mid-level stimuli is observed in all SF ranges.

Results

To investigate the impact of SF content on hierarchical object categorization using neural activities of the IT and STS areas, we designed a rapid serial visual presentation (RSVP)27,28,29 (stimulus duration, 50 ms; interstimulus interval, 450 ms) task (Fig. 1b). Eighty-one images of real-world objects in three SF levels (Fig. 1c) were displayed to passive-viewing monkeys while individual IT neurons were recorded. The cells were located across the IT cortex and STS (Fig. 1a). For simplicity, we refer to this combined area as “IT” throughout the paper; however, “IT” here includes all recorded neurons from both the IT cortex and the STS. Results rely on the neural population response analysis for all stimuli. The stimuli were images of eighty-one grayscale natural and artificial objects. The organization of visual object categories at three levels of abstraction enables us to examine the category representation at different levels of abstraction hierarchy for each SF content (Fig. 1c). We divided the stimuli into animate and inanimate categories. We then divided the animate stimuli into faces and bodies, and human faces into female and male faces4. This categorical sort is compatible with prior studies about natural categorical representations in the monkey and human IT24,30,31,32. Throughout the results, only neurons that were responsive across all stimuli were included in the analysis to ensure reliability. A neuron was considered responsive if its average firing rate between 70 and 170 ms after the onset of all stimuli was significantly higher than its firing rate during a baseline period, spanning from 50 ms before to 50 ms after stimulus onset. In total, there exist 330 responsive neurons (45 were recorded from the STS), and all population analyses are conducted on all 330 responsive neurons.

Fig. 1: Paradigm and stimuli.
figure 1

a recording area from inferior temporal (IT) and superior temporal sulcus (STS) neurons in macaque monkeys’ brains. b Schematic of a rapid serial visual presentation (RSVP) task, with raw images of different spatial frequencies (SFs) shown to the monkey. c Hierarchical category structure of stimuli at three SF levels. Three category levels are defined: (1) super-ordinate level: animate vs. inanimate. (2) mid-level (mid-level): face vs. body, human body vs. animal body, animal face vs. human face, and natural inanimate vs. artificial inanimate. (3) sub-ordinate level: individual human identity. Three SFs are defined: low SF (LSF), intact, and high SF (HSF). Blue, green, and orange are associated with the super-ordinate, sub-ordinate, and mid-level hierarchical levels. d peristimulus time histogram (PSTH) for all hierarchical levels, from right to left for Intact, LSF, and HSF. These are color-coded as follows: animate (aqua), inanimate (blue), face (light orange), body (red), identity1 (lawn green), and identity2 (dark green). The shaded areas indicate standard deviation.

Global category information needs fine shape information

To show the effect of the SF filtering on IT representations, we investigate the changes in neuron responses in LSF and HSF compared to the intact representations using the neuron response modulation index (RMI). RMI shows the difference between two variables (e.g., firing rates for LSF and intact) normalized to their sum. Figure 2a shows the index for various categories under LSF and HSF conditions. HSF significantly affects the neuron responses compared to the LSF (animate: LSF = 0.006 ± 0.01 (p = 0.16), HSF = 0.063 ± 0.01 (p = 2e-9), and HSF > LSF: p = 5e-50, inanimate: LSF = 0.01 ± 0.013 (p = 0.22), HSF = 0.080 ± 0.01 (p = 7e-11), and HSF > LSF: p = 7e-44, face: LSF = 0.025 ± 0.01 (p = 0.01), HSF = 0.066 ± 0.01 (p = 9e-9), and HSF > LSF: p = 1e-44, body: LSF = −0.003 ± 0.01 (p = 0.72), HSF = 0.068 ± 0.01 (p = 2e-11), and HSF > LSF: p = 1e-46). While the impact of HSF (or removing of LSF components does not vary significantly among the categories (p = 0.80, one-way analysis of variance (ANOVA)), LSF (or removing of HSF components) only affects the face category neuronal responses significantly.

Fig. 2: The effect of SF on hierarchical category representation.
figure 2

a Response modulation index (RMI) across all neurons. The error bars and shaded areas show the standard deviation of the index measured using the bootstrap method. Low spatial frequency (LSF) (removing high SF (HSF) components) highly affects the representation across all categories more than HSF (removing LSF components). bd Support vector machine (SVM) decoding accuracy on population responses in three levels of abstraction: animate vs. inanimate (A-I), face vs. body (F-B), and Identity (Id) in three SF levels. The colors represent three hierarchies (blue for A-I, orange for F-B, and green for Id). The gray area in each panel represents the early and late phases, corresponding to the onset and peak of each SF. While the mid-level abstraction is roughly unaffected by SF filtering, the performance of sub-ordinate categorization is highly degraded. On the other hand, super-ordinate is only affected by HSF filtering (removing LSF components). This observation suggests the necessity of all SF spectrums for fine and only HSF for coarse category representation.

Next, the impact of SF filtering on population decoding performance is investigated using a support vector machine (SVM) classifier. The impact of SF on 2D hierarchical category representation is illustrated in Supplementary Fig. 1. The time course of the accuracy of the classifier on each abstraction level using the leave-p-out method is illustrated in Fig. 2b-d. The peak accuracy rate for all levels of abstraction in intact (mid = 0.92 ± 0.025, super = 0.90 ± 0.018, sub = 0.81 ± 0.038) shows the generalizability of representations in full SF. In LSF (Fig. 2c), the performance of mid-level (face vs. body) is less affected compared to the performances of super- and sup-ordinate categorizations that are highly decreased (mid = 0.82 ± 0.032, super = 0.64 ± 0.028, sub = 0.56 ± 0.041), suggesting the importance of HSF for both super- and sub-ordinate levels. On the other hand, in HSF (Fig. 2d), the accuracy rate of both mid-level and super-ordinate categories are close to the intact condition, while the performance in sub-ordinate level is decreased as in LSF (mid = 0.87 ± 0.025, super = 0.85 ± 0.026, sub = 0.58 ± 0.050). These observations suggest the robustness of mid-level decoding capabilities in all SF ranges, while HSF is necessary for the super-ordinate level, and both LSF and HSF have vital information for sub-ordinate categorization. While the SVM method shows the generalizability of the representations, it limits the samples to correctly or wrongly classified and does not consider their relative distance. To overcome this limit, we use the Separability Index (SI)4 to understand better the impact of SF on population decoding capabilities in three levels of abstraction.

Using a Separability Index (SI)4 that determines the proportion of between-category and within-category distances of stimuli based on the IT responses, we showed the strength and reliability of category discrimination in IT neural populations. Figure 3 (a-left) demonstrates the time course of SI for three abstraction levels for the intact SF condition. This figure illustrates that SI leads to significant values in all categories after the stimulus onset. Figure 3 (a-right) shows that the SI value for mid-level is significantly higher than super-ordinate and sub-ordinate levels (SI value for mid-level = 1.67 ± 0.09, super-ordinate = 0.27 ± 0.01, sub-ordinate = 0.32 ± 03, mid-level vs. super- or sub- ordinate p < 0.001, super-ordinate vs. sub-ordinate p = 0.11). To analyze the time course of category representations in different SF bands, time courses, and SI values in LSF and HSF are presented in Fig. 3b, c. Moreover, in the LSF, only the SI value for the mid-level is significant (SI value in LSF for mid-level = 0.87 ± 0.05, super-ordinate = 0.11 ± 0.06, sub-ordinate = 0.15 ± 0.11; compared to baseline mid-level p < 0.001, super-ordinate p = 0.09, and sub-ordinate p = 0.47; onset time for mid-level = 99.81 ± 1.63).

Fig. 3: Time course and separability index (SI) values for each level of abstraction in intact (I), low spatial frequency (L), and high SF (H) stimuli for the population of recorded IT neurons.
figure 3

a Time course of the SI for the three hierarchy levels for intact stimuli (left). The time courses were offset by the mean value of the index at the 50 ms to 1 ms interval before the stimulus onset. Shaded areas describe SD and are measured using the bootstrap method. The bar graphs (right) show the SI values at (70–320) ms after stimulus onset for animate-inanimate, face-body, and Identity at intact. b, c Time course and SI values for the three levels of abstraction for LSF and HSF stimuli, respectively. The colors represent three hierarchies (blue for inanimate vs. animate, orange for face vs. body, and green for identity). Error bars and shaded areas indicate the standard deviation of the mean estimated by bootstrapping the stimulus set. The gray region in each panel indicates the early and late phases, which are associated with the onset and peak of each SF. There are significant effects against mid-level given above the bars with significant effects indicated by stars (p > 0.05 indicated by n.s. for “not significant”, p < 0.05 indicated by *, p < 0.01 indicated by **, and p < 0.001 indicated by ***).

As shown in Fig. 3b, in the LSF condition, SI leads to significant values at mid-level abstraction after stimulus onset, while it does not lead to significant values at the super-ordinate and sub-ordinate levels of abstraction. Concerning LSF content (Fig. 3b right), the SI value is still significant at the mid-level; however, we failed to find any statistically significant discrimination at the super-ordinate and sub-ordinate levels. In the HSF, Fig. 3c, SI leads to significant values at the mid-level and super-ordinate levels of abstraction (SI value in HSF for mid-level = 1.44 ± 0.08, super-ordinate = 0.29 ± 0.02; compared to baseline mid-level and super-ordinate p < 0.001). In contrast, SI does not lead to significant values at the sub-ordinate level of abstraction (SI value in HSF for sub-ordinate = 0.15 ± 0.05; compared to baseline p = 0.24). Further, as illustrated in Fig. 3, the SI value for the mid-level is significantly higher than that for the super-ordinate and the sub-ordinate levels in all SFs (in all SFs, mid-level vs. super- or sub-ordinate p < 0.001). Furthermore, the sub-ordinate level does not contain significant information in LSF or HSF. Therefore, we concluded that mid- and super-ordinate level representations were maintained after the LSF removal, while merely mid-level representations persevered when HSF was removed. In the intact stimuli, all abstraction levels were represented. Nevertheless, the representative information for mid-level is significantly higher than super-ordinate and sub-ordinate levels.

The results of this section indicate that the mid-level categories are well represented in all SF bands in the IT cortex. However, the presence of HSF contents is critical for representing information at the super-ordinate level since the removal of the HSF contents (LSF condition) significantly impacts the super-ordinate level information. The representation of the sub-ordinate level is highly affected by both LSF and HSF filtering, suggesting that all SF bands are necessary for the sub-ordinate level. The observations suggest that the global object information requires coarser shape information, and the fine object information requires a variety of SF contents ranging from LSF to HSF. The results presented in the paper are consistent when examining individual animals, with Supplementary Fig. 2 emphasizing the main discovery separated for the two monkeys in this study.

Since IT is a large area, the recording location plays a crucial role in our study. For this reason, we pinpointed neurons spread across the anterior IT (AIT), central IT (CIT), and posterior IT (PIT) regions. Interestingly, our findings remained consistent across different recording sites as outlined in Supplementary Fig. 3.

Interaction between SF and category in the mid-level of abstraction

To individually investigate the representation of mid-level categories, i.e., body and face, the time course of the SI of these categories and the animate category was calculated against the inanimate (Fig. 4). These calculations are done for HSF, Intact, and LSF stimuli. According to Fig. 4a, the discrimination of body is higher than face and animate categories in intact (peak SI value for animate=0.27 ± 0.01, face=0.71 ± 0.03, and body = 1.12 ± 0.05; body vs. face p < 0.001, face vs. animate p < 0.001). This pattern is further repeated for HSF stimuli (peak SI value for animate = 0.29 ± 0.02, face = 0.47 ± 0.02, and body = 1.26 ± 0.05; body vs. face p < 0.001 and face vs. animate p < 0.001) (Fig. 4c). In LSF (Fig. 4b), the SI value for body drops significantly while the face information is comparable to HSF (Peak SI value for animate is not significant against the baseline with p = 0.09, face = 0.45 ± 0.04, and body = 0.32 ± 0.02; peak SI value for the body in intact vs. LSF p < 0.001 and HSF vs. LSF p < 0.001; face in intact vs. HSF p < 0.001 and HSF vs. LSF p = 0.28).

Fig. 4: Temporal dynamics of the separability index (SI) for face, body, and animate vs. inanimate, and the magnitude of SI differences between the intact and both HSF and LSF conditions.
figure 4

a Time courses of SI for three categories (face vs. inanimate, body vs. inanimate, and animate vs. inanimate) in three SFs for Intact (I), b for low spatial frequency (LSF, illustrated by L), and c for high spatial frequency (HSF, illustrated by H). The time courses were offset by the mean value of the index at the 1–50 ms interval from stimulus onset. Shaded areas describe standard deviation, measured using the bootstrap method. d This bar plot shows the magnitude of intact SI difference against HSF and LSF at (70–320) ms for animate vs. inanimate, face vs. inanimate, and body vs. inanimate. Error bars and shaded areas indicate the standard deviation of the mean calculated by bootstrapping the stimulus set. There are significant effects against mid-level given above the bars with significant effects indicated by stars (p > 0.05 indicated by n.s. for “not significant,” p < 0.05 indicated by *, p < 0.01 indicated by **, and p < 0.001 indicated by ***).

To investigate the effect of SF filtering on information, we use the SF modulation index (SMI). In SMI, to measure the impact of SF on the information of face, body, and animate categories, SI values in HSF and LSF are evaluated against the intact condition. SMI shows the proportion of information in the given category that HSF or LSF contents do not contain. Consequently, the more considerable value in SMI associated with each SF and category shows that those specific categories do not use that SF content to represent that category. In contrast, the smaller values suggest that the neural population uses SF content to represent that category. Figure 4d shows the SMI value for three categories. Left to right, the first two bar plots illustrate the SMI for the animated category in LSF and HSF, respectively. As mentioned, the animate group has no significant SI value in LSF, so removing HSF content considerably degrades animate category discrimination. However, the SMI value shows no significant difference between intact and HSF (animate SMI value for HSF = −0.01 ± 0.02 and p = 0.18). Face information loss in LSF is slightly greater than in HSF (SMI for LSF = 0.26 ± 0.03 and HSF = 0.19 ± 0.02; SMI in LSF vs. HSF p = 0.03). The statistical test implies that the IT neuronal population represents the face category in both LSF and HSF with roughly similar information.

Furthermore, the body discrimination in HSF has been maintained and improved, while in LSF, the information drops significantly (SMI value for LSF = 0.54 ± 0.01 with p < 0.001 and HSF = −0.06 ± 0.02 with p < 0.01; SMI value in HSF vs. LSF p < 0.001). This observation suggests that the body is well represented in HSF. Face and body observations demonstrate an interaction between SF and category, even at the same level of abstraction (mid-level here).

The unsupervised data-driven approach confirms the interaction between SF and abstraction level

Hierarchical clustering analysis is an unsupervised method that assumes some categorical structure for the data, yet it does not imply any unique grouping into categories31. Clustering scores can be viewed as an unsupervised grouping representation. If the clustering score for a given category is high, then the representation of the samples within that category will be more similar than that of the samples in other categories. Accordingly, each category could be analyzed separately based on the clustering scores.

Hierarchical cluster trees have been computed for the IT response patterns, as shown in Fig. 5a; furthermore, they have been computed in the early and late phases of the responses for each SF. In the early phase for intact, 80–110 ms after stimulus onset, the colorful bullet points on the clustering tree exhibit that face and inanimate categories started to be clustered. Furthermore, the body category was clustered at the late phase, i.e., 155–185 ms, and the face and inanimate categories were also clustered at that time interval. A similar pattern can be observed in HSF and LSF. To quantify these observations, each row in Fig. 5b describes the clustering score for each category calculated between 70 and 320 ms after the stimulus onset for intact, LSF, and HSF contents, respectively. Furthermore, clustering scores for all categories in the late and early phases are reported in Table 1. According to Fig. 5b, the clustering score of the face is significantly higher than that of the body in intact stimuli (clustering score of face = 0.096 ± 0.005 and body = 0.086 ± 0.005; face vs. body p = 0.02). These patterns are preserved in both HSF and LSF (clustering score in LSF for face = 0.055 ± 0.004 and body = 0.048 ± 0.002; face vs. body p = 0.03; in HSF face = 0.068 ± 0.004 and body = 0.052 ± 0.003; face vs. body p < 0.001). Furthermore, the clustering score for body and face (i.e., mid-level) is significantly higher than other categories, consistent with the previous supervised approach (clustering score for intact: animate = 0.021 ± 0.009, inanimate = 0.056 ± 0.003, identity1 = 0.014 ± 0.002, identity2 = 0.018 ± 0.005; face/body vs. animate/inanimate/identity1/identity2 p < 0.001; for LSF: animate = 0.010 ± 0.001, inanimate = 0.009 ± 0.002, identity1 = 0.032 ± 0.002, identity2 = not significant; face/body vs. animate/inanimate/identity1/identity2 p < 0.001; for HSF: animate = 0.035 ± 0.001, inanimate = 0.047 ± 0.001, identity1 = 0.053 ± 0.002, identity2 = 0.013 ± 0.001; face vs. animate/inanimate/identity1/identity2 p < 0.001 and body vs. animate/identity2 p < 0.01, body vs. inan-imate p = 0.09, body vs. identity1 p = 0.43). The clustering score of identity1 is significantly higher in LSF and HSF than the intact (intact = 0.014 ± 0.002, LSF = 0.032 ± 0.002, HSF = 0.053 ± 0.002; HSF vs. LSF p < 0.001, LSF vs. intact p < 0.001). Similar to the supervised method, the clustering score of the super-ordinate categories (animate and inanimate) in LSF is significantly lower than intact and HSF (intact: animate = 0.021 ± 0.009, inanimate = 0.056 ± 0.003, LSF: animate = 0.010 ± 0.001, inanimate = 0.009 ± 0.002, and HSF: animate = 0.035 ± 0.001, inanimate = 0.047 ± 0.001; intact/HSF vs. LSF p value for both categories <0.001). According to these observations, an interaction between SF and abstraction level could improve or degrade a category’s discriminability in neural representation. The time windows for the early and late phases (as shown in Figs. 5 and 6 and Table 1) were chosen according to the onset and peak times of each frequency level to more effectively highlight any differences.

Fig. 5: Hierarchical clustering and clustering scores of inferior temporal (IT) responses for Intact, LSF, and HSF.
figure 5

a Hierarchical clustering of IT responses for intact (I), low spatial frequency (L), and high spatial frequency (H). Hierarchical clustering analysis was employed for neural responses to determine if IT response patterns are grouped similarly to natural categories. Early and late intervals are the same, as stated in Fig. 3. This analysis successively merges the two closest clusters starting from single sample clusters by measuring dissimilarities to generate a hierarchy of clusters. Vertical axes represent the mean neural distance among the stimuli of subclusters (dissimilarity: 1 – r, where r is the Pearson correlation coefficient). On the tree nodes, every colorful bullet point indicates the clustering label determined by majority voting. b The bar graphs show the clustering scores at (70–320) ms for the animate (Aqua), inanimate (Blue), face (Orange), body (Red), identity1 (LawnGreen), and identity2 (DarkGreen). Each row corresponds to I, L, and H from top to bottom. The clustering score was offset by the mean value of the clustering score at 50 ms to 1 ms before stimulus onset. Error bars indicate the standard deviation of the mean estimated by bootstrapping the stimulus set.

Table 1 Degrees of the match between tree nodes and categories, reconstructed from responses of inferior temporal (IT) cells in the early and late phase of neural responses for Intact, Low Spatial Frequency (SF) and High SF stimulis
Fig. 6: Representational dissimilarity matrices (RDMs) and the time course of correlations with the model for monkeys’ inferior temporal (IT) cortex responses at three spatial frequencies (SFs) and different time intervals.
figure 6

a RDM for monkeys’ IT cortex responses at three SFs and different intervals. For each pair of stimuli, each RDM color encodes the dissimilarity of the two response exemplars extracted by the stimuli in monkey IT at different SFs. The dissimilarity measure is 1r, where r is the Pearson correlation coefficient over time. The color code displays percentiles for each RDM, which are computed independently and ranked from 0 to 100. The color code reveals percentiles computed independently for each RDM. The rows show the RDM for Intact (left to right; 80–110 and 155–185 ms, respectively), LSF (left to right; 85–115 and 165–195 ms, respectively), and HSF (left to right, 95–125 and 175–205 ms, respectively) SF stimulus. b: The time course of correlation with the model, for animate vs. inanimate (blue), face vs. body (orange), and identity (green) categories in three SFs. The time courses were offset by the mean value of the correlation at 50 ms to 1 ms before the stimulus onset. Time windows were selected based on the onset and peak times for each frequency level to better reveal differences. The shaded areas indicate standard deviation.

Representational dissimilarity matrices (RDMs) are constructed by comparing all stimuli to one another. As a multivariate method relevant to population vector analysis, RDM can elicit information about scattered representation patterns throughout the brain regions. By applying RDM, we can examine the primary representational organization of information within the brain activity patterns; providing a framework to test assumptions about the information’s construction. The initial presumption of RDM is that the stimuli with higher similar representations remain more arduous to decode. Figure 6a shows the empirical representational dissimilarity matrices (RDMs) averaged across stimuli for intact, LSF, and HSF in three intervals. Each cell in the matrix describes the dissimilarity measured by correlation distance (i.e., 1 – r, where r is Pearson correlation coefficient) between the IT activation patterns for one pair of stimuli. In this figure, at the first row for the intact stimuli, the time interval for 80–110 ms is representative of onset latency; whenever one stimulus is face, and the other belongs to the non-face category, it leads to a considerable dissimilarity. In the second interval, 155–185 ms, the within-dissimilarity for face and body categories decreases, and the between-dissimilarity becomes more prominent; this pattern is also observed among the body versus inanimate categories. Also, face identities have small within-dissimilarity and considerable between-dissimilarity.

The second row shows the RDM representations for LSF stimuli in which, during 85–115 ms after stimulus onset, within-dissimilarity of the inanimate group is decreased. In addition, at 165-195 ms, the within-dissimilarity of faces and non-faces is reduced, and the dissimilarity between them is grown. Moreover, the between-dissimilarity of faces and bodies is increased, and their within-dissimilarity is decreased. Finally, the third row exhibits the RDM’s for HSF stimuli. During the 95–125 ms interval, the within-dissimilarity between the human face and body is lower than other stimuli. Besides, during 175–205 ms, within-dissimilarity for bodies and faces is diminished, and between-dissimilarity increases.

We can observe the separation of face versus inanimate. RDM illustrates the representation patterns for several abstraction categories at various SF. The distinctness and dissimilarity of categories at mid-level abstraction can also be seen by visiting the RDMs at the late-phase responses at both high and low-SFs. Furthermore, increasing category separation based on dissimilarity could be observed by comparing each late-phase RDM with each early phase one at each SF. Nevertheless, at these RDMs, the contrast between categories in high and low-level abstractions is unclear.

Figure 6b refers to the correlation with a model, the Pearson correlation of empirical RDMs with reference RDM related to each hierarchy level for different SF (see “Methods”), and the correlation’s value directly relates to the separability of categories. Although mid-level (face-body) correlation has been maintained in both HSF and LSF, correlation values toward intact are decreased. However, the correlation with the model at the super-ordinate level (animate vs. inanimate) and sub-ordinate-level (face identity) have been preserved in HSF and eliminated in the LSF.

Behavioral evidence supports the effect of SF on hierarchical object representation

To complement the neurophysiological findings, we conducted a psychophysical task to investigate how SF impacts object recognition at different levels of abstraction. Our goal was to examine whether our findings on the role of SF in the hierarchical organization of object categories and their representation in the IT cortex would also manifest in the behavioral responses of human subjects. In the task, participants were presented with a series of SF-filtered images of objects from different categories at varying levels of abstraction. We utilized the same stimulus set as our neurophysiological experiment except for the faces, which are substituted with six new faces of six different identities in both male and female genders (see “Methods”). We used a forced two-choice paradigm, where participants were asked whether the presented stimulus belonged to a specific category or identity (for identities, we used the names assigned to them in the training phase). Each trial consists of the question for 500 ms, 25 ms of stimulus presentation, 500 ms of the mask, 500 ms of the blank page, and the yes/no answer. Participants are asked to answer as accurately and quickly as possible, and their reaction time and answers are collected. Figure 7a shows the task process and added face stimuli.

Fig. 7: Human psychophysics.
figure 7

a A forced two-choice task is designed to evaluate the impact of the spatial frequency (SF) on hierarchical object recognition (upper panel). Each trial consists of a question (500 ms), stimulus (25 ms), mask (500 ms), blank (500 ms), and answer (yes/no). We also changed the face stimuli to six new faces (bottom panel). New faces in intact, LSF, and HSF are illustrated here. The answers and reaction times of 21 participants are recorded. b the categorization accuracy in three levels of abstraction separated by SF (I: intact, L: LSF, H: HSF). The mid-level categorization accuracy (averaged across subjects) is higher than the other two levels in all SF contents. The error bars show the standard deviation. Consistent with our results, the categorization performance in HSF is the same as intact and higher than HSF, and the performance at the sub-ordinate level degrades in both LSF and HSF. c, d Average reaction times (across subjects) for correct (c) and wrong (d) answers. The mid-level categorization is faster in correct answers for intact as well as LSF and HSF. Furthermore, the super-ordinate categorization happens faster than the sub-ordinate one in all SF contents.

Figure 7b shows the average accuracy of participants separated by level of abstraction and SF content. Consistent with our findings, the accuracy of participants in the mid-level of abstraction is significantly higher than the other two levels in intact, LSF and HSF stimuli (intact: mid = 0.92 ±0.02 > super-ordinate = 0.89 ± 0.01 (p = 0.03) and >sub-ordinate =0.78± 0.03 (p = 0.001), LSF: mid = 0.84 ± 0.03 >super-ordinate = 0.73 ±0.02 (p = 5e-4) and >sub-ordinate = 0.58 ± 0.02 (p = 8e-5), HSF: mid=0.91 ± 0.02 >super-ordinate = 0.87 ± 0.02 (p = 0.03) and >sub-ordinate = 0.70 ± 0.02 (p = 6e-5)). Considering the super-ordinate level, there is no significant difference in categorization accuracy between intact and HSF (intact = 0.89 ± 0.01, HSF = 0.87 ± 0.02, and p = 0.21), while both are significantly higher than LSF (LSF = 0.73 ± 0.02 < HSF (p = 6e-5) and <intact (p = 5e-5)). Finally, at the sub-ordinate level, both removing of LSF and HSF components significantly degrade the categorization performance (intact = 0.78 ± 0.03 > LSF = 0.58 ± 0.02 (p = 1e-4) and > HSF = 0.70 ± 0.02 (p = 6e-5)). These findings are consistent with our neurophysiological observations where HSF is necessary for the super-ordinate level, while sub-ordinate representation relies on both LSF and HSF.

Reaction times separated for correct and wrong answers are depicted in Fig. 7c (left: correct, right: wrong). Reaction times for the correct answers show the mid-level temporal advantage, where in all SF contents, the reaction time for the mid-level is significantly shorter than the other two levels (intact: mid = 0.39 ± 0.03 <super-ordinate = 0.53 ± 0.04 (p = 2e- 4) and <sub-ordinate = 0.72 ± 0.08 (p = 1e-4), LSF: mid = 0.42 ± 0.04 >super-ordinate = 0.63 ± 0.05 (p = 2e-4) and >sub-ordinate = 0.80 ± 0.09 (p = 6e-5), HSF: mid = 0.45 ± 0.04 > super-ordinate=0.54 ± 0.04 (p = 0.003) and > sub- ordinate = 0.77 ± 0.09 (p = 0.005)). Furthermore, the reaction time for the super-ordinate category is significantly shorter than the sub-ordinate (intact: p = 0.001, LSF: p = 0.009, HSF: p = 0.01). In the reaction time of wrong answers, we see no similar trend. Considering the reaction time in each level of abstraction in both correct and wrong answers, no significant impact of SF is observed on the reaction times except for the super-ordinate category, where the reaction time in the LSF is significantly higher than that of intact and HSF.

To overcome the low within-category variance raised by the limited number of stimuli, we repeated the psychophysical study with a more diverse stimulus set. The new stimulus set consists of 10 categories in total, five animate (bird, mammal, reptile, insect, and fish) and five inanimate (fruit, chair, clock, car, and house) categories in the super-ordinate level. All 10 categories are utilized for mid-level categorization. In total, there exist 150 stimuli (50 in each SF level). Finally, the sub-ordinate level is the same as the previous experiment with 10 new faces (See “Methods and Materials”). The average accuracy and reaction times of the participants are illustrated in Fig. 8. The average accuracy of participants supports our previous findings where the categorization accuracy in the mid-level is significantly higher than the super- and sub-ordinate (intact: mid = 0.95 ± 0.01 >super-ordinate =0.80 ± 0.02 (p = 2 × 10−5) and >sub-ordinate = 0.87 ± 0.02 (p = 6 × 10−4), LSF: mid=0.86 ± 0.02 >super-ordinate = 0.70 ± 0.01 (p = 6 × 10−5) and >sub-ordinate=0.52 0.02 (p = 3 × 10−5), HSF: mid = 0.94 0.01 > super-ordinate =0.80 ± 0.01 (p = 4 × 10−5) and >sub-ordinate = 0.68 ± 0.02 (p = 3 × 10−5)). At the super-ordinate level, there is no significant difference between intact and HSF (p = 0.70), while the accuracy in LSF is significantly lower than intact (p = 3 × 10−5) and HSF (p = 5 × 10−5). At the sub-ordinate level, both LSF (p = 2 × 10−5) and HSF (p = 8 × 10−5) conditions significantly degraded the categorization accuracy of participants compared to intact stimuli. Therefore, the new experiment provides support for the generalization of our findings. Unlike the categorization accuracy, the reaction times are different from the previous experiment and we observed a significant impact of SF in the reaction time at super- and subordinate levels of abstraction (one-way ANOVA p value for super-ordinate: 0.01, mid-level: 0.05, and sub-ordinate: 0.02). Furthermore, the reaction times are in line with the categorization accuracy, where higher accuracies have lower reaction times.

Fig. 8: Psychophysical study with a diverse stimulus set.
figure 8

The stimulus set includes 10 super-ordinate categories (5 animate and 5 inanimate) and 10 sub-ordinate face stimuli. a Categorization accuracy (averaged across subjects) for intact, low spatial frequency (SF), and high SF stimuli. The mid-level categorization accuracy is significantly higher than super-ordinate and sub-ordinate levels in all SF contents. Error bars indicate standard deviation. b Average reaction times (across subjects) for correct answers. Reaction times are consistent with categorization accuracy, showing lower times for higher accuracies. Significant differences in reaction times are observed across SF levels for mid- and sub-levels of abstraction.

Discussion

In this paper, we studied the effect of SF on the visual object category representation in three levels of abstraction by analyzing the neural responses of the IT cortex and STS of the macaque monkeys. To the best of our knowledge, this study is the first attempt to investigate the effect of SF on the hierarchical representation of categories in neuronal space. Our dataset contains mid-level categories (face and body), forming two super-ordinate abstraction levels (animate and inanimate) and a sub-ordinate level (i.e., the identity of faces) (Fig. 1). We found that the mid-level information (i.e., face vs. body) is present in both LSF and HSF. However, the presence of HSF is necessary for representing a super-ordinate level in IT neurons. In addition, the identity information (the sub-ordinate level) was absent in both LSF and HSF contents (Figs. 2, 3). Given the presence of mid-SF information in both HSF and LSF bands, our study suggests the potential importance of these frequency bands in mid-level categorization. However, it is essential to clarify that our work specifically focuses on the analysis of HSF and LSF content, and we do not directly examine mid-SF content in this study. Thus, the information of mid-level categories does not directly depend on HSF or LSF contents. However, sub-ordinate level representation in the IT cortex needed all range of SF content since any SF filtering degraded the identity information. Our results are consistent for the two monkeys (see Supplementary Fig. 2) and recording locations within the IT and STS (see Supplementary Fig. 3). Finally, we verified our observations with two human psychophysics tasks one with the same stimulus set and one with a larger number of stimuli. In the psychophysics tasks, we observed that the performance of the mid-level categorization is significantly higher than that of super- and sub-ordinate levels. The super-ordinate level categorization was only affected by the LSF filtering (removing HSF components), while both LSF and HSF filtering decreased the performance in the sub-ordinate level. Furthermore, employing the reaction times of the correct answers, we observed the mid-level temporal advantage in intact, LSF, and HSF in one task.

Object categorization and its neural correlates in the ventral visual pathway have been widely studied22,24,31,33. Studies show the selective response of IT cells to specific categories in various levels of abstraction23,24,34,35. However, the processing order of various abstraction levels is still being debated. The response of IT neurons to human and monkey faces shows that the discrimination between monkey-human happens faster than face identities25,26. On the one hand, massive psychophysical and individual neuron recording studies show a faster representation of mid-level categories rather than super- or sub-ordinate levels4. Furthermore, this claim is challenged by several studies showing the faster perception for super-ordinate level36,37,38,39.

SF can affect the categorization performance. For example, Rotshtein et al.9 suggest that house-flower and face-house categorization are easier in HSF, while flower-face and gender categorizations are easier in LSF. Our results confirmed the studies that magnify the role of HSF contents in super-ordinate categorization2,20. Nevertheless, Ashtiani et al. showed low-frequency information is sufficient for super-ordinate level21. This contradiction could be due to the utilized categories and specific paradigm design, which relies on very fast presentation and block-based experiment4. Furthermore, in object recognition tasks, specifically in animal detection in the work of Ashtiani et al.21, subjects could rely on different parts of objects for effective categorization, which could be the source of contradiction. Our findings also show that the amount of information loss due to the HSF filtering (removing of LSF components) is the same as the LSF in face categorization at the mid-level; however, HSF filtering (removing of LSF components) of the body stimuli at the mid-level of abstraction preserves the amount of information compared to intact stimuli. This observation supports the evidence that suggests a special neural mechanism for face representation in the IT cortex21. The effect of SF filtering on face information is also compatible with psychophysical studies where middle-frequency bands are more critical for face perception than LSF and HSF40,41,42,43. According to these studies, LSF or HSF filtering degrades the amount of information in a face object, similar to our findings, where both LSF and HSF contents carry less information for the face category than the intact faces (Fig. 4).

Craddock et al. investigated the effect of SF on categorization using EEG recording20. They used two levels of abstraction: (i) a gender classification task as a mid-level categorization and (ii) a living vs. non-living classification task as a super-ordinate categorization task. They found that HSF content removal impairs both mid- and super-ordinate-level categorizations. However, no significant interaction between task and SF has been observed20,44. Unlike EEG studies, psychophysical studies show the impact of SF on categorization in various hierarchical levels. Our IT-spiking activity study also supports the psychophysical results. This discrepancy could have originated from differences in the recorded signals in EEG and extracellular techniques. EEG signals combine synaptic inputs, neuronal outputs, synchrony, and spatial alignment in neuronal population45,46,47. More profound knowledge about the mapping between EEG and spiking activity is needed to understand this discrepancy.

There exist several confounding factors that our results are immune to. First, all stimuli (intact or filtered) were corrected in contrast and illumination to eliminate the attribution of basic stimulus characteristics to the results (See “Methods and Materials”). Second, as we move from super-ordinate- to mid- to sub-ordinate levels of abstraction the within-category heterogeneity decreases. Therefore, the effect of SF on the hierarchical representation could not be due to the stimulus diversity or within-level dissimilarity of stimuli. Third, the number of stimuli per category could not be attributed to our observations since the number of stimuli per involved category in each experiment was equalized by random sampling of stimuli without replacement (See “Materials and Methods”). Fourth, both supervised and unsupervised methods were used to confirm the observations. We employed SVM and SI as two supervised methods for investigating the amount of information in neural responses. For unsupervised ones, hierarchical clustering is used. Therefore, observations could not be shaped by the specific characteristics of the data analyzing method. Rolling out the confounds increases the reliability of our results about the effect of SF on the hierarchical representation of categories in the IT cortex.

The recording area is uniformly distributed across the IT cortex from posterior IT to anterior IT. It also includes the superior temporal sulcus. Our analysis is based on the visually responsive neurons across all stimuli (See “Materials and Methods”). Since we have face, body, and inanimate selective neurons (47 face, 82 body, 25 natural, and 36 man-made selective neurons, responses are averaged from 70 ms to 170 ms after stimulus onset), the recording area could include the face patches but is not limited to these patches. Since the exact location of face or body patches needs functional magnetic resonance imaging data, we cannot determine the exact recording locations relative to face or body patches.

To better understand the role of category-specific neural tuning in our findings, we analyzed the distribution of stimulus category preferences among individual neurons: 47 neurons preferred faces, 72 preferred bodies, 25 favored natural objects, and 24 favored man-made objects. This diverse distribution suggests that our population-level results are not solely driven by a preference for face stimuli but rather reflect a broader spectrum of category-specific tuning across neurons. We then focused on face- and body-selective neuron subpopulations to examine how SF filtering influences category encoding at different levels (see Supplementary Fig. 4). In face-selective neurons, we observed a small but significant decoding of animate-inanimate distinctions in the LSF condition, although this effect was weaker than in the intact and HSF conditions. Identity-level decoding in face-selective neurons was only evident in the intact condition, underscoring the necessity for both LSF and HSF information for fine-grained categorization. In contrast, body-selective neurons did not show significant identity decoding across any SF condition. Therefore, while mid-level distinctions (e.g., face vs. body) remained robust across all SF conditions, superordinate categorizations relied more heavily on HSF, and identity information was present only in face-selective neurons, requiring both LSF and HSF for effective encoding.

The number of stimuli per category-SF condition was small because of the simultaneously studying SF and hierarchical representation with the limited number of stimuli. Therefore, we only investigated one category pair per abstraction level. Small stimulus set impact both super- and sub-ordinate levels of abstraction, and their generalization is not as powerful as the mid-level. However, as illustrated in Fig. 3 and the human psychophysics task with a larger number of stimuli, the generalization of super-, mid-, and sub-ordinates categorization is high enough for the analysis. More importantly, while the stimulus set does not support the full spectrum of abstraction levels, we are sure that the abstraction level of mid-categories (face vs. body) is between the super- and sub-ordinate levels. Furthermore, the gender characteristics of the two identity classes are also different. However, if the discrimination is based on gender, it is still finer than the face vs body categorization. From the SF point of view, only two SF bands exist, i.e., LSF and HSF, and the middle SF band was absent in our stimulus set. A more balanced stimulus set with three levels of SF filtering is needed to fully understand the effect of SF filtering and the importance of each SF band on the hierarchical representation of categories in the ventral visual pathway.

One important consideration in our study is the use of pixel-based filtering for SF manipulations. While this approach ensures uniformity across stimuli, it introduces certain limitations that must be acknowledged. Specifically, because our filtering operates in pixel units, the real-world size of an object in the image (e.g., a face versus a body) is not directly taken into account. This means that objects of different physical sizes are treated equally in terms of spatial frequency content, potentially leading to variations in the level of detail preserved across stimuli, especially when comparing objects at different levels of abstraction. For instance, an LSF image of a body might resemble what a monkey would see at a distance of 50 m, while a low-spatial-frequency image of a face might preserve more details, similar to viewing a face from a shorter distance. This discrepancy arises because the filtering is applied in pixel units, not real-world distances such as millimeters or degrees of visual angle. While this is a common practice in visual neuroscience studies, it may result in conclusions that are influenced by the scale of the photographs used, rather than solely by the spatial frequency properties of the objects themselves. We note that our power spectrum analysis indicates that the overall distribution of SF components remains consistent across object categories, regardless of their real-world size.

In light of our findings, the broader implications of this study extend beyond the specifics of SF effects on category representation in the IT cortex. Firstly, our research offers valuable insights into the fundamental mechanisms of visual perception and cognition. Understanding how different levels of shape detail (coarse vs. fine) are processed in the brain can significantly advance our comprehension of visual processing, which has implications for various fields such as cognitive neuroscience and psychology. Moreover, the distinct roles of HSF and LSF in category representation can inform the development of more brain-like models for visual recognition in artificial intelligence and machine learning, potentially leading to improvements in technologies like facial recognition systems and automated image categorization.

In summary, we investigated the effect of SF on hierarchical object representation in the macaque IT cortex and found that super-ordinate representation is highly dependent on the HSF band rather than the LSF band. On the other hand, sub-ordinate categories need all SF contents. These findings suggest that shape boundaries are enough for coarse categories to be represented in the IT cortex, while the IT cortex needs all fine and coarse shape information to represent finer categories. The dependence of categorization on SF provides a mechanism to use various SF bands in hierarchical category perception and behavior.

Methods

Animals and recordings

We analyzed the responses of neurons in the IT cortex and STS of two male macaque monkeys (10 and 11 kg and 11 and 12 years old). All experimental procedures followed the National Institutes of Health Guide for the Care and Use of Laboratory Animals and the Society for Neuroscience Guidelines and Policies. The Institute for Research in Fundamental Sciences committee approved the protocols for both monkeys’ experimental, surgical, and behavioral procedures. We have complied with all relevant ethical regulations for animal use. To place a recording chamber in a subsequent surgery, magnetic resonance imaging and CT scans were performed to identify the prelunate gyrus and arcuate sulcus. Under strict aseptic conditions and Isoflurane anesthesia, all surgical procedures were performed. A custom-made stainless-steel chamber was implanted into each animal before behavioral training. Titanium screws and dental acrylics were used to attach the chamber to the skull. A craniotomy was performed for both monkeys within the 30 × 70 mm chamber (5 mm to 30 mm A P−1 and 0 mm to 23 mm M L−1).

During the experiment, animals were seated in custom-made primate chairs, with their heads restrained and a tube delivering juice rewards inserted into their mouths. Eye position was monitored and stored at 2 kHz using an infrared optical eye tracking system (EyeLink 1000 Plus Eye Tracker, SR Research Ltd, Ottawa, CA). It was mounted in front of the monkey, and the EyeLink PM captured eye movements-910 Illuminator Module and EyeLink 1000 Plus Camera (SR Research Ltd, Ottawa, CA). Custom software is written in MATLAB using the MonkeyLogic toolbox-controlled stimulus presentation and juice delivery. We presented visual stimuli to the animal on a 24-in LED-lit monitor (AsusVG248QE: 1920 × 1080, 144 Hz) set at 65.5 cm from its eyes. The actual time the stimulus appeared on the monitor was recorded using a photodiode (OSRAM Opto Semiconductors, Sunnyvale, CA).

An electrode, securely attached to a recording chamber, was positioned within the craniotomy area using the Narishige two-axis platform, facilitating continuous adjustment of electrode positioning. To establish contact with, or minimally penetrate, the dura, a 28-gauge guide tube was introduced via a manual oil hydraulic micromanipulator from Narishige, Tokyo, Japan. For extracellular recording of neural activity in both monkeys, varnish-coated tungsten microelectrodes (FHC, Bowdoinham, ME) with an impedance between 0.2 and 1 MΩ (measured at 1 kHz) and a shank diameter ranging from 200 to 250 µm were inserted into the brain. Single-electrode recordings were conducted using a pre-amplifier and amplifier (Resana, Tehran, Iran), with filtering parameters set between 300 Hz and 5 kHz for spikes and 0.1 Hz and 9 kHz for local field potentials. Continuous data were digitized and stored at a sampling rate of 40 kHz for subsequent offline spike sorting and data analysis. Identification of IT was based on patterns of gray and white matter, its stereotaxic location, position relative to nearby sulci, and response properties of encountered units.

Stimulus set and task paradigm

The stimulus set consisted of 81 grayscale photographs of various objects in three SFs (HSF:27, LSF:27, intact:27) centered on a gray background.The images used in the study had a size of 500 × 500 pixels (5° × 5°). Specifically, human faces were represented with a width ranging from 330 to 350 pixels (3.3°–3.5°) and a height of 500 pixels (5°). Human body images had a width ranging from 160 to 200 pixels (1.6°–2°) and a height of 500 pixels (5°). Animal body and face images were standardized to 500 × 500 pixels (5° × 5°), except for one animal face with a height of 300 pixels (3°). Additionally, other images in the set had a minimum size of 175 pixels (1.75°) in width and 200 pixels (2°) in height. Images were displayed at the center of a monitor and were scaled to fit in a 5° window.

There exist two identities (one male and one female) and three faces per identity that form the sub-ordinate categories. At mid-level, nine face stimuli exist (six humans with two mentioned identities and three animal faces), six bodies (three humans and three animal bodies), and six natural and six man-made stimuli. At the super-ordinated level, face and body categories form the animate category, and natural and man-made categories are combined to form the inanimate category. To present a stimulus set that could be reliably recorded from each neuron, we utilized a rapid RSVP. Each session of the recording consists of 5 blocks. In each block, we show all stimuli in a pseudo-random order. The stimulus duration and interstimulus intervals were 50 ms and 450 ms, respectively. The monkeys were required to maintain fixation within a window of 2° at the center of the screen. They were rewarded with juice in each 1.5 to 2 s for keeping focus.

Human psychophysics

We designed two psychophysics tasks to evaluate our neurophysiological findings. The first task utilizes the same stimulus set as our neurophysiological experiment except for the face to verify our findings in human perception. We replaced six face stimuli with six new faces of different identities and assigned a name to each of the six identities (three males and three females). We also created a mask version of each stimulus by scrambling the pixels of each stimulus. The responses from 21 human subjects (12 males and nine females) were collected. All participants signed a consent form at the beginning of the experiment, and the study protocol was approved by the Institute for Research in Fundamental Sciences committee. All ethical regulations relevant to human research participants were followed. The task consists of a training phase for identities, followed by the main phase. In the training phase, the participant observes each face stimulus in intact form for an arbitrary time to learn the name assigned to each face. The main phase starts right after training and consists of a forced two-choice (yes/no) categorization task. Each trial starts with the question for 500 ms, followed by the stimulus for 25 ms, then the mask presentation for 500 ms, 500 ms of a blank page, and finally, the answer (yes/no) appears on the screen. The trial ends when the participant presses the right (for yes) or left (for no) keys. There exist various questions based on the category. For natural and man-made categories, the question is where the stimulus belongs to animate or inanimate categories (super-ordinate level). For the body, in addition to the super-ordinate level question, the belonging of the stimulus to the face or body category is also questioned (mid-level). Finally, for face stimuli, in addition to the two aforementioned question types, we have a question about the identity of the face. The second experiment uses the same protocol as the first experiment, with more diverse categories. The second experiment consisted of 10 categories organized into five animate (bird, mammal, reptile, insect, and fish) and five inanimate (fruit, chair, clock, car, and house) categories. In the super-ordinate and mid-level, there are 20 (20 in each SF content) stimuli uniformly distributed in categories. In the sub-ordinate, we included the faces of 10 individuals (five males and five females, different from the first task). The rest of the details are as in the first task.

Spatial frequency filtering of stimulus

Each stimulus has three versions of intact, HSF, and LSF regarding SF. A band-pass 2D Butterworth filter is designed to construct each stimulus by multiplying a high-pass Butterworth filter with a low-pass one. The high-pass and low-pass filters are constructed in the frequency domain using the following formulas.

$$\begin{array}{cc}{H}_{H}\left(u,v\right)=\frac{1}{1+(\frac{\sqrt{{u}^{2}+{v}^{2}}}{{f}_{c}})2* d} & {H}_{L}\left(u,v\right)=\frac{1}{1+(\frac{{f}_{c}}{\sqrt{{u}^{2}+{v}^{2}}})2* d}\end{array}$$
(1)
$$H={H}_{H}\times {H}_{L}$$
(2)

where HH and HL are high- and low-pass filters in the frequency domain, u and v are frequency indices, fc is the cut-off frequency, d is the filter order, and H is the final band-pass filter employed to construct the stimuli set. For LSF images, the cut-off frequency of low-pass and high-pass filters are 1 and 10 cycles image−1, respectively. To construct HSF images, the cut-off frequencies of 18 and 75 cycles image−1 are used for low- and high-pass filters. To equalize the luminance value across the stimuli, each image’s average gray level of pixels is shifted to the middle of the range as follows.

$$I\left(i,j\right)=I\left(i,j\right)-\frac{1}{{N}_{I}}{\sum}_{i,j}I\left(i,j\right)+128$$
(3)

where I(i, j) is the gray level of the pixel located at the i’th row and the j’th column and NI is the total number of pixels in the image. To equalize the contrast, all image pixels are standardized by the STD of all pixels in that image and multiplied by a fixed factor as follows.

$$I\left(i,j\right)=\frac{I(i,j)}{{\sigma }_{I}}\times 8$$
(4)

where σI is the STD of all pixels in the image.

The temporal dynamic calculation for category information in the IT population

Responses of 379 neurons (261 Monkey 1 and 118) were recorded as the monkeys viewed a rapid presentation of different natural and artificial visual stimuli. Then, the spiking activities are extracted employing the ROSS toolbox48. At each time point, we represented each stimulus by a vector whose elements are the average firing rates of the recorded single neurons. For each time point and a given neuron, the average firing rate of that neuron in a 50 ms window around the time point is calculated. Therefore, each stimulus could be represented with a point in RN space in each time point, where N is the number of recorded neurons. So, each stimulus was represented in the population of N neurons.

$${S}_{i}(t)=[{r}_{1}(t),{r}_{2}(t),\ldots ,{r}_{N}(t)]$$
(5)

where Si(t) is the stimulus representation in time t and rn(t) is the average response of the n’th neuron in a 50 ms window around t.

The advantages of population representation were studied in many theoretical and experimental works49,50,51,52 where the signal correlation in the population of neural data increases coding performance for object discrimination. We normalized each neuron’s responses using the z-score procedure by subtracting the mean and dividing by the standard deviation across trials. Furthermore, we only included the responsive neurons in the analysis to achieve reliable results. A neuron is responsive if its firing rate from 70 to 170 ms after stimulus onset is significantly greater than that from 50 ms before to 50 ms after stimulus onset across all stimuli. We used a non-parametric two-tailed Wilcoxon signed-rank test with a significance level of 0.05 (false discovery rate corrected with Benjamini/Hochberg method53) to find responsive neurons. We found 330 responsive neurons (45 were recorded from the STS) in total (223 Monkey 1 and 107 Monkey 2). Throughout the experiments, all population analyses are conducted on all 330 responsive neurons.

Low dimensional representation

We embedded data into low dimensional space with linear and nonlinear approaches. Principal component analysis (PCA) was applied as a linear method to illustrate the separation of IT responses for different categories in two dimensions space. PCA utilizes the eigenvectors of the covariance matrix of the samples to transform the data from the high-dimensional to the lower-dimensional neural space. We calculated the principal components for neural responses for two early and late intervals at 80 ms to 110 ms and 155 ms to 185 ms after stimulus onset. Then, the first two components corresponding to the highest eigenvalues are considered a 2D representation of the high-dimensional neural responses. The PCA algorithm is applied to all stimuli simultaneously; thus, PC dimensions are the same for different category comparisons. The explained variance of the first two dimensions is 38%.

Category information using separability index and classification accuracy

The discrimination of two categories (e.g., face and body) according to the population responses is adapted from our previous work4. SI is defined based on the scatter matrix within and between categories samples. The ratio of the norm of the between-category and within-category scatter matrices was defined as SI. Here, we used the Frobenius norm for scatter matrices. Employing SI as a category separation measure in neuronal populations has several advantages. It could be used for high dimensional data (330 responsive neurons in this study). It takes both the variance and covariance of categories into account. It also could be employed for multi-class scenarios. This metric is computed for the high and low dimensional neural responses using a 50 ms sliding time window with a 1 ms stride. Then it is smoothed employing a Gaussian window with σ = 10 ms. An SVM classifier with a linear kernel54 is trained using the same sliding windows as in SI for classification accuracy. Bootstrap sampling is used for the evaluation of the SVM classifier. Training samples are selected with replacement, and the remaining is considered as the test samples. The statistical analysis of both SI and SVM are carried out empirically, as stated in the next section. In both SI and SVM, when the number of classes is not equal, in each run, the samples of the larger classes are sampled randomly to equalize the number of samples per class.

Statistics and reproducibility

Unless stated otherwise, all statistical analysis is based on the bootstrap method described here. To calculate standard deviation, confidence intervals, and p values, a bootstrapping process55 is employed. The confidence intervals are calculated empirically using quantiles across the bootstrap runs. Therefore, the 95% confidence interval starts from 0.025 quantiles to 0.975 quantiles. All the calculations were repeated 1000 times on a random selection of stimuli with a bootstrap method. Then p values are calculated based on the confidence interval of a given index. The p value for a given index for a specific category is calculated as \(\frac{r}{n}\), where r is the number of values lower (higher) than zero (or other value if stated), and n is the total number of runs. To compare an index for two categories, first, the index values are subtracted and again compared with zero, as mentioned before.

Neuron response modulation index and SF modulation index

To measure the effect of SF on neuron responses and information, we defined the RMI and SMI, which compare the amount of information loss (neuron response change) caused by SF filtering. To calculate SMI (similarly RMI) in each LSF or HSF, the SI difference (similarly firing rate difference for RMI) between intact and LSF or HSF is measured and normalized to the sum of corresponding SI values. Mathematically, SMILSF and SMIHSF are defined as follows (the same method is applied for the RMI in Fig. 2a).

$$\begin{array}{cc}{{{{\rm{SMI}}}}}_{{{{\rm{LSF}}}}}=\frac{{{{{\rm{SI}}}}}_{{{{\rm{Intact}}}}}-{{{{\rm{SI}}}}}_{{{{\rm{LSF}}}}}}{{{{{\rm{SI}}}}}_{{{{\rm{Intact}}}}}+{{{{\rm{SI}}}}}_{{{{\rm{LSF}}}}}} & {{{{\rm{SMI}}}}}_{{{{\rm{HSF}}}}}=\frac{{{{{\rm{SI}}}}}_{{{{\rm{Intact}}}}}-{{{{\rm{SI}}}}}_{{{{\rm{HSF}}}}}}{{{{{\rm{SI}}}}}_{{{{\rm{Intact}}}}}+{{{{\rm{SI}}}}}_{{{{\rm{HSF}}}}}}\end{array}$$
(6)

Higher values of SMI for LSF (HSF) given specific category information, e.g., face vs. body, shows the information loss in the LSF (HSF) band and, equivalently, the importance of HSF (LSF) for discrimination between face and body. On the other hand, a lower absolute value for SMI shows that SF does not affect the information significantly.

Clustering

We applied the hierarchical clustering method in the early (80–110 ms after stimulus onset), late (155–185 ms after stimulus onset) phases, and in the interval of 70–170 ms after stimulus onset in each SF band. Here we employ agglomerative hierarchical clustering to compute the tree structure56. The advantage of this method is that it is an unsupervised analysis with no prior assumption on data representations. The likeness score between the category and the tree is estimated by applying the average of the two ratios24:

$$\begin{array}{cc}{r}_{1}=\frac{{number\; of\; category\; members\; under\; the}\,\, n{ode}}{{total\; members\; of\; the \,category}}, & \\ {r}_{2}=\frac{{number\; of\; category\; members\; under\; the}\,\, n{ode}}{{total\; stimuli\; under\; the\; node}}\end{array}$$

The clustering score is equivalent to the average of r1 and r2.

Representational dissimilarity matrices

The RDM is computed by matching all pairwise combinations of stimuli. The distance among the activation patterns is calculated as 1 − r, where r is the Pearson correlation coefficient57. All RDM’s are ranked-Normalized between 0 and 100.

Correlation with the model

The correlation of empirical RDMs has been calculated per time point with three theoretical models; a stimulus animacy model, a model that separates face versus body stimuli, and a model based on each face identity58. These models have predicted the relevant dissimilarity of IT activation patterns for each stimulus couple56. The correlation between the model and empirical RDMs reveals the order in which the “representational structure” defined by each model is in the IT activation patterns58. Empirical RDM is tested and examined for its ability to explain the reference RDM for each category. We define the theoretical RDM Models for each abstraction level (animate-inanimate, face-body, and Identity) as a matrix in which individuals in between-category pairs are one, and within-category couples are zero. Then the correlation with a model is determined by the correlation between the reference RDM and the theoretical RDM models for each category. Correlation with a model is computed for RDM’s in each time step within a 50 ms window31,58.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.